This repository contains the implementation of Learning To Segment Actions From Visual and Language Instructions via Differentiable Weak Sequence Alignment published in CVPR 2021.
We address the problem of unsupervised action segmentation and feature learning in instructional videos using both visual and language instructions. Our key contributions include proposing i) a Soft Ordered Prototype Learning (SOPL) method for learning visual and linguistic prototype sequences corresponding to subtasks (key-steps) in videos, and ii) a new class of alignment methods, referred to as Differentiable Weak Sequence Alignment (DWSA), for weak alignment of visual and linguistic prototype sequences while allowing self-supervised feature learning.
To install all the dependency packages, please run:
pip install -r requirements.txt
To download the model for spacy, please run:
python -m spacy download en
ffmpeg
is required for loading videos. If you do not have it installed, you may run
conda install ffmpeg
or
apt-get install ffmpeg
or any alternatives.
Info: It works on Python version: 3.7.11, Cuda version: 10.1
We include two video samples under the directory ./data/105222
from the task make kimchi fried rice
in CrossTask dataset to show the required hierachy. You may run the following commands using these videos.
For full dataset, please refer to CrossTask Dataset.
If you want to use ProceL dataset, please refer to ProceL.
We use the pretrained model Demo-Europarl-EN.pcl
as punctuator. If you want to perform punctuation to subtitles, please download this model from the URL (only Demo-Europarl-EN.pcl is needed) and put it to the folder preprocessing/narr_process/punctuator/
.
After downloading data, the model can be run by:
python main.py --data_dir <data_dir> --task <task>
where <data_dir> is the directory that you save the data and is the task you want to use, can also be all
if you want to run the model for all tasks in one run. If you want to use the sample videos, simply run python main.py
.
This command will run the whole process of data preprocessing, training, and segmentation.
In the end, it will print the segmentation results as:
Task: 105222, Precision: 18.60%, Recall: 38.35%, F1-score: 25.05%, MoF: 41.77%, MoF-bg: 38.35%
Note: The data preprocessing module takes a long time, but it only needs to be executed once. The processed video embeddings and textual embeddings will be stored in the folder args.processed/args.task
. You may also execute data preprocessing, training, and segmentation separately as follows.
The data preprocessing module includes two parts: narration processing (extract verb phrases from subtitles) and extract pretrained features. It can be run by:
python data_preprocess.py --data_dir <data_dir> --task <task>
The narration processing module can be time-consuming. If you want to skip punctuation, please run
python data_preprocess.py --data_dir <data_dir> --task <task> --perform_punct
Similarly, if you want to skip coreference resolution, please add --perform_coref
. You can also skip the step of computing concreteness scores by setting concreteness threshold to be zero (using --conc_threshold 0
).
Note: If you are interested in the details of data preprocessing, please see the rest of this section. Otherwise, you may move to the Training section.
Our narration processing module can extract verb phrases from SRT-Format subtitles. Other formats, i.e. VTT or TTML, can be converted via ./preprocessing/narr_process/format_converter.py
Our module mainly includes the following parts:
- Punctuate the subtitles [1] (this takes a long time)
- Perform coreference resolution
- Extract verb phrases from sentences
- Compute the concreteness score of verb phrases [2]
We use the pretrained model [3] to extract visual embeddings of video segments and textual embeddings of verb phrases in a joint embedding space.
The videos should be stored in args.data_dir/args.task/videos
and the verb phrases should be stored in args.processed_dir/args.task/verb_phrases
. The extracted features will be stored in args.processed_dir/args.task/video_embd
and args.processed_dir/args.task/text_embd
.
You may choose either pretrained S3D or I3D model. This can be set by adding argument --pretrain_model s3d
or --pretrain_model i3d
.
The training module (multimodal feature learning using DWSA) can be run by:
python train.py --data_dir <data_dir> --task <task>
You may change the hyparameters by arguments, such as learning rate (--lr
), weight decay (--wd
), batch size (--batch_size
), max epoch (--max_epoch
).
Other hyparameters can be set in a similar manner: timestamp weight (--time_weight
), smoothing parameter (--smooth_param
), empty alignment cost (--delta_e
), and so on.
After training, we can load the trained model to get new features, and apply clustering to the new features to get the segmentations. Please run:
python segmentation.py --data_dir <data_dir> --task <task> --test_epoch <test_epoch> --bg_ratio <bg_ratio>
where <test_epoch> is the epoch of the model you want to test on, <bg_ratio> is the ratio of background in your segmentations (default: 0.4).
We open an issue #1 for common problems when running the codes. Please find solutions or ask questions there to help us improve this repository.
If you find the project helpful, we would appreciate if you cite the work:
@article{Shen-VisualNarrationProceL:CVPR21,
author = {Y.~Shen and L.~Wang and E.~Elhamifar},
title = {Learning to Segment Actions from Visual and Language Instructions via Differentiable Weak Sequence Alignment},
journal = {{IEEE} Conference on Computer Vision and Pattern Recognition},
year = {2021}}
[1] We use punctuator from https://github.com/ottokart/punctuator2
[2] The concreteness rating list is from https://github.com/ArtsEngine/concreteness
[3] We use the I3D/S3D models pretrained on Howto100M dataset from https://www.di.ens.fr/willow/research/mil-nce/
[4] The code for DWSA loss is adapted from pytorch-softdtw https://github.com/Sleepwalking/pytorch-softdtw
shen [dot] yuh [at] northeastern [dot] edu