A PyTorch implementation of Multi-Faceted Moment Localizer, an improved version of MAC with additional features for better moment localization. Newly introduced features include:
- Semantic segmentation features from ADE20K MobileNetV2dilated + C1_deepsup pretrained model.
- Video captioning features
- BERT Sentence Embeddings
- Download "Object Segmentation Features" and "Video Understanding Features" (Video caption features) from the following link and extract to ./data directory.
- Download c3d visual features, c3d visual activity concepts, ref_info provided by authors of MAC and extract to ./data directory.
- Now, the data directory should have following sub directories:
all_fc6_unit16_overlap0.5
,clip_object_features_test
,clip_object_features_train
,ref_info
,test_softmax
,train_softmax
,video_understanding_features_test
,video_understanding_features_train
. - Create a Python 2 Conda environmnt with
pytorch 0.4.1
andtorchvision
. Additionally, install following dependencies using pip.pip install pytorch_pretrained_bert
pip install numpy pickle
- Start training with
python train.py
- Follow steps 1-4 of "Training the model" section.
- Download the pre-trained model and place it in
./checkpoints
directory:- Pre-trained model with BERT Sentence Embeddings, Glove VO Embedding, Object Segmentation Features and Video Captioning Features: Link.
- Final Output:
[dr-obj 0.50][dr-act 0.00][sr 0.005][csr: 0.005] best_R1_IOU5: 0.319 [dr-obj 0.50][dr-act 0.00][sr 0.005][csr: 0.005] best_R5_IOU5: 0.652
This is not the code we used to identify the hyper-parameters used in the model. This a simplified version of the code released to the public.
Code related to feature extraction of Object Segmentation and Video Captioning will be released in the future.
Code improved from PyTorch implementation of MAC available here.