Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caption my own video with provided pretrained model #10

Closed
dawnlh opened this issue May 25, 2021 · 8 comments
Closed

caption my own video with provided pretrained model #10

dawnlh opened this issue May 25, 2021 · 8 comments

Comments

@dawnlh
Copy link

dawnlh commented May 25, 2021

Hi, thanks for the wonderful work.
I want to caption my own videos giving the video frames (without transcript), can I use the pretrained weight (univl.pretrained.bin) provided in the repository directly to finish this task? I evaluated the pretained weightunivl.pretrained.bin directly on MSRVTT with the following code,

DATATYPE="msrvtt"
TRAIN_CSV="data/msrvtt/MSRVTT_train.9k.csv"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=1 \
main_task_caption.py \
--do_eval --num_thread_reader=4 \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_caption --bert_model bert-base-uncased \
--do_lower_case \
--batch_size_val 32 --visual_num_hidden_layers 6 \
--decoder_num_hidden_layers 3 --datatype ${DATATYPE} --stage_two \
--init_model ${INIT_MODEL}

but got a very low metric value:

BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
 METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045

Emmm, I'm a fresher of this field, I would appreciate it a lot if you can provide some suggestions, instructions or codes on making use of provided pretrained model to deal with video captioning tasks in the real cases. (Perhaps main points lie in pretrained model, feature extraction and result visualization?)

@ArrowLuo
Copy link
Contributor

Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.

@dawnlh
Copy link
Author

dawnlh commented May 25, 2021

Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.

Thanks a lot! Here is the log file:

2021-05-25 11:15:57,643:INFO: Effective parameters:
2021-05-25 11:15:57,644:INFO:   <<< batch_size: 256
2021-05-25 11:15:57,644:INFO:   <<< batch_size_val: 32
2021-05-25 11:15:57,644:INFO:   <<< bert_model: bert-base-uncased
2021-05-25 11:15:57,644:INFO:   <<< cache_dir: 
2021-05-25 11:15:57,644:INFO:   <<< coef_lr: 0.1
2021-05-25 11:15:57,644:INFO:   <<< cross_model: cross-base
2021-05-25 11:15:57,644:INFO:   <<< cross_num_hidden_layers: 2
2021-05-25 11:15:57,644:INFO:   <<< data_path: data/msrvtt/MSRVTT_data.json
2021-05-25 11:15:57,644:INFO:   <<< datatype: msrvtt
2021-05-25 11:15:57,644:INFO:   <<< decoder_model: decoder-base
2021-05-25 11:15:57,644:INFO:   <<< decoder_num_hidden_layers: 3
2021-05-25 11:15:57,644:INFO:   <<< do_eval: True
2021-05-25 11:15:57,644:INFO:   <<< do_lower_case: True
2021-05-25 11:15:57,644:INFO:   <<< do_pretrain: False
2021-05-25 11:15:57,644:INFO:   <<< do_train: False
2021-05-25 11:15:57,644:INFO:   <<< epochs: 20
2021-05-25 11:15:57,644:INFO:   <<< feature_framerate: 1
2021-05-25 11:15:57,644:INFO:   <<< features_path: data/msrvtt/msrvtt_videos_features.pickle
2021-05-25 11:15:57,644:INFO:   <<< fp16: False
2021-05-25 11:15:57,644:INFO:   <<< fp16_opt_level: O1
2021-05-25 11:15:57,644:INFO:   <<< gradient_accumulation_steps: 1
2021-05-25 11:15:57,644:INFO:   <<< hard_negative_rate: 0.5
2021-05-25 11:15:57,644:INFO:   <<< init_model: weight/univl.pretrained.bin
2021-05-25 11:15:57,644:INFO:   <<< local_rank: 0
2021-05-25 11:15:57,644:INFO:   <<< lr: 0.0001
2021-05-25 11:15:57,644:INFO:   <<< lr_decay: 0.9
2021-05-25 11:15:57,644:INFO:   <<< margin: 0.1
2021-05-25 11:15:57,644:INFO:   <<< max_frames: 100
2021-05-25 11:15:57,644:INFO:   <<< max_words: 20
2021-05-25 11:15:57,644:INFO:   <<< min_time: 5.0
2021-05-25 11:15:57,645:INFO:   <<< n_display: 100
2021-05-25 11:15:57,645:INFO:   <<< n_gpu: 1
2021-05-25 11:15:57,645:INFO:   <<< n_pair: 1
2021-05-25 11:15:57,645:INFO:   <<< negative_weighting: 1
2021-05-25 11:15:57,645:INFO:   <<< num_thread_reader: 4
2021-05-25 11:15:57,645:INFO:   <<< output_dir: ckpts/ckpt_msrvtt_caption
2021-05-25 11:15:57,645:INFO:   <<< sampled_use_mil: False
2021-05-25 11:15:57,645:INFO:   <<< seed: 42
2021-05-25 11:15:57,645:INFO:   <<< stage_two: True
2021-05-25 11:15:57,645:INFO:   <<< task_type: caption
2021-05-25 11:15:57,645:INFO:   <<< text_num_hidden_layers: 12
2021-05-25 11:15:57,645:INFO:   <<< train_csv: data/youcookii_singlef_train.csv
2021-05-25 11:15:57,645:INFO:   <<< use_mil: False
2021-05-25 11:15:57,645:INFO:   <<< val_csv: data/msrvtt/MSRVTT_JSFUSION_test.csv
2021-05-25 11:15:57,645:INFO:   <<< video_dim: 1024
2021-05-25 11:15:57,645:INFO:   <<< visual_model: visual-base
2021-05-25 11:15:57,645:INFO:   <<< visual_num_hidden_layers: 6
2021-05-25 11:15:57,645:INFO:   <<< warmup_proportion: 0.1
2021-05-25 11:15:57,645:INFO:   <<< world_size: 1
2021-05-25 11:15:57,646:INFO: device: cuda:0 n_gpu: 1
2021-05-25 11:15:57,646:INFO: loading vocabulary file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased/vocab.txt
2021-05-25 11:15:58,017:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/visual-base
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 1,
  "type_vocab_size": 2,
  "vocab_size": 1024
}

2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/visual-base/visual_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/cross-base
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 1024,
  "num_attention_heads": 12,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 768
}

2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/cross-base/cross_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base
2021-05-25 11:15:58,019:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_target_embeddings": 512,
  "num_attention_heads": 12,
  "num_decoder_layers": 1,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2021-05-25 11:15:58,019:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base/decoder_pytorch_model.bin
2021-05-25 11:15:58,019:WARNING: Stage-One:False, Stage-Two:True
2021-05-25 11:15:58,019:WARNING: Set bert_config.num_hidden_layers: 12.
2021-05-25 11:15:59,122:WARNING: Set visual_config.num_hidden_layers: 6.
2021-05-25 11:15:59,591:WARNING: Set cross_config.num_hidden_layers: 2.
2021-05-25 11:15:59,763:WARNING: Set decoder_config.num_decoder_layers: 3.
2021-05-25 11:16:02,843:INFO: --------------------
2021-05-25 11:16:02,843:INFO: Weights from pretrained model not used in UniVL: 
   cls.predictions.bias
   cls.predictions.transform.dense.weight
   cls.predictions.transform.dense.bias
   cls.predictions.transform.LayerNorm.weight
   cls.predictions.transform.LayerNorm.bias
   cls.predictions.decoder.weight
   cls_visual.predictions.weight
   cls_visual.predictions.bias
   cls_visual.predictions.transform.dense.weight
   cls_visual.predictions.transform.dense.bias
   cls_visual.predictions.transform.LayerNorm.weight
   cls_visual.predictions.transform.LayerNorm.bias
   similarity_pooler.dense.weight
   similarity_pooler.dense.bias
2021-05-25 11:16:10,136:INFO: ***** Running test *****
2021-05-25 11:16:10,136:INFO:   Num examples = 2990
2021-05-25 11:16:10,136:INFO:   Batch size = 32
2021-05-25 11:16:10,136:INFO:   Num steps = 94
2021-05-25 11:23:31,867:INFO: >>>  BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
2021-05-25 11:23:31,877:INFO: >>>  METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045

@ArrowLuo
Copy link
Contributor

Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train at first.

@dawnlh
Copy link
Author

dawnlh commented May 25, 2021

Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train at first.

Yes, I evaluated the pretrained weight (zero-shot) directly. I tried to finetune the model, but failed due to limited GPU memory (even setting batch_size to 1) . Can you give an estimation about how much GPU memory is needed to finetune the model? Or is it convenient for you to share the weights for captioning task (no transcript) ?

@ArrowLuo
Copy link
Contributor

Hi @dawnlh. We finetuned the model with 4 Tesla V100 GPUs. I am so sorry that we can not provide the finetuned weights.

@dawnlh
Copy link
Author

dawnlh commented May 25, 2021

Okay, thanks anyway~ I'll try to figure out the GPU limitation problem. Another question is that if you can provide some instructions or codes on making use of finetuned model to deal with video captioning tasks for self-captured videos? I mean the input video processing (how to extract the same feature as the training set to serve as the model input) and output visualization.

@ArrowLuo
Copy link
Contributor

More information about the feature extractor can be found at README. The caption results are saved in --output_dir.

@dawnlh
Copy link
Author

dawnlh commented May 25, 2021

More information about the feature extractor can be found at README. The caption results are saved in --output_dir.

Got it! Thank you a lot for your patient replying.

@dawnlh dawnlh closed this as completed May 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants