-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caption my own video with provided pretrained model #10
Comments
Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command. |
Thanks a lot! Here is the log file:
|
Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with |
Yes, I evaluated the pretrained weight (zero-shot) directly. I tried to finetune the model, but failed due to limited GPU memory (even setting batch_size to 1) . Can you give an estimation about how much GPU memory is needed to finetune the model? Or is it convenient for you to share the weights for captioning task (no transcript) ? |
Hi @dawnlh. We finetuned the model with 4 Tesla V100 GPUs. I am so sorry that we can not provide the finetuned weights. |
Okay, thanks anyway~ I'll try to figure out the GPU limitation problem. Another question is that if you can provide some instructions or codes on making use of finetuned model to deal with video captioning tasks for self-captured videos? I mean the input video processing (how to extract the same feature as the training set to serve as the model input) and output visualization. |
More information about the feature extractor can be found at README. The caption results are saved in |
Got it! Thank you a lot for your patient replying. |
Hi, thanks for the wonderful work.
I want to caption my own videos giving the video frames (without transcript), can I use the pretrained weight (
univl.pretrained.bin
) provided in the repository directly to finish this task? I evaluated the pretained weightunivl.pretrained.bin
directly on MSRVTT with the following code,but got a very low metric value:
Emmm, I'm a fresher of this field, I would appreciate it a lot if you can provide some suggestions, instructions or codes on making use of provided pretrained model to deal with video captioning tasks in the real cases. (Perhaps main points lie in pretrained model, feature extraction and result visualization?)
The text was updated successfully, but these errors were encountered: