Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame #1

Closed
Sazoji opened this issue Nov 25, 2021 · 2 comments

Comments

@Sazoji
Copy link

Sazoji commented Nov 25, 2021

Judging by the results, the transformer is taking in a single frame, and would be considered an Image to Video process.
Something like video inpainting or camera FOV extrapolation(like in FGVC) would be input video -> output video.
Am I missing something in the documentation that maybe shows it as some sort of sparse video interpolation where it can input more than a (D1, D2, single frame); or was it called V2V in order to match the I2I label on the inpainting/image completion counterparts?

Additionally, there isn't a direct link to the paper, which documents that the V2V model only takes in a single image.
https://arxiv.org/abs/2111.12417

@chenfei-wu
Copy link
Contributor

We view image as a speical video with one frame. As a result, image-to-video generation can viewed as a special case of video-to-video generation.

@Sazoji
Copy link
Author

Sazoji commented Nov 29, 2021

Ok, I'll agree that frame to video can be seen as a special case of V2V generation. I was going to close this yesterday, but GitHub was down during my break.
I'd just like to mention that this method is not the type of V2V usage one would be looking for when trying to do video completion or inpainting, which seemed to be implied when put below image completion.

An actual example of V2V synthesis would be a doman change or style transfer, like a video label encoder -> photorealistic video decoder, NUWA-Infinity seems to have the capacity to change style via a conditioned decoder, and properly labels the synthetic models as video prediction and generation based on what's encoded (images and text, IE not video). Would still like to see how video encoders could be implemented.

@Sazoji Sazoji closed this as completed Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants