Implementation of Transframer, Deepmind's U-net + Transformer architecture for up to 30 seconds video generation, in Pytorch
The gist of the paper is the usage of a Unet as a multi-frame encoder, along with a regular transformer decoder cross attending and predicting the rest of the frames. The author builds upon his prior work where images are encoded as sparse discrete cosine transform (DCT) sequences.
I will deviate from the implementation in this paper, using a hierarchical autoregressive transformer, and just a regular resnet block in place of the NF-net block (this design choice is just Deepmind reusing their own code, as NF-net was developed at Deepmind by Brock et al).
Update: On further meditation, there is nothing new in this paper except for generative modeling on DCT representations
- This work would not be possible without the generous sponsorship from Stability AI, as well as my other sponsors
- figure out if dct can be directly extracted from images in jpeg format
@article{Nash2022TransframerAF,
title = {Transframer: Arbitrary Frame Prediction with Generative Models},
author = {Charlie Nash and Jo{\~a}o Carreira and Jacob Walker and Iain Barr and Andrew Jaegle and Mateusz Malinowski and Peter W. Battaglia},
journal = {ArXiv},
year = {2022},
volume = {abs/2203.09494}
}