News

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.

Run it on Replicate

News

09-22-2021
- New models released (see 0.2 version in https://github.com/mehdidc/feed_forward_vqgan_clip#pre-trained-models)
- New Colab notebook for training from scratch or fine-tuning
- Web interface from Replicate AI to use the models

How to install?

Download the 16384 Dimension Imagenet VQGAN (f=16)

Links:

https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.1/vqgan_imagenet_f16_16384.ckpt (vqgan_imagenet_f16_16384.ckpt)
https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.1/vqgan_imagenet_f16_16384.yaml (vqgan_imagenet_f16_16384.yaml)

Install dependencies.

conda

conda create -n ff_vqgan_clip_env python=3.8
conda activate ff_vqgan_clip_env
# Install pytorch/torchvision - See https://pytorch.org/get-started/locally/ for more info.
(ff_vqgan_clip_env) conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(ff_vqgan_clip_env) pip install -r requirements.txt

pip/venv

conda deactivate # Make sure to use your global python3
python3 -m pip install venv
python3 -m venv ./ff_vqgan_clip_venv
source ./ff_vqgan_clip_venv/bin/activate
$ (ff_vqgan_clip_venv) python -m pip install -r requirements.txt

How to use?

(Optional) Pre-tokenize Text

$ (ff_vqgan_clip_venv) python main.py tokenize data/list_of_captions.txt cembeds 128

Train

Modify configs/example.yaml as needed.

$ (ff_vqgan_clip_venv) python main.py train configs/example.yaml

Tensorboard:

Loss will be output for tensorboard.

# in a new terminal/session
(ff_vqgan_clip_venv) pip install tensorboard
(ff_vqgan_clip_venv) tensorboard --logdir results

Pre-trained models

version 0.2 (last)

Name	Type	Size	Dataset	Link	Author
cc12m_8x128	MLPMixer	12.1MB	Conceptual captions 12M	Download	@mehdidc
cc12m_32x1024	MLPMixer	1.19GB	Conceptual captions 12M	Download	@mehdidc
cc12m_32x1024	VitGAN	1.55GB	Conceptual captions 12M	Download	@mehdidc

After downloading a model or finishing training your own model, you can test it with new prompts, e.g.,

wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.2/cc12m_32x1024_vitgan.th
python -u main.py test cc12m_32x1024_vitgan.th "Picture of a futuristic snowy city during the night, the tree is lit with a lantern"

version 0.1

Name	Type	Size	Dataset	Link	Author
cc12m_8x128	VitGAN	12.1MB	Conceptual captions 12M	Download	@mehdidc
cc12m_16x256	VitGAN	60.1MB	Conceptual captions 12M	Download	@mehdidc
cc12m_32x512	VitGAN	408.4MB	Conceptual captions 12M	Download	@mehdidc
cc12m_32x1024	VitGAN	1.55GB	Conceptual captions 12M	Download	@mehdidc
cc12m_64x1024	VitGAN	3.05GB	Conceptual captions 12M	Download	@mehdidc
bcaptmod_8x128	VitGAN	11.2MB	Modified blog captions	Download	@afiaka87
bcapt_16x128	MLPMixer	168.8MB	Blog captions	Download	@mehdidc

NB: cc12m_AxB means a model trained on conceptual captions 12M, with depth A and hidden state dimension B

You can also try it in the Colab Notebook. Using the notebook you can generate images from pre-trained models and do interpolations between text prompts to create videos, see for instance video 1 or video 2 or video 3

Acknowledgements

The training code is heavily based on the VQGAN-CLIP notebook https://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ, thanks to all the authors who contributed to the notebook (@crowsonkb, @advadnoun, @Eleiber, @Crimeacs, @Abulafia)
Thanks to @lucidrains, the MLP mixer model (mlp_mixer_pytorch.py) is from https://github.com/lucidrains/mlp-mixer-pytorch.
Thanks to Taming Transformers authors https://github.com/CompVis/taming-transformers, the code uses VQGAN pre-trained model and VGG16 feature space perceptual loss https://github.com/CompVis/taming-transformers/blob/master/taming/modules/losses/lpips.py
Thanks to @afiaka87 for all the contributions to the repository's code and for providing the blog captions dataset for experimentation
Thanks to VitGAN authors, the VitGAN model is from https://github.com/wilile26811249/ViTGAN
Thanks to @CJWBW from Replicate AI for making and hosting a browser based text to image interface using the model

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
configs		configs
data		data
images		images
.gitignore		.gitignore
README.md		README.md
cog.yaml		cog.yaml
download-weights.sh		download-weights.sh
main.py		main.py
mlp_mixer_pytorch.py		mlp_mixer_pytorch.py
predict.py		predict.py
requirements.txt		requirements.txt
vitgan.py		vitgan.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

How to install?

Download the 16384 Dimension Imagenet VQGAN (f=16)

Install dependencies.

conda

pip/venv

How to use?

(Optional) Pre-tokenize Text

Train

Tensorboard:

Pre-trained models

version 0.2 (last)

version 0.1

Acknowledgements

About

Releases

Packages

Languages

martin-ev/feed_forward_vqgan_clip

Folders and files

Latest commit

History

Repository files navigation

News

How to install?

Download the 16384 Dimension Imagenet VQGAN (f=16)

Install dependencies.

conda

pip/venv

How to use?

(Optional) Pre-tokenize Text

Train

Tensorboard:

Pre-trained models

version 0.2 (last)

version 0.1

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages