open replication of the prior #23

rom1504 · 2022-04-24T20:36:02Z

hey,

thanks @lucidrains for building this awesome replication of the model, as usual!
dalle2 paper https://arxiv.org/pdf/2204.06125.pdf

with a few people from laion we're working on replication of the prior at scale
we're gathering notes in https://docs.google.com/document/d/1BKIQPzZS7pVL2JgL74W0dUIlUcfld8jptA6nIne_cNo/edit
and here
big plan:

use https://github.com/rom1504/embedding-reader to build a data loader
- use laion2B-en 2B image/text embeddings https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/
plug that into the diffusion prior here: need to support precomputed embeddings
train at small scale
implement metrics (eg retrieval, classification, generation) on the output of the prior
train at large scale

Anybody interested in that project, feel free to discuss here or in #dalle2-prior at laion server (https://discord.gg/xBPBXfcFHd)

hp param from dalle2 paper

As a first step, we're trying to have an end to end version running at small scale. Once it works, we'll scale it up.

We intend to send PR in this repo for any improvement that seems worthwhile (eg support precomputed embeddings, distributed training, ...)

lucidrains · 2022-04-24T21:08:49Z

@rom1504 ah looks like a great plan :) so i think the PCA portion was only for the autoregressive prior method, and not the diffusion prior (they needed to quantize so they can use the straightforward cross entropy loss). but correct me if i'm wrong i'm happy to build that into the framework

rom1504 · 2022-04-24T21:14:05Z

ohhh you're right, well that makes thing much easier
I'll remove that from the plan

rom1504 · 2022-04-24T21:15:49Z

I'll keep this issue updated as we make progress

lucidrains · 2022-04-24T21:16:12Z

nousr · 2022-04-26T19:58:06Z

I was able to over-fit a small subset of LAION-2B locally (using the new CLIP-less DiffusionPrior class). I'll be working to make that data loader next so we can do the first "real" training run.

referencing:

use https://github.com/rom1504/embedding-reader to build a data loader

use laion2B-en 2B image/text embeddings https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/

rom1504 · 2022-04-26T20:14:00Z

Cool!
The corresponding text embeddings are now in https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/

rom1504 · 2022-04-26T20:14:37Z

You can use 2 instances of embedding reader and a zip to get both embeddings batch

taki0112 · 2022-04-27T05:28:34Z

@rom1504
The text embedding you mentioned is from which network? Is it a clip?

rom1504 · 2022-04-27T06:58:06Z

Yes, image and text embeddings link above are ViT-L/14 clip.
Check the laion5B blogpost for details

rom1504 · 2022-05-01T19:53:52Z

https://huggingface.co/rom1504/dalle2-diffusion-prior/resolve/main/1651432174.5708027_saved_model.pth here's a first checkpoint by @krish240574
thanks to him for building the training code and running that first training!

it's time for the evaluation to start.

krish240574 · 2022-05-02T03:47:17Z

https://huggingface.co/rom1504/dalle2-diffusion-prior/resolve/main/1651432174.5708027_saved_model.pth here's a first checkpoint by @krish240574 thanks to him for building the training code and running that first training!

it's time for the evaluation to start.

Here is a run happening, 100 million data points -
https://wandb.ai/laion/diffusion-prior/runs/3o0ic6ou?workspace=user-krish240574

Hyperparameters -
learning_rate=0.001,
max_grad_norm=0.5,
weight_decay=0.01
batch_size=10 ** 4
Train-val-test = 0.7 - 0.2 - 0.1
Refer to https://github.com/lucidrains/DALLE2-pytorch/blob/main/train_diffusion_prior.py for training details.

lucidrains · 2022-05-02T19:11:32Z

@krish240574 so just a word of caution, that learning rate of 1e-3 is quite high for transformers. a more conservative value would be Karpathy's favorite, 3e-4 (however, I don't know how transformers behave within a DDPM framework, so I could be wrong)

rom1504 · 2022-05-04T20:43:48Z

@krish240574 is running that new run with the newly added metrics https://wandb.ai/laion/diffusion-prior/runs/aul0rhv5?workspace=user-rom1504

seems cosine similarity is going up.

@lucidrains do you have any opinion on what would be the best way to know if the prior is doing its job? (except for plugging it into a generator and training of course)

xiankgx · 2022-05-04T22:13:02Z

Seems like the prior is doing a great job here. Please share the training details and checkpoint and dataset if possible. Thanks.

rom1504 · 2022-05-05T07:30:14Z

Dataset is
parser.add_argument("--image-embed-url", type=str, default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/")
parser.add_argument("--text-embed-url", type=str, default="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/text_emb/")

IE laion2B-en clip ViT-l/14 text/image embeddings

krish240574 · 2022-05-05T13:39:36Z

Here is a sample run, 600 million data points - https://wandb.ai/laion/diffusion-prior/runs/ar65uq6n?workspace=user-krish240574, hyperparameters as in the script(default values) train_diffision_prior.py

nousr · 2022-05-06T02:09:21Z

Just wanted to make a note that Katherine recommended we use EMA while training the prior as well @lucidrains

lucidrains · 2022-05-06T04:49:55Z

Just wanted to make a note that Katherine recommended we use EMA while training the prior as well @lucidrains

ohh got it, i can take care of that tomorrow morning 👍

lucidrains · 2022-05-06T15:07:29Z

Just wanted to make a note that Katherine recommended we use EMA while training the prior as well @lucidrains
@nousr

740d644 ok all done! @crowsonkb thank you for the advice yet again 🙏

nousr · 2022-05-06T15:30:18Z

@lucidrains awesome, thanks for doing that!

Since we're starting to pull out meaningful results from our mini-prior, what would you recommend in terms of network hp params next?

I think I saw somewhere there was a discussion about moving to something like...?

# change to 12 layers, 128 dim, 16 heads
prior_network = DiffusionPriorNetwork(
    dim = 768,
    depth = 12,
    dim_head = 128,
    heads = 16
).cuda()
# i'd also like to try 1000 steps and compare results (open to thoughts on this)
diffusion_prior = DiffusionPrior(
    net = prior_network,
    clip = clip,
    timesteps = 1000,
    cond_drop_prob = 0.1
).cuda()

lucidrains · 2022-05-06T15:51:38Z

@nousr no problem! :)

yea we can aim for maybe the same size as GPT2 small? https://huggingface.co/docs/transformers/model_doc/gpt2

so it would translate to

prior_network = DiffusionPriorNetwork(
    dim = 768,
    depth = 12,
    dim_head = 64,
    heads = 12
).cuda()

lucidrains · 2022-05-07T17:04:09Z

in light of #71 should definitely be retrained at the latest version! i've also turned off classifier free guidance, since i'm uncertain if it works well with the predict x0 objective. final thought is that we should be training with the text encodings + corresponding text mask if possible. i think the paper did have this (although if Katherine was able to get good results without, let us aim for that first)

crowsonkb · 2022-05-07T17:07:13Z

Classifier-free guidance works fine with predicting x_0, I trained mine that way. :) As for training with the text encodings, I am feeding in the hidden states at the end of the frozen CLIP text encoder to my prior (along with the corresponding padding mask for attention) instead of trying to learn language from scratch. This works pretty well, a lot better than trying to feed in only the CLIP text embedding!

crowsonkb · 2022-05-07T17:11:23Z

Actually do you think I need learned queries if I am feeding in a sequence of text encoder hidden states? I don't have them and don't know how much they help. There are always at least two text encoder hidden states from the SOT and EOT tokens (for the null condition), more for actual prompts.

lucidrains · 2022-05-07T19:59:45Z

Classifier-free guidance works fine with predicting x_0, I trained mine that way. :) As for training with the text encodings, I am feeding in the hidden states at the end of the frozen CLIP text encoder to my prior (along with the corresponding padding mask for attention) instead of trying to learn language from scratch. This works pretty well, a lot better than trying to feed in only the CLIP text embedding!

thanks for sharing! that's great to know! was wondering if this would work well given the l2norm constraint - i'll revert the commit next week so others can use it

and yes, i have things setup exactly like you did (the text encodings being the output of the final layer of the CLIP text transformer), so it should be ready to go, provided Laion can save and dataload the text encodings efficiently in addition to the text embeddings

lucidrains · 2022-05-07T20:03:44Z

Actually do you think I need learned queries if I am feeding in a sequence of text encoder hidden states? I don't have them and don't know how much they help. There are always at least two text encoder hidden states from the SOT and EOT tokens (for the null condition), more for actual prompts.

so the literature is scant on this, but https://arxiv.org/abs/2006.11527 does suggest adding learned queries (up to 16) can be beneficial. i was also puzzled why they didn't take the output token of the image embedding, and one possibility is that they projected the "noised | predicted" image embedding into multiple tokens for attention (which i'll build into the repository as a setting eventually). the more realistic answer is no matter which token you choose, as long as the position information is there, and as long as the transformer is big enough, it won't matter that much 😆

tldr: it doesn't hurt to add a few memory (learned query) tokens. with a big enough transformer, probably matters little

crowsonkb · 2022-05-07T20:07:35Z

i was also puzzled why they didn't take the output token of the image embedding

I tried this and it was worse/didn't learn as well. I suspect it might be easier to predict a clean x_0 without the residual bias from the noisy input (because it would have to generate exact anti-noise in the ffns and add it to that token's residual stream), whereas if you have a separate output token you just have to copy information into it via attention.

lucidrains · 2022-05-07T20:16:00Z

this is valuable information! thanks!

at the end of the day, I think the team still struggled with the generator not treating the text as a bag of words, and the bottleneck is the clip text encoder. I suspect the new Coca, which is an LM and Clip trained end to end, would help

crowsonkb · 2022-05-07T20:17:47Z

I think so too because it can't just throw away as much information about the contents of the individual tokens bc it needs to do the next token prediction task too.

crowsonkb · 2022-05-07T20:18:54Z

btw I think CoCa should be trained dropping out the image features so you can generate captions with superconditioning (it should work well there for the same reason it works well when generating images, the typical image/caption pair in most training sets doesn't match all that well).

lucidrains · 2022-05-07T20:56:07Z

btw I think CoCa should be trained dropping out the image features so you can generate captions with superconditioning (it should work well there for the same reason it works well when generating images, the typical image/caption pair in most training sets doesn't match all that well).

yes! agreed on the superconditioning!

last thought i've had that is worth sharing is perhaps there can be one more level of indirection. If one were to train CLIP with a multi-stage efficient ViT, we can do text -> text embed + text encoding -> image embed + image encoding (~49-64 tokens) -> image. in the ideal world, we have a 540B parameter PaLM + CoCa (with FILIP fine interactions between the text tokens and image tokens) + cascading DDPM generator of about 3B parameters conditioned on the image tokens from the hierarchical ViT. One can dream 😆

lucidrains · 2022-05-15T19:53:00Z

@rom1504 update: Romain reported that a research group out there have replicated the prior using the code in this repository for their CLIP generations - in other words, the code in this repository works, and we have confirmation that the prior is effective, per paper

nousr · 2022-05-23T14:22:44Z

@lucidrains have you started working on config-file support for the prior?

clarification: (I was thinking about tackling that to simplify the training script, but didn't wanna duplicate work!)

lucidrains · 2022-05-23T15:20:15Z

@nousr hey yea i did, only partly. but the general scaffold is there that it shouldn't take too much code to convert it to be config-based (by looking at the decoder training script and what was done in dalle2_pytorch/train_configs.py as example)

this week i'm back to attending a bunch of meetings, so will be generally unproductive. feel free to jump in with a PR!

rom1504 · 2022-06-19T19:20:23Z

I believe we got everything done here.
One last thing that would be helpful in my opinion would be a prior.md with these information:

What is the prior and what can it do
How to prepare a dataset for it
How to run a training
how to use a pretrained model
What are good metrics
how to plug into the decoder

I think that would close this topic.
I believe most of the information is already present in readme.md but having it in its dedicated file would help as the readme is getting very large now.

rom1504 · 2022-07-08T21:13:03Z

pretty much done now

lucidrains added the enhancement New feature or request label Apr 24, 2022

nousr mentioned this issue Apr 26, 2022

Add Functionality For 'Embedding-Only' Training #24

Closed

rom1504 mentioned this issue Apr 26, 2022

Pretrained weights #25

Closed

This was referenced Apr 27, 2022

Open replication of the generator #27

Open

How can I change Cuda to CPU in dalle2 on Mac? #26

Closed

rom1504 mentioned this issue May 1, 2022

All changes implemented, current run happening. Link to wandb run in comments. #43

Merged

lucidrains mentioned this issue May 15, 2022

Build a fair evaluation of the prior #29

Open

3 tasks

rom1504 closed this as completed Jul 8, 2022

open replication of the prior #23

open replication of the prior #23

Comments

rom1504 commented Apr 24, 2022 • edited

lucidrains commented Apr 24, 2022

rom1504 commented Apr 24, 2022

rom1504 commented Apr 24, 2022

lucidrains commented Apr 24, 2022

nousr commented Apr 26, 2022 • edited

rom1504 commented Apr 26, 2022

rom1504 commented Apr 26, 2022

taki0112 commented Apr 27, 2022

rom1504 commented Apr 27, 2022

rom1504 commented May 1, 2022

krish240574 commented May 2, 2022

lucidrains commented May 2, 2022

rom1504 commented May 4, 2022

xiankgx commented May 4, 2022

rom1504 commented May 5, 2022

krish240574 commented May 5, 2022

nousr commented May 6, 2022

lucidrains commented May 6, 2022

lucidrains commented May 6, 2022 • edited

nousr commented May 6, 2022

lucidrains commented May 6, 2022

lucidrains commented May 7, 2022

crowsonkb commented May 7, 2022

crowsonkb commented May 7, 2022

lucidrains commented May 7, 2022

lucidrains commented May 7, 2022 • edited

crowsonkb commented May 7, 2022

lucidrains commented May 7, 2022 • edited

crowsonkb commented May 7, 2022

crowsonkb commented May 7, 2022

lucidrains commented May 7, 2022 • edited

lucidrains commented May 15, 2022 • edited

nousr commented May 23, 2022 • edited

lucidrains commented May 23, 2022 • edited

rom1504 commented Jun 19, 2022

rom1504 commented Jul 8, 2022

rom1504 commented Apr 24, 2022 •

edited

nousr commented Apr 26, 2022 •

edited

lucidrains commented May 6, 2022 •

edited

lucidrains commented May 7, 2022 •

edited

lucidrains commented May 7, 2022 •

edited

lucidrains commented May 7, 2022 •

edited

lucidrains commented May 15, 2022 •

edited

nousr commented May 23, 2022 •

edited

lucidrains commented May 23, 2022 •

edited