Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training plans? #17

Closed
nbardy opened this issue Jun 21, 2023 · 122 comments
Closed

Training plans? #17

nbardy opened this issue Jun 21, 2023 · 122 comments

Comments

@nbardy
Copy link
Contributor

nbardy commented Jun 21, 2023

I've got a bunch of compute the next couple weeks and thinking to train this on LAION.

Wondering if there is any other training going on right now. Would hate to duplicate efforts too much.

@lucidrains
Copy link
Owner

@nbardy where do you have the compute from? you should join the LAION discord and check to see first

i will be finishing the unconditional training code this week for starters, before the entire training code by end of month

@nbardy
Copy link
Contributor Author

nbardy commented Jun 21, 2023

512 TPUv4 from a google startup grant.

Didn't get any response in LAION when I asked. Looks like nothing going on yet.

@lucidrains
Copy link
Owner

ohh sweet, though you probably should do it in jax? or has the state of pytorch xla improved?

@lucidrains
Copy link
Owner

are you doing a startup? or working for a new one?

@francqz31
Copy link

@nbardy I think you should just train it for the Super Resolution Upsampling Task 128px to 4k Which is the highlight of the paper. Gigagan's text to image is kinda meh and not good nor impressive.

What's impressive and holds the current SOTA in text to Image is this project https://raphael-painter.github.io/ it even beats midjouney v5.1 and is competitive with 5.2v and has Efficient finetuning
lucid might implement raphael and you might train it that would be a far better idea than wasting all that compute on nothing.

@lucidrains
Copy link
Owner

lucidrains commented Jun 23, 2023

@francqz31 oh nice, wasn't aware of raphael. there is no implementation yet?

@lucidrains
Copy link
Owner

lucidrains commented Jun 23, 2023

@francqz31 i see, they just added a ton of mixture of experts. i have been meaning to open source ST-MoE for language modeling front, so maybe this is good timing. also have a few ideas for improving PKM

@francqz31
Copy link

@lucidrains Nope there isn't , I asked one of the authors he said something about releasing an api or something but they will not open source it that's 100% for sure. downside of an api that i don't think it will have fine-tuning. but yeah overall they trained it on 1000 A100s for 2 months straight , if you implement it and nbardy trains it. it will be a huge leap in the opensource community.

@lucidrains
Copy link
Owner

lucidrains commented Jun 23, 2023

@francqz31 i haven't dived into the paper yet, but i think there's basically nothing to it besides adding MoE and some hand wavy stuff about each expert being 'painters'. i just need to do to mixture-of-experts what i did to attention, and both language and generative image / videos will naturally improve if one replaces the feedforwards with them

@lucidrains
Copy link
Owner

@francqz31 it was on my plate anyways, since we now know GPT4 uses mixture of experts

@lucidrains
Copy link
Owner

@francqz31 do correct me if i'm wrong about that paper. i will get around to reading it (too much in the queue)

@francqz31
Copy link

@lucidrains that's my pleasure I indeed will , I even took some prompts of raphael and I compared it with midjourney v5.2 , it is almost the same if not even better , But in the paper they compare with v5.1
like this for example with 5.1v
get (57)
prompts by order:

  1. A cute little matte low poly isometric cherry blossom forest island, waterfalls, lighting, soft shadows, trending on
    Artstation, 3d render, monument valley, fez video game
  2. A shanty version of Tokyo, new rustic style, bold colors with all colors palette, video game, genshin, tribe, fantasy,
    overwatch.
  3. Cartoon characters, mini characters, figures, illustrations, flower fairy, green dress, brown hair, curly long hair, elf-like
    wings, many flowers and leaves, natural scenery, golden eyes, detailed light and shadow , a high degree of detail.
  4. Cartoon characters, mini characters, hand-made, illustrations, robot kids, color expressions, boy, short brown hair, curly
    hair, blue eyes, technological age, cyberpunk, big eyes, cute, mini, detailed light and shadow, high detail.

@lucidrains
Copy link
Owner

@francqz31 cool! yea, i guess this is yet another testament to using mixture-of-experts or conditional computation modules

@nbardy
Copy link
Contributor Author

nbardy commented Jun 23, 2023

Definitely most interested in training the upscaler.

@lucidrains do you have an idea how much work is left for the upscaler code? Looking at the paper it seems pretty similar to the base unconditioned model with some tweaks.

although the paper is light on details about the upscaler

I’m still at the same startup, Facet.

Talking to the Google team and they said the performance is very similar between PyTorch and Jax now.

@nbardy
Copy link
Contributor Author

nbardy commented Jun 23, 2023

@francqz31 thanks for sharing, too much work to implement and train a new model architecture on a short timeline. Raphael does look quite interesting, although expensive to run inference with MoE.

particularly interested in the openMUSe training going on.

@francqz31
Copy link

@nbardy no problems don't feel any pressure , Dr. phil might just implement it and leave it for the open source community. if any one else is interested. someone will be hopefully.

@francqz31
Copy link

it is more than enough that you are willing to train the Upsampler. it is not an easy work. plus it is the most important thing in the paper.

@lucidrains
Copy link
Owner

@nbardy i'll get to it soon, but like anything in open source, no promises on timeline

@francqz31 oh please, don't address me that way. got enough of that in med school

@nbardy
Copy link
Contributor Author

nbardy commented Jun 23, 2023

Happy to jump in and help.

How up to date is the TODO list? You mentioned there is some work left on the unconditioned model code still.

@lucidrains
Copy link
Owner

@nbardy yea, the plan of attack was going to be to wire up hf accelerate for unconditional, following their example here, then move on to conditional, before finally tackling the upsampler modifications

@lucidrains
Copy link
Owner

@nbardy are you planning on open sourcing the final model, or is this for commercial purposes for Facet?

@francqz31
Copy link

@francqz31 do correct me if i'm wrong about that paper. i will get around to reading it (too much in the queue)

Ok here is a quick thing that I hacked Because I read the paper before.

To implement the RAPHAEL model described in this paper, here are the main steps they used:

1-Data collection and preprocessing
*They Collect a large-scale dataset of text prompt-image pairs. This paper uses LAION-5B of course and some internal datasets.
*They Preprocess the images and text by removing noise, resizing images, etc.
2-Model architecture
*The model is based on a U-Net architecture with 16 transformer blocks.
*Each block contains:
*A self-attention layer
*A cross-attention layer with textPrompt
*A space-Mixture-of-Experts (space-MoE) layer
*A time-Mixture-of-Experts (time-MoE) layer
*An edge-supervised learning module
3-Space-MoE
*The space-MoE layer uses experts to model the relationship between text tokens and image regions.
*A text gate network is used to assign text tokens to experts.
*A thresholding mechanism is used to determine the correspondence between text tokens and image regions.
*There are 6 space experts in each of the 16 transformer blocks.
4-Time-MoE
*The time-MoE layer uses experts to handle different diffusion timesteps.
*A time gate network is used to assign timesteps to experts.
*There are 4 time experts.
5-Edge-supervised learning. Add an edge detection module to extract edges from the input image.
Supervise the model using these edges and a focal loss. Pause edge learning after a certain timestep threshold.
6-Training
*They Use the AdamW optimizer with learning rate 1e-4.
*They Train for 2 months on 1000 GPUs with a batch size of 2000, Warmup steps 20000.
*They Combine a denoising loss and an edge-supervised loss.
*Optional: Use LoRA, ControlNet or SR-GAN for additional controls or higher resolution.
*They use a private tailormade SR-GAN model too I think not the public one but that can be replaced by the Gigagan upsampler ;).

@lucidrains
Copy link
Owner

lucidrains commented Jun 23, 2023

@francqz31 thanks for the rundown!

yea, there is nothing surprising then. mostly more attention (transformer blocks), and the experts per diffusion timesteps goes back to eDiff from Balaji et al

the application of space and time MoE seems to be the main novelty, but that in itself is just porting over lessons from LLM

@nbardy
Copy link
Contributor Author

nbardy commented Jun 23, 2023

@nbardy are you planning on open sourcing the final model, or is this for commercial purposes for Facet?

Got the all clear to open source the weights.

Might finetune on some proprietary data. But the base model trained on LAION we'd release.

@lucidrains
Copy link
Owner

@nbardy awesome! i will prioritize this! expect me to power through it this weekend

@nbardy
Copy link
Contributor Author

nbardy commented Jun 23, 2023

🥳

@lucidrains
Copy link
Owner

didn't get to it this weekend 😢 caught up with some TTS work and Pride celebrations

going to work on it this morning!

@lucidrains
Copy link
Owner

@nbardy the upsampler is nothing more than a unet with some high resolution downsampling layers removed, should be straightforward!

@lucidrains
Copy link
Owner

ok, got the unet upsampler to a decent place, will move onwards to unconditional training tomorrow, and by week's end, conditional + unet upsampler training

@nbardy
Copy link
Contributor Author

nbardy commented Jun 28, 2023

Exiting progress.

Trying to start some jobs this week and there is no actually available TPUv4. We have the quota but the LLMs teams must be taking them all. Yet to see if we actually have compute :( or if it's a mirage.

Probably willing to pay to scale up a smaller version of this. It looks like the compute budget isn't too high for the upscaler.

@lucidrains
Copy link
Owner

@francqz31 nice find!

@nbardy
Copy link
Contributor Author

nbardy commented Jul 18, 2023

I was not able to find the t_local and t_global sizes in the paper.

@nbardy
Copy link
Contributor Author

nbardy commented Jul 18, 2023

Reading through training details. Some notes on datasets and models size from the paper.

with the exception of the 128-to 1024 upsampler model trained on Adobe’s internal Stock
images.

That is the 8x upsampler that gives the stunning results in the paper.

Unfortunately it's hyper-parameters are not in the paper, but I imagine it would be about the same size maybe a little deeper to get some higher resolution features. Should take less compute than the text conditioned upscalers.

Also interesting

Additionally, we train a separate 256px class-conditional upsampler model and combine them with end-to-end finetuning stage.

Does this mean training the text->image and upsampler models in series for fine tuning, I hadn't noticed before.

@lucidrains
Copy link
Owner

ok, finished the text-conditioning logic for both base and upsampler

going to start wiring up accelerate probably this afternoon (as well as some hparams for more efficient recon and multi-scale losses)

@lucidrains
Copy link
Owner

will also aim to get the eval for both base and upsampler done, using what @CerebralSeed pull requested as a starting point. then we can see the GAN working for some toy datasets for unconditional training

@lucidrains
Copy link
Owner

@nbardy or were you planning on doing the distributed stuff with accelerate + ray today? just making sure no overlapping work

@nbardy
Copy link
Contributor Author

nbardy commented Jul 19, 2023

Thanks for all the great work.

I'm happy to take the distributed stuff from here. Was hoping to have a distributed run going today on the cluster, but just got a single chip running
I have a couple different training scripts on my fork one of them uses ray and accelerate.

Just got a webdataset script working with the upsampler on the TPU chip. Was surprisingly a pain debugging webdataset pipe errors and setting up credentials.

@lucidrains
Copy link
Owner

lucidrains commented Jul 19, 2023

@nbardy yea no problem, i know how it is. things are never straightforward in software

@CerebralSeed pull requested the sampler script and validated that the upsampler works! that should unblock you for your work

i'm going to give accelerate integration (sans ray, since i'm not familiar with it) a try today

@nbardy
Copy link
Contributor Author

nbardy commented Jul 19, 2023

image Learning on the accelerated chips finally! Remarkably good results for 40 steps in. Last time I trained a GAN was a very long time ago.

Losses look stable.
image

Looking at the XLA docs trying to figure out what the best way to network this is with tpus. Might just drop ray 🤔 already checkpointing and tracking runs with WB.

https://wandb.ai/nbardy-facet/gigagan/runs/zv9004dr?workspace=user-nbardy-facet

@nbardy
Copy link
Contributor Author

nbardy commented Jul 20, 2023

Got started on XMP today. It’s getting stuck on step 1 . Most likely more device errors

@nbardy
Copy link
Contributor Author

nbardy commented Jul 20, 2023

Accelerate was giving bad crashes. Probably incompatible.

@nbardy
Copy link
Contributor Author

nbardy commented Jul 20, 2023

I will talk more with Google tomorrow. They will mostly likely be able to help me sort this out end of day tomorrow.

@lucidrains
Copy link
Owner

@nbardy good to see some progress on your end!

for me, i was stuck on a bug in the base generator architecture, but finally got it working before bedtime

sample-32

i'm going to wire up accelerate this morning (this time for real lol) and try out that vision aided discriminator loss

@nbardy
Copy link
Contributor Author

nbardy commented Jul 20, 2023

Training across 16 chips with XLA/XMP.

Logs(Currently very slow because XLA is compiling the first steps and debug mode is on)

@nbardy
Copy link
Contributor Author

nbardy commented Jul 20, 2023

And they all crash at 30 minutes :(

@lucidrains
Copy link
Owner

And they all crash at 30 minutes :(

haha yea, expected this to be not that mature

they are basically exchanging free compute for free QA

today was much smoother sailing for me; accelerate and mixed precision is working for multi-gpu on my one machine!

@randintgenr
Copy link

Hi Phil,

I have been using your implementation and noted that subpixel upsampling is giving me a lower generative performance.

It is introducing checkerboard artifacts that negatively affect the quality of the generated images. To address this, I have experimented with replacing subpixel convolution with Bilinear Upsampling, and it has yielded better results.

Also, the StyleGAN generator relies on maintaining unit variance for its feature activations for effective style mixing. It is unclear if the subpixel upsampling still leads to activations that are unit variance.

@lucidrains
Copy link
Owner

Hi Phil,

I have been using your implementation and noted that subpixel upsampling is giving me a lower generative performance.

It is introducing checkerboard artifacts that negatively affect the quality of the generated images. To address this, I have experimented with replacing subpixel convolution with Bilinear Upsampling, and it has yielded better results.

Also, the StyleGAN generator relies on maintaining unit variance for its feature activations for effective style mixing. It is unclear if the subpixel upsampling still leads to activations that are unit variance.

hey yup! i was actually going to offer this as an option as i noticed the same

defaulted it to bilinear upsample for now, controllable with this option

@lucidrains
Copy link
Owner

@randintgenr are you a computer vision researcher?

@lucidrains
Copy link
Owner

almost done with the entire training code

@lucidrains
Copy link
Owner

lucidrains commented Jul 23, 2023

ok, i think it is done, save for a few edge cases and cleanup

going to wind down work on this repo next week and move back to video gen

@lucidrains
Copy link
Owner

closing, as code is there, and I know of a group moving forward with training already

@anandbhattad
Copy link

Hey @lucidrains, have you heard anything about a timeline for the group that's currently training GigaGAN? I'd appreciate any information you have. Thank you!

@lucidrains
Copy link
Owner

@anandbhattad yea they have proceeded, but this group will not be doing it open sourced

@anandbhattad
Copy link

@lucidrains, I appreciate your response. I was wondering if you knew the necessary computing power for training on the LIAON-5B dataset. The paper lacks clear information on compute and time requirements for training the model (Table A2 is ambiguous). As I only have academic compute access, I am interested in exploring whether GigaGAN utilizes familiar rendering elements such as normals and depth like we demonstrated in StyleGAN-2. Here's the link for more information: https://arxiv.org/abs/2306.00987

@CerebralSeed
Copy link
Contributor

@nbardy would greatly appreciate if you're able to share what image size and other settings you use, if you get anything that works at a size larger than 128px. TIA

@davizca
Copy link

davizca commented Dec 19, 2023

@lucidrains I'm pretty sure that group is this one:
https://magnific.ai/

Or at least it seems so. If I had money and anything more than 24 GB VRAM I will train this but is impossible for me, haha.

@topological-modular-forms

@nbardy Hi Nicholas! Do you still plan to train this model on LAION, or have any updates regarding it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants