Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dreambooth broken, possibly because of ADAM optimizer, possibly more. #712

Closed
affableroots opened this issue Oct 4, 2022 · 41 comments
Closed
Assignees

Comments

@affableroots
Copy link

affableroots commented Oct 4, 2022

I think Huggingface's Dreambooth is the only popular SD implementation that also uses Prior Preservation Loss, so I've been motivated to get it working, but the results have been terrible, and the entire model degrades, regardless of: # timesteps, learning rate, PPL turned on/off, # instance samples, # class regularization samples, etc. I've read the paper, and found that they actually unfreeze everything including the text embedder (and VAE? I'm not sure so I leave it frozen), so I implemented textual inversion within the dreambooth example (new token, unfreeze a single row of the embedder), which improves results considerably, but the whole model still degrades no matter what.

Someone smarter than me can confirm, but I think the culprit is ADAM:

optimizer_class = torch.optim.AdamW

My hypothesis is that since ADAM tries to drag all weights of unet etc. to 0, it ruins parts of the model that aren't concurrently being trained during the finetuning.

I've tested with weight_decay set to 0, and results seem considerably better, but I think the entire model is still degrading. I'm trying SGD next, so, fingers crossed, but there may still be some dragon lurking in the depths even despite removing ADAM.

A point of reference on this journey is the JoePenna "Dreambooth" library which doesn't implement PPL, and yet preserves priors much much better than this example, not to mention it learns the instance better, and is far more editable, and preserves out-of-class rather well. I expect more from this huggingface dreambooth example, and I'm trying to find why it's not delivering.

Any thoughts or guidance?

EDIT1A: SGD didn't learn the instance at 1000 steps + lr=5e-5, but it definitely preserved the priors way better (upon visual inspection. The loss really doesn't decrease much in any of my inversion/dreambooth experiments).

EDIT1B: Another test failed to learn using SGD at 1500 steps + lr=1e-3 + momentum=0.9. It might be trying to learn, but, not much. Priors were nicely preserved though still.

EDIT1C: 1500 * lr=5e2 learned great, was editable, didn't destroy other priors!!!

EDIT2: JoePenna seems to use AdamW, so I'm not sure what's up anymore, but I'm still getting quite poor results training with this library's (huggingface's) DB example.

@osanseviero
Copy link
Member

cc @patil-suraj @apolinario

@patil-suraj
Copy link
Contributor

Hi @affableroots , thanks for the issue!

What do you mean by the whole model still degrades ?

The dreambooth paper uses imagen model, which keeps the text encoder frozen. The VAE should also be frozen.

As far as I can see this https://github.com/JoePenna/Dreambooth-Stable-Diffusion also uses prior preservation. Did you try training with this and did it work ? I would be happy to investigate.

Also note that, for dreambooth we need to play around a bit with hyperparamters i.e the number of steps, learning rate, number of prior images etc. Could you please post more details

  • How many images are you using for training
  • How many priro images
  • The number of training steps
  • The instance prompt and class prompt
  • And if possible the training images that you are using.

This will help us debug the issue better. Thank you.

@affableroots
Copy link
Author

affableroots commented Oct 4, 2022

If I understand correctly, this paper says they finetune the entire model, although in my experiments I agree and have kept the VAE frozen. I guess the following could be saying they only finetune all layers down stream of the text embeddings?

image


RE JoePenna, I think his repo is an example of Textual Inversion + unfrozen unet. Quoted from his readme:

We're [now] realizing the "regularization class" bits of this code have no effect, and that there is little to no prior preservation loss.

So, out of respect to both the MIT team and the Google researchers, I'm renaming this fork to: "Unfrozen Model Textual Inversion for Stable Diffusion".

For an alternate implementation that attempts prior loss preservation, please see "Alternate Option" below.

This repo currently gives far better results than the huggingface version, but I think the HF version has more promise once fixed.


RE my params, I've tried ~50ish permutations within the following bounds, but in general I've tried to stick close to the example params in your readme, and the params in the DB paper.

  • Batches: 1-2
  • 8bit optimizer on/off
  • Number of images: 5 - 50
  • Number of prior images: 200-1500
  • Number of steps: 500 - 5000
  • Learning rate: 1e-7 to 1e-4
  • Prior Preservation Loss on/off
  • Prompts are always of the form: "a photo of CLASS" and "a photo of TOKEN CLASS", where TOKEN is sks (struggles to train away from rifles), or random three letter words, or, more recently, I added textual inversion to your example to add a token to the tokenizer.
  • I've tried many types of images, but have settled on a test case of simply learning faces that aren't already in SD, eg, my face.

RE degradation, I mean the whole model slowly becomes unusable over the course of learning. For instance given the stock params of 500 steps * lr=5e-6, it poorly learns the instance face, and all the class images visually degrade somewhat. It forgets things it knows already, and generally turns to visual noise, at any learning rate, and for any number of steps, in or out of the finetuning set. Very low learning rates, eg 1e-7, slow the process, but then you're not learning your instance either.

For inference I always test with: "a photo of token class", " a black and white sketch of token class", "a photo of johnny depp". So I've got in there: the exact prompt I trained with, a prompt that tests generalization, and a non-instance prompt to see if it's forgetting other faces. Johnny Depp degrades over the course of learning.


Given my limited understanding, could this all be due to weight regularization performed by Adam, decaying the weights? I seem to have better results using SGD, although it learns slowly.

@jslegers
Copy link

jslegers commented Oct 4, 2022

I'm still getting quite poor results training with this library's (huggingface's) DB example.

I've been using ShivamShrirao's fork of this repo for several days now. With my own Google Colab notebook (which adds additional configuration options & streamlines the whole process), I've managed to generate some pretty decent renders of my younger self...

Some of the renders generated

image

Actual photos of me used as input for the training process

image

@affableroots
Copy link
Author

@jslegers Thanks for the follow up! I've had results look fairly similar, but I think everythings still degrading at least on my end. For instance:

  • "a photo of" type prompts look a bit overcooked for the trained instance, as well as out-of class instances (unless my lr is low enough to not learn the instance)
  • How do results look for random actors, compared to pre-training? Mine tend to degrade considerably, and often look "overcooked", ie that look things get at real high CFG.
  • And how about using your likeness in non-portrait situations like, playing a sport, or an instrument, etc.?

@affableroots
Copy link
Author

affableroots commented Oct 5, 2022

Oh, another symptom that I think indicates something is wrong, is I have not been able to overtrain. What I mean is, given a very high lr and/or high step count, everything sort of degrades without looking like my new instance. Faces look melted, and drift away from their priors, eg actors faces are no longer their faces, and they probably "melt" and become very indistinct, as faces. I notice photos degrade quicker than artwork.

Contrarily, I feel like JoePenna's version is easy to overtrain. Sure the trained concept leaks into everything, but I'm using this as a sort of diagnostic tool. In his version, an overtrained concept looks unmistakably like that concept (even if it increasingly fails to generalize)

Is there no merit to the weight decay hypothesis? That still sticks out in my mind. The way I see it is if you have a whole universe of latent space, but you only update the small corner that looks like your face, but regardless you regularize the entire thing towards 0, you're going to degrade everything, which would explain why overtraining doesn't work, generalization, learning a concept in the first place, etc.

Given my hypothesis, with a low step count you could learn your concept without degrading things too much, as @jslegers and others (and myself) have demonstrated, but it still doesn't seem like a quality finetuning for all the reasons I've mentioned, IE how the model degrades.

Or I could be crazy and running this all wrong?

@jslegers
Copy link

jslegers commented Oct 5, 2022

but I think everythings still degrading at least on my end.

For my use case, I noticed that the diversity with the class "man" started declining significiantly. At some point, all random class images generated during the training process look identical.

I haven't really looked much into the degrading of my "environment", though, as I was too focused to trying to fix my likeness. But it does indeed appear to be degrading as training increases.

How do results look for random actors, compared to pre-training? Mine tend to degrade considerably, and often look "overcooked", ie that look things get at real high CFG.

Is this that "overcooked" look you're talking about?

image

Nightmare fuel, if you ask me...

And how about using your likeness in non-portrait situations like, playing a sport, or an instrument, etc.?

I haven't really invested much time in doing full body renders, but my best models did produce some OK results :

image

he way I see it is if you have a whole universe of latent space, but you only update the small corner that looks like your face, but regardless you regularize the entire thing towards 0

That seems to be what's happening, yes.

As a sidenote, I'm struggling to find a way to build upon previous trainings. When I have a general "johnslegers" concept, for example, I'd like to build upon that training be training eg. the concept "close-up or johnslegers" or "detailed portrait of johnslegers", as to refine the model further. But this doesn't seem possible. Attempts to achieve this by tinkering with my class prompt, instance prompt, class name or concept name generally leads to a loss of previous training achieved and serious degrading the quality of my models.

@affableroots
Copy link
Author

affableroots commented Oct 5, 2022

Thank you for running that again! It seems we have similar results, though, your generalization attempts do look better than what I've been achieving, even given the stock params from the readme. And yes, that's the overcooking I'm talking about. That seems specifically like certain latent values are... clipping, if that's the right term.

try to refine the model further

Ya imagine that's impossible if the entire latent space is getting regularized and forgetting everything it learned in its original training. I've turned off weight_decay and while it seems to help, the whole latent space gets degraded anyway.

So then I thought maybe the Prior Preservation Loss was affecting things incorrectly, and I turned that off. So, ostensibly it should become a run of the mill finetuning of just unet, I think. I would expect that training without PPL would simply, for instance, pull all faces toward your face, but not destroy everything. But, all faces somehow just get cooked!

Hypothesis: I'm strongly suspecting weight decay is an issue and some other hidden factor.

@jslegers
Copy link

jslegers commented Oct 5, 2022

Ya imagine that's impossible if the entire latent space is getting regularized and forgetting everything it learned in its original training.

Probably.

I guess I better just start all over with the original SD model and play around with different image files & config settings to reproduce at minimum the quality I achieved to far than continue tinkering with existing models...

@patil-suraj
Copy link
Contributor

@affableroots In the paper by all layers they mean all layers of the unet model trained, the rest i.e the text encode is kept frozen.

Looking at the discussion and also from my experiments the model does run into overfitting and catastrophic forgetting. To me this seems likely as we finetune the whole model with few images. For now I don't know any way to avoid this, I am going to discuss this with the author and let you know if we find anything.

To sum it up, I'm not sure if the script is broken, it's rather that the model quickly overfits.

@affableroots
Copy link
Author

I appreciate your attention on this @patil-suraj!

@jslegers
Copy link

jslegers commented Oct 5, 2022

Looking at the discussion and also from my experiments the model does run into overfitting and catastrophic forgetting. To me this seems likely as we finetune the whole model with few images. For now I don't know any way to avoid this, I am going to discuss this with the author and let you know if we find anything.

I suspect overfitting issues and other related issues can be reduced by finetuning your number of training steps, class prompt, number of variety of class images generated, quality and variety of your input images, etc.

I've been experimenting with various strategies : use just a handful of pics, use many pics, auto-generate the class pics, choose class pics myself, do only 250 training steps per run, do 2000 training steps each time, etc.

My results after just a single run from the original SD model vary significantly depending on various of these criteria and I have yet to determine any pattern.

We should develop some best practices on how to optimize the model we're creating while keeping degradation of the latent space minimal. If anyone should be able to provide some info we all seem to be missing, that's going to be the original author...

@affableroots
Copy link
Author

affableroots commented Oct 5, 2022

While we're exploring the tuning of DB from all angles, I'll mention some recently decent results. I'm using my modified Dreambooth + Textual Inversion:

  • 5 new tokens+embeddings
  • 2000 steps
  • lr5e-5 for text embeddings
  • lr5e-6 for unet
  • weight decay=0.0
  • prior loss weight=0.1
  • no. class images=1500

I will say that faces are still getting cooked, so, I'm going to play with this some more.

My theory is that if you combine DB + TI, you can ask TI to learn more strongly without risking damage to the rest of the model (because you're only learning brand new embeddings that don't cause other domains to forget), but still train unet to help out a little, where TI is struggling.

@jslegers
Copy link

jslegers commented Oct 7, 2022

I'm using my modified Dreambooth + Textual Inversion

Any chance you can make your modified script available for the rest of us, so we can test it ourselves?

@affableroots
Copy link
Author

affableroots commented Oct 7, 2022

@jslegers Forgive the mess: https://gist.github.com/affableroots/a36a74287c8eb2da438a459795b158d6
I flew through it to clean it up and haven't tested again since, so, hopefully I didn't mess anything up but let me know if so!

This is TI + Unet unfreezing + Prior Preservation Loss, adapted from this repos train_dreambooth.py. Text embeddings can learn at a different rate than unet. Hit me up with any questions, and good luck!

Also, I think this is the main difference with Joe Penna's "dreambooth", that it includes TI intrinsically, although, no nice embeddings for new tokens, you have to train over, eg, "sks".

@affableroots
Copy link
Author

@patil-suraj I know you're slammed, (I see you on every support ticket on this repo!), but any chance you've had an opportunity to ask the Dreambooth authors about the catastrophic forgetting + overfitting we've been facing? My pet theory was it has to do with weight decay, but, whatever it is, I'm just curious! No rush of course, and thanks for all your hard work!

@patrickvonplaten
Copy link
Contributor

Gently pinging @patil-suraj here again :-)

@jslegers
Copy link

jslegers commented Oct 10, 2022

I did some testing regarding the impact of Dreambooth on different prompts, using the same seed.

Pretty much all of my tests produced results similar to this, when running running Dreambooth with class "man" and concept "johnslegers" :

afbeelding

I've tried using different config, but to no avail. The degradation persists no matter how many input pics I use, how many class pics I use, what value I use for prior preservation, etc.

@affableroots
Copy link
Author

@jslegers did you have a chance to test that script that adds Textual Inversion to the mix? I haven't cracked the code yet, but so far my successes are from: Strong TI training (1e-3ish?), weak Unet training (1e-7ish?), no weight decay, low PPL (0.1ish?).

Oh, and I do actually keep a strong weight decay on the text embeddings, because only the new tokens are trained. I have a hideous hack in there to limit updates to just my new tokens, so it's not as dangerous as decaying unet, but, it doesn't seem to really slow anything down and it probably helps, so I keep it.

@jslegers
Copy link

did you have a chance to test that script that adds Textual Inversion to the mix?

Not yet. I'm planning to run some tests on Textual Inversion soon... both as a standalone approach, and using the script you provided.

I wanted to first complete my tests runs with Dreambooth alone, to see if I could find a way to reduce degradation by eg. increasing the number of class pics, reducing the prior loss weight or Adam weight decay, by using a different class (I tried "Jason Molina", who I allegedly kinda look like) or anything else I could think of. But nothing I tried produced better results. At least it allowed me to compile above overview, though, of the typical impact my test runs had.

@affableroots
Copy link
Author

affableroots commented Oct 11, 2022

@jslegers Have you by chance tried the JoePenna repo? I'm still trying to pin down the difference, but, I think it works better, and I don't know why.

They're starting to look identical, so I don't know where the difference in perceptual quality lies. The code is semantically identical, I think; JoePennas lib:

  • cond stage (text) unfrozen.
  • does use PPL
  • VAE remains frozen. I thought this might not be the case because it "unfreezes the whole model", but, I think this does indeed stay frozen.
  • regularization weight = 1.0
  • AdamW shares lr
  • AdamW never sets decay_weight, leaving it to pytorches 1e-2 default
  • both libs use DDPM
  • what is num_train_timesteps in DDPM? That might be different between the 2 libs? (diffusers=1000 vs joepenna=1)

EDIT: we must get this library to work, the Shivam fork of it can happily train 1024x1024 by 3 batches at a time on a 24GB card.

EDIT2: Could that DDPM num_timesteps be the issue?

The paper

Coarse-to-fine interpolation Figure 9 shows interpolations between a pair of source CelebA 256 × 256 images as we vary the number of diffusion steps prior to latent space interpolation. Increasing the number of diffusion steps destroys more structure in the source images, which the model completes during the reverse process. This allows us to interpolate at both fine granularities
and coarse granularities. In the limiting case of 0 diffusion steps, the interpolation mixes source images in pixel space. On the other hand, after 1000 diffusion steps, source information is lost and interpolations are novel samples

Fig 7 also elucidates timesteps a bit.

So if I understand the paper, and these 2 repos, HF-diffusers runs 1000 steps of noise, and says "try and denoise that, bwahaha!", whereas JoePenna runs 1 step of noise saying "let's play on easy mode".

EDIT3: I tried a short 10min round of training with num_timesteps=1 instead of 1000, and the loss is a helluva lot smoother. Before it was perfect zigzags, now, after an initial transient, it's perfectly monotonic decreasing. That's a good sign.

@patrickvonplaten
Copy link
Contributor

Very interesting! @patil-suraj let's try this one out :-)

@affableroots
Copy link
Author

affableroots commented Oct 11, 2022

Hm, I'm not sure I understand DDPM well yet, that timestep thing may be a red herring.

Could it be a difference of the DDPMs beta schedule type: diffuser's "scaled_linear" vs JP's "linear"

If I throw enough mud at the wall, some will stick right?

EDIT: No, JP's "linear" is the same as our "scaled_linear"

I'm stumped.

@apolinario
Copy link
Contributor

@affableroots, besides the loss looking smoother, what were your actual perceptual impressions from this experiment?

@patil-suraj
Copy link
Contributor

Thanks a lot @affableroots and sorry for being later here. Running some experiments today, also comparing against other codebases. Will post my findings here soon.

@nerdyrodent
Copy link

nerdyrodent commented Oct 12, 2022

I'm using Shivam's DB and I don't get that blotchy look unless I put the guidance scale above 30 during inference (or above 10 on something like LMS). All the faces do start to tend towards my face though, which is nightmare fuel all of it's own! Tried with and without fp16 and varying amount of class images. I've also tried made up classes, such as §. It just seems to recognise a face as a face!

@matteoserva
Copy link

The blotchy look disappears if I increase the timesteps at inference time to a high value, like 250

@nerdyrodent
Copy link

nerdyrodent commented Oct 12, 2022

Been testing using my face all day with a variety of optimisers because reasons :) Things learnt so far -

  • The defaults seem very good so far, with 800-1800 steps being OK for a face. Gives perfectly useable results in 10 minutes.
  • However, my face bleeds into all the other faces. Some (Abraham Lincoln, Donald Trump) are impacted less
  • It for sure changes the whole model regardless of "class". This can be easily seen by switching to something like DiffGrad.
  • Slightly lower LR (4e-6) with more steps (1600) and more preservation images seems to lessen the leakage into other faces (even if the class is just a symbol):

Default Johnny Depp (seed 1201043):
image

With DB model (defaults), 200 preservation images, fp16, 1000 steps:
image
I can really see my features in there!

With DB model, 1200 preservation images, fp16, 1600 steps, 4e-6:
image
Not so much me :)

@compustar
Copy link

@nerdyrodent, have you tried 1e-6 for 2000 steps which is basically the JoePenna's default training parameters.

@ghost
Copy link

ghost commented Oct 14, 2022

This "overcooked" look is actually the opposite, that happened to me after training for several steps (3000+) and trying to render using 50 inference steps or so. If you look at this image, the early steps for some sampling methods look kinda similar to the bleeding we are talking about.

2022101401

I solved the issue:

  • Increasing the inference steps (200+).
  • Augmenting my images by flipping, rotating, and adding brightness, contrast or saturation.
  • Adding more regularization images, 180 for this case.

This way I can train more steps and reduce the influence over the rest of the model. I hope this info is useful, I can't promess it solves all the problem but I'm getting consistent results after 1000-2000 training steps.

2022101402

@nerdyrodent
Copy link

@compustar - just tried 2000 steps at 1e-6 and the results weren't as good as the default 800 steps at 5e-6.

@zcorley
Copy link

zcorley commented Oct 16, 2022

Great thread you started here @affableroots

I've been trying to run your TI-Dreambooth-Tokenizer code. I am getting errors about accessing huggingface-cli that I am not getting when I run Shivam's version of the code. specifically:

OSError: CompVis/stable-diffusion-v1-4 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login.
Traceback (most recent call last):

I am logged into hugface-cli tried; tried different tokens with no success.
I tried downloading the files locally and pathing them there, but eventually it asks for a config file i cannot find, however this seems to be an access problem rather than a file location problem.

do you (or anyone) have an updated version, or do you know how i can fix this?

@d8ahazard
Copy link
Contributor

JoePenna "Dreambooth"

You need to download the diffusers model, which is a repo with multiple directories and files. The .ckpt is for regular stable diffusion, etc.

@patrickvonplaten
Copy link
Contributor

Think @patil-suraj will very soon release a nice update of dreambooth :-)

@affableroots
Copy link
Author

@patil-suraj Did you figure out what the issue was then? I've been curious.

@matteoserva
Copy link

It might be related to this PR: #883

@patil-suraj
Copy link
Contributor

Hi everyone! Sorry to be so late here.

We ran a lot of experiments with the script to see if there's any issues or if it's broken.
Turns out, we need to carefully pick hyperparameters like LR and training steps to get good results with dreambooth.

Also, the main reason the results with this script were not as good as Compvis forks, is that the text_encoder is being trained in those forks, and it makes a big difference in quality especially on faces.

We compiled all our experiments in this report, and also added an option to train the text_encoder in the script which can be enabled by passing the --train_text_encoder argument.

Note that, if we train the text_encoder the training won't fit on 16GB GPU anymore, it will need at least 24GB VRAM. It should still be possible to do it 16GB using deepspeed, but will be slower.

Please take a look at the report, hope you find it useful.

@matteoserva
Copy link

matteoserva commented Oct 19, 2022

I tested ShivamShirao's fork with a 12GB card and it trained without errors, It seems that the text_encoder's weight have been updated. I haven't had time to compare the results of the two forks.
My parameters: with 8bit_adam, without caching latents, with prior preservation, without mixed precision, with gradient checkpointing, with text encoder training.

@d8ahazard
Copy link
Contributor

d8ahazard commented Oct 19, 2022 via email

@affableroots
Copy link
Author

Thanks for the review @patil-suraj! It sounds like the difference was all in unfreezing the text embeddings. It sounds like lr+nsteps makes a small target to hit, and your experiment helps us aim better.

@10pratik
Copy link

@affableroots I am getting good outputs for Dreambooth by training the images of a certain face. However I am not getting consistently good results i.e the model doesn't product the similar face. It changes it a little bit in some instances.
Even when I take the best output and just change the prompt, it changes the face a bit.

Any idea on how to solve the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests