You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -502,9 +502,54 @@ You may also run inference from any of the [saved training checkpoints](#inferen
502
502
503
503
## IF
504
504
505
-
You can use the lora and full dreambooth scripts to also train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). A few alternative cli flags are needed due to the model size, the expected input resolution, and the text encoder conventions.
505
+
You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler
Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed
509
+
variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you
Additionally, a few alternative cli flags are needed for IF.
524
+
525
+
`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution.
526
+
527
+
`--pre_compute_text_embeddings`: IF uses T5 for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate
528
+
T5.
529
+
530
+
`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number.
531
+
532
+
`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder.
533
+
534
+
`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline
535
+
with a T5 loaded from the original model.
536
+
537
+
`use_8bit_adam`: When training the full model,
538
+
539
+
### Tips and Tricks
540
+
We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless.
541
+
542
+
For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the
543
+
upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt.
544
+
545
+
For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than
546
+
LoRA finetuning stage II.
547
+
548
+
For finegrained detail like faces, we find that lower learning rates work best.
549
+
550
+
For stage II, we find that lower learning rates are also needed.
0 commit comments