# DreamBooth

**DreamBooth** is a training technique that updates the entire diffusion model by training no just a few images of a subject or style. It works by associating a special word in the prompt with the example images.

We will explore the [`train_dreambooth.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py).

As always, make sure to install the `diffusers` library from source:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

We need to navigate to the following folder and install the corresponding dependencies:
```bash
cd examples/dreambooth
pip install -r requirements.txt
```
HuggingFace Accelerate is also helpful to train on multiple GPUs with mixed-precision.
```bash
pip install accelerate
```

Now we can initialize a HuggingFace Accelerate environment
```bash
accelerate config
```
To set up a default Acclerate environment without choosing any configurations:
```bash
accelerate config default
```
Or if our environment does not support an interactive shell like a notebook, we can use:
```python
from accelerate.utils import write_basic_config
write_basic_config()
```

## Script parameters

**DreamBooth is very sensitive to training hyperparameters, and it is easy to overfit.**

The training script provides many parameters to customize the training run. We can find all of the parameters and their descriptions in the `parse_args()` function.

To speed up training with mixed precision using the `bp16` format, add the `--mixed_precision` parameter to the training command:
```bash
accelerate launch train_text_to_image.py --mixed_precision="bp16"
```

Some basic and important parameters to specify:
* `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
* `--instance_data_dir`: path to a folder containing the training dataset (example images)
* `--instance_prompt`: the text prompt that contains the special word for the example images
* `--train_text_encoder`: whether to also train the text encoder
* `--output_dir`: where to save the trained model
* `--push_to_hub`: whether to push the trained model to the Hub
* `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command

### Min-SNR weighting

The **Min-SNR** weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types.

Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
```bash
accelerate launch train_dreambooth.py --snr_gamma=5.0
```

### Prior preservation loss

**Prior preservation loss** is a method that uses a model's own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images we provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions.

Some parameters:
* `--with_prior_preservation`: whether to use prior preservation loss
* `--prior_loss_weight`: controls the influence of the prior preservation loss on the model
* `--class_data_dir`: path to a folder containing the generated class sample images
* `--class_prompt`: the text prompt describing the class of the generated sample images

Example `bash` script:
```bash
accelerate launch train_dreambooth.py \
  --with_prior_preservation \
  --prior_loss_weight=1.0 \
  --class_data_dir="path/to/class/images" \
  --class_prompt="text prompt describing class"
```

### Train text encoder

To improve the quality of the generated outputs, we can also train the text encoder in addition to the UNet. This requires additional memory and we will need a GPU with at least 24GB of vRAM.

Example `bash` script:
```bash
accelerate launch train_dreambooth.py --train_text_encoder
```

## Training script

DreamBooth comes with its own dataset classes:
* `DreamBoothDataset`: preprocess the images and class images, and tokenize the prompts for training
* `PromptDataset`: generate the prompt embeddings to generate the class images

If we enable the prior preservation loss, the class images are generated here:
```python
    # Generate class images if prior preservation is enabled.
    if args.with_prior_preservation:
        class_images_dir = Path(args.class_data_dir)
        if not class_images_dir.exists():
            class_images_dir.mkdir(parents=True)
        cur_class_images = len(list(class_images_dir.iterdir()))

        if cur_class_images < args.num_class_images:
            torch_dtype = torch.float16 if accelerator.device.type == "cuda" else torch.float32
            if args.prior_generation_precision == "fp32":
                torch_dtype = torch.float32
            elif args.prior_generation_precision == "fp16":
                torch_dtype = torch.float16
            elif args.prior_generation_precision == "bf16":
                torch_dtype = torch.bfloat16
            pipeline = DiffusionPipeline.from_pretrained(
                args.pretrained_model_name_or_path,
                torch_dtype=torch_dtype,
                safety_checker=None,
                revision=args.revision,
            )
            pipeline.set_progress_bar_config(disable=True)

            num_new_images = args.num_class_images - cur_class_images
            logger.info(f"Number of class images to sample: {num_new_images}.")

            sample_dataset = PromptDataset(args.class_prompt, num_new_images)
            sample_dataloader = torch.utils.data.DataLoader(sample_dataset, batch_size=args.sample_batch_size)

            sample_dataloader = accelerator.prepare(sample_dataloader)
            pipeline.to(accelerator.device)

            for example in tqdm(
                sample_dataloader, desc="Generating class images", disable=not accelerator.is_local_main_process
            ):
                images = pipeline(example["prompt"]).images

                for i, image in enumerate(images):
                    hash_image = hashlib.sha1(image.tobytes()).hexdigest()
                    image_filename = class_images_dir / f"{example['index'][i] + cur_class_images}-{hash_image}.jpg"
                    image.save(image_filename)
```

The `main()` function handles setting up the dataset for training and the training loop itself. The script loads the `tokenizer`, `scheduler`, and `models`:
```python
    # Load the tokenizer
    if args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, revision=args.revision, use_fast=False)
    elif args.pretrained_model_name_or_path:
        tokenizer = AutoTokenizer.from_pretrained(
            args.pretrained_model_name_or_path,
            subfolder="tokenizer",
            revision=args.revision,
            use_fast=False,
        )

    # import correct text encoder class
    text_encoder_cls = import_model_class_from_model_name_or_path(args.pretrained_model_name_or_path, args.revision)

    # Load scheduler and models
    noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
    text_encoder = text_encoder_cls.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
    )

    if model_has_vae(args):
        vae = AutoencoderKL.from_pretrained(
            args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision
        )
    else:
        vae = None

    unet = UNet2DConditionModel.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
    )
```

Then, we create the training dataset and dataloader from `DreamBoothDataset`:
```python
    # Dataset and DataLoaders creation:
    train_dataset = DreamBoothDataset(
        instance_data_root=args.instance_data_dir,
        instance_prompt=args.instance_prompt,
        class_data_root=args.class_data_dir if args.with_prior_preservation else None,
        class_prompt=args.class_prompt,
        class_num=args.num_class_images,
        tokenizer=tokenizer,
        size=args.resolution,
        center_crop=args.center_crop,
        encoder_hidden_states=pre_computed_encoder_hidden_states,
        class_prompt_encoder_hidden_states=pre_computed_class_prompt_encoder_hidden_states,
        tokenizer_max_length=args.tokenizer_max_length,
    )

    train_dataloader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=args.train_batch_size,
        shuffle=True,
        collate_fn=lambda examples: collate_fn(examples, args.with_prior_preservation),
        num_workers=args.dataloader_num_workers,
    )
```

Finally, the training loop takes care of the remaining steps such as converting images to latent space, adding noise to the input, predicting the noise residual, and calculating the loss.

## Launch the script

In this guide, we will download some images from the `dog` dataset and store them in a directory.
```python
from huggingface_hub import snapshot_download

local_dir = './dog'
snapshot_download(
    'diffusers/dog-example',
    local_dir=local_dir,
    repo_type='dataset',
    ignore_patterns='.gitattributes'
)

```

If we want to follow along with the training process, we can periodically save generated images as training progresses. Add the following parameters to the training command:
```bash
  --validation_prompt="a photo of a sks dog"
  --num_validation_images=4
  --validaiton_steps=100
```

On a 12GB GPU, we need `bitsandbytes` 8-bit optimizer, gradient checkpointing, xFormers, and set the gradients to `None` instead of zero to reduce our memory-usage.
```bash
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export INSTANCE_DIR="./dog"
export OUTPUT_DIR="path_to_saved_model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=400 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --enable_xformers_memory_efficient_attention \
  --set_grads_to_none \
  --push_to_hub
```

On a 16GB GPU,
```bash
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export INSTANCE_DIR="./dog"
export OUTPUT_DIR="path_to_saved_model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=400 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --push_to_hub
```

Once training is complete, we can use our newly trained model for inference:

In [None]:
from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "path_to_saved_model",
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")

In [None]:
image = pipeline(
    "A photo of sks dog in a bucket",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]
image

## LoRA

LoRA is a training technique for significantly reducing the number of trainable parameters. As a result, training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs).

We can use the [`train_dreambooth_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py) script to train with LoRA.

## SDXL

SDXL adds a second text-encoder to its architecture to generate high-resolution images.

We can use the [`train_dreambooth_lora_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py) script to train a SDXL model with LoRA.

## DeepFloyd IF

**DeeppFloyd IF** is a cascading pixel diffusion model with three stages. The first stage generates a base image and the second and third stages progressively upscales the base image into a high-resolution 1024x1024 image.

We can use [`train_dreambooth_lora.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py) to train a DeepFloyd IF model with LoRA, or use the [`train_dreambooth.py`](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) to train a DeepFloyd IF model with the full model.

DeepFloyd IF uses predicted variance, but the Diffusers training scripts use predicted error so the trained DeepFloyd IF models are switched to a fixed variance schedule. The training scripts will update the scheduler config of the fully trained model for us. However, when we load the saved LoRA weights, we must also update the pipeline's scheduler config:

In [None]:
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    'DeepFloyd/IF-I-XL-v1.0',
    use_safetensors=True
)
pipe.load_lora_weights('<lora weights path>')

# Update scheduler config to fixed variance schedule
pipe.scheduler = pipe.scheduler.__class__.from_config(
    pipe.scheduler.config,
    variance_type='fixed_small'
)

The stage 2 model requires additional validation images to upscale. We can download and use a downsized version of the training images for this:

In [None]:
from huggingface_hub import snapshot_download

local_dir = './dog_downsized'
snapshot_download(
    'diffusers/dog-example-downsized',
    local_dir=local_dir,
    repo_type='dataset',
    ignore_patterns='.gitattributes'
)

The parameters below useful to train a DeepFloyd IF model with a combination of DreamBooth and LoRA:
* `--resolution=64`, a much smaller resolution is required because DeepFloyd IF is a pixel diffusion model and to work on uncompressed pixels, the input images must be smaller
* `--pre_compute_text_embeddings`, compute the text embeddings ahead of time to save memory because the `T5Model` can take up a lot of memory
* `--tokenizer_max_length=77`, we can use a longer default text length with T5 as the text encoder but the default model encoding procedure uses a shorter text length
* `--text_encoder_use_attention_mask`, pass the attention mask to the text encoder

##### Stage 1 LoRA DreamBooth

Training stage 1 of DeepFloyd IF with LoRA and DreamBooth requires ~28GB of memory.

```bash
export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_lora"

accelerate launch train_dreambooth_lora.py \
  --report_to wandb \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a sks dog" \
  --resolution=64 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --scale_lr \
  --max_train_steps=1200 \
  --validation_prompt="a sks dog" \
  --validation_epochs=25 \
  --checkpointing_steps=100 \
  --pre_compute_text_embeddings \
  --tokenizer_max_length=77 \
  --text_encoder_use_attention_mask
```

##### Stage 2 LoRA DreamBooth

For stage 2 of DeepFloyd IF with LoRA and DreamBooth, pay attention to these parameters:
* `--validation_images`, the images to upscale during validaiton
* `--class_labels_conditioning="timesteps"`, additionally conditional the UNet as needed in stage 2
* `--learning_rate=1e-6`, a lower learning rate is used compared to stage 1
* `--resolution=256`, the expected resolution for the upscaler

```bash
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_upscale"
export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png"

python train_dreambooth_lora.py \
    --report_to wandb \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --instance_data_dir=$INSTANCE_DIR \
    --output_dir=$OUTPUT_DIR \
    --instance_prompt="a sks dog" \
    --resolution=256 \
    --train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --learning_rate=1e-6 \
    --max_train_steps=2000 \
    --validation_prompt="a sks dog" \
    --validation_epochs=100 \
    --checkpointing_steps=500 \
    --pre_compute_text_embeddings \
    --tokenizer_max_length=77 \
    --text_encoder_use_attention_mask \
    --validation_images $VALIDATION_IMAGES \
    --class_labels_conditioning=timesteps
```

##### Stage 1 DreamBooth

For stage 1 of DeepFloyd IF with DreamBooth,
* `--skip_save_text_encoder`, skip saving the full T5 text encoder with the finetuned model
* `--use_8bit_adam`, use 8bit Adam optimizer to save memory due to the size of the optimizer state when training the full model
* `--learning_rate=1e-7`, a really low learning rate should be used for full model training otherwise the model quality is degraded


Training with 8-bit Adam and a batch size of 4, the full model can be trained with ~48GB of memory.
```bash
export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_if"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=64 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-7 \
  --max_train_steps=150 \
  --validation_prompt "a photo of sks dog" \
  --validation_steps 25 \
  --text_encoder_use_attention_mask \
  --tokenizer_max_length 77 \
  --pre_compute_text_embeddings \
  --use_8bit_adam \
  --set_grads_to_none \
  --skip_save_text_encoder \
  --push_to_hub
```

##### Stage 2 DreamBooth

For stage 2 of DeepFloyd IF with DreamBooth, pay attention to these parameters:

* `--learning_rate=5e-6`, use a lower learning rate with a smaller effective batch size
* `--resolution=256`, the expected resolution for the upscaler
* `--train_batch_size=2` and `--gradient_accumulation_steps=6`, to effectively train on images wiht faces requires larger batch sizes

```bash
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_upscale"
export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png"

accelerate launch train_dreambooth.py \
  --report_to wandb \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a sks dog" \
  --resolution=256 \
  --train_batch_size=2 \
  --gradient_accumulation_steps=6 \
  --learning_rate=5e-6 \
  --max_train_steps=2000 \
  --validation_prompt="a sks dog" \
  --validation_steps=150 \
  --checkpointing_steps=500 \
  --pre_compute_text_embeddings \
  --tokenizer_max_length=77 \
  --text_encoder_use_attention_mask \
  --validation_images $VALIDATION_IMAGES \
  --class_labels_conditioning timesteps \
  --push_to_hub
```

### Training tips

* LoRA is sufficient for training the stage 1 model because the model's low resolution makes representing finer details difficult regardless.
* For common or simple objects, we do not necessarily need to finetune the upscaler. Make sure the prompt passed to the upscaler is adjusted to remove the new token from the instance prompt. For example, if our stage 1 prompt is "a sks dog" then our stage 2 prompt should be "a dog"
* For finer details like faces, fully training the stage 2 upscaler is better than training the stage 2 model with LoRA. It also helps to use lower learning rates with larger batch sizes.
* Lower learning rates should be used to train the stage 2 model.
* The `DDPMScheduler` works better than the DPMSolver used in the training scripts.