# Stable Diffusion XL

SDXL is a larger and more powerful iteration of the Stable Diffusion Model, capapble of producing higher resolution images.

SDXL's UNet is 3x larger and the model adds a second text encoder to the architecture.

We will explore the [`train_text_to_image_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py) script to train SDXL.

As always, make sure to install the `diffusers` library from source:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

We need to navigate to the following folder and install the corresponding dependencies:
```bash
cd examples/text_to_image
pip install -r requirements.txt
```
HuggingFace Accelerate is also helpful to train on multiple GPUs with mixed-precision.
```bash
pip install accelerate
```

Now we can initialize a HuggingFace Accelerate environment
```bash
accelerate config
```
To set up a default Acclerate environment without choosing any configurations:
```bash
accelerate config default
```
Or if our environment does not support an interactive shell like a notebook, we can use:
```python
from accelerate.utils import write_basic_config
write_basic_config()
```

## Script parameters

The training script provides many parameters to customize the training run. We can find all of the parameters and their descriptions in the `parse_args()` function.

To speed up training with mixed precision using the `bp16` format, add the `--mixed_precision` parameter to the training command:
```bash
accelerate launch train_text_to_image.py --mixed_precision="bp16"
```

Most of the parameters are identical to the parameters in the **Text-to-image** training guide, so we will focus on parameters that are relevant to training SDXL:
* `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows us to specify a better VAE
* `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings
* `--timestep_bias_strategy`: where (earlier vs. later) in the timestep to apply a bias, which can encourage the model to either learn low or high frequency details
* `--timestep_bias_multiplier`: the weight of the bias to apply to the timestep
* `--timestep_bias_begin`: the timestep to begin applying the bias
* `--timestep_bias_end`: the timestep to end applying the bias
* `--timestep_bias_portion`: the proportion of timesteps to apply the bias to

## Min-SNR weighting

The `Min-SNR` weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types.

We can add the `--snr_gamma` and set it to the recommended value of 5.0:
```bash
accelerate launch train_text_to_image.py --snr_gamma=5.0
```

## Training script

The training script is similar to the Text-to-image training guide, but it has been modified to support SDXL training.

The training script starts by creating an `encode_prompt` function to tokenize the prompts to calculate the prompt embeddings, and to compute the image embeddings with the VAE using the `compute_vae_encodings` function. Next we use the `generate_timestep_weights` to generate the timesteps weiights depending on the number of timesteps and the timestep bias strategy to apply.

Within the `main()` function, in addition to loading a tokenizer, the script loads a second tokeinzer and text encoder because the SDXL architecture uses two of each:
```python
    # Load the tokenizers
    tokenizer_one = AutoTokenizer.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision, use_fast=False
    )
    tokenizer_two = AutoTokenizer.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="tokenizer_2", revision=args.revision, use_fast=False
    )

    # import correct text encoder classes
    text_encoder_cls_one = import_model_class_from_model_name_or_path(
        args.pretrained_model_name_or_path, args.revision
    )
    text_encoder_cls_two = import_model_class_from_model_name_or_path(
        args.pretrained_model_name_or_path, args.revision, subfolder="text_encoder_2"
    )
```

The prompt and image embeddings are computed first and kept in memory. For larger datasets, this can lead to memory problems, then we should save the pre-computed embeddings to disk separately and load them into memory during the training process.
```python
    # Let's first compute all the embeddings so that we can free up the text encoders
    # from memory. We will pre-compute the VAE encodings too.
    text_encoders = [text_encoder_one, text_encoder_two]
    tokenizers = [tokenizer_one, tokenizer_two]
    compute_embeddings_fn = functools.partial(
        encode_prompt,
        text_encoders=text_encoders,
        tokenizers=tokenizers,
        proportion_empty_prompts=args.proportion_empty_prompts,
        caption_column=args.caption_column,
    )
    compute_vae_encodings_fn = functools.partial(compute_vae_encodings, vae=vae)
    with accelerator.main_process_first():
        from datasets.fingerprint import Hasher

        # fingerprint used by the cache for the other processes to load the result
        # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
        new_fingerprint = Hasher.hash(args)
        new_fingerprint_for_vae = Hasher.hash("vae")
        train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)
        train_dataset = train_dataset.map(
            compute_vae_encodings_fn,
            batched=True,
            batch_size=args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps,
            new_fingerprint=new_fingerprint_for_vae,
        )
```
After calculating the embeddings, the text encoder, VAE, and tokenizer are deleted to free up some memory:
```python
    del text_encoders, tokenizers, vae
    gc.collect()
    torch.cuda.empty_cache()
```
Finally, the training loop takes care of the rest. If we choose to apply a timestep bias strategy, we will see the timestep weights are calculated and added as noise:
```python
    # Sample a random timestep for each image, potentially biased by the timestep weights.
    # Biasing the timestep weights allows us to spend less time training irrelevant timesteps.
    weights = generate_timestep_weights(args, noise_scheduler.config.num_train_timesteps).to(
        model_input.device
    )
    timesteps = torch.multinomial(weights, bsz, replacement=True).long()
```

## Launch the script

We will train our SDXL on the Naruto BLIP captions dataset to generate our own Naruto characters. We need to set the envrionment variables `MODEL_NAME` and `DATASET_NAME` to the model and the dataset. Also we need to specify a VAE other than the SDXL VAE with `VAE_NAME` to avoid numerical instabilities.

To monitor training progress with **Weights & Biases**, add the `--report_to="wandb"` parameter to the training command. We also need to add the `--validation_prompt` and `--validation_epochs` to the training command to keep track of results. This can be useful for debugging the model and viewing intermediate results.

```bash
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalab/naruto-blip-captions

accelerate launch train_text_to_image_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --enable_xformers_memory_efficient_attention \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=10000 \
  --use_8bit_adam \
  --leraning_rate=1e-06 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --report_to="wandb" \
  --validation_prompt="a cute Sundar Pichai creature" \
  --validation_epochs=5 \
  --checkpointing_steps=5000 \
  --output_dir="sdxl-naruto-model"
  --push_to_hub
```

After training, we can use the customed SDXL model:

In [None]:
from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    'path/to/our/model',
    torch_dtype=torch.float16
).to('cuda')

In [None]:
prompt = 'a naruto with green eyes and red legs'
image = pipeline(
    prompt,
    num_inference_steps=30,
    guidance_scale=7.5
).images[0]
image