# Text-to-image

Text-to-image models like SD are conditioned to generate images given a text prompt.

We will use the [`train_text_to_image.py`](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) script to train and adapt a model.

As always, make sure to install the `diffusers` library from source:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

We need to navigate to the following folder and install the corresponding dependencies:
```bash
cd examples/text_to_image
pip install -r requirements.txt
```
HuggingFace Accelerate is also helpful to train on multiple GPUs with mixed-precision.
```bash
pip install accelerate
```

Now we can initialize a HuggingFace Accelerate environment
```bash
accelerate config
```
To set up a default Acclerate environment without choosing any configurations:
```bash
accelerate config default
```
Or if our environment does not support an interactive shell like a notebook, we can use:
```python
from accelerate.utils import write_basic_config
write_basic_config()
```

## Script parameters

The training script provides many parameters to customize the training run. We can find all of the parameters and their descriptions in the `parse_args()` function.

To speed up training with mixed precision using the `fp16` format, add the `--mixed_precision` parameter to the training command:
```bash
accelerate launch train_text_to_image.py --mixed_precision="fp16"
```

Some important parameters:
* `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
* `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on
* `--image_column`: the name of the image column in the dataset to train on
* `--caption_column`: the name of the text column in the dataset to train on
* `--output_dir`: where to save the trained model
* `--push_to_hub`: whether to push the trained model to the Hub
* `--checkpointing_steps`: frequency of saving a checkpoint as the model train

### Min-SNR weighting

The `Min-SNR` weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types.

We can add the `--snr_gamma` and set it to the recommended value of 5.0:
```bash
accelerate launch train_text_to_image.py --snr_gamma=5.0
```

## Training script

We can modify the dataset preprocessing and training loop in the `main()` function if necessary.

The `train_text_to_image` script starts by loading a scheduler and tokenzier.
```python
    # Load scheduler, tokenizer and models.
    noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
    tokenizer = CLIPTokenizer.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision
    )
```
Then it loads the UNet model:
```python
    unet = UNet2DConditionModel.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="unet", revision=args.non_ema_revision
    )
```
Then, the text and image columns of the dataset need to be preprocessed. The `tokenize_captions` function handles tokenizing the inputs:
```python
    # Preprocessing the datasets.
    # We need to tokenize input captions and transform the images.
    def tokenize_captions(examples, is_train=True):
        captions = []
        for caption in examples[caption_column]:
            if isinstance(caption, str):
                captions.append(caption)
            elif isinstance(caption, (list, np.ndarray)):
                # take a random caption if there are multiple
                captions.append(random.choice(caption) if is_train else caption[0])
            else:
                raise ValueError(
                    f"Caption column `{caption_column}` should contain either strings or lists of strings."
                )
        inputs = tokenizer(
            captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
        )
        return inputs.input_ids
```
The `train_transforms` function specifies the type of transforms to apply to the image:
```python
    # Preprocessing the datasets.
    train_transforms = transforms.Compose(
        [
            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
            transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
            transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5]),
        ]
    )
```
Then both of these functions are bundled into `preprocess_train`:
```python
    def preprocess_train(examples):
        images = [image.convert("RGB") for image in examples[image_column]]
        examples["pixel_values"] = [train_transforms(image) for image in images]
        examples["input_ids"] = tokenize_captions(examples)
        return examples
```

Finally, the training loop handles everything else:
* It encodes images into latent space,
* adds noise to the latents,
* computes the text embeddings to condition on,
* updates the model parameters, and
* saves and pushes the model to the Hub.

## Launch the script

In this example, we can train our model on the Naruto BLIP captions dataset to generate our own Naruto characters.
```bash
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export dataset_name="lambdalabs/naruto-blip-captions"

accelerate launch train_text_to_image.py \
  --mixed_precision="fp16" \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --enable_xformers_memory_efficient_attention \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="sd-naruto-model" \
  --push_to_hub
```

Once training completed, we can try our trained model for inference:

In [None]:
from diffusers import StableDiffusionPipeline
import torch

pipeline = StableDiffusionPipeline.from_pretrained(
    'path/to/saved_model',
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

In [None]:
image = pipeline(prompt="yoda").images[0]
image