# ControlNet

**ControlNet** models are adapters trained on top of another pretrained model. It allows for a greater degree of control over image generation by conditioning the model with an additional input image.

We will explore the [`train_controlnet.py`](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py) script.

As always, make sure to install the `diffusers` library from source:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

We need to navigate to the following folder and install the corresponding dependencies:
```bash
cd examples/controlnet
pip install -r requirements.txt
```
HuggingFace Accelerate is also helpful to train on multiple GPUs with mixed-precision.
```bash
pip install accelerate
```

Now we can initialize a HuggingFace Accelerate environment
```bash
accelerate config
```
To set up a default Acclerate environment without choosing any configurations:
```bash
accelerate config default
```
Or if our environment does not support an interactive shell like a notebook, we can use:
```python
from accelerate.utils import write_basic_config
write_basic_config()
```

## Script parameters

The training script provides many parameters to customize the training run. We can find all of the parameters and their descriptions in the `parse_args()` function.

To speed up training with mixed precision using the `fp16` format, add the `--mixed_precision` parameter to the training command:
```bash
accelerate launch train_text_to_image.py --mixed_precision="fp16"
```
Most of the parameters are identical to the parameters in the **Text-to-image** training guide. In addition to those, we have the following parameters to focus on the ControlNet:
* `--max_train_samples`: the number of training samples; this can be lowered for faster training, but if we want to stream really large datasets, we will need to include this parameter and the `--streaming` parmaeter in the training command
* `--gradient_accumulation_steps`: number of update steps to accumulate before the backward pass; this allows us to train with a bigger batch size than our GPU memory can typically handgle

### Min-SNR weighting

The `Min-SNR` weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types.

We can add the `--snr_gamma` and set it to the recommended value of 5.0:
```bash
accelerate launch train_text_to_image.py --snr_gamma=5.0
```

# Training script

The general training script is provided in the **Text-to-image** training guide. This guide is specifically about ControlNet part.

The training script has a `make_train_dataset` function for preprocessing the dataset with image transforms and caption tokenization. In addition to the usual caption tokenization and image transforms, the script also includes transforms for the conditioning image.

In the `make_train_dataset`,
```python
    def tokenize_captions(examples, is_train=True):
        captions = []
        for caption in examples[caption_column]:
            if random.random() < args.proportion_empty_prompts:
                captions.append("")
            elif isinstance(caption, str):
                captions.append(caption)
            elif isinstance(caption, (list, np.ndarray)):
                # take a random caption if there are multiple
                captions.append(random.choice(caption) if is_train else caption[0])
            else:
                raise ValueError(
                    f"Caption column `{caption_column}` should contain either strings or lists of strings."
                )
        inputs = tokenizer(
            captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
        )
        return inputs.input_ids

    image_transforms = transforms.Compose(
        [
            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
            transforms.CenterCrop(args.resolution),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5]),
        ]
    )

    conditioning_image_transforms = transforms.Compose(
        [
            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
            transforms.CenterCrop(args.resolution),
            transforms.ToTensor(),
        ]
    )
```
Within the `main()` function, we will find the code for loading the tokenizer, text encoder, scheduler, and models. This is also where the ControlNet model is loaded either from existing weights or randomly initialized from a UNet:
```python
    if args.controlnet_model_name_or_path:
        logger.info("Loading existing controlnet weights")
        controlnet = ControlNetModel.from_pretrained(args.controlnet_model_name_or_path)
    else:
        logger.info("Initializing controlnet weights from unet")
        controlnet = ControlNetModel.from_unet(unet)
```
The optimizer is set up to update the ControlNet parameters:
```python
    # Optimizer creation
    params_to_optimize = controlnet.parameters()
    optimizer = optimizer_class(
        params_to_optimize,
        lr=args.learning_rate,
        betas=(args.adam_beta1, args.adam_beta2),
        weight_decay=args.adam_weight_decay,
        eps=args.adam_epsilon,
    )
```
Finally, in the training loop, the conditioning text embeddings and image are passed to the down and mid-blocks of the ControlNet model.

## Launch the script

We will use the [`fusing/fill50k`](https://huggingface.co/datasets/fusing/fill50k) dataset to train our custom ControlNet as an example.

We will set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where we want to save the model.

Before training, we also need to download the following iamges to condition our training:
```bash
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png
```

Run the script
```bash
export MODEL_DIR="stable-diffusion-v1-5/stable-diffusion-v1-5"
export OUTPUT_DIR="path/to/save/model"

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=fusing/fill50k \
 --resolution=512 \
 --learning_rate=1e-5 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --push_to_hub
```

If we have a 12GB GPU, we will need `bitsandbytes` 8-bit optimizer, gradient checkpointing, xFormers, and set the gradients to `None` instead of zero to reduce our memory-usage:
```bash
export MODEL_DIR="stable-diffusion-v1-5/stable-diffusion-v1-5"
export OUTPUT_DIR="path/to/save/model"

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=fusing/fill50k \
 --resolution=512 \
 --learning_rate=1e-5 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --use_8bit_adam \
 --gradient_checkpointing \
 --enable_xformers_memory_efficient_attention \
 --set_grads_to_none \
 --push_to_hub
```

If we have a 16GB GPU, we can use bitsandbytes 8-bit optimizer and gradient checkpointing to optimize our training run.
```bash
export MODEL_DIR="stable-diffusion-v1-5/stable-diffusion-v1-5"
export OUTPUT_DIR="path/to/save/model"

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=fusing/fill50k \
 --resolution=512 \
 --learning_rate=1e-5 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --use_8bit_adam \
 --gradient_checkpointing \
 --push_to_hub
```

Once training is complete,

In [None]:
from diffusers improt StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import torch

controlnet = ControlNetModel.from_pretrained(
    'path/to/our/custom/controlnet',
    torch_dtype=torch.float16
)

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    'path/to/our/base/model',
    controlnet=controlnet,
    torch_dtype=torch.float16
).to('cuda')

In [None]:
control_image = load_image("./conditioning_image_1.png")
prompt = "pale golden rod circle with old lace background"
generator = torch.manual_seed(1111)

image = pipeline(
    prompt,
    image=control_image,
    num_inference_steps=20,
    generator=generator,
).images[0]
image

# SDXL
For SDXL models, we will use the [`train_controlnet_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py) script to train a ControlNet adapter for the SDXL model.

We need to explore the SDXL training script in the SDXL training guide.