# Textual Inversion

**Textual Inversion** is a training technique for personalizing image generation models with just a few example images of what we want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word we must use in the prompt) to match the example images we provide.

We will explore the [`textual_inversion.py`](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) to train our custom textual inversion.

As always, make sure to install the `diffusers` library from source:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

We need to navigate to the following folder and install the corresponding dependencies:
```bash
cd examples/textual_inversion
pip install -r requirements.txt
```
HuggingFace Accelerate is also helpful to train on multiple GPUs with mixed-precision.
```bash
pip install accelerate
```

Now we can initialize a HuggingFace Accelerate environment
```bash
accelerate config
```
To set up a default Acclerate environment without choosing any configurations:
```bash
accelerate config default
```
Or if our environment does not support an interactive shell like a notebook, we can use:
```python
from accelerate.utils import write_basic_config
write_basic_config()
```

## Script parameters

The training script provides many parameters to customize the training run. We can find all of the parameters and their descriptions in the `parse_args()` function.

Some basic and important parameters to specify:
* `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
* `--train_data_dir`: path to a folder containing the training dataset (example images)
* `--output_dir`: where to save the trained model
* `--push_to_hub`: whether to push the trained model to the Hub
* `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, we can continue training from that checkpoint by adding `--resume_from_checkpoint` to our trainining command
* `--num_vectors`: the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs
* `--placeholder_token`: the special word to tie the learned embeddings to (we must use the word in our prompt for inference)
* `--initializer_token`: a single-word that roughly describes the object or style we are trying to train on
* `--learnable_property`: whether we are training the model to learn a new "style" (for example, Van Gogh's painting style) or "object" (for example, new discovery)

## Training script

The `textual_inversion.py` has a custom dataset class, `TextualInversionDataset` for creating a dataset. We can customize the image size, placeholder token, interpolation method, whether to crop the image, and more. If we need to change how the dataset is created, we can modify `TextualInversionDataset`.

Next, in the `main()` function, we will find the dataset preprocessing code. The script starts by loading the `tokenizer`, `scheduler` and `model`:
```python
    # Load tokenizer
    if args.tokenizer_name:
        tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name)
    elif args.pretrained_model_name_or_path:
        tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")

    # Load scheduler and models
    noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
    text_encoder = CLIPTextModel.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
    )
    vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision)
    unet = UNet2DConditionModel.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
    )
```
The special `placeholder_token` is added next to the tokenizer, and the embedding is readjusted to account for the new token:
```python
    # Add the placeholder token in tokenizer
    placeholder_tokens = [args.placeholder_token]

    if args.num_vectors < 1:
        raise ValueError(f"--num_vectors has to be larger or equal to 1, but is {args.num_vectors}")

    # add dummy tokens for multi-vector
    additional_tokens = []
    for i in range(1, args.num_vectors):
        additional_tokens.append(f"{args.placeholder_token}_{i}")
    placeholder_tokens += additional_tokens

    num_added_tokens = tokenizer.add_tokens(placeholder_tokens)
    if num_added_tokens != args.num_vectors:
        raise ValueError(
            f"The tokenizer already contains the token {args.placeholder_token}. Please pass a different"
            " `placeholder_token` that is not already in the tokenizer."
        )

    # Convert the initializer_token, placeholder_token to ids
    token_ids = tokenizer.encode(args.initializer_token, add_special_tokens=False)
    # Check if initializer_token is a single token or a sequence of tokens
    if len(token_ids) > 1:
        raise ValueError("The initializer token must be a single token.")

    initializer_token_id = token_ids[0]
    placeholder_token_ids = tokenizer.convert_tokens_to_ids(placeholder_tokens)

    # Resize the token embeddings as we are adding new special tokens to the tokenizer
    text_encoder.resize_token_embeddings(len(tokenizer))

    # Initialise the newly added placeholder token with the embeddings of the initializer token
    token_embeds = text_encoder.get_input_embeddings().weight.data
    with torch.no_grad():
        for token_id in placeholder_token_ids:
            token_embeds[token_id] = token_embeds[initializer_token_id].clone()
```
Then, the script creates a dataset from the `TextualInversionDataset`:
```python
    # Dataset and DataLoaders creation:
    train_dataset = TextualInversionDataset(
        data_root=args.train_data_dir,
        tokenizer=tokenizer,
        size=args.resolution,
        placeholder_token=(" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids))),
        repeats=args.repeats,
        learnable_property=args.learnable_property,
        center_crop=args.center_crop,
        set="train",
    )
    train_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.dataloader_num_workers
    )
```
Finally, the training loop handles everything else from predicting the noisy residual to updating the embedding weights of the special placeholder token.

## Launch the script

We will download images in the `cat_toy` dataset and store them in a directory in this guide.
```python
from huggingface_hub import snapshot_download
local_dir = './cat
snapshot_download(
    'diffusers/cat_toy_example',
    local_dir=local_dir,
    repo_type='dataset',
    ignore_patterns='.gitattributes'
)
```

Once we launch the script, it will create and save the following files to our repository:
* `learned_embeds.bin`: the learned embedding vectors corresponding to our example images
* `token_identifier.txt`: the special placeholder token
* `type_of_ceoncept.txt`: the type of concept we are training on (either "object" or "style")

If we are interetsted in following along with the training process, we can preiodically save generated images as training progresses. Add the following parameters to the training command:
```bash
 --validation_prompt="A <cat-toy> train"
 --num_validation_images=4
 --validation_steps=100
```

Run the script:
```bash
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export DATA_DIR="./cat"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" \
  --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 \
  --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="textual_inversion_cat" \
  --push_to_hub
```

After training is complete, we can use our newly trained model for inference:

In [None]:
from diffusers import StableDiffusionPipeline
import torch

pipeline = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16
).to('cuda')
pipeline.load_textual_inversion('sd-concepts-library/cat-toy')

In [None]:
image = pipeline(
    "A <cat-toy> train",
    num_inference_steps=50
).images[0]
image