# Running Stable Diffusion 3 (SD3) DreamBooth LoRA training under 16GB GPU VRAM

## Install Dependencies

In [1]:
!pip install -q -U git+https://github.com/huggingface/diffusers
!pip install -q -U \
    transformers \
    accelerate \
    wandb \
    bitsandbytes \
    peft \
    parquet \
    fastparquet \
    safetensors

As SD3 is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/huggingface-cli", line 8, in <module>
    sys.exit(main())
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/huggingface_hub/commands/huggingface_cli.py", line 51, in

In [2]:
import os
with open("token.txt", "r") as f:
    os.environ["HF_TOKEN"] = f.read()

## Clone `diffusers`

In [5]:
!git clone https://github.com/huggingface/diffusers

Cloning into 'diffusers'...

/teamspace/studios/this_studio/diffusers/examples/research_projects/sd3_lora_colab


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [3]:
%cd diffusers/examples/research_projects/sd3_lora_colab

/teamspace/studios/this_studio/diffusers/examples/research_projects/sd3_lora_colab


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


## Download instance data images

In [None]:
!apt install unzip
!unzip "/content/images.zip"

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unzip is already the newest version (6.0-26ubuntu3.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Archive:  /content/images.zip
   creating: images/
  inflating: images/bodyarmor1.jpg   
  inflating: images/bodyarmor2.jpg   
  inflating: images/gatorade1.jpg    
  inflating: images/gatorade2.jpg    
  inflating: images/gatorade3.jpg    
  inflating: images/gatorade4.jpg    
  inflating: images/gatorade5.jpg    
  inflating: images/powerade1.jpg    
  inflating: images/powerade2.jpg    
  inflating: images/prime1.jpg       


## Compute embeddings

Add your image folder and configure the correct instance prompt. Refer to the `compute_embeddings.py` script for details on other supported arguments.

### YOU NEED TO CONFIGURE `compute_embeddings.py` WHEN CHANGING FILES IN ORDER TO MAKE THIS WORKKKK

In [35]:
!python compute_embeddings.py

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|███████████████████████| 2/2 [00:00<00:00, 2746.76it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:09<00:00,  4.68s/it]
Loading pipeline components...:  29%|███▋         | 2/7 [00:55<02:19, 27.87s/it]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|█████████████| 7/7 [03:25<00:00, 29.40s/it]
prompt_embeds.shape=torch.Size([1, 154, 4096]), negative_prompt_embeds.shape=torch.Size([1, 154, 4096]), pooled_prompt_embeds.shape=torch.Size([1, 2048]), torch.Size([1, 2048])
Max memory allocated: 10.552 GB
[('52504d405d31c6d1bf47d29b457cd5466da8ea47aed0f93bf9469d11ab842c19', tensor([[[-3.8917e+00, -2.5113e+00,  4.7167e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 

## Clear memory

In [4]:
import torch
import gc


def flush():
    torch.cuda.empty_cache()
    gc.collect()

flush()

## Train!

In [23]:
!wandb login 2d5b5d33aaf4716a466c81d4526b5cecb9ba6e37

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /teamspace/studios/this_studio/.netrc


In [37]:
!accelerate launch train_dreambooth_lora_sd3_miniature.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers"  \
  --instance_data_dir="images" \
  --data_df_path="sample_embeddings.parquet" \
  --output_dir="trained-sd3-lora-miniature" \
  --mixed_precision="fp16" \
  --instance_prompt="a sports drink ad" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --seed="0"

07/04/2024 00:29:10 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

                                          image_hash  ...                      negative_pooled_prompt_embeds
0  52504d405d31c6d1bf47d29b457cd5466da8ea47aed0f9...  ...  [-0.3713539242744446, -1.4495151042938232, -0....
1  ea4eff802a86762ba02f24c3e965c276007d0b9d3171c1...  ...  [-0.3713539242744446, -1.4495151042938232, -0....
2  63e69121120a864eb7dab93fe7136a50750417a1d6ec5b...  ...  [-0.3713539242744446, -1.4495151042938232, -0....
3  43ceb5582f86556468719f2b0ee6decd131a9fc759a10a...  ...  [-0.3713539242744446, -1.4495151042938232, -0....
4  f9c212aa6c7509941ce6f57db361157e095bb408d50b6b...  ...  [-0.3713539242744446, -1.4495151042938232, -0....
5  d7575e4506102abb98a76bc9c8a03f6704813f8dadae93...  ...  [-0.3713539242744446, -1.4495151042938232, -0....
6  5299dad4e362803326c80a9c5c0538b92eb7acaa7b5dce...  ...  [-0.3713539

Training will take about an hour to complete depending on the length of your dataset.

## Inference

In [5]:
flush()

In [8]:
from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
)
lora_output_path = "trained-sd3-lora-miniature"
pipeline.load_lora_weights(lora_output_path)

pipeline.enable_sequential_cpu_offload()

Loading pipeline components...:   0%|          | 0/9 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

In [12]:
image = pipeline("ad for black sunglasses with pale blue lenses with technological stylistic elements from sports drink ads showing the price as $50 and the name Gogglez somewhere on the advertisement, juice and colorful liquid surrounding the advertisement but NO BOTTLES anywhere on the ad").images[0]
image.save("test.png")

  0%|          | 0/28 [00:00<?, ?it/s]

## Image Inpainting!!!
This technique is what we'll have to go for since Image2ImagePipeline isn't natively fine-tune supported yet and this is the easier computational solution for now.

This will not really be classical image inpainting, implemented through the finetuning of a model, but rather a technique that doesn't make us start image generation from scratch and rather optimizes based on a fed image.
- https://arxiv.org/abs/2209.00647
- https://proceedings.neurips.cc/paper_files/paper/2023/hash/1e75f7539cbde5de895fab238ff42519-Abstract-Conference.html

Note that inference will be very slow in this case because we're loading and unloading individual components of the models and that introduces significant data movement overhead. Refer to [this resource](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more memory optimization related techniques.