# **Running, monitoring and evaluating a training job**

### About this interactive guide

This Jupyter Notebook is part of our [A Step-by-Step Guide for Non-Technical Folks on Training Stable Diffusion with a low-cost Cloud GPU](https://learn2train.medium.com/a-step-by-step-guide-for-non-technical-folks-on-training-stable-diffusion-with-a-low-cost-cloud-gpu-344c6b250d64). 

In this guide, we'll cover the following topics in an interactive way:

1. **Fine-tuning a Stable Diffusion base model with a custom dataset**.
      
2. **Download the training dataset**. 

3. **Start the training job**
    
4. **Monitor your sample generations in Weights & Biases (W&B)**. W&B is a free tool used to visualise machine learning experiments. No installation is required as it will be run from a standalone webpage.

5. **Training is done** 

6. **Upload the fine-tuned models to Hugging Face (optional)** so you can re-use them later. Hugging Face is an open-source community for AI experts and enthusiasts. It’s free to use!
       
7. **Evaluate the fine-tuned checkpoints** to asses its performance.

8. **Terminate the GPU instance**. Avoid incurring charges by destroying the GPU instance. 


### Requirements

This interactive tutorial assumes you have:

- Setup the training application on a cloud GPU platform according to [this guide](https://learn2train.medium.com/a-step-by-step-guide-for-non-technical-folks-on-training-stable-diffusion-with-a-low-cost-cloud-gpu-344c6b250d64)
- A basic understanding of how Jupyter Notebooks work (if you don't check this [cool introduction to Jupyter Notebook demo](https://jupyter.org/try-jupyter/notebooks/?path=notebooks/Intro.ipynb)!)
- A reliable internet connection.
- An updated browser such as Chrome, Safari, Firefox, etc. 
- Time to train (it will take about 20 minutes to train the training dataset)




# 1. Fine-tuning a Stable Diffusion base model with custom data

### Fine-tuning using a photographer's image dataset

In this notebook tutorial we will train a Stable Diffusion 1.5 base model in the style of a photographer that it doesn't knows very well. By feeding the model with a photographer's image dataset, it should be able to generate pictures in the style of the photographer.  

### Bella Kotak

In this notebook tutorial, we will fine-tune a Stable Diffusion text-to-image model using Bella Kotak recent artwork. **Bella Kotak** is an award-winning UK-based photographer with a strong, distinctive style.

Check her instagram account at [https://www.instagram.com/bellakotak](https://www.instagram.com/bellakotak) ...and be amazed!

### Before fine-tuning the base model

Hope you checked her portfolio because you need to know how much the base model needs to learn in order to be able to generate decent-looking synthetic images in her artistic style. 

Since the base model wasn't trained with enough pictures of her artwork, it fails to portray her unique artistic vision. So if we prompt the base Stable Diffusion 1.5 model with **"a black and white photo of a woman wearing a floral crown and holding a bouquet of flowers in the style of Bella Kotak"**, the base model will struggle to generate a picture that represents her style, or even follow the prompt. 
  
![Bella before](https://drive.google.com/uc?export=view&id=1iUX_aMLQCulbcLMEMbta9GRsPk4VVG-i)

### After fine-tuning the base model

Thankfully by fine-tuning the base Stable Diffusion model using captioned images, the ability of the base model to generate better-looking pictures based on her style is greatly improved. And even the prompt is better followed. 

Image below was generated on a fine-tuned Stable Diffusion 1.5 model. It has the same prompt, seed, resolution, and CFG values as the image above!

![Bella after](https://drive.google.com/uc?export=view&id=1GgOyCNIFAkjsvkVcYc7U3SlgppLXMJPX)

As you can see, it's not perfect -for one thing, it's not exactly black and white- but yet the differences between the non-fine tuned model and the fine-tuned one are rather noticeable. That's the power of training a Stable Diffusion base model with a custom dataset.

# 2. Download the training dataset

### Download and extract the dataset 

We are going to download an already prepared training dataset into our GPU instance.

A dataset is said to be prepared when every image has a caption describing it. It may or may not include other configuration settings read by the training application. 

Our image dataset contains 109 images, 109 text files, and 1 tag configuration file (`global.yaml`) that adds a suffix tag to each text file (in this case appends the phrase `in the style of Bella Kotak` to each caption description for each image). For more information about how to create a dataset please refer to chapter II of the tutorial.

This is an example of how images and caption files are formatted in our dataset:

* `image-name_001.jpg`
* `image-name_001.txt`  <= Same filename as the jpg file

The text file `image-name_001.txt` contains the caption describing `image-name_001.jpg`, say, for example: `a photo of a woman wearing a floral crown and holding a bouquet of flowers in the style of Bella Kotak`.



Running the cell below will download a public ZIP file from Google Drive, extract it and store it in the **input** subfolder.


In [None]:
import os
import zipfile

# Install gdown (to be able to download files from Google Drive)
!pip install gdown

# Download dataset
os.makedirs('input', exist_ok=True)
path_to_dataset = "input/dataset.zip"

if not os.path.exists(path_to_dataset):
    !gdown 1Ifk07HeqxHfCCOCvb5oDF-cdxfkfsuq- -O input/dataset.zip
else:
    print(f"Already downloaded `{path_to_dataset}`")

# Unzip dataset into 'input' folder
with zipfile.ZipFile(path_to_dataset, 'r') as zip_ref:
    zip_ref.extractall('input/dataset')

# Remove zip file
os.remove(path_to_dataset)

# List input directory
%ls input/

print('Done')

# 3. Start the training job

Once our training images and their captions are inside the **input folder** we are ready to train the model. 


### Training configuration
We will override the following default training settings:

* **project name**: "sd1_kotak" <= Name of the project. It is convenient to name it in a way that identifies it from other training sessions.
* **data_root**: "input" <= Folder location of the training images
* **max epochs**: 60 <= An epoch refers to the one entire passing of training images through the trainer. We are doing 60 entire passes.  
* **batch size**: 6 <= Determines the amount of images that are going to be trained every epoch
* **sample steps**: 80 <= Determines how frequently samples are generated. In this case we will save every 20 epoch steps.   
* **save every n epochs**: 20 <= Checkpoints will be saved every 20 epochs (since we are doing 60 epochs, we will end with 3 checkpoints) 
* **save ckpt dir**: "ouput" <= Folder location of the saved checkpoints
* **zero_frequency_noise_ratio**: 0.04 <= This will make dark scenes more realistic  
* **optimizer_config**: optimizer-photo.json <= We add an optimiser config file to get better results
* **cond_dropout**: 0.0 <= This will prevent the trainer learning images without captions


The are more configurations not show here. For a detailed explanation of each check [EveryDream 2 documentation](https://github.com/victorchall/EveryDream2trainer/blob/main/doc/TRAINING.md). 

### Download the optimizer configuration file

Run the following cell to get the optimiser configuration settings to improve our training. 


In [None]:
!wget https://raw.githubusercontent.com/learn2train/l2t-sd/main/notebooks/optimizer-photo.json

### Set up Weights & Biases (W&B) for monitoring sample generation 

If you have a W&B account you can use it to track your training progress.  If you don't have one, you can create your W&B account for free at https://wandb.ai/site.

You can get your API key from your [User Settings](https://wandb.ai/settings). Paste it in the following cell where it says "PUT-YOUR-API-KEY-HERE", keep the double quotes, and then RUN the cell. 

In [None]:
wandb_token = "PUT-YOUR-API-KEY-HERE"

The cell above should look like:
    
`wandb_token = "28d37291d39f337237291d39f391d39f3"`

### Running the training session

To start training run the cell below. The cell will start printing its log. Keep scrolling down to monitor the current status of the training session. 

**IMPORTANT: If you see messages with a red backround, IGNORE THEM as they are only warning messages** 

The training takes about 20 minutes on a RTX 3090 with 24GB of VRAM. 

While you wait for the `Training completed` message, watch the samples being generated in Weights & Biases (see cell below).

In [None]:
# Get the wandb token
wandb_settings = ""
if wandb_token:
  !rm /root/.netrc
  !wandb login $wandb_token
  wandb_settings = "--wandb"

# Start the training

%run train.py --resume_ckpt "learn2train/stable-diffusion-v1-5" \
$wandb_settings \
--project_name "sd1_kotak" \
--data_root "input" \
--max_epochs 60 \
--sample_steps 80 \
--batch_size 6 \
--save_every_n_epochs 20 \
--zero_frequency_noise_ratio 0.04 \
--cond_dropout 0.0 \
--optimizer_config optimizer-photo.json \
--save_ckpt_dir "output"

# 4. Watch your samples in Weights & Biases while training is running

### W&B dashboard

Go to the [W&B dashboard](https://wandb.ai/home) in another browser tab. You will see your training run in your home page. 

Click on your training run to check the samples being generated. They should give you an idea how good/bad your model learning progress is going. You should stop the training once you are satisfied with the results you are seeing.

Samples come in three. That is because each sample generated uses different CFG values (1, 4 and 7).

![W&B](https://drive.google.com/uc?export=view&id=1G1fmv5uFN_pk57jBhmD7SVes4at-uv4C)


# 5. Training is finished

Once the training is done you should see the following messages:

![Training is finished](https://drive.google.com/uc?export=view&id=1WXwNcHaKStpuusvReueriEJXsl3rLWRM)


That was it! The base model has been updated and you are now left with checkpoints.

Before terminating the GPU instance, you could download the checkpoints into your computer, or upload them to your Hugging Face repository.



# 6. Upload your checkpoints to Hugging Face (optional)

If you aren't downloading your checkpoints to your computer, you could save them to your Hugging Face repository.

### Get a Hugging Face token

If you haven't got one yet, have a look at [How to Host Stable Diffusion Checkpoints on Hugging Face for Free](https://learn2train.medium.com/a-step-by-step-guide-to-host-stable-diffusion-checkpoints-on-hugging-face-for-free-2098d0c18a01)

### Log-in into your account 

Run the cell below and paste your **Hugging Face write token** into the prompt to log into your account to be able to upload data into your repo (NOTE: There's no need to go to the Hugging Face website: you will be loging in from the cell).

In [None]:
# Log in to Hugging Face

from huggingface_hub import notebook_login, hf_hub_download
import os
notebook_login()

### Upload checkpoints to your model repository

Make sure you are **logged in** to Hugging Face running the above login cell first.

Use the cell below to upload one or more checkpoints to your personal Hugging Face repository. You should already be authorized to interact with Hugging Face if you ran the cell above.

When you run the cell below, a box will show up and you need to  **CLICK** to select which `.safetensor` files are marked for upload. This allows you to select which ones to upload.  If you don't click of the ckpts, nothing will happen.

You will also be required to fill-in your username and your repository name:
* Hugging Face username: **Your username** (look in [HuggingFace account page](https://huggingface.co/settings/account)).
* Hugging Face repository name: **your-repo-name**

**WARNING**

**If your Hugging Face account is brand new upload only 3 checkpoint files**. For safety reasons, Hugging Face limits the amount of files a new user can make. If you try to upload more than 3 checkpoint files you'll probably get a warning tell you to wait 24 hours to keep uploading. 


In [None]:
# Run this cell after reading the instructions of the cell above. 

import glob
import os
from huggingface_hub import HfApi
from ipywidgets import *

all_ckpts = [f for f in glob.glob("output/*.safetensor")]
  
ckpt_picker = SelectMultiple(options=all_ckpts, layout=Layout(width="600px")) 
hfuser = Text(placeholder='Hugging Face username')
hfrepo = Text(placeholder='Hugging Face repository name')

api = HfApi()
upload_btn = Button(description='Upload')
out = Output()

def upload_ckpts(_):
    repo_id=f"{hfuser.value or hfuser.placeholder}/{hfrepo.value or hfrepo.placeholder}"
    with out:
        if ckpt_picker is None or len(ckpt_picker.value) < 1:
            print("Nothing selected for upload, make sure to click one of the ckpt files in the list, or, you have no ckpt files in the current directory.")
        for ckpt in ckpt_picker.value:
            print(f"Uploading to HF: huggingface.co/{repo_id}/{ckpt}")
            response = api.upload_file(
                path_or_fileobj=ckpt,
                path_in_repo=ckpt,
                repo_id=repo_id,
                repo_type=None,
                create_pr=1,
            )
            display(response)
        print("DONE")

upload_btn.on_click(upload_ckpts)
box = VBox([ckpt_picker, HBox([hfuser, hfrepo]), upload_btn, out])

display(box)

### Save the uploads to your model repository

To actually save the uploaded checkpoints to your repo, go back to your Hugging Face model repository and click the **Community** tab. You'll see a list of one or more checkpoints. Go one by one and click **Merge** to save them to your model repository:

![Merge](https://drive.google.com/uc?export=view&id=1zyOcOq9uABW1dO69pNYenvsag1C7asyc)

# 7. Evaluate the fine-tuned checkpoints


### Test inference on your checkpoints

To recap: Training is over and you are left with model checkpoints (safetensor files). These checkpoints are updated fine-tuned models saved at different times during the training session. 

The main idea here is to evaluate each of your checkpoints to find the ones that generate the output you like the most.  

Run the following cell to display a mini text-to-image generator. You can choose any checkpoint -or all of them- and set inference parameters such as **prompt, steps, CFG, resolution and seed**.

Have fun!

In [None]:
from ipywidgets import *
from IPython.display import display, clear_output
import os
import gc
import random
import torch
import inspect

from torch import autocast
from diffusers import StableDiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDIMScheduler, DDPMScheduler, PNDMScheduler, EulerAncestralDiscreteScheduler
from transformers import CLIPTextModel, CLIPTokenizer


checkpoints_ts = []
for root, dirs, files in os.walk("."):
        for file in files:
            if os.path.basename(file) == "model_index.json":
                ts = os.path.getmtime(os.path.join(root,file))
                ckpt = root
                checkpoints_ts.append((ts, root))

checkpoints = [ckpt for (_, ckpt) in sorted(checkpoints_ts, reverse=True)]
full_width = Layout(width='600px')
half_width = Layout(width='300px')

checkpoint = Dropdown(options=checkpoints, description='Checkpoint:', layout=full_width)
prompt = Textarea(value='a photo of ', description='Prompt:', layout=full_width)
height = IntSlider(value=512, min=256, max=768, step=32, description='Height:', layout=half_width)
width = IntSlider(value=512, min=256, max=768, step=32, description='Width:', layout=half_width)
cfg = FloatSlider(value=7.0, min=0.0, max=14.0, step=0.2, description='CFG Scale:', layout=half_width)
steps = IntSlider(value=30, min=10, max=100, description='Steps:', layout=half_width)
seed = IntText(value=-1, description='Seed:', layout=half_width)
generate_btn = Button(description='Generate', layout=full_width)
out = Output()

def generate(_):
    with out:
        clear_output()
        display(f"Loading model {checkpoint.value}")
        actual_seed = seed.value if seed.value != -1 else random.randint(0, 2**30)

        text_encoder = CLIPTextModel.from_pretrained(checkpoint.value, subfolder="text_encoder")
        vae = AutoencoderKL.from_pretrained(checkpoint.value, subfolder="vae")
        unet = UNet2DConditionModel.from_pretrained(checkpoint.value, subfolder="unet")
        tokenizer = CLIPTokenizer.from_pretrained(checkpoint.value, subfolder="tokenizer", use_fast=False)
        scheduler = DDIMScheduler.from_pretrained(checkpoint.value, subfolder="scheduler")
        text_encoder.eval()
        vae.eval()
        unet.eval()

        text_encoder.to("cuda")
        vae.to("cuda")
        unet.to("cuda")

        pipe = StableDiffusionPipeline(
            vae=vae,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            unet=unet,
            scheduler=scheduler,
            safety_checker=None, # save vram
            requires_safety_checker=None, # avoid nag
            feature_extractor=None, # must be none of no safety checker
        )

        pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
        
        print(inspect.cleandoc(f"""
              Prompt: {prompt.value}
              Resolution: {width.value}x{height.value}
              CFG: {cfg.value}
              Steps: {steps.value}
              Seed: {actual_seed}
              """))
        with autocast("cuda"):
            image = pipe(prompt.value, 
                generator=torch.Generator("cuda").manual_seed(actual_seed),
                num_inference_steps=steps.value, 
                guidance_scale=cfg.value,
                width=width.value,
                height=height.value
            ).images[0]
        del pipe
        gc.collect()
        with torch.cuda.device("cuda"):
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
        display(image)
            
generate_btn.on_click(generate)
box = VBox(
    children=[
        checkpoint, prompt, 
        HBox([VBox([width, height]), VBox([steps, cfg])]), 
        seed, 
        generate_btn, 
        out]
)


display(box)

# 8. Terminate your GPU instance when you are done

Don't forget to terminate your cloud GPU instance once you are done evaluating your checkpoints, otherwise you will be still charged. Check the last section of the previous chapter to see how to terminate your instance. 

Note that once you terminate your instance **Jupyter Lab** will stop working. If you want to use it again you'll have to start a new training session on the same or difference GPU instance, and start all over again. 