<a href="https://colab.research.google.com/github/nyp-sit/nypi/blob/main/day4am/stable_diffusion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Diffusion with 🤗 Diffusers

This notebook introduces Stable Diffusion, the highest-quality open source text to image model as of now. It's also small enough to run in consumer GPUs rather than in a datacenter. We use the 🤗 Hugging Face [🧨 Diffusers library](https://github.com/huggingface/diffusers), which is currently the recommended library for using diffusion models.

This notebook shows what Stable Diffusion can do and a glimpse of its main components. We will not cover the training and fine-tuning of Stable Diffusion, a process that will take significantly more time and more compute resources.

*Acknowledgement: This notebook is adapted from the FastAI diffusion course*

In [None]:
!pip install -Uq diffusers transformers fastcore gradio

## Using Stable Diffusion

To run Stable Diffusion on your computer you have to accept the model license. It's an open CreativeML OpenRail-M license that claims no rights on the outputs you generate and prohibits you from deliberately producing illegal or harmful content. The [model card](https://huggingface.co/CompVis/stable-diffusion-v1-4) provides more details.

In [None]:
import logging
from pathlib import Path

import matplotlib.pyplot as plt
import torch

from diffusers import StableDiffusionPipeline
from PIL import Image
from fastcore.all import concat
logging.disable(logging.WARNING)

torch.manual_seed(1)

### Stable Diffusion Pipeline

[`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion#diffusers.StableDiffusionPipeline) is an end-to-end [diffusion inference pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion) that allows you to start generating images with just a few lines of code. Many Hugging Face libraries (along with other libraries such as scikit-learn) use the concept of a "pipeline" to indicate a sequence of steps that when combined complete some task. We'll look at the individual steps of the pipeline later -- for now though, let's just use it to see what it can do.

When we say "inference" we're referring to using an existing model to generate samples (in this case, images), as opposed to "training" (or fine-tuning) models using new data.

We use [`from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) to create the pipeline and download the pretrained weights. We indicate that we want to use the `fp16` (half-precision) version of the weights, and we tell `diffusers` to expect the weights in that format. This allows us to perform much faster inference with almost no discernible difference in quality. The string passed to `from_pretrained` in this case (`stabilityai/stable-diffusion-2-1`) is the repo id of a pretrained pipeline hosted on [Hugging Face Hub](https://huggingface.co/models); it can also be a path to a directory containing pipeline weights. The weights for all the models in the pipeline will be downloaded and cached the first time you run this cell.

In [None]:
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", revision="fp16", torch_dtype=torch.float16).to("cuda")

The weights are cached in your home directory by default.

In [None]:
!ls ~/.cache/huggingface/hub

We are now ready to use the pipeline to start creating images.

In [None]:
prompt = "a photograph of an astronaut riding a horse"

In [None]:
torch.manual_seed(1024)
pipe(prompt).images[0]

You will have noticed that running the pipeline shows a progress bar with a certain number of steps. This is because Stable Diffusion is based on a progressive denoising algorithm that is able to create a convincing image starting from pure random noise. Models in this family are known as _diffusion models_. Here's an example of the process (from random noise at top to progressively improved images towards the bottom) of a model drawing handwritten digits.

![digit_diffusion](https://raw.githubusercontent.com/nyp-sit/nypi/main/day4/digit_diffusion.png)

In [None]:
torch.manual_seed(1024)
pipe(prompt, num_inference_steps=3).images[0]

In [None]:
torch.manual_seed(1024)
pipe(prompt, num_inference_steps=16).images[0]

### Classifier-Free Guidance

In [None]:
def image_grid(imgs, rows, cols):
    w,h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    for i, img in enumerate(imgs): grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

_Classifier-Free Guidance_ is a method to increase the adherence of the output to the conditioning signal we used (the text).

Roughly speaking, the larger the guidance the more the model tries to represent the text prompt. However, large values tend to produce less diversity. The default is `7.5`, which represents a good compromise between variety and fidelity. This [blog post](https://benanne.github.io/2022/05/26/guidance.html) goes into deeper details on how it works.

We can generate multiple images for the same prompt by simply passing a list of prompts instead of a string.

In [None]:
images = [pipe(prompt, guidance_scale=g).images[0] for g in [1.1, 3, 7, 14]]

In [None]:
image_grid(images, rows=1, cols=4)

### Negative prompts

_Negative prompting_ refers to the use of another prompt (instead of a completely unconditioned generation), and scaling the difference between generations of that prompt and the conditioned generation.

In [None]:
torch.manual_seed(1024)
prompt = "Labrador wearing a hat in the style of Vermeer"
pipe(prompt).images[0]

In [None]:
torch.manual_seed(1024)
pipe(prompt, negative_prompt="yellow color").images[0]

By using the negative prompt we move more towards the direction of the positive prompt, effectively reducing the importance of the negative prompt in our composition.

### Image to Image

Even though Stable Diffusion was trained to generate images, and optionally drive the generation using text conditioning, we can use the raw image diffusion process for other tasks.

For example, instead of starting from pure noise, we can start from an image an add a certain amount of noise to it. We are replacing the initial steps of the denoising and pretending our image is what the algorithm came up with. Then we continue the diffusion process from that state as usual.

This usually preserves the composition although details may change a lot. It's great for sketches!

These operations (provide an initial image, add some noise to it and run diffusion from there) can be automatically performed by a special image to image pipeline: `StableDiffusionDepth2ImgPipeline`.

In [None]:
from diffusers import StableDiffusionDepth2ImgPipeline
from fastdownload import FastDownload

In [None]:
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
    revision="fp16",
    torch_dtype=torch.float16,
).to("cuda")

In [None]:
p = FastDownload().download('https://raw.githubusercontent.com/nyp-sit/nypi/main/day4am/lala-land.png')
init_image = Image.open(p).convert("RGB")
init_image

In [None]:
torch.manual_seed(2000)
prompt = "Two men are wrestling"
# negative_prompt = ''
strength = 0.85
images = pipe(prompt=prompt, num_images_per_prompt=3, image=init_image, strength=strength).images


In [None]:
image_grid(images, rows=1, cols=3)

### In-painting

Inpainting is a process where missing parts of an artwork are filled in to present a complete image.

In [None]:
from diffusers import StableDiffusionInpaintPipeline

model_path = "stabilityai/stable-diffusion-2-inpainting"

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
).to('cuda')

In [None]:
img_url = "https://raw.githubusercontent.com/nyp-sit/nypi/main/day4am/dog_on_bench.png"
mask_url = "https://raw.githubusercontent.com/nyp-sit/nypi/main/day4am/dog_on_bench_mask.png"

In [None]:
p = FastDownload().download(img_url)
image = Image.open(p).convert('RGB').resize((512,512))
image

The mask that we download represent the part that is removed (missing). We will later get the diffusion model to fill in content based on our text prompt.

In [None]:
p = FastDownload().download(mask_url)
mask_image = Image.open(p).resize((512, 512))
mask_image

In [None]:
prompt = "Cat sitting on the bench."

guidance_scale=7.5
num_samples = 3
generator = torch.Generator(device="cuda").manual_seed(100) # change the seed to get different results

images = pipe(
    prompt=prompt,
    image=image,
    mask_image=mask_image,
    guidance_scale=guidance_scale,
    generator=generator,
    num_inference_steps=80,
    num_images_per_prompt=num_samples,
).images

In [None]:
# insert initial image in the list so we can compare side by side
images.insert(0, image)
image_grid(images, 1, num_samples + 1)

### Gradio demo of In-painting


In the codes below, we build an easy to use Gradio app to create your own mask based on your own custom image, and using the created mask, we will do the in-painting as before.

In [None]:
def predict(dict, prompt):

    guidance_scale=7.5
    image =  dict['image'].convert("RGB").resize((512, 512))
    mask_image = dict['mask'].convert("RGB").resize((512, 512))
    images = pipe(
        prompt=prompt,
        image=image,
        mask_image=mask_image,
        guidance_scale=guidance_scale,
        generator=generator,
        num_inference_steps=80).images

    return(images[0])

In [None]:
import gradio as gr

gr.Interface(
    predict,
    title = 'Stable Diffusion In-Painting',
    inputs=[
        gr.Image(source = 'upload', tool = 'sketch', type = 'pil'),
        gr.Textbox(label = 'prompt')
    ],
    outputs = [
        gr.Image()
        ]
).launch(debug=True, share=True)