Stable Diffusion is a **deep learning model** used for **generating images** from text.
It became very famous because the model weights are open-source and people can run it on normal GPUs (even consumer GPUs).

---

## What it is (in simple words)

Stable Diffusion is a type of **text-to-image generative AI model**.

You write a prompt → it creates a picture.

Example:

> “a realistic tiger sitting on a chair in a classroom”

→ Stable Diffusion will generate that image.

---

## What type of model is it?

It is a **Latent Diffusion Model (LDM)**.

This is a sub-type of “Diffusion Models”.

---

## How it works (simple explanation)

It works in 3 main stages:

| Step                                           | Meaning                                        |
| ---------------------------------------------- | ---------------------------------------------- |
| 1) Start with random noise                     | The model begins from pure random pixels       |
| 2) Gradually remove the noise                  | It learns to “denoise” step-by-step            |
| 3) Turn hidden representation into final image | The latent representation becomes a full image |

### More clearly:

* The text prompt is encoded into vectors using **CLIP text encoder**
* That text embedding guides the model on what kind of image to form
* The diffusion model slowly removes noise to match the meaning of the text
* A decoder (VAE decoder) converts the final latent representation into an actual pixel image

---

## Why is it called “latent”?

Because the diffusion happens in **latent space** (compressed representation) instead of full resolution pixels.

This makes it:

* faster
* cheaper to run
* possible to run on less powerful GPUs (e.g. RTX 3060)

GANs used pixel space → very heavy
Stable Diffusion uses latent space → much lighter

---

## Why is Stable Diffusion popular?

| Reason       | Explanation                              |
| ------------ | ---------------------------------------- |
| Open-source  | people can download weights and finetune |
| Good quality | images look artistic & realistic         |
| Customizable | LoRA, DreamBooth, Textual Inversion      |
| Runs locally | does not need huge servers like DALL-E   |

---

### Summary in one sentence

Stable Diffusion is a latent diffusion model that turns text into images by starting from random noise and step-by-step removing noise guided by a text embedding.



### What are “Generative Models”?

In machine learning, **generative models** are models that can **generate new data** that looks similar to the real data they were trained on.

Example:

* If trained on images → they can generate new images
* If trained on text → they can generate new text sentences
* If trained on music → they can generate new music

They learn the **distribution** of the data, and then **sample** from it.

---

### So, diffusion models are one type of generative model

Other types of generative models include:

| Type of Generative Model | Example models                 |
| ------------------------ | ------------------------------ |
| GANs                     | StyleGAN, CycleGAN             |
| VAEs                     | Variational Autoencoders       |
| Autoregressive           | GPT, PixelRNN                  |
| Flow based               | Glow                           |
| Diffusion Models         | DDPM, Stable Diffusion, Imagen |

---

### What is special about diffusion models?

Diffusion models create data in two steps:

1. **Forward process**: take real image → gradually add noise → until becomes pure noise
2. **Reverse process**: learn to remove noise step-by-step → become real image again

After training, model starts from random noise and removes noise → to generate a fresh new image that never existed before.

So diffusion models learn the reverse denoising process.

---

### Why are they good?

| Advantage             | Meaning                                            |
| --------------------- | -------------------------------------------------- |
| Stable to train       | GANs are very hard to train (mode collapse)        |
| Very realistic images | Good global coherence                              |
| Easy to condition     | You can add text, segmentation map, depth map, etc |

This is why diffusion models are dominating image generation today (Stable Diffusion, Imagen, DALL-E 3 internally are also diffusion-based).

---

### Summary

**Generative models** = models that can generate new samples.

**Diffusion models** = one category of generative models that generate data by **starting with noise and denoising it step-by-step**.

---




## Why we think in “probability distribution”?

Because in real world things are **not fixed**, they are **varied**.

Example:

* People's ages vary
* People's heights vary

So a model needs to learn **how likely** things are.

It should learn:

* which values are common
* which values are rare
* which values almost never happen

For example: a 3-year-old child being 130 cm tall is very unlikely → so the probability of that combination should be very low.

So a generative model learns the **joint probability** of all variables together.

In images: every pixel also has probability.
But pixels are connected together → so many variables depend on each other.

---

## Why is this important?

Because if the model learns this distribution very well, then:

* we can *sample* from it (like throwing a special weighted coin)

Sampling means: randomly selecting values based on how probable they are.

→ This gives us **new realistic data**.

---

## How this connects to Stable Diffusion:

Stable Diffusion has been trained on huge amount of images and learns a very complex probability distribution of all images (pixels, shapes, patterns, etc).

After training:

* We “sample” from the distribution
* We get new images that look realistic
* But the images are **new** (not taken from training data)



<img src="asset/sd_forward_reverse.png" width=800>

## 1) Forward Process (Diffusion)

**Idea / example:**
We take a real image → and keep adding small random noise step by step → until it becomes pure noise (like adding more snow static on a TV).

**Why this step exists:**
So we get clean + noisy pairs.
This teaches the model what “noise destroying image” looks like.

---

## 2) Reverse Process (Diffusion)

**Idea / example:**
Now the model learns the opposite direction → how to remove noise step by step → until the image becomes clear.

**Why this step matters:**
During generation, we start from pure noise and go backwards (step-by-step denoising) to create a brand new image that never existed before.

---

## 3) VAE + Latent Space

**Idea / example:**
Images are huge and messy in pixel space.
VAE compresses images into a **small hidden space** where similar images are mapped close together.

Think of this like: instead of working on full 4K photos, we work on a tiny meaningful 64×64 version.
This smaller space is smooth: small changes in this space = smooth visual changes in image (not random garbage pixels).

<img src="asset/sd_vae_concept.png" width=800>

**Why Stable Diffusion needs VAE:**
Diffusion happens in this **latent space** (not pixels).
This makes generation faster, cheaper, and more stable.
After denoising is done, the VAE decoder converts the final latent back to the full image.

---

### Final One-Line Summary

Forward adds noise to learn destruction → Reverse removes noise to generate images → VAE gives a smooth small “latent world” where diffusion happens efficiently.


## Text-to-Image Model

<img src="asset/sd_text_to_image.png" width=800>

This is the architecture for a text-to-image diffusion model.

It works by taking random noise and a text prompt (encoded by CLIP). A U-Net model then iteratively denoises the noise for 'T' steps, using the text prompt and time embeddings as guidance, until a final Decoder converts the refined data into the output image.

## Image-to-Image Model

<img src="asset/sd_image_to_image.png" width=800>

This diagram shows an **Image-to-Image** diffusion architecture.

It's similar to the Text-to-Image model, but instead of starting with random noise, it **encodes an input image (X)** into a latent representation (Z) and **adds noise** to it. This noised latent is then iteratively refined by the U-Net, guided by the **text prompt** (from CLIP). This process allows the model to **modify an existing image** based on the prompt, rather than generating one from scratch.

## In-Painting Model

<img src="asset/sd_image_edit.png" width=800 >

This diagram shows an **In-Painting** architecture.

This model is designed to **regenerate or change only a specific, masked portion** of an image while keeping the rest untouched.

It works by:
1.  Encoding the original image (X) and the text prompt ("A dog running").
2.  Running the standard diffusion process (U-Net) guided by the text prompt.
3.  **Crucially**, at each denoising step, it **combines** the model's current output with the **original, unmasked parts** of the image (using a noised version of the masked image).

This process "fools" the model by forcing it to preserve the unmasked areas, ensuring that it only generates new content (making the dog run) *inside* the specified mask, effectively replacing the original content (the ball) in that region.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [3]:
class VAE_Residual(nn.Module):
    def __init__(self, in_channels,out_channels):
        super().__init__()
        self.groupnorm_1 = nn.GroupNorm(num_groups=32,in_channels=in_channels)
        self.conv_1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        
        self.groupnorm_2 = nn.GroupNorm(num_groups=32,in_channels=out_channels)
        self.conv_2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        
        if in_channels == out_channels:
            self.residual_layer = nn.Identity()
        else:
            self.residual_layer = nn.Conv2d(in_channels, out_channels, kernel_size=1)
            
    def forward(self, x):
        
        residue=x
        
        x = self.groupnorm_1(x)
        x = F.silu(x)
        x = self.conv_1(x)
        x = self.groupnorm_2(x)
        x = F.silu(x)
        x = self.conv_2(x)
        x += self.residual_layer(residue)
        
        return x

In [4]:
class VAE_AttentionBlock(nn.Module):
    pass

In [None]:
class VAE_Encoder(nn.Module):
    def __init__(self):
        super().__init__(
            nn.Conv2d(3, 128, kernel_size=3, stride=2, padding=1),
            
            VAE_Residual(128, 128),
            VAE_Residual(128, 128),
            
            nn.Conv2d(128,128, kernel_size=3, stride=2, padding=0),
            
            VAE_Residual(128, 256),
            VAE_Residual(256, 256),
            
            nn.Conv2d(256,256, kernel_size=3, stride=2, padding=0),
            
            VAE_Residual(256, 512),
            VAE_Residual(512, 512),
            
            nn.Conv2d(512,512, kernel_size=3, stride=2, padding=0),
            
            VAE_Residual(512, 512),
            VAE_Residual(512, 512),
            VAE_Residual(512, 512),
            
            VAE_AttentionBlock(512),
            
            VAE_Residual(512, 512),
            
            nn.GroupNorm(num_groups=32,in_channels=512),
            
            nn.SiLU(),
            
            nn.Conv2d(512, 8, kernel_size=3, padding=1),
            nn.Conv2d(8, 8, kernel_size=1,padding=0)
        )
        def forward(self,x:torch.Tensor,noise_level:torch.Tensor):
            for module in self:
                if getattr(module,"stride",None)==(2,2):
                    x=F.pad(x, (1,0,1,0))
                x=module(x)
            mean, log_variance = torch.chunk(x, 2, dim=1)
            
            log_variance = torch.clamp(log_variance, min=-30.0, max=20.0)
            
            variance=log_variance.exp()
            
            stdev = torch.sqrt(variance)
            
            x=mean +stdev * noise_level
            
            x*=.18215
            
            return x