# Text-to-Image Generation using Stable Diffusion - Homework Assignment

![Stable Diffusion Architecture](https://miro.medium.com/v2/resize:fit:1400/1*NpQ282NJdOfxUsYlwLJplA.png)

In this homework, you will finetune a **Stable Diffusion** model to generate Naruto-style images from text descriptions. This involves working with the complete diffusion pipeline including VAE, UNet, text encoder, and scheduler.

## 📌 Project Overview
- **Task**: Text-to-Naruto image generation
- **Architecture**: Stable Diffusion with UNet diffusion model
- **Dataset**: Naruto-style dataset with text descriptions
- **Goal**: Generate realistic Naruto-style images from text prompts

## 📚 Learning Objectives
By completing this assignment, you will:
- Understand diffusion models and the stable diffusion pipeline
- Learn to finetune pre-trained diffusion models
- Work with VAE, UNet, text encoders, and schedulers
- Practice text-to-image generation techniques
- Handle memory constraints with large models

## 1️⃣ Dataset Setup (PROVIDED)

The Naruto-style dataset has been loaded for you. The dataset contains:
- 1,221 training images with corresponding text descriptions
- Each sample has an 'image' and 'text' field
- Images are in various sizes and need to be resized to 512x512


In [1]:
from IPython.display import clear_output

In [2]:
!pip install -U datasets
clear_output()

In [3]:
from datasets import load_dataset

# Load dataset without any cache directory specified
ds = load_dataset("Alex-0402/naruto-style-dataset-with-text")
print("Dataset info:", ds)
print("Number of training samples:", len(ds['train']))

# Display a sample
sample = ds['train'][0]
print("\nSample text:", sample['text'])
print("Image size:", sample['image'].size)

Dataset info: DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 1221
    })
})
Number of training samples: 1221

Sample text: a man with dark hair and brown eyes, naruto style
Image size: (1080, 1080)


## 2️⃣ Import Libraries and Configuration

**Task**: Import all necessary libraries and set up configuration parameters.

**Requirements**:
- Import diffusers, transformers, and related libraries
- Import PyTorch, PIL, numpy, and other utilities
- Set random seeds for reproducibility
- Configure hyperparameters for stable diffusion training

In [None]:
# TODO: Import all necessary libraries:
#       - torch, torch.nn.functional, torch.optim
#       - diffusers (UNet2DConditionModel, AutoencoderKL, PNDMScheduler, DDPMScheduler)
#       - transformers (CLIPTextModel, CLIPTokenizer)
#       - PIL, numpy, matplotlib
#       - torchvision.transforms
#       - tqdm for progress bars
import torch
import torch.nn.functional as F
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader

from diffusers import UNet2DConditionModel, AutoencoderKL, PNDMScheduler, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from torchvision import transforms
from tqdm.auto import tqdm
import random

# TODO: Set random seeds for reproducibility (use seed=42)
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# TODO: Check device availability and print
if torch.backends.mps.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print(f"device now using: {device}")

# TODO: Define configuration parameters:
MODEL_ID = "OFA-Sys/small-stable-diffusion-v0"  # Smaller stable diffusion model
IMG_SIZE = 512  # Image resolution
BATCH_SIZE = 1  # Small batch size for memory constraints
LEARNING_RATE = 1e-5  # Learning rate for finetuning
NUM_EPOCHS = 6  # Number of training epochs
INFERENCE_STEPS = 200  # Number of denoising steps during inference
GUIDANCE_SCALE = 7.5  # Classifier-free guidance scale

Using MPS device for Apple Silicon GPU acceleration.
device now using: mps


计算机生成的“随机”其实是“**伪随机**” **(Pseudo-random)**。

简单来说：随机种子是伪随机数生成算法的起始点。只要起始点相同，生成的“随机”数字序列就将完全一样。

所以，设置随机种子有利于模型复现！

## 3️⃣ Load Pre-trained Stable Diffusion Components

**Task**: Load all components of the stable diffusion pipeline.

**Requirements**:
- Load VAE (Variational Autoencoder) for image encoding/decoding
- Load UNet for the diffusion process
- Load text encoder and tokenizer for text conditioning
- Load noise scheduler for the diffusion process


In [5]:
# TODO: Load stable diffusion components:
#       - vae = AutoencoderKL.from_pretrained(MODEL_ID, subfolder="vae")
#       - unet = UNet2DConditionModel.from_pretrained(MODEL_ID, subfolder="unet")
#       - text_encoder = CLIPTextModel.from_pretrained(MODEL_ID, subfolder="text_encoder")
#       - tokenizer = CLIPTokenizer.from_pretrained(MODEL_ID, subfolder="tokenizer")
#       - scheduler = PNDMScheduler.from_pretrained(MODEL_ID, subfolder="scheduler")

vae = AutoencoderKL.from_pretrained(MODEL_ID, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(MODEL_ID, subfolder="unet")
text_encoder = CLIPTextModel.from_pretrained(MODEL_ID, subfolder="text_encoder")
tokenizer = CLIPTokenizer.from_pretrained(MODEL_ID, subfolder="tokenizer")
scheduler = PNDMScheduler.from_pretrained(MODEL_ID, subfolder="scheduler") 
# DDPMScheduler is a foundational scheduler used primarily for training
# while PNDMScheduler is a more advanced scheduler designed for fast and high-quality inference (image generation).

# TODO: Move models to device
vae.to(device)
unet.to(device)
text_encoder.to(device)

# TODO: Set VAE and text encoder to eval mode (only UNet will be trained)
vae.eval()
text_encoder.eval()

# TODO: Print model information and parameter counts
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    # numel() is a method available on tensor objects (e.g., PyTorch tensors) that returns the total number of elements (or scalar values) in that tensor.

print(f"UNet has {count_parameters(unet):,} trainable parameters.")
print(f"VAE has {count_parameters(vae):,} trainable parameters (frozen).")
print(f"Text Encoder has {count_parameters(text_encoder):,} trainable parameters (frozen).")

An error occurred while trying to fetch OFA-Sys/small-stable-diffusion-v0: OFA-Sys/small-stable-diffusion-v0 does not appear to have a file named diffusion_pytorch_model.safetensors.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch OFA-Sys/small-stable-diffusion-v0: OFA-Sys/small-stable-diffusion-v0 does not appear to have a file named diffusion_pytorch_model.safetensors.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
The config attributes {'predict_epsilon': True} were passed to PNDMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.


UNet has 579,384,964 trainable parameters.
VAE has 83,653,863 trainable parameters (frozen).
Text Encoder has 123,060,480 trainable parameters (frozen).


The `scheduler` is responsible for **defining the diffusion process and how noise is added and removed at each step during the sampling (inference) process**.

More specifically, the `scheduler` handles:

1. **Noise Scheduling**: It determines the amount of noise to add or remove at each step of the denoising process.

2. **Denoising Algorithm**: It implements the specific algorithm used to denoise the latent representation.

3. **Timesteps**: It manages the timesteps (or noise levels) through which the model progressively denoises the latent representation.

4. **Prediction Type**: It often handles whether the model is predicting the noise itself, the original image, or the velocity.

`if p.requires_grad`:
* This is a crucial **filter** within the generator expression.

* In deep learning frameworks, each parameter tensor has an attribute called `requires_grad` (typically a boolean).

* `p.requires_grad = True` means that the gradients for this parameter will be computed during the backward pass of backpropagation, and thus, this parameter will be updated by the optimizer during training. These are the "trainable" parameters.

* `p.requires_grad = False` means that this parameter is frozen. Its gradients will not be computed, and its value will not be updated during training. Examples of such parameters include:
    * Parameters in a pre-trained model that you want to use as a fixed feature extractor (as seen with `vae.eval()` and `text_encoder.eval()` in your original code, which often implicitly sets `requires_grad` to False for their parameters, or you explicitly set it to False when loading/defining them).
    
    * Buffers in a model (like running means and variances in Batch Normalization layers), which are part of the model's state but are not "trained" in the same way weights and biases are, and typically don't have `requires_grad=True`.

* By including `if p.requires_grad`, the function ensures that only trainable parameters are counted. This is important because you often want to know how many parameters your optimizer needs to manage.

## 4️⃣ Data Preprocessing and Custom Dataset

**Task**: Create custom dataset class and preprocessing pipeline.

**Requirements**:
- Resize images to 512x512 resolution
- Normalize images to [-1, 1] range for VAE
- Tokenize text descriptions
- Handle data augmentation appropriately

In [6]:
# TODO: Create NarutoDataset class inheriting from torch.utils.data.Dataset
class NarutoDataset(Dataset):
    # TODO: In __init__(self, dataset, tokenizer, size=512):
    def __init__(self, dataset, tokenizer, size=512):
        #       - Store dataset, tokenizer, and image size
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.size = size
        #       - Define image transforms:
        #         * Resize to (size, size)
        #         * Random horizontal flip for augmentation
        #         * ToTensor()
        #         * Normalize with mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]
        self.transforms = transforms.Compose([
            transforms.Resize((size, size)),
            transforms.RandomHorizontalFlip(),
            transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),
            transforms.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5]),
        ])
        # I added ColorJitter and RandomAffine for additional augmentation.

    # TODO: Implement __len__ to return dataset length
    def __len__(self):
        return len(self.dataset)

    # TODO: Implement __getitem__ to:
    def __getitem__(self, idx):
        #       - Get image and text from dataset
        sample = self.dataset[idx]
        image = sample['image'].convert("RGB")
        text = sample['text']

        #       - Apply transforms to image
        pixel_values = self.transforms(image)

        #       - Tokenize text with padding and truncation
        input_ids = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.tokenizer.model_max_length,
            return_tensors="pt"
        ).input_ids

        #       - Return dict with 'pixel_values' and 'input_ids'
        return {"pixel_values": pixel_values, "input_ids": input_ids.squeeze()}

# TODO: Create train_dataset and train_dataloader
train_dataset = NarutoDataset(ds['train'], tokenizer, size=IMG_SIZE)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

# TODO: Print dataset info and test with one sample
print(f"Dataset size: {len(train_dataset)}")
sample_batch = next(iter(train_dataloader))
print("Sample batch keys:", sample_batch.keys())
print("Pixel values shape:", sample_batch['pixel_values'].shape)
print("Input IDs shape:", sample_batch['input_ids'].shape)

Dataset size: 1221
Sample batch keys: dict_keys(['pixel_values', 'input_ids'])
Pixel values shape: torch.Size([1, 3, 512, 512])
Input IDs shape: torch.Size([1, 77])


## 5️⃣ Training Setup and Loss Function

**Task**: Set up the training components including optimizer and loss function.

**Requirements**:
- Create optimizer for UNet parameters only
- Implement the diffusion loss (noise prediction loss)
- Set up proper gradient scaling and mixed precision if needed
- Configure learning rate scheduling

In [None]:
from torch.cuda.amp import GradScaler, autocast

# TODO: Create optimizer for UNet parameters only:
#       - optimizer = torch.optim.AdamW(unet.parameters(), lr=LEARNING_RATE)
optimizer = AdamW(unet.parameters(), lr=LEARNING_RATE)

# TODO: Create noise scheduler for training (different from inference)
#       - noise_scheduler = DDPMScheduler.from_pretrained(MODEL_ID, subfolder="scheduler")
noise_scheduler = DDPMScheduler.from_pretrained(MODEL_ID, subfolder="scheduler")
# DDPMScheduler is a foundational scheduler used primarily for training
# while PNDMScheduler is a more advanced scheduler designed for fast and high-quality inference (image generation).

# TODO: Define helper functions:
#       - encode_text(text_input): tokenize and encode text to embeddings
#       - encode_image(image): encode image to latent space using VAE
#       - decode_latent(latent): decode latent back to image using VAE
@torch.no_grad()
def encode_text(text_input_ids):
    """Tokenize and encode text to embeddings."""
    return text_encoder(text_input_ids.to(device))[0]

@torch.no_grad()
def encode_image(image_pixels):
    """Encode image to latent space using VAE."""
    h = vae.encode(image_pixels.to(device)).latent_dist
    latents = h.sample()
    return latents * vae.config.scaling_factor

@torch.no_grad()
def decode_latent(latents):
    """Decode latent back to image using VAE."""
    latents = 1 / vae.config.scaling_factor * latents
    image = vae.decode(latents).sample
    return image

# TODO: Set up gradient scaler for mixed precision training (optional)
scaler = torch.amp.GradScaler("cuda", init_scale=2**16, growth_interval=1000, growth_factor=2.0)
# 这里我们使用 GradScaler 来支持混合精度训练，这可以提高训练速度并减少内存使用。
# 但是注意，我们下面的训练循环中会需要使用 autocast 来自动处理混合精度。

# TODO: Initialize training tracking variables
losses = []
print("Training setup complete. Optimizer, noise scheduler, helper functions, and GradScaler are ready.")

The config attributes {'predict_epsilon': True} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.


Training setup complete. Optimizer, noise scheduler, helper functions, and GradScaler are ready.


`Gradient Scaler` **(梯度缩放器) 是一个用来解决混合精度训练中“数值下溢” (Underflow) 问题的工具，它能保证模型在低精度浮点数（如FP16）下依然可以稳定地训练**。

1. **什么是“混合精度训练” (Mixed Precision Training)？**

    在深度学习中，我们通常使用32位浮点数（FP32，或称单精度）来存储和计算模型的权重、梯度等数据。

    混合精度训练则是指在训练过程中，同时使用16位浮点数（FP16，或称半精度）和32位浮点数（FP32）。

    优点：

    * **速度更快**：现代的GPU（尤其是NVIDIA带有Tensor Cores的显卡）计算FP16的速度远超FP32。

    * **显存更少**：FP16占用的存储空间是FP32的一半，这意味着你可以用更大的模型、更大的批量（batch size）进行训练。

    缺点/挑战：

    * **数值范围更小**：FP16能表示的数值范围比FP32小得多。如果一个数非常小，超出了FP16的表示范围，它就会被当作**零**，这种情况称为“**数值下溢**”(Numerical Underflow)。

2. **为什么“数值下溢”是个大问题？**

    在模型训练的反向传播（backpropagation）过程中，计算出的梯度 (gradients) 可能非常小。如果使用FP16，这些微小的梯度值很容易因为“数值下溢”而变成零。

    一旦梯度变成零，模型就无法从这些信息中学到任何东西，参数也就不会被更新。这就好比老师给学生划重点，但声音太小，学生一个字也听不见，自然也就无法学习和进步。最终，这会导致模型**无法收敛或训练失败**。

3. `Gradient Scaler` **如何解决这个问题？**
    
    `Gradient Scaler` (在PyTorch中是 `torch.cuda.amp.GradScaler`) 通过一个巧妙的“放大再缩小”的策略来解决问题：

    工作流程如下：

    1. **放大损失 (Loss Scaling)**：在计算反向传播之前，GradScaler 会将计算出的损失值（Loss）乘以一个巨大的缩放因子（例如 65536.0）。

    2. **计算放大的梯度 (Scaled Backward Pass)**：当使用这个被放大了的损失值进行反向传播时，根据链式法则，所有计算出的梯度也会被同样地放大。

        这样一来，那些原本非常微小的、可能会下溢的梯度，现在被“抬升”到了FP16可以安全表示的范围内，避免了信息丢失。

    3. **缩小梯度 (Unscaling)**：梯度虽然安全了，但它们是“虚高”的，不能直接用来更新模型权重。因此，在优化器（Optimizer）更新权重之前，`GradScaler` 会将这些被放大的梯度**除以同一个缩放因子**，将它们恢复到原始的、正确的数值。

    4. **动态调整缩放因子 (Dynamic Scaling)：** `GradScaler`非常智能，它会动态调整这个缩放因子。

        如果梯度在放大后出现了无穷大（`inf`）或非数字（`NaN`）（说明缩放因子太大，导致“数值上溢”了），`GradScaler` 会跳过这次的参数更新，并在下一次迭代时减小缩放因子。

        如果连续很多次迭代都没有出现问题，它会尝试增大缩放因子，以利用FP16更广的动态范围。


`@torch.no_grad()` 是一个 PyTorch 中的**装饰器 (decorator)**，它的作用非常明确和重要：**在该装饰器下的函数运行时，临时禁用梯度计算**。

简单来说，当代码被 `@torch.no_grad()` 包裹时，PyTorch 就不会去追踪和记录任何张量（Tensor）的操作历史，因此也就无法为这些操作计算梯度。

这会带来两个核心的好处：

1. **节省显存/内存**：因为不需要存储计算图（computation graph）和中间状态来为反向传播做准备，所以会显著减少内存的消耗。

2. **加快计算速度**：由于省去了追踪操作的开销，代码的执行速度会更快。

## 6️⃣ Training Loop Implementation

**Task**: Implement the main training loop for diffusion model finetuning.

**Requirements**:
- Encode images to latent space using VAE
- Add noise to latents according to diffusion schedule
- Predict noise using UNet conditioned on text
- Compute loss between predicted and actual noise
- Update UNet parameters via backpropagation

在Stable Diffusion的训练过程中，我们使用 **MSE Loss (Mean Squared Error, 均方误差损失)**，是因为这个任务的本质是一个**回归问题**，而MSE是解决这类问题的经典且高效的工具。

具体来说，我们训练的目标是**让模型精确地预测出我们当初添加的噪声**。

In [None]:
# TODO: Implement training loop:
print("Starting training with mixed precision...")
unet.train()  # Set UNet to training mode

for epoch in range(NUM_EPOCHS):
    progress_bar = tqdm(total=len(train_dataloader), desc=f"Epoch {epoch + 1}")
    
    #       For each batch in train_dataloader:
    for step, batch in enumerate(train_dataloader):
        #           - Get images and text from batch
        pixel_values = batch["pixel_values"] 
        input_ids = batch["input_ids"]

        # Clear gradients
        optimizer.zero_grad()

        # 使用 autocast 上下文管理器进行混合精度前向传播
        with autocast("cuda"):
            #           - Encode images to latent space using VAE
            latents = encode_image(pixel_values)

            #           - Encode text to embeddings using text encoder
            encoder_hidden_states = encode_text(input_ids)
            
            #           - Sample random timesteps for diffusion
            noise = torch.randn_like(latents) # 生成一个与latents张量形状完全相同，但填充了标准正态分布（高斯噪声）随机数的噪声张量。
            timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (latents.shape[0],), device=latents.device).long()
            
            #           - Add noise to latents according to schedule
            noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)
            
            #           - Predict noise using UNet with text conditioning
            noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states, return_dict=False)[0]

            #           - Compute MSE loss between predicted and actual noise
            loss = F.mse_loss(noise_pred, noise)
        
        #           - Backpropagate and update UNet parameters using the GradScaler
        # 使用 scaler 来缩放损失并进行反向传播
        scaler.scale(loss).backward()
        # 使用 scaler 来执行优化器步骤
        scaler.step(optimizer)
        # 更新 scaler 为下一次迭代做准备
        scaler.update()
        
        #           - Track and display training progress
        losses.append(loss.item())
        progress_bar.set_postfix(loss=loss.item())
        progress_bar.update(1)

    progress_bar.close()

# TODO: Save model checkpoints periodically
# unet.save_pretrained("naruto_finetuned_unet_mixed_precision")
print("Training finished.")

# TODO: Display loss curves and training statistics
# This will be done in the evaluation section (Block 9).

Starting training with mixed precision...


Epoch 1:   0%|          | 0/1221 [00:00<?, ?it/s]

  with autocast("mps"):


TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

## 7️⃣ Inference Pipeline Setup

**Task**: Create inference pipeline for text-to-image generation.

**Requirements**:
- Set up complete diffusion pipeline with trained UNet
- Configure scheduler for inference (100 steps)
- Implement text-to-image generation function
- Handle classifier-free guidance

In [None]:
# TODO: Create inference pipeline:
#       - Set all models to eval mode
#       - Create StableDiffusionPipeline with trained components
#       - Configure scheduler for inference
# 导入 StableDiffusionPipeline 类
from diffusers import StableDiffusionPipeline

#       - Set all models to eval mode
#       - Create StableDiffusionPipeline with trained components
# 我们将所有微调过的（unet）和未改动的（vae, text_encoder等）组件加载到一个Pipeline中
# 这是进行推理的推荐方法，代码更简洁且不易出错。
pipeline = StableDiffusionPipeline(
    vae=vae,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    unet=unet,
    scheduler=scheduler,
    safety_checker=None, # 小型模型通常没有安全检查器
    feature_extractor=None,
    requires_safety_checker=False,
)
# StableDiffusionPipeline 内部会自动处理 Set all models to eval mode！

# 将整个 pipeline 移动到指定设备
pipeline.to(device)
print("StableDiffusionPipeline created and moved to device.")

#       - Configure scheduler for inference
# Pipeline 会自动处理推理时 scheduler 的配置，我们只需在调用时传入步数。

# TODO: Implement generate_image function that:
#       - Takes text prompt as input
#       - Encodes text to embeddings (handled by pipeline)
#       - Starts with random noise (handled by pipeline)
#       - Performs denoising for specified number of steps (handled by pipeline)
#       - Decodes final latent to image (handled by pipeline)
#       - Returns PIL image
@torch.no_grad()
def generate_image(prompt, num_inference_steps, guidance_scale, seed=None):
    # 使用生成器以确保在使用种子时结果可复现
    generator = torch.manual_seed(seed) if seed is not None else None
    
    # 调用 pipeline 生成图像
    # pipeline 会自动处理文本编码、CFG、降噪循环和VAE解码
    result = pipeline(
        prompt=prompt,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=generator
    )
    
    # 从结果中获取图像
    image = result.images[0]
    return image

# TODO: Set up proper inference configuration:
#       - num_inference_steps = INFERENCE_STEPS
#       - guidance_scale = GUIDANCE_SCALE
#       - Enable safety checker if desired (already disabled in pipeline)
print(f"Inference function 'generate_image' is ready.")
print(f"Default inference steps: {INFERENCE_STEPS}, Guidance scale: {GUIDANCE_SCALE}")
# 在最开始的Define configuration parameters部分已经定义了 INFERENCE_STEPS 和 GUIDANCE_SCALE。

**Pipeline (管道)**

在Hugging Face `diffusers` 库的语境下，`Pipeline` (管道) 是一个**高级、一体化的工具，它将运行一个复杂模型（如Stable Diffusion）所需的所有组件和步骤都封装打包在了一起**。

你可以把它想象成一个“**全自动的图像生成工厂**”。你不需要关心工厂内部的各个车间（模型组件）是如何协同工作的，你只需要向工厂下订单（提供一个文本提示 `prompt`），工厂就能自动完成所有工序，最终交付给你成品（一张图片）。

一个典型的 `StableDiffusionPipeline` 包含以下这些核心组件：

1. **Tokenizer (分词器)**：负责接收你输入的文本（如 "a photo of an astronaut riding a horse on mars"），并将其转换成模型能理解的数字ID。

2. **Text Encoder (文本编码器)**：接收分词器输出的数字ID，并将其转换成包含丰富语义信息的向量（embeddings）。这是模型理解你“想要什么”的关键。

3. **UNet (核心降噪模型)**：这是扩散模型的心脏。它接收一个随机噪声图和文本编码器的输出，然后在多个步骤中，逐步去除噪声，最终“雕刻”出符合文本描述的图像的潜空间表示。你在微调（finetuning）时，训练的就是这个组件。

4. **VAE (Variational Autoencoder, 变分自编码器)**：

    编码：在训练开始前，它负责将原始图片压缩到更小、更高效的潜空间（`latents`）。

    解码：在推理结束时，它负责将UNet生成的、干净的潜空间表示解码（“放大”）回我们能看到的正常像素图像。

5. **Scheduler (调度器)**：负责管理整个降噪过程的“步调”。它定义了总共有多少个降噪步骤，以及在每个步骤中应该去除多少噪声。不同的调度器（如 `PNDM`, `DPM-Solver`）有不同的去噪策略，会影响生成速度和图片质量。

**Pipeline 的核心优点：**

* **便捷性**：它隐藏了所有复杂的内部调用流程。你只需要一行代码 `pipeline(prompt)` 就能生成图片，而不需要手动调用5个不同的组件并传递数据。

* **一致性与正确性**：它能确保所有组件都以正确的方式（例如，在推理时自动设置 `.eval()` 模式）协同工作，减少了用户出错的可能性。

## 8️⃣ Generate Images with Dataset Prompts

**Task**: Generate images using 5 prompts from the training dataset.

**Requirements**:
- Select 5 different text prompts from the dataset
- Generate images for each prompt
- Display results in a grid format
- Show prompt text alongside generated images

In [None]:
# TODO: Select 5 prompts from training dataset:
#       - Use different indices to get variety
#       - Extract text descriptions
dataset_prompts = [ds['train'][i]['text'] for i in [10, 100, 200, 300, 400]]

#       - Set random seed for reproducibility
generation_seed = 1337
generated_images_dataset = []

# TODO: Generate images for each dataset prompt:
for prompt in dataset_prompts:
    print(f"Generating image for prompt: '{prompt}'")
    #       - Use generate_image function
    #       - Set random seed for reproducibility
    image = generate_image(
        prompt,
        num_inference_steps=INFERENCE_STEPS,
        guidance_scale=GUIDANCE_SCALE,
        seed=generation_seed
    )
    #       - Save generated images
    generated_images_dataset.append(image)

# TODO: Create visualization:
#       - Display each prompt text
#       - Show corresponding generated image
#       - Use matplotlib subplot for clean layout
#       - Add titles and proper formatting
fig, axes = plt.subplots(1, 5, figsize=(25, 5))
for i, (prompt, img) in enumerate(zip(dataset_prompts, generated_images_dataset)):
    axes[i].imshow(img)
    axes[i].set_title(f"Prompt: \"{prompt[:40]}...\"", fontsize=10)
    axes[i].axis("off")
plt.tight_layout()
plt.suptitle("Generated Images from Training Dataset Prompts", fontsize=16)

# TODO: Display results in a 2x3 grid or similar arrangement
plt.show()

`guidance_scale` controls how strictly the model follows your text prompt.

The `seed` (or random seed) is a number that initializes the model's random number generator.

## 9️⃣ Generate Images with Custom Prompts

**Task**: Generate images using 5 custom prompts that you create.

**Requirements**:
- Write 5 creative prompts in Naruto style
- Test different types of descriptions (characters, scenes, actions)
- Generate and display results
- Compare quality with dataset prompt results

In [None]:
# TODO: Define 5 custom prompts, for example:
custom_prompts = [
    "A ninja with sharingan eyes, wearing a black cloak, standing in the rain, naruto style",
    "A majestic nine-tailed fox spirit with glowing red chakra, naruto style",
    "A beautiful kunoichi with pink hair, sad expression, cherry blossoms falling around her, naruto style",
    "An epic battle between two powerful shinobis on top of a giant statue, naruto style",
    "The hidden leaf village seen from the hokage rock at night, naruto style"
]

#       - Use same generation parameters as before
#       - Ensure consistent quality
generation_seed = 42
generated_images_custom = []

# TODO: Generate images for each custom prompt:
for prompt in custom_prompts:
    print(f"Generating image for custom prompt: '{prompt}'")
    image = generate_image(
        prompt,
        num_inference_steps=INFERENCE_STEPS,
        guidance_scale=GUIDANCE_SCALE,
        seed=generation_seed
    )
    generated_images_custom.append(image)

# TODO: Create visualization for custom prompts:
#       - Similar layout to dataset prompts
#       - Show prompt text and generated image
fig, axes = plt.subplots(1, 5, figsize=(25, 5))
for i, (prompt, img) in enumerate(zip(custom_prompts, generated_images_custom)):
    axes[i].imshow(img)
    axes[i].set_title(f"Prompt: \"{prompt[:40]}...\"", fontsize=10)
    axes[i].axis("off")
plt.tight_layout()
plt.suptitle("Generated Images from Custom Prompts", fontsize=16)

# TODO: Display all 5 custom prompt results
plt.show()

## 🔟 Model Evaluation and Comparison

**Task**: Evaluate and compare your results

**Requirements**:
- Compare generated images with original dataset images
- Evaluate image quality, style consistency, and prompt adherence
- Plot training progress and loss convergence


In [None]:
# TODO: Create comparison visualization:
#       - Show original dataset images alongside generated ones
#       - Compare style consistency
#       - Evaluate prompt adherence
print("Comparing original dataset images with generated images.")
indices_to_compare = [50, 150, 250, 350, 450]
prompts_to_compare = [ds['train'][i]['text'] for i in indices_to_compare]
original_images = [ds['train'][i]['image'].convert("RGB") for i in indices_to_compare]
generated_images_compare = []

for prompt in prompts_to_compare:
    print(f"Generating for comparison: '{prompt}'")
    img = generate_image(prompt, num_inference_steps=INFERENCE_STEPS, guidance_scale=GUIDANCE_SCALE, seed=8888)
    generated_images_compare.append(img)

fig, axes = plt.subplots(2, 5, figsize=(25, 10))
fig.suptitle('Original vs. Generated Image Comparison', fontsize=16)

for i in range(5):
    # Plot original images
    axes[0, i].imshow(original_images[i])
    axes[0, i].set_title(f"Original: \"{prompts_to_compare[i][:30]}...\"", fontsize=10)
    axes[0, i].axis('off')
    
    # Plot generated images
    axes[1, i].imshow(generated_images_compare[i])
    axes[1, i].set_title(f"Generated: \"{prompts_to_compare[i][:30]}...\"", fontsize=10)
    axes[1, i].axis('off')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


# TODO: Plot training loss curve:
#       - Show loss progression over epochs
#       - Analyze convergence behavior
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.title("Training Loss Convergence")
plt.xlabel("Training Steps")
plt.ylabel("MSE Loss")
plt.grid(True)
plt.show()

print("Evaluation complete.")

## 📝 Evaluation Criteria

Your homework will be evaluated based on:

1. **Implementation Correctness (40%)**
   - Proper stable diffusion pipeline setup
   - Correct training loop implementation
   - Working inference pipeline
   - Appropriate use of VAE, UNet, text encoder, and scheduler

2. **Training and Results (30%)**
   - Model trains without errors
   - Reasonable loss convergence
   - Generated images show Naruto style characteristics
   - Successful generation from both dataset and custom prompts

3. **Code Quality (30%)**
   - Clean, readable code with proper comments
   - Efficient memory usage and error handling
   - Proper tensor operations and device management
   - Good visualization and presentation