# SD 3.5 Pipeline: Understanding Embedding Flow

This notebook explains how embeddings are processed and used in the Stable Diffusion 3.5 pipeline, specifically for our workshop where we're bypassing text encoding and directly injecting pre-computed embeddings.

## 1. Standard SD 3.5 Pipeline (Normal Text-to-Image)

```mermaid
graph TD
    A[Text Prompt] --> B[T5-XXL Encoder]
    A --> C[CLIP-L Encoder]
    A --> D[CLIP-G Encoder]
    
    B --> E[T5 Text Embeddings<br/>77 tokens × 4096 dims]
    C --> F[CLIP-L Text Embeddings<br/>77 tokens × 768 dims]
    C --> G[CLIP-L Pooled<br/>Single vector: 768 dims]
    D --> H[CLIP-G Text Embeddings<br/>77 tokens × 1280 dims]
    D --> I[CLIP-G Pooled<br/>Single vector: 1280 dims]
    
    E --> J[Concatenate Text Embeddings]
    F --> J
    H --> J
    
    J --> K[Combined Context<br/>77 tokens × 6144 dims<br/>4096+768+1280]
    
    G --> L[Concatenate Pooled]
    I --> L
    
    L --> M[Pooled Embeddings<br/>2048 dims<br/>768+1280]
    
    K --> N[MMDiT Transformer]
    M --> N
    
    O[Random Latent Noise] --> N
    
    N --> P[Denoising Process<br/>Multiple Steps]
    P --> Q[Final Latent]
    Q --> R[VAE Decoder]
    R --> S[Output Image]
    
    style K fill:#e1f5ff
    style M fill:#ffe1f5
    style N fill:#fff4e1
```

## 2. Our Modified Pipeline (Direct Embedding Injection)

```mermaid
graph TD
    A[T5 Embeddings JSON<br/>77 × 4096] --> B[Load & Parse]
    C[CLIP-L Embeddings JSON<br/>77 × 768] --> D[Load & Parse]
    E[CLIP-L Pooled JSON<br/>768 dims] --> F[Load & Parse]
    G[CLIP-G Embeddings JSON<br/>77 × 1280] --> H[Load & Parse]
    I[CLIP-G Pooled JSON<br/>1280 dims] --> J[Load & Parse]
    
    B --> K[Optional: Scale/Modify<br/>e.g., × 1.5, invert, etc.]
    D --> L[Optional: Scale/Modify]
    F --> M[Optional: Scale/Modify]
    H --> N[Optional: Scale/Modify]
    J --> O[Optional: Scale/Modify]
    
    K --> P[Concatenate Text Embeddings]
    L --> P
    N --> P
    
    P --> Q[Combined Context<br/>77 × 6144 dims]
    
    M --> R[Concatenate Pooled]
    O --> R
    
    R --> S[Pooled Embeddings<br/>2048 dims]
    
    Q --> T[MMDiT Transformer]
    S --> T
    
    U[Random Latent Noise] --> T
    
    T --> V[Denoising Process]
    V --> W[Final Latent]
    W --> X[VAE Decoder]
    X --> Y[Output Image]
    
    style K fill:#ffcccc
    style L fill:#ffcccc
    style M fill:#ffcccc
    style N fill:#ffcccc
    style O fill:#ffcccc
    style Q fill:#e1f5ff
    style S fill:#ffe1f5
    style T fill:#fff4e1
```

## 3. Detailed View: How Embeddings Flow Into MMDiT

```mermaid
graph LR
    subgraph "Text Embeddings (Context)"
        A[T5-XXL<br/>77 × 4096]
        B[CLIP-L<br/>77 × 768]
        C[CLIP-G<br/>77 × 1280]
    end
    
    subgraph "Pooled Embeddings (Global Conditioning)"
        D[CLIP-L Pooled<br/>768]
        E[CLIP-G Pooled<br/>1280]
    end
    
    A --> F[Concat]
    B --> F
    C --> F
    
    F --> G[Context Vector<br/>77 × 6144]
    
    D --> H[Concat]
    E --> H
    
    H --> I[Pooled Vector<br/>2048]
    
    subgraph "MMDiT Transformer Block"
        G --> J[Cross-Attention<br/>Keys & Values]
        K[Noisy Latent] --> L[Self-Attention<br/>Queries]
        L --> M[Attention]
        J --> M
        M --> N[Feed Forward]
        
        I --> O[AdaLN Modulation<br/>Scale & Shift]
        O --> L
        O --> N
    end
    
    N --> P[Denoised Output]
    
    style G fill:#e1f5ff
    style I fill:#ffe1f5
    style M fill:#d4edda
    style O fill:#fff3cd
```

## 4. Key Roles of Each Embedding Type

```mermaid
graph TD
    subgraph "T5-XXL Embeddings"
        A[Rich Semantic Understanding<br/>4096 dimensions per token]
        A --> A1[Detailed descriptions]
        A --> A2[Complex relationships]
        A --> A3[Nuanced concepts]
    end
    
    subgraph "CLIP-L Embeddings"
        B[Visual-Text Alignment<br/>768 dimensions per token]
        B --> B1[Object recognition]
        B --> B2[Style understanding]
        B --> B3[Composition hints]
    end
    
    subgraph "CLIP-G Embeddings"
        C[High-Level Visual Concepts<br/>1280 dimensions per token]
        C --> C1[Artistic style]
        C --> C2[Overall aesthetics]
        C --> C3[Global structure]
    end
    
    subgraph "Pooled Embeddings"
        D[Global Image Characteristics]
        D --> D1[Overall style modulation]
        D --> D2[Quality/aesthetic level]
        D --> D3[Conditioning strength]
    end
    
    A1 --> E[Cross-Attention]
    A2 --> E
    A3 --> E
    B1 --> E
    B2 --> E
    B3 --> E
    C1 --> E
    C2 --> E
    C3 --> E
    
    D1 --> F[AdaLN Modulation]
    D2 --> F
    D3 --> F
    
    E --> G[Image Generation]
    F --> G
    
    style E fill:#e1f5ff
    style F fill:#ffe1f5
    style G fill:#d4edda
```

## 5. Workshop Experiments: What We Can Explore

```mermaid
graph TD
    A[Saved Embeddings JSON] --> B{Modification Type}
    
    B --> C[Scaling]
    B --> D[Inversion]
    B --> E[Mixing]
    B --> F[Zeroing]
    B --> G[Interpolation]
    
    C --> C1[T5 × 1.5, CLIP × 0.5]
    D --> D1[Negative embeddings]
    E --> E1[Blend different prompt embeddings]
    F --> F1[Remove specific encoder influence]
    G --> G1[Morph between concepts]
    
    C1 --> H[Modified Embeddings]
    D1 --> H
    E1 --> H
    F1 --> H
    G1 --> H
    
    H --> I[Inject into Pipeline]
    I --> J[Generate Image]
    J --> K{Results}
    
    K --> L[Impossible from text alone]
    K --> M[Novel visual combinations]
    K --> N[Understanding encoder roles]
    
    style H fill:#ffcccc
    style J fill:#d4edda
    style L fill:#fff4e1
    style M fill:#fff4e1
    style N fill:#fff4e1
```

## Summary

### Embedding Dimensions:
- **T5-XXL**: 77 tokens × 4096 dimensions
- **CLIP-L**: 77 tokens × 768 dimensions + 768 pooled
- **CLIP-G**: 77 tokens × 1280 dimensions + 1280 pooled

### How They're Used:
1. **Text embeddings** (77 × 6144 combined) → Cross-attention in MMDiT
2. **Pooled embeddings** (2048 combined) → AdaLN modulation in MMDiT

### Workshop Advantages:
- Direct manipulation of semantic space
- Bypass text encoding limitations
- Create impossible-to-prompt images
- Understand individual encoder contributions
- Discover emergent behaviors through raw embedding manipulation