# CLIP-L and CLIP-G Embeddings for SD3

SD3 uses **two** CLIP models that get concatenated:
- **CLIP-L** (OpenAI): 768-dim pooled embedding
- **CLIP-G** (OpenCLIP): 1280-dim pooled embedding
- **Combined**: 2048-dim pooled embedding for SD3

This notebook generates both and saves them together for use with SD3.

```mermaid
flowchart LR
    T[Text Prompt]

    CLIPL[CLIP-L Encoder]
    CLIPG[CLIP-G Encoder]

    PL[Pooled CLIP-L embedding]
    PG[Pooled CLIP-G embedding]

    F[Fusion and Projection]

    SD35[SD 3.5\nDiffusion Transformer]

    T --> CLIPL --> PL
    T --> CLIPG --> PG

    PL --> F
    PG --> F

    F -->|global conditioning| SD35


# CLIP Pooled Embedding Extraction in SD3.5

## Overview

SD3.5 uses two CLIP text encoders:
- **CLIP-L (OpenCLIP ViT-L)**: Outputs 77×768
- **CLIP-G (OpenCLIP ViT-G)**: Outputs 77×1280

The pooled embedding is created by extracting the **EOS token embedding** from each encoder and concatenating them.

## The Process

### Step 1: Text Encoding
Each CLIP encoder processes the tokenized text and outputs a sequence of embeddings:

- **CLIP-L**: `[BOS, token1, token2, ..., EOS, PAD, PAD, ...]` → **77×768**
- **CLIP-G**: `[BOS, token1, token2, ..., EOS, PAD, PAD, ...]` → **77×1280**

### Step 2: Pooling (Extract EOS Token)
The "pooled" representation is simply the embedding at the **EOS (End-of-Sequence) token position**:

- **CLIP-L pooled**: Extract position 76 (or last non-padding position) → **768-dim vector**
- **CLIP-G pooled**: Extract position 76 (or last non-padding position) → **1280-dim vector**

> **Why the EOS token?** The EOS token acts as a sentence-level summary because it has attended to all previous tokens through the transformer's self-attention mechanism.

### Step 3: Concatenation
The two pooled vectors are simply concatenated:

```
pooled_embedding = concat(CLIP-L[EOS], CLIP-G[EOS])
                 = concat(768-dim, 1280-dim)
                 = 2048-dim vector
```

This 2048-dimensional vector is what SD3.5 uses as the **pooled text conditioning**.

## Architecture Diagram

```mermaid
graph TD
    A[Input Text: 'an elephant'] --> B[Tokenization]
    B --> C[CLIP-L Encoder]
    B --> D[CLIP-G Encoder]
    
    C --> E[CLIP-L Sequence Output<br/>77 × 768]
    D --> F[CLIP-G Sequence Output<br/>77 × 1280]
    
    E --> G[Extract EOS Token<br/>Position 76]
    F --> H[Extract EOS Token<br/>Position 76]
    
    G --> I[CLIP-L Pooled<br/>768-dim vector]
    H --> J[CLIP-G Pooled<br/>1280-dim vector]
    
    I --> K[Concatenate]
    J --> K
    
    K --> L[Final Pooled Embedding<br/>2048-dim vector]
    
    style L stroke:#90EE90
    style E stroke:#FFE4B5
    style F stroke:#FFE4B5
    style I stroke:#87CEEB
    style J stroke:#87CEEB
```

## Code Example

Here's how this works in practice:

```python
# Assuming clip_l_output and clip_g_output are the sequence outputs
clip_l_sequence = clip_l_output.last_hidden_state  # Shape: [batch, 77, 768]
clip_g_sequence = clip_g_output.last_hidden_state  # Shape: [batch, 77, 1280]

# Extract the pooled embeddings (EOS token at position 76, or use text_embeds property)
clip_l_pooled = clip_l_output.text_embeds  # Shape: [batch, 768]
# OR manually: clip_l_pooled = clip_l_sequence[:, -1, :]

clip_g_pooled = clip_g_output.text_embeds  # Shape: [batch, 1280]
# OR manually: clip_g_pooled = clip_g_sequence[:, -1, :]

# Concatenate to create the final pooled embedding
pooled_embedding = torch.cat([clip_l_pooled, clip_g_pooled], dim=-1)  # Shape: [batch, 2048]
```

## Key Points

1. **No averaging or complex pooling**: SD3.5 simply uses the EOS token embedding
2. **Position matters**: The EOS token is typically at position 76 (the last token before padding)
3. **Direct concatenation**: No projection layers between CLIP-L and CLIP-G pooled outputs
4. **Both encoders matter**: CLIP-G provides richer semantics (1280-dim), while CLIP-L provides complementary information (768-dim)

## Why This Works

The EOS token embedding is a natural choice for pooling because:
- It has attended to all previous tokens through self-attention
- It represents a sentence-level summary
- It's commonly used in CLIP models for image-text matching
- It's computationally simple (no learnable pooling layers needed)

## Load CLIP-L and CLIP-G Models

In [2]:
from transformers import CLIPTextModel, CLIPTokenizer
import torch
import os
from pathlib import Path

# Setup paths
current_dir = Path.cwd()

# Load models path from config
models_path_file = current_dir.parent / "misc/paths/models.txt"
with open(models_path_file, 'r') as f:
    models_path = f.read().strip()
MODELS_DIR = current_dir.parent / models_path

CLIP_L_PATH = MODELS_DIR / "clip-vit-large-patch14"
CLIP_G_PATH = MODELS_DIR / "clip-vit-large-patch14-336"  # CLIP-G is the 336px variant

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

os.makedirs(MODELS_DIR, exist_ok=True)

Using device: cuda


In [8]:
# Load CLIP-L (OpenAI CLIP - 768 dim)
print("Loading CLIP-L (OpenAI)...")

if not os.path.exists(CLIP_L_PATH):
    print("  Downloading CLIP-L from Hugging Face...")
    clip_l_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
    clip_l_model = CLIPTextModel.from_pretrained(
        "openai/clip-vit-large-patch14",
        torch_dtype=torch.bfloat16
    )
    clip_l_tokenizer.save_pretrained(CLIP_L_PATH)
    clip_l_model.save_pretrained(CLIP_L_PATH)
    print("  ✓ Downloaded and saved")
else:
    print("  Loading from local...")

clip_l_tokenizer = CLIPTokenizer.from_pretrained(CLIP_L_PATH, local_files_only=True)
clip_l_model = CLIPTextModel.from_pretrained(
    CLIP_L_PATH,
    torch_dtype=torch.bfloat16,
    local_files_only=True
).to(device)
clip_l_model.eval()

print(f"✓ CLIP-L loaded!")
print(f"  Embedding dimension: {clip_l_model.config.hidden_size}")
print(f"  Max sequence length: {clip_l_tokenizer.model_max_length}")

Loading CLIP-L (OpenAI)...
  Loading from local...
✓ CLIP-L loaded!
  Embedding dimension: 768
  Max sequence length: 77


In [9]:
# Load CLIP-G (OpenCLIP bigG - 1280 dim)
# Note: We'll use laion/CLIP-ViT-bigG-14-laion2B-39B-b160k which is CLIP-G
print("\nLoading CLIP-G (OpenCLIP bigG)...")

CLIP_G_PATH = MODELS_DIR / "CLIP-ViT-bigG-14-laion2B"

if not os.path.exists(CLIP_G_PATH):
    print("  Downloading CLIP-G from Hugging Face...")
    clip_g_tokenizer = CLIPTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k")
    clip_g_model = CLIPTextModel.from_pretrained(
        "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k",
        torch_dtype=torch.bfloat16
    )
    clip_g_tokenizer.save_pretrained(CLIP_G_PATH)
    clip_g_model.save_pretrained(CLIP_G_PATH)
    print("  ✓ Downloaded and saved")
else:
    print("  Loading from local...")

clip_g_tokenizer = CLIPTokenizer.from_pretrained(CLIP_G_PATH, local_files_only=True)
clip_g_model = CLIPTextModel.from_pretrained(
    CLIP_G_PATH,
    torch_dtype=torch.bfloat16,
    local_files_only=True
).to(device)
clip_g_model.eval()

print(f"✓ CLIP-G loaded!")
print(f"  Embedding dimension: {clip_g_model.config.hidden_size}")
print(f"  Max sequence length: {clip_g_tokenizer.model_max_length}")


Loading CLIP-G (OpenCLIP bigG)...
  Loading from local...
✓ CLIP-G loaded!
  Embedding dimension: 1280
  Max sequence length: 77


## Generate Combined CLIP Embeddings

In [10]:
import ipywidgets as widgets
from IPython.display import display
import numpy as np

# Create input widget
prompt_input = widgets.Textarea(
    value='an elephant',
    placeholder='Enter your prompt here',
    description='Prompt:',
    layout=widgets.Layout(width='80%', height='80px')
)

generate_button = widgets.Button(
    description='Generate CLIP-L + CLIP-G Embeddings',
    button_style='success'
)

output_area = widgets.Output()

# Global variables
current_clip_l_embedding = None
current_clip_g_embedding = None
current_combined_pooled = None
current_prompt = None

def generate_clip_embeddings(b):
    global current_clip_l_embedding, current_clip_g_embedding
    global current_combined_pooled, current_prompt
    
    with output_area:
        output_area.clear_output()
        
        prompt = prompt_input.value
        current_prompt = prompt
        
        print(f"Generating embeddings for: '{prompt}'\n")
        
        # Generate CLIP-L embeddings
        print("=== CLIP-L (OpenAI) ===")
        tokens_l = clip_l_tokenizer(
            prompt,
            padding="max_length",
            max_length=77,
            truncation=True,
            return_tensors="pt"
        )
        
        with torch.no_grad():
            tokens_l = {k: v.to(device) for k, v in tokens_l.items()}
            outputs_l = clip_l_model(**tokens_l)
            
            # Get sequence embeddings and pooled embedding
            clip_l_hidden = outputs_l.last_hidden_state  # [1, 77, 768]
            clip_l_pooled = outputs_l.pooler_output      # [1, 768]
        
        current_clip_l_embedding = clip_l_hidden.float().cpu().numpy()[0]  # [77, 768]
        clip_l_pooled_np = clip_l_pooled.float().cpu().numpy()[0]          # [768]
        
        print(f"  Sequence shape: {current_clip_l_embedding.shape} (77 tokens × 768 dims)")
        print(f"  Pooled shape: {clip_l_pooled_np.shape} (768 dims)")
        
        # Generate CLIP-G embeddings
        print("\n=== CLIP-G (OpenCLIP bigG) ===")
        tokens_g = clip_g_tokenizer(
            prompt,
            padding="max_length",
            max_length=77,
            truncation=True,
            return_tensors="pt"
        )
        
        with torch.no_grad():
            tokens_g = {k: v.to(device) for k, v in tokens_g.items()}
            outputs_g = clip_g_model(**tokens_g)
            
            # Get sequence embeddings and pooled embedding
            clip_g_hidden = outputs_g.last_hidden_state  # [1, 77, 1280]
            clip_g_pooled = outputs_g.pooler_output      # [1, 1280]
        
        current_clip_g_embedding = clip_g_hidden.float().cpu().numpy()[0]  # [77, 1280]
        clip_g_pooled_np = clip_g_pooled.float().cpu().numpy()[0]          # [1280]
        
        print(f"  Sequence shape: {current_clip_g_embedding.shape} (77 tokens × 1280 dims)")
        print(f"  Pooled shape: {clip_g_pooled_np.shape} (1280 dims)")
        
        # Concatenate pooled embeddings for SD3
        print("\n=== Combined for SD3 ===")
        current_combined_pooled = np.concatenate([clip_l_pooled_np, clip_g_pooled_np])
        print(f"  Combined pooled shape: {current_combined_pooled.shape} (768 + 1280 = 2048 dims)")
        print(f"  ✓ Ready for SD3!")
        
        print(f"\nFirst 10 values of combined pooled embedding:")
        print(current_combined_pooled[:10])

generate_button.on_click(generate_clip_embeddings)
display(prompt_input, generate_button, output_area)

Textarea(value='an elephant', description='Prompt:', layout=Layout(height='80px', width='80%'), placeholder='E…

Button(button_style='success', description='Generate CLIP-L + CLIP-G Embeddings', style=ButtonStyle())

Output()

## Save Combined CLIP Embeddings

In [11]:
import json

# Define embeddings directory
CLIP_COMBINED_DIR = current_dir.parent / "data/embeddings/CLIP_SD3"
os.makedirs(CLIP_COMBINED_DIR, exist_ok=True)

save_button = widgets.Button(
    description='Save CLIP Embeddings',
    button_style='primary'
)

save_output = widgets.Output()

def save_clip_embeddings(b):
    with save_output:
        save_output.clear_output()
        
        if current_combined_pooled is None:
            print("❌ No embeddings to save! Generate embeddings first.")
            return
        
        # Get first 4 tokens from CLIP-L for filename
        tokens_l = clip_l_tokenizer(
            current_prompt,
            padding="max_length",
            max_length=77,
            truncation=True,
            return_tensors="pt"
        )
        
        token_ids = tokens_l['input_ids'][0].tolist()
        token_strings = [clip_l_tokenizer.decode([tid]) for tid in token_ids]
        
        # Get first 4 real tokens
        filename_tokens = []
        for token in token_strings:
            cleaned = token.strip().replace('</w>', '').replace('<|startoftext|>', '').replace('<|endoftext|>', '')
            if cleaned and cleaned not in ['<|startoftext|>', '<|endoftext|>', '']:
                filename_tokens.append(cleaned)
            if len(filename_tokens) >= 4:
                break
        
        filename = "_".join(filename_tokens) + ".json"
        filepath = CLIP_COMBINED_DIR / filename
        
        # Save all embeddings
        embedding_data = {
            "prompt": current_prompt,
            "clip_l_sequence": current_clip_l_embedding.tolist(),
            "clip_g_sequence": current_clip_g_embedding.tolist(),
            "combined_pooled": current_combined_pooled.tolist(),
            "shapes": {
                "clip_l_sequence": list(current_clip_l_embedding.shape),
                "clip_g_sequence": list(current_clip_g_embedding.shape),
                "combined_pooled": list(current_combined_pooled.shape)
            }
        }
        
        with open(filepath, 'w') as f:
            json.dump(embedding_data, f)
        
        print(f"✓ CLIP embeddings saved!")
        print(f"  File: {filepath}")
        print(f"  Size: {os.path.getsize(filepath) / 1024:.2f} KB")
        print(f"\nContains:")
        print(f"  - CLIP-L sequence: [77, 768]")
        print(f"  - CLIP-G sequence: [77, 1280]")
        print(f"  - Combined pooled: [2048] (for SD3)")

save_button.on_click(save_clip_embeddings)
display(save_button, save_output)

Button(button_style='primary', description='Save CLIP Embeddings', style=ButtonStyle())

Output()

## Batch Generation from Text Input

Enter multiple prompts (one per line) to generate and save embeddings for all of them.

In [None]:
# Batch generate CLIP-L + CLIP-G embeddings from text input
batch_prompt_input = widgets.Textarea(
    value='an elephant\na red sports car\na mountain landscape',
    placeholder='Enter prompts, one per line',
    description='Prompts:',
    layout=widgets.Layout(width='80%', height='150px')
)

batch_generate_button = widgets.Button(
    description='Batch Generate & Save',
    button_style='warning'
)

batch_output_area = widgets.Output()

def generate_combined_embedding(prompt):
    """Generate CLIP-L + CLIP-G embedding for a single prompt."""
    # CLIP-L
    tokens_l = clip_l_tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt")
    with torch.no_grad():
        tokens_l = {k: v.to(device) for k, v in tokens_l.items()}
        outputs_l = clip_l_model(**tokens_l)
        clip_l_hidden = outputs_l.last_hidden_state.float().cpu().numpy()[0]
        clip_l_pooled = outputs_l.pooler_output.float().cpu().numpy()[0]
    
    # CLIP-G
    tokens_g = clip_g_tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt")
    with torch.no_grad():
        tokens_g = {k: v.to(device) for k, v in tokens_g.items()}
        outputs_g = clip_g_model(**tokens_g)
        clip_g_hidden = outputs_g.last_hidden_state.float().cpu().numpy()[0]
        clip_g_pooled = outputs_g.pooler_output.float().cpu().numpy()[0]
    
    combined_pooled = np.concatenate([clip_l_pooled, clip_g_pooled])
    
    return clip_l_hidden, clip_g_hidden, combined_pooled

def get_filename_tokens(prompt):
    """Get first 4 tokens for filename."""
    tokens_l = clip_l_tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt")
    token_ids = tokens_l['input_ids'][0].tolist()
    token_strings = [clip_l_tokenizer.decode([tid]) for tid in token_ids]
    
    filename_tokens = []
    for token in token_strings:
        cleaned = token.strip().replace('</w>', '').replace('<|startoftext|>', '').replace('<|endoftext|>', '')
        if cleaned and cleaned not in ['<|startoftext|>', '<|endoftext|>', '']:
            filename_tokens.append(cleaned)
        if len(filename_tokens) >= 4:
            break
    return filename_tokens

def batch_generate_from_text(b):
    with batch_output_area:
        batch_output_area.clear_output()
        
        prompts = [p.strip() for p in batch_prompt_input.value.strip().split('\n') if p.strip()]
        
        if not prompts:
            print("No prompts provided!")
            return
        
        print(f"Generating {len(prompts)} CLIP-L + CLIP-G embeddings...\n")
        
        for i, prompt in enumerate(prompts, 1):
            print(f"[{i}/{len(prompts)}] '{prompt[:50]}{'...' if len(prompt) > 50 else ''}'")
            
            clip_l_hidden, clip_g_hidden, combined_pooled = generate_combined_embedding(prompt)
            filename_tokens = get_filename_tokens(prompt)
            
            filename = "_".join(filename_tokens) + ".json"
            filepath = CLIP_COMBINED_DIR / filename
            
            embedding_data = {
                "prompt": prompt,
                "clip_l_sequence": clip_l_hidden.tolist(),
                "clip_g_sequence": clip_g_hidden.tolist(),
                "combined_pooled": combined_pooled.tolist(),
                "shapes": {
                    "clip_l_sequence": list(clip_l_hidden.shape),
                    "clip_g_sequence": list(clip_g_hidden.shape),
                    "combined_pooled": list(combined_pooled.shape)
                }
            }
            
            with open(filepath, 'w') as f:
                json.dump(embedding_data, f)
            
            print(f"   ✓ Saved: {filename}")
        
        print(f"\n✓ All {len(prompts)} embeddings saved to:")
        print(f"  {CLIP_COMBINED_DIR}")

batch_generate_button.on_click(batch_generate_from_text)

print("Batch CLIP-L + CLIP-G Embedding Generator")
print(f"Output directory: {CLIP_COMBINED_DIR}")
print("Enter prompts (one per line):\n")
display(batch_prompt_input, batch_generate_button, batch_output_area)

## Batch Generation from Example Prompts File

Load CLIP prompts from `misc/example_prompts.txt` and generate embeddings. Files are saved to `examples/` subfolder.

In [None]:
# Batch generate CLIP-L + CLIP-G embeddings from example_prompts.txt
EXAMPLES_DIR = CLIP_COMBINED_DIR / "examples"
os.makedirs(EXAMPLES_DIR, exist_ok=True)

# Path to prompts file
prompts_file = current_dir.parent / "misc/example_prompts.txt"

def load_clip_prompts_from_file(filepath):
    """Load CLIP prompts section from example_prompts.txt"""
    if not filepath.exists():
        return []
    
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    sections = content.split('#')
    prompts = []
    
    for section in sections:
        if 'CLIP prompts' in section:
            lines = section.split('\n')
            for line in lines:
                line = line.strip()
                if line and not line.startswith('#') and 'prompts' not in line.lower():
                    prompts.append(line)
            break
    
    return prompts

file_batch_button = widgets.Button(
    description='Generate from File',
    button_style='info'
)

file_batch_output = widgets.Output()

def batch_generate_from_file(b):
    with file_batch_output:
        file_batch_output.clear_output()
        
        prompts = load_clip_prompts_from_file(prompts_file)
        
        if not prompts:
            print(f"No CLIP prompts found in {prompts_file}!")
            return
        
        print(f"Loaded {len(prompts)} CLIP prompts")
        print(f"Output: {EXAMPLES_DIR}\n")
        
        for i, prompt in enumerate(prompts, 1):
            print(f"[{i}/{len(prompts)}] '{prompt[:60]}{'...' if len(prompt) > 60 else ''}'")
            
            clip_l_hidden, clip_g_hidden, combined_pooled = generate_combined_embedding(prompt)
            filename_tokens = get_filename_tokens(prompt)
            
            filename = "_".join(filename_tokens) + ".json"
            filepath = EXAMPLES_DIR / filename
            
            embedding_data = {
                "prompt": prompt,
                "clip_l_sequence": clip_l_hidden.tolist(),
                "clip_g_sequence": clip_g_hidden.tolist(),
                "combined_pooled": combined_pooled.tolist(),
                "shapes": {
                    "clip_l_sequence": list(clip_l_hidden.shape),
                    "clip_g_sequence": list(clip_g_hidden.shape),
                    "combined_pooled": list(combined_pooled.shape)
                }
            }
            
            with open(filepath, 'w') as f:
                json.dump(embedding_data, f)
            
            print(f"   ✓ Saved: {filename}")
        
        print(f"\n✓ All {len(prompts)} embeddings saved to examples/")

file_batch_button.on_click(batch_generate_from_file)

print("Generate from Example Prompts File")
print(f"Source: {prompts_file}")
print(f"Output: {EXAMPLES_DIR}\n")
display(file_batch_button, file_batch_output)

## Summary

This notebook generates **both CLIP-L and CLIP-G embeddings** and saves:

1. **CLIP-L sequence**: [77, 768] - Full sequence from OpenAI CLIP
2. **CLIP-G sequence**: [77, 1280] - Full sequence from OpenCLIP bigG
3. **Combined pooled**: [2048] - CLIP-L pooled + CLIP-G pooled for SD3

Saved to: `data/embeddings/CLIP_SD3/*.json`

### Usage with SD3:

The `Image_generation_SD35.ipynb` notebook will load these files and use the **combined_pooled** embedding as the pooled prompt embedding for Stable Diffusion 3.

### Manipulation Ideas:

- Scale the pooled embedding values
- Zero out specific dimensions
- Interpolate between two different CLIP embeddings
- Use different CLIP-L and CLIP-G from different prompts

---
<sub>Latent Vandalism Workshop • Laura Wagner, 2026 • [laurajul.github.io](https://laurajul.github.io/)</sub>