# T5 & CLIP Embedding Manipulation for Stable Diffusion 3 Medium

This notebook lets you:
1. Load manipulated **T5 embeddings** (positive and negative)
2. Load manipulated **CLIP-L + CLIP-G combined embeddings** (positive and negative)
3. Generate images with SD3 Medium using custom embeddings
4. Control guidance scale for positive/negative balance
5. Compare results with FLUX

**Model Used:** `stabilityai/stable-diffusion-3-medium-diffusers`

**Architecture:** SD3 uses 3 text encoders:
- **T5-XXL** [512, 4096] - You can use custom manipulated embeddings
- **CLIP-L** [768-dim] + **CLIP-G** [1280-dim] = **Combined [2048-dim]** - You can use custom manipulated embeddings OR auto-generate from prompt

**CLIP Generation:** Use `003_CLIP_L_and_G_embeddings.ipynb` to create CLIP embeddings

## Installation and Setup

In [7]:
import torch
import json
import numpy as np
from transformers import T5EncoderModel, T5Tokenizer
from diffusers import StableDiffusion3Pipeline
import ipywidgets as widgets
from IPython.display import display, Image as IPImage
from PIL import Image
import os
from pathlib import Path

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Create models directory
current_dir = Path.cwd()
MODELS_DIR = current_dir.parent / "data/models"
SD3_MODEL_PATH = os.path.join(MODELS_DIR, "stable-diffusion-3-medium-tensorrt")

os.makedirs(MODELS_DIR, exist_ok=True)
print(f"Models directory: {os.path.abspath(MODELS_DIR)}")
print(f"SD3 Medium TensorRT path: {os.path.abspath(SD3_MODEL_PATH)}")

Using device: cuda
Models directory: /shares/weddigen.ki.uzh/laura_wagner/latent_vandalism_workshop/data/models
SD3 Medium TensorRT path: /shares/weddigen.ki.uzh/laura_wagner/latent_vandalism_workshop/data/models/stable-diffusion-3-medium-tensorrt


### Download Stable Diffusion 3 Medium from Hugging Face

In [8]:
# Load Hugging Face token from file
from pathlib import Path

# Get the token file path
current_dir = Path.cwd()
token_file = current_dir.parent / "misc/credentials/hf.txt"

print(f"Looking for HF token at: {token_file}")

if token_file.exists():
    with open(token_file, 'r') as f:
        hf_token = f.read().strip()
    
    # Set the token as an environment variable
    os.environ['HF_TOKEN'] = hf_token
    
    # Also login using huggingface_hub
    from huggingface_hub import login
    login(token=hf_token)
    print("✓ Logged in to Hugging Face")
else:
    print("⚠️ No HF token found - you may need to authenticate manually")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Looking for HF token at: /shares/weddigen.ki.uzh/laura_wagner/latent_vandalism_workshop/misc/credentials/hf.txt
✓ Logged in to Hugging Face


In [None]:
# Load Stable Diffusion 3 Medium (standard version)
try:
    MODEL_ID = "stabilityai/stable-diffusion-3-medium-diffusers"
    
    if not os.path.exists(SD3_MODEL_PATH):
        print("Downloading Stable Diffusion 3 Medium from Hugging Face...")
        sd_pipe = StableDiffusion3Pipeline.from_pretrained(
            MODEL_ID,
            torch_dtype=torch.bfloat16
        )
        sd_pipe.save_pretrained(SD3_MODEL_PATH)
        print(f"✓ Model downloaded and saved to {SD3_MODEL_PATH}")
    else:
        print("Loading Stable Diffusion 3 Medium from local path...")
        sd_pipe = StableDiffusion3Pipeline.from_pretrained(
            SD3_MODEL_PATH,
            torch_dtype=torch.bfloat16,
            local_files_only=True
        )
    
    sd_pipe = sd_pipe.to(device)
    print("✓ Stable Diffusion 3 Medium loaded successfully!")
    
except Exception as e:
    print(f"❌ Error loading SD3 Medium: {e}")
    import traceback
    traceback.print_exc()

Downloading Stable Diffusion 3 Medium from Hugging Face...


model_index.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

Fetching 26 files:   0%|          | 0/26 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/740 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

text_encoder_3/model-00002-of-00002.safe(…):   0%|          | 0.00/4.53G [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/247M [00:00<?, ?B/s]

text_encoder_2/model.safetensors:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/574 [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

text_encoder_3/model-00001-of-00002.safe(…):   0%|          | 0.00/4.99G [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/19.9k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/588 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/856 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/576 [00:00<?, ?B/s]

tokenizer_3/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/372 [00:00<?, ?B/s]

transformer/diffusion_pytorch_model.safe(…):   0%|          | 0.00/4.17G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/739 [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/9 [00:00<?, ?it/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Load Embeddings

Select **positive** and **negative** embeddings for both T5 and CLIP.

**SD3 uses 3 text encoders:**
- **T5-XXL** (4096-dim per token, 512 tokens) - Load manipulated embeddings from `data/embeddings/T5/`
- **CLIP-L + CLIP-G** (768 + 1280 = 2048-dim pooled) - Load from `data/embeddings/CLIP_SD3/` OR auto-generate from prompt

**Generate CLIP embeddings:** Use notebook `003_CLIP_L_and_G_embeddings.ipynb`

In [None]:
# Setup directories
T5_EMBEDDINGS_DIR = current_dir.parent / "data/embeddings/T5"
CLIP_SD3_EMBEDDINGS_DIR = current_dir.parent / "data/embeddings/CLIP_SD3"

# Global variables for loaded embeddings
loaded_t5_pos_embedding = None
loaded_t5_neg_embedding = None
loaded_t5_pos_prompt = None
loaded_t5_neg_prompt = None

loaded_clip_pos_pooled = None
loaded_clip_neg_pooled = None
loaded_clip_pos_prompt = None
loaded_clip_neg_prompt = None

# Get available embedding files
t5_files = []
clip_sd3_files = []

if T5_EMBEDDINGS_DIR.exists():
    t5_files = sorted([f.name for f in T5_EMBEDDINGS_DIR.glob('*.json')])

if CLIP_SD3_EMBEDDINGS_DIR.exists():
    clip_sd3_files = sorted([f.name for f in CLIP_SD3_EMBEDDINGS_DIR.glob('*.json')])

# Add 'None' option for negative embeddings
t5_files_with_none = ['(None)'] + t5_files
clip_files_with_none = ['(None)'] + clip_sd3_files

print(f"Found {len(t5_files)} T5 embeddings")
print(f"Found {len(clip_sd3_files)} CLIP-SD3 embeddings (CLIP-L + CLIP-G combined)")
print(f"\nNote: If no CLIP embeddings loaded, they will be auto-generated from prompts.")

In [None]:
# ==================== T5 EMBEDDING WIDGETS ====================

# T5 POSITIVE embedding selection
t5_pos_dropdown = widgets.Dropdown(
    options=t5_files,
    description='T5 Positive:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='600px')
)

load_t5_pos_button = widgets.Button(
    description='Load T5 Positive',
    button_style='success'
)

t5_pos_output = widgets.Output()

# T5 NEGATIVE embedding selection
t5_neg_dropdown = widgets.Dropdown(
    options=t5_files_with_none,
    description='T5 Negative:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='600px')
)

load_t5_neg_button = widgets.Button(
    description='Load T5 Negative',
    button_style='warning'
)

t5_neg_output = widgets.Output()

def load_t5_pos_embedding(b):
    global loaded_t5_pos_embedding, loaded_t5_pos_prompt
    
    with t5_pos_output:
        t5_pos_output.clear_output()
        
        filename = t5_pos_dropdown.value
        if not filename:
            print("❌ Please select a T5 embedding file")
            return
        
        filepath = T5_EMBEDDINGS_DIR / filename
        
        try:
            with open(filepath, 'r') as f:
                data = json.load(f)
            
            loaded_t5_pos_embedding = np.array(data['embedding'])
            loaded_t5_pos_prompt = data.get('prompt', 'Unknown')
            
            print(f"✓ Loaded T5 POSITIVE embedding!")
            print(f"  File: {filename}")
            print(f"  Prompt: '{loaded_t5_pos_prompt}'")
            print(f"  Shape: {loaded_t5_pos_embedding.shape}")
            
        except Exception as e:
            print(f"❌ Error loading T5 positive embedding: {e}")

def load_t5_neg_embedding(b):
    global loaded_t5_neg_embedding, loaded_t5_neg_prompt
    
    with t5_neg_output:
        t5_neg_output.clear_output()
        
        filename = t5_neg_dropdown.value
        if not filename or filename == '(None)':
            loaded_t5_neg_embedding = None
            loaded_t5_neg_prompt = None
            print("✓ No negative T5 embedding (will use zero tensor)")
            return
        
        filepath = T5_EMBEDDINGS_DIR / filename
        
        try:
            with open(filepath, 'r') as f:
                data = json.load(f)
            
            loaded_t5_neg_embedding = np.array(data['embedding'])
            loaded_t5_neg_prompt = data.get('prompt', 'Unknown')
            
            print(f"✓ Loaded T5 NEGATIVE embedding!")
            print(f"  File: {filename}")
            print(f"  Prompt: '{loaded_t5_neg_prompt}'")
            print(f"  Shape: {loaded_t5_neg_embedding.shape}")
            
        except Exception as e:
            print(f"❌ Error loading T5 negative embedding: {e}")

load_t5_pos_button.on_click(load_t5_pos_embedding)
load_t5_neg_button.on_click(load_t5_neg_embedding)

# ==================== CLIP EMBEDDING WIDGETS ====================

# CLIP POSITIVE embedding selection
clip_pos_dropdown = widgets.Dropdown(
    options=clip_files_with_none,
    description='CLIP Positive:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='600px')
)

load_clip_pos_button = widgets.Button(
    description='Load CLIP Positive',
    button_style='success'
)

clip_pos_output = widgets.Output()

# CLIP NEGATIVE embedding selection
clip_neg_dropdown = widgets.Dropdown(
    options=clip_files_with_none,
    description='CLIP Negative:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='600px')
)

load_clip_neg_button = widgets.Button(
    description='Load CLIP Negative',
    button_style='warning'
)

clip_neg_output = widgets.Output()

def load_clip_pos_embedding(b):
    global loaded_clip_pos_pooled, loaded_clip_pos_prompt
    
    with clip_pos_output:
        clip_pos_output.clear_output()
        
        filename = clip_pos_dropdown.value
        if not filename or filename == '(None)':
            loaded_clip_pos_pooled = None
            loaded_clip_pos_prompt = None
            print("✓ No positive CLIP embedding (will auto-generate from T5 prompt)")
            return
        
        filepath = CLIP_SD3_EMBEDDINGS_DIR / filename
        
        try:
            with open(filepath, 'r') as f:
                data = json.load(f)
            
            # Load the combined pooled embedding (2048-dim)
            loaded_clip_pos_pooled = np.array(data['combined_pooled'])
            loaded_clip_pos_prompt = data.get('prompt', 'Unknown')
            
            print(f"✓ Loaded CLIP POSITIVE pooled embedding!")
            print(f"  File: {filename}")
            print(f"  Prompt: '{loaded_clip_pos_prompt}'")
            print(f"  Shape: {loaded_clip_pos_pooled.shape} (CLIP-L + CLIP-G)")
            
        except Exception as e:
            print(f"❌ Error loading CLIP positive embedding: {e}")

def load_clip_neg_embedding(b):
    global loaded_clip_neg_pooled, loaded_clip_neg_prompt
    
    with clip_neg_output:
        clip_neg_output.clear_output()
        
        filename = clip_neg_dropdown.value
        if not filename or filename == '(None)':
            loaded_clip_neg_pooled = None
            loaded_clip_neg_prompt = None
            print("✓ No negative CLIP embedding (will auto-generate from T5 negative prompt)")
            return
        
        filepath = CLIP_SD3_EMBEDDINGS_DIR / filename
        
        try:
            with open(filepath, 'r') as f:
                data = json.load(f)
            
            # Load the combined pooled embedding (2048-dim)
            loaded_clip_neg_pooled = np.array(data['combined_pooled'])
            loaded_clip_neg_prompt = data.get('prompt', 'Unknown')
            
            print(f"✓ Loaded CLIP NEGATIVE pooled embedding!")
            print(f"  File: {filename}")
            print(f"  Prompt: '{loaded_clip_neg_prompt}'")
            print(f"  Shape: {loaded_clip_neg_pooled.shape} (CLIP-L + CLIP-G)")
            
        except Exception as e:
            print(f"❌ Error loading CLIP negative embedding: {e}")

load_clip_pos_button.on_click(load_clip_pos_embedding)
load_clip_neg_button.on_click(load_clip_neg_embedding)

# ==================== GENERATION CONTROLS ====================

seed_input = widgets.IntText(
    value=42,
    description='Seed:',
    style={'description_width': 'initial'}
)

guidance_scale_slider = widgets.FloatSlider(
    value=7.0,
    min=0.0,
    max=20.0,
    step=0.5,
    description='Guidance Scale:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

generate_button = widgets.Button(
    description='Generate Image',
    button_style='primary',
    layout=widgets.Layout(width='300px', height='50px')
)

image_output = widgets.Output()

def generate_from_loaded_embeddings(b):
    with image_output:
        image_output.clear_output()
        
        # Check that T5 positive is loaded
        if loaded_t5_pos_embedding is None:
            print("❌ Please load a T5 positive embedding first!")
            return
        
        print("Generating SD3 image from loaded embeddings...")
        print(f"  Seed: {seed_input.value}")
        print(f"  Guidance Scale: {guidance_scale_slider.value}")
        
        # Process T5 embeddings
        print("\nT5 POSITIVE:")
        print(f"  Prompt: '{loaded_t5_pos_prompt}'")
        print(f"  Shape: {loaded_t5_pos_embedding.shape}")
        t5_pos_tensor = torch.from_numpy(loaded_t5_pos_embedding.astype(np.float32)).to(
            device=device, dtype=torch.bfloat16
        ).unsqueeze(0)
        
        # Handle negative T5 embedding
        if loaded_t5_neg_embedding is not None:
            print("\nT5 NEGATIVE:")
            print(f"  Prompt: '{loaded_t5_neg_prompt}'")
            print(f"  Shape: {loaded_t5_neg_embedding.shape}")
            t5_neg_tensor = torch.from_numpy(loaded_t5_neg_embedding.astype(np.float32)).to(
                device=device, dtype=torch.bfloat16
            ).unsqueeze(0)
        else:
            # CRITICAL: Create negative T5 embedding with SAME shape as positive
            print("\nT5 NEGATIVE: Creating zeros with same shape as positive")
            t5_neg_tensor = torch.zeros_like(t5_pos_tensor)
        
        # Process CLIP pooled embeddings
        if loaded_clip_pos_pooled is not None:
            print("\nCLIP POSITIVE: Using loaded pooled embedding")
            print(f"  Prompt: '{loaded_clip_pos_prompt}'")
            print(f"  Shape: {loaded_clip_pos_pooled.shape}")
            pooled_pos = torch.from_numpy(loaded_clip_pos_pooled.astype(np.float32)).to(
                device=device, dtype=torch.bfloat16
            ).unsqueeze(0)
        else:
            print("\nCLIP POSITIVE: Auto-generating from T5 prompt")
            _, pooled_pos, _ = sd_pipe.encode_prompt(
                prompt=loaded_t5_pos_prompt,
                device=device,
                num_images_per_prompt=1,
            )
        
        if loaded_clip_neg_pooled is not None:
            print("\nCLIP NEGATIVE: Using loaded pooled embedding")
            print(f"  Prompt: '{loaded_clip_neg_prompt}'")
            print(f"  Shape: {loaded_clip_neg_pooled.shape}")
            pooled_neg = torch.from_numpy(loaded_clip_neg_pooled.astype(np.float32)).to(
                device=device, dtype=torch.bfloat16
            ).unsqueeze(0)
        else:
            if loaded_t5_neg_prompt:
                print("\nCLIP NEGATIVE: Auto-generating from T5 negative prompt")
                _, pooled_neg, _ = sd_pipe.encode_prompt(
                    prompt=loaded_t5_neg_prompt,
                    device=device,
                    num_images_per_prompt=1,
                )
            else:
                print("\nCLIP NEGATIVE: Using zero tensor")
                pooled_neg = torch.zeros_like(pooled_pos)
        
        print(f"\n{'='*60}")
        print("Generating image with SD3 Medium...")
        print(f"{'='*60}\n")
        
        # Generate image
        image = sd_pipe(
            prompt_embeds=t5_pos_tensor,
            negative_prompt_embeds=t5_neg_tensor,
            pooled_prompt_embeds=pooled_pos,
            negative_pooled_prompt_embeds=pooled_neg,
            num_inference_steps=28,
            guidance_scale=guidance_scale_slider.value,
            height=1024,
            width=1024,
            generator=torch.manual_seed(seed_input.value)
        ).images[0]
        
        print("✓ Image generated!\n")
        display(image)

generate_button.on_click(generate_from_loaded_embeddings)

# ==================== DISPLAY UNIFIED INTERFACE ====================

display(widgets.VBox([
    widgets.HTML("<h2>Embedding Selection & Image Generation</h2>"),
    
    # T5 Embeddings Section
    widgets.HTML("<h3>1. T5 Embeddings (Required)</h3>"),
    widgets.HTML("<b>Positive Embedding:</b>"),
    t5_pos_dropdown,
    load_t5_pos_button,
    t5_pos_output,
    widgets.HTML("<br><b>Negative Embedding (optional):</b>"),
    t5_neg_dropdown,
    load_t5_neg_button,
    t5_neg_output,
    
    # CLIP Embeddings Section
    widgets.HTML("<br><h3>2. CLIP Embeddings (Optional - auto-generated if not loaded)</h3>"),
    widgets.HTML("<b>Positive Embedding:</b>"),
    clip_pos_dropdown,
    load_clip_pos_button,
    clip_pos_output,
    widgets.HTML("<br><b>Negative Embedding (optional):</b>"),
    clip_neg_dropdown,
    load_clip_neg_button,
    clip_neg_output,
    
    # Generation Controls Section
    widgets.HTML("<br><h3>3. Generation Controls</h3>"),
    seed_input,
    guidance_scale_slider,
    widgets.HTML("<br>"),
    generate_button,
    
    # Output Section
    widgets.HTML("<br><h3>Generated Image</h3>"),
    image_output
]))

## Summary

### SD3 Architecture (3 Text Encoders):

Unlike FLUX which uses **T5-XXL + CLIP**, SD3 uses **THREE** text encoders:

1. **T5-XXL**: Produces 512 tokens × 4096-dim embeddings (sequence embeddings)
2. **CLIP-L** (OpenAI): Produces 768-dim pooled embedding
3. **CLIP-G** (OpenCLIP): Produces 1280-dim pooled embedding

The final inputs to SD3:
- **prompt_embeds**: T5-XXL embeddings [512, 4096] - **You can manipulate these!**
- **pooled_prompt_embeds**: CLIP-L + CLIP-G concatenated [2048] - Auto-generated from text

### Key Differences from FLUX:

1. **3 Text Encoders vs 2**: SD3 uses T5 + CLIP-L + CLIP-G, FLUX uses T5 + CLIP
2. **Negative Embeddings**: SD3 supports negative T5 and pooled embeddings, FLUX doesn't
3. **Guidance Scale**: SD3 uses CFG (default 7.0), FLUX.1-schnell doesn't (guidance=0)
4. **More Inference Steps**: SD3 uses 28 steps, FLUX uses 4 steps
5. **Different Architecture**: SD3 uses MMDiT with CFG, FLUX uses rectified flow

### What You Can Manipulate:

✅ **T5 Positive Embeddings** - Load your custom manipulated embeddings  
✅ **T5 Negative Embeddings** - Load different manipulated embeddings to avoid  
❌ **CLIP Pooled Embeddings** - Auto-generated from prompt text (need CLIP-L + CLIP-G)

### Workflow:

1. Load **positive** T5 embedding (required) - with your manipulations!
2. Load **negative** T5 embedding (optional) - with different manipulations
3. **Pooled embeddings** auto-generated from the prompts stored in the embedding files
4. Adjust **guidance scale** (7.0 default, higher = stronger adherence to positive/avoidance of negative)
5. Set **seed** for reproducibility
6. Generate and compare!

### Experiment Ideas:

- **Positive normal + Negative zeroed**: See what gets removed when you zero embeddings
- **Different guidance scales**: Test 1.0, 3.5, 7.0, 15.0 with same embeddings
- **Manipulated positive only**: Use scaled/inverted/zeroed as positive prompt
- **Compare with FLUX**: Same T5 embedding, different models, different results!
- **T5 positive A + T5 negative B**: Blend concepts (e.g., "red cat" positive, "blue dog" negative)