# Understanding Padding Tokens in CLIP Embeddings

## The Question

When we have a prompt like **"a beaver with blue teeth"**, it only uses maybe 10 tokens out of 77 total positions. The remaining 67 positions are **padding tokens**.

**Key Questions:**
1. Are padding tokens always the same?
2. How do they affect the final embedding?
3. Do padding embeddings at position 10 differ from padding at position 50?
4. Can we manipulate padding tokens to affect Flux output?

Let's investigate!

## Setup

In [None]:
import torch
from transformers import CLIPTextModel, CLIPTokenizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load CLIP text model
model_name = "openai/clip-vit-large-patch14"
tokenizer = CLIPTokenizer.from_pretrained(model_name)
model = CLIPTextModel.from_pretrained(model_name).to(device)
model.eval()

print(f"✓ Model loaded")
print(f"  Vocab size: {tokenizer.vocab_size}")
print(f"  Max length: {tokenizer.model_max_length}")
print(f"  Pad token: '{tokenizer.pad_token}' (ID: {tokenizer.pad_token_id})")
print(f"  EOS token: '{tokenizer.eos_token}' (ID: {tokenizer.eos_token_id})")
print(f"  BOS token: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")

## Investigation 1: What Are the Token IDs?

In [None]:
# Test with a short prompt
prompt = "a beaver with blue teeth"

# Tokenize
tokens = tokenizer(
    prompt,
    padding="max_length",
    max_length=77,
    truncation=True,
    return_tensors="pt"
)

token_ids = tokens['input_ids'][0].tolist()
attention_mask = tokens['attention_mask'][0].tolist()

print(f"Prompt: '{prompt}'")
print(f"\nToken IDs (first 15 positions):")
print("Position | Token ID | Attention | Decoded")
print("-" * 60)
for i in range(15):
    decoded = tokenizer.decode([token_ids[i]])
    print(f"{i:8} | {token_ids[i]:8} | {attention_mask[i]:9} | '{decoded}'")

print("\n...")
print("\nLast 5 positions:")
for i in range(72, 77):
    decoded = tokenizer.decode([token_ids[i]])
    print(f"{i:8} | {token_ids[i]:8} | {attention_mask[i]:9} | '{decoded}'")

# Count real tokens vs padding
num_real_tokens = sum(attention_mask)
num_padding = 77 - num_real_tokens
print(f"\n✓ Real tokens: {num_real_tokens}")
print(f"✓ Padding tokens: {num_padding}")

## Investigation 2: Are Padding Embeddings Identical?

Let's check if padding embeddings at different positions are the same or different.

In [None]:
# Generate embedding
with torch.no_grad():
    tokens_device = {k: v.to(device) for k, v in tokens.items()}
    outputs = model(**tokens_device)
    embedding = outputs.last_hidden_state[0]  # [77, 768]

embedding_np = embedding.cpu().numpy()

print(f"Embedding shape: {embedding_np.shape}")
print(f"\nLet's compare padding embeddings at different positions...\n")

# Find first padding position
first_padding_pos = num_real_tokens
print(f"First padding position: {first_padding_pos}")

# Compare padding embeddings at different positions
if num_padding > 1:
    padding_positions = [first_padding_pos, first_padding_pos + 10, first_padding_pos + 20, 76]
    padding_positions = [p for p in padding_positions if p < 77]
    
    print(f"\nComparing padding embeddings at positions: {padding_positions}")
    print("-" * 60)
    
    for i, pos in enumerate(padding_positions):
        emb = embedding_np[pos]
        print(f"\nPosition {pos}:")
        print(f"  First 10 values: {emb[:10]}")
        print(f"  Mean: {emb.mean():.6f}")
        print(f"  Std: {emb.std():.6f}")
        print(f"  L2 norm: {np.linalg.norm(emb):.6f}")
    
    # Calculate pairwise differences
    print("\n" + "="*60)
    print("Pairwise Cosine Similarities Between Padding Embeddings:")
    print("="*60)
    
    for i in range(len(padding_positions)):
        for j in range(i+1, len(padding_positions)):
            pos_i = padding_positions[i]
            pos_j = padding_positions[j]
            emb_i = embedding_np[pos_i]
            emb_j = embedding_np[pos_j]
            
            # Cosine similarity
            cos_sim = np.dot(emb_i, emb_j) / (np.linalg.norm(emb_i) * np.linalg.norm(emb_j))
            
            # L2 distance
            l2_dist = np.linalg.norm(emb_i - emb_j)
            
            print(f"Positions {pos_i} vs {pos_j}:")
            print(f"  Cosine similarity: {cos_sim:.8f}")
            print(f"  L2 distance: {l2_dist:.6f}")
            print()

## Investigation 3: Are Padding Embeddings the Same Across Different Prompts?

Do the padding embeddings change when we use different prompts?

In [None]:
# Test multiple prompts of different lengths
test_prompts = [
    "cat",
    "a red cat",
    "a beaver with blue teeth",
    "an elephant standing in a field of flowers",
]

padding_embeddings = {}

for prompt in test_prompts:
    # Tokenize
    tokens = tokenizer(
        prompt,
        padding="max_length",
        max_length=77,
        truncation=True,
        return_tensors="pt"
    )
    
    # Get embedding
    with torch.no_grad():
        tokens_device = {k: v.to(device) for k, v in tokens.items()}
        outputs = model(**tokens_device)
        embedding = outputs.last_hidden_state[0].cpu().numpy()
    
    # Find padding positions
    attention_mask = tokens['attention_mask'][0].tolist()
    num_real = sum(attention_mask)
    first_padding = num_real
    
    # Store padding embedding at a consistent position (e.g., position 50)
    if first_padding < 50:
        padding_embeddings[prompt] = {
            'num_real_tokens': num_real,
            'first_padding': first_padding,
            'padding_at_50': embedding[50],
            'padding_at_76': embedding[76]
        }
    
    print(f"Prompt: '{prompt}'")
    print(f"  Real tokens: {num_real}")
    print(f"  First padding position: {first_padding}")
    print()

# Compare padding embeddings across prompts
print("="*60)
print("Comparing Padding at Position 50 Across Different Prompts:")
print("="*60)

prompts_list = list(padding_embeddings.keys())
for i in range(len(prompts_list)):
    for j in range(i+1, len(prompts_list)):
        prompt_i = prompts_list[i]
        prompt_j = prompts_list[j]
        
        emb_i = padding_embeddings[prompt_i]['padding_at_50']
        emb_j = padding_embeddings[prompt_j]['padding_at_50']
        
        cos_sim = np.dot(emb_i, emb_j) / (np.linalg.norm(emb_i) * np.linalg.norm(emb_j))
        l2_dist = np.linalg.norm(emb_i - emb_j)
        
        print(f"\n'{prompt_i[:30]}...' vs '{prompt_j[:30]}...'")
        print(f"  Cosine similarity: {cos_sim:.10f}")
        print(f"  L2 distance: {l2_dist:.10f}")

print("\n" + "="*60)
if cos_sim > 0.9999:
    print("✓ CONCLUSION: Padding embeddings at the same position are")
    print("  NEARLY IDENTICAL across different prompts!")
else:
    print("✓ CONCLUSION: Padding embeddings differ across prompts")

## Visualization: Real Tokens vs Padding Tokens

In [None]:
# Use the "a beaver with blue teeth" prompt
prompt = "a beaver with blue teeth"

tokens = tokenizer(
    prompt,
    padding="max_length",
    max_length=77,
    truncation=True,
    return_tensors="pt"
)

with torch.no_grad():
    tokens_device = {k: v.to(device) for k, v in tokens.items()}
    outputs = model(**tokens_device)
    embedding = outputs.last_hidden_state[0].cpu().numpy()

attention_mask = tokens['attention_mask'][0].tolist()
num_real = sum(attention_mask)

# Create visualization
fig, axes = plt.subplots(2, 1, figsize=(15, 10))

# Plot 1: Heatmap of embedding
ax1 = axes[0]
im = ax1.imshow(embedding.T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
ax1.axvline(x=num_real-0.5, color='lime', linewidth=3, label='Padding starts here')
ax1.set_xlabel('Token Position', fontsize=12)
ax1.set_ylabel('Embedding Dimension', fontsize=12)
ax1.set_title(f'CLIP Text Embedding: "{prompt}"\n(Green line = where padding starts)', 
              fontsize=14, fontweight='bold')
ax1.legend(loc='upper right')
plt.colorbar(im, ax=ax1, label='Embedding Value')

# Plot 2: L2 norms of each token embedding
ax2 = axes[1]
norms = [np.linalg.norm(embedding[i]) for i in range(77)]
colors = ['steelblue' if i < num_real else 'orange' for i in range(77)]
ax2.bar(range(77), norms, color=colors, alpha=0.7)
ax2.axvline(x=num_real-0.5, color='lime', linewidth=3, linestyle='--', 
            label='Padding starts here')
ax2.set_xlabel('Token Position', fontsize=12)
ax2.set_ylabel('L2 Norm', fontsize=12)
ax2.set_title('L2 Norm of Each Token Embedding\n(Blue = real tokens, Orange = padding)', 
              fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n✓ Visualization complete")
print(f"  Real tokens: {num_real}")
print(f"  Padding tokens: {77 - num_real}")
print(f"  Average norm of real tokens: {np.mean(norms[:num_real]):.4f}")
print(f"  Average norm of padding tokens: {np.mean(norms[num_real:]):.4f}")

## Experiment: What If We Zero Out Padding?

What happens if we replace padding embeddings with zeros?

In [None]:
import json

prompt = "a beaver with blue teeth"

# Generate normal embedding
tokens = tokenizer(
    prompt,
    padding="max_length",
    max_length=77,
    truncation=True,
    return_tensors="pt"
)

with torch.no_grad():
    tokens_device = {k: v.to(device) for k, v in tokens.items()}
    outputs = model(**tokens_device)
    embedding_normal = outputs.last_hidden_state[0].cpu().numpy()

attention_mask = tokens['attention_mask'][0].tolist()
num_real = sum(attention_mask)

# Create version with zeroed padding
embedding_zeroed = embedding_normal.copy()
embedding_zeroed[num_real:] = 0  # Zero out all padding positions

print(f"Prompt: '{prompt}'")
print(f"Real tokens: {num_real}")
print(f"\nOriginal embedding stats:")
print(f"  Mean: {embedding_normal.mean():.6f}")
print(f"  Std: {embedding_normal.std():.6f}")
print(f"  Norm: {np.linalg.norm(embedding_normal):.6f}")

print(f"\nZeroed padding embedding stats:")
print(f"  Mean: {embedding_zeroed.mean():.6f}")
print(f"  Std: {embedding_zeroed.std():.6f}")
print(f"  Norm: {np.linalg.norm(embedding_zeroed):.6f}")

# Visualize difference
fig, axes = plt.subplots(3, 1, figsize=(15, 12))

# Original
im1 = axes[0].imshow(embedding_normal.T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
axes[0].axvline(x=num_real-0.5, color='lime', linewidth=2)
axes[0].set_title('Original Embedding (with padding)', fontweight='bold')
axes[0].set_ylabel('Dimension')
plt.colorbar(im1, ax=axes[0])

# Zeroed
im2 = axes[1].imshow(embedding_zeroed.T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
axes[1].axvline(x=num_real-0.5, color='lime', linewidth=2)
axes[1].set_title('Zeroed Padding Embedding', fontweight='bold')
axes[1].set_ylabel('Dimension')
plt.colorbar(im2, ax=axes[1])

# Difference
diff = embedding_normal - embedding_zeroed
im3 = axes[2].imshow(diff.T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
axes[2].axvline(x=num_real-0.5, color='lime', linewidth=2)
axes[2].set_title('Difference (what we removed)', fontweight='bold')
axes[2].set_xlabel('Token Position')
axes[2].set_ylabel('Dimension')
plt.colorbar(im3, ax=axes[2])

plt.tight_layout()
plt.show()

# Save both versions
current_dir = Path(os.getcwd())
output_dir = current_dir.parent / "data" / "embeddings" / "CLIP"
output_dir.mkdir(parents=True, exist_ok=True)

# Save normal
normal_data = {
    "prompt": prompt,
    "embedding": embedding_normal.tolist(),
    "shape": [77, 768]
}
with open(output_dir / "beaver_normal.json", 'w') as f:
    json.dump(normal_data, f)

# Save zeroed
zeroed_data = {
    "prompt": prompt + " (zeroed padding)",
    "embedding": embedding_zeroed.tolist(),
    "shape": [77, 768]
}
with open(output_dir / "beaver_zeroed_padding.json", 'w') as f:
    json.dump(zeroed_data, f)

print("\n✓ Saved both embeddings to:") 
print(f"  {output_dir / 'beaver_normal.json'}")
print(f"  {output_dir / 'beaver_zeroed_padding.json'}")
print("\nYou can test both in Flux to see if padding affects the output!")

## Summary: What We Learned About Padding

### Key Findings:

1. **Padding Token Structure**:
   - Padding uses a special token ID (usually the EOS token repeated)
   - Appears in positions after real tokens up to position 77

2. **Are Padding Embeddings Identical?**
   - Padding at the **same position** across different prompts is nearly identical
   - Padding at **different positions** within the same prompt may vary slightly (due to positional encodings)

3. **Do They Affect Flux?**
   - This is an empirical question! Test the two saved embeddings:
     - `beaver_normal.json` - with normal padding
     - `beaver_zeroed_padding.json` - with padding set to zero
   - Compare the Flux outputs to see if padding matters

4. **Practical Implications**:
   - If padding doesn't affect output: Only the real tokens matter for image generation
   - If padding DOES affect output: The padding embeddings contribute to Flux's understanding
   - Either way, understanding this helps us know which parts of the embedding we can safely manipulate

### Next Steps:

1. Test both embeddings in Flux (use your `Image_generation.ipynb` notebook)
2. Compare outputs - are they identical or different?
3. If different: padding matters! We could explore manipulating padding for creative effects
4. If identical: we can focus our manipulations on real token positions only