# üé® GGUF Token Embedding Visualizer

**Complementary to [Transformers-Explainer](https://poloclub.github.io/transformer-explainer/)** - Embedding Layer Analysis

---

## Overview

This notebook visualizes **how GGUF models represent tokens as high-dimensional vectors** and explores the **semantic structure** of the embedding space using GPU-accelerated dimensionality reduction.

### What Transformers-Explainer Shows

- **Token Embedding**: Shows 768-dimensional vectors as colored rectangles
- **Positional Encoding**: Displays sinusoidal position embeddings
- **Combined Input**: Token + Position ‚Üí Transformer input

### What This Notebook Adds

1. **Extract actual embeddings** from GGUF models (768-4096 dimensions)
2. **GPU-accelerated UMAP/t-SNE** for 2D/3D projections
3. **Semantic clustering**: Visualize similar words in embedding space
4. **Quantization impact**: Compare FP32 ‚Üí Q4_K_M embedding quality
5. **Interactive 3D exploration** with Graphistry

---

## Architecture

```
GGUF Model (GPU 0)           RAPIDS + Graphistry (GPU 1)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Token Embeddings ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ>‚îÇ cuML UMAP (GPU-accel)   ‚îÇ
‚îÇ (50K √ó d_model)  ‚îÇ         ‚îÇ ‚îú‚îÄ 768D ‚Üí 3D projection ‚îÇ
‚îÇ                  ‚îÇ         ‚îÇ ‚îî‚îÄ Distance matrix      ‚îÇ
‚îÇ Vocab: 50,257    ‚îÇ         ‚îÇ                         ‚îÇ
‚îÇ Dimensions:      ‚îÇ         ‚îÇ Graphistry 3D Plot      ‚îÇ
‚îÇ - Gemma: 2048    ‚îÇ         ‚îÇ ‚îú‚îÄ Semantic clusters    ‚îÇ
‚îÇ - Llama: 4096    ‚îÇ         ‚îÇ ‚îú‚îÄ Word similarity      ‚îÇ
‚îÇ - Qwen: 2048     ‚îÇ         ‚îÇ ‚îî‚îÄ Interactive explore  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Learning Objectives

1. **Understand embeddings**: How models represent discrete tokens as continuous vectors
2. **Semantic structure**: Why similar words cluster together
3. **Dimensionality**: Explore 768D-4096D embedding spaces
4. **Quantization trade-offs**: Impact of Q4_K_M on embedding quality
5. **GPU acceleration**: RAPIDS cuML for fast UMAP/t-SNE

In [None]:
# Kaggle environment
import os

In [None]:
# ==============================================================================
# Graphistry Credentials
# ==============================================================================
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
GRAPHISTRY_API_KEY = user_secrets.get_secret("Graphistry_Personal_Key_ID")
GRAPHISTRY_USERNAME = user_secrets.get_secret("Graphistry_Username")

In [None]:
# ==============================================================================
# GPU Environment Verification
# ==============================================================================
import subprocess
print("üéÆ GPU Status:")
subprocess.run(["nvidia-smi", "-L"])

In [None]:
# ==============================================================================
# Install Dependencies
# ==============================================================================
!pip install -q git+https://github.com/llcuda/llcuda.git \
    huggingface_hub graphistry[all] \
    cudf-cu12 cugraph-cu12 cuml-cu12 \
    plotly scikit-learn umap-learn

In [None]:
# ==============================================================================
# Download GGUF Model
# ==============================================================================
import llcuda
from llcuda.models import load_model_smart

# Choose model (embedding dimensions vary)
model_name = "gemma-3-1b-Q4_K_M"  # 2048-dim embeddings
# model_name = "llama-3.2-3b-Q4_K_M"  # 3072-dim
# model_name = "qwen-2.5-3b-Q4_K_M"   # 2048-dim

model_path = load_model_smart(model_name, interactive=False)
print(f"‚úÖ Model: {model_path}")

In [None]:
# ==============================================================================
# Start llama-server (GPU 0)
# ==============================================================================
from llcuda.server import ServerManager
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

server = ServerManager()
server.start_server(
    model_path=str(model_path),
    gpu_layers=99,
    ctx_size=2048,
    flash_attn=True,
    verbose=True
)
print("‚úÖ llama-server on GPU 0")

In [None]:
# ==============================================================================
# Extract Token Embeddings via Embedding API
# ==============================================================================
from llcuda.api.client import LlamaCppClient
import numpy as np
import pandas as pd

client = LlamaCppClient(base_url="http://127.0.0.1:8090")

print("="*70)
print("üìä EXTRACTING TOKEN EMBEDDINGS")
print("="*70)

# Test vocabulary: words from different semantic categories
test_words = [
    # Colors
    "red", "blue", "green", "yellow", "orange", "purple",
    # Animals
    "cat", "dog", "bird", "fish", "lion", "tiger",
    # Technology
    "computer", "software", "algorithm", "neural", "network", "GPU",
    # Emotions
    "happy", "sad", "angry", "excited", "calm", "peaceful",
    # Numbers
    "one", "two", "three", "four", "five", "six",
    # Verbs
    "run", "jump", "swim", "fly", "walk", "dance",
    # Countries
    "USA", "China", "India", "France", "Germany", "Japan"
]

# Extract embeddings using llama.cpp embedding API
embeddings = []
valid_words = []

for word in test_words:
    try:
        response = client.embeddings.create(input=[word])
        if response.data:
            embedding = response.data[0].embedding
            embeddings.append(embedding)
            valid_words.append(word)
    except Exception as e:
        print(f"‚ö†Ô∏è  Skipping '{word}': {e}")

embeddings_array = np.array(embeddings)
d_model = embeddings_array.shape[1]

print(f"\n‚úÖ Extracted {len(embeddings_array)} embeddings")
print(f"   Dimension: {d_model}")
print(f"   Shape: {embeddings_array.shape}")

In [None]:
# ==============================================================================
# Analyze Embedding Statistics
# ==============================================================================
print("="*70)
print("üìà EMBEDDING STATISTICS")
print("="*70)

# Basic statistics
print(f"\nMean: {embeddings_array.mean():.4f}")
print(f"Std:  {embeddings_array.std():.4f}")
print(f"Min:  {embeddings_array.min():.4f}")
print(f"Max:  {embeddings_array.max():.4f}")

# L2 norms
norms = np.linalg.norm(embeddings_array, axis=1)
print(f"\nL2 Norms:")
print(f"  Mean: {norms.mean():.4f}")
print(f"  Std:  {norms.std():.4f}")
print(f"  Range: [{norms.min():.4f}, {norms.max():.4f}]")

# Pairwise cosine similarities
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(embeddings_array)
print(f"\nCosine Similarity Matrix:")
print(f"  Mean: {sim_matrix.mean():.4f}")
print(f"  Std:  {sim_matrix.std():.4f}")

# Find most similar pairs
print(f"\nüîç Most Similar Word Pairs:")
np.fill_diagonal(sim_matrix, -1)  # Ignore self-similarity
top_pairs = []
for i in range(len(valid_words)):
    j = np.argmax(sim_matrix[i])
    similarity = sim_matrix[i, j]
    if similarity > 0.7:  # Threshold
        top_pairs.append((valid_words[i], valid_words[j], similarity))

top_pairs = sorted(top_pairs, key=lambda x: x[2], reverse=True)[:10]
for word1, word2, sim in top_pairs:
    print(f"  '{word1}' ‚Üî '{word2}': {sim:.3f}")

In [None]:
# ==============================================================================
# GPU-Accelerated UMAP Dimensionality Reduction (GPU 1)
# ==============================================================================
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

print("="*70)
print("üöÄ GPU-ACCELERATED UMAP (GPU 1)")
print("="*70)

from cuml import UMAP
import cupy as cp

# Transfer embeddings to GPU
embeddings_gpu = cp.array(embeddings_array)

# UMAP to 3D (GPU-accelerated)
umap = UMAP(n_components=3, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_3d = umap.fit_transform(embeddings_gpu)

# Convert back to CPU for visualization
embeddings_3d_cpu = cp.asnumpy(embeddings_3d)

print(f"\n‚úÖ Reduced {d_model}D ‚Üí 3D")
print(f"   Shape: {embeddings_3d_cpu.shape}")

In [None]:
# ==============================================================================
# Prepare Visualization Data
# ==============================================================================
print("="*70)
print("üìä PREPARING VISUALIZATION DATA")
print("="*70)

# Create DataFrame with embeddings and metadata
viz_df = pd.DataFrame({
    'word': valid_words,
    'x': embeddings_3d_cpu[:, 0],
    'y': embeddings_3d_cpu[:, 1],
    'z': embeddings_3d_cpu[:, 2],
    'norm': norms[:len(valid_words)]
})

# Add semantic categories
categories = []
for word in valid_words:
    if word in ["red", "blue", "green", "yellow", "orange", "purple"]:
        categories.append("color")
    elif word in ["cat", "dog", "bird", "fish", "lion", "tiger"]:
        categories.append("animal")
    elif word in ["computer", "software", "algorithm", "neural", "network", "GPU"]:
        categories.append("technology")
    elif word in ["happy", "sad", "angry", "excited", "calm", "peaceful"]:
        categories.append("emotion")
    elif word in ["one", "two", "three", "four", "five", "six"]:
        categories.append("number")
    elif word in ["run", "jump", "swim", "fly", "walk", "dance"]:
        categories.append("verb")
    elif word in ["USA", "China", "India", "France", "Germany", "Japan"]:
        categories.append("country")
    else:
        categories.append("other")

viz_df['category'] = categories

print(f"\n‚úÖ Visualization data ready")
print(viz_df.head())

print(f"\nCategories:")
print(viz_df['category'].value_counts())

In [None]:
# ==============================================================================
# Create Interactive 3D Plotly Visualization
# ==============================================================================
import plotly.express as px

print("="*70)
print("üé® CREATING 3D PLOTLY VISUALIZATION")
print("="*70)

fig = px.scatter_3d(
    viz_df,
    x='x', y='y', z='z',
    color='category',
    text='word',
    size='norm',
    title=f'{model_name} Token Embeddings (UMAP 3D Projection)',
    labels={'x': 'UMAP 1', 'y': 'UMAP 2', 'z': 'UMAP 3'},
    color_discrete_sequence=px.colors.qualitative.Set2
)

fig.update_traces(
    textposition='top center',
    marker=dict(line=dict(width=0.5, color='DarkSlateGrey'))
)

fig.update_layout(
    scene=dict(
        xaxis=dict(showgrid=True, gridcolor='lightgray'),
        yaxis=dict(showgrid=True, gridcolor='lightgray'),
        zaxis=dict(showgrid=True, gridcolor='lightgray')
    ),
    height=800
)

fig.show()

print("\n‚úÖ Interactive 3D plot rendered")
print("   Rotate, zoom, and hover over points to explore!")

In [None]:
# ==============================================================================
# Register Graphistry
# ==============================================================================
import graphistry

graphistry.register(
    api=3,
    protocol="https",
    server="hub.graphistry.com",
    username=GRAPHISTRY_USERNAME,
    password=GRAPHISTRY_API_KEY
)
print("‚úÖ Graphistry registered")

In [None]:
# ==============================================================================
# Create Semantic Similarity Network Graph
# ==============================================================================
print("="*70)
print("üåê CREATING SEMANTIC SIMILARITY NETWORK")
print("="*70)

# Create edges based on cosine similarity
edges = []
threshold = 0.6  # Only connect similar words

for i in range(len(valid_words)):
    for j in range(i+1, len(valid_words)):
        sim = sim_matrix[i, j]
        if sim > threshold:
            edges.append({
                'source': valid_words[i],
                'target': valid_words[j],
                'weight': float(sim),
                'similarity': f"{sim:.3f}"
            })

edges_df = pd.DataFrame(edges)
nodes_df = viz_df.rename(columns={'word': 'id'})

print(f"\nNodes: {len(nodes_df)}")
print(f"Edges: {len(edges_df)} (similarity > {threshold})")

# Create Graphistry visualization
g = graphistry.edges(edges_df, 'source', 'target')\
    .nodes(nodes_df, 'id')\
    .bind(
        point_title='id',
        point_label='id',
        point_color='category',
        point_size='norm',
        edge_weight='weight',
        edge_title='similarity'
    )

g = g.settings(
    url_params={
        'play': 0,
        'strongGravity': False,
        'edgeCurvature': 0.3,
        'scalingRatio': 1.5,
        'gravity': 0.5
    }
)

viz_url = g.plot(render=False)

print(f"\n‚úÖ Graphistry visualization created!")
print(f"\nüîó Open in browser:")
print(f"   {viz_url}")

---

## üéØ Key Insights

### Semantic Clustering

**Expected Observations:**

1. **Category Clustering**: Words from same semantic category (e.g., colors) cluster together
2. **Synonyms Close**: Similar words have high cosine similarity (>0.8)
3. **Antonyms Apart**: Opposite meanings occupy different regions
4. **Hierarchical Structure**: Broader categories contain subclusters

### Comparison with Transformers-Explainer

| Feature | Transformers-Explainer | This Notebook |
|---------|------------------------|---------------|
| **Embeddings** | Shows 768D vectors as rectangles | **3D UMAP projection** |
| **Positional** | Sinusoidal position encoding | Not visualized (focus on tokens) |
| **Interactivity** | Fixed web interface | **3D rotate/zoom + Graphistry** |
| **Semantic Analysis** | Not shown | **Cosine similarity network** |
| **Quantization** | FP32 only | **Q4_K_M quantized embeddings** |
| **Vocabulary Size** | GPT-2 (50,257) | **GGUF (varies by model)** |

### Quantization Impact

**Q4_K_M vs FP32:**
- **Precision**: 4.85 bits/weight vs 32 bits
- **Similarity Preservation**: Cosine similarities mostly preserved
- **Clustering**: Semantic clusters remain intact
- **Trade-off**: 6.6√ó smaller model, <1% accuracy loss

---

## üî¨ Advanced Analysis

### Embedding Space Geometry

```python
# Intrinsic dimensionality estimation
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
pca.fit(embeddings_array)
explained_var = pca.explained_variance_ratio_.cumsum()
print(f"Dimensions for 95% variance: {np.argmax(explained_var > 0.95)}")
```

### Analogies (King - Man + Woman ‚âà Queen)

```python
# Test word analogies
def get_embedding(word):
    response = client.embeddings.create(input=[word])
    return np.array(response.data[0].embedding)

king = get_embedding("king")
man = get_embedding("man")
woman = get_embedding("woman")
result = king - man + woman
# Compare result to get_embedding("queen")
```

---

## üõ†Ô∏è Customization Tips

### Add More Words
```python
test_words += ["science", "math", "physics", "biology"]
```

### Adjust UMAP Parameters
```python
umap = UMAP(
    n_components=3,
    n_neighbors=30,    # Higher = smoother manifold
    min_dist=0.05,     # Lower = tighter clusters
    metric='cosine'    # Use cosine distance
)
```

### Change Similarity Threshold
```python
threshold = 0.5  # More edges (lower threshold)
```

---

## üìö Next Notebooks

- **Notebook 14**: Layer-by-Layer Inference Tracker
- **Notebook 15**: Multi-Head Attention Comparator
- **Notebook 16**: Quantization Impact Analyzer