# Advanced: Compute CLIP Embeddings from Your Own Images

This notebook allows you to compute CLIP embeddings for your own image collection. This is useful when:

- You have a collection of images you want to make searchable
- You're working with a dataset not covered by pre-calculated embeddings
- You want to use a different CLIP model

**Hardware:**
- **NVIDIA GPU (CUDA)** ‚Äî fastest (~1‚Äì5 sec per 100 images)
- **Apple Silicon GPU (MPS)** ‚Äî good performance on M1/M2/M3 Macs
- **CPU** ‚Äî works fine for small collections; expect ~1‚Äì5 min per 100 images
- Disk space: ~4 MB per 1 000 images (ViT-B/32)

---

## How It Works

```mermaid
flowchart TD
    subgraph Input
        FOLDER["üìÅ Image Folder\n(jpg, png, webp...)"]
    end
    
    subgraph Processing
        LOAD["Load images\nin batches"]
        PREP["Preprocess\n(resize, normalize)"]
        ENC["üß† CLIP\nImage Encoder"]
    end
    
    subgraph Output
        EMB["üíæ embeddings.npz\n(numpy arrays)"]
        IDX["üìã index.json\n(filenames)"]
    end
    
    FOLDER --> LOAD
    LOAD --> PREP
    PREP --> ENC
    ENC --> EMB
    ENC --> IDX
```

The output files can then be used with Notebooks 02 and 03 for semantic search.

---

## Part 1: Setup

In [None]:
# Standard library imports
import os
import json
import time
from pathlib import Path

# External libraries
import numpy as np
from PIL import Image as PILImage
from tqdm.notebook import tqdm

# Import PyTorch and CLIP
try:
    import torch
    import clip
    CLIP_AVAILABLE = True
    print(f"‚úì CLIP loaded successfully!")
except ImportError:
    CLIP_AVAILABLE = False
    print("‚ùå CLIP not installed.")
    print("   Install with: pip install git+https://github.com/openai/CLIP.git torch torchvision")

# Select compute device: CUDA GPU > Apple Silicon GPU > CPU
if CLIP_AVAILABLE:
    if torch.cuda.is_available():
        DEVICE = 'cuda'
        gpu_name = torch.cuda.get_device_name(0)
        gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"‚úì NVIDIA GPU (CUDA): {gpu_name}")
        print(f"  Memory: {gpu_mem:.1f} GB")
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        DEVICE = 'mps'
        print("‚úì Apple Silicon GPU (MPS) ‚Äî good performance!")
    else:
        DEVICE = 'cpu'
        print("‚ÑπÔ∏è No GPU detected. Using CPU.")
        print("   Expect ~1‚Äì5 minutes per 100 images. Fine for small collections.")

In [None]:
# Set up paths
CURRENT_DIR = Path.cwd()
PROJECT_ROOT = CURRENT_DIR.parent

print(f"Project root: {PROJECT_ROOT}")

---

## Part 2: Configuration

### Choose Your CLIP Model

| Model | Embedding Dim | Speed | Quality | VRAM |
|-------|---------------|-------|---------|------|
| `RN50` | 1024 | Fast | Good | ~2 GB |
| `RN101` | 512 | Medium | Better | ~3 GB |
| `ViT-B/32` | 512 | Fast | Good | ~2 GB |
| `ViT-B/16` | 512 | Medium | Better | ~3 GB |
| `ViT-L/14` | 768 | Slow | Best | ~5 GB |
| `ViT-L/14@336px` | 768 | Slowest | Best+ | ~6 GB |

**Recommendation:** Start with `ViT-B/32` for a good balance of speed and quality.

In [None]:
# ============================================================
# CONFIGURATION - Adjust these settings!
# ============================================================

# Collection to compute embeddings for
COLLECTION_NAME = "Uppsala University"  # <-- CHANGE THIS!

# CLIP model to use
MODEL_NAME = 'ViT-B/32'  # <-- Options: RN50, ViT-B/32, ViT-B/16, ViT-L/14

# Batch size (reduce if you get out-of-memory errors)
BATCH_SIZE = 32 if DEVICE == 'cuda' else 8

# ============================================================

safe_name = COLLECTION_NAME.lower().replace(' ', '_')
IMAGES_FOLDER      = PROJECT_ROOT / "data" / "images"     / COLLECTION_NAME
OUTPUT_EMBEDDINGS  = PROJECT_ROOT / "data" / "embeddings" / f"{safe_name}_clip_embeddings.npz"

print(f"Collection:    {COLLECTION_NAME}")
print(f"Images folder: {IMAGES_FOLDER}")
print(f"Output file:   {OUTPUT_EMBEDDINGS}")
print(f"Model:         {MODEL_NAME}")
print(f"Batch size:    {BATCH_SIZE}")
print(f"Device:        {DEVICE}")

In [None]:
# Check the images folder
if IMAGES_FOLDER.exists():
    # Find all images
    image_extensions = ['.jpg', '.jpeg', '.png', '.webp', '.gif', '.bmp']
    image_files = []
    
    for ext in image_extensions:
        image_files.extend(IMAGES_FOLDER.rglob(f'*{ext}'))
        image_files.extend(IMAGES_FOLDER.rglob(f'*{ext.upper()}'))
    
    image_files = sorted(set(image_files))
    
    print(f"‚úì Found {len(image_files)} images in {IMAGES_FOLDER}")
    
    if image_files:
        print(f"\nFirst 5 images:")
        for f in image_files[:5]:
            print(f"  - {f.name}")
        if len(image_files) > 5:
            print(f"  ... and {len(image_files) - 5} more")
else:
    print(f"‚ùå Folder not found: {IMAGES_FOLDER}")
    print("   Please update IMAGES_FOLDER to point to your images.")
    image_files = []

---

## Part 3: Load CLIP Model

In [None]:
if CLIP_AVAILABLE:
    print(f"Loading CLIP model '{MODEL_NAME}'...")
    print("(This may download the model on first run, ~350MB)")
    
    model, preprocess = clip.load(MODEL_NAME, device=DEVICE)
    model.eval()
    
    # Get embedding dimension
    with torch.no_grad():
        dummy_text = clip.tokenize(["test"]).to(DEVICE)
        dummy_embedding = model.encode_text(dummy_text)
        EMBEDDING_DIM = dummy_embedding.shape[1]
    
    print(f"‚úì Model loaded on {DEVICE}")
    print(f"  Embedding dimension: {EMBEDDING_DIM}")
else:
    print("‚ùå CLIP not available")

---

## Part 4: Compute Embeddings

This is the main computation. Depending on your hardware and number of images, this may take:
- **GPU:** ~1-5 seconds per 100 images
- **CPU:** ~1-5 minutes per 100 images

In [None]:
def compute_embeddings(image_files, batch_size=32):
    """
    Compute CLIP embeddings for a list of image files.
    
    Parameters:
        image_files: List of Path objects to images
        batch_size: Number of images to process at once
    
    Returns:
        Tuple of (embeddings array, filenames list)
    """
    all_embeddings = []
    all_filenames = []
    errors = []
    
    num_batches = (len(image_files) + batch_size - 1) // batch_size
    
    print(f"Processing {len(image_files)} images in {num_batches} batches...")
    print(f"Batch size: {batch_size}")
    print()
    
    start_time = time.time()
    
    for batch_idx in tqdm(range(0, len(image_files), batch_size), desc="Processing batches"):
        batch_files = image_files[batch_idx:batch_idx + batch_size]
        batch_images = []
        batch_names = []
        
        # Load and preprocess batch
        for img_path in batch_files:
            try:
                image = PILImage.open(img_path).convert('RGB')
                image_tensor = preprocess(image)
                batch_images.append(image_tensor)
                # Store relative path from images folder
                try:
                    rel_path = img_path.relative_to(IMAGES_FOLDER)
                except ValueError:
                    rel_path = img_path.name
                batch_names.append(str(rel_path))
            except Exception as e:
                errors.append((str(img_path), str(e)))
                continue
        
        if not batch_images:
            continue
        
        # Compute embeddings
        batch_tensor = torch.stack(batch_images).to(DEVICE)
        
        with torch.no_grad():
            embeddings = model.encode_image(batch_tensor)
            # Normalize embeddings
            embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
        
        all_embeddings.append(embeddings.cpu().numpy())
        all_filenames.extend(batch_names)
        
        # Clear GPU memory
        if DEVICE == 'cuda':
            torch.cuda.empty_cache()
    
    elapsed = time.time() - start_time
    
    # Combine all embeddings
    if all_embeddings:
        embeddings_array = np.concatenate(all_embeddings, axis=0)
    else:
        embeddings_array = np.array([])
    
    print(f"\n‚úì Computed {len(all_filenames)} embeddings in {elapsed:.1f} seconds")
    print(f"  Speed: {len(all_filenames) / elapsed:.1f} images/second")
    
    if errors:
        print(f"\n‚ö†Ô∏è {len(errors)} images failed to process:")
        for path, err in errors[:5]:
            print(f"  - {Path(path).name}: {err[:50]}")
        if len(errors) > 5:
            print(f"  ... and {len(errors) - 5} more")
    
    return embeddings_array, all_filenames, errors

In [None]:
# ============================================================
# RUN THE COMPUTATION
# ============================================================

# Set to True when ready to compute
RUN_COMPUTATION = False  # <-- Change to True when ready!

# ============================================================

if RUN_COMPUTATION and CLIP_AVAILABLE and image_files:
    print("=" * 60)
    print("STARTING EMBEDDING COMPUTATION")
    print("=" * 60)
    print()
    
    embeddings, filenames, errors = compute_embeddings(image_files, batch_size=BATCH_SIZE)
    
    print(f"\nFinal embeddings shape: {embeddings.shape}")
else:
    if not RUN_COMPUTATION:
        print("‚ÑπÔ∏è Computation skipped. Set RUN_COMPUTATION = True to proceed.")
    elif not CLIP_AVAILABLE:
        print("‚ùå CLIP not available")
    else:
        print("‚ùå No images found")
    
    embeddings = None
    filenames = None

---

## Part 5: Save Embeddings

In [None]:
if embeddings is not None and len(embeddings) > 0:
    # Create output directory
    OUTPUT_EMBEDDINGS.parent.mkdir(parents=True, exist_ok=True)
    
    # Save embeddings as numpy archive
    print(f"Saving embeddings to {OUTPUT_EMBEDDINGS}...")
    
    np.savez_compressed(
        OUTPUT_EMBEDDINGS,
        embeddings=embeddings,
        filenames=np.array(filenames),
        model_name=MODEL_NAME
    )
    
    file_size = OUTPUT_EMBEDDINGS.stat().st_size / (1024 * 1024)
    print(f"‚úì Saved! File size: {file_size:.2f} MB")
    
    # Save index JSON for easy reference
    index_file = OUTPUT_EMBEDDINGS.with_suffix('.json')
    index_data = {
        'model_name': MODEL_NAME,
        'embedding_dim': int(embeddings.shape[1]),
        'num_images': len(filenames),
        'source_folder': str(IMAGES_FOLDER),
        'filenames': filenames
    }
    
    with open(index_file, 'w') as f:
        json.dump(index_data, f, indent=2)
    
    print(f"‚úì Index saved to {index_file}")
    
    print(f"\n" + "=" * 60)
    print("DONE! Your embeddings are ready to use.")
    print("=" * 60)
    print(f"\nTo use these embeddings in Notebooks 02 or 03:")
    print(f"  1. Update EMBEDDINGS_FILE to point to:")
    print(f"     {OUTPUT_EMBEDDINGS}")
    print(f"  2. Update IMAGES_DIR to point to:")
    print(f"     {IMAGES_FOLDER}")
else:
    print("‚ö†Ô∏è No embeddings to save. Run the computation first.")

---

## Part 6: Verify Embeddings

Let's verify the saved embeddings work correctly.

In [None]:
# Verify saved embeddings
if OUTPUT_EMBEDDINGS.exists():
    print("Verifying saved embeddings...")
    
    # Load embeddings
    data = np.load(OUTPUT_EMBEDDINGS, allow_pickle=True)
    
    loaded_embeddings = data['embeddings']
    loaded_filenames = data['filenames']
    loaded_model = str(data.get('model_name', 'unknown'))
    
    print(f"‚úì Loaded successfully!")
    print(f"  Embeddings shape: {loaded_embeddings.shape}")
    print(f"  Number of images: {len(loaded_filenames)}")
    print(f"  Model: {loaded_model}")
    
    # Quick sanity check - embeddings should be normalized
    norms = np.linalg.norm(loaded_embeddings, axis=1)
    print(f"  Embedding norms: min={norms.min():.4f}, max={norms.max():.4f}, mean={norms.mean():.4f}")
    
    if np.allclose(norms, 1.0, atol=0.01):
        print("  ‚úì Embeddings are properly normalized")
    else:
        print("  ‚ö†Ô∏è Embeddings may not be normalized")
else:
    print(f"‚ùå Embeddings file not found: {OUTPUT_EMBEDDINGS}")

In [None]:
# Test semantic search with the new embeddings
if OUTPUT_EMBEDDINGS.exists() and CLIP_AVAILABLE:
    print("Testing semantic search...")
    
    # Load embeddings as torch tensor
    data = np.load(OUTPUT_EMBEDDINGS, allow_pickle=True)
    test_embeddings = torch.tensor(data['embeddings'], dtype=torch.float32).to(DEVICE)
    test_filenames = data['filenames']
    
    # Test query
    test_query = "landscape with water"
    
    # Encode query
    with torch.no_grad():
        text_tokens = clip.tokenize([test_query]).to(DEVICE)
        text_embedding = model.encode_text(text_tokens)
        text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
    
    # Search
    similarities = (test_embeddings @ text_embedding.T).squeeze()
    top_indices = similarities.argsort(descending=True)[:5]
    
    print(f"\nTest query: '{test_query}'")
    print(f"Top 5 results:")
    for i, idx in enumerate(top_indices, 1):
        filename = str(test_filenames[idx.item()])
        score = similarities[idx].item()
        print(f"  {i}. {score:.4f} - {filename[:50]}")
    
    print("\n‚úì Semantic search working!")

---

## Summary

In this notebook, you learned how to:

1. **Configure** CLIP embedding computation
2. **Choose** an appropriate CLIP model
3. **Compute** embeddings for an image collection
4. **Save** embeddings in a reusable format
5. **Verify** the embeddings work correctly

### Output Files

| File | Description |
|------|-------------|
| `embeddings.npz` | Numpy archive with embeddings and filenames |
| `embeddings.json` | Index file with metadata |

### Using Your Embeddings

To use your custom embeddings in other notebooks:

```python
# Load your embeddings
data = np.load('path/to/your/embeddings.npz', allow_pickle=True)
embeddings = data['embeddings']
filenames = data['filenames']
model_name = str(data.get('model_name', 'unknown'))
```

### Tips for Large Collections

- **Memory management:** Reduce batch size if you run out of GPU memory
- **Checkpointing:** For very large collections, save intermediate results
- **Model selection:** ViT-B/32 is fastest; ViT-L/14 is best quality
- **Storage:** ~4MB per 1000 images (ViT-B/32)