# Vogue Archive Data Processing

This notebook processes Vogue magazine data and creates vector embeddings for semantic search.

**Run this in Google Colab for free GPU access**

Runtime: ~20 minutes for 10k records

## 1. Install Dependencies

In [16]:
!pip install sentence-transformers pinecone pandas tqdm pyarrow torch transformers ftfy regex



## 2. Setup Pinecone

In [17]:
import os
from pinecone import Pinecone, ServerlessSpec

# Your Pinecone API key
PINECONE_API_KEY = "pcsk_2JKS4Y_LNuT72kmgxsuWksy2LyqcQP5Q2iX626vPCwb2KEjj23Vf72a43ZWgNp6FcCJshz"
INDEX_NAME = "vogue-archive-clip"  # New index name for CLIP embeddings

# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

# Create index if it doesn't exist
# CLIP uses 512 dimensions (vs 384 for MiniLM)
if INDEX_NAME not in pc.list_indexes().names():
    pc.create_index(
        name=INDEX_NAME,
        dimension=512,  # CLIP ViT-B/32 dimension
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

index = pc.Index(INDEX_NAME)
print(f"Index '{INDEX_NAME}' ready!")

Index 'vogue-archive-clip' ready!


## 3. Download Pre-computed Image Embeddings

The dataset includes pre-computed CLIP image embeddings - we'll use these!

In [19]:
import numpy as np
import requests
from tqdm import tqdm

# Define the base URL for the archive
ARCHIVE_BASE_URL = "https://archive.org/download/VogueRunway_dataset"

print("Downloading pre-computed CLIP image embeddings (1.2GB - this will take a few minutes)...")
embeddings_url = f"{ARCHIVE_BASE_URL}/img_emb/VogueRunway_image.npy"

# Download with progress bar
response = requests.get(embeddings_url, stream=True)
total_size = int(response.headers.get('content-length', 0))

with open('VogueRunway_image.npy', 'wb') as f, tqdm(
    desc="VogueRunway_image.npy",
    total=total_size,
    unit='B',
    unit_scale=True,
    unit_divisor=1024,
) as pbar:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)
            pbar.update(len(chunk))

print("\nLoading embeddings into memory...")
# Load the embeddings
image_embeddings = np.load('VogueRunway_image.npy')
print(f"✓ Loaded {len(image_embeddings):,} pre-computed CLIP image embeddings")
print(f"Embedding dimension: {image_embeddings.shape[1]}")

Downloading pre-computed CLIP image embeddings (1.2GB - this will take a few minutes)...


VogueRunway_image.npy: 100%|██████████| 1.22G/1.22G [02:12<00:00, 9.89MB/s]



Loading embeddings into memory...
✓ Loaded 1,281,633 pre-computed CLIP image embeddings
Embedding dimension: 512


## 4. Load Vogue Runway Metadata

Download metadata and match with embeddings

In [20]:
import json
import pandas as pd

# Define the base URL for the archive
ARCHIVE_BASE_URL = "https://archive.org/download/VogueRunway_dataset"

print("Downloading Vogue Runway metadata...")
url = f"{ARCHIVE_BASE_URL}/VogueRunway.parquet"

# Download parquet file
response = requests.get(url, stream=True)
with open('VogueRunway.parquet', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

print("Loading metadata...")
df = pd.read_parquet('VogueRunway.parquet')

print(f"Total items: {len(df):,}")
print(f"Total embeddings: {len(image_embeddings):,}")

# Take top items by aesthetic score (you can change 1000 to 10000 or 100000)
NUM_ITEMS = 1000

if 'aesthetic' in df.columns:
    df = df.nlargest(NUM_ITEMS, 'aesthetic')
    print(f"\nSelected top {NUM_ITEMS} items by aesthetic score")
else:
    df = df.head(NUM_ITEMS)
    print(f"\nTaking first {NUM_ITEMS} items")

# Reset index to get clean indices
df = df.reset_index(drop=True)

print(f"\nSample data:")
print(df[['key', 'designer', 'season', 'year', 'category', 'city']].head())

Downloading Vogue Runway metadata...
Loading metadata...
Total items: 1,281,633
Total embeddings: 1,281,633

Selected top 1000 items by aesthetic score

Sample data:
       key  designer  season  year  category   city
0  0417267    Brioni  Spring  2019  Menswear   None
1  0673432     Cinoh  Spring  2022      None  Tokyo
2  0940024  Belstaff    Fall  2015  Menswear   None
3  1088552  Belstaff    Fall  2015  Menswear   None
4  0395237     Prada    Fall  2022  Menswear   None


## 5. Match Embeddings with Metadata and Upload to Pinecone

Use pre-computed image embeddings for multimodal search

In [24]:
from tqdm import tqdm

# Batch size for uploading
BATCH_SIZE = 100

def process_batch(batch_df, batch_start_idx):
    """Process a batch of records with pre-computed embeddings"""
    vectors = []

    for local_idx, (_, row) in enumerate(batch_df.iterrows()):
        # Use the key to get the correct embedding
        key = int(row['key'])

        # Get embedding using the key as index
        # The embeddings array is indexed by key value
        if key < len(image_embeddings):
            embedding = image_embeddings[key].tolist()
        else:
            print(f"Warning: key {key} out of bounds, skipping")
            continue

        # Prepare metadata
        metadata = {
            "description": f"{row.get('designer', '')} {row.get('season', '')} {row.get('year', '')} {row.get('category', '')} {row.get('section', '')}".strip(),
            "designer": str(row.get('designer', '')),
            "season": str(row.get('season', '')),
            "year": int(row.get('year', 0)) if pd.notna(row.get('year')) else 0,
            "category": str(row.get('category', '')),
            "city": str(row.get('city', '')),
            "section": str(row.get('section', '')),
            "image_url": row.get('url', ''),
            "aesthetic_score": float(row.get('aesthetic', 0)) if pd.notna(row.get('aesthetic')) else 0,
        }

        vectors.append({
            "id": f"vogue_runway_{row['key']}",
            "values": embedding,
            "metadata": metadata
        })

    # Upload to Pinecone
    if vectors:
        index.upsert(vectors=vectors)
    return len(vectors)

print(f"\nProcessing {len(df)} items with pre-computed embeddings...")
print(f"Using embeddings indexed by key value")
print(f"Uploading in batches of {BATCH_SIZE}...\n")

total_uploaded = 0

for i in tqdm(range(0, len(df), BATCH_SIZE)):
    batch_df = df.iloc[i:i+BATCH_SIZE]
    count = process_batch(batch_df, i)
    total_uploaded += count

print(f"\n✓ Upload complete! {total_uploaded} vectors uploaded to Pinecone.")
print(f"\nIndex stats: {index.describe_index_stats()}")


Processing 1000 items with pre-computed embeddings...
Using embeddings indexed by key value
Uploading in batches of 100...



100%|██████████| 10/10 [00:04<00:00,  2.27it/s]


✓ Upload complete! 1000 vectors uploaded to Pinecone.

Index stats: {'dimension': 512,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000,
 'vector_type': 'dense'}





## 6. Test Multimodal Search

Test text queries against image embeddings - this is CLIP's superpower!

In [25]:
from sentence_transformers import SentenceTransformer

# Load CLIP text encoder for queries
model = SentenceTransformer('clip-ViT-B-32')

# Test queries
test_queries = [
    "elegant evening gown",
    "minimalist black dress",
    "tweed jacket",
    "vintage cocktail dress"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: '{query}'")
    print(f"{'='*60}")

    # Encode text query with CLIP
    query_embedding = model.encode(query).tolist()

    # Search against image embeddings
    results = index.query(
        vector=query_embedding,
        top_k=3,
        include_metadata=True
    )

    for i, match in enumerate(results['matches'], 1):
        print(f"\n{i}. Score: {match['score']:.3f}")
        print(f"   Designer: {match['metadata']['designer']}")
        print(f"   {match['metadata']['season']} {match['metadata']['year']}")
        print(f"   Category: {match['metadata']['category']}")
        print(f"   City: {match['metadata']['city']}")
        if match['metadata'].get('image_url'):
            print(f"   Image: {match['metadata']['image_url'][:80]}...")


Query: 'elegant evening gown'

1. Score: 0.293
   Designer: Jenny Packham
   Fall 2022.0
   Category: Ready-to-Wear
   City: None
   Image: https://assets.vogue.com/photos/62223957921b9eb00286c356/00017-jenny-packham-fal...

2. Score: 0.288
   Designer: Marchesa Notte
   Resort 2020.0
   Category: None
   City: None
   Image: https://assets.vogue.com/photos/5d1a684464290300083ece20/00001-Marchesa-Notte-re...

3. Score: 0.288
   Designer: Luisa Beccaria
   Spring 2019.0
   Category: Couture
   City: None
   Image: https://assets.vogue.com/photos/5c49caf9153d8a2d1ae2ffdd/00002-Luisa-Beccaria-Co...

Query: 'minimalist black dress'

1. Score: 0.296
   Designer: Theory
   Resort 2019.0
   Category: None
   City: None
   Image: https://assets.vogue.com/photos/5b054cc79069fc6a729d51e0/00019-theory-vogue-reso...

2. Score: 0.286
   Designer: Schiaparelli
   Spring 2023.0
   Category: Couture
   City: None
   Image: https://assets.vogue.com/photos/63ce82596612476c0db22aa3/00083-schiaparelli-sp

## Done! Multimodal Search Ready

Your Vogue archive now uses **image embeddings** in the database.

When users search with text, CLIP matches:
- Text query → Image embeddings
- This finds visually similar runway looks based on semantic understanding

**Benefits:**
✓ Faster processing (no embedding generation needed)
✓ Better visual understanding (searches actual image features)
✓ True multimodal CLIP search (text-to-image matching)

Next steps:
1. Deploy the API (see ../api/)
2. Your React Native app is already configured!