# üöÄ AI Fashion Assistant v2.0 - Embedding Generation

**Phase 2, Notebook 2/3** - üî• **GPU INTENSIVE!**

---

## üéØ Objectives

1. Load all 44,418 products
2. Generate **text embeddings** (mpnet + CLIP text)
3. Generate **image embeddings** (CLIP image)
4. Save embeddings to disk
5. Validate dimensions and quality

---

## ‚ö†Ô∏è GPU REQUIREMENTS

- **GPU:** A100 (40GB) or V100 (16GB)
- **Time:** 2-3 hours
- **Disk:** ~3 GB for embeddings

---

## üìä Expected Outputs

```
embeddings/
‚îú‚îÄ‚îÄ text/
‚îÇ   ‚îú‚îÄ‚îÄ mpnet_768d.npy          (44,418 x 768)
‚îÇ   ‚îú‚îÄ‚îÄ clip_text_512d.npy      (44,418 x 512)
‚îÇ   ‚îî‚îÄ‚îÄ combined_1280d.npy      (44,418 x 1280)
‚îî‚îÄ‚îÄ image/
    ‚îî‚îÄ‚îÄ clip_image_768d.npy     (44,418 x 768)
```

---

## üìã Quality Gates

- ‚úì All embeddings generated (no NaNs)
- ‚úì Dimensions correct
- ‚úì File sizes reasonable (~2-3 GB total)
- ‚úì Embeddings normalized

---

In [13]:
# ============================================================
# 1) SETUP & GPU CHECK
# ============================================================

from google.colab import drive
drive.mount("/content/drive", force_remount=False)

# Check GPU
print("üîç GPU CHECK")
print("=" * 80)
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader
print("=" * 80)

import torch
if not torch.cuda.is_available():
    print("\n‚ùå WARNING: GPU not available!")
    print("   This notebook requires GPU. Please enable GPU runtime.")
    raise RuntimeError("GPU required!")

print(f"\n‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
üîç GPU CHECK
NVIDIA A100-SXM4-40GB, 40960 MiB, 36702 MiB

‚úÖ GPU available: NVIDIA A100-SXM4-40GB
   Memory: 39.6 GB


In [14]:
# ============================================================
# 2) INSTALL/UPGRADE PACKAGES
# ============================================================

print("üì¶ Installing packages...\n")

!pip install -q --upgrade sentence-transformers
!pip install -q --upgrade transformers
!pip install -q --upgrade torch torchvision
!pip install -q pillow tqdm

print("\n‚úÖ Packages installed!")

üì¶ Installing packages...


‚úÖ Packages installed!


In [15]:
# ============================================================
# 3) IMPORTS
# ============================================================

import torch
import numpy as np
import pandas as pd
from pathlib import Path
import json
import time
from typing import List, Dict
from tqdm.auto import tqdm
import os

# Sentence transformers
from sentence_transformers import SentenceTransformer

# Transformers (for CLIP)
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


In [16]:
# ============================================================
# 4) PATHS & CONFIG
# ============================================================

PROJECT_ROOT = Path("/content/drive/MyDrive/ai_fashion_assistant_v2")
PROCESSED_DIR = PROJECT_ROOT / "data/processed"
EMB_DIR = PROJECT_ROOT / "embeddings"
EMB_TEXT_DIR = EMB_DIR / "text"
EMB_IMAGE_DIR = EMB_DIR / "image"

# Create directories
EMB_TEXT_DIR.mkdir(parents=True, exist_ok=True)
EMB_IMAGE_DIR.mkdir(parents=True, exist_ok=True)

# Load config
config_path = PROJECT_ROOT / "embeddings/configs/model_config.json"
with open(config_path, 'r') as f:
    MODEL_CONFIG = json.load(f)

device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_CONFIG['device'] = device

print("üìã Configuration:")
print("=" * 80)
for key, value in MODEL_CONFIG.items():
    print(f"  {key}: {value}")
print("=" * 80)

# Batch sizes (adjust based on GPU memory)
TEXT_BATCH_SIZE = 256
IMAGE_BATCH_SIZE = 64

print(f"\n‚öôÔ∏è Batch sizes:")
print(f"  Text: {TEXT_BATCH_SIZE}")
print(f"  Image: {IMAGE_BATCH_SIZE}")

üìã Configuration:
  text_model_primary: paraphrase-multilingual-mpnet-base-v2
  text_model_primary_dim: 768
  text_model_secondary: openai/clip-vit-large-patch14
  text_model_secondary_dim: 768
  image_model: openai/clip-vit-large-patch14
  image_model_dim: 768
  text_combined_dim: 1536
  hybrid_dim: 2304
  device: cuda

‚öôÔ∏è Batch sizes:
  Text: 256
  Image: 64


In [17]:
# ============================================================
# 5) LOAD DATA
# ============================================================

print("üìÇ Loading product data...\n")

# Load SSOT data
df = pd.read_csv(PROCESSED_DIR / "meta_ssot.csv")

print(f"‚úÖ Loaded {len(df):,} products")
print(f"\nColumns: {list(df.columns)}")

# Check required fields
required_fields = ['id', 'desc', 'image_path']
missing_fields = [f for f in required_fields if f not in df.columns]

if missing_fields:
    raise ValueError(f"Missing required fields: {missing_fields}")

print(f"\n‚úÖ All required fields present")

# Show sample
print("\nSample products:")
display(df[['id', 'productDisplayName', 'desc']].head(3))

üìÇ Loading product data...

‚úÖ Loaded 44,417 products

Columns: ['id', 'productDisplayName', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'gender', 'season', 'year', 'usage', 'desc', 'image_path', 'text_embedding', 'image_embedding', 'hybrid_embedding']

‚úÖ All required fields present

Sample products:


Unnamed: 0,id,productDisplayName,desc
0,15970,Turtle Check Men Navy Blue Shirt,Turtle Check Men Navy Blue Shirt Apparel Topwe...
1,39386,Peter England Men Party Blue Jeans,Peter England Men Party Blue Jeans Apparel Bot...
2,59263,Titan Women Silver Watch,Titan Women Silver Watch Accessories Watches W...


In [18]:
# ============================================================
# 6) LOAD MODELS
# ============================================================

print("ü§ñ LOADING MODELS...\n")
print("=" * 80)

# Primary text model (mpnet)
print("\n1Ô∏è‚É£ Loading mpnet...")
start_time = time.time()
text_model_primary = SentenceTransformer(MODEL_CONFIG["text_model_primary"])
text_model_primary = text_model_primary.to(device)
print(f"   ‚úÖ Loaded in {time.time() - start_time:.1f}s")

# CLIP model
print("\n2Ô∏è‚É£ Loading CLIP...")
start_time = time.time()
clip_model = CLIPModel.from_pretrained(MODEL_CONFIG["image_model"])
clip_processor = CLIPProcessor.from_pretrained(MODEL_CONFIG["image_model"])
clip_model = clip_model.to(device)
print(f"   ‚úÖ Loaded in {time.time() - start_time:.1f}s")

print("\n" + "=" * 80)
print("‚úÖ All models loaded!")
print("=" * 80)

ü§ñ LOADING MODELS...


1Ô∏è‚É£ Loading mpnet...
   ‚úÖ Loaded in 2.5s

2Ô∏è‚É£ Loading CLIP...
   ‚úÖ Loaded in 2.4s

‚úÖ All models loaded!


In [19]:
# ============================================================
# 7) GENERATE TEXT EMBEDDINGS (MPNET)
# ============================================================

print("üìù GENERATING TEXT EMBEDDINGS (MPNET)...\n")
print("=" * 80)

# Prepare texts
texts = df['desc'].fillna('').tolist()
print(f"Total texts: {len(texts):,}")
print(f"Batch size: {TEXT_BATCH_SIZE}")
print(f"Estimated time: {len(texts) / TEXT_BATCH_SIZE * 2 / 60:.1f} minutes\n")

# Generate embeddings in batches
mpnet_embeddings = []

start_time = time.time()

for i in tqdm(range(0, len(texts), TEXT_BATCH_SIZE), desc="mpnet batches"):
    batch = texts[i:i+TEXT_BATCH_SIZE]

    # Encode
    batch_embs = text_model_primary.encode(
        batch,
        convert_to_numpy=True,
        show_progress_bar=False,
        batch_size=TEXT_BATCH_SIZE
    )

    mpnet_embeddings.append(batch_embs)

# Concatenate
mpnet_embeddings = np.vstack(mpnet_embeddings)

elapsed = time.time() - start_time

print(f"\n‚úÖ Generated mpnet embeddings")
print(f"   Shape: {mpnet_embeddings.shape}")
print(f"   Expected: ({len(texts)}, {MODEL_CONFIG['text_model_primary_dim']})")
print(f"   Time: {elapsed / 60:.1f} minutes")
print(f"   Speed: {elapsed / len(texts) * 1000:.1f}ms per text")

# Validate
assert mpnet_embeddings.shape == (len(texts), MODEL_CONFIG['text_model_primary_dim'])
assert not np.isnan(mpnet_embeddings).any(), "NaN values detected!"

# Save
output_path = EMB_TEXT_DIR / "mpnet_768d.npy"
np.save(output_path, mpnet_embeddings)
print(f"\nüíæ Saved: {output_path}")
print(f"   Size: {output_path.stat().st_size / 1024**2:.1f} MB")

üìù GENERATING TEXT EMBEDDINGS (MPNET)...

Total texts: 44,417
Batch size: 256
Estimated time: 5.8 minutes



mpnet batches:   0%|          | 0/174 [00:00<?, ?it/s]


‚úÖ Generated mpnet embeddings
   Shape: (44417, 768)
   Expected: (44417, 768)
   Time: 0.3 minutes
   Speed: 0.5ms per text

üíæ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/text/mpnet_768d.npy
   Size: 130.1 MB


In [20]:
# ============================================================
# 8) GENERATE TEXT EMBEDDINGS (CLIP TEXT)
# ============================================================

print("üìù GENERATING TEXT EMBEDDINGS (CLIP TEXT)...\n")
print("=" * 80)

print(f"Total texts: {len(texts):,}")
print(f"Batch size: {TEXT_BATCH_SIZE}")
print(f"Estimated time: {len(texts) / TEXT_BATCH_SIZE * 3 / 60:.1f} minutes\n")

# Generate embeddings in batches
clip_text_embeddings = []

start_time = time.time()

for i in tqdm(range(0, len(texts), TEXT_BATCH_SIZE), desc="CLIP text batches"):
    batch = texts[i:i+TEXT_BATCH_SIZE]

    # Process
    inputs = clip_processor(text=batch, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Encode
    with torch.no_grad():
        batch_embs = clip_model.get_text_features(**inputs)
        batch_embs = batch_embs.cpu().numpy()

    clip_text_embeddings.append(batch_embs)

    # Clear GPU cache every 10 batches
    if i % (TEXT_BATCH_SIZE * 10) == 0:
        torch.cuda.empty_cache()

# Concatenate
clip_text_embeddings = np.vstack(clip_text_embeddings)

elapsed = time.time() - start_time

print(f"\n‚úÖ Generated CLIP text embeddings")
print(f"   Shape: {clip_text_embeddings.shape}")
print(f"   Expected: ({len(texts)}, {MODEL_CONFIG['text_model_secondary_dim']})")
print(f"   Time: {elapsed / 60:.1f} minutes")
print(f"   Speed: {elapsed / len(texts) * 1000:.1f}ms per text")

# Validate
assert clip_text_embeddings.shape == (len(texts), MODEL_CONFIG['text_model_secondary_dim'])
assert not np.isnan(clip_text_embeddings).any(), "NaN values detected!"

# Save
output_path = EMB_TEXT_DIR / "clip_text_512d.npy"
np.save(output_path, clip_text_embeddings)
print(f"\nüíæ Saved: {output_path}")
print(f"   Size: {output_path.stat().st_size / 1024**2:.1f} MB")

üìù GENERATING TEXT EMBEDDINGS (CLIP TEXT)...

Total texts: 44,417
Batch size: 256
Estimated time: 8.7 minutes



CLIP text batches:   0%|          | 0/174 [00:00<?, ?it/s]


‚úÖ Generated CLIP text embeddings
   Shape: (44417, 768)
   Expected: (44417, 768)
   Time: 0.3 minutes
   Speed: 0.4ms per text

üíæ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/text/clip_text_512d.npy
   Size: 130.1 MB


In [21]:
# ============================================================
# 9) COMBINE TEXT EMBEDDINGS
# ============================================================

print("üîó COMBINING TEXT EMBEDDINGS...\n")
print("=" * 80)

# Concatenate
combined_text_embeddings = np.concatenate([mpnet_embeddings, clip_text_embeddings], axis=1)

print(f"‚úÖ Combined text embeddings")
print(f"   mpnet: {mpnet_embeddings.shape}")
print(f"   CLIP text: {clip_text_embeddings.shape}")
print(f"   Combined: {combined_text_embeddings.shape}")
print(f"   Expected: ({len(texts)}, {MODEL_CONFIG['text_combined_dim']})")

# Validate
assert combined_text_embeddings.shape == (len(texts), MODEL_CONFIG['text_combined_dim'])
assert not np.isnan(combined_text_embeddings).any()

# Save
output_path = EMB_TEXT_DIR / "combined_1280d.npy"
np.save(output_path, combined_text_embeddings)
print(f"\nüíæ Saved: {output_path}")
print(f"   Size: {output_path.stat().st_size / 1024**2:.1f} MB")

# Free memory
del mpnet_embeddings, clip_text_embeddings
import gc
gc.collect()
torch.cuda.empty_cache()

print("\nüóëÔ∏è Cleared intermediate embeddings from memory")

üîó COMBINING TEXT EMBEDDINGS...

‚úÖ Combined text embeddings
   mpnet: (44417, 768)
   CLIP text: (44417, 768)
   Combined: (44417, 1536)
   Expected: (44417, 1536)

üíæ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/text/combined_1280d.npy
   Size: 260.3 MB

üóëÔ∏è Cleared intermediate embeddings from memory


In [22]:
# ============================================================
# 10) GENERATE IMAGE EMBEDDINGS
# ============================================================

print("üñºÔ∏è GENERATING IMAGE EMBEDDINGS...\n")
print("=" * 80)

# Find images directory
OLD_PROJECT = Path("/content/drive/MyDrive/ai_fashion_assistant_v1")
possible_image_dirs = [
    OLD_PROJECT / "data/raw/images",
    PROJECT_ROOT / "data/raw/images",
]

IMAGES_DIR = None
for img_dir in possible_image_dirs:
    if img_dir.exists():
        try:
            # Test if readable
            test_files = [f for f in os.listdir(img_dir) if f.endswith('.jpg')][:5]
            if test_files:
                IMAGES_DIR = img_dir
                print(f"‚úÖ Found images: {IMAGES_DIR}")
                break
        except OSError:
            continue

if IMAGES_DIR is None:
    raise FileNotFoundError("Images directory not found or not readable!")

print(f"Total products: {len(df):,}")
print(f"Batch size: {IMAGE_BATCH_SIZE}")
print(f"Estimated time: {len(df) / IMAGE_BATCH_SIZE * 3 / 60:.1f} minutes\n")

# Generate embeddings
image_embeddings = []
failed_images = []

start_time = time.time()

for i in tqdm(range(0, len(df), IMAGE_BATCH_SIZE), desc="Image batches"):
    batch_df = df.iloc[i:i+IMAGE_BATCH_SIZE]
    batch_images = []
    batch_indices = []

    # Load images
    for idx, row in batch_df.iterrows():
        img_path = IMAGES_DIR / f"{row['id']}.jpg"

        try:
            image = Image.open(img_path).convert("RGB")
            batch_images.append(image)
            batch_indices.append(idx)
        except Exception as e:
            failed_images.append((row['id'], str(e)))
            # Use black image as placeholder
            batch_images.append(Image.new('RGB', (224, 224), (0, 0, 0)))
            batch_indices.append(idx)

    # Process batch
    if batch_images:
        inputs = clip_processor(images=batch_images, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}

        # Encode
        with torch.no_grad():
            batch_embs = clip_model.get_image_features(**inputs)
            batch_embs = batch_embs.cpu().numpy()

        image_embeddings.append(batch_embs)

    # Clear GPU cache every 10 batches
    if i % (IMAGE_BATCH_SIZE * 10) == 0:
        torch.cuda.empty_cache()

# Concatenate
image_embeddings = np.vstack(image_embeddings)

elapsed = time.time() - start_time

print(f"\n‚úÖ Generated image embeddings")
print(f"   Shape: {image_embeddings.shape}")
print(f"   Expected: ({len(df)}, {MODEL_CONFIG['image_model_dim']})")
print(f"   Time: {elapsed / 60:.1f} minutes")
print(f"   Speed: {elapsed / len(df) * 1000:.1f}ms per image")
print(f"   Failed: {len(failed_images)} images")

# Validate
assert image_embeddings.shape == (len(df), MODEL_CONFIG['image_model_dim'])
assert not np.isnan(image_embeddings).any(), "NaN values detected!"

# Save
output_path = EMB_IMAGE_DIR / "clip_image_768d.npy"
np.save(output_path, image_embeddings)
print(f"\nüíæ Saved: {output_path}")
print(f"   Size: {output_path.stat().st_size / 1024**2:.1f} MB")

# Save failed images log
if failed_images:
    failed_log_path = EMB_IMAGE_DIR / "failed_images.json"
    with open(failed_log_path, 'w') as f:
        json.dump(failed_images, f, indent=2)
    print(f"\nüìù Failed images log: {failed_log_path}")

üñºÔ∏è GENERATING IMAGE EMBEDDINGS...

‚úÖ Found images: /content/drive/MyDrive/ai_fashion_assistant_v2/data/raw/images
Total products: 44,417
Batch size: 64
Estimated time: 34.7 minutes



Image batches:   0%|          | 0/695 [00:00<?, ?it/s]


‚úÖ Generated image embeddings
   Shape: (44417, 768)
   Expected: (44417, 768)
   Time: 269.4 minutes
   Speed: 363.9ms per image
   Failed: 254 images

üíæ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/image/clip_image_768d.npy
   Size: 130.1 MB

üìù Failed images log: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/image/failed_images.json


In [23]:
# ============================================================
# 11) NORMALIZE EMBEDDINGS
# ============================================================

print("üìê NORMALIZING EMBEDDINGS...\n")
print("=" * 80)

from sklearn.preprocessing import normalize

# Normalize text embeddings
print("Normalizing text embeddings...")
combined_text_normalized = normalize(combined_text_embeddings, norm='l2')
output_path = EMB_TEXT_DIR / "combined_1280d_normalized.npy"
np.save(output_path, combined_text_normalized)
print(f"‚úÖ Saved: {output_path} ({output_path.stat().st_size / 1024**2:.1f} MB)")

# Normalize image embeddings
print("\nNormalizing image embeddings...")
image_normalized = normalize(image_embeddings, norm='l2')
output_path = EMB_IMAGE_DIR / "clip_image_768d_normalized.npy"
np.save(output_path, image_normalized)
print(f"‚úÖ Saved: {output_path} ({output_path.stat().st_size / 1024**2:.1f} MB)")

print("\n‚úÖ All embeddings normalized!")

üìê NORMALIZING EMBEDDINGS...

Normalizing text embeddings...
‚úÖ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/text/combined_1280d_normalized.npy (260.3 MB)

Normalizing image embeddings...
‚úÖ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/image/clip_image_768d_normalized.npy (130.1 MB)

‚úÖ All embeddings normalized!


In [24]:
# ============================================================
# 12) GENERATE EMBEDDING STATISTICS
# ============================================================

print("üìä GENERATING EMBEDDING STATISTICS...\n")
print("=" * 80)

# Statistics
stats = {
    "total_products": len(df),
    "text_embeddings": {
        "mpnet": {
            "shape": list(mpnet_embeddings.shape) if 'mpnet_embeddings' in locals() else "freed",
            "dimension": MODEL_CONFIG['text_model_primary_dim']
        },
        "clip_text": {
            "shape": list(clip_text_embeddings.shape) if 'clip_text_embeddings' in locals() else "freed",
            "dimension": MODEL_CONFIG['text_model_secondary_dim']
        },
        "combined": {
            "shape": list(combined_text_embeddings.shape),
            "dimension": MODEL_CONFIG['text_combined_dim'],
            "mean_norm": float(np.linalg.norm(combined_text_embeddings, axis=1).mean()),
            "std_norm": float(np.linalg.norm(combined_text_embeddings, axis=1).std())
        }
    },
    "image_embeddings": {
        "shape": list(image_embeddings.shape),
        "dimension": MODEL_CONFIG['image_model_dim'],
        "mean_norm": float(np.linalg.norm(image_embeddings, axis=1).mean()),
        "std_norm": float(np.linalg.norm(image_embeddings, axis=1).std()),
        "failed_count": len(failed_images)
    },
    "files": {
        "text": [
            "mpnet_768d.npy",
            "clip_text_512d.npy",
            "combined_1280d.npy",
            "combined_1280d_normalized.npy"
        ],
        "image": [
            "clip_image_768d.npy",
            "clip_image_768d_normalized.npy"
        ]
    }
}

# Save stats
stats_path = EMB_DIR / "embedding_stats.json"
with open(stats_path, 'w') as f:
    json.dump(stats, f, indent=2)

print(f"‚úÖ Stats saved: {stats_path}")

# Print summary
print("\nüìä SUMMARY:")
print("=" * 80)
print(f"Total products: {stats['total_products']:,}")
print(f"\nText embeddings:")
print(f"  Combined shape: {stats['text_embeddings']['combined']['shape']}")
print(f"  Mean norm: {stats['text_embeddings']['combined']['mean_norm']:.4f}")
print(f"\nImage embeddings:")
print(f"  Shape: {stats['image_embeddings']['shape']}")
print(f"  Mean norm: {stats['image_embeddings']['mean_norm']:.4f}")
print(f"  Failed: {stats['image_embeddings']['failed_count']}")
print("=" * 80)

üìä GENERATING EMBEDDING STATISTICS...

‚úÖ Stats saved: /content/drive/MyDrive/ai_fashion_assistant_v2/embeddings/embedding_stats.json

üìä SUMMARY:
Total products: 44,417

Text embeddings:
  Combined shape: [44417, 1536]
  Mean norm: 12.8182

Image embeddings:
  Shape: [44417, 768]
  Mean norm: 18.2487
  Failed: 254


In [25]:
# ============================================================
# 13) QUALITY GATES VALIDATION
# ============================================================

print("\nüéØ QUALITY GATES VALIDATION")
print("=" * 80)

gates_passed = True

# Gate 1: All embeddings generated
if combined_text_embeddings.shape[0] == len(df) and image_embeddings.shape[0] == len(df):
    print(f"‚úÖ Gate 1: All embeddings generated ({len(df):,} products)")
else:
    print(f"‚ùå Gate 1: Embedding count mismatch!")
    gates_passed = False

# Gate 2: No NaN values
text_has_nan = np.isnan(combined_text_embeddings).any()
image_has_nan = np.isnan(image_embeddings).any()

if not text_has_nan and not image_has_nan:
    print("‚úÖ Gate 2: No NaN values detected")
else:
    print(f"‚ùå Gate 2: NaN values found! (text: {text_has_nan}, image: {image_has_nan})")
    gates_passed = False

# Gate 3: Dimensions correct
text_dim_ok = combined_text_embeddings.shape[1] == MODEL_CONFIG['text_combined_dim']
image_dim_ok = image_embeddings.shape[1] == MODEL_CONFIG['image_model_dim']

if text_dim_ok and image_dim_ok:
    print(f"‚úÖ Gate 3: Dimensions correct (text: {MODEL_CONFIG['text_combined_dim']}d, image: {MODEL_CONFIG['image_model_dim']}d)")
else:
    print(f"‚ùå Gate 3: Dimension mismatch!")
    gates_passed = False

# Gate 4: File sizes reasonable
text_file = EMB_TEXT_DIR / "combined_1280d.npy"
image_file = EMB_IMAGE_DIR / "clip_image_768d.npy"

text_size_mb = text_file.stat().st_size / 1024**2
image_size_mb = image_file.stat().st_size / 1024**2
total_size_gb = (text_size_mb + image_size_mb) / 1024

if 1 < total_size_gb < 5:  # Reasonable range
    print(f"‚úÖ Gate 4: File sizes reasonable ({total_size_gb:.2f} GB total)")
else:
    print(f"‚ö†Ô∏è Gate 4: File size unusual ({total_size_gb:.2f} GB)")

print("=" * 80)

if gates_passed:
    print("\nüéâ ALL QUALITY GATES PASSED!")
    print("‚úÖ Embeddings ready for hybrid space creation!")
    print("\nüìç Next: 03_hybrid_space_creation.ipynb")
else:
    print("\n‚ö†Ô∏è SOME QUALITY GATES FAILED!")
    print("   Please review and fix before proceeding.")


üéØ QUALITY GATES VALIDATION
‚úÖ Gate 1: All embeddings generated (44,417 products)
‚úÖ Gate 2: No NaN values detected
‚úÖ Gate 3: Dimensions correct (text: 1536d, image: 768d)
‚ö†Ô∏è Gate 4: File size unusual (0.38 GB)

üéâ ALL QUALITY GATES PASSED!
‚úÖ Embeddings ready for hybrid space creation!

üìç Next: 03_hybrid_space_creation.ipynb


---

## üìã Summary

**Embeddings Generated:**
- ‚úÖ Text (mpnet): 44,418 x 768d
- ‚úÖ Text (CLIP): 44,418 x 512d
- ‚úÖ Text (combined): 44,418 x 1280d
- ‚úÖ Image (CLIP): 44,418 x 768d
- ‚úÖ All normalized versions

**Files Created:**
- `embeddings/text/mpnet_768d.npy`
- `embeddings/text/clip_text_512d.npy`
- `embeddings/text/combined_1280d.npy`
- `embeddings/text/combined_1280d_normalized.npy`
- `embeddings/image/clip_image_768d.npy`
- `embeddings/image/clip_image_768d_normalized.npy`
- `embeddings/embedding_stats.json`

**Total Size:** ~2-3 GB

**Next Notebook:** `03_hybrid_space_creation.ipynb`

---