# Comprehensive Evaluation Metrics

**Phase 9, Notebook 1/3** - Beyond basic recall and NDCG

---

## Motivation

Our current evaluation uses Recall@10 (48%) and NDCG@10 (86.6%). These are
good metrics, but they don't tell the whole story. We're missing:

- **Diversity**: Are we showing varied products or just similar items?
- **Novelty**: Are we recommending obvious choices or discovering new items?
- **Coverage**: What percentage of our catalog gets recommended?

This notebook implements a comprehensive metrics suite to better understand
system performance from multiple angles.

---

In [1]:
from google.colab import drive
drive.mount("/content/drive", force_remount=False)

print("Drive mounted")

Mounted at /content/drive
Drive mounted


In [2]:
import os
import sys
import json
import numpy as np
import pandas as pd
from pathlib import Path
from typing import Dict, List, Tuple, Set
from collections import defaultdict, Counter
from scipy.spatial.distance import cosine
from itertools import combinations
import warnings
warnings.filterwarnings('ignore')

PROJECT_ROOT = Path("/content/drive/MyDrive/ai_fashion_assistant_v2")
sys.path.insert(0, str(PROJECT_ROOT))

print("Imports ready")

Imports ready


In [3]:
# ============================================================
# SETUP & CREATE SYNTHETIC DATA
# ============================================================

EVAL_DIR = PROJECT_ROOT / "evaluation/comprehensive"
EVAL_DIR.mkdir(parents=True, exist_ok=True)

print(f"Working directory: {EVAL_DIR}")
print("\nCreating synthetic data for testing...\n")
print("=" * 60)

# Create synthetic product catalog
np.random.seed(42)

num_products = 1000
categories = ['ayakkabı', 'elbise', 'pantolon', 'gömlek', 'ceket', 'çanta']
colors = ['beyaz', 'siyah', 'mavi', 'kırmızı', 'yeşil', 'gri']
brands = ['nike', 'adidas', 'zara', 'h&m', 'mango', 'koton']
genders = ['kadın', 'erkek', 'unisex']

products_df = pd.DataFrame({
    'product_id': range(num_products),
    'name': [f'Product {i}' for i in range(num_products)],
    'category': np.random.choice(categories, num_products),
    'color': np.random.choice(colors, num_products),
    'brand': np.random.choice(brands, num_products),
    'gender': np.random.choice(genders, num_products),
    'price': np.random.uniform(50, 1000, num_products)
})

products_df.set_index('product_id', inplace=True)

print(f"Created synthetic catalog: {len(products_df):,} products")
print(f"  Categories: {products_df['category'].nunique()}")
print(f"  Brands: {products_df['brand'].nunique()}")
print(f"  Colors: {products_df['color'].nunique()}")

# Create synthetic test queries
queries_df = pd.DataFrame({
    'query_id': range(10),
    'query_text': [
        'beyaz nike ayakkabı',
        'siyah elbise',
        'mavi pantolon',
        'kırmızı çanta',
        'spor ayakkabı',
        'kadın gömlek',
        'erkek ceket',
        'zara elbise',
        'beyaz ayakkabı',
        'siyah çanta'
    ]
})

print(f"\nCreated test queries: {len(queries_df)}")

# Create synthetic ground truth
gt_data = []
for qid in queries_df['query_id']:
    # Each query has 3-5 relevant products
    num_relevant = np.random.randint(3, 6)
    relevant_products = np.random.choice(products_df.index, num_relevant, replace=False)
    relevances = np.random.choice([1, 2, 3], num_relevant)  # 1=low, 2=med, 3=high

    for pid, rel in zip(relevant_products, relevances):
        gt_data.append({
            'query_id': qid,
            'product_id': pid,
            'relevance': rel
        })

gt_df = pd.DataFrame(gt_data)

print(f"Created ground truth: {len(gt_df)} relevance judgments")
print(f"  Avg relevant per query: {len(gt_df) / len(queries_df):.1f}")

print("\n" + "=" * 60)
print("Synthetic data ready")

Working directory: /content/drive/MyDrive/ai_fashion_assistant_v2/evaluation/comprehensive

Creating synthetic data for testing...

Created synthetic catalog: 1,000 products
  Categories: 6
  Brands: 6
  Colors: 6

Created test queries: 10
Created ground truth: 39 relevance judgments
  Avg relevant per query: 3.9

Synthetic data ready


In [4]:
# ============================================================
# DIVERSITY METRICS
# ============================================================

print("\nImplementing diversity metrics...\n")
print("=" * 60)

def intra_list_similarity(product_ids: List[int], embeddings: np.ndarray) -> float:
    """Average pairwise similarity. Lower = more diverse."""
    if len(product_ids) < 2:
        return 0.0

    similarities = []
    for i, j in combinations(range(len(product_ids)), 2):
        if product_ids[i] < len(embeddings) and product_ids[j] < len(embeddings):
            sim = 1 - cosine(embeddings[product_ids[i]], embeddings[product_ids[j]])
            similarities.append(sim)

    return np.mean(similarities) if similarities else 0.0


def category_diversity(product_ids: List[int], products_df: pd.DataFrame) -> float:
    """Ratio of unique categories to total items."""
    if not product_ids:
        return 0.0

    valid_ids = [pid for pid in product_ids if pid in products_df.index]
    if not valid_ids:
        return 0.0

    categories = products_df.loc[valid_ids, 'category'].unique()
    return len(categories) / len(valid_ids)


print("Diversity metrics implemented:")
print("  - ILS (intra-list similarity)")
print("  - Category diversity")
print("\n" + "=" * 60)


Implementing diversity metrics...

Diversity metrics implemented:
  - ILS (intra-list similarity)
  - Category diversity



In [5]:
# ============================================================
# NOVELTY METRICS
# ============================================================

print("\nImplementing novelty metrics...\n")
print("=" * 60)

def compute_item_popularity(products_df: pd.DataFrame) -> Dict[int, float]:
    """Compute popularity scores."""
    popularity = {}

    for idx in products_df.index:
        score = 0.5  # base
        row = products_df.loc[idx]

        # Popular colors
        if row['color'] in ['beyaz', 'siyah', 'mavi']:
            score += 0.2

        # Popular brands
        if row['brand'] in ['nike', 'adidas', 'zara']:
            score += 0.2

        popularity[idx] = min(score, 1.0)

    return popularity


def novelty_at_k(product_ids: List[int], popularity: Dict[int, float]) -> float:
    """Average novelty (inverse popularity)."""
    if not product_ids:
        return 0.0

    novelties = [1 - popularity.get(pid, 0.5) for pid in product_ids]
    return np.mean(novelties)


item_popularity = compute_item_popularity(products_df)

print("Novelty metrics implemented:")
print(f"  Computed popularity for {len(item_popularity):,} items")
print(f"  Avg popularity: {np.mean(list(item_popularity.values())):.3f}")
print("\n" + "=" * 60)


Implementing novelty metrics...

Novelty metrics implemented:
  Computed popularity for 1,000 items
  Avg popularity: 0.695



In [6]:
# ============================================================
# RANKING QUALITY METRICS
# ============================================================

print("\nImplementing ranking quality metrics...\n")
print("=" * 60)

def mean_average_precision(product_ids: List[int], relevant_ids: Set[int]) -> float:
    """MAP - mean average precision."""
    if not relevant_ids:
        return 0.0

    precisions = []
    num_relevant = 0

    for i, pid in enumerate(product_ids):
        if pid in relevant_ids:
            num_relevant += 1
            precision_at_i = num_relevant / (i + 1)
            precisions.append(precision_at_i)

    return np.mean(precisions) if precisions else 0.0


def success_at_k(product_ids: List[int], relevant_ids: Set[int]) -> float:
    """Success@K - at least one relevant item?"""
    return 1.0 if any(pid in relevant_ids for pid in product_ids) else 0.0


def recall_at_k(product_ids: List[int], relevant_ids: Set[int]) -> float:
    """Recall@K - what fraction of relevant items retrieved?"""
    if not relevant_ids:
        return 0.0

    retrieved_relevant = len(set(product_ids) & relevant_ids)
    return retrieved_relevant / len(relevant_ids)


print("Ranking metrics implemented:")
print("  - MAP (mean average precision)")
print("  - Success@K")
print("  - Recall@K")
print("\n" + "=" * 60)


Implementing ranking quality metrics...

Ranking metrics implemented:
  - MAP (mean average precision)
  - Success@K
  - Recall@K



In [7]:
# ============================================================
# COMPREHENSIVE EVALUATOR
# ============================================================

print("\nBuilding comprehensive evaluator...\n")
print("=" * 60)

class ComprehensiveEvaluator:
    """Evaluate system with multiple metrics."""

    def __init__(self, products_df: pd.DataFrame, gt_df: pd.DataFrame):
        self.products_df = products_df
        self.gt_df = gt_df
        self.item_popularity = compute_item_popularity(products_df)
        # Create mock embeddings for diversity calculation
        self.embeddings = np.random.randn(len(products_df), 128)

    def evaluate_query(
        self,
        query_id: int,
        product_ids: List[int]
    ) -> Dict[str, float]:
        """Evaluate single query with all metrics."""

        # Get ground truth
        gt_items = self.gt_df[self.gt_df['query_id'] == query_id]
        relevant_ids = set(gt_items['product_id'].tolist())

        metrics = {}

        # Ranking quality
        metrics['map'] = mean_average_precision(product_ids, relevant_ids)
        metrics['recall'] = recall_at_k(product_ids, relevant_ids)
        metrics['success'] = success_at_k(product_ids, relevant_ids)

        # Diversity
        metrics['ils'] = intra_list_similarity(product_ids, self.embeddings)
        metrics['cat_diversity'] = category_diversity(product_ids, self.products_df)

        # Novelty
        metrics['novelty'] = novelty_at_k(product_ids, self.item_popularity)

        return metrics

    def evaluate_all(self, results: Dict[int, List[int]]) -> pd.DataFrame:
        """Evaluate all queries."""
        all_metrics = []

        for query_id, product_ids in results.items():
            metrics = self.evaluate_query(query_id, product_ids)
            metrics['query_id'] = query_id
            all_metrics.append(metrics)

        return pd.DataFrame(all_metrics)


evaluator = ComprehensiveEvaluator(products_df, gt_df)

print("Evaluator ready")
print(f"  Products: {len(products_df):,}")
print(f"  Ground truth: {len(gt_df)} judgments")
print(f"  Embeddings: {evaluator.embeddings.shape}")
print("\n" + "=" * 60)


Building comprehensive evaluator...

Evaluator ready
  Products: 1,000
  Ground truth: 39 judgments
  Embeddings: (1000, 128)



In [8]:
# ============================================================
# TEST WITH MOCK RETRIEVAL
# ============================================================

print("\nTesting with mock retrieval results...\n")
print("=" * 60)

# Simulate retrieval results for all queries
mock_results = {}

for query_id in queries_df['query_id']:
    # Mock retrieval: 10 random products
    # (in real system this would come from FAISS/retrieval)
    mock_results[query_id] = np.random.choice(
        products_df.index,
        size=10,
        replace=False
    ).tolist()

print(f"Created mock results for {len(mock_results)} queries")
print(f"Each query: 10 products")

# Evaluate all
results_df = evaluator.evaluate_all(mock_results)

print(f"\nEvaluation complete")
print(f"  Queries evaluated: {len(results_df)}")
print(f"  Metrics per query: {len(results_df.columns) - 1}")

# Show summary
print("\n" + "=" * 60)
print("Metric Summary:")
print("=" * 60)

summary = results_df.drop('query_id', axis=1).mean()
for metric, value in summary.items():
    if metric in ['map', 'recall', 'success']:
        print(f"  {metric:15s}: {value:.3f} (ranking quality)")
    elif metric in ['ils', 'cat_diversity']:
        print(f"  {metric:15s}: {value:.3f} (diversity)")
    elif metric == 'novelty':
        print(f"  {metric:15s}: {value:.3f} (novelty)")

print("\n" + "=" * 60)


Testing with mock retrieval results...

Created mock results for 10 queries
Each query: 10 products

Evaluation complete
  Queries evaluated: 10
  Metrics per query: 6

Metric Summary:
  map            : 0.000 (ranking quality)
  recall         : 0.000 (ranking quality)
  success        : 0.000 (ranking quality)
  ils            : -0.004 (diversity)
  cat_diversity  : 0.510 (diversity)
  novelty        : 0.312 (novelty)



In [9]:
# ============================================================
# SAVE RESULTS
# ============================================================

print("\nSaving results...\n")
print("=" * 60)

# Save detailed results
results_path = EVAL_DIR / "comprehensive_metrics.csv"
results_df.to_csv(results_path, index=False)
print(f"✓ Saved: {results_path.name}")

# Save summary statistics
summary_path = EVAL_DIR / "metrics_summary.json"
summary_data = {
    'mean': summary.to_dict(),
    'std': results_df.drop('query_id', axis=1).std().to_dict(),
    'median': results_df.drop('query_id', axis=1).median().to_dict(),
    'min': results_df.drop('query_id', axis=1).min().to_dict(),
    'max': results_df.drop('query_id', axis=1).max().to_dict()
}
with open(summary_path, 'w') as f:
    json.dump(summary_data, f, indent=2)
print(f"✓ Saved: {summary_path.name}")

# Save metric descriptions
desc_path = EVAL_DIR / "README.txt"
with open(desc_path, 'w', encoding='utf-8') as f:
    f.write("Comprehensive Evaluation Metrics\n")
    f.write("=" * 60 + "\n\n")
    f.write("This evaluation goes beyond basic Recall/NDCG to measure\n")
    f.write("multiple aspects of recommendation quality.\n\n")
    f.write("Ranking Quality:\n")
    f.write("  MAP - mean average precision (higher = better)\n")
    f.write("  Recall@K - fraction of relevant items retrieved\n")
    f.write("  Success@K - at least one relevant item found\n\n")
    f.write("Diversity:\n")
    f.write("  ILS - intra-list similarity (lower = more diverse)\n")
    f.write("  cat_diversity - unique categories ratio (higher = better)\n\n")
    f.write("Novelty:\n")
    f.write("  novelty - average item novelty (1 - popularity)\n")
    f.write("  Higher = recommending less obvious items\n\n")
    f.write("Files:\n")
    f.write("  - comprehensive_metrics.csv: per-query metrics\n")
    f.write("  - metrics_summary.json: aggregated statistics\n")
print(f"✓ Saved: {desc_path.name}")

print("\n" + "=" * 60)
print("All files saved successfully")


Saving results...

✓ Saved: comprehensive_metrics.csv
✓ Saved: metrics_summary.json
✓ Saved: README.txt

All files saved successfully


In [10]:
# ============================================================
# SUMMARY
# ============================================================

print("\n" + "=" * 60)
print("PHASE 9, NOTEBOOK 1 COMPLETE")
print("=" * 60)

print("\nWhat we built:")
print("  ✓ Comprehensive metrics suite")
print("  ✓ Ranking quality (MAP, Recall, Success)")
print("  ✓ Diversity (ILS, category diversity)")
print("  ✓ Novelty (inverse popularity)")

print("\nResults:")
print(f"  Queries tested: {len(results_df)}")
print(f"  Metrics computed: {len(summary)}")
print(f"  Files created: 3")

print("\nKey findings (mock data):")
print(f"  Avg MAP: {summary['map']:.3f}")
print(f"  Avg Recall@10: {summary['recall']:.3f}")
print(f"  Success rate: {summary['success']:.3f}")
print(f"  Category diversity: {summary['cat_diversity']:.3f}")
print(f"  Novelty: {summary['novelty']:.3f}")

print("\nNext steps:")
print("  1. Run on real retrieval results (not mock)")
print("  2. Compare different system configurations")
print("  3. Ablation studies (Notebook 2)")
print("  4. Baseline comparisons (Notebook 3)")

print("\n" + "=" * 60)


PHASE 9, NOTEBOOK 1 COMPLETE

What we built:
  ✓ Comprehensive metrics suite
  ✓ Ranking quality (MAP, Recall, Success)
  ✓ Diversity (ILS, category diversity)
  ✓ Novelty (inverse popularity)

Results:
  Queries tested: 10
  Metrics computed: 6
  Files created: 3

Key findings (mock data):
  Avg MAP: 0.000
  Avg Recall@10: 0.000
  Success rate: 0.000
  Category diversity: 0.510
  Novelty: 0.312

Next steps:
  1. Run on real retrieval results (not mock)
  2. Compare different system configurations
  3. Ablation studies (Notebook 2)
  4. Baseline comparisons (Notebook 3)



---

## Summary

Implemented comprehensive evaluation metrics that go beyond basic Recall/NDCG.
These metrics help us understand different aspects of recommendation quality.

### Metrics Implemented

**Ranking Quality**
- MAP: Precision across all ranks
- Recall@K: Coverage of relevant items
- Success@K: Binary success metric

**Diversity**
- ILS: Pairwise similarity (want low)
- Category diversity: Unique categories shown

**Novelty**
- Novelty@K: Inverse popularity

### Why This Matters

Different metrics capture different quality dimensions. A system might have high
recall but low diversity (showing many similar items). These metrics help us
understand trade-offs and optimize holistically.

### Files Created

```
evaluation/comprehensive/
├── comprehensive_metrics.csv
├── metrics_summary.json
└── README.txt
```

### Next

Notebook 2 will implement ablation studies to understand which components
contribute most to overall performance.

---