# Tribly AI Assistant - Retrieval Evaluation

This notebook evaluates and compares different retrieval methods:
- **Semantic Search**: Uses sentence embeddings for meaning-based retrieval
- **Keyword Search**: Uses TF-IDF based text matching (baseline)
- **Hybrid Search**: Combines both using Reciprocal Rank Fusion

## Metrics Evaluated
- **Precision@k**: Fraction of retrieved documents that are relevant
- **Recall@k**: Fraction of relevant documents that were retrieved
- **MRR**: Mean Reciprocal Rank - position of first relevant result
- **Latency**: Response time in milliseconds

In [None]:
# Setup and Imports
import sys
import json
from pathlib import Path

# Add project root to path
sys.path.insert(0, str(Path('.').absolute().parent))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print('Setup complete!')

In [None]:
# Load evaluation results
results_path = Path('../results/metrics/evaluation_results.json')

with open(results_path) as f:
    results = json.load(f)

print(f"Loaded results from: {results_path}")
print(f"Number of test queries: {results['num_queries']}")
print(f"Methods evaluated: {list(results['summary'].keys())}")

In [None]:
# Create summary DataFrame
summary_data = []
for method, metrics in results['summary'].items():
    summary_data.append({
        'Method': method.capitalize(),
        'Precision@1': metrics['precision@1'],
        'Precision@3': metrics['precision@3'],
        'Precision@5': metrics['precision@5'],
        'Precision@10': metrics['precision@10'],
        'Recall@5': metrics['recall@5'],
        'Recall@10': metrics['recall@10'],
        'MRR': metrics['mrr'],
        'Latency (ms)': metrics['avg_latency_ms']
    })

df = pd.DataFrame(summary_data)
df.set_index('Method', inplace=True)
print("\n=== Evaluation Summary ===")
df.round(4)

## Precision@k Comparison

Precision measures how many of the retrieved documents are relevant.

In [None]:
# Chart 1: Precision@k Comparison
fig, ax = plt.subplots(figsize=(10, 6))

methods = ['Semantic', 'Keyword', 'Hybrid']
k_values = [1, 3, 5, 10]
x = np.arange(len(k_values))
width = 0.25

for i, method in enumerate(methods):
    precisions = [df.loc[method, f'Precision@{k}'] for k in k_values]
    ax.bar(x + i*width, precisions, width, label=method)

ax.set_xlabel('k (number of retrieved documents)', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision@k Comparison Across Search Methods', fontsize=14, fontweight='bold')
ax.set_xticks(x + width)
ax.set_xticklabels([f'k={k}' for k in k_values])
ax.legend()
ax.set_ylim(0, 0.6)

plt.tight_layout()
plt.savefig('../results/figures/precision_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: results/figures/precision_comparison.png")

## Recall@k Comparison

Recall measures what fraction of all relevant documents were retrieved.

In [None]:
# Chart 2: Recall@k Comparison
fig, ax = plt.subplots(figsize=(8, 6))

recall_data = {
    'k=5': [df.loc[m, 'Recall@5'] for m in methods],
    'k=10': [df.loc[m, 'Recall@10'] for m in methods]
}

x = np.arange(len(methods))
width = 0.35

ax.bar(x - width/2, recall_data['k=5'], width, label='Recall@5', color='steelblue')
ax.bar(x + width/2, recall_data['k=10'], width, label='Recall@10', color='coral')

ax.set_xlabel('Search Method', fontsize=12)
ax.set_ylabel('Recall', fontsize=12)
ax.set_title('Recall@k Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(methods)
ax.legend()
ax.set_ylim(0, 1.0)

# Add value labels
for i, (r5, r10) in enumerate(zip(recall_data['k=5'], recall_data['k=10'])):
    ax.text(i - width/2, r5 + 0.02, f'{r5:.2f}', ha='center', fontsize=9)
    ax.text(i + width/2, r10 + 0.02, f'{r10:.2f}', ha='center', fontsize=9)

plt.tight_layout()
plt.savefig('../results/figures/recall_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: results/figures/recall_comparison.png")

## Mean Reciprocal Rank (MRR) Comparison

MRR measures how quickly the first relevant result appears in the ranking.

In [None]:
# Chart 3: MRR Comparison
fig, ax = plt.subplots(figsize=(8, 6))

mrr_values = [df.loc[m, 'MRR'] for m in methods]
colors = ['#3498db', '#e74c3c', '#2ecc71']

bars = ax.bar(methods, mrr_values, color=colors, edgecolor='black', linewidth=1.2)

ax.set_xlabel('Search Method', fontsize=12)
ax.set_ylabel('Mean Reciprocal Rank', fontsize=12)
ax.set_title('MRR Comparison - First Relevant Result Position', fontsize=14, fontweight='bold')
ax.set_ylim(0, 1.0)

# Add value labels on bars
for bar, val in zip(bars, mrr_values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{val:.3f}', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('../results/figures/mrr_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: results/figures/mrr_comparison.png")

## Latency Comparison

Average response time for each search method.

In [None]:
# Chart 4: Latency Comparison
fig, ax = plt.subplots(figsize=(8, 6))

latencies = [df.loc[m, 'Latency (ms)'] for m in methods]
colors = ['#9b59b6', '#f39c12', '#1abc9c']

bars = ax.bar(methods, latencies, color=colors, edgecolor='black', linewidth=1.2)

ax.set_xlabel('Search Method', fontsize=12)
ax.set_ylabel('Average Latency (ms)', fontsize=12)
ax.set_title('Response Time Comparison', fontsize=14, fontweight='bold')

# Add value labels
for bar, val in zip(bars, latencies):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
            f'{val:.1f}ms', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig('../results/figures/latency_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: results/figures/latency_comparison.png")

## Overall Performance Heatmap

In [None]:
# Chart 5: Performance Heatmap
fig, ax = plt.subplots(figsize=(10, 6))

# Select key metrics for heatmap
heatmap_data = df[['Precision@5', 'Recall@5', 'Recall@10', 'MRR']].T

sns.heatmap(heatmap_data, annot=True, fmt='.3f', cmap='YlGnBu',
            linewidths=0.5, ax=ax, cbar_kws={'label': 'Score'})

ax.set_title('Retrieval Performance Heatmap', fontsize=14, fontweight='bold')
ax.set_xlabel('Search Method', fontsize=12)
ax.set_ylabel('Metric', fontsize=12)

plt.tight_layout()
plt.savefig('../results/figures/performance_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

print("Saved: results/figures/performance_heatmap.png")

## Analysis & Conclusions

### Key Findings

1. **Semantic Search** achieves the best precision@1 and MRR, meaning it's best at ranking the most relevant result first.

2. **Hybrid Search** provides the best recall@10, successfully finding more relevant documents overall.

3. **Keyword Search** is significantly faster (~1ms vs ~66ms for semantic) but has lower quality metrics.

### Trade-offs

| Method | Best For | Trade-off |
|--------|----------|----------|
| Semantic | Quality, relevance | Slower, needs embeddings |
| Keyword | Speed, exact matches | Lower semantic understanding |
| Hybrid | Balanced performance | Most complex, moderate latency |

### Recommendation

For the Tribly AI Assistant:
- Use **Semantic Search** as the default for best user experience
- Use **Hybrid Search** when comprehensive coverage is needed
- Use **Keyword Search** for real-time autocomplete or suggestions

In [None]:
# Final Summary Table
print("\n" + "="*60)
print(" FINAL EVALUATION SUMMARY")
print("="*60)
print(df.to_string())
print("="*60)