# ML-Powered Code Search Engine Report

This report documents the evaluation and fine-tuning of a code search engine using the UniXcoder model on the CoSQA dataset.

## 1. Baseline Evaluation - Pre-trained UniXcoder

First, we evaluate the base microsoft/unixcoder-base model on the CoSQA test set.

In [None]:
from src.search_engine import SearchEngine
from evaluation.cosqa_loader import CoSQALoader
from evaluation.metrics import recall_at_k, mrr_at_k, ndcg_at_k
from evaluation.evaluate import get_cache_path, index_corpus, calculate_metrics
import logging

logging.basicConfig(level=logging.INFO, format='%(message)s')
log = logging.getLogger(__name__)

In [None]:
# Load test dataset
loader = CoSQALoader()
corpus, queries, relevance = loader.load(split="test")

log.info(f"Corpus size: {len(corpus)}")
log.info(f"Total queries: {len(queries)}")

In [None]:
# Initialize baseline model
baseline_model_name = "microsoft/unixcoder-base"
search_engine = SearchEngine(model_name=baseline_model_name)
index_corpus(search_engine, corpus, baseline_model_name, normalize=True)

In [None]:
# Run evaluation
results = {}
for query in queries:
    query_id = query["query_id"]
    query_text = query["query_text"]
    search_results = search_engine.search(query_text, top_k=10)
    retrieved_indices = [r["id"] for r in search_results]
    results[query_id] = retrieved_indices

In [None]:
# Calculate baseline metrics
baseline_metrics = calculate_metrics(results, relevance)

log.info("="*50)
log.info("BASELINE EVALUATION RESULTS".center(50))
log.info("="*50)
for metric_name, value in baseline_metrics.items():
    log.info(f"  {metric_name.upper():<20} {value:.4f}")
log.info("="*50)

# Expected results:
# RECALL@10: 0.2220
# MRR@10: 0.0890
# NDCG@10: 0.1197

## 2. Model Fine-tuning

Fine-tune the UniXcoder model on the CoSQA training set using MultipleNegativesRankingLoss.

In [None]:
from training.train import train_model
from training.config import TrainingConfig

# Train the model (this will take time)
config = TrainingConfig()
trained_model = train_model(config)

# Training results saved to:
# - Model: ./models/unixcoder-finetuned
# - History: ./training_history.json

### Training Results

The model was trained for 8 epochs (early stopping triggered):

**Training Loss:**
- Epoch 1: 0.5803
- Epoch 2: 0.3273
- Epoch 3: 0.1933
- Epoch 4: 0.1163
- Epoch 5: 0.0757 (best validation NDCG@10: 0.2551)
- Epoch 6: 0.0518
- Epoch 7: 0.0382
- Epoch 8: 0.0290

**Best Validation Metrics (Epoch 5):**
- NDCG@10: 0.2551
- Recall@10: 0.4657
- MRR@10: 0.1911

## 3. Post Fine-tuning Evaluation

Evaluate the fine-tuned model on the test set.

In [None]:
# Evaluate fine-tuned model
finetuned_model_path = "./models/unixcoder-finetuned"
finetuned_engine = SearchEngine(model_name=finetuned_model_path)
index_corpus(finetuned_engine, corpus, finetuned_model_path, normalize=True)

In [None]:
# Run queries
finetuned_results = {}
for query in queries:
    query_id = query["query_id"]
    query_text = query["query_text"]
    search_results = finetuned_engine.search(query_text, top_k=10)
    retrieved_indices = [r["id"] for r in search_results]
    finetuned_results[query_id] = retrieved_indices

In [None]:
# Calculate fine-tuned metrics
finetuned_metrics = calculate_metrics(finetuned_results, relevance)

log.info("="*50)
log.info("FINE-TUNED MODEL RESULTS".center(50))
log.info("="*50)
for metric_name, value in finetuned_metrics.items():
    log.info(f"  {metric_name.upper():<20} {value:.4f}")
log.info("="*50)

# Expected results:
# RECALL@10: 0.4260
# MRR@10: 0.1621
# NDCG@10: 0.2234

## 4. Bonus 1 - Function Names Only Evaluation

Evaluate the fine-tuned model using only function names extracted from code snippets.

In [None]:
from bonus.extractor import extract_function_name

# Extract function names
corpus_names = [extract_function_name(code) for code in corpus]

# Evaluate with function names
engine_names = SearchEngine(model_name=finetuned_model_path)
engine_names.index_documents(corpus_names, show_progress=True)

results_names = {}
for query in queries:
    query_id = query["query_id"]
    query_text = query["query_text"]
    search_results = engine_names.search(query_text, top_k=10)
    retrieved_indices = [r["id"] for r in search_results]
    results_names[query_id] = retrieved_indices

metrics_names = calculate_metrics(results_names, relevance)

log.info("="*50)
log.info("FUNCTION NAMES ONLY RESULTS".center(50))
log.info("="*50)
for metric_name, value in metrics_names.items():
    log.info(f"  {metric_name.upper():<20} {value:.4f}")
log.info("="*50)

## 5. Bonus 2 - Similarity Metrics Comparison

Compare different similarity metrics (Cosine, Euclidean, Manhattan, Dot Product) with both normalized and unnormalized embeddings.

In [None]:
from bonus.bonus2 import run_all_metrics

# This will evaluate all combinations of:
# - Metrics: cosine, euclidean, manhattan, dot_product
# - Normalization: normalized, unnormalized
run_all_metrics()

# Results saved to: bonus/results/similarity_metrics.json

### Bonus 2 Results Summary

**Best Performance: Manhattan Distance with Normalized Embeddings**
- Recall@10: 0.4240
- MRR@10: 0.1658
- NDCG@10: 0.2261

**Normalized Embeddings:**
- Cosine: Recall@10: 0.3860, MRR@10: 0.1556, NDCG@10: 0.2092
- Euclidean: Recall@10: 0.3860, MRR@10: 0.1513, NDCG@10: 0.2060
- Manhattan: Recall@10: 0.4240, MRR@10: 0.1658, NDCG@10: 0.2261
- Dot Product: Recall@10: 0.3860, MRR@10: 0.1556, NDCG@10: 0.2092

**Unnormalized Embeddings:**
- Cosine: Recall@10: 0.3860, MRR@10: 0.1579, NDCG@10: 0.2109
- Euclidean: Recall@10: 0.3200, MRR@10: 0.1274, NDCG@10: 0.1723
- Manhattan: Recall@10: 0.3400, MRR@10: 0.1337, NDCG@10: 0.1817
- Dot Product: Recall@10: 0.3840, MRR@10: 0.1457, NDCG@10: 0.2008

## 6. Embeddings Analysis - PCA and Anisotropy

Analyze the structure of fine-tuned embeddings to understand their intrinsic dimensionality and detect anisotropy.

In [None]:
import pickle
import numpy as np
from sklearn.decomposition import PCA

# Load fine-tuned normalized embeddings
with open('cache/embeddings/embeddings_finetuned_normalized.pkl.npz', 'rb') as f:
    embeddings = pickle.load(f)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Total values: {embeddings.size:,}")

In [None]:
# Sparsity Analysis
exact_zeros = np.sum(embeddings == 0)
near_zeros_001 = np.sum(np.abs(embeddings) < 0.001)
near_zeros_01 = np.sum(np.abs(embeddings) < 0.01)
near_zeros_05 = np.sum(np.abs(embeddings) < 0.05)
total = embeddings.size

print("Sparsity Analysis:")
print(f"  Exact zeros (= 0):        {exact_zeros:,} ({exact_zeros/total*100:.4f}%)")
print(f"  Near-zero (< 0.001):      {near_zeros_001:,} ({near_zeros_001/total*100:.4f}%)")
print(f"  Near-zero (< 0.01):       {near_zeros_01:,} ({near_zeros_01/total*100:.4f}%)")
print(f"  Near-zero (< 0.05):       {near_zeros_05:,} ({near_zeros_05/total*100:.4f}%)")

print(f"\nValue Statistics:")
print(f"  Min:    {embeddings.min():.6f}")
print(f"  Max:    {embeddings.max():.6f}")
print(f"  Mean:   {embeddings.mean():.6f}")
print(f"  Std:    {embeddings.std():.6f}")
print(f"  Median: {np.median(embeddings):.6f}")

In [None]:
# PCA Analysis - Intrinsic Dimensionality
print("Computing PCA...")
pca = PCA()
pca.fit(embeddings)

cumvar = np.cumsum(pca.explained_variance_ratio_)
dims_50 = np.argmax(cumvar >= 0.50) + 1
dims_90 = np.argmax(cumvar >= 0.90) + 1
dims_95 = np.argmax(cumvar >= 0.95) + 1
dims_99 = np.argmax(cumvar >= 0.99) + 1

print(f"\nIntrinsic Dimensionality:")
print(f"  50% variance: {dims_50:3d} / 768 ({dims_50/768*100:.1f}%)")
print(f"  90% variance: {dims_90:3d} / 768 ({dims_90/768*100:.1f}%)")
print(f"  95% variance: {dims_95:3d} / 768 ({dims_95/768*100:.1f}%)")
print(f"  99% variance: {dims_99:3d} / 768 ({dims_99/768*100:.1f}%)")

print(f"\n  Top 10 components explain: {cumvar[9]*100:.2f}% of variance")
print(f"  Top 50 components explain: {cumvar[49]*100:.2f}% of variance")

### PCA Results Summary

**Actual Results:**
- Shape: (20604, 768)
- No exact zeros (0.0000%)
- Near-zero (< 0.05): 83.71% of values

**Intrinsic Dimensionality:**
- 31 dimensions (4.0%) explain 50% of variance
- 231 dimensions (30.1%) explain 90% of variance
- 349 dimensions (45.4%) explain 95% of variance
- 584 dimensions (76.0%) explain 99% of variance

**Anisotropy Evidence:**

The strong dimensional compression (only 31 dims for 50% variance vs expected ~384 for isotropic) indicates **significant anisotropy**. The embeddings are concentrated along preferred directions in the embedding space, rather than being uniformly distributed across all 768 dimensions.

**Value Distribution:**
- Range: [-0.32, 0.36]
- Mean: 0.0011 (centered near zero)
- Std: 0.036
- Symmetric distribution around zero

## 7. Results Comparison

In [None]:
import pandas as pd

# Actual results from evaluation
baseline_metrics = {
    'recall@10': 0.2220,
    'mrr@10': 0.0890,
    'ndcg@10': 0.1197
}

finetuned_metrics = {
    'recall@10': 0.4260,
    'mrr@10': 0.1621,
    'ndcg@10': 0.2234
}

comparison = pd.DataFrame({
    'Baseline': baseline_metrics,
    'Fine-tuned': finetuned_metrics
})

comparison['Improvement'] = comparison['Fine-tuned'] - comparison['Baseline']
comparison['Improvement %'] = (comparison['Improvement'] / comparison['Baseline']) * 100

print(comparison)

## 7. Training Visualization

In [None]:
from training.plot_training import plot_training_history

# Generate training plots
plot_training_history("./training_history.json")

# This creates:
# - training_loss.png
# - validation_metrics.png

## 8. Metrics Comparison Visualization

In [None]:
import matplotlib.pyplot as plt

metrics_names = list(baseline_metrics.keys())
baseline_values = list(baseline_metrics.values())
finetuned_values = list(finetuned_metrics.values())

x = range(len(metrics_names))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
ax.bar([i - width/2 for i in x], baseline_values, width, label='Baseline', color='skyblue')
ax.bar([i + width/2 for i in x], finetuned_values, width, label='Fine-tuned', color='lightcoral')

ax.set_xlabel('Metrics')
ax.set_ylabel('Score')
ax.set_title('Baseline vs Fine-tuned Model Performance')
ax.set_xticks(x)
ax.set_xticklabels([m.upper() for m in metrics_names])
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

### Key Findings:

1. **Baseline Performance:**
   - Recall@10: 0.2220
   - MRR@10: 0.0890
   - NDCG@10: 0.1197

2. **Fine-tuned Performance:**
   - Recall@10: 0.4260 (+92% improvement)
   - MRR@10: 0.1621 (+82% improvement)
   - NDCG@10: 0.2234 (+87% improvement)

3. **Best Configuration (Bonus 2):**
   - Manhattan distance with normalized embeddings
   - Recall@10: 0.4240
   - NDCG@10: 0.2261

4. **Training Insights:**
   - Early stopping at epoch 8
   - Best model from epoch 5
   - Consistent validation improvement through epochs 1-5