# 05f - Cross-Encoder Zero-Shot Comparison

**Purpose**: Evaluate pre-trained Cross-Encoder models for OCEAN trait prediction without any fine-tuning

**Why Cross-Encoder over Bi-Encoder (BGE)?**
- **Bi-Encoder**: Encodes query and document independently → learn mapping via regression
- **Cross-Encoder**: Jointly processes query+document → direct similarity score
- **Advantage**: Better semantic understanding of (OCEAN definition, loan description) pairs

**Models to Test**:
1. `cross-encoder/stsb-roberta-large` - Trained on STS-B for semantic similarity
2. `cross-encoder/stsb-roberta-base` - Base version (faster)
3. `cross-encoder/ms-marco-MiniLM-L-6-v2` - Trained on MS MARCO passage ranking
4. `cross-encoder/quora-distilroberta-base` - Trained on Quora question pairs

**Evaluation Strategy**:
- Zero-shot: No training required
- Input: (OCEAN definition, loan description) pairs
- Output: Similarity score (normalized to 0-1 for OCEAN prediction)
- Metrics: R², RMSE, MAE vs ground truth

**Expected Performance**:
- Baseline: BGE + Elastic Net (R² 0.19-0.24)
- Target: R² 0.20-0.35 (comparable or better than BGE)
- If R² < 0.25: Consider LoRA fine-tuning

**Estimated Time**: 10-20 minutes for all models

## Step 1: Import Libraries

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
import warnings
from datetime import datetime
from tqdm import tqdm
import time
import torch
warnings.filterwarnings('ignore')

# Use sentence-transformers for Cross-Encoder models
from sentence_transformers import CrossEncoder
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler

# Check device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Libraries loaded successfully")
print(f"Timestamp: {datetime.now()}")
print(f"Device: {device}")

Libraries loaded successfully
Timestamp: 2025-10-29 14:49:11.551204
Device: cpu


## Step 2: Define Configuration

In [13]:
# OCEAN dimension definitions
OCEAN_DEFINITIONS = {
    'openness': "This person is imaginative, creative, curious about new experiences, and open to new ideas. They appreciate art, emotion, adventure, unusual ideas, and variety of experience.",
    
    'conscientiousness': "This person is organized, responsible, hardworking, reliable, and goal-oriented. They show self-discipline, act dutifully, and aim for achievement against measures or outside expectations.",
    
    'extraversion': "This person is outgoing, energetic, talkative, sociable, and enjoys being around others. They seek stimulation in the company of others and are assertive and enthusiastic.",
    
    'agreeableness': "This person is friendly, cooperative, compassionate, trusting, and considerate of others. They are generally well-tempered, kind, and value getting along with others.",
    
    'neuroticism': "This person tends to experience negative emotions such as anxiety, anger, or depression. They are emotionally unstable, prone to worry, and have difficulty coping with stress."
}

# LLM configurations
LLM_CONFIGS = {
    'llama': {
        'name': 'Llama-3.1-8B',
        'ocean_file': '../ocean_ground_truth/llama_3.1_8b_ocean_500.csv'
    },
    'gpt': {
        'name': 'GPT-OSS-120B',
        'ocean_file': '../ocean_ground_truth/gpt_oss_120b_ocean_500.csv'
    },
    'gemma': {
        'name': 'Gemma-2-9B',
        'ocean_file': '../ocean_ground_truth/gemma_2_9b_ocean_500.csv'
    },
    'deepseek': {
        'name': 'DeepSeek-V3.1',
        'ocean_file': '../ocean_ground_truth/deepseek_v3.1_ocean_500.csv'
    },
    'qwen': {
        'name': 'Qwen-2.5-72B',
        'ocean_file': '../ocean_ground_truth/qwen_2.5_72b_ocean_500.csv'
    }
}

# Cross-Encoder models to evaluate
CROSSENCODER_MODELS = {
    'stsb-roberta-large': {
        'model_name': 'cross-encoder/stsb-roberta-large',
        'description': 'RoBERTa-Large trained on STS-B (semantic similarity)',
        'params': '355M',
        'expected_range': [0, 5]  # STS-B score range
    },
    'stsb-roberta-base': {
        'model_name': 'cross-encoder/stsb-roberta-base',
        'description': 'RoBERTa-Base trained on STS-B (faster)',
        'params': '125M',
        'expected_range': [0, 5]
    },
    'ms-marco-minilm': {
        'model_name': 'cross-encoder/ms-marco-MiniLM-L-6-v2',
        'description': 'MiniLM trained on MS MARCO passage ranking',
        'params': '22M',
        'expected_range': [-10, 10]  # Relevance score
    },
    'quora-distilroberta': {
        'model_name': 'cross-encoder/quora-distilroberta-base',
        'description': 'DistilRoBERTa trained on Quora question pairs',
        'params': '82M',
        'expected_range': [0, 1]  # Binary similarity
    }
}

OCEAN_DIMS = ['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']

print("Configuration loaded:")
print(f"  LLM models: {len(LLM_CONFIGS)}")
print(f"  Cross-Encoder models: {len(CROSSENCODER_MODELS)}")
print(f"  OCEAN dimensions: {len(OCEAN_DIMS)}")
print(f"  Total evaluations: {len(LLM_CONFIGS) * len(CROSSENCODER_MODELS) * len(OCEAN_DIMS)}")

print("\nCross-Encoder Models:")
for key, config in CROSSENCODER_MODELS.items():
    print(f"  {key}: {config['description']} ({config['params']})")

Configuration loaded:
  LLM models: 5
  Cross-Encoder models: 4
  OCEAN dimensions: 5
  Total evaluations: 100

Cross-Encoder Models:
  stsb-roberta-large: RoBERTa-Large trained on STS-B (semantic similarity) (355M)
  stsb-roberta-base: RoBERTa-Base trained on STS-B (faster) (125M)
  ms-marco-minilm: MiniLM trained on MS MARCO passage ranking (22M)
  quora-distilroberta: DistilRoBERTa trained on Quora question pairs (82M)


In [14]:
def call_crossencoder_api(client, model_name, sentence1, sentence2, max_retries=3):
    """
    Call HF Inference API for cross-encoder similarity scoring.
    
    Args:
        client: HuggingFace InferenceClient
        model_name: Model name on HF Hub
        sentence1: First sentence (OCEAN definition)
        sentence2: Second sentence (loan description)
        max_retries: Maximum number of retry attempts
    
    Returns:
        float: Similarity score
    """
    for retry in range(max_retries):
        try:
            # Use sentence-similarity task for cross-encoder models
            response = client.sentence_similarity(
                sentence=sentence1,
                other_sentences=[sentence2],
                model=model_name
            )
            
            # Extract score from response
            if isinstance(response, list):
                return float(response[0])
            elif isinstance(response, (int, float)):
                return float(response)
            else:
                raise ValueError(f"Unexpected response format: {type(response)}")
                
        except Exception as e:
            if retry < max_retries - 1:
                print(f"        API error (retry {retry+1}/{max_retries}): {str(e)[:100]}")
                time.sleep(API_CONFIG['retry_delay'])
            else:
                raise Exception(f"API failed after {max_retries} retries: {e}")


def evaluate_crossencoder_model(model_key, model_config, llm_key, llm_data, loan_descs, ocean_dims):
    """
    Evaluate a Cross-Encoder model on OCEAN prediction for one LLM using HF Inference API.
    
    Args:
        model_key: Cross-Encoder model identifier
        model_config: Model configuration dict
        llm_key: LLM identifier
        llm_data: Ground truth data for this LLM
        loan_descs: Loan descriptions
        ocean_dims: OCEAN dimensions to evaluate
    
    Returns:
        dict: Evaluation results with metrics for each dimension
    """
    print(f"\n  Using model: {model_config['model_name']} via HF Inference API...")
    start_time = time.time()
    
    model_name = model_config['model_name']
    n_samples = len(llm_data['data'])
    
    results = {}
    
    for dim in ocean_dims:
        print(f"\n    [{dim}] Generating predictions via API...")
        
        # Get OCEAN definition and loan samples
        ocean_def = OCEAN_DEFINITIONS[dim]
        loan_samples = loan_descs[:n_samples]
        
        # Collect predictions with batching
        raw_scores = []
        pred_start = time.time()
        
        batch_size = API_CONFIG['batch_size']
        n_batches = (n_samples + batch_size - 1) // batch_size
        
        with tqdm(total=n_samples, desc=f"      API calls", leave=False) as pbar:
            for batch_idx in range(n_batches):
                start_idx = batch_idx * batch_size
                end_idx = min((batch_idx + 1) * batch_size, n_samples)
                batch_descs = loan_samples[start_idx:end_idx]
                
                # Process each description in the batch
                for desc in batch_descs:
                    score = call_crossencoder_api(client, model_name, ocean_def, desc)
                    raw_scores.append(score)
                    pbar.update(1)
                
                # Delay between batches to respect rate limits
                if batch_idx < n_batches - 1:
                    time.sleep(API_CONFIG['delay_between_batches'])
        
        pred_time = time.time() - pred_start
        raw_scores = np.array(raw_scores)
        
        # Normalize scores to [0, 1] range
        scaler = MinMaxScaler(feature_range=(0, 1))
        predictions = scaler.fit_transform(raw_scores.reshape(-1, 1)).flatten()
        
        # Get ground truth
        y_true = llm_data['data'][dim].values
        
        # Calculate metrics
        r2 = r2_score(y_true, predictions)
        rmse = np.sqrt(mean_squared_error(y_true, predictions))
        mae = mean_absolute_error(y_true, predictions)
        
        results[dim] = {
            'r2': float(r2),
            'rmse': float(rmse),
            'mae': float(mae),
            'raw_score_mean': float(np.mean(raw_scores)),
            'raw_score_std': float(np.std(raw_scores)),
            'raw_score_min': float(np.min(raw_scores)),
            'raw_score_max': float(np.max(raw_scores)),
            'pred_mean': float(np.mean(predictions)),
            'pred_std': float(np.std(predictions)),
            'true_mean': float(np.mean(y_true)),
            'true_std': float(np.std(y_true)),
            'inference_time_sec': float(pred_time),
            'inference_time_per_sample_ms': float(pred_time * 1000 / n_samples),
            'api_calls': int(n_samples)
        }
        
        print(f"      R²={r2:.4f}, RMSE={rmse:.4f}, MAE={mae:.4f}")
        print(f"      Raw scores: [{raw_scores.min():.2f}, {raw_scores.max():.2f}], μ={raw_scores.mean():.2f}")
        print(f"      API time: {pred_time:.1f}s ({pred_time*1000/n_samples:.1f}ms/sample, {n_samples} calls)")
    
    # Calculate summary metrics
    r2_scores = [results[dim]['r2'] for dim in ocean_dims]
    rmse_scores = [results[dim]['rmse'] for dim in ocean_dims]
    mae_scores = [results[dim]['mae'] for dim in ocean_dims]
    
    total_time = time.time() - start_time
    
    summary = {
        'avg_r2': float(np.mean(r2_scores)),
        'avg_rmse': float(np.mean(rmse_scores)),
        'avg_mae': float(np.mean(mae_scores)),
        'min_r2': float(np.min(r2_scores)),
        'max_r2': float(np.max(r2_scores)),
        'std_r2': float(np.std(r2_scores)),
        'total_api_calls': int(n_samples * len(ocean_dims)),
        'total_time_sec': float(total_time)
    }
    
    return {
        'model_key': model_key,
        'model_name': model_config['model_name'],
        'llm_key': llm_key,
        'llm_name': llm_data['name'],
        'n_samples': llm_data['n_samples'],
        'results': results,
        'summary': summary
    }

print("✓ Evaluation function defined (using HF Inference API)")
print("  - Simplified API calls using client.sentence_similarity()")
print("  - Batch processing with rate limiting")
print("  - Automatic retry logic for failed requests")

✓ Evaluation function defined (using HF Inference API)
  - Simplified API calls using client.sentence_similarity()
  - Batch processing with rate limiting
  - Automatic retry logic for failed requests


In [15]:
def evaluate_crossencoder_model(model_key, model_config, llm_key, llm_data, loan_descs, ocean_dims):
    """
    Evaluate a Cross-Encoder model on OCEAN prediction for one LLM.
    
    Args:
        model_key: Cross-Encoder model identifier
        model_config: Model configuration dict
        llm_key: LLM identifier
        llm_data: Ground truth data for this LLM
        loan_descs: Loan descriptions
        ocean_dims: OCEAN dimensions to evaluate
    
    Returns:
        dict: Evaluation results with metrics for each dimension
    """
    print(f"\n  Loading model: {model_config['model_name']}...")
    start_time = time.time()
    
    # Load Cross-Encoder model
    model = CrossEncoder(model_config['model_name'], device=device)
    model_name = model_config['model_name']
    n_samples = len(llm_data['data'])
    
    print(f"  Model loaded successfully (device: {device})")
    
    results = {}
    
    for dim in ocean_dims:
        print(f"\n    [{dim}] Generating predictions...")
        
        # Get OCEAN definition and loan samples
        ocean_def = OCEAN_DEFINITIONS[dim]
        loan_samples = loan_descs[:n_samples]
        
        # Create sentence pairs for cross-encoder
        sentence_pairs = [[ocean_def, desc] for desc in loan_samples]
        
        # Get predictions with batch processing
        pred_start = time.time()
        batch_size = 32  # Process 32 pairs at a time
        
        raw_scores = []
        for i in tqdm(range(0, len(sentence_pairs), batch_size), desc=f"      Batches", leave=False):
            batch = sentence_pairs[i:i+batch_size]
            scores = model.predict(batch, show_progress_bar=False)
            raw_scores.extend(scores)
        
        pred_time = time.time() - pred_start
        raw_scores = np.array(raw_scores)
        
        # Normalize scores to [0, 1] range
        scaler = MinMaxScaler(feature_range=(0, 1))
        predictions = scaler.fit_transform(raw_scores.reshape(-1, 1)).flatten()
        
        # Get ground truth
        y_true = llm_data['data'][dim].values
        
        # Calculate metrics
        r2 = r2_score(y_true, predictions)
        rmse = np.sqrt(mean_squared_error(y_true, predictions))
        mae = mean_absolute_error(y_true, predictions)
        
        results[dim] = {
            'r2': float(r2),
            'rmse': float(rmse),
            'mae': float(mae),
            'raw_score_mean': float(np.mean(raw_scores)),
            'raw_score_std': float(np.std(raw_scores)),
            'raw_score_min': float(np.min(raw_scores)),
            'raw_score_max': float(np.max(raw_scores)),
            'pred_mean': float(np.mean(predictions)),
            'pred_std': float(np.std(predictions)),
            'true_mean': float(np.mean(y_true)),
            'true_std': float(np.std(y_true)),
            'inference_time_sec': float(pred_time),
            'inference_time_per_sample_ms': float(pred_time * 1000 / n_samples)
        }
        
        print(f"      R²={r2:.4f}, RMSE={rmse:.4f}, MAE={mae:.4f}")
        print(f"      Raw scores: [{raw_scores.min():.2f}, {raw_scores.max():.2f}], μ={raw_scores.mean():.2f}")
        print(f"      Inference time: {pred_time:.1f}s ({pred_time*1000/n_samples:.1f}ms/sample)")
    
    # Calculate summary metrics
    r2_scores = [results[dim]['r2'] for dim in ocean_dims]
    rmse_scores = [results[dim]['rmse'] for dim in ocean_dims]
    mae_scores = [results[dim]['mae'] for dim in ocean_dims]
    
    total_time = time.time() - start_time
    
    summary = {
        'avg_r2': float(np.mean(r2_scores)),
        'avg_rmse': float(np.mean(rmse_scores)),
        'avg_mae': float(np.mean(mae_scores)),
        'min_r2': float(np.min(r2_scores)),
        'max_r2': float(np.max(r2_scores)),
        'std_r2': float(np.std(r2_scores)),
        'total_time_sec': float(total_time)
    }
    
    # Clean up model to free memory
    del model
    if device == 'cuda':
        torch.cuda.empty_cache()
    
    return {
        'model_key': model_key,
        'model_name': model_config['model_name'],
        'llm_key': llm_key,
        'llm_name': llm_data['name'],
        'n_samples': llm_data['n_samples'],
        'results': results,
        'summary': summary
    }

print("✓ Evaluation function defined")
print("  - Uses sentence-transformers CrossEncoder")
print("  - Batch processing for efficiency")
print("  - Automatic device selection (CPU/GPU)")

✓ Evaluation function defined
  - Uses sentence-transformers CrossEncoder
  - Batch processing for efficiency
  - Automatic device selection (CPU/GPU)


In [16]:
# Load loan descriptions
print("Loading loan descriptions...")
loan_df = pd.read_csv('../loan_final_desc50plus_with_ocean_bge.csv')
loan_descriptions = loan_df['desc'].tolist()
print(f"✓ Loaded {len(loan_descriptions)} loan descriptions")

# Load OCEAN ground truth for each LLM
llm_ocean_data = {}

print("\nLoading OCEAN ground truth data...")
for llm_key, config in LLM_CONFIGS.items():
    ocean_file = config['ocean_file']
    
    if not os.path.exists(ocean_file):
        print(f"  ⚠️ Warning: {ocean_file} not found, skipping {llm_key}")
        continue
    
    df = pd.read_csv(ocean_file)
    
    # Verify OCEAN columns exist
    missing_cols = [col for col in OCEAN_DIMS if col not in df.columns]
    if missing_cols:
        print(f"  ⚠️ Warning: {llm_key} missing columns {missing_cols}, skipping")
        continue
    
    llm_ocean_data[llm_key] = {
        'name': config['name'],
        'data': df,
        'n_samples': len(df)
    }
    
    print(f"  ✓ {config['name']}: {len(df)} samples")

print(f"\n✓ Data loading complete")
print(f"  Loan descriptions: {len(loan_descriptions)}")
print(f"  LLM datasets: {len(llm_ocean_data)}")
print(f"  Ready for evaluation!")

Loading loan descriptions...
✓ Loaded 34529 loan descriptions

Loading OCEAN ground truth data...
  ✓ Llama-3.1-8B: 500 samples
  ✓ GPT-OSS-120B: 500 samples
  ✓ Gemma-2-9B: 500 samples
  ✓ DeepSeek-V3.1: 500 samples
  ✓ Qwen-2.5-72B: 500 samples

✓ Data loading complete
  Loan descriptions: 34529
  LLM datasets: 5
  Ready for evaluation!


In [17]:
# Run evaluation for all combinations
all_results = []

print("=" * 80)
print("Starting Cross-Encoder Zero-Shot Evaluation")
print("=" * 80)

total_evals = len(CROSSENCODER_MODELS) * len(llm_ocean_data)
eval_count = 0

for model_key, model_config in CROSSENCODER_MODELS.items():
    print(f"\n{'='*80}")
    print(f"Model: {model_key} ({model_config['description']})")
    print(f"{'='*80}")
    
    for llm_key, llm_data in llm_ocean_data.items():
        eval_count += 1
        print(f"\n[{eval_count}/{total_evals}] Evaluating {llm_data['name']}...")
        
        try:
            result = evaluate_crossencoder_model(
                model_key=model_key,
                model_config=model_config,
                llm_key=llm_key,
                llm_data=llm_data,
                loan_descs=loan_descriptions,
                ocean_dims=OCEAN_DIMS
            )
            
            all_results.append(result)
            
            # Print summary
            print(f"\n  Summary for {model_key} + {llm_key}:")
            print(f"    Avg R²: {result['summary']['avg_r2']:.4f}")
            print(f"    Avg RMSE: {result['summary']['avg_rmse']:.4f}")
            print(f"    Total time: {result['summary']['total_time_sec']:.1f}s")
            
        except Exception as e:
            print(f"\n  ❌ Error evaluating {model_key} + {llm_key}: {e}")
            continue

print(f"\n{'='*80}")
print(f"✓ Evaluation Complete")
print(f"  Total evaluations: {len(all_results)}/{total_evals}")
print(f"{'='*80}")

Starting Cross-Encoder Zero-Shot Evaluation

Model: stsb-roberta-large (RoBERTa-Large trained on STS-B (semantic similarity))

[1/20] Evaluating Llama-3.1-8B...

  Loading model: cross-encoder/stsb-roberta-large...


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

  Model loaded successfully (device: cpu)

    [openness] Generating predictions...





  ❌ Error evaluating stsb-roberta-large + llama: Input contains NaN.

[2/20] Evaluating GPT-OSS-120B...

  Loading model: cross-encoder/stsb-roberta-large...
  Model loaded successfully (device: cpu)

    [openness] Generating predictions...




      R²=-5.7267, RMSE=0.3230, MAE=0.2660
      Raw scores: [0.01, 0.69], μ=0.14
      Inference time: 147.9s (295.8ms/sample)

    [conscientiousness] Generating predictions...




      R²=-28.8090, RMSE=0.5858, MAE=0.5343
      Raw scores: [0.01, 0.72], μ=0.17
      Inference time: 144.9s (289.7ms/sample)

    [extraversion] Generating predictions...




      R²=-9.5760, RMSE=0.2906, MAE=0.2206
      Raw scores: [0.01, 0.69], μ=0.13
      Inference time: 147.7s (295.4ms/sample)

    [agreeableness] Generating predictions...




KeyboardInterrupt: 

## Step 5: Run Evaluation

**Note**: This will download and run models locally. Estimated time: 10-20 minutes for all models.

In [None]:
# Run evaluation for all combinations
all_results = []

print("=" * 80)
print("Starting Cross-Encoder Zero-Shot Evaluation")
print("=" * 80)

total_evals = len(CROSSENCODER_MODELS) * len(llm_ocean_data)
eval_count = 0

for model_key, model_config in CROSSENCODER_MODELS.items():
    print(f"\n{'='*80}")
    print(f"Model: {model_key} ({model_config['description']})")
    print(f"{'='*80}")
    
    for llm_key, llm_data in llm_ocean_data.items():
        eval_count += 1
        print(f"\n[{eval_count}/{total_evals}] Evaluating {llm_data['name']}...")
        
        try:
            result = evaluate_crossencoder_model(
                model_key=model_key,
                model_config=model_config,
                llm_key=llm_key,
                llm_data=llm_data,
                loan_descs=loan_descriptions,
                ocean_dims=OCEAN_DIMS
            )
            
            all_results.append(result)
            
            # Print summary
            print(f"\n  Summary for {model_key} + {llm_key}:")
            print(f"    Avg R²: {result['summary']['avg_r2']:.4f}")
            print(f"    Avg RMSE: {result['summary']['avg_rmse']:.4f}")
            print(f"    Total time: {result['summary']['total_time_sec']:.1f}s")
            
        except Exception as e:
            print(f"\n  ❌ Error evaluating {model_key} + {llm_key}: {e}")
            import traceback
            traceback.print_exc()
            continue

print(f"\n{'='*80}")
print(f"✓ Evaluation Complete")
print(f"  Total evaluations: {len(all_results)}/{total_evals}")
print(f"{'='*80}")

In [None]:
# Save detailed results to JSON
output_file = '../05f_crossencoder_zeroshot_results.json'

output_data = {
    'metadata': {
        'timestamp': datetime.now().isoformat(),
        'n_models': len(CROSSENCODER_MODELS),
        'n_llms': len(llm_ocean_data),
        'n_dimensions': len(OCEAN_DIMS),
        'total_evaluations': len(all_results)
    },
    'models': CROSSENCODER_MODELS,
    'llms': {k: {'name': v['name'], 'n_samples': v['n_samples']} for k, v in llm_ocean_data.items()},
    'results': all_results
}

with open(output_file, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"✓ Results saved to: {output_file}")
print(f"  File size: {os.path.getsize(output_file) / 1024:.1f} KB")

## Step 6: Save Results

In [None]:
# Create summary DataFrame
summary_data = []

for result in all_results:
    # Check if total_api_calls exists (from API version) or not (from local version)
    total_api_calls = result['summary'].get('total_api_calls', 'N/A')
    
    summary_data.append({
        'model': result['model_key'],
        'llm': result['llm_key'],
        'avg_r2': result['summary']['avg_r2'],
        'avg_rmse': result['summary']['avg_rmse'],
        'avg_mae': result['summary']['avg_mae'],
        'min_r2': result['summary']['min_r2'],
        'max_r2': result['summary']['max_r2'],
        'std_r2': result['summary']['std_r2'],
        'total_time_sec': result['summary']['total_time_sec']
    })

df_summary = pd.DataFrame(summary_data)

# Save summary to CSV
summary_csv = '../05f_crossencoder_zeroshot_comparison.csv'
df_summary.to_csv(summary_csv, index=False)
print(f"✓ Summary saved to: {summary_csv}")

# Display summary
print("\n" + "="*80)
print("SUMMARY RESULTS")
print("="*80)
print(df_summary.to_string(index=False))

# Find best models
print("\n" + "="*80)
print("BEST MODELS BY R²")
print("="*80)
best_results = df_summary.nlargest(5, 'avg_r2')
for idx, row in best_results.iterrows():
    print(f"{row['model']} + {row['llm']}: R²={row['avg_r2']:.4f}, RMSE={row['avg_rmse']:.4f}")

In [None]:
# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. R² by Model
ax = axes[0, 0]
model_r2 = df_summary.groupby('model')['avg_r2'].mean().sort_values(ascending=False)
model_r2.plot(kind='bar', ax=ax, color='steelblue')
ax.set_title('Average R² by Cross-Encoder Model', fontsize=12, fontweight='bold')
ax.set_xlabel('Model')
ax.set_ylabel('Average R²')
ax.grid(axis='y', alpha=0.3)
ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)

# 2. R² by LLM
ax = axes[0, 1]
llm_r2 = df_summary.groupby('llm')['avg_r2'].mean().sort_values(ascending=False)
llm_r2.plot(kind='bar', ax=ax, color='coral')
ax.set_title('Average R² by LLM Ground Truth', fontsize=12, fontweight='bold')
ax.set_xlabel('LLM')
ax.set_ylabel('Average R²')
ax.grid(axis='y', alpha=0.3)
ax.axhline(y=0, color='k', linestyle='-', linewidth=0.5)

# 3. Heatmap of R² (Model x LLM)
ax = axes[1, 0]
pivot_r2 = df_summary.pivot(index='model', columns='llm', values='avg_r2')
sns.heatmap(pivot_r2, annot=True, fmt='.3f', cmap='RdYlGn', center=0, ax=ax, cbar_kws={'label': 'R²'})
ax.set_title('R² Heatmap: Model × LLM', fontsize=12, fontweight='bold')
ax.set_xlabel('LLM Ground Truth')
ax.set_ylabel('Cross-Encoder Model')

# 4. RMSE vs R²
ax = axes[1, 1]
for model in df_summary['model'].unique():
    model_data = df_summary[df_summary['model'] == model]
    ax.scatter(model_data['avg_r2'], model_data['avg_rmse'], label=model, s=100, alpha=0.7)
ax.set_title('RMSE vs R² by Model', fontsize=12, fontweight='bold')
ax.set_xlabel('Average R²')
ax.set_ylabel('Average RMSE')
ax.legend(fontsize=8)
ax.grid(alpha=0.3)

plt.tight_layout()
output_png = '../05f_crossencoder_comparison.png'
plt.savefig(output_png, dpi=300, bbox_inches='tight')
print(f"\n✓ Visualization saved to: {output_png}")
plt.show()

## Step 8: Visualize Results

In [None]:
print("="*80)
print("FINAL SUMMARY AND RECOMMENDATIONS")
print("="*80)

# Overall statistics
overall_avg_r2 = df_summary['avg_r2'].mean()
overall_std_r2 = df_summary['avg_r2'].std()
best_model = df_summary.loc[df_summary['avg_r2'].idxmax()]
worst_model = df_summary.loc[df_summary['avg_r2'].idxmin()]

print(f"\nOverall Performance:")
print(f"  Average R² across all combinations: {overall_avg_r2:.4f} ± {overall_std_r2:.4f}")
print(f"  Best: {best_model['model']} + {best_model['llm']} (R²={best_model['avg_r2']:.4f})")
print(f"  Worst: {worst_model['model']} + {worst_model['llm']} (R²={worst_model['avg_r2']:.4f})")

# Comparison with baseline
baseline_r2 = 0.24  # BGE + Elastic Net baseline
better_than_baseline = df_summary[df_summary['avg_r2'] > baseline_r2]

print(f"\nComparison with BGE Baseline (R² = {baseline_r2:.2f}):")
print(f"  Models better than baseline: {len(better_than_baseline)}/{len(df_summary)}")

if len(better_than_baseline) > 0:
    print(f"\n  Top performers vs baseline:")
    for idx, row in better_than_baseline.nlargest(3, 'avg_r2').iterrows():
        improvement = ((row['avg_r2'] - baseline_r2) / baseline_r2) * 100
        print(f"    {row['model']} + {row['llm']}: R²={row['avg_r2']:.4f} (+{improvement:.1f}%)")
else:
    print("  ⚠️ No models outperformed the BGE baseline")

# Recommendations
print(f"\nRecommendations:")
if overall_avg_r2 > baseline_r2:
    print(f"  ✓ Cross-Encoder models show promise (avg R² = {overall_avg_r2:.4f})")
    print(f"  ✓ Consider fine-tuning the top-performing models with LoRA")
    print(f"  ✓ Best model: {best_model['model']} with {best_model['llm']} ground truth")
else:
    print(f"  ⚠️ Cross-Encoder zero-shot performance below baseline (avg R² = {overall_avg_r2:.4f})")
    print(f"  → Consider LoRA fine-tuning to improve performance")
    print(f"  → Alternatively, stick with BGE + Elastic Net approach")

print(f"\n" + "="*80)
print(f"Evaluation complete! Results saved to:")
print(f"  - JSON: ../05f_crossencoder_zeroshot_results.json")
print(f"  - CSV: ../05f_crossencoder_zeroshot_comparison.csv")
print(f"  - PNG: ../05f_crossencoder_comparison.png")
print("="*80)