# 05f - Train Elastic Net Models (All LLMs, MPNet Embeddings)

**Purpose**: Train Elastic Net regression models for all 5 LLMs using MPNet embeddings to learn the mapping to OCEAN scores

**Why MPNet over MiniLM?**
- MPNet-base-v2: 110M parameters, 768 dimensions (official recommendation)
- MiniLM-L12-v2: 33M parameters, 384 dimensions (baseline)
- MPNet is 3.3x larger → better semantic understanding
- Expected R² improvement: +0.03~0.05 vs MiniLM

**Why Elastic Net?**
- Basic Ridge showed severe overfitting: train R² ≈ 0.999, test R² < 0
- Problem: 768 features vs ~400 samples → dimension curse
- Elastic Net = L1 + L2 regularization:
  - L1 (Lasso): Feature selection, removes noise features
  - L2 (Ridge): Stability, prevents overfitting
- ElasticNetCV: Automatic hyperparameter tuning via 5-fold cross-validation

**Input Files**:
- mpnet_embeddings_500.npy - MPNet embeddings (500x768)
- ocean_ground_truth/[llm]_ocean_500.csv - OCEAN ground truth for each LLM

**Output Files** (per LLM):
- elasticnet_models_mpnet_[llm].pkl - 5 Elastic Net models + Scaler
- 05f_elasticnet_training_report_mpnet_[llm].json - Training report with feature importance

**Summary Output**:
- 05f_mpnet_elasticnet_comparison.png - Performance comparison across LLMs
- 05f_mpnet_vs_minilm.csv - MPNet vs MiniLM comparison

**Estimated Time**: Approximately 10-15 minutes (5 LLMs x CV grid search)

## Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import pickle
import os
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNetCV, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from datetime import datetime

print("Libraries loaded successfully")
print(f"Timestamp: {datetime.now()}")

## Step 2: Configuration

In [None]:
# LLM configurations
LLM_CONFIGS = {
    'llama': {
        'name': 'Llama-3.1-8B',
        'ocean_file': '../ocean_ground_truth/llama_3.1_8b_ocean_500.csv'
    },
    'gpt': {
        'name': 'GPT-OSS-120B',
        'ocean_file': '../ocean_ground_truth/gpt_oss_120b_ocean_500.csv'
    },
    'gemma': {
        'name': 'Gemma-2-9B',
        'ocean_file': '../ocean_ground_truth/gemma_2_9b_ocean_500.csv'
    },
    'deepseek': {
        'name': 'DeepSeek-V3.1',
        'ocean_file': '../ocean_ground_truth/deepseek_v3.1_ocean_500.csv'
    },
    'qwen': {
        'name': 'Qwen-2.5-72B',
        'ocean_file': '../ocean_ground_truth/qwen_2.5_72b_ocean_500.csv'
    }
}

# OCEAN dimensions
OCEAN_DIMS = ['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']

# ElasticNetCV hyperparameters
ALPHAS = [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
L1_RATIOS = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99]
CV_FOLDS = 5

# Random seed for reproducibility
RANDOM_STATE = 42

print(f"Configuration loaded:")
print(f"  LLM models: {len(LLM_CONFIGS)}")
print(f"  OCEAN dimensions: {len(OCEAN_DIMS)}")
print(f"  Alpha grid: {len(ALPHAS)} values")
print(f"  L1 ratio grid: {len(L1_RATIOS)} values")
print(f"  Total CV combinations: {len(ALPHAS) * len(L1_RATIOS)} per dimension")
print(f"  CV folds: {CV_FOLDS}")

## Step 3: Load MPNet Embeddings (Shared)

In [None]:
print("="*80)
print("Loading MPNet Embeddings")
print("="*80)

embedding_file = '../mpnet_embeddings_500.npy'
print(f"\nLoading: {embedding_file}")
X_full = np.load(embedding_file)
print(f"Embeddings shape: {X_full.shape}")
print(f"  Expected: (500, 768) - MPNet has 768 dimensions")
print(f"  Data type: {X_full.dtype}")
print(f"  Memory usage: {X_full.nbytes / 1024 / 1024:.1f} MB")
print(f"  Value range: [{X_full.min():.4f}, {X_full.max():.4f}]")

# Verify dimensions
assert X_full.shape[1] == 768, f"Expected 768 dimensions for MPNet, got {X_full.shape[1]}"
print(f"\n✓ Dimension check passed: {X_full.shape[1]} dimensions")

## Step 4: Train Elastic Net Models for Each LLM

In [None]:
# Storage for all results
all_results = {}
minilm_comparison = {}

for llm_key, llm_config in LLM_CONFIGS.items():
    print("\n" + "="*80)
    print(f"Training Elastic Net Models: {llm_config['name']} (MPNet Embeddings)")
    print("="*80)
    
    # Load OCEAN targets
    print(f"\n[1/7] Loading OCEAN targets...")
    ocean_file = llm_config['ocean_file']
    y_df = pd.read_csv(ocean_file)
    print(f"  Shape: {y_df.shape}")
    print(f"  Columns: {y_df.columns.tolist()}")
    
    # Check and handle NaN values
    nan_count_total = y_df.isnull().sum().sum()
    if nan_count_total > 0:
        print(f"  Warning: Found {nan_count_total} NaN values")
        nan_indices = y_df[y_df.isnull().any(axis=1)].index
        y_df = y_df.dropna()
        X = np.delete(X_full, nan_indices, axis=0)
        print(f"  After dropping NaN: {len(y_df)} samples")
    else:
        X = X_full.copy()
    
    # Verify consistency
    if len(X) != len(y_df):
        raise ValueError(f"Data inconsistency: X={len(X)}, y={len(y_df)}")
    
    # Train/test split
    print(f"\n[2/7] Splitting data (80/20)...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_df,
        test_size=0.2,
        random_state=RANDOM_STATE
    )
    print(f"  Training: {X_train.shape[0]} samples")
    print(f"  Test: {X_test.shape[0]} samples")
    print(f"  Feature-to-sample ratio: {X_train.shape[1] / X_train.shape[0]:.2f}:1")
    print(f"  (768 features / {X_train.shape[0]} samples)")
    
    # Standardize
    print(f"\n[3/7] Standardizing features...")
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print(f"  Train mean={X_train_scaled.mean():.6f}, std={X_train_scaled.std():.6f}")
    print(f"  Test mean={X_test_scaled.mean():.6f}, std={X_test_scaled.std():.6f}")
    
    # Train models
    print(f"\n[4/7] Training Elastic Net models (5 dimensions)...")
    print(f"  ElasticNetCV parameters:")
    print(f"    - Alphas: {ALPHAS}")
    print(f"    - L1 ratios: {L1_RATIOS}")
    print(f"    - CV folds: {CV_FOLDS}")
    print(f"    - Total combinations: {len(ALPHAS) * len(L1_RATIOS)}")
    
    elasticnet_models = {}
    training_results = {}
    
    for i, dim in enumerate(OCEAN_DIMS):
        print(f"\n  [{i+1}/5] Training {dim}...")
        
        # Get target
        y_train_dim = y_train[dim].values
        y_test_dim = y_test[dim].values
        
        # Train ElasticNetCV
        model = ElasticNetCV(
            alphas=ALPHAS,
            l1_ratio=L1_RATIOS,
            cv=CV_FOLDS,
            random_state=RANDOM_STATE,
            max_iter=10000,
            n_jobs=-1,
            verbose=0
        )
        model.fit(X_train_scaled, y_train_dim)
        
        # Predict
        y_train_pred = model.predict(X_train_scaled)
        y_test_pred = model.predict(X_test_scaled)
        
        # Metrics
        train_r2 = r2_score(y_train_dim, y_train_pred)
        test_r2 = r2_score(y_test_dim, y_test_pred)
        train_rmse = np.sqrt(mean_squared_error(y_train_dim, y_train_pred))
        test_rmse = np.sqrt(mean_squared_error(y_test_dim, y_test_pred))
        train_mae = mean_absolute_error(y_train_dim, y_train_pred)
        test_mae = mean_absolute_error(y_test_dim, y_test_pred)
        
        # Feature importance analysis
        coefficients = model.coef_
        non_zero_count = np.sum(np.abs(coefficients) > 1e-6)
        sparsity = (1 - non_zero_count / len(coefficients)) * 100
        
        # Top features
        top_indices = np.argsort(np.abs(coefficients))[-20:][::-1]
        top_features = [
            {'index': int(idx), 'coefficient': float(coefficients[idx])}
            for idx in top_indices
        ]
        
        # Save model and results
        elasticnet_models[dim] = model
        training_results[dim] = {
            'train_r2': float(train_r2),
            'test_r2': float(test_r2),
            'train_rmse': float(train_rmse),
            'test_rmse': float(test_rmse),
            'train_mae': float(train_mae),
            'test_mae': float(test_mae),
            'best_alpha': float(model.alpha_),
            'best_l1_ratio': float(model.l1_ratio_),
            'non_zero_features': int(non_zero_count),
            'sparsity_percent': float(sparsity),
            'top_20_features': top_features,
            'model_intercept': float(model.intercept_)
        }
        
        print(f"      Best alpha: {model.alpha_:.2f}, l1_ratio: {model.l1_ratio_:.2f}")
        print(f"      Non-zero features: {non_zero_count}/{len(coefficients)} ({100-sparsity:.1f}%)")
        print(f"      Train R²: {train_r2:.4f} | Test R²: {test_r2:.4f}")
        print(f"      Train RMSE: {train_rmse:.4f} | Test RMSE: {test_rmse:.4f}")
    
    # Save models
    print(f"\n[5/7] Saving models...")
    model_data = {
        'models': elasticnet_models,
        'scaler': scaler,
        'ocean_dims': OCEAN_DIMS,
        'training_results': training_results,
        'training_timestamp': datetime.now().isoformat(),
        'llm_model': llm_config['name'],
        'embedding_model': 'sentence-transformers/all-mpnet-base-v2',
        'embedding_dimension': 768,
        'hyperparameters': {
            'alphas': ALPHAS,
            'l1_ratios': L1_RATIOS,
            'cv_folds': CV_FOLDS
        }
    }
    
    model_file = f'../elasticnet_models_mpnet_{llm_key}.pkl'
    with open(model_file, 'wb') as f:
        pickle.dump(model_data, f)
    print(f"  Saved: {model_file} ({os.path.getsize(model_file) / 1024:.1f} KB)")
    
    # Try loading MiniLM results for comparison
    print(f"\n[6/7] Looking for MiniLM results for comparison...")
    minilm_report_file = f'../05f_elasticnet_training_report_minilm_{llm_key}.json'
    try:
        with open(minilm_report_file, 'r') as f:
            minilm_data = json.load(f)
        minilm_comparison[llm_key] = minilm_data['training_results']
        print(f"  ✓ MiniLM report loaded: {minilm_report_file}")
    except Exception as e:
        print(f"  ✗ MiniLM report not found (will skip comparison)")
        minilm_comparison[llm_key] = None
    
    # Generate report
    print(f"\n[7/7] Generating training report...")
    report = {
        'phase': f'05f - Train Elastic Net Models ({llm_config["name"]}, MPNet Embeddings)',
        'timestamp': datetime.now().isoformat(),
        'llm_model': llm_config['name'],
        'embedding_model': 'sentence-transformers/all-mpnet-base-v2',
        'embedding_parameters': '110M',
        'embedding_dimension': 768,
        'training_samples': int(X_train.shape[0]),
        'test_samples': int(X_test.shape[0]),
        'model_type': 'Elastic Net (L1+L2 Regularization)',
        'hyperparameters': model_data['hyperparameters'],
        'ocean_dimensions': OCEAN_DIMS,
        'model_file': model_file,
        'training_results': training_results
    }
    
    # Summary metrics
    test_r2_scores = [training_results[dim]['test_r2'] for dim in OCEAN_DIMS]
    test_rmse_scores = [training_results[dim]['test_rmse'] for dim in OCEAN_DIMS]
    test_mae_scores = [training_results[dim]['test_mae'] for dim in OCEAN_DIMS]
    avg_sparsity = np.mean([training_results[dim]['sparsity_percent'] for dim in OCEAN_DIMS])
    
    report['summary_metrics'] = {
        'avg_test_r2': float(np.mean(test_r2_scores)),
        'avg_test_rmse': float(np.mean(test_rmse_scores)),
        'avg_test_mae': float(np.mean(test_mae_scores)),
        'min_test_r2': float(np.min(test_r2_scores)),
        'max_test_r2': float(np.max(test_r2_scores)),
        'avg_sparsity_percent': float(avg_sparsity)
    }
    
    report_file = f'../05f_elasticnet_training_report_mpnet_{llm_key}.json'
    with open(report_file, 'w') as f:
        json.dump(report, f, indent=2)
    print(f"  Report saved: {report_file}")
    
    # Store for final comparison
    all_results[llm_key] = {
        'name': llm_config['name'],
        'training_results': training_results,
        'summary': report['summary_metrics']
    }
    
    # Print summary
    print(f"\n  Summary for {llm_config['name']}:")
    print(f"    Avg Test R²: {report['summary_metrics']['avg_test_r2']:.4f}")
    print(f"    Avg Sparsity: {avg_sparsity:.1f}%")
    print(f"    Test R² range: [{report['summary_metrics']['min_test_r2']:.4f}, {report['summary_metrics']['max_test_r2']:.4f}]")

print("\n" + "="*80)
print("All MPNet models trained successfully!")
print("="*80)

## Step 5: Generate Comparison Visualizations

In [None]:
print("\n" + "="*80)
print("Generating Comparison Visualizations")
print("="*80)

# Create comparison dataframe
comparison_data = []

for llm_key, results in all_results.items():
    for dim in OCEAN_DIMS:
        mpnet_r2 = results['training_results'][dim]['test_r2']
        mpnet_rmse = results['training_results'][dim]['test_rmse']
        sparsity = results['training_results'][dim]['sparsity_percent']
        
        # Get MiniLM results if available
        if minilm_comparison.get(llm_key):
            minilm_r2 = minilm_comparison[llm_key][dim]['test_r2']
            minilm_rmse = minilm_comparison[llm_key][dim]['test_rmse']
        else:
            minilm_r2 = None
            minilm_rmse = None
        
        comparison_data.append({
            'LLM': results['name'],
            'llm_key': llm_key,
            'Dimension': dim,
            'MPNet_R2': mpnet_r2,
            'MiniLM_R2': minilm_r2,
            'R2_Improvement': mpnet_r2 - minilm_r2 if minilm_r2 else None,
            'MPNet_RMSE': mpnet_rmse,
            'MiniLM_RMSE': minilm_rmse,
            'Sparsity_%': sparsity
        })

comparison_df = pd.DataFrame(comparison_data)

# Save comparison table
comparison_file = '../05f_mpnet_vs_minilm.csv'
comparison_df.to_csv(comparison_file, index=False)
print(f"\nComparison table saved: {comparison_file}")
print(f"\nPreview:")
print(comparison_df.head(10))

In [None]:
# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('MPNet vs MiniLM Performance Comparison (All LLMs)', fontsize=16, fontweight='bold')

# 1. Test R² Comparison
ax1 = axes[0, 0]
x_pos = np.arange(len(comparison_df))
width = 0.35

ax1.bar(x_pos - width/2, comparison_df['MPNet_R2'], width, label='MPNet (768d)', color='#3498db', alpha=0.8)
if comparison_df['MiniLM_R2'].notna().any():
    ax1.bar(x_pos + width/2, comparison_df['MiniLM_R2'], width, label='MiniLM (384d)', color='#95a5a6', alpha=0.8)
ax1.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax1.set_xlabel('Model-Dimension', fontsize=10)
ax1.set_ylabel('Test R² Score', fontsize=10)
ax1.set_title('Test R² Comparison (Higher is Better)', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.tick_params(axis='x', rotation=90, labelsize=7)
ax1.set_xticks(x_pos)
ax1.set_xticklabels([f"{row['llm_key'][:3]}-{row['Dimension'][:3]}" for _, row in comparison_df.iterrows()])

# 2. Average R² by LLM
ax2 = axes[0, 1]
avg_by_llm = comparison_df.groupby('LLM')[['MPNet_R2', 'MiniLM_R2']].mean()
avg_by_llm.plot(kind='bar', ax=ax2, color=['#3498db', '#95a5a6'], alpha=0.8)
ax2.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax2.set_xlabel('LLM Model', fontsize=10)
ax2.set_ylabel('Average Test R²', fontsize=10)
ax2.set_title('Average Test R² by LLM', fontsize=12, fontweight='bold')
ax2.legend(['MPNet (768d)', 'MiniLM (384d)'])
ax2.grid(True, alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

# 3. R² Improvement (MPNet - MiniLM)
ax3 = axes[1, 0]
if comparison_df['R2_Improvement'].notna().any():
    colors = ['#2ecc71' if x > 0 else '#e74c3c' for x in comparison_df['R2_Improvement']]
    ax3.bar(x_pos, comparison_df['R2_Improvement'], color=colors, alpha=0.8)
    ax3.axhline(y=0, color='black', linestyle='--', linewidth=1)
    ax3.set_xlabel('Model-Dimension', fontsize=10)
    ax3.set_ylabel('R² Improvement', fontsize=10)
    ax3.set_title('R² Improvement (MPNet - MiniLM)', fontsize=12, fontweight='bold')
    ax3.grid(True, alpha=0.3)
    ax3.tick_params(axis='x', rotation=90, labelsize=7)
    ax3.set_xticks(x_pos)
    ax3.set_xticklabels([f"{row['llm_key'][:3]}-{row['Dimension'][:3]}" for _, row in comparison_df.iterrows()])
else:
    ax3.text(0.5, 0.5, 'MiniLM data not available', ha='center', va='center', transform=ax3.transAxes)

# 4. Sparsity Analysis
ax4 = axes[1, 1]
sparsity_by_llm = comparison_df.groupby('LLM')['Sparsity_%'].mean().sort_values(ascending=False)
sparsity_by_llm.plot(kind='barh', ax=ax4, color='#9b59b6', alpha=0.8)
ax4.set_xlabel('Average Sparsity (%)', fontsize=10)
ax4.set_ylabel('LLM Model', fontsize=10)
ax4.set_title('Feature Sparsity by LLM (% Features Set to Zero)', fontsize=12, fontweight='bold')
ax4.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
viz_file = '../05f_mpnet_elasticnet_comparison.png'
plt.savefig(viz_file, dpi=300, bbox_inches='tight')
print(f"\nVisualization saved: {viz_file}")
plt.show()

## Step 6: Summary Statistics

In [None]:
print("\n" + "="*80)
print("FINAL SUMMARY (MPNet Embeddings)")
print("="*80)

print("\n1. Overall Performance (MPNet):")
print(f"   Average Test R² across all models: {comparison_df['MPNet_R2'].mean():.4f}")
print(f"   Best Test R²: {comparison_df['MPNet_R2'].max():.4f}")
print(f"   Worst Test R²: {comparison_df['MPNet_R2'].min():.4f}")
print(f"   Std Dev: {comparison_df['MPNet_R2'].std():.4f}")

if comparison_df['MiniLM_R2'].notna().any():
    print("\n2. Comparison with MiniLM:")
    print(f"   Average MiniLM Test R²: {comparison_df['MiniLM_R2'].mean():.4f}")
    print(f"   Average Improvement: {comparison_df['R2_Improvement'].mean():.4f}")
    print(f"   Models improved: {(comparison_df['R2_Improvement'] > 0).sum()}/{len(comparison_df)}")
    print(f"   Best improvement: {comparison_df['R2_Improvement'].max():.4f}")
    print(f"   Worst change: {comparison_df['R2_Improvement'].min():.4f}")

print("\n3. Feature Selection (Sparsity):")
print(f"   Average sparsity: {comparison_df['Sparsity_%'].mean():.1f}%")
print(f"   Average features retained: {768 * (1 - comparison_df['Sparsity_%'].mean()/100):.0f}/768")

print("\n4. Best Performing LLM:")
best_llm = comparison_df.groupby('LLM')['MPNet_R2'].mean().idxmax()
best_r2 = comparison_df.groupby('LLM')['MPNet_R2'].mean().max()
print(f"   {best_llm}: {best_r2:.4f}")

print("\n5. Best Performing Dimension:")
best_dim = comparison_df.groupby('Dimension')['MPNet_R2'].mean().idxmax()
best_dim_r2 = comparison_df.groupby('Dimension')['MPNet_R2'].mean().max()
print(f"   {best_dim}: {best_dim_r2:.4f}")

print("\n6. Model Comparison:")
print(f"   MPNet: 110M parameters, 768 dimensions")
print(f"   MiniLM: 33M parameters, 384 dimensions")
print(f"   Parameter ratio: 3.3x larger")
print(f"   Dimension ratio: 2x larger")

print("\n" + "="*80)
print("Output Files Generated:")
print("="*80)
print("Models:")
for llm_key in LLM_CONFIGS.keys():
    print(f"  - elasticnet_models_mpnet_{llm_key}.pkl")
print("\nReports:")
for llm_key in LLM_CONFIGS.keys():
    print(f"  - 05f_elasticnet_training_report_mpnet_{llm_key}.json")
print("\nComparison:")
print(f"  - 05f_mpnet_vs_minilm.csv")
print(f"  - 05f_mpnet_elasticnet_comparison.png")

print("\n" + "="*80)
print("05f Complete (MPNet)!")
print("="*80)

## Analysis Notes

**Expected Results**:

1. **Performance Improvement**: 
   - MPNet should show +0.03~0.05 R² improvement vs MiniLM
   - Predicted R²: 0.22-0.27 (vs MiniLM's 0.19-0.24)

2. **Feature Selection**:
   - L1 regularization should eliminate 90-95% of features
   - 768 dims → ~40-80 retained features per dimension

3. **Model Comparison**:
   - MPNet (110M): Better semantic understanding
   - MiniLM (33M): Faster but less accurate
   - BGE (326M): Most powerful but slowest

4. **Next Steps**:
   - If MPNet R² > MiniLM: Use MPNet for production
   - If improvement < 0.03: Stick with MiniLM for speed
   - Consider trying larger models (BGE, GTR-T5-XL) if more improvement needed