# 05g - Apply Ridge Models to Generate Full Dataset OCEAN Features (Gemma-2-9B)

**Purpose**: Apply trained Ridge models (Gemma-2-9B) to all 34,529 samples to generate complete OCEAN features

**Input Files**:
- data/loan_final_desc50plus.csv - Full dataset (34,529 samples)
- ridge_models_gemma.pkl - Trained Ridge models (Gemma-2-9B)

**Output Files**:
- loan_with_ocean_gemma.csv - Complete dataset with OCEAN features (34,529xN)
- bge_large_embeddings_full.npy - Full BGE embeddings (34,529x1024)
- 05g_full_data_summary_gemma.json - Processing report

**WARNING: Estimated Time: 3-4 hours** (34,529 API calls, 0.3-0.5 second delay each)

**Recommendation**: Run this notebook in background using screen or tmux to avoid connection interruption

## Step 1: Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import pickle
import requests
import json
import os
import time
import warnings
from datetime import datetime
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")

## Step 2: Load Full Dataset and Ridge Models (Gemma-2-9B)

In [None]:
print("="*100)
print("Loading data and models (Gemma-2-9B)")
print("="*100)

# Load full dataset
print("\nLoading full dataset...")
data_file = '../data/loan_final_desc50plus.csv'
df_full = pd.read_csv(data_file, low_memory=False)
print(f"Data loaded successfully: {len(df_full)} rows x {len(df_full.columns)} columns")
print(f"  Columns: {df_full.columns.tolist()[:5]}... (showing first 5 columns)")

# Load Ridge models and Scaler (Gemma-2-9B)
print("\nLoading Ridge models (Gemma-2-9B)...")
model_file = '../ridge_models_gemma.pkl'
with open(model_file, 'rb') as f:
    model_data = pickle.load(f)
    ridge_models = model_data['models']
    scaler = model_data['scaler']
    OCEAN_DIMS = model_data['ocean_dims']

print(f"Models loaded successfully (Gemma-2-9B)")
print(f"  OCEAN dimensions: {OCEAN_DIMS}")
print(f"  Scaler: StandardScaler")

# Load HF Token
def load_hf_token():
    try:
        with open('../.env', 'r') as f:
            for line in f:
                if line.strip() and not line.startswith('#'):
                    key, value = line.strip().split('=', 1)
                    if key == 'HF_TOKEN':
                        return value
    except:
        pass
    return os.getenv('HF_TOKEN', '')

hf_token = load_hf_token()
print(f"\nHF Token loaded: {'yes' if hf_token else 'no'}")

total_samples = len(df_full)
print(f"\nTotal samples to process: {total_samples:,}")

## Step 3: Define Embedding Extraction Function

In [None]:
def extract_bge_embedding(text: str, max_retries: int = 3, retry_delay: int = 2) -> np.ndarray:
    """
    Call HF Inference API to extract BGE-Large embeddings
    
    Returns zero vector on failure to maintain index alignment
    """
    api_url = "https://api-inference.huggingface.co/models/BAAI/bge-large-en-v1.5"
    headers = {
        "Authorization": f"Bearer {hf_token}",
        "Content-Type": "application/json"
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.post(
                api_url,
                headers=headers,
                json={"inputs": text},
                timeout=30
            )
            
            if response.status_code == 200:
                features = response.json()
                
                if isinstance(features, list):
                    if len(features) > 0:
                        if isinstance(features[0], list):
                            avg_feature = np.mean(features, axis=0)
                        else:
                            avg_feature = features
                    else:
                        return np.zeros(1024)
                else:
                    avg_feature = features
                
                return np.array(avg_feature)
            
            elif response.status_code == 503:
                if attempt < max_retries - 1:
                    time.sleep(retry_delay)
                    continue
                else:
                    return np.zeros(1024)
            
            elif response.status_code == 429:
                if attempt < max_retries - 1:
                    time.sleep(retry_delay * (attempt + 1))
                    continue
                else:
                    return np.zeros(1024)
            
            else:
                return np.zeros(1024)
        
        except:
            if attempt < max_retries - 1:
                time.sleep(retry_delay)
                continue
            else:
                return np.zeros(1024)
    
    return np.zeros(1024)

print("Embedding extraction function defined")

## Step 4: Batch Extract All Sample Embeddings (Takes 3-4 hours)

In [None]:
print("="*100)
print("Step 1: Extract Full BGE-Large Embeddings (34,529 samples)")
print("="*100)
print(f"\nEstimated time: 3-4 hours")
print(f"WARNING: This step will make {total_samples:,} API calls")
print(f"\nStart time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\nProgress:\n")

all_embeddings = []
success_count = 0
error_count = 0

start_time = time.time()
checkpoint_interval = 2000  # Save checkpoint every 2000 samples

for idx in range(total_samples):
    text = df_full.iloc[idx].get('desc', '')
    
    if len(text.strip()) < 10:
        all_embeddings.append(np.zeros(1024))
        error_count += 1
    else:
        emb = extract_bge_embedding(text)
        all_embeddings.append(emb)
        
        if np.sum(emb) == 0:  # Zero vector indicates error
            error_count += 1
        else:
            success_count += 1
    
    # Show progress (update every 100 samples)
    if (idx + 1) % 100 == 0:
        elapsed = time.time() - start_time
        rate = (idx + 1) / elapsed if elapsed > 0 else 0
        remaining = (total_samples - idx - 1) / rate if rate > 0 else 0
        
        progress = (idx + 1) / total_samples * 100
        
        hours = int(remaining // 3600)
        minutes = int((remaining % 3600) // 60)
        
        print(f"[{idx+1:5d}/{total_samples}] ({progress:5.1f}%) | Success: {success_count:5d}, Failed: {error_count:5d} | "
              f"Rate: {rate:.2f} samples/s | Remaining: {hours}h {minutes}m")
    
    # Checkpoint save
    if (idx + 1) % checkpoint_interval == 0:
        print(f"\n  Checkpoint: Processed {idx+1:,} samples...")
    
    # Adjust delay based on success rate
    if success_count + error_count > 0:
        success_rate = success_count / (success_count + error_count)
        if success_rate < 0.5:  # Low success rate, increase delay
            time.sleep(0.5)
        else:
            time.sleep(0.3)

elapsed_total = time.time() - start_time

print(f"\n" + "="*100)
print(f"Embedding extraction complete")
print("="*100)
print(f"\nTime elapsed: {elapsed_total:.1f} seconds ({elapsed_total/60:.1f} minutes)")
print(f"Completion time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nResults:")
print(f"  Success: {success_count:,}/{total_samples:,} ({success_count/total_samples*100:.2f}%)")
print(f"  Failed: {error_count:,}/{total_samples:,} ({error_count/total_samples*100:.2f}%)")
print(f"  Average rate: {total_samples/elapsed_total:.2f} samples/second")

# Convert to numpy array
X_full = np.array(all_embeddings)
print(f"\nEmbedding matrix:")
print(f"  Shape: {X_full.shape}")
print(f"  Data type: {X_full.dtype}")
print(f"  Memory usage: {X_full.nbytes / 1024 / 1024:.1f} MB")

## Step 5: Use Ridge Models (Gemma-2-9B) to Generate OCEAN Features

In [None]:
print("="*100)
print("Step 2: Generate OCEAN features using Gemma-2-9B Ridge models")
print("="*100)

# Standardize embeddings
print(f"\nStandardizing embeddings...")
X_full_scaled = scaler.transform(X_full)
print(f"Standardization complete")

# Use Ridge models to predict OCEAN scores
print(f"\nUsing Gemma-2-9B Ridge models to predict OCEAN scores...")
ocean_predictions = {}

for dim in OCEAN_DIMS:
    print(f"  {dim}...", end=' ', flush=True)
    model = ridge_models[dim]
    predictions = model.predict(X_full_scaled)
    
    # Clip to [0, 1] range
    predictions = np.clip(predictions, 0.0, 1.0)
    ocean_predictions[dim] = predictions
    
    print(f"mean={predictions.mean():.3f}, std={predictions.std():.3f}")

print(f"\nOCEAN feature generation complete")

## Step 6: Merge Data and Save

In [None]:
print("="*100)
print("Step 3: Merge data and save (Gemma-2-9B)")
print("="*100)

# Create dataset with OCEAN features
print(f"\nCreating dataset with OCEAN features...")
df_full_with_ocean = df_full.copy()

for dim in OCEAN_DIMS:
    df_full_with_ocean[dim] = ocean_predictions[dim]

print(f"Data merge complete")
print(f"  Original columns: {len(df_full.columns)}")
print(f"  New OCEAN columns: {len(OCEAN_DIMS)}")
print(f"  Total columns: {len(df_full_with_ocean.columns)}")
print(f"  Total rows: {len(df_full_with_ocean)}")

# Save complete dataset
print(f"\nSaving complete dataset (Gemma-2-9B)...")
output_csv = '../loan_with_ocean_gemma.csv'
df_full_with_ocean.to_csv(output_csv, index=False)
print(f"Saved: {output_csv}")
print(f"  File size: {os.path.getsize(output_csv) / 1024 / 1024:.1f} MB")

# Save embeddings
print(f"\nSaving embeddings...")
embedding_file = '../bge_large_embeddings_full.npy'
np.save(embedding_file, X_full)
print(f"Saved: {embedding_file}")
print(f"  File size: {os.path.getsize(embedding_file) / 1024 / 1024:.1f} MB")

# Verify data
print(f"\nData verification...")
print(f"Row count match: {len(df_full_with_ocean) == len(X_full)}")
print(f"OCEAN columns exist: {all(col in df_full_with_ocean.columns for col in OCEAN_DIMS)}")
print(f"\nOCEAN feature statistics (Gemma-2-9B):")
for col in OCEAN_DIMS:
    print(f"  {col:20s}: mean={df_full_with_ocean[col].mean():.3f}, std={df_full_with_ocean[col].std():.3f}, "
          f"min={df_full_with_ocean[col].min():.3f}, max={df_full_with_ocean[col].max():.3f}")

## Step 7: Generate Summary Report

In [None]:
print("="*100)
print("Generating summary report (Gemma-2-9B)")
print("="*100)

# Create summary report
summary = {
    'phase': '05g - Apply Ridge to Full Data (Gemma-2-9B)',
    'timestamp': datetime.now().isoformat(),
    'llm_model': 'Gemma-2-9B',
    'embedding_model': 'BAAI/bge-large-en-v1.5',
    'total_samples': int(total_samples),
    'embedding_success_count': int(success_count),
    'embedding_error_count': int(error_count),
    'embedding_success_rate': f"{success_count/total_samples*100:.2f}%",
    'embedding_dimension': 1024,
    'processing_time_seconds': elapsed_total,
    'processing_time_minutes': elapsed_total / 60,
    'processing_time_hours': elapsed_total / 3600,
    'samples_per_second': total_samples / elapsed_total if elapsed_total > 0 else 0,
    'output_files': {
        'csv': output_csv,
        'embeddings': embedding_file
    },
    'ocean_dimensions': OCEAN_DIMS,
    'ocean_statistics': {}
}

# Add OCEAN statistics
for dim in OCEAN_DIMS:
    summary['ocean_statistics'][dim] = {
        'mean': float(df_full_with_ocean[dim].mean()),
        'std': float(df_full_with_ocean[dim].std()),
        'min': float(df_full_with_ocean[dim].min()),
        'max': float(df_full_with_ocean[dim].max()),
        'median': float(df_full_with_ocean[dim].median())
    }

# Save report
summary_file = '../05g_full_data_summary_gemma.json'
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nSummary report saved: {summary_file}")
print(f"\n" + "="*100)
print("Final Summary (Gemma-2-9B)")
print("="*100)
print(json.dumps(summary, indent=2, default=str))

print(f"\n" + "="*100)
print("05g Complete (Gemma-2-9B)")
print("="*100)
print(f"\nGenerated files:")
print(f"  1. {output_csv}")
print(f"     - 34,529 rows x {len(df_full_with_ocean.columns)} columns")
print(f"     - Contains all original features + 5 OCEAN features (Gemma-2-9B)")
print(f"\n  2. {embedding_file}")
print(f"     - 34,529 x 1024 embeddings")
print(f"\n  3. {summary_file}")
print(f"     - Processing report and statistics")
print(f"\nNext step: Run XGBoost training with Gemma-2-9B OCEAN features")

## Summary

Step 05g Complete (Gemma-2-9B)

**Key Achievements**:
- Extracted BGE-Large embeddings for 34,529 samples
- Generated 5 OCEAN features using Gemma-2-9B Ridge models
- Created complete dataset with OCEAN features

**Output Files**:
- `loan_with_ocean_gemma.csv` - Complete dataset
- `bge_large_embeddings_full.npy` - Embeddings
- `05g_full_data_summary_gemma.json` - Report

**Next Steps**:
1. Use Gemma-2-9B OCEAN features for XGBoost training
2. Compare Gemma-2-9B performance with other models
3. Evaluate OCEAN feature impact