# 05f - Prepare Cross-Encoder Training Data (25 Models)

**Purpose**: Prepare training data for Cross-Encoder LoRA fine-tuning on HF AutoTrain

**Task**: 
- Transform OCEAN prediction into a text-pair regression task
- Query: OCEAN dimension definition
- Document: Loan description
- Label: OCEAN score (0-1)

**Training Strategy**:
- 5 LLMs × 5 OCEAN dimensions = **25 separate models**
- Each model fine-tuned independently for best performance
- LoRA fine-tuning for parameter efficiency

**Output**:
- 25 CSV files ready for HF AutoTrain upload
- Each file: 500 rows × 3 columns (text_1, text_2, label)
- Total data for training: 12,500 text pairs

**Next Step**: Upload to HF AutoTrain for LoRA fine-tuning

**Estimated Time**: 5-10 minutes

## Step 1: Import Libraries and Configuration

In [None]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")
print(f"Timestamp: {datetime.now()}")

## Step 2: Define OCEAN Dimension Definitions

These definitions will be used as the "query" in text-pair classification

In [None]:
OCEAN_DEFINITIONS = {
    'openness': "This person is imaginative, creative, curious about new experiences, and open to new ideas. They appreciate art, emotion, adventure, unusual ideas, and variety of experience.",
    
    'conscientiousness': "This person is organized, responsible, hardworking, reliable, and goal-oriented. They show self-discipline, act dutifully, and aim for achievement against measures or outside expectations.",
    
    'extraversion': "This person is outgoing, energetic, talkative, sociable, and enjoys being around others. They seek stimulation in the company of others and are assertive and enthusiastic.",
    
    'agreeableness': "This person is friendly, cooperative, compassionate, trusting, and considerate of others. They are generally well-tempered, kind, and value getting along with others.",
    
    'neuroticism': "This person tends to experience negative emotions such as anxiety, anger, or depression. They are emotionally unstable, prone to worry, and have difficulty coping with stress."
}

print("OCEAN Definitions:")
print("="*80)
for dim, definition in OCEAN_DEFINITIONS.items():
    print(f"\n{dim.upper()}:")
    print(f"  {definition}")
print("\n" + "="*80)

## Step 3: Load Loan Descriptions and OCEAN Ground Truth

In [None]:
# Load loan descriptions
print("Loading loan descriptions...")
df_samples = pd.read_csv('../test_samples_500.csv')
print(f"Loaded {len(df_samples)} samples")
print(f"Columns: {df_samples.columns.tolist()}")

# Extract descriptions
loan_descriptions = df_samples['desc'].tolist()
print(f"\nSample description:")
print(f"  {loan_descriptions[0][:200]}...")

# LLM configurations
LLM_CONFIGS = {
    'llama': {
        'name': 'Llama-3.1-8B',
        'ocean_file': '../ocean_ground_truth/llama_3.1_8b_ocean_500.csv'
    },
    'gpt': {
        'name': 'GPT-OSS-120B',
        'ocean_file': '../ocean_ground_truth/gpt_oss_120b_ocean_500.csv'
    },
    'gemma': {
        'name': 'Gemma-2-9B',
        'ocean_file': '../ocean_ground_truth/gemma_2_9b_ocean_500.csv'
    },
    'deepseek': {
        'name': 'DeepSeek-V3.1',
        'ocean_file': '../ocean_ground_truth/deepseek_v3.1_ocean_500.csv'
    },
    'qwen': {
        'name': 'Qwen-2.5-72B',
        'ocean_file': '../ocean_ground_truth/qwen_2.5_72b_ocean_500.csv'
    }
}

OCEAN_DIMS = ['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']

print(f"\nConfiguration:")
print(f"  LLM models: {len(LLM_CONFIGS)}")
print(f"  OCEAN dimensions: {len(OCEAN_DIMS)}")
print(f"  Total models to train: {len(LLM_CONFIGS) * len(OCEAN_DIMS)}")

## Step 4: Generate Training Data Files

**Format for Cross-Encoder Regression**:
```
text_1,text_2,label
"OCEAN definition","loan description",0.75
...
```

**File naming**: `crossencoder_train_{llm}_{dimension}.csv`

In [None]:
# Create output directory
output_dir = '../crossencoder_training_data'
os.makedirs(output_dir, exist_ok=True)
print(f"Output directory: {output_dir}")

# Storage for metadata
training_files_metadata = []
total_files = 0
total_rows = 0

print("\n" + "="*80)
print("Generating Training Data Files")
print("="*80)

for llm_key, llm_config in LLM_CONFIGS.items():
    print(f"\n[{llm_key.upper()}] {llm_config['name']}")
    
    # Load OCEAN ground truth
    ocean_df = pd.read_csv(llm_config['ocean_file'])
    
    # Check for NaN and drop if necessary
    initial_len = len(ocean_df)
    ocean_df = ocean_df.dropna()
    if len(ocean_df) < initial_len:
        print(f"  Warning: Dropped {initial_len - len(ocean_df)} rows with NaN values")
    
    # Ensure descriptions match
    if len(ocean_df) != len(loan_descriptions):
        # Adjust loan descriptions
        valid_descriptions = [loan_descriptions[i] for i in ocean_df.index]
    else:
        valid_descriptions = loan_descriptions
    
    for dim in OCEAN_DIMS:
        # Create training dataframe
        train_data = pd.DataFrame({
            'text_1': [OCEAN_DEFINITIONS[dim]] * len(ocean_df),  # Query: OCEAN definition
            'text_2': valid_descriptions,  # Document: loan description
            'label': ocean_df[dim].values  # Target: OCEAN score (0-1)
        })
        
        # Save to CSV
        filename = f'crossencoder_train_{llm_key}_{dim}.csv'
        filepath = os.path.join(output_dir, filename)
        train_data.to_csv(filepath, index=False)
        
        # Collect metadata
        file_size = os.path.getsize(filepath) / 1024  # KB
        training_files_metadata.append({
            'llm_key': llm_key,
            'llm_name': llm_config['name'],
            'dimension': dim,
            'filename': filename,
            'filepath': filepath,
            'num_samples': len(train_data),
            'file_size_kb': file_size,
            'label_mean': float(train_data['label'].mean()),
            'label_std': float(train_data['label'].std()),
            'label_min': float(train_data['label'].min()),
            'label_max': float(train_data['label'].max())
        })
        
        total_files += 1
        total_rows += len(train_data)
        
        print(f"  [{dim:<18}] {len(train_data):3d} samples | Label: {train_data['label'].mean():.3f} ± {train_data['label'].std():.3f} | {file_size:.1f} KB")

print(f"\n" + "="*80)
print(f"Total files created: {total_files}")
print(f"Total training samples: {total_rows}")
print(f"Average samples per file: {total_rows / total_files:.1f}")
print("="*80)

## Step 5: Generate Metadata and Instructions

In [None]:
# Save metadata
metadata = {
    'phase': '05f - Cross-Encoder Training Data Preparation',
    'timestamp': datetime.now().isoformat(),
    'base_model': 'cross-encoder/nli-deberta-v3-large',
    'task_type': 'text-regression',
    'total_models': total_files,
    'total_training_samples': total_rows,
    'ocean_definitions': OCEAN_DEFINITIONS,
    'training_files': training_files_metadata,
    'autotrain_config': {
        'learning_rate': 2e-5,
        'epochs': '3-5',
        'batch_size': 8,
        'lora': True,
        'lora_r': 8,
        'lora_alpha': 32,
        'train_split': 0.8,
        'eval_split': 0.2
    }
}

metadata_file = os.path.join(output_dir, 'training_data_metadata.json')
with open(metadata_file, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\nMetadata saved: {metadata_file}")

# Create training files list
files_list_file = os.path.join(output_dir, 'TRAINING_FILES_LIST.txt')
with open(files_list_file, 'w') as f:
    f.write("Cross-Encoder Training Files\n")
    f.write("="*80 + "\n\n")
    f.write(f"Total files: {total_files}\n")
    f.write(f"Total samples: {total_rows}\n\n")
    f.write("File List:\n")
    f.write("-" * 80 + "\n")
    
    for i, meta in enumerate(training_files_metadata, 1):
        f.write(f"{i:2d}. {meta['filename']:<50} ({meta['num_samples']} samples, {meta['file_size_kb']:.1f} KB)\n")
        f.write(f"    LLM: {meta['llm_name']:<20} | Dimension: {meta['dimension']:<15}\n")
        f.write(f"    Label stats: μ={meta['label_mean']:.3f}, σ={meta['label_std']:.3f}, range=[{meta['label_min']:.3f}, {meta['label_max']:.3f}]\n\n")

print(f"Files list saved: {files_list_file}")

# Print summary table
print("\n" + "="*80)
print("Training Files Summary")
print("="*80)
print(f"\n{'LLM':<20} {'Dimension':<18} {'Samples':<10} {'Label Mean':<12} {'Size (KB)'}")
print("-" * 80)
for meta in training_files_metadata[:10]:  # Show first 10
    print(f"{meta['llm_name']:<20} {meta['dimension']:<18} {meta['num_samples']:<10} {meta['label_mean']:<12.3f} {meta['file_size_kb']:.1f}")
print("\n(Showing first 10 of 25 files)")

## Step 6: Verify Data Quality

In [None]:
print("\n" + "="*80)
print("Data Quality Checks")
print("="*80)

# Load a sample file for verification
sample_file = training_files_metadata[0]['filepath']
sample_df = pd.read_csv(sample_file)

print(f"\nSample file: {training_files_metadata[0]['filename']}")
print(f"Shape: {sample_df.shape}")
print(f"Columns: {sample_df.columns.tolist()}")

print(f"\nFirst 3 rows:")
print(sample_df.head(3).to_string())

# Check for issues
issues = []

# Check 1: NaN values
nan_count = sample_df.isnull().sum().sum()
if nan_count > 0:
    issues.append(f"Found {nan_count} NaN values")

# Check 2: Label range
if sample_df['label'].min() < 0 or sample_df['label'].max() > 1:
    issues.append(f"Labels out of range [0,1]: [{sample_df['label'].min()}, {sample_df['label'].max()}]")

# Check 3: Empty texts
empty_text1 = (sample_df['text_1'].str.len() < 10).sum()
empty_text2 = (sample_df['text_2'].str.len() < 10).sum()
if empty_text1 > 0 or empty_text2 > 0:
    issues.append(f"Found short texts: text_1={empty_text1}, text_2={empty_text2}")

# Check 4: Duplicate rows
duplicates = sample_df.duplicated().sum()
if duplicates > 0:
    issues.append(f"Found {duplicates} duplicate rows")

if issues:
    print("\n⚠️ Quality Issues Found:")
    for issue in issues:
        print(f"  - {issue}")
else:
    print("\n✓ All quality checks passed!")

# Statistics
print(f"\nData Statistics:")
print(f"  Text 1 (OCEAN def) length: {sample_df['text_1'].str.len().mean():.1f} chars")
print(f"  Text 2 (loan desc) length: {sample_df['text_2'].str.len().mean():.1f} ± {sample_df['text_2'].str.len().std():.1f} chars")
print(f"  Label distribution: {sample_df['label'].describe().to_dict()}")

## Summary

**Data Preparation Complete!**

**Output**:
- 📁 Directory: `../crossencoder_training_data/`
- 📄 Training files: 25 CSV files (5 LLMs × 5 OCEAN dimensions)
- 📊 Total training samples: ~12,500
- 📝 Metadata: `training_data_metadata.json`
- 📋 Files list: `TRAINING_FILES_LIST.txt`

**Next Steps**:

1. **Upload to Hugging Face**:
   - Go to: https://huggingface.co/spaces/autotrain-projects/autotrain-advanced
   - Upload the 25 CSV files

2. **Configure AutoTrain** (for each file):
   ```
   Task: Text Regression
   Base Model: cross-encoder/nli-deberta-v3-large
   Learning Rate: 2e-5
   Epochs: 3-5
   Batch Size: 8
   LoRA: Enable
   LoRA r: 8
   LoRA alpha: 32
   Train/Eval Split: 80/20
   ```

3. **Cost Estimate**:
   - ~$2-3 per model
   - Total: $50-75 for 25 models

4. **After Training**:
   - Models will be saved to your HF account
   - Run `05f_crossencoder_lora_evaluate.ipynb` to evaluate

**See**: `HF_AUTOTRAIN_GUIDE.md` for detailed AutoTrain instructions