# 05d Extended - Generate 2,000 OCEAN Ground Truth (Llama-3-8B)

## Purpose
Generate high-quality OCEAN personality scores for 2,000 loan application samples using Llama-3-8B.

## Model Selection
- **Selected Model**: Llama-3-8B (meta-llama/Meta-Llama-3-8B-Instruct)
- **Provider**: novita

## Expected Metrics
- Total samples: 2,000
- Estimated time: ~120 minutes
- Expected cost: ~$N/A

## Output Files
- `ocean_targets_2000_llama.csv`: OCEAN scores for 2,000 samples
- `samples_2000_with_desc_llama.csv`: Sample descriptions
- `samples_2000_metadata_llama.csv`: Sample metadata
- `.checkpoint_2k_ocean_llama.json`: Resume capability

## Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import requests
import time
import os
from datetime import datetime

print('Libraries imported')

Libraries imported


## Step 2: Load Configuration

In [2]:
# Load HF token
def load_env():
    env_dict = {}
    try:
        with open('../.env', 'r') as f:
            for line in f:
                if line.strip() and not line.startswith('#'):
                    key, value = line.strip().split('=', 1)
                    env_dict[key] = value
    except:
        print('Warning: Unable to read .env file')
    return env_dict

env_vars = load_env()
HF_TOKEN = env_vars.get('HF_TOKEN', '')

if not HF_TOKEN:
    raise ValueError('HF_TOKEN not found in .env file!')

print(f'HF token loaded: {HF_TOKEN[:10]}...{HF_TOKEN[-5:]}')

HF token loaded: hf_qObfQZg...KHaJU


In [3]:
# Model configuration
MODEL_NAME = 'meta-llama/Meta-Llama-3-8B-Instruct'
PROVIDER = 'novita'
DISPLAY_NAME = 'Llama-3-8B'

# File paths
DATA_FILE = '../loan_final_desc50plus_with_ocean_bge.csv'
OUTPUT_OCEAN = '../ocean_targets_2000_llama.csv'
OUTPUT_METADATA = '../samples_2000_metadata_llama.csv'
CHECKPOINT_FILE = '../.checkpoint_2k_ocean_llama.json'

# Parameters
SAMPLE_SIZE = 2000
RANDOM_STATE = 42

print(f'Model: {DISPLAY_NAME}')
print(f'Provider: {PROVIDER}')
print(f'Target samples: {SAMPLE_SIZE:,}')
print(f'Random seed: {RANDOM_STATE}')

Model: Llama-3-8B
Provider: novita
Target samples: 2,000
Random seed: 42


# Sample 2000 rows randomly
print(f'\nSampling {SAMPLE_SIZE:,} samples (random seed: {RANDOM_STATE})...')

np.random.seed(RANDOM_STATE)
df_samples = df_full.sample(n=SAMPLE_SIZE, random_state=RANDOM_STATE).reset_index(drop=True)

print(f'Sampled: {len(df_samples):,} samples')
print(f'\nDescription statistics:')
df_samples['desc_length'] = df_samples['desc'].str.len()
print(f'  Min length: {df_samples["desc_length"].min()}')
print(f'  Mean length: {df_samples["desc_length"].mean():.1f}')
print(f'  Max length: {df_samples["desc_length"].max()}')

# Save full sampled data (including desc) for 05g to use
SAMPLES_WITH_DESC_FILE = '../samples_2000_with_desc.csv'
df_samples_to_save = df_samples[['desc']].copy()
df_samples_to_save.insert(0, 'sample_id', range(len(df_samples_to_save)))
df_samples_to_save.to_csv(SAMPLES_WITH_DESC_FILE, index=False)
print(f'\nSaved samples with desc to {SAMPLES_WITH_DESC_FILE}')

# Save metadata
metadata_cols = ['desc_length']
if 'loan_amnt' in df_samples.columns:
    metadata_cols.append('loan_amnt')
if 'grade' in df_samples.columns:
    metadata_cols.append('grade')

df_metadata = df_samples[metadata_cols].copy()
df_metadata.insert(0, 'sample_id', range(len(df_metadata)))
df_metadata.to_csv(OUTPUT_METADATA, index=False)

print(f'Saved metadata to {OUTPUT_METADATA}')

In [4]:
OCEAN_PROMPT_TEMPLATE = '''You are a psychologist specialized in the Big Five (OCEAN) personality assessment for credit behavior research.

Analyze the loan applicant's text and provide personality scores for each of the Big Five traits. Base your assessment on ANY available linguistic cues, writing style, word choice, and expressed intentions in the text.

Trait definitions and scoring guidelines:
- Openness (0.0-1.0): curiosity, imagination, preference for novelty and new ideas
  * High (0.7-1.0): words like "learn," "try new," "explore," "creative," "open-minded," "different," "unique"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): focus on routine, traditional, familiar, conservative language
  
- Conscientiousness (0.0-1.0): organization, discipline, reliability, planning, self-control
  * High (0.7-1.0): "planning," "saving," "on time," "responsibility," "organized," "careful"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): impulsive, unplanned, casual language
  
- Extraversion (0.0-1.0): sociability, assertiveness, energy, enthusiasm
  * High (0.7-1.0): "team," "connect," "talk," "outgoing," "social," "people," "friends"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): solitary, quiet, reserved language
  
- Agreeableness (0.0-1.0): cooperation, empathy, kindness, trust
  * High (0.7-1.0): "help," "care," "family," "support," "honest," "kind," "together"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): competitive, critical, confrontational language
  
- Neuroticism (0.0-1.0): emotional instability, anxiety, sensitivity to stress
  * High (0.7-1.0): "worry," "stress," "pressure," "concern," "can't sleep," "anxious," "difficult"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): calm, stable, confident language

IMPORTANT: You MUST provide a score between 0.0 and 1.0 for each trait based on the available text. Do NOT default to 0.5 unless you genuinely find perfectly neutral/balanced evidence. Use the full range of scores (0.0-1.0) to reflect varying degrees of each trait.

Loan description:
{description_text}

Return ONLY valid JSON in this exact format:
{{
  "openness": 0.X,
  "conscientiousness": 0.X,
  "extraversion": 0.X,
  "agreeableness": 0.X,
  "neuroticism": 0.X
}}'''

print('OCEAN prompt template defined')

OCEAN prompt template defined


## Step 4: Define API Function

In [5]:
def call_llm_for_ocean_scores(description_text, model_name, provider, api_token, max_retries=3):
    """
    Call HuggingFace Router API to generate OCEAN scores.
    
    Returns:
        dict: OCEAN scores or None if failed
    """
    prompt = OCEAN_PROMPT_TEMPLATE.format(description_text=description_text)
    
    api_url = 'https://router.huggingface.co/v1/chat/completions'
    headers = {
        'Authorization': f'Bearer {api_token}',
        'Content-Type': 'application/json'
    }
    
    payload = {
        'messages': [{'role': 'user', 'content': prompt}],
        'model': f'{model_name}:{provider}',
        'stream': False,
        'max_tokens': 200,
        'temperature': 0.7
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.post(api_url, headers=headers, json=payload, timeout=30)
            
            if response.status_code == 200:
                result = response.json()
                if 'choices' in result and len(result['choices']) > 0:
                    text_output = result['choices'][0].get('message', {}).get('content', '')
                    
                    try:
                        # Extract JSON from response
                        json_start = text_output.find('{')
                        if json_start != -1:
                            json_string = text_output[json_start:]
                            json_end = json_string.find('}') + 1
                            json_string = json_string[:json_end]
                            score_dict = json.loads(json_string)
                            
                            # Validate all OCEAN dimensions present
                            return_value = {}
                            for key in ['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']:
                                if key in score_dict:
                                    return_value[key] = float(score_dict[key])
                            
                            if len(return_value) == 5:
                                return return_value
                    except Exception as parse_error:
                        pass
                        
            elif response.status_code == 429 and attempt < max_retries - 1:
                # Rate limit, wait and retry
                time.sleep(2 * (attempt + 1))
                continue
                
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2)
    
    return None

print('API function defined')

API function defined


## Step 5: Load and Sample Data

In [6]:
print(f'Loading data from {DATA_FILE}...')
df_full = pd.read_csv(DATA_FILE)

print(f'Total samples in dataset: {len(df_full):,}')
print(f'Columns: {df_full.shape[1]}')

# Check for required column
if 'desc' not in df_full.columns:
    raise ValueError('Column "desc" not found in dataset!')

print('Column "desc" found')

Loading data from ../loan_final_desc50plus_with_ocean_bge.csv...
Total samples in dataset: 34,529
Columns: 38
Column "desc" found


In [7]:
# Sample 2000 rows randomly
print(f'\nSampling {SAMPLE_SIZE:,} samples (random seed: {RANDOM_STATE})...')

np.random.seed(RANDOM_STATE)
df_samples = df_full.sample(n=SAMPLE_SIZE, random_state=RANDOM_STATE).reset_index(drop=True)

print(f'Sampled: {len(df_samples):,} samples')
print(f'\nDescription statistics:')
df_samples['desc_length'] = df_samples['desc'].str.len()
print(f'  Min length: {df_samples["desc_length"].min()}')
print(f'  Mean length: {df_samples["desc_length"].mean():.1f}')
print(f'  Max length: {df_samples["desc_length"].max()}')

# Save metadata
metadata_cols = ['desc_length']
if 'loan_amnt' in df_samples.columns:
    metadata_cols.append('loan_amnt')
if 'grade' in df_samples.columns:
    metadata_cols.append('grade')

df_metadata = df_samples[metadata_cols].copy()
df_metadata.insert(0, 'sample_id', range(len(df_metadata)))
df_metadata.to_csv(OUTPUT_METADATA, index=False)

print(f'\nSaved metadata to {OUTPUT_METADATA}')


Sampling 2,000 samples (random seed: 42)...
Sampled: 2,000 samples

Description statistics:
  Min length: 232
  Mean length: 511.5
  Max length: 3976

Saved metadata to ../samples_2000_metadata_llama.csv


## Step 6: Load Checkpoint (if exists)

In [8]:
if os.path.exists(CHECKPOINT_FILE):
    with open(CHECKPOINT_FILE, 'r') as f:
        checkpoint = json.load(f)
    
    print(f'Checkpoint loaded: {checkpoint["processed_count"]}/{checkpoint["total_count"]}')
    ocean_scores = checkpoint['ocean_scores']
    start_idx = checkpoint['processed_count']
    success_count = checkpoint['success_count']
    failure_count = checkpoint['failure_count']
    
else:
    print('No checkpoint found, starting from scratch')
    ocean_scores = []
    start_idx = 0
    success_count = 0
    failure_count = 0

No checkpoint found, starting from scratch


## Step 7: Generate OCEAN Scores

**This will take approximately 76 minutes (19 min per 500 samples × 4)**

Progress is saved every 50 samples, so you can resume if interrupted.

In [9]:
print('=' * 80)
print(f'Processing {DISPLAY_NAME} for {SAMPLE_SIZE:,} samples')
print('=' * 80)
print(f'Total samples: {len(df_samples):,}')
print(f'Starting from: {start_idx}')
print(f'Estimated time: ~{(SAMPLE_SIZE - start_idx) / 500 * 30:.1f} minutes')
print('=' * 80)

start_time = time.time()

for idx in range(start_idx, len(df_samples)):
    row = df_samples.iloc[idx]
    description = row.get('desc', '')
    
    # Skip very short descriptions
    if len(description) < 10:
        ocean_scores.append(None)
        failure_count += 1
        continue
    
    # Call LLM
    ocean_score = call_llm_for_ocean_scores(
        description, 
        MODEL_NAME, 
        PROVIDER, 
        HF_TOKEN, 
        max_retries=3
    )
    
    if ocean_score:
        ocean_scores.append(ocean_score)
        success_count += 1
    else:
        ocean_scores.append(None)
        failure_count += 1
    
    # Progress reporting and checkpointing every 50 samples
    if (idx + 1) % 50 == 0 or (idx + 1) == len(df_samples):
        elapsed = time.time() - start_time
        rate = (idx + 1 - start_idx) / elapsed if elapsed > 0 else 0
        eta = (len(df_samples) - (idx + 1)) / rate / 60 if rate > 0 else 0
        
        print(f'{idx + 1}/{len(df_samples)} ({(idx+1)/len(df_samples)*100:.1f}%) | '
              f'Success: {success_count} ({success_count/(idx+1)*100:.1f}%) | '
              f'Failed: {failure_count} | '
              f'Rate: {rate:.2f} samples/sec | '
              f'ETA: {eta:.1f} min')
        
        # Save checkpoint
        checkpoint = {
            'model_name': MODEL_NAME,
            'provider': PROVIDER,
            'display_name': DISPLAY_NAME,
            'total_count': len(df_samples),
            'processed_count': idx + 1,
            'success_count': success_count,
            'failure_count': failure_count,
            'ocean_scores': ocean_scores,
            'last_update': datetime.now().isoformat()
        }
        with open(CHECKPOINT_FILE, 'w') as f:
            json.dump(checkpoint, f, indent=2)
    
    # Rate limiting
    time.sleep(1)

total_time = time.time() - start_time
print(f'\nCOMPLETE: {total_time/60:.1f} minutes')
print(f'Success: {success_count}/{len(df_samples)} ({success_count/len(df_samples)*100:.1f}%)')
print(f'Failed: {failure_count}/{len(df_samples)} ({failure_count/len(df_samples)*100:.1f}%)')

Processing Llama-3-8B for 2,000 samples
Total samples: 2,000
Starting from: 0
Estimated time: ~120.0 minutes
50/2000 (2.5%) | Success: 50 (100.0%) | Failed: 0 | Rate: 0.28 samples/sec | ETA: 117.1 min
100/2000 (5.0%) | Success: 99 (99.0%) | Failed: 1 | Rate: 0.26 samples/sec | ETA: 121.9 min
150/2000 (7.5%) | Success: 149 (99.3%) | Failed: 1 | Rate: 0.26 samples/sec | ETA: 119.1 min
200/2000 (10.0%) | Success: 199 (99.5%) | Failed: 1 | Rate: 0.26 samples/sec | ETA: 115.1 min
250/2000 (12.5%) | Success: 248 (99.2%) | Failed: 2 | Rate: 0.26 samples/sec | ETA: 112.5 min
300/2000 (15.0%) | Success: 298 (99.3%) | Failed: 2 | Rate: 0.26 samples/sec | ETA: 107.2 min
350/2000 (17.5%) | Success: 348 (99.4%) | Failed: 2 | Rate: 0.26 samples/sec | ETA: 104.1 min
400/2000 (20.0%) | Success: 398 (99.5%) | Failed: 2 | Rate: 0.27 samples/sec | ETA: 100.0 min
450/2000 (22.5%) | Success: 448 (99.6%) | Failed: 2 | Rate: 0.27 samples/sec | ETA: 97.1 min
500/2000 (25.0%) | Success: 497 (99.4%) | Failed: 3

## Step 8: Save Final Results

In [10]:
# Create DataFrame with OCEAN scores
data_list = []
for idx, score in enumerate(ocean_scores):
    if score:
        data_list.append({'sample_id': idx, **score})
    else:
        data_list.append({
            'sample_id': idx,
            'openness': None,
            'conscientiousness': None,
            'extraversion': None,
            'agreeableness': None,
            'neuroticism': None
        })

df_ocean = pd.DataFrame(data_list)

# Save OCEAN targets
df_ocean.to_csv(OUTPUT_OCEAN, index=False)
print(f'Results saved: {OUTPUT_OCEAN}')
print(f'  Total rows: {len(df_ocean)}')
print(f'  Valid rows: {df_ocean["openness"].notna().sum()}')

# Clean up checkpoint
if os.path.exists(CHECKPOINT_FILE):
    os.remove(CHECKPOINT_FILE)
    print(f'\nCheckpoint file removed')

Results saved: ../ocean_targets_2000_llama.csv
  Total rows: 2000
  Valid rows: 1988

Checkpoint file removed


## Step 9: Display Statistics

In [11]:
print('=' * 80)
print('OCEAN GROUND TRUTH STATISTICS (2,000 samples)')
print('=' * 80)

ocean_cols = ['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']
print(df_ocean[ocean_cols].describe())

print('\n' + '=' * 80)
print('SUMMARY')
print('=' * 80)
print(f'Model: {DISPLAY_NAME}')
print(f'Total samples: {len(df_ocean):,}')
print(f'Valid samples: {df_ocean["openness"].notna().sum():,} ({df_ocean["openness"].notna().sum()/len(df_ocean)*100:.1f}%)')
print(f'Processing time: {total_time/60:.1f} minutes')
print(f'Rate: {len(df_ocean)/(total_time/60):.1f} samples/minute')
print('=' * 80)

print('\nALL DONE! Ready for 05g model training.')

OCEAN GROUND TRUTH STATISTICS (2,000 samples)
          openness  conscientiousness  extraversion  agreeableness   
count  1988.000000        1988.000000   1988.000000    1988.000000  \
mean      0.395835           0.656791      0.270714       0.668073   
std       0.109252           0.134812      0.112648       0.124434   
min       0.000000           0.000000      0.000000       0.000000   
25%       0.300000           0.600000      0.200000       0.600000   
50%       0.400000           0.700000      0.300000       0.700000   
75%       0.400000           0.800000      0.400000       0.700000   
max       0.800000           0.900000      0.800000       0.930000   

       neuroticism  
count  1988.000000  
mean      0.187394  
std       0.201995  
min       0.000000  
25%       0.000000  
50%       0.100000  
75%       0.200000  
max       0.900000  

SUMMARY
Model: Llama-3-8B
Total samples: 2,000
Valid samples: 1,988 (99.4%)
Processing time: 128.8 minutes
Rate: 15.5 samples/minute


## Next Steps

1. Verify the generated OCEAN scores look reasonable
2. Use `ocean_targets_2000.csv` in the 05g training notebook
3. Extract BGE embeddings for these 2,000 samples
4. Train ElasticNet, Random Forest, and Gradient Boosting models
5. Compare performance with 500-sample baseline