# 05d - OCEAN Generation: Gemma-2-9B

**Model**: google/gemma-2-9b-it  
**Provider**: nebius  
**Samples**: 500  
**Output**: ../ocean_ground_truth/gemma_2_9b_ocean_500.csv  
**Estimated Time**: 1.5-2 hours

This notebook generates OCEAN personality scores for 500 loan application samples using the Gemma-2-9B model.

## Step 1: Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import json
import requests
import time
import os
from datetime import datetime

print('✓ Libraries imported')

✓ Libraries imported


In [2]:
def load_env():
    env_dict = {}
    try:
        with open('../.env', 'r') as f:
            for line in f:
                if line.strip() and not line.startswith('#'):
                    key, value = line.strip().split('=', 1)
                    env_dict[key] = value
    except:
        print('Warning: Unable to read .env file')
    return env_dict

env_vars = load_env()
hf_token = env_vars.get('HF_TOKEN', '')
print('✓ HF token loaded' if hf_token else '❌ HF_TOKEN not found')

✓ HF token loaded


In [3]:
df_samples = pd.read_csv('../test_samples_500.csv')
print(f'✓ Loaded {len(df_samples)} samples')

✓ Loaded 500 samples


## Step 2: Define Model Configuration

In [4]:
MODEL_NAME = 'google/gemma-2-9b-it'
PROVIDER = 'nebius'
DISPLAY_NAME = 'Gemma-2-9B'
OUTPUT_FILE = '../ocean_ground_truth/gemma_2_9b_ocean_500.csv'
CHECKPOINT_FILE = '../ocean_ground_truth/.checkpoint_gemma_2_9b.json'

print(f'Model: {DISPLAY_NAME}')
print(f'Provider: {PROVIDER}')

Model: Gemma-2-9B
Provider: nebius


In [5]:
ocean_prompt_template = '''You are a psychologist specialized in the Big Five (OCEAN) personality assessment for credit behavior research.

Analyze the loan applicant's text and provide personality scores for each of the Big Five traits. Base your assessment on ANY available linguistic cues, writing style, word choice, and expressed intentions in the text.

Trait definitions and scoring guidelines:
- Openness (0.0-1.0): curiosity, imagination, preference for novelty and new ideas
  * High (0.7-1.0): words like "learn," "try new," "explore," "creative," "open-minded," "different," "unique"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): focus on routine, traditional, familiar, conservative language
  
- Conscientiousness (0.0-1.0): organization, discipline, reliability, planning, self-control
  * High (0.7-1.0): "planning," "saving," "on time," "responsibility," "organized," "careful"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): impulsive, unplanned, casual language
  
- Extraversion (0.0-1.0): sociability, assertiveness, energy, enthusiasm
  * High (0.7-1.0): "team," "connect," "talk," "outgoing," "social," "people," "friends"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): solitary, quiet, reserved language
  
- Agreeableness (0.0-1.0): cooperation, empathy, kindness, trust
  * High (0.7-1.0): "help," "care," "family," "support," "honest," "kind," "together"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): competitive, critical, confrontational language
  
- Neuroticism (0.0-1.0): emotional instability, anxiety, sensitivity to stress
  * High (0.7-1.0): "worry," "stress," "pressure," "concern," "can't sleep," "anxious," "difficult"
  * Medium (0.4-0.6): neutral or mixed signals
  * Low (0.0-0.3): calm, stable, confident language

IMPORTANT: You MUST provide a score between 0.0 and 1.0 for each trait based on the available text. Do NOT default to 0.5 unless you genuinely find perfectly neutral/balanced evidence. Use the full range of scores (0.0-1.0) to reflect varying degrees of each trait.

Loan description:
{description_text}

Return ONLY valid JSON in this exact format:
{{
  "openness": 0.X,
  "conscientiousness": 0.X,
  "extraversion": 0.X,
  "agreeableness": 0.X,
  "neuroticism": 0.X
}}'''

print('✓ OCEAN prompt template defined')

✓ OCEAN prompt template defined


## Step 3: Define API Function

In [6]:
def call_llm_for_ocean_scores(description_text, model_name, provider, api_token, max_retries=3):
    prompt = ocean_prompt_template.format(description_text=description_text)
    api_url = 'https://router.huggingface.co/v1/chat/completions'
    headers = {'Authorization': f'Bearer {api_token}', 'Content-Type': 'application/json'}
    payload = {
        'messages': [{'role': 'user', 'content': prompt}],
        'model': f'{model_name}:{provider}',
        'stream': False,
        'max_tokens': 200,
        'temperature': 0.7
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.post(api_url, headers=headers, json=payload, timeout=30)
            if response.status_code == 200:
                result = response.json()
                if 'choices' in result and len(result['choices']) > 0:
                    text_output = result['choices'][0].get('message', {}).get('content', '')
                    try:
                        json_start = text_output.find('{')
                        if json_start != -1:
                            json_string = text_output[json_start:]
                            json_end = json_string.find('}') + 1
                            json_string = json_string[:json_end]
                            score_dict = json.loads(json_string)
                            return_value = {}
                            for key in ['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']:
                                if key in score_dict:
                                    return_value[key] = float(score_dict[key])
                            if len(return_value) == 5:
                                return return_value
                    except:
                        pass
            elif response.status_code == 429 and attempt < max_retries - 1:
                time.sleep(2 * (attempt + 1))
                continue
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2)
    return None

print('✓ API function defined')

✓ API function defined


## Step 4: Load Checkpoint (if exists)

In [7]:
os.makedirs('../ocean_ground_truth', exist_ok=True)

if os.path.exists(CHECKPOINT_FILE):
    with open(CHECKPOINT_FILE, 'r') as f:
        checkpoint = json.load(f)
    print(f'✓ Checkpoint loaded: {checkpoint["processed_count"]}/{checkpoint["total_count"]}')
    ocean_scores = checkpoint['ocean_scores']
    start_idx = checkpoint['processed_count']
else:
    print('No checkpoint found, starting from scratch')
    ocean_scores = []
    start_idx = 0

success_count = sum(1 for s in ocean_scores if s is not None)
failure_count = len(ocean_scores) - success_count

No checkpoint found, starting from scratch


## Step 5: Process All Samples

In [8]:
print('=' * 80)
print(f'Processing {DISPLAY_NAME}')
print('=' * 80)
print(f'Total samples: {len(df_samples)}')
print(f'Starting from: {start_idx}')
print('=' * 80)

start_time = time.time()

for idx in range(start_idx, len(df_samples)):
    row = df_samples.iloc[idx]
    description = row.get('desc', '')
    
    if len(description) < 10:
        ocean_scores.append(None)
        failure_count += 1
        continue
    
    ocean_score = call_llm_for_ocean_scores(description, MODEL_NAME, PROVIDER, hf_token, max_retries=2)
    
    if ocean_score:
        ocean_scores.append(ocean_score)
        success_count += 1
    else:
        ocean_scores.append(None)
        failure_count += 1
    
    if (idx + 1) % 50 == 0 or (idx + 1) == len(df_samples):
        elapsed = time.time() - start_time
        rate = (idx + 1 - start_idx) / elapsed if elapsed > 0 else 0
        eta = (len(df_samples) - (idx + 1)) / rate / 60 if rate > 0 else 0
        print(f'{idx + 1}/{len(df_samples)} ({(idx+1)/len(df_samples)*100:.1f}%) | Success: {success_count} ({success_count/(idx+1)*100:.1f}%) | ETA: {eta:.1f}min')
        checkpoint = {'model_name': MODEL_NAME, 'provider': PROVIDER, 'display_name': DISPLAY_NAME, 'total_count': len(df_samples), 'processed_count': idx+1, 'success_count': success_count, 'failure_count': failure_count, 'ocean_scores': ocean_scores, 'last_update': datetime.now().isoformat()}
        with open(CHECKPOINT_FILE, 'w') as f:
            json.dump(checkpoint, f, indent=2)
    time.sleep(1)

print(f'\n✅ COMPLETE: {(time.time()-start_time)/60:.1f}min | Success: {success_count}/{len(df_samples)} ({success_count/len(df_samples)*100:.1f}%)')

Processing Gemma-2-9B
Total samples: 500
Starting from: 0
50/500 (10.0%) | Success: 50 (100.0%) | ETA: 17.3min
100/500 (20.0%) | Success: 100 (100.0%) | ETA: 15.4min
150/500 (30.0%) | Success: 150 (100.0%) | ETA: 13.6min
200/500 (40.0%) | Success: 200 (100.0%) | ETA: 11.5min
250/500 (50.0%) | Success: 250 (100.0%) | ETA: 9.4min
300/500 (60.0%) | Success: 300 (100.0%) | ETA: 7.5min
350/500 (70.0%) | Success: 350 (100.0%) | ETA: 5.6min
400/500 (80.0%) | Success: 400 (100.0%) | ETA: 3.8min
450/500 (90.0%) | Success: 450 (100.0%) | ETA: 1.9min
500/500 (100.0%) | Success: 500 (100.0%) | ETA: 0.0min

✅ COMPLETE: 19.0min | Success: 500/500 (100.0%)


## Step 6: Save Results

In [9]:
data_list = [{'sample_id': idx, **score} if score else {'sample_id': idx, 'openness': None, 'conscientiousness': None, 'extraversion': None, 'agreeableness': None, 'neuroticism': None} for idx, score in enumerate(ocean_scores)]
df_ocean = pd.DataFrame(data_list)
df_ocean.to_csv(OUTPUT_FILE, index=False)
print(f'✓ Results saved: {OUTPUT_FILE} ({len(df_ocean)} rows, {df_ocean["openness"].notna().sum()} valid)')
if os.path.exists(CHECKPOINT_FILE):
    os.remove(CHECKPOINT_FILE)
    print('✓ Checkpoint removed')

✓ Results saved: ../ocean_ground_truth/gemma_2_9b_ocean_500.csv (500 rows, 500 valid)
✓ Checkpoint removed


## Step 7: Display Statistics

In [10]:
print('=' * 80)
print('OCEAN Statistics')
print('=' * 80)
print(df_ocean[['openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']].describe())
print('\n✅ ALL DONE!')

OCEAN Statistics
         openness  conscientiousness  extraversion  agreeableness  neuroticism
count  500.000000         500.000000    500.000000     500.000000   500.000000
mean     0.284800           0.594800      0.280000       0.510300     0.205400
std      0.069776           0.105476      0.088179       0.122265     0.151383
min      0.100000           0.100000      0.100000       0.200000     0.000000
25%      0.200000           0.600000      0.200000       0.400000     0.100000
50%      0.300000           0.600000      0.300000       0.500000     0.200000
75%      0.300000           0.700000      0.300000       0.600000     0.200000
max      0.700000           0.800000      0.800000       0.900000     0.800000

✅ ALL DONE!
