# 05e - Extract BGE-Large Embeddings (500 Test Samples)

**Purpose**: Extract BGE-Large embeddings from 500 test samples as input features for Ridge regression model training

**Input Files**:
- test_samples_500.csv - 500 samples
- ocean_ground_truth/ - OCEAN ground truth (select best model)

**Output Files**:
- bge_embeddings_500.npy - BGE embeddings matrix (500x1024)
- ocean_targets_500.csv - Corresponding OCEAN scores (500x5)
- 05e_extraction_summary.json - Extraction statistics report

**Estimated Time**: Approximately 15-20 minutes (500 API calls, 0.3 second delay each)

## Step 1: Import Libraries and Load Data

In [10]:
import pandas as pd
import numpy as np
import requests
import json
import os
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")

Libraries loaded successfully


## Step 2: Load Test Data and OCEAN Ground Truth

In [11]:
# Load 500 test samples
print("Loading test data...")

# Check if test_samples_500.csv exists, if not create it from original data
import os
test_file = '../test_samples_500.csv'

if not os.path.exists(test_file):
    print(f"⚠️  {test_file} not found, creating from original dataset...")
    
    # Load original dataset (WITHOUT OCEAN scores - that's the point!)
    original_file = '../data/loan_final_desc50plus.csv'
    
    if os.path.exists(original_file):
        print(f"  Loading: {original_file}")
        df_original = pd.read_csv(original_file, low_memory=False)
        print(f"  Total samples: {len(df_original):,}")
        
        # Take first 500 samples (corresponding to sample_id 0-499 in OCEAN ground truth)
        df_samples = df_original.head(500).copy()
        
        # Verify desc column exists
        if 'desc' not in df_samples.columns:
            raise ValueError("desc column not found in dataset!")
        
        # Save to test file
        df_samples.to_csv(test_file, index=False)
        print(f"  ✓ Created {test_file} with {len(df_samples)} samples")
        print(f"  ✓ These correspond to sample_id 0-499 in OCEAN ground truth files")
    else:
        raise FileNotFoundError(f"Original dataset not found: {original_file}\n"
                                f"  Please ensure loan_final_desc50plus.csv exists in data/ directory")
else:
    print(f"  Loading existing file: {test_file}")
    df_samples = pd.read_csv(test_file)

print(f"\n✓ Loaded {len(df_samples)} samples")
print(f"  Columns: {len(df_samples.columns)}")
print(f"  Has 'desc': {'desc' in df_samples.columns}")

if 'desc' in df_samples.columns:
    desc_count = df_samples['desc'].notna().sum()
    print(f"  Valid descriptions: {desc_count}/{len(df_samples)}")

print(f"\nFirst 3 descriptions (preview):")
for i in range(min(3, len(df_samples))):
    desc = str(df_samples.iloc[i]['desc'])[:100]
    print(f"  [{i}] {desc}...")

# Load OCEAN ground truth (use best model)
print("\n" + "="*80)
print("Loading OCEAN ground truth...")
print("="*80)

ocean_gt_file = '../ocean_ground_truth/deepseek_v3.1_ocean_500.csv'

if not os.path.exists(ocean_gt_file):
    # If file doesn't exist, try other models
    print(f"⚠️  {ocean_gt_file} not found, searching for alternatives...")
    ocean_dir = '../ocean_ground_truth'
    if os.path.exists(ocean_dir):
        files = [f for f in os.listdir(ocean_dir) if f.endswith('_ocean_500.csv')]
        if files:
            ocean_gt_file = os.path.join(ocean_dir, files[0])
            print(f"  Using: {files[0]}")
        else:
            raise FileNotFoundError(f"No OCEAN ground truth files found in {ocean_dir}")
    else:
        raise FileNotFoundError(f"Directory not found: {ocean_dir}")

df_ocean = pd.read_csv(ocean_gt_file)
print(f"\n✓ Loaded {len(df_ocean)} OCEAN scores")
print(f"  Columns: {df_ocean.columns.tolist()}")
print(f"\nOCEAN score statistics:")
print(df_ocean.describe())

# Verify alignment
if len(df_samples) != len(df_ocean):
    print(f"\n⚠️  WARNING: Sample count mismatch!")
    print(f"  Test samples: {len(df_samples)}")
    print(f"  OCEAN scores: {len(df_ocean)}")
else:
    print(f"\n✓ Data alignment verified: {len(df_samples)} samples match {len(df_ocean)} OCEAN scores")

Loading test data...
  Loading existing file: ../test_samples_500.csv

✓ Loaded 500 samples
  Columns: 33
  Has 'desc': True
  Valid descriptions: 500/500

First 3 descriptions (preview):
  [0] I currently have a loan out with CashCall. The interest rate is 96%! At the time I took out the loan...
  [1] Temporary cash flow challenges. Would like this loan to offset mortgage payments-next 90 days. Have ...
  [2] Hi! So $5,500 doesn't sound like much to be debt free. Well, when you're 24 years old it seems like ...

Loading OCEAN ground truth...

✓ Loaded 500 OCEAN scores
  Columns: ['sample_id', 'openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']

OCEAN score statistics:
        sample_id    openness  conscientiousness  extraversion  agreeableness   
count  500.000000  500.000000         500.000000    500.000000     500.000000  \
mean   249.500000    0.293200           0.764800      0.255600       0.489800   
std    144.481833    0.103029           0.114136  

## Step 3: Define BGE Embedding Extraction Function

In [12]:
# Load HF Token with debugging
def load_hf_token():
    import os
    
    # Method 1: Try loading from .env file
    env_path = '../.env'
    print(f"[DEBUG] Trying to load from: {env_path}")
    print(f"[DEBUG] Current working directory: {os.getcwd()}")
    print(f"[DEBUG] File exists: {os.path.exists(env_path)}")
    
    try:
        with open(env_path, 'r') as f:
            content = f.read()
            print(f"[DEBUG] File content length: {len(content)} chars")
            
            for line in content.split('\n'):
                if line.strip() and not line.startswith('#'):
                    if '=' in line:
                        key, value = line.strip().split('=', 1)
                        print(f"[DEBUG] Found key: {key}")
                        if key == 'HF_TOKEN':
                            print(f"[DEBUG] Token found, length: {len(value)}")
                            return value.strip()
    except Exception as e:
        print(f"[DEBUG] Error reading .env file: {e}")
    
    # Method 2: Try environment variable
    env_token = os.getenv('HF_TOKEN', '')
    if env_token:
        print(f"[DEBUG] Token found in environment variable")
        return env_token
    
    print(f"[DEBUG] No token found!")
    return ''

hf_token = load_hf_token()
print(f"\n{'='*60}")
print(f"HF Token loaded: {'YES ✓' if hf_token else 'NO ✗'}")
if hf_token:
    print(f"Token preview: {hf_token[:10]}...{hf_token[-5:]}")
    print(f"Token length: {len(hf_token)}")
print(f"{'='*60}\n")

# Define BGE embedding extraction function with enhanced retry logic
def extract_bge_embedding(text: str, max_retries: int = 5, base_delay: int = 3) -> np.ndarray:
    """
    Call HF Inference API to extract BGE-Large embeddings with exponential backoff
    
    Args:
        text: Input text
        max_retries: Maximum retry attempts (increased to 5)
        base_delay: Base retry delay in seconds
    
    Returns:
        1024-dimensional embedding vector
    """
    api_url = "https://api-inference.huggingface.co/models/BAAI/bge-large-en-v1.5"
    headers = {
        "Authorization": f"Bearer {hf_token}",
        "Content-Type": "application/json"
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.post(
                api_url,
                headers=headers,
                json={"inputs": text},
                timeout=60  # Increased timeout
            )
            
            if response.status_code == 200:
                features = response.json()
                
                # Handle different response formats
                if isinstance(features, list):
                    if len(features) > 0:
                        if isinstance(features[0], list):
                            # features is [token_embeddings]
                            avg_feature = np.mean(features, axis=0)
                        else:
                            # features is already the embedding
                            avg_feature = features
                    else:
                        raise ValueError("Empty features list")
                else:
                    # Assume it's already an embedding
                    avg_feature = features
                
                return np.array(avg_feature)
            
            elif response.status_code == 500:
                # Internal server error - exponential backoff
                if attempt < max_retries - 1:
                    delay = base_delay * (2 ** attempt)  # Exponential backoff: 3s, 6s, 12s, 24s
                    print(f"    API 500 error (attempt {attempt+1}/{max_retries}), waiting {delay}s...")
                    time.sleep(delay)
                    continue
                else:
                    raise Exception(f"API Error 500 after {max_retries} retries")
            
            elif response.status_code == 503:
                # Model loading
                if attempt < max_retries - 1:
                    delay = base_delay * 2  # Wait longer for model loading
                    print(f"    Model loading... waiting {delay}s (attempt {attempt+1}/{max_retries})")
                    time.sleep(delay)
                    continue
                else:
                    raise Exception(f"API Error 503 after {max_retries} retries")
            
            elif response.status_code == 429:
                # Rate limit - wait even longer
                if attempt < max_retries - 1:
                    delay = base_delay * (attempt + 2)  # Linear increase: 6s, 9s, 12s
                    print(f"    Rate limited, waiting {delay}s...")
                    time.sleep(delay)
                    continue
                else:
                    raise Exception(f"Rate limited after {max_retries} retries")
            
            else:
                error_msg = response.text[:200] if hasattr(response, 'text') else 'Unknown error'
                if attempt < max_retries - 1:
                    print(f"    API Error {response.status_code}, retrying...")
                    time.sleep(base_delay * (attempt + 1))
                    continue
                else:
                    raise Exception(f"API Error {response.status_code}: {error_msg}")
        
        except requests.exceptions.Timeout:
            if attempt < max_retries - 1:
                delay = base_delay * (attempt + 1)
                print(f"    Request timeout, waiting {delay}s...")
                time.sleep(delay)
                continue
            else:
                raise Exception("Request timeout after all retries")
        
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                delay = base_delay * (attempt + 1)
                print(f"    Network error: {str(e)[:50]}, waiting {delay}s...")
                time.sleep(delay)
                continue
            else:
                raise
        
        except Exception as e:
            if attempt < max_retries - 1:
                delay = base_delay * (attempt + 1)
                print(f"    Error: {str(e)[:50]}, waiting {delay}s...")
                time.sleep(delay)
                continue
            else:
                raise
    
    raise Exception("Failed to extract embedding after all retries")

print("✓ BGE embedding extraction function defined (with enhanced retry)")
print("  - Max retries: 5")
print("  - Exponential backoff for 500 errors")
print("  - Timeout: 60s")

[DEBUG] Trying to load from: ../.env
[DEBUG] Current working directory: /Users/jietaoxie/Documents/GitHub/Credibly-INFO-5900/notebooks
[DEBUG] File exists: True
[DEBUG] File content length: 209 chars
[DEBUG] Found key: HF_TOKEN
[DEBUG] Token found, length: 37

HF Token loaded: YES ✓
Token preview: hf_TdTspnR...voFvX
Token length: 37

✓ BGE embedding extraction function defined (with enhanced retry)
  - Max retries: 5
  - Exponential backoff for 500 errors
  - Timeout: 60s


## Step 4: Batch Extract Embeddings

In [13]:
print("="*80)
print("Starting BGE Embeddings extraction (500 samples)")
print("="*80)

embeddings = []
success_count = 0
error_count = 0
error_indices = []

start_time = time.time()
total_samples = len(df_samples)

# Increased delay to avoid API rate limits and 500 errors
DELAY_BETWEEN_REQUESTS = 0.5  # Increased from 0.3 to 0.5 seconds

for idx, (_, row) in enumerate(df_samples.iterrows(), 1):
    text = row.get('desc', '')
    
    if len(text.strip()) < 10:
        # Skip too short descriptions
        embeddings.append(np.zeros(1024))
        error_count += 1
        error_indices.append(idx - 1)
        print(f"  [{idx}] Skipping: text too short")
        continue
    
    try:
        # Extract embedding
        emb = extract_bge_embedding(text)
        
        if emb is not None and len(emb) == 1024:
            embeddings.append(emb)
            success_count += 1
        else:
            embeddings.append(np.zeros(1024))
            error_count += 1
            error_indices.append(idx - 1)
            print(f"  [{idx}] Error: Invalid embedding dimension")
    
    except Exception as e:
        embeddings.append(np.zeros(1024))
        error_count += 1
        error_indices.append(idx - 1)
        print(f"  [{idx}] Error: {str(e)[:100]}")
    
    # Show progress
    if idx % 25 == 0 or idx == total_samples:  # Show progress more frequently
        elapsed = time.time() - start_time
        rate = idx / elapsed if elapsed > 0 else 0
        eta = (total_samples - idx) / rate if rate > 0 else 0
        
        progress = idx / total_samples * 100
        success_rate = success_count / idx * 100 if idx > 0 else 0
        print(f"[{idx:3d}/{total_samples}] {progress:5.1f}% | ✓{success_count} ✗{error_count} ({success_rate:.1f}% success) | {rate:.2f} samples/s | ETA: {eta/60:.1f}min")
    
    # Add delay to avoid rate limiting and reduce 500 errors
    time.sleep(DELAY_BETWEEN_REQUESTS)

elapsed_total = time.time() - start_time

print(f"\n" + "="*80)
print(f"Embedding extraction complete")
print(f"="*80)
print(f"\nTime elapsed: {elapsed_total/60:.1f} minutes ({elapsed_total:.1f} seconds)")
print(f"Success: {success_count}/{total_samples} ({success_count/total_samples*100:.1f}%)")
print(f"Failed: {error_count}/{total_samples} ({error_count/total_samples*100:.1f}%)")

if error_count > 0:
    print(f"\nFailed indices (first 20): {error_indices[:20]}")

# Convert to numpy array
X = np.array(embeddings)
print(f"\nEmbedding matrix shape: {X.shape}")
print(f"Data type: {X.dtype}")
print(f"Memory usage: {X.nbytes / 1024 / 1024:.1f} MB")
print(f"Average time per sample: {elapsed_total/total_samples:.2f}s")

Starting BGE Embeddings extraction (500 samples)
[ 25/500]   5.0% | ✓25 ✗0 (100.0% success) | 1.37 samples/s | ETA: 5.8min
[ 50/500]  10.0% | ✓50 ✗0 (100.0% success) | 1.38 samples/s | ETA: 5.4min
[ 75/500]  15.0% | ✓75 ✗0 (100.0% success) | 1.39 samples/s | ETA: 5.1min
[100/500]  20.0% | ✓100 ✗0 (100.0% success) | 1.38 samples/s | ETA: 4.8min
[125/500]  25.0% | ✓125 ✗0 (100.0% success) | 1.38 samples/s | ETA: 4.5min
[150/500]  30.0% | ✓150 ✗0 (100.0% success) | 1.38 samples/s | ETA: 4.2min
[175/500]  35.0% | ✓175 ✗0 (100.0% success) | 1.39 samples/s | ETA: 3.9min
[200/500]  40.0% | ✓200 ✗0 (100.0% success) | 1.38 samples/s | ETA: 3.6min
[225/500]  45.0% | ✓225 ✗0 (100.0% success) | 1.39 samples/s | ETA: 3.3min
[250/500]  50.0% | ✓250 ✗0 (100.0% success) | 1.39 samples/s | ETA: 3.0min
[275/500]  55.0% | ✓275 ✗0 (100.0% success) | 1.39 samples/s | ETA: 2.7min
[300/500]  60.0% | ✓300 ✗0 (100.0% success) | 1.39 samples/s | ETA: 2.4min
[325/500]  65.0% | ✓325 ✗0 (100.0% success) | 1.39 sam

## Step 5: Save Embeddings and Target Variables

In [14]:
print("Saving results...\n")

# Save embeddings
embedding_file = '../bge_embeddings_500.npy'
np.save(embedding_file, X)
print(f"Embeddings saved: {embedding_file}")
print(f"  Shape: {X.shape}")
print(f"  Size: {os.path.getsize(embedding_file) / 1024 / 1024:.1f} MB")

# Save OCEAN targets
ocean_target_file = '../ocean_targets_500.csv'
df_ocean.to_csv(ocean_target_file, index=False)
print(f"\nOCEAN targets saved: {ocean_target_file}")
print(f"  Shape: {df_ocean.shape}")
print(f"  Columns: {df_ocean.columns.tolist()}")

# Verify data consistency
if len(X) == len(df_ocean):
    print(f"\nData consistency check passed")
    print(f"   Embeddings count: {len(X)}")
    print(f"   OCEAN targets count: {len(df_ocean)}")
else:
    print(f"\nWARNING: Data inconsistency")
    print(f"   Embeddings: {len(X)}")
    print(f"   OCEAN targets: {len(df_ocean)}")

Saving results...

Embeddings saved: ../bge_embeddings_500.npy
  Shape: (500, 1024)
  Size: 3.9 MB

OCEAN targets saved: ../ocean_targets_500.csv
  Shape: (500, 6)
  Columns: ['sample_id', 'openness', 'conscientiousness', 'extraversion', 'agreeableness', 'neuroticism']

Data consistency check passed
   Embeddings count: 500
   OCEAN targets count: 500


## Step 6: Generate Statistics Report

In [15]:
# Generate summary report
summary = {
    'phase': '05e - Extract BGE Embeddings',
    'timestamp': datetime.now().isoformat(),
    'total_samples': int(total_samples),
    'success_count': int(success_count),
    'error_count': int(error_count),
    'success_rate': f"{success_count/total_samples*100:.2f}%",
    'embedding_model': 'BAAI/bge-large-en-v1.5',
    'embedding_dimension': 1024,
    'embedding_file': embedding_file,
    'ocean_target_file': ocean_target_file,
    'ocean_features': df_ocean.columns.tolist(),
    'ocean_statistics': {},
    'processing_time_seconds': elapsed_total,
    'samples_per_second': success_count / elapsed_total if elapsed_total > 0 else 0
}

# Add OCEAN statistics
for col in df_ocean.columns:
    summary['ocean_statistics'][col] = {
        'mean': float(df_ocean[col].mean()),
        'std': float(df_ocean[col].std()),
        'min': float(df_ocean[col].min()),
        'max': float(df_ocean[col].max())
    }

# Save summary
summary_file = '../05e_extraction_summary.json'
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"Statistics report saved: {summary_file}")
print(f"\n" + "="*80)
print("Summary")
print("="*80)
print(json.dumps(summary, indent=2, default=str))

Statistics report saved: ../05e_extraction_summary.json

Summary
{
  "phase": "05e - Extract BGE Embeddings",
  "timestamp": "2025-10-29T18:11:39.357684",
  "total_samples": 500,
  "success_count": 500,
  "error_count": 0,
  "success_rate": "100.00%",
  "embedding_model": "BAAI/bge-large-en-v1.5",
  "embedding_dimension": 1024,
  "embedding_file": "../bge_embeddings_500.npy",
  "ocean_target_file": "../ocean_targets_500.csv",
  "ocean_features": [
    "sample_id",
    "openness",
    "conscientiousness",
    "extraversion",
    "agreeableness",
    "neuroticism"
  ],
  "ocean_statistics": {
    "sample_id": {
      "mean": 249.5,
      "std": 144.4818327679989,
      "min": 0.0,
      "max": 499.0
    },
    "openness": {
      "mean": 0.2932,
      "std": 0.10302907346938493,
      "min": 0.2,
      "max": 0.8
    },
    "conscientiousness": {
      "mean": 0.7647999999999999,
      "std": 0.11413594538118181,
      "min": 0.2,
      "max": 0.9
    },
    "extraversion": {
      "mean

## Summary

Step 05e Complete

**Output Files**:
- `bge_embeddings_500.npy` - 500x1024 embeddings matrix
- `ocean_targets_500.csv` - 500x5 OCEAN scores
- `05e_extraction_summary.json` - Extraction report

**Next Step**:
Run `05f_train_ridge_models.ipynb` to train Ridge regression models