# 05e - Extract MPNet Embeddings (500 Test Samples)

**Purpose**: Extract all-mpnet-base-v2 embeddings from 500 test samples using HF Inference API

**Why MPNet-base-v2?**
- MPNet-base-v2: 110M parameters, 768 dimensions, official sentence-transformers recommendation
- MiniLM-L12-v2: 33M parameters, 384 dimensions (current baseline)
- BGE-Large: 326M parameters, 1024 dimensions
- MPNet is 3.3x larger than MiniLM, better semantic understanding
- Official "best all-around" model from sentence-transformers

**Input Files**:
- test_samples_500.csv - 500 loan descriptions
- ocean_ground_truth/ - OCEAN ground truth (for consistency check)

**Output Files**:
- mpnet_embeddings_500.npy - MPNet embeddings matrix (500x768)
- 05e_mpnet_extraction_summary.json - Extraction statistics report

**Note**: Output dimension is 768 (2x MiniLM's 384, smaller than BGE's 1024)

**Estimated Time**: Approximately 8-12 minutes with HF Pro (500 API calls, faster than MiniLM)

## Step 1: Import Libraries and Setup

In [1]:
import pandas as pd
import numpy as np
import requests
import json
import os
import time
from datetime import datetime
import warnings
from huggingface_hub import InferenceClient
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")
print(f"Timestamp: {datetime.now()}")

Libraries loaded successfully
Timestamp: 2025-10-29 13:06:05.410074


## Step 2: Load HF Token and Test Data

In [2]:
# Load HF Token
def load_hf_token():
    try:
        with open('../.env', 'r') as f:
            for line in f:
                if line.strip() and not line.startswith('#'):
                    key, value = line.strip().split('=', 1)
                    if key == 'HF_TOKEN':
                        return value
    except:
        pass
    return os.getenv('HF_TOKEN', '')

hf_token = load_hf_token()
print(f"HF Token loaded: {'yes' if hf_token else 'no'}")

if not hf_token:
    raise ValueError("HF_TOKEN not found. Please set it in .env file or environment variable")

# Load 500 test samples
print("\nLoading test data...")
df_samples = pd.read_csv('../test_samples_500.csv')
print(f"Loaded {len(df_samples)} samples")
print(f"\nColumns: {df_samples.columns.tolist()}")
print(f"\nSample preview:")
print(df_samples.head(3))

HF Token loaded: yes

Loading test data...
Loaded 500 samples

Columns: ['loan_amnt', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'collections_12_mths_ex_med', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens', 'disbursement_method', 'target']

Sample preview:
   loan_amnt  funded_amnt        term  int_rate  installment grade sub_grade   
0      10000        10000   36 months      6.03       304.36     A        A1  \
1      10000        10000   36 months     13.11       337.47     B        B4   
2       7200         7200   36 months     11.14       236.20     B        B2   

            emp_title emp_length home_ownership  ... 

## Step 3: Define MPNet Embedding Extraction Function

**MPNet-base-v2 Embedding Strategy**:
- Model: `sentence-transformers/all-mpnet-base-v2` (110M parameters)
- Method: Feature extraction via InferenceClient
- Output: 768-dimensional embedding per text
- No special prefix required (unlike E5)
- Official recommendation from sentence-transformers team

In [3]:
def extract_mpnet_embedding(text: str, max_retries: int = 3, retry_delay: int = 3) -> np.ndarray:
    """
    Call HF Inference API to extract MPNet embeddings using InferenceClient
    
    Args:
        text: Input text
        max_retries: Maximum retry attempts
        retry_delay: Retry delay (seconds)
    
    Returns:
        768-dimensional embedding vector
    """
    # Create InferenceClient with HF Pro provider
    client = InferenceClient(
        provider="hf-inference",
        api_key=hf_token
    )
    
    for attempt in range(max_retries):
        try:
            # Use feature_extraction method for embeddings
            result = client.feature_extraction(
                text=text,
                model="sentence-transformers/all-mpnet-base-v2"
            )
            
            # Handle the result
            if result is not None:
                # Convert to numpy array
                embeddings_array = np.array(result)
                
                # Handle different output formats
                if len(embeddings_array.shape) == 2:
                    # Shape: (seq_len, hidden_dim) - do mean pooling
                    mean_embedding = np.mean(embeddings_array, axis=0)
                elif len(embeddings_array.shape) == 1:
                    # Already a single embedding vector
                    mean_embedding = embeddings_array
                else:
                    raise ValueError(f"Unexpected embedding shape: {embeddings_array.shape}")
                
                # Verify dimension (MPNet-base-v2 outputs 768 dimensions)
                if len(mean_embedding) == 768:
                    return mean_embedding
                else:
                    raise ValueError(f"Expected 768 dimensions, got {len(mean_embedding)}")
            else:
                raise ValueError("Received None from API")
        
        except Exception as e:
            error_msg = str(e)
            
            # Handle rate limiting
            if "rate" in error_msg.lower() or "429" in error_msg:
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (attempt + 2)
                    print(f"    Rate limited... waiting {wait_time}s")
                    time.sleep(wait_time)
                    continue
                else:
                    raise Exception(f"Rate limited after {max_retries} retries")
            
            # Handle model loading
            elif "loading" in error_msg.lower() or "503" in error_msg:
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (attempt + 1)
                    print(f"    Model loading... waiting {wait_time}s")
                    time.sleep(wait_time)
                    continue
                else:
                    raise Exception(f"Model still loading after {max_retries} retries")
            
            # Other errors - retry
            else:
                if attempt < max_retries - 1:
                    print(f"    Error: {error_msg[:100]} ... retrying")
                    time.sleep(retry_delay)
                    continue
                else:
                    raise
    
    raise Exception("Failed to extract embedding after all retries")

print("\nMPNet-base-v2 embedding extraction function defined (using InferenceClient)")

# Test with a sample
print("\nTesting embedding extraction...")
test_text = "This is a test sentence for embedding extraction."
try:
    test_emb = extract_mpnet_embedding(test_text)
    print(f"✅ Test successful! Embedding shape: {test_emb.shape}")
    print(f"  Dimension: {len(test_emb)}")
    print(f"  Sample values: {test_emb[:5]}")
except Exception as e:
    print(f"❌ Test failed: {str(e)}")
    print("\nNote: First API call may take longer as model loads. This is normal.")


MPNet-base-v2 embedding extraction function defined (using InferenceClient)

Testing embedding extraction...
✅ Test successful! Embedding shape: (768,)
  Dimension: 768
  Sample values: [ 0.00726548 -0.06523006 -0.04566502  0.03256908 -0.01838495]


## Step 4: Batch Extract MPNet Embeddings

**Processing Strategy**:
- Process 500 samples sequentially
- 0.5 second delay between requests (HF Pro = faster)
- Automatic retry on errors
- Progress updates every 50 samples

In [4]:
print("="*80)
print("Starting MPNet-base-v2 Embeddings Extraction (500 samples)")
print("="*80)
print("\nNote: MPNet-base-v2 is a mid-size model (110M parameters, 768 dimensions).")
print("This is 3.3x larger than MiniLM-L12-v2, with better semantic understanding.")
print("Estimated time with HF Pro: 8-12 minutes\n")

embeddings = []
success_count = 0
error_count = 0
error_indices = []

start_time = time.time()
total_samples = len(df_samples)

for idx, (_, row) in enumerate(df_samples.iterrows(), 1):
    text = row.get('desc', '')
    
    # Skip very short descriptions
    if len(text.strip()) < 10:
        embeddings.append(np.zeros(768))  # 768 dimensions for MPNet
        error_count += 1
        error_indices.append(idx - 1)
        print(f"  [{idx:3d}] Skipped: text too short")
        continue
    
    try:
        # Extract embedding
        emb = extract_mpnet_embedding(text)
        
        if emb is not None and len(emb) == 768:
            embeddings.append(emb)
            success_count += 1
        else:
            embeddings.append(np.zeros(768))
            error_count += 1
            error_indices.append(idx - 1)
            print(f"  [{idx:3d}] Error: Invalid embedding dimension")
    
    except Exception as e:
        embeddings.append(np.zeros(768))
        error_count += 1
        error_indices.append(idx - 1)
        
        # Log first few errors and periodic errors
        if idx <= 10 or error_count % 10 == 1:
            print(f"  [{idx:3d}] ERROR: {str(e)[:80]}")
    
    # Progress report
    if idx % 50 == 0 or idx == total_samples:
        elapsed = time.time() - start_time
        rate = idx / elapsed if elapsed > 0 else 0
        eta = (total_samples - idx) / rate if rate > 0 else 0
        
        progress = idx / total_samples * 100
        print(f"\n[{idx:3d}/{total_samples}] ({progress:5.1f}%) | Success: {success_count}, Failed: {error_count}")
        print(f"  Rate: {rate:.2f} samples/s | Elapsed: {elapsed/60:.1f}min | ETA: {eta/60:.1f}min\n")
    
    # Delay to avoid rate limiting (shorter with HF Pro)
    time.sleep(0.5)

elapsed_total = time.time() - start_time

print("\n" + "="*80)
print("MPNet-base-v2 Embedding Extraction Complete")
print("="*80)
print(f"\nTotal time: {elapsed_total/60:.1f} minutes ({elapsed_total:.1f} seconds)")
print(f"Success: {success_count}/{total_samples} ({success_count/total_samples*100:.1f}%)")
print(f"Failed: {error_count}/{total_samples} ({error_count/total_samples*100:.1f}%)")
print(f"Average rate: {success_count/elapsed_total:.2f} samples/second")

if error_count > 0:
    print(f"\nError indices (first 20): {error_indices[:20]}")

# Convert to numpy array
X = np.array(embeddings)
print(f"\nEmbedding matrix shape: {X.shape}")
print(f"Data type: {X.dtype}")
print(f"Memory usage: {X.nbytes / 1024 / 1024:.2f} MB")
print(f"Value range: [{X.min():.4f}, {X.max():.4f}]")
print(f"Mean: {X.mean():.4f}, Std: {X.std():.4f}")

Starting MPNet-base-v2 Embeddings Extraction (500 samples)

Note: MPNet-base-v2 is a mid-size model (110M parameters, 768 dimensions).
This is 3.3x larger than MiniLM-L12-v2, with better semantic understanding.
Estimated time with HF Pro: 8-12 minutes


[ 50/500] ( 10.0%) | Success: 50, Failed: 0
  Rate: 1.63 samples/s | Elapsed: 0.5min | ETA: 4.6min


[100/500] ( 20.0%) | Success: 100, Failed: 0
  Rate: 1.59 samples/s | Elapsed: 1.1min | ETA: 4.2min


[150/500] ( 30.0%) | Success: 150, Failed: 0
  Rate: 1.59 samples/s | Elapsed: 1.6min | ETA: 3.7min


[200/500] ( 40.0%) | Success: 200, Failed: 0
  Rate: 1.60 samples/s | Elapsed: 2.1min | ETA: 3.1min


[250/500] ( 50.0%) | Success: 250, Failed: 0
  Rate: 1.59 samples/s | Elapsed: 2.6min | ETA: 2.6min


[300/500] ( 60.0%) | Success: 300, Failed: 0
  Rate: 1.59 samples/s | Elapsed: 3.1min | ETA: 2.1min


[350/500] ( 70.0%) | Success: 350, Failed: 0
  Rate: 1.60 samples/s | Elapsed: 3.7min | ETA: 1.6min


[400/500] ( 80.0%) | Success: 400

## Step 5: Save MPNet Embeddings

In [5]:
print("\nSaving MPNet-base-v2 embeddings...")

# Save embeddings
embedding_file = '../mpnet_embeddings_500.npy'
np.save(embedding_file, X)
print(f"\nEmbeddings saved: {embedding_file}")
print(f"  Model: sentence-transformers/all-mpnet-base-v2")
print(f"  Shape: {X.shape}")
print(f"  Dimensions: 768 (2x MiniLM's 384, smaller than BGE's 1024)")
print(f"  File size: {os.path.getsize(embedding_file) / 1024 / 1024:.2f} MB")

# Verify loading
X_loaded = np.load(embedding_file)
print(f"\nVerification: Loaded embeddings shape = {X_loaded.shape}")
assert np.array_equal(X, X_loaded), "Verification failed!"
print("Verification passed ✓")


Saving MPNet-base-v2 embeddings...

Embeddings saved: ../mpnet_embeddings_500.npy
  Model: sentence-transformers/all-mpnet-base-v2
  Shape: (500, 768)
  Dimensions: 768 (2x MiniLM's 384, smaller than BGE's 1024)
  File size: 1.46 MB

Verification: Loaded embeddings shape = (500, 768)
Verification passed ✓


## Step 6: Generate Statistics Report

In [6]:
# Generate summary report
summary = {
    'phase': '05e - Extract MPNet-base-v2 Embeddings',
    'timestamp': datetime.now().isoformat(),
    'model': 'sentence-transformers/all-mpnet-base-v2',
    'model_parameters': '110M',
    'embedding_dimension': 768,
    'extraction_method': 'HF Inference API (InferenceClient) + Mean Pooling',
    'total_samples': int(total_samples),
    'success_count': int(success_count),
    'error_count': int(error_count),
    'success_rate': f"{success_count/total_samples*100:.2f}%",
    'processing_time_seconds': float(elapsed_total),
    'processing_time_minutes': float(elapsed_total / 60),
    'samples_per_second': float(success_count / elapsed_total if elapsed_total > 0 else 0),
    'embedding_file': embedding_file,
    'embedding_statistics': {
        'mean': float(X.mean()),
        'std': float(X.std()),
        'min': float(X.min()),
        'max': float(X.max()),
        'non_zero_embeddings': int(success_count)
    },
    'comparison_with_other_models': {
        'minilm_parameters': '33M',
        'mpnet_parameters': '110M',
        'bge_parameters': '326M',
        'minilm_dimensions': 384,
        'mpnet_dimensions': 768,
        'bge_dimensions': 1024,
        'parameter_ratio': 'MPNet is 3.3x larger than MiniLM, 3x smaller than BGE',
        'dimension_ratio': 'MPNet has 2x MiniLM dimensions, 0.75x BGE dimensions',
        'expected_comparison': 'MPNet is the official sentence-transformers recommendation for best balance'
    }
}

# Save summary
summary_file = '../05e_mpnet_extraction_summary.json'
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nStatistics report saved: {summary_file}")
print("\n" + "="*80)
print("Summary")
print("="*80)
print(json.dumps(summary, indent=2))


Statistics report saved: ../05e_mpnet_extraction_summary.json

Summary
{
  "phase": "05e - Extract MPNet-base-v2 Embeddings",
  "timestamp": "2025-10-29T13:11:36.150883",
  "model": "sentence-transformers/all-mpnet-base-v2",
  "model_parameters": "110M",
  "embedding_dimension": 768,
  "extraction_method": "HF Inference API (InferenceClient) + Mean Pooling",
  "total_samples": 500,
  "success_count": 500,
  "error_count": 0,
  "success_rate": "100.00%",
  "processing_time_seconds": 326.53772616386414,
  "processing_time_minutes": 5.442295436064402,
  "samples_per_second": 1.5312166403372593,
  "embedding_file": "../mpnet_embeddings_500.npy",
  "embedding_statistics": {
    "mean": -0.0002619049628265202,
    "std": 0.036083441227674484,
    "min": -0.211687833070755,
    "max": 0.17882640659809113,
    "non_zero_embeddings": 500
  },
  "comparison_with_other_models": {
    "minilm_parameters": "33M",
    "mpnet_parameters": "110M",
    "bge_parameters": "326M",
    "minilm_dimensions"

## Step 7: Compare with Other Embeddings (Optional)

In [7]:
# Load other embeddings for comparison
print("\n" + "="*80)
print("Comparison with Other Models")
print("="*80)

comparison_table = []

# MPNet (current)
comparison_table.append({
    'Model': 'MPNet-base-v2',
    'Parameters': '110M',
    'Dimensions': 768,
    'Mean': X.mean(),
    'Std': X.std(),
    'Min': X.min(),
    'Max': X.max()
})

# Try loading MiniLM
try:
    X_minilm = np.load('../deberta_embeddings_500.npy')
    comparison_table.append({
        'Model': 'MiniLM-L12-v2',
        'Parameters': '33M',
        'Dimensions': 384,
        'Mean': X_minilm.mean(),
        'Std': X_minilm.std(),
        'Min': X_minilm.min(),
        'Max': X_minilm.max()
    })
    print("✓ MiniLM embeddings loaded")
except FileNotFoundError:
    print("✗ MiniLM embeddings not found")

# Try loading BGE
try:
    X_bge = np.load('../bge_embeddings_500.npy')
    comparison_table.append({
        'Model': 'BGE-Large',
        'Parameters': '326M',
        'Dimensions': 1024,
        'Mean': X_bge.mean(),
        'Std': X_bge.std(),
        'Min': X_bge.min(),
        'Max': X_bge.max()
    })
    print("✓ BGE embeddings loaded")
except FileNotFoundError:
    print("✗ BGE embeddings not found")

# Display comparison
if len(comparison_table) > 1:
    df_comparison = pd.DataFrame(comparison_table)
    print("\nEmbedding Comparison:")
    print(df_comparison.to_string(index=False))
    print("\nNote: Different dimensions mean these embeddings will need separate regression models.")
else:
    print("\nNo other embeddings found for comparison.")


Comparison with Other Models
✗ MiniLM embeddings not found
✓ BGE embeddings loaded

Embedding Comparison:
        Model Parameters  Dimensions      Mean      Std       Min      Max
MPNet-base-v2       110M         768 -0.000262 0.036083 -0.211688 0.178826
    BGE-Large       326M        1024 -0.000080 0.031250 -0.137539 0.257268

Note: Different dimensions mean these embeddings will need separate regression models.


## Summary

**Step 05e Complete - MPNet-base-v2 Embeddings**

**Output Files**:
- `mpnet_embeddings_500.npy` - 500x768 MPNet embeddings
- `05e_mpnet_extraction_summary.json` - Extraction statistics

**Model Used**:
- **Name**: sentence-transformers/all-mpnet-base-v2
- **Size**: 110M parameters (3.3x larger than MiniLM)
- **Dimensions**: 768 (2x MiniLM's 384)
- **Specialization**: Official sentence-transformers recommendation for best all-around performance

**Key Features**:
- Better semantic understanding than MiniLM
- More compact than BGE (768 vs 1024 dims)
- Good balance of quality and efficiency
- Official recommendation from sentence-transformers team

**Expected Performance**:
- Predicted R²: 0.22-0.27 (better than MiniLM's 0.19-0.24)
- Feature sparsity: ~90-95% with ElasticNet

**Next Steps**:
1. Run `05f_train_elasticnet_mpnet.ipynb` to train regression models
2. Compare: MiniLM (384d) vs MPNet (768d) vs BGE (1024d) performance
3. Evaluate if the 3.3x parameter increase translates to better OCEAN prediction