# 05e - Extract MiniLM Embeddings (500 Test Samples)

**Purpose**: Extract all-MiniLM-L12-v2 embeddings from 500 test samples using HF Inference API

**Why MiniLM-L12-v2?**
- MiniLM-L12-v2: 33M parameters, 384 dimensions, fast and efficient
- BGE-Large: 326M parameters, 1024 dimensions
- MiniLM is designed specifically for sentence embeddings
- Part of the popular sentence-transformers library

**Input Files**:
- test_samples_500.csv - 500 loan descriptions
- ocean_ground_truth/ - OCEAN ground truth (for consistency check)

**Output Files**:
- deberta_embeddings_500.npy - MiniLM embeddings matrix (500x384)
- 05e_deberta_extraction_summary.json - Extraction statistics report

**Note**: File names kept as "deberta" for compatibility, but using MiniLM model
**Note**: Output dimension is 384 (different from BGE's 1024)

**Estimated Time**: Approximately 10-15 minutes (500 API calls, smaller model = faster)

## Step 1: Import Libraries and Setup

In [None]:
import pandas as pd
import numpy as np
import requests
import json
import os
import time
from datetime import datetime
import warnings
from huggingface_hub import InferenceClient
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")
print(f"Timestamp: {datetime.now()}")

## Step 2: Load HF Token and Test Data

In [None]:
# Load HF Token
def load_hf_token():
    try:
        with open('../.env', 'r') as f:
            for line in f:
                if line.strip() and not line.startswith('#'):
                    key, value = line.strip().split('=', 1)
                    if key == 'HF_TOKEN':
                        return value
    except:
        pass
    return os.getenv('HF_TOKEN', '')

hf_token = load_hf_token()
print(f"HF Token loaded: {'yes' if hf_token else 'no'}")

if not hf_token:
    raise ValueError("HF_TOKEN not found. Please set it in .env file or environment variable")

# Load 500 test samples
print("\nLoading test data...")
df_samples = pd.read_csv('../test_samples_500.csv')
print(f"Loaded {len(df_samples)} samples")
print(f"\nColumns: {df_samples.columns.tolist()}")
print(f"\nSample preview:")
print(df_samples.head(3))

## Step 3: Define MiniLM Embedding Extraction Function

**MiniLM-L12-v2 Embedding Strategy**:
- Model: `sentence-transformers/all-MiniLM-L12-v2` (33M parameters)
- Method: Feature extraction via InferenceClient
- Output: 384-dimensional embedding per text
- No special prefix required (unlike E5)

In [None]:
def extract_deberta_embedding(text: str, max_retries: int = 3, retry_delay: int = 3) -> np.ndarray:
    """
    Call HF Inference API to extract MiniLM embeddings using InferenceClient
    
    Args:
        text: Input text
        max_retries: Maximum retry attempts
        retry_delay: Retry delay (seconds)
    
    Returns:
        384-dimensional embedding vector
    """
    # Create InferenceClient with HF Pro provider
    client = InferenceClient(
        provider="hf-inference",
        api_key=hf_token
    )
    
    for attempt in range(max_retries):
        try:
            # Use feature_extraction method for embeddings
            result = client.feature_extraction(
                text=text,
                model="sentence-transformers/all-MiniLM-L12-v2"
            )
            
            # Handle the result
            if result is not None:
                # Convert to numpy array
                embeddings_array = np.array(result)
                
                # Handle different output formats
                if len(embeddings_array.shape) == 2:
                    # Shape: (seq_len, hidden_dim) - do mean pooling
                    mean_embedding = np.mean(embeddings_array, axis=0)
                elif len(embeddings_array.shape) == 1:
                    # Already a single embedding vector
                    mean_embedding = embeddings_array
                else:
                    raise ValueError(f"Unexpected embedding shape: {embeddings_array.shape}")
                
                # Verify dimension (MiniLM-L12-v2 outputs 384 dimensions)
                if len(mean_embedding) == 384:
                    return mean_embedding
                else:
                    raise ValueError(f"Expected 384 dimensions, got {len(mean_embedding)}")
            else:
                raise ValueError("Received None from API")
        
        except Exception as e:
            error_msg = str(e)
            
            # Handle rate limiting
            if "rate" in error_msg.lower() or "429" in error_msg:
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (attempt + 2)
                    print(f"    Rate limited... waiting {wait_time}s")
                    time.sleep(wait_time)
                    continue
                else:
                    raise Exception(f"Rate limited after {max_retries} retries")
            
            # Handle model loading
            elif "loading" in error_msg.lower() or "503" in error_msg:
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (attempt + 1)
                    print(f"    Model loading... waiting {wait_time}s")
                    time.sleep(wait_time)
                    continue
                else:
                    raise Exception(f"Model still loading after {max_retries} retries")
            
            # Other errors - retry
            else:
                if attempt < max_retries - 1:
                    print(f"    Error: {error_msg[:100]} ... retrying")
                    time.sleep(retry_delay)
                    continue
                else:
                    raise
    
    raise Exception("Failed to extract embedding after all retries")

print("\nMiniLM-L12-v2 embedding extraction function defined (using InferenceClient)")

# Test with a sample
print("\nTesting embedding extraction...")
test_text = "This is a test sentence for embedding extraction."
try:
    test_emb = extract_deberta_embedding(test_text)
    print(f"✅ Test successful! Embedding shape: {test_emb.shape}")
    print(f"  Dimension: {len(test_emb)}")
    print(f"  Sample values: {test_emb[:5]}")
except Exception as e:
    print(f"❌ Test failed: {str(e)}")
    print("\nNote: First API call may take longer as model loads. This is normal.")

## Step 4: Batch Extract MiniLM Embeddings

**Processing Strategy**:
- Process 500 samples sequentially
- 1 second delay between requests (smaller model = faster)
- Automatic retry on errors
- Progress updates every 50 samples

In [None]:
print("="*80)
print("Starting MiniLM-L12-v2 Embeddings Extraction (500 samples)")
print("="*80)
print("\nNote: MiniLM-L12-v2 is a compact model (33M parameters, 384 dimensions).")
print("This should be faster than larger models.")
print("Estimated time: 10-15 minutes\n")

embeddings = []
success_count = 0
error_count = 0
error_indices = []

start_time = time.time()
total_samples = len(df_samples)

for idx, (_, row) in enumerate(df_samples.iterrows(), 1):
    text = row.get('desc', '')
    
    # Skip very short descriptions
    if len(text.strip()) < 10:
        embeddings.append(np.zeros(384))  # 384 dimensions for MiniLM
        error_count += 1
        error_indices.append(idx - 1)
        print(f"  [{idx:3d}] Skipped: text too short")
        continue
    
    try:
        # Extract embedding
        emb = extract_deberta_embedding(text)
        
        if emb is not None and len(emb) == 384:
            embeddings.append(emb)
            success_count += 1
        else:
            embeddings.append(np.zeros(384))
            error_count += 1
            error_indices.append(idx - 1)
            print(f"  [{idx:3d}] Error: Invalid embedding dimension")
    
    except Exception as e:
        embeddings.append(np.zeros(384))
        error_count += 1
        error_indices.append(idx - 1)
        
        # Log first few errors and periodic errors
        if idx <= 10 or error_count % 10 == 1:
            print(f"  [{idx:3d}] ERROR: {str(e)[:80]}")
    
    # Progress report
    if idx % 50 == 0 or idx == total_samples:
        elapsed = time.time() - start_time
        rate = idx / elapsed if elapsed > 0 else 0
        eta = (total_samples - idx) / rate if rate > 0 else 0
        
        progress = idx / total_samples * 100
        print(f"\n[{idx:3d}/{total_samples}] ({progress:5.1f}%) | Success: {success_count}, Failed: {error_count}")
        print(f"  Rate: {rate:.2f} samples/s | Elapsed: {elapsed/60:.1f}min | ETA: {eta/60:.1f}min\n")
    
    # Delay to avoid rate limiting (shorter for smaller model)
    time.sleep(1.0)

elapsed_total = time.time() - start_time

print("\n" + "="*80)
print("MiniLM-L12-v2 Embedding Extraction Complete")
print("="*80)
print(f"\nTotal time: {elapsed_total/60:.1f} minutes ({elapsed_total:.1f} seconds)")
print(f"Success: {success_count}/{total_samples} ({success_count/total_samples*100:.1f}%)")
print(f"Failed: {error_count}/{total_samples} ({error_count/total_samples*100:.1f}%)")
print(f"Average rate: {success_count/elapsed_total:.2f} samples/second")

if error_count > 0:
    print(f"\nError indices (first 20): {error_indices[:20]}")

# Convert to numpy array
X = np.array(embeddings)
print(f"\nEmbedding matrix shape: {X.shape}")
print(f"Data type: {X.dtype}")
print(f"Memory usage: {X.nbytes / 1024 / 1024:.2f} MB")
print(f"Value range: [{X.min():.4f}, {X.max():.4f}]")
print(f"Mean: {X.mean():.4f}, Std: {X.std():.4f}")

## Step 5: Save MiniLM Embeddings

In [None]:
print("\nSaving MiniLM-L12-v2 embeddings...")

# Save embeddings (keeping filename as deberta for compatibility)
embedding_file = '../deberta_embeddings_500.npy'
np.save(embedding_file, X)
print(f"\nEmbeddings saved: {embedding_file}")
print(f"  Model: sentence-transformers/all-MiniLM-L12-v2")
print(f"  Shape: {X.shape}")
print(f"  Dimensions: 384 (note: different from BGE's 1024)")
print(f"  File size: {os.path.getsize(embedding_file) / 1024 / 1024:.2f} MB")

# Verify loading
X_loaded = np.load(embedding_file)
print(f"\nVerification: Loaded embeddings shape = {X_loaded.shape}")
assert np.array_equal(X, X_loaded), "Verification failed!"
print("Verification passed ✓")

## Step 6: Generate Statistics Report

In [None]:
# Generate summary report
summary = {
    'phase': '05e - Extract MiniLM-L12-v2 Embeddings',
    'timestamp': datetime.now().isoformat(),
    'model': 'sentence-transformers/all-MiniLM-L12-v2',
    'model_parameters': '33M',
    'embedding_dimension': 384,
    'extraction_method': 'HF Inference API (InferenceClient) + Mean Pooling',
    'total_samples': int(total_samples),
    'success_count': int(success_count),
    'error_count': int(error_count),
    'success_rate': f"{success_count/total_samples*100:.2f}%",
    'processing_time_seconds': float(elapsed_total),
    'processing_time_minutes': float(elapsed_total / 60),
    'samples_per_second': float(success_count / elapsed_total if elapsed_total > 0 else 0),
    'embedding_file': embedding_file,
    'embedding_statistics': {
        'mean': float(X.mean()),
        'std': float(X.std()),
        'min': float(X.min()),
        'max': float(X.max()),
        'non_zero_embeddings': int(success_count)
    },
    'comparison_with_bge': {
        'bge_parameters': '326M',
        'minilm_parameters': '33M',
        'bge_dimensions': 1024,
        'minilm_dimensions': 384,
        'parameter_ratio': 'MiniLM is 10x smaller',
        'dimension_ratio': 'MiniLM has 2.7x fewer dimensions',
        'expected_comparison': 'MiniLM is compact but efficient, good for rapid experimentation'
    }
}

# Save summary
summary_file = '../05e_deberta_extraction_summary.json'
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nStatistics report saved: {summary_file}")
print("\n" + "="*80)
print("Summary")
print("="*80)
print(json.dumps(summary, indent=2))

## Step 7: Compare with BGE Embeddings (Optional)

In [None]:
# Load BGE embeddings for comparison
try:
    print("\nLoading BGE embeddings for comparison...")
    X_bge = np.load('../bge_embeddings_500.npy')
    
    print(f"\nEmbedding Comparison:")
    print(f"{'Metric':<20} {'BGE':>15} {'MiniLM-L12':>15}")
    print(f"{'-'*50}")
    print(f"{'Shape':<20} {str(X_bge.shape):>15} {str(X.shape):>15}")
    print(f"{'Mean':<20} {X_bge.mean():>15.4f} {X.mean():>15.4f}")
    print(f"{'Std':<20} {X_bge.std():>15.4f} {X.std():>15.4f}")
    print(f"{'Min':<20} {X_bge.min():>15.4f} {X.min():>15.4f}")
    print(f"{'Max':<20} {X_bge.max():>15.4f} {X.max():>15.4f}")
    
    print(f"\nNote: Cannot compute cosine similarity - different dimensions (1024 vs 384)")
    print(f"Both embeddings will be evaluated separately for OCEAN prediction.")
    
except FileNotFoundError:
    print("\nBGE embeddings not found. Skipping comparison.")

## Summary

**Step 05e Complete - MiniLM-L12-v2 Embeddings**

**Output Files**:
- `deberta_embeddings_500.npy` - 500x384 MiniLM embeddings
- `05e_deberta_extraction_summary.json` - Extraction statistics

**Model Used**:
- **Name**: sentence-transformers/all-MiniLM-L12-v2
- **Size**: 33M parameters (10x smaller than BGE)
- **Dimensions**: 384 (vs BGE's 1024)
- **Specialization**: Compact sentence embeddings, part of sentence-transformers

**Key Features**:
- Fast extraction (smaller model)
- Good quality despite compact size
- Different dimensionality from BGE - will need separate Ridge/ElasticNet models

**Important Note**:
- This model has **384 dimensions** instead of 1024
- You'll need to train separate regression models for this embedding
- Or you can stick with BGE (1024 dims) if dimension consistency is important

**Next Steps**:
1. Run `05f_train_ridge_all_models.ipynb` with MiniLM embeddings (384 dims)
2. Run `05f_train_elasticnet_all_models.ipynb` with MiniLM embeddings
3. Compare: BGE (1024d) vs MiniLM (384d) performance on OCEAN prediction