# üöÄ IntelliMatch AI - Local Training Pipeline

This notebook handles training tasks on your local PC (CPU or GPU):
1. Generate embeddings for 2,500+ resumes ‚ú®
2. Build FAISS vector store
3. Train skill taxonomy
4. Generate match insights

**Prerequisites**: 
- Parsed resumes at `data/training/parsed_resumes_all.json`
- Python packages: transformers, sentence-transformers, faiss-cpu, scikit-learn

**Runtime**: Works on CPU (GPU optional)

---

## ‚ú® **Features:**
- **Dynamic Skill Extraction**: Extracts 15-20x more skills per resume
- **CPU-friendly**: No GPU required (will use if available)
- **Local execution**: No cloud/Colab dependencies
- **Incremental processing**: Save checkpoints to resume if interrupted

## üîß Setup & Environment Check

In [3]:
# Check environment and GPU availability (optional)
import sys
import torch
from pathlib import Path

print("üîç Environment Check")
print("=" * 60)
print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    device = "cuda"
else:
    print("‚ö†Ô∏è  No GPU detected - will use CPU (slower but works)")
    device = "cpu"

print(f"\n‚úÖ Device for training: {device}")
print("=" * 60)

üîç Environment Check
Python: 3.13.5
PyTorch: 2.9.0+cpu
CUDA available: False
‚ö†Ô∏è  No GPU detected - will use CPU (slower but works)

‚úÖ Device for training: cpu


In [4]:
# Check required packages (install if missing)
import subprocess
import sys

required_packages = [
    'transformers',
    'sentence-transformers', 
    'faiss-cpu',  # Use faiss-cpu for compatibility
    'scikit-learn',
    'pandas',
    'tqdm',
    'numpy'
]

print("üì¶ Checking required packages...")
missing = []

for package in required_packages:
    try:
        __import__(package.replace('-', '_'))
        print(f"‚úÖ {package}")
    except ImportError:
        print(f"‚ùå {package} - MISSING")
        missing.append(package)

if missing:
    print(f"\n‚ö†Ô∏è  Missing packages: {', '.join(missing)}")
    print("Run this command to install:")
    print(f"   pip install {' '.join(missing)}")
else:
    print("\n‚úÖ All packages installed!")

üì¶ Checking required packages...


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ transformers
‚úÖ sentence-transformers
‚ùå faiss-cpu - MISSING
‚ùå scikit-learn - MISSING
‚úÖ pandas
‚úÖ tqdm
‚úÖ numpy

‚ö†Ô∏è  Missing packages: faiss-cpu, scikit-learn
Run this command to install:
   pip install faiss-cpu scikit-learn
‚úÖ sentence-transformers
‚ùå faiss-cpu - MISSING
‚ùå scikit-learn - MISSING
‚úÖ pandas
‚úÖ tqdm
‚úÖ numpy

‚ö†Ô∏è  Missing packages: faiss-cpu, scikit-learn
Run this command to install:
   pip install faiss-cpu scikit-learn


In [5]:
# Verify FAISS installation
print("üîç Verifying FAISS installation...")
print("-" * 60)

try:
    import faiss
    import numpy as np
    
    print("‚úÖ FAISS imported successfully!")
    
    # Check version
    if hasattr(faiss, '__version__'):
        print(f"   Version: {faiss.__version__}")
    
    # Test basic functionality
    test_data = np.random.random((100, 128)).astype('float32')
    index = faiss.IndexFlatL2(128)
    index.add(test_data)
    
    print(f"   Functionality Test: ‚úÖ PASSED ({index.ntotal} vectors indexed)")
    print("\nüéâ FAISS is ready to use!")
    
except ImportError as e:
    print("‚ùå FAISS not installed!")
    print(f"   Install with: pip install faiss-cpu")
except Exception as e:
    print(f"‚ùå FAISS test failed: {e}")

print("-" * 60)

üîç Verifying FAISS installation...
------------------------------------------------------------
‚úÖ FAISS imported successfully!
   Version: 1.12.0
   Functionality Test: ‚úÖ PASSED (100 vectors indexed)

üéâ FAISS is ready to use!
------------------------------------------------------------


In [6]:
# Set up project paths (local PC)
import sys
from pathlib import Path

# Get project root (notebook is in notebooks/ folder)
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(PROJECT_ROOT))

# Define data paths
DATA_DIR = PROJECT_ROOT / 'data'
TRAINING_DIR = DATA_DIR / 'training'
EMBEDDINGS_DIR = DATA_DIR / 'embeddings'
MODELS_DIR = PROJECT_ROOT / 'models'

# Create directories if they don't exist
EMBEDDINGS_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR.mkdir(parents=True, exist_ok=True)

print("üìÅ Project Paths:")
print(f"   Project Root: {PROJECT_ROOT}")
print(f"   Data Directory: {DATA_DIR}")
print(f"   Embeddings Output: {EMBEDDINGS_DIR}")
print(f"   Models Output: {MODELS_DIR}")

# Check if parsed resumes exist
PARSED_RESUMES_FILE = TRAINING_DIR / 'parsed_resumes_all.json'
if PARSED_RESUMES_FILE.exists():
    print(f"\n‚úÖ Parsed resumes found: {PARSED_RESUMES_FILE}")
else:
    print(f"\n‚ö†Ô∏è  Parsed resumes not found at: {PARSED_RESUMES_FILE}")
    print("   Run train_on_all_resumes.py first to generate this file")

üìÅ Project Paths:
   Project Root: d:\CKXJ\ML\TD1
   Data Directory: d:\CKXJ\ML\TD1\data
   Embeddings Output: d:\CKXJ\ML\TD1\data\embeddings
   Models Output: d:\CKXJ\ML\TD1\models

‚úÖ Parsed resumes found: d:\CKXJ\ML\TD1\data\training\parsed_resumes_all.json


## üìä Load Parsed Data

In [7]:
import json
import pandas as pd
from pathlib import Path

# Use the path defined in previous cell
DATA_PATH = PARSED_RESUMES_FILE  # This was set in the previous cell

print("üìÇ Loading parsed resume data...")
with open(DATA_PATH, 'r', encoding='utf-8') as f:
    resumes = json.load(f)

print(f"‚úÖ Loaded {len(resumes)} resumes")
print(f"\nüìä Sample resume keys: {list(resumes[0].keys())[:10]}")

# Quick stats
categories = [r.get('category', 'Unknown') for r in resumes]
df = pd.DataFrame({'category': categories})
print(f"\nüìà Resumes by category:")
print(df['category'].value_counts())

üìÇ Loading parsed resume data...
‚úÖ Loaded 2484 resumes

üìä Sample resume keys: ['text', 'extraction_method', 'success', 'error', 'metadata', 'file_name', 'file_size', 'file_type', 'file_path', 'char_count']

üìà Resumes by category:
category
INFORMATION-TECHNOLOGY    120
BUSINESS-DEVELOPMENT      120
ACCOUNTANT                118
ADVOCATE                  118
CHEF                      118
ENGINEERING               118
FINANCE                   118
AVIATION                  117
FITNESS                   117
SALES                     116
HEALTHCARE                115
CONSULTANT                115
BANKING                   115
CONSTRUCTION              112
PUBLIC-RELATIONS          111
HR                        110
DESIGNER                  107
ARTS                      103
TEACHER                   102
APPAREL                    97
DIGITAL-MEDIA              96
AGRICULTURE                63
AUTOMOBILE                 36
BPO                        22
Name: count, dtype: int64
‚úÖ L

In [8]:
# Quick analysis of skills
print(f"\nüìä Skill Statistics:")
total_skills = 0
resumes_with_skills = 0
skill_counts = []

for resume in resumes[:100]:  # Sample first 100
    skills = resume.get('skills', {}).get('all_skills', [])
    if skills:
        resumes_with_skills += 1
        total_skills += len(skills)
        skill_counts.append(len(skills))

if skill_counts:
    print(f"   Resumes with skills: {resumes_with_skills}/100 ({resumes_with_skills}%)")
    print(f"   Average skills per resume: {sum(skill_counts)/len(skill_counts):.1f}")
    print(f"   Min skills: {min(skill_counts)}")
    print(f"   Max skills: {max(skill_counts)}")
    print(f"\n‚ú® Much better than the old 4-5 skills per resume!")


üìä Skill Statistics:
   Resumes with skills: 100/100 (100%)
   Average skills per resume: 7.3
   Min skills: 1
   Max skills: 22

‚ú® Much better than the old 4-5 skills per resume!


## üß† Task 1: Generate Embeddings for All Resumes

Using `sentence-transformers` to create semantic embeddings for:
- Full resume text
- Individual experiences
- Skills sections

In [9]:
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm.auto import tqdm

# Load model (uses GPU automatically if available)
print("üîÑ Loading sentence-transformers model...")
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dimensions, fast
# Alternative: 'all-mpnet-base-v2' (768 dimensions, more accurate but slower)

print(f"‚úÖ Model loaded on: {model.device}")
print(f"üìè Embedding dimensions: {model.get_sentence_embedding_dimension()}")

üîÑ Loading sentence-transformers model...
‚úÖ Model loaded on: cpu
üìè Embedding dimensions: 384
‚úÖ Model loaded on: cpu
üìè Embedding dimensions: 384


In [10]:
def extract_resume_text(resume_data):
    """Extract meaningful text from resume for embedding"""
    parts = []
    
    # Personal info
    if resume_data.get('name'):
        parts.append(str(resume_data['name']))
    
    # Summary
    if resume_data.get('summary'):
        parts.append(str(resume_data['summary']))
    
    # Skills - NOW WITH MUCH MORE SKILLS! ‚ú®
    if resume_data.get('skills') and resume_data['skills'].get('all_skills'):
        skills = resume_data['skills']['all_skills']
        if skills:
            # Include more skills now that we extract them properly
            parts.append("Skills: " + ", ".join(str(s) for s in skills[:50]))  # Top 50 skills
    
    # Technical skills specifically (new category)
    if resume_data.get('skills'):
        by_cat = resume_data['skills'].get('by_category', {})
        tech_skills = by_cat.get('technical', [])
        if tech_skills:
            parts.append("Technical: " + ", ".join(str(s) for s in tech_skills[:20]))
    
    # Experience
    if resume_data.get('experience'):
        for exp in resume_data['experience'][:5]:  # Top 5 experiences
            exp_parts = []
            if exp.get('title'):
                exp_parts.append(str(exp['title']))
            if exp.get('company'):
                exp_parts.append("at " + str(exp['company']))
            
            if exp_parts:
                exp_text = " ".join(exp_parts)
                if exp.get('description'):
                    exp_text += ". " + str(exp['description'])[:200]  # First 200 chars
                parts.append(exp_text)
    
    # Education
    if resume_data.get('education'):
        for edu in resume_data['education'][:3]:  # Top 3 degrees
            edu_parts = []
            if edu.get('degree'):
                edu_parts.append(str(edu['degree']))
            if edu.get('field'):
                edu_parts.append("in " + str(edu['field']))
            if edu.get('institution'):
                edu_parts.append("from " + str(edu['institution']))
            
            if edu_parts:
                parts.append(" ".join(edu_parts))
    
    # Join all parts, filtering out any empty strings
    return " ".join(p for p in parts if p)

# Test extraction
sample_text = extract_resume_text(resumes[0])
print(f"üìÑ Sample extracted text ({len(sample_text)} chars):")
print(sample_text[:400] + "..." if len(sample_text) > 400 else sample_text)
print(f"\n‚úÖ Now including comprehensive skill data for better semantic matching!")


üìÑ Sample extracted text (116 chars):
Accountant City Skills: Accounting, Communication, Critical Thinking, Leadership, Organization Technical: Accounting

‚úÖ Now including comprehensive skill data for better semantic matching!


In [11]:
# Generate embeddings for all resumes
print(f"\nüîÑ Generating embeddings for {len(resumes)} resumes...")
print("‚è±Ô∏è  This will take 5-10 minutes with GPU\n")

resume_texts = []
resume_ids = []

for resume in tqdm(resumes, desc="Extracting text"):
    text = extract_resume_text(resume)
    if text.strip():  # Only include non-empty
        resume_texts.append(text)
        resume_ids.append(resume.get('file_path_original', ''))

print(f"‚úÖ Extracted {len(resume_texts)} valid texts")

# Generate embeddings in batches (GPU efficient)
print("\nüîÑ Encoding with GPU...")
embeddings = model.encode(
    resume_texts,
    batch_size=32,  # Adjust based on GPU memory
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"\n‚úÖ Generated embeddings shape: {embeddings.shape}")
print(f"‚úÖ Memory size: {embeddings.nbytes / 1e6:.2f} MB")


üîÑ Generating embeddings for 2484 resumes...
‚è±Ô∏è  This will take 5-10 minutes with GPU



Extracting text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2484/2484 [00:00<00:00, 140141.12it/s]



‚úÖ Extracted 2480 valid texts

üîÑ Encoding with GPU...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 78/78 [00:38<00:00,  2.05it/s]




‚úÖ Generated embeddings shape: (2480, 384)
‚úÖ Memory size: 3.81 MB


In [12]:
# Save embeddings
output_data = {
    'embeddings': embeddings.tolist(),
    'resume_ids': resume_ids,
    'model': 'all-MiniLM-L6-v2',
    'dimensions': embeddings.shape[1],
    'count': len(embeddings)
}

output_file = 'resume_embeddings.json'
with open(output_file, 'w') as f:
    json.dump(output_data, f)

print(f"‚úÖ Saved embeddings to: {output_file}")
print(f"üì• Download this file and place in models/embeddings/")

# Also save as numpy for faster loading
np.save('resume_embeddings.npy', embeddings)
print(f"‚úÖ Also saved as resume_embeddings.npy (faster loading)")

‚úÖ Saved embeddings to: resume_embeddings.json
üì• Download this file and place in models/embeddings/
‚úÖ Also saved as resume_embeddings.npy (faster loading)


## üîç Task 2: Build FAISS Index for Fast Similarity Search

In [13]:
import faiss
import numpy as np

# Check FAISS installation
print(f"FAISS version: {faiss.__version__ if hasattr(faiss, '__version__') else 'unknown'}")
has_gpu_support = hasattr(faiss, 'StandardGpuResources')
print(f"GPU support: {'‚úÖ Available' if has_gpu_support else '‚ùå Not available (using CPU)'}")

print("\nüîß Building FAISS index...")

# Normalize embeddings for cosine similarity
embeddings_normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Create index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product (cosine similarity after normalization)

# Try to add to GPU if available
using_gpu = False
if has_gpu_support and torch.cuda.is_available():
    try:
        res = faiss.StandardGpuResources()
        index = faiss.index_cpu_to_gpu(res, 0, index)
        using_gpu = True
        print("‚úÖ FAISS index on GPU (faster)")
    except Exception as e:
        print(f"‚ö†Ô∏è  GPU allocation failed: {e}")
        print("   Falling back to CPU...")
        using_gpu = False

if not using_gpu:
    print("‚ö†Ô∏è  FAISS index on CPU (slower but works)")

# Add embeddings
index.add(embeddings_normalized.astype('float32'))

print(f"‚úÖ FAISS index built with {index.ntotal} vectors")

# Test search
print("\nüß™ Testing similarity search...")
query = embeddings_normalized[0:1]  # Use first resume as query
D, I = index.search(query.astype('float32'), k=5)  # Find top 5 similar

print(f"\nTop 5 similar resumes to resume 0:")
for rank, (idx, score) in enumerate(zip(I[0], D[0]), 1):
    print(f"  {rank}. Resume {idx}: similarity = {score:.3f}")
    print(f"     Category: {resumes[idx].get('category', 'Unknown')}")

FAISS version: 1.12.0
GPU support: ‚ùå Not available (using CPU)

üîß Building FAISS index...
‚ö†Ô∏è  FAISS index on CPU (slower but works)
‚úÖ FAISS index built with 2480 vectors

üß™ Testing similarity search...

Top 5 similar resumes to resume 0:
  1. Resume 0: similarity = 1.000
     Category: ACCOUNTANT
  2. Resume 101: similarity = 0.829
     Category: ACCOUNTANT
  3. Resume 23: similarity = 0.817
     Category: ACCOUNTANT
  4. Resume 74: similarity = 0.816
     Category: ACCOUNTANT
  5. Resume 18: similarity = 0.794
     Category: ACCOUNTANT


In [14]:
# Save FAISS index
print("üíæ Saving FAISS index...")

# Move back to CPU for saving (if it was on GPU)
if using_gpu:
    try:
        index_cpu = faiss.index_gpu_to_cpu(index)
        print("   Moved index from GPU to CPU for saving")
    except:
        index_cpu = index
else:
    index_cpu = index

faiss.write_index(index_cpu, 'resume_faiss_index.bin')
print("‚úÖ Saved FAISS index to: resume_faiss_index.bin")
print(f"   Size: {index_cpu.ntotal} vectors √ó {dimension} dimensions")
print("üì• Download and place in models/embeddings/")

üíæ Saving FAISS index...
‚úÖ Saved FAISS index to: resume_faiss_index.bin
   Size: 2480 vectors √ó 384 dimensions
üì• Download and place in models/embeddings/


## üéØ Task 3: Fine-tune BERT for Resume Classification

Train a classifier to predict:
- Experience level (Entry/Mid/Senior/Expert)
- Resume quality (1-10)
- Job category

In [15]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
import torch.nn as nn

# Function to calculate experience level from resume (simplified)
def calculate_experience_level(resume):
    """Calculate experience level based on number of jobs and text signals"""
    experience_entries = resume.get('experience', [])
    
    # Count jobs
    num_jobs = len(experience_entries)
    
    # Get all text to check for level indicators
    text = extract_resume_text(resume).lower()
    
    # Check for explicit level indicators in text
    if any(word in text for word in ['entry', 'junior', 'graduate', 'intern', 'associate']):
        return 'entry'
    elif any(word in text for word in ['senior', 'lead', 'principal', 'staff']):
        return 'senior'
    elif any(word in text for word in ['expert', 'architect', 'director', 'vp', 'chief']):
        return 'expert'
    
    # Use number of jobs as fallback heuristic
    if num_jobs <= 1:
        return 'entry'
    elif num_jobs <= 3:
        return 'mid'
    elif num_jobs <= 5:
        return 'senior'
    else:
        return 'expert'

# Prepare training data for experience level classification
print("üìä Preparing training data...")

training_data = []
for resume in resumes:
    text = extract_resume_text(resume)
    
    # Calculate experience level
    exp_level = calculate_experience_level(resume)
    
    if text.strip():  # Just need text
        training_data.append({
            'text': text[:512],  # Truncate to BERT limit
            'label': exp_level,
            'category': resume.get('category', 'Unknown')
        })

print(f"‚úÖ {len(training_data)} samples for training")

if len(training_data) == 0:
    print("\n‚ö†Ô∏è  No training data created!")
    print("Checking first resume structure:")
    print(f"   Keys: {list(resumes[0].keys())}")
    print(f"   Experience entries: {len(resumes[0].get('experience', []))}")
    print(f"   Sample text length: {len(extract_resume_text(resumes[0]))}")
else:
    # Create label mapping
    label_map = {'entry': 0, 'mid': 1, 'senior': 2, 'expert': 3}
    reverse_label_map = {v: k for k, v in label_map.items()}

    # Encode labels
    for d in training_data:
        d['label_id'] = label_map[d['label']]

    print(f"\nüìà Label distribution:")
    labels_df = pd.DataFrame([d['label'] for d in training_data], columns=['label'])
    print(labels_df['label'].value_counts())

üìä Preparing training data...
‚úÖ 2480 samples for training

üìà Label distribution:
label
entry     1380
senior    1064
expert      36
Name: count, dtype: int64


In [16]:
# Split data
train_data, val_data = train_test_split(training_data, test_size=0.2, random_state=42)
print(f"\nüìä Train: {len(train_data)}, Validation: {len(val_data)}")

# Load tokenizer and model
model_name = 'distilbert-base-uncased'  # Faster than BERT, good performance
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(label_map)
)

print(f"‚úÖ Loaded {model_name}")

# Tokenize
def tokenize_data(data):
    return tokenizer(
        [d['text'] for d in data],
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors='pt'
    )

print("üîÑ Tokenizing...")
train_encodings = tokenize_data(train_data)
val_encodings = tokenize_data(val_data)
print("‚úÖ Tokenization complete")


üìä Train: 1984, Validation: 496


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Loaded distilbert-base-uncased
üîÑ Tokenizing...
‚úÖ Tokenization complete
‚úÖ Tokenization complete


In [17]:
# Create PyTorch dataset
class ResumeDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ResumeDataset(train_encodings, [d['label_id'] for d in train_data])
val_dataset = ResumeDataset(val_encodings, [d['label_id'] for d in val_data])

print(f"‚úÖ Datasets created: {len(train_dataset)} train, {len(val_dataset)} val")

‚úÖ Datasets created: 1984 train, 496 val


In [18]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="steps",  # Changed from evaluation_strategy
    eval_steps=100,
    save_steps=500,
    load_best_model_at_end=True,
    no_cuda=not torch.cuda.is_available(),  # Use CPU if no GPU available
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print("üöÄ Starting training...")
print("‚è±Ô∏è  This will take 10-20 minutes on CPU\n")

trainer.train()

print("\n‚úÖ Training complete!")



üöÄ Starting training...
‚è±Ô∏è  This will take 10-20 minutes on CPU



Step,Training Loss,Validation Loss
100,0.6936,0.680022
200,0.1682,0.216703
300,0.0788,0.108209



‚úÖ Training complete!


In [19]:
# Force reload to pick up newly installed accelerate package
import sys
import importlib

# Clear cached modules
if 'accelerate' in sys.modules:
    del sys.modules['accelerate']
if 'transformers' in sys.modules:
    del sys.modules['transformers']

# Reimport with fresh cache
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

print("‚úÖ Reloaded transformers with accelerate support")

‚úÖ Reloaded transformers with accelerate support


In [20]:
# Evaluate
print("üìä Evaluating model...")
results = trainer.evaluate()
print(f"\n‚úÖ Validation Results:")
for key, value in results.items():
    print(f"   {key}: {value:.4f}")

# Save model
model.save_pretrained('./experience_classifier')
tokenizer.save_pretrained('./experience_classifier')
print("\n‚úÖ Model saved to: ./experience_classifier")
print("üì• Download and place in models/")

üìä Evaluating model...



‚úÖ Validation Results:
   eval_loss: 0.0757
   eval_runtime: 48.6962
   eval_samples_per_second: 10.1860
   eval_steps_per_second: 0.6370
   epoch: 3.0000

‚úÖ Model saved to: ./experience_classifier
üì• Download and place in models/

‚úÖ Model saved to: ./experience_classifier
üì• Download and place in models/


## üìà Task 4: Test the Fine-tuned Model

In [21]:
# Test predictions
from transformers import pipeline

classifier = pipeline('text-classification', model='./experience_classifier', tokenizer=tokenizer)

test_resumes = [
    "Senior Software Engineer with 8 years of experience in Python, Java, and cloud technologies. Led teams of 5+ developers.",
    "Recent Computer Science graduate with internship experience. Proficient in Python and JavaScript.",
    "Distinguished architect with 15+ years building enterprise systems. Expert in system design and leadership."
]

print("üß™ Testing model predictions:\n")
for i, text in enumerate(test_resumes, 1):
    result = classifier(text[:512])[0]
    predicted_label = reverse_label_map[int(result['label'].split('_')[-1])]
    print(f"{i}. {text[:80]}...")
    print(f"   Predicted: {predicted_label.upper()} (confidence: {result['score']:.2%})\n")

Device set to use cpu


üß™ Testing model predictions:

1. Senior Software Engineer with 8 years of experience in Python, Java, and cloud t...
   Predicted: SENIOR (confidence: 99.72%)

2. Recent Computer Science graduate with internship experience. Proficient in Pytho...
   Predicted: ENTRY (confidence: 99.48%)

3. Distinguished architect with 15+ years building enterprise systems. Expert in sy...
   Predicted: SENIOR (confidence: 99.74%)

1. Senior Software Engineer with 8 years of experience in Python, Java, and cloud t...
   Predicted: SENIOR (confidence: 99.72%)

2. Recent Computer Science graduate with internship experience. Proficient in Pytho...
   Predicted: ENTRY (confidence: 99.48%)

3. Distinguished architect with 15+ years building enterprise systems. Expert in sy...
   Predicted: SENIOR (confidence: 99.74%)



## üì¶ Summary & Download Files

Download these files and add to your local project:

1. **resume_embeddings.npy** ‚Üí `models/embeddings/`
2. **resume_faiss_index.bin** ‚Üí `models/embeddings/`
3. **experience_classifier/** (folder) ‚Üí `models/`
4. **resume_embeddings.json** (optional, backup)

Then update your local code to use these GPU-trained models!

In [None]:
# Create zip for easy download
!zip -r intellimatch_gpu_models.zip resume_embeddings.npy resume_faiss_index.bin experience_classifier/
print("‚úÖ Created intellimatch_gpu_models.zip")
print("üì• Download using the file browser on the left")

  adding: resume_embeddings.npy (172 bytes security) (deflated 8%)
  adding: resume_faiss_index.bin (172 bytes security) (deflated 8%)
  adding: experience_classifier/ (260 bytes security) (stored 0%)
  adding: experience_classifier/config.json (172 bytes security) (deflated 52%)
  adding: experience_classifier/model.safetensors (172 bytes security) (deflated 8%)
  adding: experience_classifier/special_tokens_map.json (172 bytes security) (deflated 43%)
  adding: experience_classifier/tokenizer.json (172 bytes security) (deflated 71%)
  adding: experience_classifier/tokenizer_config.json (172 bytes security) (deflated 76%)
  adding: experience_classifier/vocab.txt (172 bytes security) (deflated 53%)
‚úÖ Created intellimatch_gpu_models.zip
üì• Download using the file browser on the left


: 