# CVInsight Step-by-Step Tutorial

Welcome to CVInsight! This comprehensive tutorial will walk you through resume analysis step-by-step, perfect for learning how the system works and understanding each component.

## 🎯 **Learning Objectives**
- Understand CVInsight's unified extractor architecture
- Learn to analyze individual resumes in detail
- Master job-specific relevance scoring
- Process multiple resumes systematically
- Export and interpret results

## 👥 **Who This Is For**
- New users learning CVInsight
- Developers integrating resume analysis
- HR professionals understanding candidate scoring
- Students learning NLP and resume processing

## 📚 **Tutorial Structure**
1. **Setup & Initialization** - Get CVInsight running
2. **Single Resume Analysis** - Deep dive into one resume
3. **Understanding the Fields** - Explore all 21 analysis fields
4. **Job Relevance Scoring** - How job descriptions affect results
5. **Batch Processing** - Handle multiple resumes systematically
6. **Results Interpretation** - Make sense of the data
7. **Integration & Export** - Use results in your applications

## 🔧 **External Repository Integration**
```bash
# Clone CVInsight to your projects directory
git clone https://github.com/your-username/CVInsight.git

# Install dependencies
cd CVInsight
pip install -r requirements.txt

# Set up your OpenAI API key
export OPEN_AI_API_KEY="your-key-here"

# Now you can use CVInsight from any notebook!
```

In [1]:
# Step 1: Setup and Imports
print("🚀 STEP 1: SETTING UP CVINSIGHT")
print("=" * 50)

import os
import sys
import pandas as pd
import json
from pathlib import Path
from datetime import datetime

# Add CVInsight to your Python path
# Adjust this path to where you cloned CVInsight
CVINSIGHT_PATH = "/Users/samcelarek/Documents/CVInsight"
if CVINSIGHT_PATH not in sys.path:
    sys.path.insert(0, CVINSIGHT_PATH)

print(f"📁 CVInsight path: {CVINSIGHT_PATH}")

# Import CVInsight functions
try:
    from cvinsight.notebook_utils import (
        initialize_client,
        find_resumes,
        parse_single_resume,
        parse_many_resumes
    )
    print("✅ CVInsight successfully imported!")
    print("📦 Available functions:")
    print("   • initialize_client() - Set up the AI client")
    print("   • find_resumes() - Discover resume files")
    print("   • parse_single_resume() - Analyze one resume")
    print("   • parse_many_resumes() - Batch process multiple resumes")
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("💡 Make sure CVInsight is cloned and the path is correct!")
    raise

🚀 STEP 1: SETTING UP CVINSIGHT
📁 CVInsight path: /Users/samcelarek/Documents/CVInsight
✅ CVInsight successfully imported!
📦 Available functions:
   • initialize_client() - Set up the AI client
   • find_resumes() - Discover resume files
   • parse_single_resume() - Analyze one resume
   • parse_many_resumes() - Batch process multiple resumes


In [2]:
# Step 2: Initialize the CVInsight Client
print("\n🔧 STEP 2: INITIALIZING THE AI CLIENT")
print("=" * 50)

# Get your OpenAI API key
api_key = os.environ.get("OPEN_AI_API_KEY")
if not api_key:
    print("🔑 API key not found in environment variables")
    print("💡 You can either:")
    print("   1. Set it as an environment variable: export OPEN_AI_API_KEY='your-key'")
    print("   2. Enter it manually below")
    api_key = input("Enter your OpenAI API key: ")

print("🔄 Initializing CVInsight client...")

# Initialize the client
client = initialize_client(api_key=api_key)

# Check what extractors are available
extractors = list(client._plugin_manager.extractors.keys())
print(f"📊 Found {len(extractors)} extractors:")
for extractor in extractors:
    print(f"   • {extractor}")

# Verify the unified extractor is available
if 'extended_analysis_extractor' in extractors:
    print("\n✅ Unified extractor is ready!")
    print("🚀 This extractor provides all 21 analysis fields in a single call")
else:
    print("\n❌ Unified extractor not found!")
    print("💡 Make sure you have the latest CVInsight version")


🔧 STEP 2: INITIALIZING THE AI CLIENT
🔄 Initializing CVInsight client...
📊 Found 6 extractors:
   • profile_extractor
   • skills_extractor
   • education_extractor
   • experience_extractor
   • yoe_extractor
   • extended_analysis_extractor

✅ Unified extractor is ready!
🚀 This extractor provides all 21 analysis fields in a single call


In [3]:
# Step 3: Discover Resume Files
print("\n📁 STEP 3: FINDING RESUME FILES")
print("=" * 40)

# Set your resume directory path
# Adjust this to where your resume files are located
resume_dir = "/Users/samcelarek/Documents/CVInsight/Resumes"  # Change this path!

print(f"🔍 Searching for resumes in: {resume_dir}")

# Find all resume files
resume_paths = find_resumes(resume_dir)

print(f"📄 Found {len(resume_paths)} resume files")

if resume_paths:
    print("\n📋 Resume files discovered:")
    for i, path in enumerate(resume_paths[:10], 1):  # Show first 10
        filename = os.path.basename(path)
        file_size = os.path.getsize(path) / 1024  # Size in KB
        print(f"   {i:2d}. {filename} ({file_size:.1f} KB)")
    
    if len(resume_paths) > 10:
        print(f"   ... and {len(resume_paths) - 10} more files")
    
    print(f"\n📊 Supported formats: PDF, DOC, DOCX")
    print(f"💡 CVInsight will extract text from all these formats automatically")
    
else:
    print("❌ No resume files found!")
    print("💡 Make sure to:")
    print("   • Check the resume_dir path is correct")
    print("   • Add some PDF, DOC, or DOCX resume files")
    print("   • Ensure the files are readable")


📁 STEP 3: FINDING RESUME FILES
🔍 Searching for resumes in: /Users/samcelarek/Documents/CVInsight/Resumes
📄 Found 21 resume files

📋 Resume files discovered:
    1. 2023-08-28 - Wesley Ordonez Resume Wesley Ordonez 2023 (Data Analytics).pdf (85.0 KB)
    2. 2023-08-20 - Weihao Chen Resume Resume_Weihao Chen.pdf (95.4 KB)
    3. 2023-08-26 - Akhil Bukkapuram Resume Akhil_ds.pdf (200.7 KB)
    4. 2023-08-28 - Brian Warras Resume BrianWarras.pdf (236.5 KB)
    5. 2023-08-22 - Stefani Sanchez Resume Stefani Sanchez Resume.pdf (61.6 KB)
    6. Gaurav_Kumar.pdf (309.0 KB)
    7. 2023-08-25 - Ka Yee Yuen Resume Resume_Claire-Yuen.pdf (91.9 KB)
    8. 2023-08-16 - Niveditha Channapatna Raju Resume nivedithacr_resume.pdf (74.2 KB)
    9. 2023-08-20 - Bryan Aguilar Resume Bryan's Resume.pdf (80.5 KB)
   10. 2023-08-23 - Eliana Mugar Resume Eliana Mugar_NLPCS.pdf (94.6 KB)
   ... and 11 more files

📊 Supported formats: PDF, DOC, DOCX
💡 CVInsight will extract text from all these formats automatica

In [4]:
# Step 4: Create a Job Description
print("\n📋 STEP 4: DEFINING THE JOB DESCRIPTION")
print("=" * 50)

# Create a detailed job description
# This will be used to calculate relevance scores
job_description = """
Data Scientist Position - Healthcare Technology

We are seeking a talented Data Scientist to join our healthcare technology team. The ideal candidate will have:

Required Qualifications:
• Bachelor's or Master's degree in Data Science, Statistics, Computer Science, or related field
• 3+ years of experience in data analysis and machine learning
• Strong programming skills in Python and R
• Experience with SQL and database management
• Knowledge of statistical analysis and hypothesis testing
• Proficiency with machine learning libraries (scikit-learn, pandas, numpy)
• Experience with data visualization tools (matplotlib, seaborn, Tableau)

Preferred Qualifications:
• Healthcare or medical data analysis experience
• Experience with deep learning frameworks (TensorFlow, PyTorch)
• Cloud platform experience (AWS, Azure, GCP)
• Knowledge of clinical data standards (HL7, FHIR)
• Experience with A/B testing and experimental design
• Publication record in data science or healthcare

Responsibilities:
• Analyze large healthcare datasets to extract insights
• Build predictive models for patient outcomes
• Collaborate with clinical teams to understand data requirements
• Develop automated reporting and monitoring systems
• Present findings to stakeholders and leadership

This role offers the opportunity to make a meaningful impact on patient care through data-driven insights.
"""

print(f"📝 Job description created ({len(job_description)} characters)")
print("\n🎯 Job Focus: Data Scientist - Healthcare Technology")
print("💡 Key requirements: Python, R, ML, Healthcare data, 3+ years experience")

# Today's date for resume submission
submission_date = datetime.now().strftime("%Y-%m-%d")
print(f"📅 Analysis date: {submission_date}")

print("\n✅ Job description ready for relevance scoring!")


📋 STEP 4: DEFINING THE JOB DESCRIPTION
📝 Job description created (1388 characters)

🎯 Job Focus: Data Scientist - Healthcare Technology
💡 Key requirements: Python, R, ML, Healthcare data, 3+ years experience
📅 Analysis date: 2025-05-29

✅ Job description ready for relevance scoring!


In [5]:
# Step 5: Analyze Your First Resume
print("\n🔍 STEP 5: ANALYZING A SINGLE RESUME IN DETAIL")
print("=" * 60)

if not resume_paths:
    print("❌ No resumes available for analysis")
    print("💡 Please add resume files to your resume directory first")
else:
    # Select the first resume for detailed analysis
    test_resume = resume_paths[0]
    resume_filename = os.path.basename(test_resume)
    
    print(f"📄 Analyzing: {resume_filename}")
    print("⏳ This may take 30-60 seconds for the AI analysis...")
    
    # Parse the resume
    import time
    start_time = time.time()
    
    result = parse_single_resume(
        client=client,
        resume_path=test_resume,
        date_of_resume_submission=submission_date,
        job_description=job_description
    )
    
    processing_time = time.time() - start_time
    print(f"⏱️  Processing completed in {processing_time:.2f} seconds")
    print(f"📊 Extracted {len(result)} fields from the resume")


🔍 STEP 5: ANALYZING A SINGLE RESUME IN DETAIL
📄 Analyzing: 2023-08-28 - Wesley Ordonez Resume Wesley Ordonez 2023 (Data Analytics).pdf
⏳ This may take 30-60 seconds for the AI analysis...
⏱️  Processing completed in 36.01 seconds
📊 Extracted 47 fields from the resume


In [6]:
# Step 6: Understanding the Basic Information
print("\n👤 STEP 6: BASIC CANDIDATE INFORMATION")
print("=" * 50)

if 'result' in locals():
    print("📋 PERSONAL DETAILS:")
    print(f"   • Name: {result.get('name', 'Not found')}")
    print(f"   • Email: {result.get('email', 'Not found')}")
    print(f"   • Phone: {result.get('contact_number', 'Not found')}")
    print(f"   • Location: {result.get('location', 'Not specified')}")
    
    # Skills
    skills = result.get('skills', [])
    if skills:
        print(f"\n🛠️  SKILLS ({len(skills)} found):")
        # Show first 10 skills
        for i, skill in enumerate(skills[:10], 1):
            print(f"   {i:2d}. {skill}")
        if len(skills) > 10:
            print(f"   ... and {len(skills) - 10} more skills")
    
    # Education
    educations = result.get('educations', [])
    if educations:
        print(f"\n🎓 EDUCATION ({len(educations)} entries):")
        for i, edu in enumerate(educations, 1):
            degree = edu.get('degree', 'Unknown degree')
            institution = edu.get('institution', 'Unknown institution')
            year = edu.get('graduation_year', 'Unknown year')
            print(f"   {i}. {degree}")
            print(f"      🏫 {institution} ({year})")
    
    # Work Experience
    work_experiences = result.get('work_experiences', [])
    if work_experiences:
        print(f"\n💼 WORK EXPERIENCE ({len(work_experiences)} positions):")
        for i, work in enumerate(work_experiences[:5], 1):  # Show first 5
            title = work.get('title', work.get('position', 'Unknown position'))
            company = work.get('company', 'Unknown company')
            duration = work.get('duration', 'Unknown duration')
            print(f"   {i}. {title}")
            print(f"      🏢 {company} | ⏱️  {duration}")
        if len(work_experiences) > 5:
            print(f"   ... and {len(work_experiences) - 5} more positions")
    
    print("\n✅ Basic information extracted successfully!")
    
else:
    print("❌ No resume analysis available. Run Step 5 first.")


👤 STEP 6: BASIC CANDIDATE INFORMATION
📋 PERSONAL DETAILS:
   • Name: Wesley Ordoñez
   • Email: wesordonez1@gmail.com
   • Phone: 
   • Location: 

🛠️  SKILLS (14 found):
    1. CSS
    2. HTML
    3. JavaScript
    4. Python
    5. SQL
    6. R
    7. Tableau
    8. Microsoft Power BI
    9. Microsoft Suite
   10. Google Suite
   ... and 4 more skills

🎓 EDUCATION (2 entries):
   1. Bachelor of Science in Mechanical Engineering (concentration: Design Engineering)
      🏫 Rose-Hulman Institute of Technology (Unknown year)
   2. Data Science and Machine Learning, Google Data Analytics Professional Certificate
      🏫 Online Coursework (Udemy/Google) (Unknown year)

💼 WORK EXPERIENCE (3 positions):
   1. Unknown position
      🏢 Puerto Rican Cultural Center | ⏱️  Unknown duration
   2. Unknown position
      🏢 Versatech LLC | ⏱️  Unknown duration
   3. Unknown position
      🏢 ZF Automotive Group | ⏱️  Unknown duration

✅ Basic information extracted successfully!


In [None]:
# Step 7: Understanding the Unified Extractor Fields
print("\n🔬 STEP 7: EXPLORING THE 21 UNIFIED EXTRACTOR FIELDS")
print("=" * 65)

if 'result' in locals():
    print("📊 The unified extractor provides 21 specialized analysis fields:")
    print("   This replaces multiple separate API calls with one comprehensive analysis")
    
    # Group fields by category for better understanding
    experience_fields = {
        'wyoe': 'Total work experience (years)',
        'relevant_wyoe': 'Relevant work experience (years)',
        'eyoe': 'Education experience (years)',
        'relevant_eyoe': 'Relevant education (years)',
        'total_relevant_yoe': 'Total relevant experience (years)',
        'average_tenure_at_company_years': 'Average company tenure (years)'
    }
    
    education_fields = {
        'highest_degree': 'Highest degree attained',
        'highest_degree_major': 'Major/field of study',
        'highest_degree_status': 'Degree status (completed/in-progress)',
        'highest_degree_school_prestige': 'School prestige level'
    }
    
    career_fields = {
        'highest_seniority_level': 'Career seniority level',
        'primary_position_title': 'Primary job title/role'
    }
    
    contact_fields = {
        'linkedin_url': 'LinkedIn profile URL',
        'github_url': 'GitHub profile URL',
        'personal_website_url': 'Personal website URL'
    }
    
    # Display experience analysis
    print(f"\n📈 EXPERIENCE ANALYSIS:")
    for field, description in experience_fields.items():
        value = result.get(field, 'Not calculated')
        print(f"   • {description}: {value}")
    
    # Display education analysis
    print(f"\n🎓 EDUCATION ANALYSIS:")
    for field, description in education_fields.items():
        value = result.get(field, 'Not found')
        print(f"   • {description}: {value}")
    
    # Display career analysis
    print(f"\n💼 CAREER ANALYSIS:")
    for field, description in career_fields.items():
        value = result.get(field, 'Not determined')
        print(f"   • {description}: {value}")
    
    # Display contact information
    print(f"\n📞 PROFESSIONAL PROFILES:")
    for field, description in contact_fields.items():
        value = result.get(field, 'Not found')
        if value and value != 'Not found':
            print(f"   • {description}: {value}")
        else:
            print(f"   • {description}: Not available")
    
    # Relevance calculation explanation
    print(f"\n🎯 RELEVANCE SCORING EXPLANATION:")
    total_work = result.get('wyoe', 0)
    relevant_work = result.get('relevant_wyoe', 0)
    if total_work > 0:
        relevance_percentage = (relevant_work / total_work) * 100
        print(f"   • Work relevance: {relevance_percentage:.1f}% ({relevant_work:.1f}/{total_work:.1f} years)")
        
        if relevance_percentage >= 80:
            print("   🟢 Excellent match for this role!")
        elif relevance_percentage >= 60:
            print("   🟡 Good match with some transferable skills")
        elif relevance_percentage >= 40:
            print("   🟠 Moderate match, may need additional training")
        else:
            print("   🔴 Limited relevant experience for this specific role")
    
    print(f"\n✅ All 21 fields analyzed in a single AI call!")
    
else:
    print("❌ No analysis results available. Run Step 5 first.")


🔬 STEP 7: EXPLORING THE 21 UNIFIED EXTRACTOR FIELDS
📊 The unified extractor provides 21 specialized analysis fields:
   This replaces multiple separate API calls with one comprehensive analysis

📈 EXPERIENCE ANALYSIS:
   • Total work experience (years): 5.8
   • Relevant work experience (years): 3.7
   • Education experience (years): 4.0
   • Relevant education (years): 4.0
   • Total relevant experience (years): 7.7
   • Average company tenure (years): 1.9

🎓 EDUCATION ANALYSIS:
   • Highest degree attained: Bachelor of Science
   • Major/field of study: Mechanical Engineering
   • Degree status (completed/in-progress): completed
   • School prestige level: medium

💼 CAREER ANALYSIS:
   • Career seniority level: manager
   • Primary job title/role: SBDC Business Advisor and Corridor Manager

📞 PROFESSIONAL PROFILES:
   • LinkedIn profile URL: Not available
   • GitHub profile URL: Not available
   • Personal website URL: Not available

🎯 RELEVANCE SCORING EXPLANATION:
   • Work relev

In [8]:
# Step 8: Batch Processing Multiple Resumes (Sequential)
print("\n📚 STEP 8: PROCESSING MULTIPLE RESUMES SYSTEMATICALLY")
print("=" * 65)

if len(resume_paths) > 1:
    # For tutorial purposes, we'll process a small batch sequentially
    # This helps you understand each step without overwhelming output
    
    batch_size = min(5, len(resume_paths))  # Process up to 5 resumes
    batch_resumes = resume_paths[:batch_size]
    
    print(f"📊 Processing {batch_size} resumes for comprehensive analysis")
    print("⏳ This will take a few minutes - each resume requires AI analysis")
    print("🔄 We're using sequential processing for clear step-by-step learning")
    
    # Process resumes one by one (sequential, not parallel)
    start_time = time.time()
    
    df = parse_many_resumes(
        client=client,
        resume_paths=batch_resumes,
        date_of_resume_submission=submission_date,
        job_description=job_description,
        parallel=False,  # Sequential processing for learning
        use_tqdm=True   # Show progress bar
    )
    
    processing_time = time.time() - start_time
    
    print(f"\n✅ Batch processing completed!")
    print(f"⏱️  Total time: {processing_time:.2f} seconds")
    print(f"📊 Average per resume: {processing_time/batch_size:.2f} seconds")
    print(f"📋 Results dataframe shape: {df.shape}")
    print(f"📈 Columns available: {len(df.columns)}")
    
else:
    print("ℹ️  Only one resume available. Add more resumes to try batch processing.")
    # Create a single-row dataframe for consistency
    if 'result' in locals():
        df = pd.DataFrame([result])
        df['parsing_status'] = 'success'
        df['filename'] = os.path.basename(test_resume)
        print("📊 Single resume converted to dataframe format for analysis")


📚 STEP 8: PROCESSING MULTIPLE RESUMES SYSTEMATICALLY
📊 Processing 5 resumes for comprehensive analysis
⏳ This will take a few minutes - each resume requires AI analysis
🔄 We're using sequential processing for clear step-by-step learning


Parsing resumes: 100%|██████████| 5/5 [02:11<00:00, 26.25s/it]


✅ Batch processing completed!
⏱️  Total time: 131.28 seconds
📊 Average per resume: 26.26 seconds
📋 Results dataframe shape: (5, 55)
📈 Columns available: 55





In [None]:
# Step 9: Analyzing Batch Results
print("\n📊 STEP 9: UNDERSTANDING BATCH ANALYSIS RESULTS")
print("=" * 60)

if 'df' in locals():
    # Check processing success
    total_resumes = len(df)
    successful = len(df[df['parsing_status'] == 'success'])
    failed = len(df[df['parsing_status'] == 'failed'])
    
    print(f"📈 PROCESSING SUMMARY:")
    print(f"   • Total resumes processed: {total_resumes}")
    print(f"   • Successfully analyzed: {successful}")
    print(f"   • Failed to process: {failed}")
    print(f"   • Success rate: {successful/total_resumes*100:.1f}%")
    
    if successful > 0:
        successful_df = df[df['parsing_status'] == 'success']
        
        # Experience statistics
        print(f"\n📊 EXPERIENCE STATISTICS:")
        avg_total_exp = successful_df['wyoe'].mean()
        avg_relevant_exp = successful_df['relevant_wyoe'].mean()
        avg_education_exp = successful_df['eyoe'].mean()
        
        print(f"   • Average total work experience: {avg_total_exp:.1f} years")
        print(f"   • Average relevant work experience: {avg_relevant_exp:.1f} years")
        print(f"   • Average education experience: {avg_education_exp:.1f} years")
        print(f"   • Overall relevance ratio: {avg_relevant_exp/avg_total_exp*100:.1f}%")
        
        # Education distribution
        print(f"\n🎓 EDUCATION DISTRIBUTION:")
        education_counts = successful_df['highest_degree'].value_counts()
        for degree, count in education_counts.items():
            percentage = count / len(successful_df) * 100
            print(f"   • {degree}: {count} candidates ({percentage:.1f}%)")
        
        # Seniority distribution
        print(f"\n💼 SENIORITY LEVEL DISTRIBUTION:")
        seniority_counts = successful_df['highest_seniority_level'].value_counts()
        for level, count in seniority_counts.items():
            percentage = count / len(successful_df) * 100
            print(f"   • {level}: {count} candidates ({percentage:.1f}%)")
        
        # Contact information availability
        linkedin_available = successful_df['linkedin_url'].notna().sum()
        github_available = successful_df['github_url'].notna().sum()
        email_available = successful_df['email'].notna().sum()
        
        print(f"\n📞 CONTACT INFORMATION AVAILABILITY:")
        print(f"   • Email addresses: {email_available}/{successful} ({email_available/successful*100:.1f}%)")
        print(f"   • LinkedIn profiles: {linkedin_available}/{successful} ({linkedin_available/successful*100:.1f}%)")
        print(f"   • GitHub profiles: {github_available}/{successful} ({github_available/successful*100:.1f}%)")
        
        # Calculate relevance rankings
        successful_df['relevance_percentage'] = (
            successful_df['relevant_wyoe'] / successful_df['wyoe'] * 100
        ).fillna(0)
        
        # Show top candidates
        top_candidates = successful_df.nlargest(3, 'relevance_percentage')
        print(f"\n🏆 TOP 3 CANDIDATES BY RELEVANCE:")
        for idx, (_, candidate) in enumerate(top_candidates.iterrows(), 1):
            name = candidate.get('name', 'Unknown')
            relevance = candidate.get('relevance_percentage', 0)
            total_exp = candidate.get('wyoe', 0)
            degree = candidate.get('highest_degree', 'Unknown')
            print(f"   {idx}. {name}")
            print(f"      🎯 Relevance: {relevance:.1f}% | 💼 Experience: {total_exp:.1f} years | 🎓 {degree}")
        
    else:
        print("❌ No successful analyses to display")
        print("💡 Check your resume files and job description")
    
else:
    print("❌ No batch processing results available. Run Step 8 first.")


📊 STEP 9: UNDERSTANDING BATCH ANALYSIS RESULTS
📈 PROCESSING SUMMARY:
   • Total resumes processed: 5
   • Successfully analyzed: 5
   • Failed to process: 0
   • Success rate: 100.0%

📊 EXPERIENCE STATISTICS:
   • Average total work experience: 8.2 years
   • Average relevant work experience: 3.5 years
   • Average education experience: 5.0 years
   • Overall relevance ratio: 42.1%

🎓 EDUCATION DISTRIBUTION:
   • Bachelor of Science: 2 candidates (40.0%)
   • Master of Science: 2 candidates (40.0%)

💼 SENIORITY LEVEL DISTRIBUTION:
   • mid-level: 3 candidates (60.0%)
   • manager: 1 candidates (20.0%)
   • executive: 1 candidates (20.0%)

📞 CONTACT INFORMATION AVAILABILITY:
   • Email addresses: 5/5 (100.0%)
   • LinkedIn profiles: 2/5 (40.0%)
   • GitHub profiles: 3/5 (60.0%)

🏆 TOP 3 CANDIDATES BY RELEVANCE:
   1. Akhil Bukkapuram
      🎯 Relevance: 74.2% | 💼 Experience: 3.1 years | 🎓 Master of Science
   2. Wesley Ordoñez
      🎯 Relevance: 57.6% | 💼 Experience: 5.9 years | 🎓 Bache

In [None]:
# Step 10: Export Results for Integration
print("\n💾 STEP 10: EXPORTING RESULTS FOR YOUR APPLICATIONS")
print("=" * 65)

if 'df' in locals() and len(df) > 0:
    # Create results directory
    results_dir = Path("./tutorial_results")
    results_dir.mkdir(exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Export complete results
    complete_file = results_dir / f"complete_analysis_{timestamp}.csv"
    df.to_csv(complete_file, index=False)
    print(f"📊 Complete results exported: {complete_file}")
    
    # Export successful results only
    successful_df = df[df['parsing_status'] == 'success']
    if not successful_df.empty:
        # Create a summary with key fields
        summary_fields = [
            'filename', 'name', 'email', 'contact_number',
            'wyoe', 'relevant_wyoe', 'eyoe',
            'highest_degree', 'highest_seniority_level', 'primary_position_title',
            'linkedin_url', 'github_url'
        ]
        
        # Only include fields that exist in the dataframe
        available_fields = [field for field in summary_fields if field in successful_df.columns]
        summary_df = successful_df[available_fields].copy()
        
        # Add calculated relevance percentage
        summary_df['relevance_percentage'] = (
            successful_df['relevant_wyoe'] / successful_df['wyoe'] * 100
        ).fillna(0).round(1)
        
        # Sort by relevance
        summary_df = summary_df.sort_values('relevance_percentage', ascending=False)
        
        summary_file = results_dir / f"candidate_summary_{timestamp}.csv"
        summary_df.to_csv(summary_file, index=False)
        print(f"🎯 Candidate summary exported: {summary_file}")
        
        # Create a JSON export for easy API integration
        json_file = results_dir / f"candidates_{timestamp}.json"
        candidates_json = summary_df.to_dict('records')
        with open(json_file, 'w') as f:
            json.dump(candidates_json, f, indent=2)
        print(f"🔗 JSON format exported: {json_file}")
    
    print(f"\n📁 All files saved to: {results_dir.absolute()}")
    
    # Show integration examples
    print(f"\n🔗 INTEGRATION EXAMPLES:")
    print("=" * 30)
    
    print("📊 Loading results in your application:")
    print("""
import pandas as pd
import json

# Load CSV results
df = pd.read_csv('tutorial_results/candidate_summary_*.csv')

# Load JSON results
with open('tutorial_results/candidates_*.json', 'r') as f:
    candidates = json.load(f)

# Filter high-relevance candidates (70%+ relevant experience)
top_candidates = df[df['relevance_percentage'] >= 70]

# Extract contact information
contacts = top_candidates[['name', 'email', 'linkedin_url']].dropna()
""")
    
    print("🔄 Using in a recruitment pipeline:")
    print("""
# Example integration with your HR system
for _, candidate in top_candidates.iterrows():
    candidate_profile = {
        'name': candidate['name'],
        'email': candidate['email'],
        'relevance_score': candidate['relevance_percentage'],
        'experience_years': candidate['wyoe'],
        'education': candidate['highest_degree'],
        'seniority': candidate['highest_seniority_level']
    }
    
    # Add to your database or send to your API
    add_candidate_to_system(candidate_profile)
""")
    
else:
    print("❌ No results to export. Complete the analysis steps first.")


💾 STEP 10: EXPORTING RESULTS FOR YOUR APPLICATIONS
📊 Complete results exported: tutorial_results/complete_analysis_20250529_205836.csv
🎯 Candidate summary exported: tutorial_results/candidate_summary_20250529_205836.csv
🔗 JSON format exported: tutorial_results/candidates_20250529_205836.json

📁 All files saved to: /Users/samcelarek/Documents/CVInsight/examples/tutorial_results

🔗 INTEGRATION EXAMPLES:
📊 Loading results in your application:

import pandas as pd
import json

# Load CSV results
df = pd.read_csv('tutorial_results/candidate_summary_*.csv')

# Load JSON results
with open('tutorial_results/candidates_*.json', 'r') as f:
    candidates = json.load(f)

# Filter high-relevance candidates (70%+ relevant experience)
top_candidates = df[df['relevance_percentage'] >= 70]

# Extract contact information
contacts = top_candidates[['name', 'email', 'linkedin_url']].dropna()

🔄 Using in a recruitment pipeline:

# Example integration with your HR system
for _, candidate in top_candidates

## 🎉 Tutorial Complete!

### ✅ **What You've Learned:**

1. **Setup & Configuration**: How to integrate CVInsight into any Python project
2. **Client Initialization**: Setting up the AI-powered analysis engine
3. **Single Resume Analysis**: Deep dive into one candidate's profile
4. **Unified Extractor**: Understanding all 21 analysis fields
5. **Job Relevance Scoring**: How job descriptions affect candidate rankings
6. **Batch Processing**: Systematically analyzing multiple resumes
7. **Results Interpretation**: Making sense of the analysis data
8. **Data Export**: Preparing results for integration into your applications

### 🚀 **Key Takeaways:**

- **Unified Extractor**: Gets 21 analysis fields in one AI call (75% faster!)
- **Job-Specific Relevance**: Candidates are scored against your specific requirements
- **Comprehensive Profiling**: Experience, education, skills, seniority, and contact info
- **Easy Integration**: Export to CSV, JSON, or direct API integration
- **Production Ready**: Handle both single resumes and large batches

### 📊 **Understanding the 21 Fields:**

**Experience Fields:**
- `wyoe`: Total work experience years
- `relevant_wyoe`: Work experience relevant to the job
- `eyoe`: Education experience years
- `relevant_eyoe`: Education relevant to the job
- `total_relevant_yoe`: Combined relevant experience
- `average_tenure_at_company_years`: Job stability indicator

**Education Fields:**
- `highest_degree`: Bachelor's, Master's, PhD, etc.
- `highest_degree_major`: Field of study
- `highest_degree_status`: Completed or in-progress
- `highest_degree_school_prestige`: School ranking

**Career Fields:**
- `highest_seniority_level`: Junior, Mid, Senior, Executive
- `primary_position_title`: Main job role

**Contact Fields:**
- `linkedin_url`, `github_url`, `personal_website_url`: Professional profiles

### 💡 **Next Steps:**

1. **Try Different Job Descriptions**: See how relevance scores change
2. **Process Your Own Resumes**: Upload your actual candidate files
3. **Integrate with Your Systems**: Use the exported data in your applications
4. **Explore Concurrent Processing**: Try the concurrent analysis demo for speed
5. **Production Deployment**: Use the production batch processor for scale

### 🔗 **Integration Checklist:**

- [ ] Clone CVInsight to your projects directory
- [ ] Install dependencies (`pip install -r requirements.txt`)
- [ ] Set up your OpenAI API key
- [ ] Test with a few sample resumes
- [ ] Customize job descriptions for your roles
- [ ] Set up automated exports to your systems
- [ ] Implement candidate ranking in your workflow

**You're now ready to build powerful resume analysis applications with CVInsight!** 🎯

---

*This tutorial used sequential processing for educational clarity. For production applications with many resumes, consider using the concurrent processing capabilities shown in our other examples.*

In [None]:
# BONUS: LLM Prediction Validation
print("\n" + "="*60)
print("🎯 BONUS: LLM PREDICTION VALIDATION")
print("="*60)
print("Let's validate our LLM predictions against known truth values!")

# Load truth values from CSV
truth_df = pd.read_csv('/Users/samcelarek/Documents/CVInsight/examples/yoe_truth_values.csv')
print(f"\n📊 Loaded truth values for {len(truth_df)} candidates")
print("\nSample of truth data:")
print(truth_df[['name', 'LLM Estimated YoE', 'TYOE', 'Work Exp Years', 'Edu Years']].head())

# Normalize names for matching (remove spaces, convert to lowercase)
def normalize_name(name):
    return str(name).replace(' ', '').replace(',', '').replace('.', '').lower().strip()

truth_df['normalized_name'] = truth_df['name'].apply(normalize_name)

# Check if we have results to validate
if 'batch_results' in locals() and batch_results is not None:
    print("\n🔍 Validating our tutorial results against actual values...")
    
    # Create validation DataFrame
    validation_results = []
    
    for _, candidate in batch_results.iterrows():
        if candidate['parsing_status'] == 'success':
            candidate_normalized = normalize_name(candidate.get('name', ''))
            
            # Find matching truth value
            truth_match = truth_df[truth_df['normalized_name'] == candidate_normalized]
            
            if not truth_match.empty:
                truth_row = truth_match.iloc[0]
                
                print(f"\n🔍 Validating: {candidate['name']}")
                
                # Extract LLM predictions from our processing
                llm_work_exp = candidate.get('work_experience_years', 0)
                llm_edu_years = candidate.get('education_years', 0)
                llm_total_yoe = llm_work_exp + llm_edu_years  # Calculate LLM total from components
                
                # Extract truth values
                actual_work_exp = truth_row['Work Exp Years']
                actual_edu_years = truth_row['Edu Years']
                actual_total_yoe = truth_row['TYOE']
                
                # Calculate errors (comparing LLM predictions vs truth values)
                work_exp_error = abs(llm_work_exp - actual_work_exp)
                edu_years_error = abs(llm_edu_years - actual_edu_years)
                total_yoe_error = abs(llm_total_yoe - actual_total_yoe)
                
                print(f"  Work Experience: LLM={llm_work_exp}, Actual={actual_work_exp}, Error={work_exp_error:.1f}")
                print(f"  Education Years: LLM={llm_edu_years}, Actual={actual_edu_years}, Error={edu_years_error:.1f}")
                print(f"  Total YoE: LLM Calc.={llm_total_yoe}, Actual={actual_total_yoe}, Error={total_yoe_error:.1f}")
                
                validation_results.append({
                    'name': candidate['name'],
                    'llm_work_exp': llm_work_exp,
                    'actual_work_exp': actual_work_exp,
                    'work_exp_error': work_exp_error,
                    'llm_edu_years': llm_edu_years,
                    'actual_edu_years': actual_edu_years,
                    'edu_years_error': edu_years_error,
                    'llm_total_yoe': llm_total_yoe,
                    'actual_total_yoe': actual_total_yoe,
                    'total_yoe_error': total_yoe_error
                })
    
    if validation_results:
        validation_df = pd.DataFrame(validation_results)
        
        print(f"\n\n✅ VALIDATION COMPLETE: Matched {len(validation_df)} candidates")
        print("\n📊 VALIDATION SUMMARY:")
        print("-" * 40)
        
        # Calculate metrics
        mean_work_error = validation_df['work_exp_error'].mean()
        mean_edu_error = validation_df['edu_years_error'].mean()
        mean_yoe_error = validation_df['total_yoe_error'].mean()
        
        print(f"Mean Work Experience Error: {mean_work_error:.2f} years")
        print(f"Mean Education Years Error: {mean_edu_error:.2f} years")
        print(f"Mean Total YoE Error: {mean_yoe_error:.2f} years")
        
        # Accuracy assessment
        work_within_1_year = len(validation_df[validation_df['work_exp_error'] <= 1]) / len(validation_df) * 100
        edu_within_1_year = len(validation_df[validation_df['edu_years_error'] <= 1]) / len(validation_df) * 100
        yoe_within_2_years = len(validation_df[validation_df['total_yoe_error'] <= 2]) / len(validation_df) * 100
        
        print(f"\n🎯 ACCURACY METRICS:")
        print(f"Work Experience within 1 year: {work_within_1_year:.1f}%")
        print(f"Education Years within 1 year: {edu_within_1_year:.1f}%")
        print(f"Total YoE within 2 years: {yoe_within_2_years:.1f}%")
        
        # Show best and worst predictions
        print("\n✅ MOST ACCURATE PREDICTIONS:")
        best_predictions = validation_df.nsmallest(3, 'total_yoe_error')
        for _, row in best_predictions.iterrows():
            print(f"  {row['name']}: Total YoE Error = {row['total_yoe_error']:.1f} years")
        
        print("\n❌ LEAST ACCURATE PREDICTIONS:")
        worst_predictions = validation_df.nlargest(3, 'total_yoe_error')
        for _, row in worst_predictions.iterrows():
            print(f"  {row['name']}: Total YoE Error = {row['total_yoe_error']:.1f} years")
        
        # Save validation results
        output_path = './tutorial_results/tutorial_validation_results.csv'
        validation_df.to_csv(output_path, index=False)
        print(f"\n💾 Validation results saved to: {output_path}")
        
        print("\n🎆 TUTORIAL COMPLETE!")
        print("You've learned how to:")
        print("✓ Process resumes with CVInsight")
        print("✓ Analyze and interpret results")
        print("✓ Validate LLM predictions against truth values")
        print("✓ Export data for integration")
    else:
        print("\n⚠️ No matching candidates found for validation")
        print("This might happen if the resumes processed don't match the truth dataset.")
else:
    print("\n⚠️ No batch processing results available for validation.")
    print("Run the batch processing steps above first to generate results for validation.")
    print("\nTo see validation in action, process some of the resumes that match the truth dataset:")
    print("Available candidates in truth data:")
    for name in truth_df['name'].head(10):
        print(f"  - {name}")


🎯 BONUS: LLM PREDICTION VALIDATION
Let's validate our LLM predictions against known truth values!

📊 Loaded truth values for 19 candidates

Sample of truth data:
               name  LLM Estimated YoE  TYOE  Work Exp Years  Edu Years
0         AKASH DAS               4.00   2.5               0        2.5
1  Akhil Bukkapuram               4.92   4.5               2        2.5
2     AndrewColbert               7.42   7.0               4        3.0
3      Brian Warras              10.33   3.1               3        0.1
4     Bryan Aguilar               5.42   9.0               6        3.0

⚠️ No batch processing results available for validation.
Run the batch processing steps above first to generate results for validation.

To see validation in action, process some of the resumes that match the truth dataset:
Available candidates in truth data:
  - AKASH DAS
  - Akhil Bukkapuram
  - AndrewColbert
  - Brian Warras
  - Bryan Aguilar
  - Doris Fang
  - ELIANA MUGAR
  - Eric de la Parra
  -