# 01 - Exploratory Data Analysis (EDA)

This notebook performs exploratory data analysis on the resume dataset.

## Objectives
- Load dataset and understand structure
- Extract key fields 
- Identify missing data and other issues
- Select sample CVs for "roasting"

---

In [1]:
import pandas as pd
import numpy as np
import json
from pathlib import Path

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load Dataset

Source: https://www.kaggle.com/datasets/saugataroyarghya/resume-dataset

In [2]:
# Load the resume dataset
data_path = Path('../data/resume_data.csv')
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape}")
print(f"Total resumes: {len(df)}")

Dataset shape: (9544, 35)
Total resumes: 9544


## 2. Dataset Overview

In [3]:
# Display column names
print("Columns in dataset:")
for i, col in enumerate(df.columns, 1):
    print(f"{i}. {col}")

Columns in dataset:
1. address
2. career_objective
3. skills
4. educational_institution_name
5. degree_names
6. passing_years
7. educational_results
8. result_types
9. major_field_of_studies
10. professional_company_names
11. company_urls
12. start_dates
13. end_dates
14. related_skils_in_job
15. positions
16. locations
17. responsibilities
18. extra_curricular_activity_types
19. extra_curricular_organization_names
20. extra_curricular_organization_links
21. role_positions
22. languages
23. proficiency_levels
24. certification_providers
25. certification_skills
26. online_links
27. issue_dates
28. expiry_dates
29. ﻿job_position_name
30. educationaL_requirements
31. experiencere_requirement
32. age_requirement
33. responsibilities.1
34. skills_required
35. matched_score


In [4]:
# Basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9544 entries, 0 to 9543
Data columns (total 35 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   address                              784 non-null    object 
 1   career_objective                     4740 non-null   object 
 2   skills                               9488 non-null   object 
 3   educational_institution_name         9460 non-null   object 
 4   degree_names                         9460 non-null   object 
 5   passing_years                        9460 non-null   object 
 6   educational_results                  9460 non-null   object 
 7   result_types                         9460 non-null   object 
 8   major_field_of_studies               9460 non-null   object 
 9   professional_company_names           9460 non-null   object 
 10  company_urls                         9460 non-null   object 
 11  start_dates                   

In [5]:
# Check for missing values
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("\nMissing values percentage:")
print(missing_pct[missing_pct > 0])


Missing values percentage:
proficiency_levels                     92.665549
languages                              92.665549
address                                91.785415
expiry_dates                           78.960604
issue_dates                            78.960604
online_links                           78.960604
certification_skills                   78.960604
certification_providers                78.960604
extra_curricular_organization_names    64.103101
role_positions                         64.103101
extra_curricular_organization_links    64.103101
extra_curricular_activity_types        64.103101
career_objective                       50.335289
age_requirement                        42.822716
skills_required                        17.822716
experiencere_requirement               14.291702
end_dates                               0.880134
result_types                            0.880134
educational_institution_name            0.880134
degree_names                            0

## 3. Key Resume Fields Analysis

In [6]:
# Key fields for CV roasting
key_fields = [
    'career_objective',
    'skills',
    'educational_institution_name',
    'degree_names',
    'professional_company_names',
    'positions',
    'responsibilities'
]

print("Key fields availability:")
for field in key_fields:
    if field in df.columns:
        non_null = df[field].notna().sum()
        print(f"{field}: {non_null}/{len(df)} ({non_null/len(df)*100:.1f}%)")

Key fields availability:
career_objective: 4740/9544 (49.7%)
skills: 9488/9544 (99.4%)
educational_institution_name: 9460/9544 (99.1%)
degree_names: 9460/9544 (99.1%)
professional_company_names: 9460/9544 (99.1%)
positions: 9460/9544 (99.1%)
responsibilities: 9544/9544 (100.0%)


## 4. Sample Resume Exploration

In [7]:
# Function to display a resume nicely
def display_resume(idx):
    resume = df.iloc[idx]
    print("="*80)
    print(f"RESUME #{idx}")
    print("="*80)
    
    if pd.notna(resume.get('career_objective')):
        print(f"\n CAREER OBJECTIVE:")
        print(resume['career_objective'])
    
    if pd.notna(resume.get('skills')):
        print(f"\n SKILLS:")
        print(resume['skills'])
    
    if pd.notna(resume.get('educational_institution_name')):
        print(f"\n EDUCATION:")
        print(f"Institution: {resume['educational_institution_name']}")
        if pd.notna(resume.get('degree_names')):
            print(f"Degree: {resume['degree_names']}")
        if pd.notna(resume.get('major_field_of_studies')):
            print(f"Major: {resume['major_field_of_studies']}")
    
    if pd.notna(resume.get('professional_company_names')):
        print(f"\n WORK EXPERIENCE:")
        print(f"Company: {resume['professional_company_names']}")
        if pd.notna(resume.get('positions')):
            print(f"Position: {resume['positions']}")
        if pd.notna(resume.get('responsibilities')):
            print(f"Responsibilities:\n{resume['responsibilities']}")
    
    print("\n" + "="*80 + "\n")

# Display first resume
display_resume(0)

RESUME #0

 CAREER OBJECTIVE:
Big data analytics working and database warehouse manager with robust experience in handling all kinds of data. I have also used multiple cloud infrastructure services and am well acquainted with them. Currently in search of role that offers more of development.

 SKILLS:
['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++', 'Data Structures', 'DBMS', 'RDBMS', 'Informatica', 'Talend', 'Amazon Redshift', 'Microsoft Azure']

 EDUCATION:
Institution: ['The Amity School of Engineering & Technology (ASET), Noida']
Degree: ['B.Tech']
Major: ['Electronics']

 WORK EXPERIENCE:
Company: ['Coca-COla']
Position: ['Big Data Analyst']
Responsibilities:
Technical Support
Troubleshooting
Collaboration
Documentation
System Monitoring
Software Deployment
Training & Mentorship
Industry Trends
Field Visits









In [8]:
# Display a few more samples
for idx in [1, 2, 3]:
    if idx < len(df):
        display_resume(idx)

RESUME #1

 CAREER OBJECTIVE:
Fresher looking to join as a data analyst and junior data scientist. Experienced in creating meaningful data dashboards and evaluation models.

 SKILLS:
['Data Analysis', 'Data Analytics', 'Business Analysis', 'R', 'SAS', 'PowerBi', 'Tableau', 'Data Visualization', 'Business Analytics', 'Machine Learning']

 EDUCATION:
Institution: ['Delhi University - Hansraj College', 'Delhi University - Hansraj College']
Degree: ['B.Sc (Maths)', 'M.Sc (Science) (Statistics)']
Major: ['Mathematics', 'Statistics']

 WORK EXPERIENCE:
Company: ['BIB Consultancy']
Position: ['Business Analyst']
Responsibilities:
Machine Learning Leadership
Cross-Functional Collaboration
Strategy Development
ML/NLP Infrastructure
Prototype Transformation
ML System Design
Algorithm Research
Application Development
Dataset Selection
ML Testing
Statistical Analysis
R&D in ML/NLP
Text Representation
Data Pipeline Design
Statistical Data Analysis
Model Training
Team Collaboration
Research Reportin

## 5. Create Helper Function to Extract CV Text

This function will format the cv into better readable text for the LLM.

In [9]:
def format_cv_for_llm(resume_row):
    """
    Format a resume row into a readable text for LLM processing.
    
    Args:
        resume_row: A pandas Series representing one resume
    
    Returns:
        str: Formatted CV text
    """
    cv_text = []
    
    # Career Objective
    if pd.notna(resume_row.get('career_objective')):
        cv_text.append(f"CAREER OBJECTIVE:\n{resume_row['career_objective']}")
    
    # Skills
    if pd.notna(resume_row.get('skills')):
        cv_text.append(f"\nSKILLS:\n{resume_row['skills']}")
    
    # Education
    education_parts = []
    if pd.notna(resume_row.get('educational_institution_name')):
        education_parts.append(f"Institution: {resume_row['educational_institution_name']}")
    if pd.notna(resume_row.get('degree_names')):
        education_parts.append(f"Degree: {resume_row['degree_names']}")
    if pd.notna(resume_row.get('major_field_of_studies')):
        education_parts.append(f"Major: {resume_row['major_field_of_studies']}")
    if pd.notna(resume_row.get('passing_years')):
        education_parts.append(f"Year: {resume_row['passing_years']}")
    
    if education_parts:
        cv_text.append(f"\nEDUCATION:\n" + "\n".join(education_parts))
    
    # Work Experience
    work_parts = []
    if pd.notna(resume_row.get('professional_company_names')):
        work_parts.append(f"Company: {resume_row['professional_company_names']}")
    if pd.notna(resume_row.get('positions')):
        work_parts.append(f"Position: {resume_row['positions']}")
    if pd.notna(resume_row.get('start_dates')):
        work_parts.append(f"Period: {resume_row['start_dates']}")
        if pd.notna(resume_row.get('end_dates')):
            work_parts.append(f" to {resume_row['end_dates']}")
    if pd.notna(resume_row.get('responsibilities')):
        work_parts.append(f"Responsibilities:\n{resume_row['responsibilities']}")
    
    if work_parts:
        cv_text.append(f"\nWORK EXPERIENCE:\n" + "\n".join(work_parts))
    
    # Languages
    if pd.notna(resume_row.get('languages')):
        cv_text.append(f"\nLANGUAGES:\n{resume_row['languages']}")
    
    # Certifications
    if pd.notna(resume_row.get('certification_skills')):
        cv_text.append(f"\nCERTIFICATIONS:\n{resume_row['certification_skills']}")
    
    return "\n".join(cv_text)

# Test the function
print("Formatted CV for LLM:")
print("="*80)
print(format_cv_for_llm(df.iloc[0]))
print("="*80)

Formatted CV for LLM:
CAREER OBJECTIVE:
Big data analytics working and database warehouse manager with robust experience in handling all kinds of data. I have also used multiple cloud infrastructure services and am well acquainted with them. Currently in search of role that offers more of development.

SKILLS:
['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++', 'Data Structures', 'DBMS', 'RDBMS', 'Informatica', 'Talend', 'Amazon Redshift', 'Microsoft Azure']

EDUCATION:
Institution: ['The Amity School of Engineering & Technology (ASET), Noida']
Degree: ['B.Tech']
Major: ['Electronics']
Year: ['2019']

WORK EXPERIENCE:
Company: ['Coca-COla']
Position: ['Big Data Analyst']
Period: ['Nov 2019']
 to ['Till Date']
Responsibilities:
Technical Support
Troubleshooting
Collaboration
Documentation
System Monitoring
Software Deployment
Training & Mentorship
Industry Trends
Field Visits







## 6. Select Test CVs

We'll select 2 different complete CVs to test each model.

In [10]:
# Select test CVs - choosing ones with good data coverage
# We'll select CVs with different characteristics

# Find CVs with most complete information
df['completeness'] = df[key_fields].notna().sum(axis=1)
most_complete = df.nlargest(5, 'completeness')

print("Top 5 most complete CVs:")
print(most_complete[['career_objective', 'completeness']].head())

# Select 2 CVs for testing
test_cv_indices = [0, 1]  # You can change these

print(f"\nSelected test CVs: {test_cv_indices}")

# Save test CV indices for use in other notebooks
with open('../data/test_cv_indices.json', 'w') as f:
    json.dump({'indices': test_cv_indices}, f)

Top 5 most complete CVs:
                                                                                      career_objective  \
0  Big data analytics working and database warehouse manager with robust experience in handling all...   
1  Fresher looking to join as a data analyst and junior data scientist. Experienced in creating mea...   
3  To obtain a position in a fast-paced business office environment, demanding a strong organizatio...   
4  Professional accountant with an outstanding work ethic and integrity seeking to make a valuable ...   
5  To secure an IT specialist, desktop support, network administration, database administrator, tec...   

   completeness  
0             7  
1             7  
3             7  
4             7  
5             7  

Selected test CVs: [0, 1]


## 7. Summary Statistics

In [11]:
print(" DATASET SUMMARY")
print("="*80)
print(f"Total Resumes: {len(df)}")
print(f"Total Features: {len(df.columns)}")
print(f"\nAverage Completeness: {df['completeness'].mean():.1f}/{len(key_fields)} fields")
print(f"Most Complete CV: {df['completeness'].max()}/{len(key_fields)} fields")
print(f"Least Complete CV: {df['completeness'].min()}/{len(key_fields)} fields")
print("\n EDA Complete! Ready for CV roasting.")

 DATASET SUMMARY
Total Resumes: 9544
Total Features: 36

Average Completeness: 6.5/7 fields
Most Complete CV: 7/7 fields
Least Complete CV: 4/7 fields

 EDA Complete! Ready for CV roasting.


---

## Next Steps

1. **02_gentle_roaster.ipynb** - Most constructive feedback model
2. **03_medium_roaster.ipynb** - Mild criticism model
3. **04_brutal_roaster.ipynb** - "Savage" roasting model
4. **05_evaluation_comparison.ipynb** - Compare all models