![](image.jpg)



## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [57]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
879,96761538.0,KIDS CLUB ATTENDANT Summary ...,FITNESS
373,29267293.0,TEACHER Summary Dedicated t...,TEACHER
1208,15535920.0,BUSINESS CONSULTANT Professiona...,CONSULTANT


In [58]:
#Define regex patterns and Compile regular expressions 
regex_skills = re.compile(r"\b(python|sql|r|excel)\b", re.IGNORECASE)
regex_job_title = re.compile(r"^([A-Z\s\.\,\-]+)\b")
regex_education = re.compile(r"\b(PHD|MCs|Master|BCs|Bachelor)\b", re.IGNORECASE)

In [59]:
# Lists to store the extracted information
job_titles= []
tech_skills = []
educations = []

In [60]:
# Function to extract unique matches from a resume using a given regex
def extract_unique_matches(pattern, text):
    matches = pattern.findall(text)
    unique_matches = set(match.title() for match in matches)
    return ", ".join(unique_matches)

In [61]:
# Loop through each resume in the DataFrame
for resume in resumes['Resume_str']:
    # Extract the job title using regex
    job_title_match = regex_job_title.search(resume)
    job_title = job_title_match.group(0).strip() if job_title_match else ""
    job_titles.append(job_title)
    # Extract technical skills
    tech_skills.append(extract_unique_matches(regex_skills, resume))
     # Extract educational degrees
    educations.append(extract_unique_matches(regex_education, resume))

In [62]:
# Add the extracted data to the DataFrame
resumes['job_title'] = job_titles
resumes['tech_skills'] = tech_skills
resumes['education'] = educations


In [63]:
# Filter out rows missing any job title, tech skill, or education information
resumes_filtered = resumes[(resumes['job_title'] != "") & (resumes['tech_skills'] != "") & (resumes['education'] != "")]


In [64]:
# Create a new DataFrame 'candidates_df' from 'resumes_filtered' with selected columns and lowercase column names for consistency
candidates_df = resumes_filtered[["ID", "job_title", "tech_skills", "education"]]
candidates_df.columns = candidates_df.columns.str.lower()

# Remove any rows that contain NaN values to ensure data integrity
candidates_df.dropna(inplace=True)

# Display the DataFrame
candidates_df.sample(10)

Unnamed: 0,id,job_title,tech_skills,education
102,18731098.0,SENIOR HR MANAGER,Excel,Master
1265,16893572.0,DIGITAL MARKETING MANAGER,Excel,Bachelor
94,15575117.0,HR SENIOR SPECIALIST,Excel,Master
880,19975121.0,SOCIAL MEDIA PRODUCER,Excel,Bachelor
333,12467531.0,TEACHER,Excel,"Bachelor, Master"
1331,26341645.0,SR BUSINESS SYSTEMS ANALYST,"Sql, Excel",Bachelor
1110,14517953.0,CONSULTANT,Excel,"Bachelor, Master"
1236,25723793.0,SALES REPRESENTATIVE,Excel,Bachelor
751,24548333.0,SENIOR SPECIALTY SALES REPRESENTATIVE,Excel,Bachelor
816,15293959.0,CERTIFIED MASTER PERSONAL TRAINER,Excel,"Bachelor, Master"
