![](image.jpg)

As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [12]:
import pandas as pd
import re
from typing import Match

# Load the resume dataset from a CSV file into a DataFrame
resumes: pd.DataFrame = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
323,27295996.0,IT DIRECTOR Accomplishm...,INFORMATION-TECHNOLOGY
660,16519708.0,DIRECTOR OF BUSINESS DEVELOPMENT ...,BUSINESS-DEVELOPMENT
1121,28951817.0,CONSULTANT Professional Summa...,CONSULTANT


In [13]:
# Get shape before and after dropping null rows
print(resumes.shape)

(1352, 3)
(1352, 3)


In [14]:
# Check the data types
resumes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1352 entries, 0 to 1351
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          1352 non-null   float64
 1   Resume_str  1352 non-null   object 
 2   Category    1352 non-null   object 
dtypes: float64(1), object(2)
memory usage: 31.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1352 entries, 0 to 1351
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          1352 non-null   float64
 1   Resume_str  1352 non-null   object 
 2   Category    1352 non-null   object 
dtypes: float64(1), object(2)
memory usage: 31.8+ KB


In [15]:
# Convert columns to correct data type
resumes = resumes.astype({"Resume_str" : 'string', "Category" : 'category'})


In [16]:
resumes.dtypes

ID                   float64
Resume_str    string[python]
Category            category
dtype: object

In [17]:
# Define regex
jobs_regex = r"^([A-Z\s\.\,\-]+)\b"
skills_regex = r"\b(python|sql|r|excel)\b"
edu_regex = r"\b(PhD|Master|Bachelor)\b"

# Lists to store extracted data
job_titles: list[str] = []
tech_skills: list[str] = []
education: list[str] = []


# functions to extract data
def get_job_title(resume: str) -> Match[str] | None:
    """Extract job titles from a string column."""
    return re.search(jobs_regex, resume)


def get_skills(resume: str) -> list[str]:
    """Get a list of skills from a string."""
    return re.findall(skills_regex, resume, flags=re.IGNORECASE)


def get_education(resume: str) -> list[str]:
    """Get a list of education history."""
    return re.findall(edu_regex, resume, flags=re.IGNORECASE)


# Extract data and add to lists
for resume in resumes['Resume_str']:

    # extract job titles
    jobs_match = get_job_title(resume)

    # if a job title is found, add to the list
    if jobs_match is not None:
        job_titles.append(jobs_match.group(0).strip())
    else:
        job_titles.append("")

    # extract tech skills
    skills = set([skill.title() for skill in get_skills(resume)])
    tech_skills.append(", ".join(skills))

    # extract education history
    education_history = set([education.title() for education in get_education(resume)])
    education.append(", ".join(education_history))

# Add extracted data as columns to resumes df
resumes['job_titles'] = job_titles
resumes['tech_skills'] = tech_skills
resumes['education'] = education

# Filter a copy of the df
filtered_df = resumes[(resumes['job_titles'] != "") & (resumes['tech_skills'] != "") & (resumes['education'] != "")]

# Create a new df w/ only the columns required
candidates_df = filtered_df[['ID', 'job_titles', 'tech_skills', 'education']]

# ensure column names are lower case
candidates_df.columns = candidates_df.columns.str.lower()

# Drop null rows
candidates_df.dropna(how='any', inplace=True)

# display the df
print(candidates_df.head())


            id     job_titles tech_skills         education
2   33176873.0    HR DIRECTOR       Excel  Master, Bachelor
4   17812897.0     HR MANAGER       Excel          Bachelor
8   11847784.0  HR SPECIALIST       Excel          Bachelor
9   32896934.0       HR CLERK           R          Bachelor
16  93002334.0     HR ANALYST       Excel          Bachelor
            id     job_titles tech_skills         education
2   33176873.0    HR DIRECTOR       Excel  Master, Bachelor
4   17812897.0     HR MANAGER       Excel          Bachelor
8   11847784.0  HR SPECIALIST       Excel          Bachelor
9   32896934.0       HR CLERK           R          Bachelor
16  93002334.0     HR ANALYST       Excel          Bachelor


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
