![](image.jpg)

As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [2]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
594,39237915.0,BUSINESS DEVELOPMENT MANAGER Pr...,BUSINESS-DEVELOPMENT
342,33704389.0,TEACHER Summary My applied...,TEACHER
333,12467531.0,TEACHER Professional Summary ...,TEACHER


In [3]:
# Get shape before and after dropping null rows
print(resumes.shape)

# Drop null rows
resumes.dropna(how='any', inplace=True)
print(resumes.shape)

(1352, 3)
(1352, 3)
(1352, 3)
(1352, 3)


In [4]:
# Check the data types
resumes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1352 entries, 0 to 1351
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          1352 non-null   float64
 1   Resume_str  1352 non-null   object 
 2   Category    1352 non-null   object 
dtypes: float64(1), object(2)
memory usage: 31.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1352 entries, 0 to 1351
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          1352 non-null   float64
 1   Resume_str  1352 non-null   object 
 2   Category    1352 non-null   object 
dtypes: float64(1), object(2)
memory usage: 31.8+ KB


In [5]:
# Convert columns to correct data type
resumes = resumes.astype({"Resume_str" : 'string', "Category" : 'category'})


In [6]:
# Find all the categories
set(resumes["Category"])

{'ADVOCATE',
 'AGRICULTURE',
 'AUTOMOBILE',
 'BPO',
 'BUSINESS-DEVELOPMENT',
 'CHEF',
 'CONSULTANT',
 'DESIGNER',
 'DIGITAL-MEDIA',
 'FITNESS',
 'HEALTHCARE',
 'HR',
 'INFORMATION-TECHNOLOGY',
 'SALES',
 'TEACHER'}

In [7]:
resumes.dtypes

ID                   float64
Resume_str    string[python]
Category            category
dtype: object

In [9]:
# Define regex
jobs_regex = r"^([A-Z\s\.\,\-]+)\b" 
skills_regex = r"\b(python|SQL|R|excel)\b"
edu_regex = r"\b(PhD|Master|Bachelor)\b"

# Create empty lists to store our info
job_titles = []
tech_skills = []
education = []

# Define functions to extract data
def get_job_title(resume: str) -> list[str]:
    re.search(jobs_regex, resume)
    

def get_skills(resume: str) -> list[str]:
    re.findall(skills_regex, resume)


def get_education(resume: str) -> list[str]:
    re.findall(edu_regex, resume)


for row in resumes.itertuples():
    resume = row.Resume_str
    job_titles.append(get_job_title(resume))
    tech_skills.append(get_skills(resume))
    education.append(get_education(resume))



In [7]:
help(re.findall)

In [11]:
job_titles[:10]

[None, None, None, None, None, None, None, None, None, None]

In [12]:
tech_skills[:10]

[None, None, None, None, None, None, None, None, None, None]

In [13]:
education[:10]

[None, None, None, None, None, None, None, None, None, None]