![](image.jpg)

As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [17]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
585,11289482.0,"BUSINESS DEVELOPMENT MANAGER, VP ...",BUSINESS-DEVELOPMENT
190,30965258.0,Zachory Edmiston Summary...,DESIGNER
519,24124250.0,SENIOR APPLICATION SPECIALIST P...,ADVOCATE


In [18]:
# Start coding here!
# Use as many cells as you need.

In [19]:
# Regex for extracting the specific technical sills from Resume_str col
# The word boundaries (\b) ensure that the matches are complete words, not substrings of larger words
regex_skills = r"\b(python|sql|r|excel)\b"

# regex_job_title: Captures job titles as a sequence of uppercase letters, possibly including spaces, periods, or hyphens to ensure it includes complete titles only.
regex_job_title = r"^([A-Z\s\.\,\-]+)\b"

# Regular expression for identifying educational degrees
regex_education = r"\b(PHD|MCs|Master|BCs|Bachelor)\b"


# Lists to store the extracted information
job_titles = []
tech_skills =[]
educations = []

# Looping through each Resume_str col

for resume in resumes["Resume_str"]:
    
    # Extract job title using regex_job_title
    job_title_match = re.search(regex_job_title, resume)
    if job_title_match is not None:
        job_title = job_title_match.group(0).strip()
    else:
        job_title = ""
    job_titles.append(job_title)
    
    # Extract tech_skills using regex_skills
    skills_match = re.findall(regex_skills, resume, flags=re.IGNORECASE)
    unique_skills = []
    for skill in skills_match:  # Remove duplicates and format to title case
        skill_title = skill.title()
        if skill_title not in unique_skills:
            unique_skills.append(skill_title)
    tech_skills.append(", ".join(unique_skills))
        
    # Extract education using regex_education
    education_match = re.findall(regex_education, resume, flags=re.IGNORECASE)
    unique_education = []
    for education in education_match:  
        education_title = education.title()
        if education_title not in unique_education:
            unique_education.append(education_title)
    educations.append(", ".join(unique_education))  
    
# Create dataframe from the listed values
resumes['job_title'] = job_titles
resumes['tech_skills'] = tech_skills
resumes['education'] = educations

# Keep only records that are complete and do not contain null or empty
resumes_filtered = resumes[(resumes['job_title'] != "") & (resumes['tech_skills'] != "") & (resumes['education'] != "")]

# Create a new DataFrame 'candidates_df' from 'resumes_filtered' with selected columns and lowercase column names for consistency
candidates_df = resumes_filtered[["ID", "job_title", "tech_skills", "education"]]
candidates_df.columns = candidates_df.columns.str.lower()

candidates_df.dropna(inplace=True)

# Display the DataFrame
candidates_df.sample(10)

Unnamed: 0,id,job_title,tech_skills,education
293,18187364.0,INFORMATION TECHNOLOGY SPECIALIST INFORMATION ...,Sql,"Master, Bachelor"
50,30646367.0,HR ASSISTANT,Excel,Master
1173,95792386.0,CONSULTANT,"Excel, Sql","Master, Bachelor"
300,17111768.0,INFORMATION TECHNOLOGY PROJECT MANAGER SYSTEM ...,Sql,"Master, Bachelor"
1158,27726066.0,CONSULTANT,Excel,Bachelor
273,70089206.0,INFORMATION TECHNOLOGY SPECIALIST,Sql,"Master, Bachelor"
1039,34303500.0,SALES DIRECTOR,Excel,Bachelor
1310,28790806.0,DATASTAGE ETL DEVELOPER,Sql,Bachelor
162,51681660.0,PRODUCTION DESIGNER,Excel,Master
664,32042584.0,BUSINESS DEVELOPMENT INTERN,"R, Sql",Bachelor


In [20]:

for resume in resumes["Resume_str"]:
    
    # Extract job title using regex_job_title
    job_title_match = re.search(regex_job_title, resume)
    if job_title_match is not None:
        job_title = job_title_match.group(0).strip()
        #print(job_title)
    else:
        job_title = ""
    job_titles.append(job_title)