# Scoring a Resume

To recruit and filter through applicants, recruiters may often score and compare resumes to select candidates. We will be walking through a simple process of generating an overall resume score, based on scores for individual categories. 

In [27]:
import spacy
import numpy as np
import pandas as pd
import re
from glob import glob
nlp = spacy.load('en')

## Reading a Resume

In order to work with the resumes, which are stored as PDF files, we need to extract the data into a string that we can parse in Python. To accomplish this, we want to take the PDF file and convert it to an image, which we then use the library `pytesseract` to convert to text using [optical character recognition](https://en.wikipedia.org/wiki/Optical_character_recognition). The logic that accomplishes this can be found in [`convert_resumes.py`](./convert_resumes.py), which we execute in the cell below.

In [53]:
!python3 convert_resumes.py

To get the resume as a string, we simple need to open the file in a `with`-`open` block and read in the text.

In [54]:
with open("texts/Jacqueline Angelina_Resume_Sept2019.txt") as f:
    resume = f.read()
    
print(resume[0:300])

Jacqueline Angelina

Education
University of California, Berkeley

B.A. Data Science, 2021
Certificate in Design Innovation
Certificate in Tech & Entrepreneurship

Relevant Coursework

Data Structures Data Science Principles

Business Analytics Probability for Data Science
Decision Modeling Microeco


## Splitting the Resume into Sections

To get a clearer understanding of how the resume is organized and set up for information extraction later, we will first split the resume into sections using regular expressions. The function `split_into_sections` below will split a resume into a dictionary containing each of the keys in `section_regexes` with the value corresponding to that section of the resume.

In [56]:
section_regexes = {
    "experience" : r"(^w?o?r?k? ?experience$)",
    "education" : r"(^education ?a?n?d? ?a?w?a?r?d?s?$)",
    "skills" : r"(^t?e?c?h?n?i?c?a?l? ?skills)",
    "projects" : r"^(projects?)$",
    "awards" : r"^(awards?)$",
    "activities" : r"^(e?x?t?r?a?c?u?r?r?i?c?u?l?a?r? ?activities)$"
}

def split_into_sections(resume):
    resume = [resume]
    sections = {}
    sections["original"] = resume[0]
    
    # split resume on sections in section_regexes
    for sec in section_regexes:
        splits = []
        for subsec in resume:
            try:
                splits += re.split(section_regexes[sec], subsec, maxsplit=1, flags = re.IGNORECASE | re.MULTILINE)
            except TypeError:
                pass
        resume = splits.copy()
    
    # create a mapping of these sections
#     sections = {}
    for sec in section_regexes:
        for i in range(len(resume)):
            if re.search(section_regexes[sec], resume[i], flags = re.IGNORECASE | re.MULTILINE):
                sections[sec] = resume[i+1]
        if sec not in sections:
            sections[sec] = ""

    return sections

resume_sections = split_into_sections(resume)
resume_sections

{'activities': '\n\nData Analyst / Upsync Consulting\nSales pipeline strategy for tech start-up\n\nDesign Consultant / Innovative Design\n\njacqueline.angelina@berkeley.edu\ngithub.com/jacquelineangelina\nlinkedin.com/in/jacquelineangelina\n\n',
 'awards': '\n\nFellowship Cohort Winner 2017\n\nCal Hacks 4.0\n\n1 of 8 teams selected from 1500\nhackers in Cal Hacks, the world’s largest\ncollegiate hackathon.\n\nReceived $1000 in project funding &\nmentorship to further develop a\nconversational language learning app.\n\nBest Venture, Best Branding 2016\nColumbia University Global\n\nEntrepreneurship Program\n\nInitiated and developed visual branding, user\ninterface design, target market\nsegmentation, and competitive analysis for\nStyle Stop, a public-sourced online clothing\nrental platform.\n\n',
 'education': '\nUniversity of California, Berkeley\n\nB.A. Data Science, 2021\nCertificate in Design Innovation\nCertificate in Tech & Entrepreneurship\n\nRelevant Coursework\n\nData Structu

## Generating Resume Scores for Individual Categories Through Matching Keywords

To generate a total resume score, we will first calculate scores for individual categories. In this case, we will measure 4 categories: Skills/Buzzwords, Job Titles, Companies, and Education. We will measure a resume's performance in each category based on its ability to match a recruiter's desired qualities, as shown in a recruiter's CSV. 

First, we will begin by generating an individual score for Buzzwords.

Here's an example of a recruiter's CSV containing a column for desired skills, and another column for each skill's respective weights, as each skill is valued differently. 

In [57]:
recruiter_buzzwords = pd.read_csv("skills.csv") 
recruiter_buzzwords

Unnamed: 0,skill,weight
0,python,2
1,pandas,2
2,java,2
3,r,2
4,html,1
5,css,1
6,sql,2
7,ruby,1
8,hadoop,2
9,spark,2


Next, we create a function to extract matches between the resume and the recruiter's desired list of skills/buzzwords. 

To do this, we implement word tokenization and break the resume into noun chunks using spacy, then matching the resume contents with the recruiter's CSV. 

In [58]:
def extract_matches(csv_name, resume_text, column):
    nlp_text = nlp(resume_text)

    # removing stop words and implementing word tokenization
    tokens = [token.text for token in nlp_text if not token.is_stop]
    
    # reading the csv file
    data = pd.read_csv(csv_name) 
    
    # extract values
    skills = data[column].tolist()
    
    skillset = []
    
    # check for one-grams (example: python)
    for token in tokens:
        if token.lower() in skills:
            skillset.append(token)
    
    # check for bi-grams and tri-grams (example: machine learning)
    for token in nlp_text.noun_chunks:
        token = token.text.lower().strip()
        if token in skills:
            skillset.append(token)
    
    return [i.capitalize() for i in set([i.lower() for i in skillset])]

Here are matching buzzwords between this resume and the recruiter's CSV:

In [59]:
matching_buzzwords = extract_matches('skills.csv', resume, 'skill')
matching_buzzwords

['Java', 'Html', 'Css', 'Python', 'Sql', 'R', 'Pandas']

Now that we have matching buzzwords, we will calculate a score for the buzzwords category, by summing up the values of matching buzzwords in accordance to their respective weights.

In [60]:
#Calculate buzzword score by matching buzzwords and their respective weights 
def individual_score(data, skill_col, weight_col, matching_list):
    weights = []
    if len(matching_list) > 0: 
        for i in matching_list:
            weight_value = data.loc[data[skill_col] == i.lower()][weight_col].item()
            weights.append(weight_value)
            score = sum(weights)
    else: 
        score = 0
    return(score*len(matching_list))

buzzwords_score = individual_score(recruiter_buzzwords, 'skill', 'weight', matching_buzzwords)
buzzwords_score

84

Let's repeat the process with the Job Title and Company categories. 

#### Category: Job Titles

In [61]:
recruiter_job_list = pd.read_csv("job list.csv") 
recruiter_job_list.head()

Unnamed: 0,job_titles,weights
0,product analytics,2
1,product analyst,1
2,product management,1
3,data scientist,3
4,data analyst,2


In [62]:
matching_job_titles = extract_matches('job list.csv', resume, 'job_titles')
matching_job_titles

[]

In [63]:
job_title_score = individual_score(recruiter_job_list, 'job_titles', 'weights', matching_job_titles)
job_title_score

0

#### Category: Companies

In [64]:
recruiter_companies = pd.read_csv("us_companies.csv") 
recruiter_companies.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,company_name_id,company_name,url,company_category,weights
0,3-round-stones-inc,"3 Round Stones, Inc.",http://3RoundStones.com,Data/Technology,1
1,48-factoring-inc,48 Factoring Inc.,https://www.48factoring.com,Finance & Investment,1
2,5psolutions,5PSolutions,www.5psolutions.com,Data/Technology,2
3,abt-associates,Abt Associates,abtassoc.com,Research & Consulting,2
4,accela,Accela,http://www.accela.com,Governance,1


In [65]:
matching_companies = extract_matches('us_companies.csv', resume_sections['experience'], 'company_name')
matching_companies

  if self.run_code(code, result):


[]

In [66]:
company_score = individual_score(recruiter_companies, 'company_name', 'weights', matching_companies)
company_score

0

### Another Method: Using Regex to Extract Information

In calculating an individual score for our education category, we can also use regex to extract universities of colleges. In this case, we will extract information from the education section of the resume.

In [67]:
recruiter_universities = pd.read_csv("universities.csv") 
recruiter_universities.head()

Unnamed: 0,university_name,weights
0,"University of California, Berkeley",3
1,"University of California, Davis",2
2,"University of California, Los Angeles",3
3,"University of California, Riverside",1
4,"University of California, San Diego",2


In [68]:
university_regex = r"\n(.*([Uu]niversity|[Cc]ollege).*)\n?"
university = re.findall(university_regex, resume_sections['education'])[0][0]
university

'University of California, Berkeley'

Then, we will score the resume using the same method as applied for previous categories. 

In [69]:
def university_score(data, data_col, weight_col, university_name):
    weights = []
    if university_name in data[data_col].tolist():
        weight_value = data.loc[data[data_col] == university_name][weight_col].item()
        weights.append(weight_value)
        score = sum(weights)
    else: 
        score = 0
    return score

In [70]:
education_score = university_score(recruiter_universities, 'university_name', 'weights', university)
education_score

3

## Generating a Total Resume Score

Now that we have scores for individual categories, we can generate a total resume score. This can be used for recruiters to compare resumes and guide their hiring process. 

In [71]:
#print all individual section scores
buzzwords_score, job_title_score, company_score, education_score

(84, 0, 0, 3)

We will use a function to calculate a total resume score, weighing each factor by a specified percentage.

In [72]:
#Calculates resume score by a composite of 4 factors: Buzzwords, Job Title, Education, and Companies
#Each factor is weighed respectively by 30%, 30%, 25% and 15%

def resume_score(buzzword_score, buzzword_weight, job_score, job_weight, education_score, education_weight, 
                 company_score, company_weight):
    score = buzzword_score*buzzword_weight + job_score*job_weight + education_score*education_weight + company_score*company_weight
    return score 

In [73]:
resume_score(buzzwords_score, 0.3, job_title_score, 0.3, education_score, 0.25, company_score, 0.15)

25.95

Great - we've successfully generated a total resume score!

### Scores in Bulk

If we want to calculate the scores of our resumes in bulk, we can accomplish this by storing the constituent parts of the resumes in a DataFrame and applying our functions to the corresponding columns. In the cell below, we create just such a DataFrame, using the `glob` function to extract all of the resumes in the `texts` directory.

In [74]:
resume_files = glob("texts/*.txt")
resume_files

['texts/Alec_Resume.txt',
 'texts/Chris_Pyles_Resume.txt',
 'texts/Jacqueline Angelina_Resume_Sept2019.txt']

To put these into a DataFrame, we first need to split them into sections. For this, we will use our `split_into_sections` function from above and Python's `map` function to map it to the list of resumes.

In [75]:
# read in all the txt files
resume_texts = []
for resume in resume_files:
    with open(resume) as f:
        resume_texts += [f.read()]
        
# apply our function to the resume_texts list
sections = list(map(split_into_sections, resume_texts))

# create a DataFrame
resumes = pd.DataFrame(sections)
resumes.head()

Unnamed: 0,activities,awards,education,experience,original,projects,skills
0,,,"\n\nUniversity of California, Berkeley / BA St...",\n\nSusquehanna International Group / Trading ...,"Alexander Kan 1888 Berkeley Way #505 Berkeley,...",,HONORS\nProgramming Languages - Python (Panda...
1,\nConnector Assistant January 2019 - Present\n...,,"\nUniversity of California, Berkeley Anticipat...",\nSenior Developer May 2019 - Present\nUC Berk...,CHRISTOPHER PYLES\n\nchrispyles.io | (818) 826...,\nHonors Thesis: DataHub Effectiveness_at UC B...,"\n- Languages and Frameworks: Python, R, Ruby ..."
2,\n\nData Analyst / Upsync Consulting\nSales pi...,\n\nFellowship Cohort Winner 2017\n\nCal Hacks...,"\nUniversity of California, Berkeley\n\nB.A. D...","\n\nSummer 2019\nSan Francisco, CA\n\nProduct ...",Jacqueline Angelina\n\nEducation\nUniversity o...,\nSpam Email Classifier Mar 2019\nLogistic Reg...,"\n\nProgramming\n\nPython, Java, R, HTML, CSS,..."


Now let's apply our functions from above to the columns of our DataFrame to generate the scores we will use. Recall that we can apply a function to a pandas `Series` using [`pd.Series.apply`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html).

In [86]:
# extract buzzword matches and calculate score
resumes["matches"] = resumes["original"].apply(lambda r: extract_matches('skills.csv', r, 'skill'))
resumes["buzzwords_score"] = resumes["matches"].apply(
    lambda m: individual_score(recruiter_buzzwords, 'skill', 'weight', m)
)
resumes.head()

Unnamed: 0,activities,awards,education,experience,original,projects,skills,matches,buzzwords_score,job_titles,jobs_score
0,,,"\n\nUniversity of California, Berkeley / BA St...",\n\nSusquehanna International Group / Trading ...,"Alexander Kan 1888 Berkeley Way #505 Berkeley,...",,HONORS\nProgramming Languages - Python (Panda...,"[Data structures, Java, Html, Python, Css, Sql...",112,[],0
1,\nConnector Assistant January 2019 - Present\n...,,"\nUniversity of California, Berkeley Anticipat...",\nSenior Developer May 2019 - Present\nUC Berk...,CHRISTOPHER PYLES\n\nchrispyles.io | (818) 826...,\nHonors Thesis: DataHub Effectiveness_at UC B...,"\n- Languages and Frameworks: Python, R, Ruby ...","[Python, Ruby, Data structures, R]",28,[],0
2,\n\nData Analyst / Upsync Consulting\nSales pi...,\n\nFellowship Cohort Winner 2017\n\nCal Hacks...,"\nUniversity of California, Berkeley\n\nB.A. D...","\n\nSummer 2019\nSan Francisco, CA\n\nProduct ...",Jacqueline Angelina\n\nEducation\nUniversity o...,\nSpam Email Classifier Mar 2019\nLogistic Reg...,"\n\nProgramming\n\nPython, Java, R, HTML, CSS,...","[Java, Html, Css, Python, Sql, R, Pandas]",84,[],0


In [87]:
# job titles
resumes["job_titles"] = resumes["original"].apply(lambda r: extract_matches('job list.csv', r, 'job_titles'))
resumes["jobs_score"] = resumes["job_titles"].apply(
    lambda m: individual_score(recruiter_job_list, 'job_titles', 'weights', m)
)
resumes.head()

Unnamed: 0,activities,awards,education,experience,original,projects,skills,matches,buzzwords_score,job_titles,jobs_score
0,,,"\n\nUniversity of California, Berkeley / BA St...",\n\nSusquehanna International Group / Trading ...,"Alexander Kan 1888 Berkeley Way #505 Berkeley,...",,HONORS\nProgramming Languages - Python (Panda...,"[Data structures, Java, Html, Python, Css, Sql...",112,[],0
1,\nConnector Assistant January 2019 - Present\n...,,"\nUniversity of California, Berkeley Anticipat...",\nSenior Developer May 2019 - Present\nUC Berk...,CHRISTOPHER PYLES\n\nchrispyles.io | (818) 826...,\nHonors Thesis: DataHub Effectiveness_at UC B...,"\n- Languages and Frameworks: Python, R, Ruby ...","[Python, Ruby, Data structures, R]",28,[],0
2,\n\nData Analyst / Upsync Consulting\nSales pi...,\n\nFellowship Cohort Winner 2017\n\nCal Hacks...,"\nUniversity of California, Berkeley\n\nB.A. D...","\n\nSummer 2019\nSan Francisco, CA\n\nProduct ...",Jacqueline Angelina\n\nEducation\nUniversity o...,\nSpam Email Classifier Mar 2019\nLogistic Reg...,"\n\nProgramming\n\nPython, Java, R, HTML, CSS,...","[Java, Html, Css, Python, Sql, R, Pandas]",84,[],0


In [88]:
# companies
resumes["companies"] = resumes["experience"].apply(
    lambda r: extract_matches('us_companies.csv', r, 'company_name')
)
resumes["companies_score"] = resumes["companies"].apply(
    lambda m: individual_score(recruiter_companies, 'company_name', 'weights', m)
)
resumes.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,activities,awards,education,experience,original,projects,skills,matches,buzzwords_score,job_titles,jobs_score,companies,companies_score
0,,,"\n\nUniversity of California, Berkeley / BA St...",\n\nSusquehanna International Group / Trading ...,"Alexander Kan 1888 Berkeley Way #505 Berkeley,...",,HONORS\nProgramming Languages - Python (Panda...,"[Data structures, Java, Html, Python, Css, Sql...",112,[],0,[],0
1,\nConnector Assistant January 2019 - Present\n...,,"\nUniversity of California, Berkeley Anticipat...",\nSenior Developer May 2019 - Present\nUC Berk...,CHRISTOPHER PYLES\n\nchrispyles.io | (818) 826...,\nHonors Thesis: DataHub Effectiveness_at UC B...,"\n- Languages and Frameworks: Python, R, Ruby ...","[Python, Ruby, Data structures, R]",28,[],0,[],0
2,\n\nData Analyst / Upsync Consulting\nSales pi...,\n\nFellowship Cohort Winner 2017\n\nCal Hacks...,"\nUniversity of California, Berkeley\n\nB.A. D...","\n\nSummer 2019\nSan Francisco, CA\n\nProduct ...",Jacqueline Angelina\n\nEducation\nUniversity o...,\nSpam Email Classifier Mar 2019\nLogistic Reg...,"\n\nProgramming\n\nPython, Java, R, HTML, CSS,...","[Java, Html, Css, Python, Sql, R, Pandas]",84,[],0,[],0


In [89]:
# university
resumes["university"] = resumes["education"].apply(
    lambda r: re.findall(university_regex, r)[0][0]
)
resumes["university_score"] = resumes["university"].apply(
    lambda m: university_score(recruiter_universities, 'university_name', 'weights', m)
)
resumes.head()

Unnamed: 0,activities,awards,education,experience,original,projects,skills,matches,buzzwords_score,job_titles,jobs_score,companies,companies_score,university,university_score
0,,,"\n\nUniversity of California, Berkeley / BA St...",\n\nSusquehanna International Group / Trading ...,"Alexander Kan 1888 Berkeley Way #505 Berkeley,...",,HONORS\nProgramming Languages - Python (Panda...,"[Data structures, Java, Html, Python, Css, Sql...",112,[],0,[],0,"University of California, Berkeley / BA Statis...",0
1,\nConnector Assistant January 2019 - Present\n...,,"\nUniversity of California, Berkeley Anticipat...",\nSenior Developer May 2019 - Present\nUC Berk...,CHRISTOPHER PYLES\n\nchrispyles.io | (818) 826...,\nHonors Thesis: DataHub Effectiveness_at UC B...,"\n- Languages and Frameworks: Python, R, Ruby ...","[Python, Ruby, Data structures, R]",28,[],0,[],0,"University of California, Berkeley Anticipated...",0
2,\n\nData Analyst / Upsync Consulting\nSales pi...,\n\nFellowship Cohort Winner 2017\n\nCal Hacks...,"\nUniversity of California, Berkeley\n\nB.A. D...","\n\nSummer 2019\nSan Francisco, CA\n\nProduct ...",Jacqueline Angelina\n\nEducation\nUniversity o...,\nSpam Email Classifier Mar 2019\nLogistic Reg...,"\n\nProgramming\n\nPython, Java, R, HTML, CSS,...","[Java, Html, Css, Python, Sql, R, Pandas]",84,[],0,[],0,"University of California, Berkeley",3


Now that we have all the component scores, let's use our weights to calculate the final resume score for each of these resumes.

In [92]:
resumes["score"] = resumes.apply(
    lambda r: resume_score(
        r["buzzwords_score"], 0.3, 
        r["jobs_score"], 0.3, 
        r["university_score"], 0.25, 
        r["companies_score"], 0.15
    ), axis=1
)
resumes.head()

Unnamed: 0,activities,awards,education,experience,original,projects,skills,matches,buzzwords_score,job_titles,jobs_score,companies,companies_score,university,university_score,score
0,,,"\n\nUniversity of California, Berkeley / BA St...",\n\nSusquehanna International Group / Trading ...,"Alexander Kan 1888 Berkeley Way #505 Berkeley,...",,HONORS\nProgramming Languages - Python (Panda...,"[Data structures, Java, Html, Python, Css, Sql...",112,[],0,[],0,"University of California, Berkeley / BA Statis...",0,33.6
1,\nConnector Assistant January 2019 - Present\n...,,"\nUniversity of California, Berkeley Anticipat...",\nSenior Developer May 2019 - Present\nUC Berk...,CHRISTOPHER PYLES\n\nchrispyles.io | (818) 826...,\nHonors Thesis: DataHub Effectiveness_at UC B...,"\n- Languages and Frameworks: Python, R, Ruby ...","[Python, Ruby, Data structures, R]",28,[],0,[],0,"University of California, Berkeley Anticipated...",0,8.4
2,\n\nData Analyst / Upsync Consulting\nSales pi...,\n\nFellowship Cohort Winner 2017\n\nCal Hacks...,"\nUniversity of California, Berkeley\n\nB.A. D...","\n\nSummer 2019\nSan Francisco, CA\n\nProduct ...",Jacqueline Angelina\n\nEducation\nUniversity o...,\nSpam Email Classifier Mar 2019\nLogistic Reg...,"\n\nProgramming\n\nPython, Java, R, HTML, CSS,...","[Java, Html, Css, Python, Sql, R, Pandas]",84,[],0,[],0,"University of California, Berkeley",3,25.95


You should notice that some of the resumes got zeros in different categories even though they have some matches to what we're looking for. Why do you think that might be? What can we do to improve this resume bot to find more matches?

_Type your answer here, replacing this text._

## Try it Yourself!

Recruiters will have different weights for each factor used to calculate our resume score. Using the resume score function, generate a resume score if buzzwords were weighed 30%, job titles by 33%, education by 17%, and companies by 20%. 

In [None]:
resume_score(buzzwords_score, ..., job_title_score, ..., education_score, ..., company_score, ...)