## Matching Job Description and Resume with Word2Vec

Word2Vec is a famous natural language processing model by Stanford and Google (https://nlp.stanford.edu/projects/glove/). The idea is a word's definition can be defined by when it appears in a context and by the surrounding words that usually come with it. In this notebook, we try to make use of the pretrained Word2Vec model (a very well-trained model by Google) to learn how a candidate's qualifications match with the job description

First, we use IBM's Natural Language Understanding API to extract keywords and concepts from a job description and resumes, then we apply our Word2Vec model to measure the "distance" between the qualifications of the candidates and the requirements of the resumes. 

In [1]:
import gensim.downloader as api
import os
import glob
import json

In [2]:
import PyPDF2 
import textract

In [3]:
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import Features, EntitiesOptions, KeywordsOptions, ConceptsOptions,CategoriesOptions

In [4]:
naturalLanguageUnderstanding = NaturalLanguageUnderstandingV1(
    version='2018-11-16',
    iam_apikey='PiMaIaLt9pDN4q670Fo0nqGAdqMY_Vx50thjccVVmRTN',
    url='https://gateway.watsonplatform.net/natural-language-understanding/api',
)

In [5]:
!pwd

/Users/hoangho/TechTogether/TTB_backup


In [6]:
word_vectors = api.load("glove-wiki-gigaword-100")

In [7]:
resume_texts = []
resume_names = []

In [8]:
path = "./resumes/"

def listdir_nohidden(path):
    return glob.glob(os.path.join(path, '*'))

In [9]:
resumes = listdir_nohidden(path)

In [10]:
path = "./resumes/"    

for resume in resumes:
    # read all pdf files in the directory
    print(resume)
    resume_names.append(resume)
    pdfFileObj = open(resume, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    # The while loop will read each page
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()

    if text != "":
        text = text
    else:
        text = textract.process(fileurl, method='tesseract', language='eng')
    resume_texts.append(text)

./resumes/John_Doe.pdf
./resumes/Sebastian_Henty.pdf
./resumes/He_Zhang.pdf
./resumes/maria.robertson-resume.pdf


In [11]:
job_description = textract.process("/Users/hoangho/TechTogether/TTB_Backup/Data Scientist Job Description Template.docx").decode("utf-8")

In [12]:
extracted_resumes = []

In [13]:
for resume_text in resume_texts:
    print(resume_text)
    response = naturalLanguageUnderstanding.analyze(
    text=resume_text,
    features=Features(keywords=KeywordsOptions(limit=8), concepts=ConceptsOptions(limit=8))).get_result()
    json_str = json.dumps(response)
    data = json.loads(json_str)
    ddict = {}
    ddict["keywords"] = [obj['text'] for obj in data['keywords']]
    ddict["concepts"] = [obj['text'] for obj in data['keywords']]
    extracted_resumes.append(ddict)

John Doe
 
Full 
Add
ress
 

 
City, State, ZIP
 

 
Phone Number
 

 
E
-
mail
 
 
 
OBJECTIVE: 
Design 
apparel 
print for an innovative retail company
 
 
EDUCATION
:
 
 
UNIVERSITY OF MINNESOTA         
                                              
   
 
  
    
            
 
 
 
 
 
 
 
 
            
  
City, State
 
College of Design
 
 
 
 
 
 
 
     
  
 
   
 
   
 
 
 
 
 
 
      
  
   
 
 
  
May 2011
 

 
Bachelor of Science
 
in 
Graphic Design
 
 

 
Cumulative GPA 3.93

 

 
Twin cities Iron Range Scholarship
 
 
WORK 
EXPERIENCE:
 
 
 
AMERICAN EAGLE    
                                                            
                                         
            
  
City, State
 
 
Sales Associate           
                       
                                                                                  
July 2009 
-
 
present
 

 
Collaborated
 
with the store merchandiser 
creating
 
displays to attract clientele
 

 
Use 
my 
trend awareness to as

Maria Robertson
 Manor Farm Cottage, Water Lane, Drayton St Leonards, Oxfordshire OX10 7BE
 Telephone
 +44 (
0)1865 890055  |  
Mobile
 +44 07766 115448  |  
Email
 maria@definemedia.co.uk
 Skype:
 maria.robertson74
  |  LinkedIn
 http://uk.linkedin
.com/in/mariarobertson1
  |  http://definemedia.co.uk/
  A creative and effective design professional with extensive experience across web, multimedia and print 
design. Has the ability to look at a project from the ground up and work through to an effective and highly 
polished design so
lution, either independently or as part of a team. Takes a hands on approach to projects, 
with careful attention to detail, and has the ability to manage, motivate and inspire a design team. Thrives on 
creative challenges, and developing innovative, user
-focused design. Combines both creative and technical 
skills, to produce a compelling visual experience that users respond to with inspiration and ease.
 KEY SKILLS
 ¥ Design Management 
Ð Experienced D

In [14]:
extracted_resumes

[{'keywords': ['nd displays',
   'organizational skills',
   'leadership skills',
   'Target Inc.',
   'company sales goals',
   'trend awareness',
   'Cumulative GPA',
   'promotional events'],
  'concepts': ['nd displays',
   'organizational skills',
   'leadership skills',
   'Target Inc.',
   'company sales goals',
   'trend awareness',
   'Cumulative GPA',
   'promotional events']},
 {'keywords': ['Create market simulators',
   'insightful data',
   'research assistants',
   'new survey tool',
   'user interface',
   'ed patterns',
   'Bucknell University',
   'bastien Genty1209 Page St'],
  'concepts': ['Create market simulators',
   'insightful data',
   'research assistants',
   'new survey tool',
   'user interface',
   'ed patterns',
   'Bucknell University',
   'bastien Genty1209 Page St']},
 {'keywords': ['Excellent English writing',
   'English-Chinese Translator',
   'Data Scientist',
   'Strong team-work spirit',
   'Strong data',
   'Machine Learning',
   'image recogni

In [15]:
response = naturalLanguageUnderstanding.analyze(
    text=job_description,
    features=Features(keywords=KeywordsOptions(limit=8), concepts=ConceptsOptions(limit=8))).get_result()

In [16]:
json_str = json.dumps(response)
data = json.loads(json_str)
jobDict = {}
jobDict["keywords"] = [obj['text'] for obj in data['keywords']]
jobDict["concepts"] = [obj['text'] for obj in data['keywords']]

In [17]:
jobDict

{'keywords': ['strong experience',
  'Data Scientist',
  'customer experiences',
  'company data',
  'statistical computer languages',
  'large data sets',
  'ideal candidate',
  'Knowledge of a variety of machine'],
 'concepts': ['strong experience',
  'Data Scientist',
  'customer experiences',
  'company data',
  'statistical computer languages',
  'large data sets',
  'ideal candidate',
  'Knowledge of a variety of machine']}

In [18]:
flatKeywords = []
for term in jobDict["keywords"]:
    flatKeywords.extend(term.lower().split())

flatConcepts = []
for term in jobDict["concepts"]:
    flatConcepts.extend(term.lower().split())

In [19]:
distance_list = []

In [20]:
for extract in extracted_resumes:
    keyword_list = []
    
    for term in extract["keywords"]:
        keyword_list.extend(term.lower().split())
    
    concept_list = []
    for term in extract["concepts"]:
        concept_list.extend(term.lower().split())
    keywords_dist = word_vectors.wmdistance(keyword_list, flatKeywords)
    concept_dist = word_vectors.wmdistance(concept_list, flatConcepts)
    if (keywords_dist != float("inf") and concept_dist != float("inf")):
        keywords_dist /= (len(keyword_list) + len(flatKeywords))
        concept_dist /= (len(concept_list) + len(flatConcepts))
        distance = keywords_dist + concept_dist
    else:
        distance = float("inf")
    distance_list.append(distance)

In [21]:
distance_list

[0.2872202397715704,
 0.26465220793841393,
 0.2033768402677651,
 0.3099036125959687]

In [22]:
resume_names

['./resumes/John_Doe.pdf',
 './resumes/Sebastian_Henty.pdf',
 './resumes/He_Zhang.pdf',
 './resumes/maria.robertson-resume.pdf']

## Major Drawbacks

There are two major drawbacks of our approach: 

* First, the pdf parsing is unstable, on some pdf file, it works fine, on others, it doesn't work
* Second, IBM's Natural Language Understand API doesn't perform well in extracting the content of a resume, we found a github repo that supports resume skills parsing (https://github.com/bjherger/ResumeParser) which performs really well, but we run out of time to incorporate it into our models. I believe, if we efficiently extract content from a resume, the matching algorithm with Word2Vec can perform very well