**NLP Project : Resume Parsing and JD Matching** 

---





Step 1 : Extracting text from Resume Files

---
Will extract the resume text from three different types of files : .pdf, .docx and .doc

1a) To extract the text from pdf format files will use the library called pdfminer. Install pdfminer using pip

1b) In order to extract the text from .docx format , will use the docx2txt library. Install docx2txt using pip

1c) For extracting the text from the .doc files, will use the catdoc command line tool , which reads MS-Word and print ASCII text. Will use apt command to install the tool 

Now for supporting the natural langugage processing tasks like tokenization etc., will use the nltk (Natural Language toolkit) toolkit library  and for basic processing the numpy library 

    


In [33]:
!pip install nltk
!pip install numpy
!pip install docx2txt
!pip install pdfminer.six
!apt-get update yes 
!apt-get install -y catdoc


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
E: The update command takes no arguments
Reading package lists... Done
Building dependency tree       
Reading state information... Done
catdoc is already the newest version (1:0.95-4.1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


STEP 2 : Import the required libraries and Packages

---



In [34]:
import re
import os
import sys
import nltk
import docx2txt
import docx2txt
import subprocess
from pdfminer.high_level import extract_text
 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords') 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Step 3 : Data Extraction from Resume file

---

3a) define a function to extract text from PDF format file

3b) define a function to extract text from .docx format file

3c) define a function to extract text from .doc format file

In [35]:
# functio to extract text from pdf file
def pdf_txt_extraction(pdf_file_path):
    return extract_text(pdf_file_path)

In [36]:
def docx_txt_extraction(docx_file_path):
    txt = docx2txt.process(docx_file_path)
    if txt:
        return txt.replace('\t', ' ') # replace the tabs in text with space
    return None

In [37]:
def doc_txt_extraction(doc_file_path):
    try:
        process = subprocess.Popen(
            ['catdoc', '-w', doc_file_path],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True,
        )
    except (
        FileNotFoundError,
        ValueError,
        subprocess.TimeoutExpired,
        subprocess.SubprocessError,
    ) as err:
        return (None, str(err))
    else:
        stdout, stderr = process.communicate()
 
    return (stdout.strip(), stderr.strip())

Step 4: Extraction of Entities from the resume data extracted

---

Define functins to extract entities like persone names, phone number , email , skills and educational institue names from the extracted resume data 

In [38]:
def name_entity_extraction(txt):
    names_person = []
 
    for sent in nltk.sent_tokenize(txt):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
                names_person.append(
                    ' '.join(chunk_leave[0] for chunk_leave in chunk.leaves())
                )
 
    return names_person

In [39]:
PHONE_REG = re.compile(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]')
def tuple_conversion(tup):
    st = ''.join(map(str, tup))
    return st

def phone_no_extraction(text_resume):
    text_resume = tuple_conversion(text_resume)
    phone_no = re.findall(PHONE_REG, text_resume)
 
    if phone_no:
        number = ''.join(phone_no[0])
 
        if text_resume.find(number) >= 0 and len(number) < 16:
            return number
    return None
 

In [40]:
EMAIL_REG = re.compile(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+')
 
def emails_extract(res_text):
    return re.findall(EMAIL_REG, res_text)

In [41]:
import pandas as pd
SKILLS_DB = pd.read_csv("/content/skills_db.csv")
SKILLS_DB = list(SKILLS_DB["Text"])

def skills_extraction(text_input):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    word_tokens = nltk.tokenize.word_tokenize(text_input)
 
    # remove the stop words
    filtered_tokens = [w for w in word_tokens if w not in stop_words]
 
    # remove the punctuation
    filtered_tokens = [w for w in word_tokens if w.isalpha()]
 
    # generate bigrams and trigrams (such as artificial intelligence)
    bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3)))
 
    # we create a set to keep the results in.
    skills_extracted = set()
 
    # we search for each token in our skills database
    for token in filtered_tokens:
        if token.lower() in SKILLS_DB:
            skills_extracted.add(token)
 
    # we search for each bigram and trigram in our skills database
    for ngram in bigrams_trigrams:
        if ngram.lower() in SKILLS_DB:
            skills_extracted.add(ngram)
 
    return skills_extracted


In [42]:
RESERVED_WORDS = [
    'school',
    'college',
    'univers',
    'academy',
    'faculty',
    'institute',
    'faculdades',
    'Schola',
    'schule',
    'lise',
    'lyceum',
    'lycee',
    'polytechnic',
    'kolej',
    'ünivers',
    'okul',
]


def org_entity_extration(text_input):
    list_organizations = []
 
    # first get all the organization names using nltk
    for sent in nltk.sent_tokenize(text_input):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label') and chunk.label() == 'ORGANIZATION':
                list_organizations.append(' '.join(c[0] for c in chunk.leaves()))
 
    organizations = set()
    for org in list_organizations:
        for word in RESERVED_WORDS:
            if org.lower().find(word) >= 0:
                organizations.add(org)
 
    return organizations

Step 5 : Process the Resume file and extract the entities from it 

---



In [43]:
# Resume Parsing
def resume_parse(file_path):
        
    # determine the file type 
    # unpacking the tuple
    file_name, file_extension = os.path.splitext(file_path)

    print("-"*90)
    print("File Meta Data")
    print(file_name)
    print(file_extension)
    print("-"*90)

    if file_extension == ".docx":
      resume_txt = docx_txt_extraction(file_path)
    elif file_extension == ".pdf":
      resume_txt = pdf_txt_extraction(file_path)
    elif file_extension == ".doc":
      resume_txt,err = doc_txt_extraction(file_path)
    else: 
      print(file_extension + "format not supported" )
    
    
    print("-"*90)
    print("Raw Resume Text")
    #print(resume_txt)
    print("-"*90)

    # extract entities
    # Extract Name
    names = name_entity_extraction(resume_txt) 
    if names:
        print("-"*90)
        print("Name of Candidate - ")
        print(names[0:2])
        print("-"*90)
    else:
        print("-"*90)
        print("Candidate Name not found")
        
    # Extract Phone number    
    phone_number = phone_no_extraction(resume_txt)
    if phone_number:
      print("-"*90)
      print("Cadidate Contact number - ")
      print(phone_number)
      print("-"*90)
    else:
      print("-"*90)
      print("Candidate's contact numnber not found")
      print("-"*90)
      
    # Extract Email    
    emails = emails_extract(resume_txt) 
    if emails:
      print("-"*90)
      print("Candidates Email Address")
      print(emails)
      print("-"*90)
    else:
      print("-"*90)
      print("Candidate's Email id not found")
      print("-"*90)
       
    # Extract Skills

    skills = skills_extraction(resume_txt)
    if skills:
      print("-"*90)
      print("Candidates Skills")
      print(skills)
      print("-"*90)
    else:
      print("-"*90)
      print("Candidate's Skills not found")
      print("-"*90) 

    # Extract Organizations
    org_information = org_entity_extration(resume_txt)
    if org_information:
      print("-"*90)
      print("Candidates Skills")
      print(org_information)
      print("-"*90)
    else:
      print("-"*90)
      print("Candidate's Organizations not found")
      print("-"*90)
    
    return skills

In [44]:
# JD Parsing
def jd_parse(jd_file_path):
    # determine the file type 
    # unpacking the tuple
    jd_file_name, jd_file_extension = os.path.splitext(jd_file_path)

    print("-"*90)
    print("JD File Meta Data")
    print(jd_file_name)
    print(jd_file_extension)
    print("-"*90)

    if jd_file_extension == ".docx":
      jd_txt = docx_txt_extraction(jd_file_path)
    elif jd_file_extension == ".pdf":
      jd_txt = pdf_txt_extraction(jd_file_path)
    elif jd_file_extension == ".doc":
      jd_txt,err = doc_txt_extraction(jd_file_path)
    else: 
      print(jd_file_extension + "format not supported" )
    
    
    print("-"*90)
    print("Raw JD Text")
    #print(jd_txt)
    print("-"*90)
    jd_skills = skills_extraction(jd_txt)
    if jd_skills:
      print("-"*90)
      print("JD Skills")
      print(jd_skills)
      print("-"*90)
    else:
      print("-"*90)
      print("JD Skills not found")
      print("-"*90) 
    return jd_skills 



In [45]:
def match_resume_and_jd(resume_skills, jd_skills):
    resume_skills = set(resume_skills)
    jd_skills = set(jd_skills)
    
    # Calculate the number of overlapping skills
    matching_skills = resume_skills.intersection(jd_skills)
    num_matching_skills = len(matching_skills)
    
    # Calculate the matching score
    matching_score = num_matching_skills / len(jd_skills)
    
    return matching_score

In [46]:
if __name__ == '__main__':
  # parse the resume
  resume_file_path = "/content/Brendan_Herger_Resume.pdf"
  #resume_file_path = "/content/NFI John Straumann.doc"
  #resume_file_path = "/content/Bhabesh Kumar Dey Resume.docx"
  resume_skills = resume_parse(resume_file_path)

  # Parse the JD
  #jd_file_path = "/content/Brendan_Herger_Resume.pdf"
  #jd_file_path = "/content/NFI John Straumann.doc"
  jd_file_path = "/content/JD1.docx"

  jd_skills = jd_parse(jd_file_path)
  #match_resume_and_jd(resume_skills,jd_skills)
  # Match resume and JD based on skills
  matching_score = match_resume_and_jd(resume_skills, jd_skills)

  print("Matching Score: {:.2%}".format(matching_score))

  

------------------------------------------------------------------------------------------
File Meta Data
/content/Brendan_Herger_Resume
.pdf
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Raw Resume Text
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Name of Candidate - 
['Brendan', 'Herger']
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Candidate's contact numnber not found
------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
Candidates Email Address
['13herg