<a href="https://colab.research.google.com/github/mahidedhia/ResumeParser-NLP/blob/master/ResumeParsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Importing Libraries

In [173]:
!pip install pdfminer.six

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [174]:
import spacy
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io
import numpy as np
import nltk

nlp = spacy.load("en_core_web_sm")

#for stopwords required for preprocessing
stopwords = nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##PDF to Text Conversion

In [175]:
filename = "/content/QA-Manual-Tester.pdf"

In [176]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as fh:
        # iterate over all pages of PDF document
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            # creating a resoure manager
            resource_manager = PDFResourceManager()
            
            # create a file handle
            fake_file_handle = io.StringIO()
            
            # creating a text converter object
            converter = TextConverter(
                                resource_manager, 
                                fake_file_handle, 
                                codec='utf-8', 
                                laparams=LAParams()
                        )

            # creating a page interpreter
            page_interpreter = PDFPageInterpreter(
                                resource_manager, 
                                converter
                            )

            # process current page
            page_interpreter.process_page(page)
            
            # extract text
            text = fake_file_handle.getvalue()
            yield text

            # close open handles
            converter.close()
            fake_file_handle.close()

In [177]:
resume_text = ''
# calling above function and extracting text
for page in extract_text_from_pdf(filename):
    resume_text += page + ' '

resume_text

'FIRST LAST\nBay Area, California • +1-234-456-789 • professionalemail@resumeworded.com • linkedin.com/in/username\n\nPROFESSIONAL EXPERIENCE\n\nResume Worded, New York, NY\nQA Manual Tester\n\nJun 2018 – Present\n\n● Enabled critical test case complexity metrics with support for Rapid adoption of  functional automation using\n\na scriptless test case adaptor by standardizing a Test Case construction method that was built\nAutomation-ready & supported a test automation framework leading to a 45% increase in reusability with\nreductions in TCO approaching 25%.\n\n● Optimized scripting, modularity, & maintenance which resulted in an 18% decrease in workflow friction.\n● Increased the company’s ability to take and complete projects without increasing manpower by 15% by\n\nreducing QA testing turnaround time by 30%.\n\nGrowthsi, New York, NY\nQA Manual Tester\n\nJan 2015 – May 2018\n\n● Restructured utilities & improved the process documentation leading to a 40% reduction in client support

In [178]:
#removing unwanted unicode characters
resume_text = resume_text.encode("ascii", "ignore")
resume_text = resume_text.decode()
resume_text

'FIRST LAST\nBay Area, California  +1-234-456-789  professionalemail@resumeworded.com  linkedin.com/in/username\n\nPROFESSIONAL EXPERIENCE\n\nResume Worded, New York, NY\nQA Manual Tester\n\nJun 2018  Present\n\n Enabled critical test case complexity metrics with support for Rapid adoption of  functional automation using\n\na scriptless test case adaptor by standardizing a Test Case construction method that was built\nAutomation-ready & supported a test automation framework leading to a 45% increase in reusability with\nreductions in TCO approaching 25%.\n\n Optimized scripting, modularity, & maintenance which resulted in an 18% decrease in workflow friction.\n Increased the companys ability to take and complete projects without increasing manpower by 15% by\n\nreducing QA testing turnaround time by 30%.\n\nGrowthsi, New York, NY\nQA Manual Tester\n\nJan 2015  May 2018\n\n Restructured utilities & improved the process documentation leading to a 40% reduction in client support\n\nticket

##Pre-Processing

In [179]:
#function for tokenizing
def tokenize(text):
  #using spacy
  doc = nlp(text)
  word_tokens = []
  for token in doc:
    word_tokens.append(str(token))
  return word_tokens

In [180]:
#function for removing stop words and special characters. also converting tokens to lowercase
def stop_words(word_tokens):
  #using nltk
  removed_stop_words = [word.lower() for word in word_tokens if word not in stops and str(word) not in special_char and not (ord(str(word)[0]) >= 0 and ord(str(word)[0]) <= 32)] 
  return removed_stop_words

In [181]:
def preProcessess(text):
  #removing \n
  text = text.replace('\n', ' ')
  #word tokenzing
  word_tokens=tokenize(text)
  #removal of stopwords and special characters
  removed_stop_words = stop_words(word_tokens)
  return removed_stop_words

In [182]:
preProcessedTokens = preProcessess(resume_text)
preProcessedTokens

['first',
 'last',
 'bay',
 'area',
 'california',
 '+1',
 '234',
 '456',
 '789',
 'professionalemail@resumeworded.com',
 'linkedin.com/in/username',
 'professional',
 'experience',
 'resume',
 'worded',
 'new',
 'york',
 'ny',
 'qa',
 'manual',
 'tester',
 'jun',
 '2018',
 'present',
 'enabled',
 'critical',
 'test',
 'case',
 'complexity',
 'metrics',
 'support',
 'rapid',
 'adoption',
 'functional',
 'automation',
 'using',
 'scriptless',
 'test',
 'case',
 'adaptor',
 'standardizing',
 'test',
 'case',
 'construction',
 'method',
 'built',
 'automation',
 'ready',
 'supported',
 'test',
 'automation',
 'framework',
 'leading',
 '45',
 'increase',
 'reusability',
 'reductions',
 'tco',
 'approaching',
 '25',
 'optimized',
 'scripting',
 'modularity',
 'maintenance',
 'resulted',
 '18',
 'decrease',
 'workflow',
 'friction',
 'increased',
 'companys',
 'ability',
 'take',
 'complete',
 'projects',
 'without',
 'increasing',
 'manpower',
 '15',
 'reducing',
 'qa',
 'testing',
 'turnar

##Extraction

In [183]:
headers_experience = (
        'career profile',
        'employment history',
        'work history',
        'work experience',
        'experience',
        'professional experience',
        'professional background',
        'additional experience',
        'career related experience',
        'related experience',
        'programming experience',
        'freelance',
        'freelance experience',
        'army experience',
        'military experience',
        'military background',
)
headers_education = (
        'academic background',
        'academic experience',
        'programs',
        'courses',
        'related courses',
        'education',
        'qualifications',
        'educational background',
        'educational qualifications',
        'educational training',
        'education and training',
        'training',
        'academic training',
        'professional training',
        'course project experience',
        'related course projects',
        'internship experience',
        'internships',
        'apprenticeships',
        'college activities',
        'certifications',
        'special training',
    )
headers_skills = (
        'credentials',
        'areas of experience',
        'areas of expertise',
        'areas of knowledge',
        'skills',
        "other skills",
        "other abilities",
        'career related skills',
        'professional skills',
        'specialized skills',
        'technical skills',
        'computer skills',
        'personal skills',
        'computer knowledge',        
        'technologies',
        'technical experience',
        'proficiencies',
        'languages',
        'language competencies and skills',
        'programming languages',
        'competencies'
    )
headers_projects = (
    'projects',
    'personal projects',
    'academic projects',
)

In [184]:
# A complete sentence contains at least one subject, one predicate, one object, and closes with punctuation. 
# Subject and object are almost always nouns, and the predicate is always a verb.
def checkSentence(text):
  text=nlp(text)
  has_noun = 2
  has_verb = 1
  for token in text:
    if token.pos_ in ["NOUN", "PROPN", "PRON"]:
        has_noun -= 1
    elif token.pos_ == "VERB":
        has_verb -= 1
  if has_noun < 1 and has_verb < 1:
    return True
  return False

In [185]:
class ResumeParser:
  def __init__(self, text):
    self.resume_text = text
    self.lines = text.split("\n")
  
  resume_sections = {
      'name': {},
      'experience': {},
      'education': {},
      'skills': {},
      'projects': {},
  }
  header_indices = {}

  #Obtain line nos of Headers 
  def getHeaderIndices(self):
    for i, line in enumerate(self.lines):
      if len(line):
        #Header always starts in Uppercase
        if line[0].isupper():
          #The line is a header if it matches the required section headers and isn't a complete sentence
          if [item for item in headers_experience if line.lower().startswith(item)] and not checkSentence(line):
            self.header_indices[i]="experience"
          elif [item for item in headers_education if line.lower().startswith(item)] and not checkSentence(line):
            self.header_indices[i]="education"
          elif [item for item in headers_skills if line.lower().startswith(item)] and not checkSentence(line):
            self.header_indices[i]="skills"
          elif [item for item in headers_projects if line.lower().startswith(item)] and not checkSentence(line):
            self.header_indices[i]="projects"

  #Obtain raw text for different resume sections using header indices/linenos
  def getRawResumeSections(self):
    no_lines = len(self.lines)
    header_linenos = list(self.header_indices.keys())
    list_last_index = len(header_linenos)-1
    for counter, lineno in enumerate(header_linenos):
      start_index = lineno+1
      if (counter<list_last_index):
        end_index = header_linenos[counter+1]
      else:
        end_index = no_lines
      self.resume_sections[self.header_indices[lineno]]=self.lines[start_index:end_index]

  def getExperience(self):
    print("getExperience()")

  def getEducation(self):
    print("getEducation()")

  def getSkills(self):
    print("getSkills()")

  #method to parse the entire resume - calls all the above methods of the class
  def getResumeData(self):
    self.getHeaderIndices()
    if len(self.header_indices)!=0:
      self.getRawResumeSections()
      self.getExperience()
      self.getEducation()
      self.getSkills()
    return self.resume_sections

In [186]:
obj = ResumeParser(resume_text)
obj.getResumeData()
print("------------EXPERIENCE---------------------")
for line in obj.resume_sections["experience"]:
  print(line)
print("------------EDUCATION---------------------")
for line in obj.resume_sections["education"]:
  print(line)
print("------------SKILLS---------------------")
for line in obj.resume_sections["skills"]:
  print(line)

getExperience()
getEducation()
getSkills()
------------EXPERIENCE---------------------

Resume Worded, New York, NY
QA Manual Tester

Jun 2018  Present

 Enabled critical test case complexity metrics with support for Rapid adoption of  functional automation using

a scriptless test case adaptor by standardizing a Test Case construction method that was built
Automation-ready & supported a test automation framework leading to a 45% increase in reusability with
reductions in TCO approaching 25%.

 Optimized scripting, modularity, & maintenance which resulted in an 18% decrease in workflow friction.
 Increased the companys ability to take and complete projects without increasing manpower by 15% by

reducing QA testing turnaround time by 30%.

Growthsi, New York, NY
QA Manual Tester

Jan 2015  May 2018

 Restructured utilities & improved the process documentation leading to a 40% reduction in client support

tickets & an 80% increase in uptime.

 Achieved department-wide improvement metrics