###Downloading CV data from kaggle:

In [None]:
from google.colab import files
!pip install kaggle
api_token = files.upload()

In [2]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
!kaggle datasets download -d snehaanbhawal/resume-dataset

In [4]:
import zipfile
with zipfile.ZipFile("/content/resume-dataset.zip","r") as zip_ref:
    zip_ref.extractall("resume_dataset")

###Loading Job Description dataset from huggingface:

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files='job_desc.csv')

###Reading pdf file:

In [None]:
!pip install PyPDF2

In [12]:
from PyPDF2 import PdfReader

def pdf_to_txt(pdf_file_path):
  '''reads multiple pdf pages and returns merged text output'''
  reader = PdfReader(pdf_file_path)
  number_of_pages = len(reader.pages)
  text=''
  for p in range(number_of_pages):
    page = reader.pages[p]
    t = page.extract_text()
    text = text + ' ' + t
  return text

**Below I have used two methods to get CV text before embedding:**

**Method 1:**
- This is a relatively simple method to get top cv.
- Instead of extracting entities like skills, experience etc. from cv, we are just doing basic cleaning of cv text.
- After cleaning we can be sure that most of the time the text will contain skills/experience/degree data along with certain other descriptions the candidate choose to write in cv, like achievements in previous jobs/ project descriptions etc. which may get ignored by any parser.
- So reliance upon any entity extraction method is removed here(as in method 2)
- Here computation cost for creating cv embeddings will increase as we are keeping most of the text from cv.
- We can use pretrained spacy NER model to find entities with 'Person' tag and remove them but after checking it was found that the model was giving 'person' tag to many useful words so the idea was dropped. Refer Miscellaneous section for code and result.

**Method 2:**
- Here we extract skills, experience, degree from cv.
- Extraction can be performed in following ways:
  - Using rules based system like string matching/regex. This method will not work for every cv format. It will not be able to find skills mentioned under various other descriptions like projects, experience etc.
  - Using tools like Spacy NER model with rule based entity recognition. We will need a file having list of various skills for entity recognition.
  - Training custom Spacy NER model which will need annotated data.
- Here we will be using a open source library 'pyresparser' which is trained on about 200 resumes giving custom NER output like degree,experience,skills etc. (https://pypi.org/project/pyresparser/)


###Embeddings:

- Here we are using 'sentence transformer' to encode the text.
- We will be using pre-trained 'all-MiniLM-L6-v2' model which gives 384 dim embeddings.
- Sentence transformer uses BERT with siamese and triplet network to get better semantically meaningful sentence embeddings that can be compared using cosine-similarity. (https://arxiv.org/abs/1908.10084)

In [None]:
!pip install -U sentence-transformers

In [16]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

Note:
- Here we will be using only information technology/ software related cv and job description.
- For every software related job description we will be finding top 5 IT domain CV.

###Method 1:

###Cleaning raw CV text:

In [None]:
import re

def text_cleaning(text):
  t = text.replace('\n',' ')
  t=re.sub('\S+(?<=@)\S+', ' ', t) #remove email
  t=re.sub('[0-9]+', ' ', t) #remove numbers
  t = re.sub('\S+:', ' ', t) #remove words with ':' eg: Skills:, Blog:, Introduction: etc.
  t = re.sub('\\b[^A-Za-z_ ]+\\b', ' ', t) #not a word
  t = re.sub('\W+',' ',t) #non-word character
  t = re.sub('^\s+', '',t) #removing spaces at start of sentence
  t = re.sub('\s+', ' ', t) #removing extra spaces
  t = re.sub('\s+$', '', t) #removing spaces at end of sentence
  t = t.lower()
  return t

In [28]:
# getting index of software related job description

indx_software=[]
for i,pos in enumerate(dataset['train']['position_title']):
  if 'Software' in pos:
    indx_software.append(i)

In [14]:
import math

def get_embeddings(text,model):
  '''function that returns embeddings of text based on model used.
     if text token length is more than acceptable token length of 512, then text is embedded in chunks and mean of all embeddings is taken'''
  text_word_list = text.split()
  embeddings_list=[]
  for i in range(math.ceil(len(text_word_list)/500)): #making chunks of 500 words.
    text_part = ' '.join(text_word_list[i*500:i*500+500])
    embeddings_list.append(model.encode(text_part))
  return np.mean(np.stack(embeddings_list),axis=0)

Job Description Embeddings

In [29]:
import numpy as np

embeddings_job_software=[]
for job in np.array(dataset['train']['job_description'])[indx_software]:
  text = text_cleaning(job)
  embeddings_job_software.append( get_embeddings(text,model) )

CV Embeddings

In [41]:
import os

embeddings_cv_software=[]
cv_dir = '/content/resume_dataset/data/data/INFORMATION-TECHNOLOGY'
for cv in os.listdir(cv_dir):
  text = pdf_to_txt(cv_dir+'/'+cv)
  embeddings_cv_software.append( get_embeddings(text,model) )


Top 5 CV for each software related job description:

- job description no. is the index no. as in hugggingface dataset.
- cv id is id inside information technology folder from kaggle dataset.

In [50]:
from sklearn.metrics.pairwise import cosine_similarity

for job in enumerate(embeddings_job_software):
  sim_list=[]
  for cv in embeddings_cv_software:
    sim = cosine_similarity(job[1].reshape(1,-1),cv.reshape(1,-1))
    sim_list.append(sim[0][0])
  top_5_cv = np.argsort(sim_list)[::-1][:5]
  print('job description no.',indx_software[job[0]],',top 5 CV id: ',np.array(os.listdir(cv_dir))[top_5_cv])


job description no. 29 ,top 5 CV id:  ['52618188.pdf' '52246737.pdf' '18067556.pdf' '13405733.pdf'
 '26768723.pdf']
job description no. 30 ,top 5 CV id:  ['28897981.pdf' '15297298.pdf' '10641230.pdf' '27770859.pdf'
 '36434348.pdf']
job description no. 31 ,top 5 CV id:  ['24020470.pdf' '25207620.pdf' '11580408.pdf' '28126340.pdf'
 '51363762.pdf']
job description no. 32 ,top 5 CV id:  ['17111768.pdf' '36434348.pdf' '10641230.pdf' '20674668.pdf'
 '21780877.pdf']
job description no. 33 ,top 5 CV id:  ['28126340.pdf' '11957080.pdf' '27372171.pdf' '17111768.pdf'
 '15297298.pdf']
job description no. 34 ,top 5 CV id:  ['52246737.pdf' '20674668.pdf' '26480367.pdf' '20237244.pdf'
 '51363762.pdf']
job description no. 35 ,top 5 CV id:  ['37242217.pdf' '15802627.pdf' '20674668.pdf' '20001721.pdf'
 '22776912.pdf']
job description no. 36 ,top 5 CV id:  ['12334140.pdf' '10641230.pdf' '11957080.pdf' '36434348.pdf'
 '20001721.pdf']
job description no. 66 ,top 5 CV id:  ['20674668.pdf' '30223363.pdf' '46

Getting job description and top cv text:

In [38]:
text_cleaning(dataset['train']['job_description'][29])

'the role remote as a software engineer youll build features into the dispatch platform that will lead us toward our goal of redefining sameday delivery youll dig deep into many parts of the system and will work across the full stack to create new ideas and improve existing functionality we believe that product development is more fun when deploying early and often splitting up work into bitesized chunks and getting feedback from real users as quickly as possible we also believe that a single engineer should be empowered to develop a feature from start to finish and that the technology stack should be simple enough to make that realistic this is a fulltime exempt computer employee role that reports to the manager software engineering what youll do executes all job duties in alignment with dispatchs core values mission and purpose acts ethically with integrity and complies legal standards to deliver an environment that promotes respect innovation and creativity encourages and fosters an

In [36]:
text_cleaning(pdf_to_txt('/content/resume_dataset/data/data/INFORMATION-TECHNOLOGY/52618188.pdf'))

'information technology help desk specialist highlights microsoft windows operating systems me xp and windows along with expert knowledge in several other applications such as microsoft active directory microsoft works microsoft office and microsoft outlook sap crm erp oracle jd edwards remedy great plains peoplesoft sharepoint avaya blue pumpkin verint novell vdi platforms and cognos business process improvement cost benefit analysis forecasting and planning advanced excel modeling business systems analysis sap business requirements matrixes project management superb communication skills advanced problem solving abilities critical thinking decisive experience information technology help desk specialist august to current company name city state diagnose and resolve technical hardware and software issues for incoming phone calls and emails while ensuring detailed documentation on all activity and communication with customers regarding their issue display the ability to understand and co

###Method 2:

In [None]:
!pip install pyresparser

In [None]:
!python -m nltk.downloader words

In [None]:
!pip install spacy==2.3.5
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz

- Using pyreparser library to extract skills,experience,degree.
- The library uses custom trained Spacy NER model.
- For improved accuracy more annotated data will be needed.

data extraction example:

In [6]:
from pyresparser import ResumeParser

data = ResumeParser('/content/resume_dataset/data/data/INFORMATION-TECHNOLOGY/10265057.pdf').get_extracted_data()
data



{'name': 'SYSTEMS ENGINEER',
 'email': None,
 'mobile_number': None,
 'skills': ['Statistics',
  'Communication',
  'Engineering',
  'Excel',
  'Audit',
  'Radar',
  'Design',
  'Mining',
  'Pivot',
  'Reports',
  'Architecture',
  'Assembly',
  'Research',
  'Sas',
  'Matlab',
  'Pivot tables',
  'Documentation',
  'Hardware',
  'Testing',
  'C',
  'Python',
  'Root cause',
  'System',
  'Database',
  'Troubleshooting',
  'Data collection',
  'Process',
  'Big data',
  'Specifications',
  'C++',
  'Requests',
  'Acquisition',
  'Microsoft office',
  'Operations',
  'Java',
  'Sql',
  'Programming',
  'Technical',
  'Electrical'],
 'college_name': None,
 'degree': None,
 'designation': ['Electrical/Validation Engineer'],
 'experience': ['Working RF Systems Engineer',
  'May 2014 to Current Company Name',
  'Qualification Â· Multidisciplinary background: RF hardware designs, manufacturing operations and data analyst.'],
 'company_names': None,
 'no_of_pages': 1,
 'total_experience': 0.0

Extracting skills, degree, experience data from cv and using the joined data as input to embedding generation.

In [None]:
import os
import numpy as np

embeddings_cv_soft=[]
cv_dir = '/content/resume_dataset/data/data/INFORMATION-TECHNOLOGY'
for cv in os.listdir(cv_dir):
  data = ResumeParser(cv_dir+'/'+cv).get_extracted_data()
  try:
    try:
      text = ' '.join(data['skills']) + ' '.join(data['degree']) + ' '.join(data['experience'])
    except:
      text = ' '.join(data['skills']) + ' '.join(data['experience'])
  except:
    text = ' '.join(data['skills'])

  embeddings_cv_soft.append( get_embeddings(text,model) )

Top 5 CV for each software related job description:

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

for job in enumerate(embeddings_job_software):
  sim_list=[]
  for cv in embeddings_cv_soft:
    sim = cosine_similarity(job[1].reshape(1,-1),cv.reshape(1,-1))
    sim_list.append(sim[0][0])
  top_5_cv = np.argsort(sim_list)[::-1][:5]
  print('job description no.',indx_software[job[0]],',top 5 CV id: ',np.array(os.listdir(cv_dir))[top_5_cv])


job description no. 29 ,top 5 CV id:  ['17111768.pdf' '18067556.pdf' '52618188.pdf' '79541391.pdf'
 '30223363.pdf']
job description no. 30 ,top 5 CV id:  ['15297298.pdf' '14789139.pdf' '37242217.pdf' '31111279.pdf'
 '20001721.pdf']
job description no. 31 ,top 5 CV id:  ['24230851.pdf' '17111768.pdf' '91697974.pdf' '21283365.pdf'
 '90867631.pdf']
job description no. 32 ,top 5 CV id:  ['17111768.pdf' '51363762.pdf' '12045067.pdf' '11957080.pdf'
 '10265057.pdf']
job description no. 33 ,top 5 CV id:  ['17111768.pdf' '11957080.pdf' '18067556.pdf' '90867631.pdf'
 '27372171.pdf']
job description no. 34 ,top 5 CV id:  ['12763627.pdf' '12045067.pdf' '51363762.pdf' '30223363.pdf'
 '10641230.pdf']
job description no. 35 ,top 5 CV id:  ['10641230.pdf' '12763627.pdf' '23864648.pdf' '51363762.pdf'
 '15802627.pdf']
job description no. 36 ,top 5 CV id:  ['12334140.pdf' '28126340.pdf' '17111768.pdf' '10641230.pdf'
 '31111279.pdf']
job description no. 66 ,top 5 CV id:  ['27372171.pdf' '11957080.pdf' '79

In [39]:
text_cleaning(dataset['train']['job_description'][29])

'the role remote as a software engineer youll build features into the dispatch platform that will lead us toward our goal of redefining sameday delivery youll dig deep into many parts of the system and will work across the full stack to create new ideas and improve existing functionality we believe that product development is more fun when deploying early and often splitting up work into bitesized chunks and getting feedback from real users as quickly as possible we also believe that a single engineer should be empowered to develop a feature from start to finish and that the technology stack should be simple enough to make that realistic this is a fulltime exempt computer employee role that reports to the manager software engineering what youll do executes all job duties in alignment with dispatchs core values mission and purpose acts ethically with integrity and complies legal standards to deliver an environment that promotes respect innovation and creativity encourages and fosters an

In [40]:
text_cleaning(pdf_to_txt('/content/resume_dataset/data/data/INFORMATION-TECHNOLOGY/17111768.pdf'))

'information technology project manager system analysis sysanalsys gs professional overview highly qualified department of defense dod program manager pm professional driven to maximize mission partner mp operational efficiency through planning project management and infrastructure technology it expertise excels at building dynamic team relationships and achieves project management process improvements looking to continue federal career as a strategic planner possessing exceptional knowledge understanding support agreements basis of estimates fiscal analysis financial reporting cost projections business proposals and increased overall responsibilities within federal service relevant professional experience january to current company name city state information technology project manager system analysis sysanalsys gs holds active security clearance member of the development and business center for defense logistics agency dla defense finance and accounting service dfas program managemen

###Miscellaneous:

In [41]:
import nltk
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

In [44]:
text = pdf_to_txt('/content/resume_dataset/data/data/INFORMATION-TECHNOLOGY/17111768.pdf')
tokenized_doc  = word_tokenize(text)
tagged_sentences = nltk.pos_tag(tokenized_doc)
NE= nltk.ne_chunk(tagged_sentences )
named_entities = []
for tagged_tree in NE:
    if hasattr(tagged_tree, 'label'):
      entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
      entity_type = tagged_tree.label()
      named_entities.append((entity_name, entity_type))
named_entities

[('INFORMATION', 'ORGANIZATION'),
 ('TECHNOLOGY', 'ORGANIZATION'),
 ('SYSANALSYS', 'ORGANIZATION'),
 ('Defense', 'GPE'),
 ('DoD', 'ORGANIZATION'),
 ('Mission Partner', 'PERSON'),
 ('Infrastructure Technology', 'PERSON'),
 ('Current Company Name City', 'PERSON'),
 ('State Information Technology Project', 'ORGANIZATION'),
 ('System Analysis', 'PERSON'),
 ('SYSANALSYS', 'ORGANIZATION'),
 ('Development', 'ORGANIZATION'),
 ('Business Center', 'ORGANIZATION'),
 ('Defense', 'ORGANIZATION'),
 ('DLA', 'ORGANIZATION'),
 ('Defense Finance', 'PERSON'),
 ('Accounting Service', 'ORGANIZATION'),
 ('DFAS', 'ORGANIZATION'),
 ('PMO', 'ORGANIZATION'),
 ('Mission Partner Engagement', 'ORGANIZATION'),
 ('MPEO', 'ORGANIZATION'),
 ('BDM11', 'ORGANIZATION'),
 ('DISA', 'ORGANIZATION'),
 ('Project Management', 'PERSON'),
 ('DISA', 'ORGANIZATION'),
 ('Business Flow', 'ORGANIZATION'),
 ('DISA Program Manager', 'ORGANIZATION'),
 ('DLA', 'ORGANIZATION'),
 ('Enterprise Business Systems', 'ORGANIZATION'),
 ('EBS', 'O