# CV Parsing and Summarization using Spacy 

In [1]:
import spacy
import pickle
import random
import pandas as pd

spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for each of the NLP tasks it implements

In [2]:
train_data = pickle.load(open('C:\\Users\\Meghna\\Desktop\\CV Ranking\\train_data.pkl','rb'))

In [3]:
train_data[0]

('Govardhana K Senior Software Engineer  Bengaluru, Karnataka, Karnataka - Email me on Indeed: indeed.com/r/Govardhana-K/ b2de315d95905b68  Total IT experience 5 Years 6 Months Cloud Lending Solutions INC 4 Month • Salesforce Developer Oracle 5 Years 2 Month • Core Java Developer Languages Core Java, Go Lang Oracle PL-SQL programming, Sales Force Developer with APEX.  Designations & Promotions  Willing to relocate: Anywhere  WORK EXPERIENCE  Senior Software Engineer  Cloud Lending Solutions -  Bangalore, Karnataka -  January 2018 to Present  Present  Senior Consultant  Oracle -  Bangalore, Karnataka -  November 2016 to December 2017  Staff Consultant  Oracle -  Bangalore, Karnataka -  January 2014 to October 2016  Associate Consultant  Oracle -  Bangalore, Karnataka -  November 2012 to December 2013  EDUCATION  B.E in Computer Science Engineering  Adithya Institute of Technology -  Tamil Nadu  September 2008 to June 2012  https://www.indeed.com/r/Govardhana-K/b2de315d95905b68?isid=rex-

Each element of the training data is a tuple. The tuple consists of two parts. 

The first part is the complete textual data of the CV which is having a data type of string.
The second part is a dictionary. The dictionary contains a list as the value and this is a list of tuples. The last element of each tuple is the LABEL

In [4]:
type(train_data[0][0])

str

In [5]:
type(train_data[0][1])

dict

The procedure that we are following to parse the CV is first convert all of the data in a CV in PDF format to text format. 

Next, we have to determine all of the requried entites in the CV(like College Name, Degree, Skills etc.) and the position from where that entitiy begins and ends.

In [6]:
type(train_data)

list

In [7]:
len(train_data)

200

Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories.

In [8]:
nlp = spacy.blank('en')

def train_model(train_data):
    if 'ner' not in nlp.pipe_names:     
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    
    #We are checking whether the NER is already present in the pipeline list or not
    #If not, we add it to the end of the pipeline list
    
    for _, annotation in train_data:
        #annotation is the second half of each training data tuple
        for ent in annotation['entities']:
            #Here, we are adding all the NER labels from the CV PDF into the dictionary
            #At position, index 2
            ner.add_label(ent[2])
     
    #Now we are going to train our model only on the NERs in our PDFs and will put all of the other NERs
    #into other_pipes list
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):     #Only train on required NERs
        optimizer = nlp.begin_training()
        
        for itn in range (10):                #We shall be performing training for 10 iterations
            print('Starting iteration ', itn)
            random.shuffle(train_data)        #By shuffling the training data
            losses = {}
            index = 0
            for text, annotations in train_data:
                try:
                    nlp.update([text],
                              [annotations],
                              drop=0.2,
                              sgd=optimizer,
                              losses=losses)
                except Exception as e:
                    pass
                    
            print(losses)

In [9]:
train_model(train_data)

Starting iteration  0
{'ner': 14199.331548548138}
Starting iteration  1
{'ner': 8488.050956346422}
Starting iteration  2
{'ner': 8431.342042162381}
Starting iteration  3
{'ner': 8252.535296776401}
Starting iteration  4
{'ner': 6858.541863939513}
Starting iteration  5
{'ner': 7132.293477769844}
Starting iteration  6
{'ner': 5196.588650850168}
Starting iteration  7
{'ner': 5287.0299678735755}
Starting iteration  8
{'ner': 7309.475048217748}
Starting iteration  9
{'ner': 4833.048514014466}


In [10]:
#We will now save this trained NLP model to our local disk
nlp.to_disk('C:\\Users\\Meghna\\Desktop\\CV Ranking\\nlp_model')

In [11]:
#Load trained model into spacy 
nlp_model = spacy.load('C:\\Users\\Meghna\\Desktop\\CV Ranking\\nlp_model')

In [12]:
#Since we shuffled out training dataset, hence the first element is different now
train_data[0]

('Shreya Agnihotri Senior System Engineer at Infosys Limited - Infosys  Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Shreya-Agnihotri/ c1755567027a0205  • Having 2.7 years of experience in Web Application design using python Django framework. • Highly experienced and skilled Agile Developer with a strong record of excellent teamwork and successful coding project management. • Good knowledge in python, Elasticsearch, Django using HTML5, MYSQL, JavaScript, jQuery. • Performed the role of team member effectively. Involved in requirement gathering and analysis of the requirements in technical perspective. • Extensively worked on software in all the phases including Design, Development, Implementation, Integration and Testing. • Possesses good analytical, logical ability and systematic approach to problem analysis, strong debugging and troubleshooting skills. • Working on classic software development models along Agile Methodologies.  WORK EXPERIENCE  Senior System Engineer at In

Now, it's not a good idea to test our model on training data, but still lets see whether our model is working correctly or not by applying the model on a training dataset data.

So, we'll only be pasing the text to our model and check if we are getting the corresponding entities or not.

In [13]:
doc = nlp_model(train_data[0][0])
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}-{ent.text}')

NAME                          -Shreya Agnihotri
DESIGNATION                   -Senior System Engineer
COMPANIES WORKED AT           -Infosys Limited
LOCATION                      -Bengaluru
EMAIL ADDRESS                 -indeed.com/r/Shreya-Agnihotri/ c1755567027a0205
DESIGNATION                   -Senior System Engineer
DEGREE                        -B.Tech in ECE
COLLEGE NAME                  -Galgotias University
SKILLS                        -Ajax (Less than 1 year), APACHE KAFKA (Less than 1 year), HTML5 (2 years), Java (2 years), SQL (2 years)


Now, lets test the model on a random CV PDF file.

So, first we need to extract the data from the PDF and then pass it through the model to get classification of text into different entites

In [14]:
#!pip install PyMuPDF

In [15]:
import sys, fitz
fname = 'C:\\Users\\Meghna\\Desktop\\CV Ranking\\Alice Clark CV.pdf'
doc = fitz.open(fname)
text = ""
for page in doc:
    text = text + str(page.get_text())
    
text

'Alice Clark \nAI / Machine Learning \n \nDelhi, India Email me on Indeed \n• \n20+ years of experience in data handling, design, and development \n• \nData Warehouse: Data analysis, star/snow flake scema data modelling and design specific to \ndata warehousing and business intelligence \n• \nDatabase: Experience in database designing, scalability, back-up and recovery, writing and \noptimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes. \nCloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure, \nStream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake \nanalytics(U-SQL) \nWilling to relocate anywhere \n \nWORK EXPERIENCE \nSoftware Engineer \nMicrosoft – Bangalore, Karnataka \nJanuary 2000 to Present \n1. Microsoft Rewards Live dashboards: \nDescription: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping \nonline. Microsoft Rewards members can earn points when 

In [16]:
print(text)

Alice Clark 
AI / Machine Learning 
 
Delhi, India Email me on Indeed 
• 
20+ years of experience in data handling, design, and development 
• 
Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to 
data warehousing and business intelligence 
• 
Database: Experience in database designing, scalability, back-up and recovery, writing and 
optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes. 
Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure, 
Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake 
analytics(U-SQL) 
Willing to relocate anywhere 
 
WORK EXPERIENCE 
Software Engineer 
Microsoft – Bangalore, Karnataka 
January 2000 to Present 
1. Microsoft Rewards Live dashboards: 
Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping 
online. Microsoft Rewards members can earn points when searching with Bing, bro

As we can see, there are many special characters and new lines, so we need to remove these extra characters and obtain clean data. We'll do that as follows:

In [17]:
clean_text = " ".join(text.split('\n'))
print(clean_text)

Alice Clark  AI / Machine Learning    Delhi, India Email me on Indeed  •  20+ years of experience in data handling, design, and development  •  Data Warehouse: Data analysis, star/snow flake scema data modelling and design specific to  data warehousing and business intelligence  •  Database: Experience in database designing, scalability, back-up and recovery, writing and  optimizing SQL code and Stored Procedures, creating functions, views, triggers and indexes.  Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL Azure,  Stream Analytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure data lake  analytics(U-SQL)  Willing to relocate anywhere    WORK EXPERIENCE  Software Engineer  Microsoft – Bangalore, Karnataka  January 2000 to Present  1. Microsoft Rewards Live dashboards:  Description: - Microsoft rewards is loyalty program that rewards Users for browsing and shopping  online. Microsoft Rewards members can earn points when searching with Bing, bro

Now, we'll test our model on this cleaned and processed text.

In [18]:
doc = nlp_model(clean_text)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}-{ent.text}')

NAME                          -Alice Clark
LOCATION                      -AI /
LOCATION                      -Delhi
DESIGNATION                   -Software Engineer
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
DEGREE                        -Indian Institute of Technology
SKILLS                        -Machine Learning, Natural Language Processing, and Big Data Handling    ADDITIONAL INFORMATION  Professional Skills  • Excellent analytical, problem solving, communication, knowledge transfer and interpersonal  skills with ability to interact with individuals at all the levels  • Quick learner and maintains cordial relationship with project manager and team members and  good performer both in team and independent job environments  • Positive attitude towards superiors &amp; peers  • Supervised junior developers throughout project lifecycle and provided technical assistanc

Let's test the model on another CV PDF.

In [19]:
import sys, fitz
fname = 'C:\\Users\\Meghna\\Desktop\\CV Ranking\\Smith Resume.pdf'
doc = fitz.open(fname)
text = ""
for page in doc:
    text = text + str(page.get_text())
    
clean_text = " ".join(text.split('\n'))
print(clean_text)

Michael Smith  BI / Big Data/ Azure  Manchester, UK- Email me on Indeed: indeed.com/r/falicent/140749dace5dc26f    10+ years of Experience in Designing, Development, Administration, Analysis,  Management  inthe  Business  Intelligence  Data  warehousing,  Client  Server  Technologies, Web-based Applications, cloud solutions and Databases.  Data warehouse: Data analysis, star/ snow flake schema data modeling and design  specific todata warehousing and business intelligence environment.  Database: Experience in database designing, scalability, back-up and recovery,  writing andoptimizing SQL code and Stored Procedures, creating functions, views,  triggers and indexes.   Cloud platform: Worked on Microsoft Azure cloud services like Document DB, SQL  Azure, StreamAnalytics, Event hub, Power BI, Web Job, Web App, Power BI, Azure  data lake analytics(U-SQL).  Big Data: Worked Azure data lake store/analytics for big data processing and Azure  data factoryto schedule U-SQL jobs. Designed and d

In [20]:
doc = nlp_model(clean_text)
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}-{ent.text}')

NAME                          -Michael Smith
DESIGNATION                   -Big Data/ Azure
EMAIL ADDRESS                 -indeed.com/r/falicent/140749dace5dc26f
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COMPANIES WORKED AT           -Microsoft
COLLEGE NAME                  -The University of Manchester - UK  
SKILLS                        -problem solving (Less than 1 year), project lifecycle (Less than 1 year), project


Clearly, our model works and is efficiently able to extract all of the relevant information from the PDF CVs