# FinParse:  A customized parser for financial resumes 

This notebook contains a walk-through of an NLP based resume parser.  The NLP models are trained on resumes from finance and accounting.  

In [1]:
# standard modules
import pandas as pd
import numpy as np
import re

# utility modules
import importdata
import parse

# Importing data 

The training data comes from ~1000 pre-parsed resumes.  Due to privacy issues I cannot make this data publicly available and so the user will not be able to train the models in the following sections.  I hope the notebook is still understandable and useful in the case that someone has their own training data.

In [2]:
# loading testing and training data
data_train, data_test = importdata.extract_data('../../data/pdf_resume_data.csv')

emp_name_tr, pos_name_tr, edu_lines_tr, descrips_tr, headers_tr = map(parse.kill_space, data_train)
emp_name_te, pos_name_te, edu_lines_te, descrips_te, headers_te = map(parse.kill_space, data_test)

Each part of the dataset is a list of strings corresponding to a given category.  For example ```emp_name_tr``` schematically looks like

```['Some Business Inc', 'Another Business Ltd.',...  ]```

Note that the ```headers_tr``` and ```headers_te``` datasets are slightly different in form.  Schematically they look like

```[['header 1', 'header 1 classification'], ['header 2', 'header 2 classification'], ...]```

The first element of each pair is the actual header from the parsed resume while the second element is the classification of the header assigned by the parser.  These form the [X,y] pairs for the training and testing of the header classifier. 

It will also be useful to have some resumes to test/demonstrate various parts of the notebook.  For that purpose I will use two resumes that I found on line:

In [3]:
# example resumes (publicly available)
res_ex1 = importdata.pdf_to_text('example_resumes/Professional Finance Resume Format.pdf')
res_ex2 = importdata.pdf_to_text('example_resumes/Finance Executive Assistant Resume.pdf')
res_ex3 = importdata.pdf_to_text('example_resumes/Jon_Toledo_Final.pdf')

# Training classifiers

The basic strategy is to pass the data on each line of a resume through two levels of classifiers.  The classifier tries to identify if a resume line corresponds to a header or not.  If the line is identified as a header is it passed through a second level of classification to identify which type of header it is (Work experience, Education, Certification, Other).  If the line is not identified as a header it is passed on for later processing.  Using the header information the resume sections related to work experience and education are extracted.  Finally, each line in the work experience section is passed through a final filter to identify if it corresponds to an employer, role or description.  

In [4]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

### Line classifier

The first classifier I train classifies each line as ```['EMPL', 'TITL', 'EDUC','DESC', 'HEAD']```.  What I am really interested in is if the line is a header or not.  For this classification task I find the best results using a linear SVC as the classifier.

In [5]:
# Prepare training data
line_bags_tr = [emp_name_tr, pos_name_tr, edu_lines_tr, descrips_tr, list(zip(*headers_tr)[0])]

# Prepare classifier for fitting
model_lines = CalibratedClassifierCV(LinearSVC(random_state=0, tol=1e-5, penalty='l1', dual=False)) 

# Instantiate the classifier
clf_lines = parse.Classifier(line_bags_tr, ['EMPL', 'TITL', 'EDUC','DESC', 'HEAD'], model_lines)

In [6]:
# we can check the performance of the classifier using the test data

line_bags_te = [emp_name_te, pos_name_te, edu_lines_te, descrips_te, list(zip(*headers_te)[0])]
print clf_lines.score_report(line_bags_te)

             precision    recall  f1-score   support

          0       0.91      0.96      0.94       414
          1       0.98      0.97      0.98       400
          2       0.96      0.93      0.95       472
          3       0.98      0.96      0.97       397
          4       0.98      0.99      0.98       439

avg / total       0.96      0.96      0.96      2122



Lets see how it works on a few examples:

In [7]:
print clf_lines.text_class(['CERTIFICATION'])

('HEAD', array([[0.00465499, 0.02024124, 0.2638527 , 0.01387927, 0.6973718 ]]))


The first element of the tuple is the predicted class.  The second element of the tuple is the probability that it belongs to each of the various classes.  In this case the text has been identified as a header.  To give another example, consider:

In [8]:
print clf_lines.text_class(['ANALYST'])

('TITL', array([[2.92355655e-06, 9.93187579e-01, 3.75972349e-04, 2.88223569e-04,
        6.14530103e-03]]))


### Header classifier 

Now that I have identified the headers, I train another model to classify each header as ```['HEAD_WORK', 'HEAD_EDUC','HEAD_OTHR']```.   First we need to massage the data into the correct form.

In [9]:
# Create a dictionary which maps 'WORK HISTORY': 0, 'EDUCATION': 1, all other classes: 2. 
header_cats = ['HEAD_WORK', 'HEAD_EDUC','HEAD_OTHR']
wanted_header = ['WORK HISTORY', 'EDUCATION']
header_dict = {}
for header in set([ header[1] for header in headers_tr ]):
    if header not in wanted_header:
        header_dict[header] = len(header_cats)-1
    else:
        header_dict[header] = wanted_header.index(header)
    
pd.Series(header_dict).sort_values()

WORK HISTORY                              0
EDUCATION                                 1
SUMMARY                                   2
SPEAKING                                  2
SKILLS                                    2
SECURITY_CLEARANCES                       2
REFERENCES                                2
QUALIFICATIONS_SUMMARY                    2
PROJECT_HEADERS                           2
PROFESSIONAL AFFILIATIONS                 2
PERSONAL INTERESTS AND ACCOMPLISHMENTS    2
ARTICLES                                  2
OTHER_PUBLICATIONS                        2
OBJECTIVE                                 2
LICENSES                                  2
LANGUAGES                                 2
HOBBIES                                   2
HEADERS_TO_IGNORE                         2
CONTACT INFO                              2
CERTIFICATIONS                            2
TRAINING                                  2
PATENTS                                   2
dtype: int64

In [10]:
# regroup the header_tr data into standard form for input into Classifier
header_bags_tr = [[],[],[]]
for el in headers_tr:
    if el[0] not in header_bags_tr[header_dict[el[1]]]:
        header_bags_tr[header_dict[el[1]]].append(el[0])
        
# print the size of each bag
print map(len, header_bags_tr)

[161, 146, 462]


In [11]:
# Train the header classifer
model_headers = CalibratedClassifierCV(LinearSVC(random_state=0, tol=1e-5, penalty='l1', dual=False)) 
clf_headers = parse.Classifier(header_bags_tr, ['HEAD_WORK', 'HEAD_EDUC','HEAD_OTHR'], model_headers)

In [12]:
# regroup the header_te data into standard form for input into Classifier
header_bags_te = [[],[],[]]
for el in headers_te:
    if el[0] not in header_bags_te[header_dict[el[1]]]:
        header_bags_te[header_dict[el[1]]].append(el[0])
        
# test the performance of the header classifer 
print clf_headers.score_report(header_bags_te)

             precision    recall  f1-score   support

          0       0.97      0.89      0.93        38
          1       0.93      0.90      0.92        31
          2       0.94      0.98      0.96       123

avg / total       0.95      0.95      0.95       192



To see how this classifier works in a few examples, consider:

In [13]:
clf_headers.text_class(['Work Experience'])

('HEAD_WORK', array([[0.96621902, 0.02725392, 0.00652706]]))

In [14]:
clf_headers.text_class(['Volunteer Experience'])

('HEAD_OTHR', array([[0.08856464, 0.0282039 , 0.88323146]]))

In [15]:
clf_headers.text_class(['Hello World'])

('UNKN', array([0, 0, 0, 0]))

### Work section classifier 

Now we train a classifier to identify lines within the work section of a resume as ```['EMPL', 'TITL', 'DESC']```.  For this task I found the best overall performance with a multinomial naive Bayes classifier.  (Actually, the performance on the testing data was still the best with the linear SVC, however the performance when inputing actual lines from resumes was the best with the NB model.)

In [16]:
clf_wrk = parse.Classifier([emp_name_tr, pos_name_tr, descrips_tr], ['EMPL', 'TITL', 'DESC'], MultinomialNB())

print clf_wrk.score_report([emp_name_te, pos_name_te, descrips_te])

             precision    recall  f1-score   support

          0       0.98      0.89      0.94       414
          1       0.97      0.97      0.97       400
          2       0.90      0.99      0.94       397

avg / total       0.95      0.95      0.95      1211



To see how this classifier works in some examples, consider:

In [17]:
clf_wrk.text_class(['Analyst'])

('TITL', array([[0.00248455, 0.98595934, 0.01155611]]))

In [18]:
clf_wrk.text_class(['Hello World Traders'])

('EMPL', array([[0.50754983, 0.24818616, 0.24426401]]))

In [19]:
clf_wrk.text_class(['I worked for Hello World Traders Inc. for seven years'])

('DESC', array([[0.2012993 , 0.04120023, 0.75750047]]))

# Parsing 

Finally we are ready to begin parsing resumes!  As I show in the next section, the parsing of a resume end-to-end can be done in one line using the functions in the ```parse``` module.  In this section we give a walk-through of the process that is going on under the hood.  The process involves four steps:

1. Breaking into 'lines'
2. Extracting sections
3. Classifying lines within sections
4. Clustering related information

I will demonstrate each of these in the following.

### Breaking into lines

Consider our first example resume ```res_ex1```.  We can have a look at a portion:

In [20]:
print res_ex1[2000:2500]

 
utilities, and related community services, with $60 million in annual revenues and 700 full-time employees. 
Senior Budget Analyst (2005-Present) 
Management Analyst (2004-2005) 
Progressed rapidly to Senior Budget Analyst to manage Performance Measurement and Accountability system across 
60 government departments and programs. Conduct budget, revenue, and variance / trend monitoring and analysis of 
performance and operational results, and provide associated semi-annual reports to government


The strategy I use for breaking the resume into 'lines' is to split the text on ```'\n'``` and then on ```'\s{5,}'```.  So more specifically, I break the text into lines and then into chunks of text on each line separated by 5 or more spaces.  For example the portion of ```res_ex1``` above gets broken down into:

In [21]:
for lis in parse.extract_lines(res_ex1[2000:2500]):
    print lis

[' ']
['utilities, and related community services, with $60 million in annual revenues and 700 full-time employees. ']
['Senior Budget Analyst (2005-Present) ']
['Management Analyst (2004-2005) ']
['Progressed rapidly to Senior Budget Analyst to manage Performance Measurement and Accountability system across ']
['60 government departments and programs. Conduct budget, revenue, and variance / trend monitoring and analysis of ']
['performance and operational results, and provide associated semi-annual reports to government']


### Extracting sections

Before we give a final classification of the lines, we first break the resume into sections using the header classification.  This allows us to classify the lines within each section using a more refined model and thus reduces the number of mistakes.  For example, we can extract the work section of ```res_ex1``` as follows:

In [22]:
# some basic cleaning of the resume text (e.g. identifying and tagging dates using regex)
res_ex1_cleaned = parse.clean_resume(res_ex1)

# extract the line position of each header
headers_res_ex1 = parse.extract_headers(res_ex1_cleaned, clf_lines.text_class, clf_headers.text_class)
print headers_res_ex1

[['HEAD_OTHR', 10], ['HEAD_WORK', 26], ['HEAD_EDUC', 62], ['HEAD_OTHR', 70], ['HEAD_OTHR', 75], ['HEAD_OTHR', 78]]


In [23]:
# extract the work section of res_ex1
wrk_res_ex1 = parse.extract_section(res_ex1_cleaned, headers_res_ex1 ,'HEAD_WORK')

# printing the first ten lines of the work section of res_ex1
for line in wrk_res_ex1[0:10]:
    print line[0:100]

PROFESSIONAL EXPERIENCE 
COUNTY OF SONOMA –  Sonoma, CA  
  
~DATE(S)~: 2004-Present
Government Agency responsible for administration of public works, law enforcement, public safety, el
Senior Budget Analyst  
~DATE(S)~: 2005-Present
Management Analyst  
~DATE(S)~: 2004-2005
Progressed rapidly to Senior Budget Analyst to manage Performance Measurement and Accountability sys


### Classifying lines within a section

After extracting the work section we then classify each line as ```['EMPL', 'TITL', 'DESC']```.  If the line is labeled ```UNKN``` this means that it does not contain any words in the classifiers vocabulary.

In [24]:
# classify a few lines in work section of res_ex1.  The print out is in the form:

#       CLASS...probablility of class <== line that was classified 

clfd_wrk_lines = parse.clf_intra_sec(res_ex1, clf_lines, clf_headers, clf_wrk)
for line in clfd_wrk_lines[1:10]:
    print line[0] + '...' + str(line[1]) + ' <== ' +  line[-1][0:100] + '\n'

EMPL...0.51 <== COUNTY OF SONOMA –  Sonoma, CA  

UNKN...0.0 <==   

DATE...1.0 <== 2004-Present

DESC...0.98 <== Government Agency responsible for administration of public works, law enforcement, public safety, el

TITL...0.96 <== Senior Budget Analyst  

DATE...1.0 <== 2005-Present

TITL...0.93 <== Management Analyst  

DATE...1.0 <== 2004-2005

DESC...0.86 <== Progressed rapidly to Senior Budget Analyst to manage Performance Measurement and Accountability sys



In [25]:
# repeat classification of a few lines in the work section of res_ex2
clfd_wrk_lines = parse.clf_intra_sec(res_ex2, clf_lines, clf_headers, clf_wrk)
for line in clfd_wrk_lines[1:10]:
    print line[0] + '...' + str(line[1]) + ' <== ' +  line[-1][0:100] + '\n'

EMPL...0.65 <== Acme International Group – Phoenix, Arizona & Las Vegas, Nevada  

UNKN...0.0 <==  

DATE...1.0 <== 2010 to Present

TITL...0.69 <== EXECUTIVE ASSISTANT TO CFO & DIRECTOR OF HUMAN RESOURCES 

DESC...1.0 <== Deliver  firsthand  support  to  senior  leaders  and  decision  makers  while  managing  a  variety

DESC...0.98 <== Texas office, and distribution of B-5s to Acme partners. Coordinate Board packages for meetings—ga

DESC...0.67 <== Internationally-based Finance team. 

DESC...1.0 <== (cid:1)  Following departure of CFO, transitioned from focused position to more comprehensive suppor

DESC...0.98 <== (cid:1)  Overhauled and vastly improved attendance management by researching, selecting, and driving



###  Clustering lines

The final step of the parsing process is to cluster together related information such as a title of a role with a particular employer and a date.  For example:

In [26]:
parse.cluster_lines(parse.clf_intra_sec(res_ex1, clf_lines, clf_headers, clf_wrk), 'HEAD_WORK')

[OrderedDict([('EMPL', 'COUNTY OF SONOMA \xe2\x80\x93  Sonoma, CA  '),
              ('TITLS',
               [OrderedDict([('TITL', 'Senior Budget Analyst  '),
                             ('DATE', '2005-Present')]),
                OrderedDict([('TITL', 'Management Analyst  '),
                             ('DATE', '2004-2005')])])]),
 OrderedDict([('EMPL',
               'LASER SOLUTIONS, INC (Wholly owned subsidiary of Digital Imprints, Ltd.) \xe2\x80\x93 Athens, GA  '),
              ('TITLS',
               [OrderedDict([('TITL', 'Business Analysis Manager  '),
                             ('DATE', '2000-2003')])])])]

In [27]:
parse.cluster_lines(parse.clf_intra_sec(res_ex2, clf_lines, clf_headers, clf_wrk), 'HEAD_WORK')

[OrderedDict([('EMPL',
               'Acme International Group \xe2\x80\x93 Phoenix, Arizona & Las Vegas, Nevada  '),
              ('TITLS',
               [OrderedDict([('TITL',
                              'EXECUTIVE ASSISTANT TO CFO & DIRECTOR OF HUMAN RESOURCES '),
                             ('DATE', '2010 to Present')])])]),
 OrderedDict([('EMPL', 'Blue Hill Enterprises \xe2\x80\x93 Phoenix, Arizona '),
              ('TITLS',
               [OrderedDict([('TITL', 'EXECUTIVE ASSISTANT TO CEO '),
                             ('DATE', '1998 to 2010')])])]),
 OrderedDict([('EMPL', 'River Hills Inc. \xe2\x80\x93 Phoenix, Arizona '),
              ('TITLS',
               [OrderedDict([('TITL',
                              'EXECUTIVE ASSISTANT TO MANAGING PARTNER '),
                             ('DATE', '1995 to 1998')])])]),
 OrderedDict([('EMPL', 'Weaver Law Firm \xe2\x80\x93 Phoenix, Arizona '),
              ('TITLS',
               [OrderedDict([('TITL', 'OFFICE MANAGER '),

# Parsing end-to-end

Finally, we demonstrate the full parser on the two example resumes.  The output is a list of dictionaries.  One can compare with the pdf versions of the two examples to see that all the work information is correctly parsed (see the example_resumes folder in repo).  

In [28]:
parse.RenderJSON({'WORK HISTORY': parse.cluster_lines(parse.clf_intra_sec(res_ex1, clf_lines, \
                                                                          clf_headers, clf_wrk), 'HEAD_WORK')})

In [29]:
parse.RenderJSON({'WORK HISTORY': parse.cluster_lines(parse.clf_intra_sec(res_ex2, clf_lines, \
                                                                          clf_headers, clf_wrk), 'HEAD_WORK')})

In [30]:
parse.RenderJSON({'WORK HISTORY': parse.cluster_lines(parse.clf_intra_sec(res_ex3, clf_lines, \
                                                                          clf_headers, clf_wrk), 'HEAD_WORK')})