### This notebook uses a list of keywords from a csv file to parse apart a PDF file. In this example, it takes a PDF file containing sample resumes and parses it apart based on a list (.csv) of the applicants' names. This matching is also improved by using fuzzy matching techniques so that the names in the .csv are used to determine the actual format of the names used in the pdf. 

##### Note: each individual set of resumes should be saved in it's own folder if planning to pickle results. The reason is that running this code without editing the pickle names can re-write over prior pickles. 

In [1]:
#EDIT THIS IF NEW DATA - change directory

#import python packages and change directory to the correct folder

import pickle
import os
import pandas
import re
import pdfminer
from pdfminer.high_level import extract_text
os.chdir(r'/Users/swilson/Library/CloudStorage/OneDrive-Personal/python/python notebooks/GitHub/Parse-NLP-ML-main')
cwd = os.getcwd()
print("Current working directory is:", cwd)

Current working directory is: /Users/swilson/Library/CloudStorage/OneDrive-Personal/python/python notebooks/GitHub/Parse-NLP-ML-main


In [2]:
#EDIT THIS IF NEW DATA - change pdf name
#convert pdf to machine readable format. Takes a few minutes depending on size of pdf
raw_text = extract_text('Resume Booklet.pdf')
raw_text[:10000]

'RESUMES\xa0\n\nSAMPLES\n\nVarious sample resumes to help you see\nways to improve your resume.\n\n\x0cTABLE OF\nTABLE OF\n\nCONTENTS\nCONTENTS\n\nArchitecture + Planning...........................................................2\nArchitecture + Planning...........................................................2\nArchitecture + Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2\n\nCultural and Social Transformation...........................4\nCultural and Social Transformation...........................4\nCultural and Social Transformation . . . . . . . . . . . . . . . 4\n\nEducation............................................................................................5\nEducation............................................................................................5\nEducation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5\n\nEngineering.....................................................................................

In [3]:
# optional code to pickle the resumes in machine readable format
#F = open('raw_text.pkl', 'wb')
#pickle.dump(raw_text, F)

In [4]:
#optional code to pull in the pickled resumes 
#F = open('raw_text.pkl','rb')
#raw_text = pickle.load(F)


In [5]:
# For this specific example, some lines of code were not needed so are commented out. 

#EDIT THIS IF NEW DATA - pull in new excel list of names. Excel format required is a single column of names, 
#last name comma first name. There cannot be a column header.

#processing the list of names to prepare for the for loop. The names were in the format last, first middle. This 
#rearranges the names so it's format first middle last. 

ea_df= pandas.read_excel (r'Resume Booklet names.xlsx', sheet_name='Sheet1', header=None) #read in excel file, list of names
ea_df.rename(columns={ea_df.columns[0]: "name" }, inplace = True)
ea_df = ea_df['name'].str.split(',',expand=True) #split last name and first and middle name in separate columns
ea_df.rename(columns={ea_df.columns[0]: "last" }, inplace = True) #rename column names
ea_df.rename(columns={ea_df.columns[1]: "first" }, inplace = True)#rename column names
# ea_df[['first','middle']] = ea_df['first'].str.split().str[::1].apply(pandas.Series).fillna('')#separate middle name from first name
#ea_df['middle_initial'] = ea_df['middle'].astype(str).str[0]
# ea_df = ea_df.drop(ea_df.columns[2], axis=1) #drop middle name
ea_df = ea_df.fillna('')#replace NA values for names without middle name as empty
# #ea_df['first_middle_initial'] = ea_df['first'] +' '+ ea_df['middle_initial']
# # ea_df = ea_df.drop(columns=['first', 'middle_initial']) #drop excess columns
# #ea_df = ea_df.rename(columns={'first_middle_initial': 'first'})
ea_df = ea_df[['first', 'last']] # order first name column first 
ea_df["name"] = ea_df["first"] + ' ' + ea_df["last"] #create a new column with first and last name together, in that order
ea_df = ea_df[['name']] #keep only the full name column
ea_df

Unnamed: 0,name
0,Architecture Student
1,Urban Ecology Student
2,Transform Student
3,Education Student
4,Education Student 2
5,Electric Engineering Student
6,Biomedical Engineering Student
7,Mining Engineering Student
8,Graphic Design Student
9,Art Student


In [6]:
#convert pythons df list of names into a list and clean up

ea_list = [[i] for i in ea_df['name']] #convert the pandas df into a list
flattened = [val for sublist in ea_list for val in sublist] #flatten the list above from a list of list format
for i, s in enumerate(flattened):  #remove extra spaces that are added when converting from df to list
    flattened[i] = " ".join(s.split())
flattened = [x.lower() for x in flattened] #make all names lowercase
flattened = [i.lstrip() for i in flattened] #strip out a space 
flattened

['architecture student',
 'urban ecology student',
 'transform student',
 'education student',
 'education student 2',
 'electric engineering student',
 'biomedical engineering student',
 'mining engineering student',
 'graphic design student',
 'art student',
 'games student',
 'health promotion and education student',
 'kinesiology student',
 'parks recreation and tourism student',
 'international studies student',
 'asian studies student',
 'communication student',
 'mining engineering student',
 'environmental science student',
 'nursing student',
 'biology student',
 'chemistry student',
 'psychology student',
 'health sociology and policy student',
 'social work student']

In [7]:
#EDIT THIS IF NEW DATA - replace ore remove names if known in advance. 

#remove and convert names missing from excel list of names

#name = [x.replace('childers-conner', 'childers') for x in flattened]
#name.remove('christopher hershey')
#name = [x.replace('osiruphu-el', 'osiruphu') for x in name]

# run this if no names removed
name = flattened

name

['architecture student',
 'urban ecology student',
 'transform student',
 'education student',
 'education student 2',
 'electric engineering student',
 'biomedical engineering student',
 'mining engineering student',
 'graphic design student',
 'art student',
 'games student',
 'health promotion and education student',
 'kinesiology student',
 'parks recreation and tourism student',
 'international studies student',
 'asian studies student',
 'communication student',
 'mining engineering student',
 'environmental science student',
 'nursing student',
 'biology student',
 'chemistry student',
 'psychology student',
 'health sociology and policy student',
 'social work student']

In [8]:
#proccess resume text to make all words lowercase

import re

text_lower = re.sub(r"[^a-zA-Z0-9]", " ", raw_text.lower()) #make text lowercase

text_lower[:2500]

'resumes   samples  various sample resumes to help you see ways to improve your resume    table of table of  contents contents  architecture   planning                                                           2 architecture   planning                                                           2 architecture   planning                                                         2  cultural and social transformation                           4 cultural and social transformation                           4 cultural and social transformation                               4  education                                                                                            5 education                                                                                            5 education                                                                                  5  engineering                                                                                        7 engineering               

In [11]:
# #DO NOT REMOVE!!! This is the original fuzzy matching code that fuzzy matches the list of names to the names in the raw
#text. The names from the raw text will now be used to parse the pdf. Originally, I tried the package fuzzy wuzzy 
#(also called the fuzz), but I think fuzzy wuzzy needs the raw text to be tokenized otherwise it returns individuals
#letters not words. Since I need first, middle, and last name, I need the format to be a string not a token. A token
#would separate first, middle, and last names in the raw text with a comma. 

from fuzzysearch import find_near_matches #takes about 1 hour to run

results_search = {} # dictionary to store results 

#max_l_dist=6 appears to be the max distance. dist =7 won't run, even when left for over an hour (maybe several hours?)
for i in name:
    results_search[i] = find_near_matches(i, text_lower, max_l_dist=6)

results_middle_initial = results_search

In [12]:
#Optional - pickle results
#Fuzzy search can take up to 1 hour to run, so pickling the result for easy access

# F = open('results_middle_initial.pkl', 'wb')
# pickle.dump(results_middle_initial, F)

In [13]:
#Optional - pulling in the results from the fuzzy matching 
#F = open('results_middle_initial.pkl','rb')
#results_middle_initial = pickle.load(F)

In [14]:
#processing the fuzzy matched names.

fuzz_df = pandas.DataFrame.from_dict(results_middle_initial, orient='index') #convert dictionary to pandas dataframe
fuzz_df = fuzz_df.stack().explode() 
fuzz_df = fuzz_df.to_frame().reset_index()#convert pandas series to dataframe
fuzz_df = fuzz_df.rename(columns={ fuzz_df.columns[0]: "original", fuzz_df.columns[2]: "results" }) #rename columns by position
fuzz_df = fuzz_df.drop(fuzz_df.columns[[1]], axis=1) #drop excess column
fuzz_df['results'] = fuzz_df['results'].astype(str).agg(lambda x:x.str.strip("Match()")) #removes Match() string from results column
fuzz_df[['start','end','dist','matched']] = fuzz_df['results'].str.split(',', expand = True) #split results
fuzz_df = fuzz_df.drop(['results','start','end'], axis=1) #drop multiple coulmns by name
fuzz_df['dist'] = fuzz_df['dist'].str.replace("dist=","") #remove "dist=" from dist column
fuzz_df['matched'] = fuzz_df['matched'].str.replace("matched='","")#remove "matched=''" from matched column
fuzz_df['matched'] = fuzz_df['matched'].str.replace("'","")#remove ' from end of name in matched column
fuzz_df

Unnamed: 0,original,dist,matched
0,architecture student,6,architecture plann
1,architecture student,6,architecture plann
2,architecture student,6,architecture plann
3,architecture student,6,architecture plann
4,architecture student,1,architechture student
...,...,...,...
1396,social work student,6,oint for student
1397,social work student,6,cil on student
1398,social work student,0,social work student
1399,social work student,2,socialworkstudent


In [15]:
#Keep only the matches with minimum levenshtein distance 

fuzz_df['dist'] = pandas.to_numeric(fuzz_df['dist']) #convert dist column from obj to numeric
min_dist = fuzz_df.loc[fuzz_df.groupby('original').dist.idxmin()] #group by each original name and keep the minimum dist
min_dist.sort_index(inplace=True) #sort by the index
min_dist = min_dist.reset_index(drop=True) #reset the index 
min_dist

Unnamed: 0,original,dist,matched
0,architecture student,1,architechture student
1,urban ecology student,0,urban ecology student
2,transform student,0,transform student
3,education student,0,education student
4,education student 2,0,education student 2
5,electric engineering student,2,electrical engineering student
6,biomedical engineering student,0,biomedical engineering student
7,mining engineering student,0,mining engineering student
8,graphic design student,0,graphic design student
9,art student,0,art student


In [16]:
min_dist["dist"].mean() #with middle names, mean is 1.58. Without middle names is 1.265

0.8333333333333334

In [17]:
min_dist["dist"].sum()  #with middle names, sum is 76. Without middle names is 62

20

In [18]:
#repeating conversion from dataframe to dictionary
min_dist_match = min_dist.drop(['original','dist'], axis=1) # keep only the matched column of names
match_list = [[i] for i in min_dist_match['matched']] #convert dataframe to list
match_flattened = [val for sublist in match_list for val in sublist] #flatten list
match_flattened = [i.lstrip() for i in match_flattened] #remove spaces
match_pairs = list(zip(match_flattened, match_flattened[1:]))#pair names
match_pairs = dict(match_pairs) #converting list to dictionary

match_pairs

{'architechture student': 'urban ecology student',
 'urban ecology student': 'transform student',
 'transform student': 'education student',
 'education student': 'education student 2',
 'education student 2': 'electrical engineering student',
 'electrical engineering student': 'biomedical engineering student',
 'biomedical engineering student': 'mining engineering student',
 'mining engineering student': 'graphic design student',
 'graphic design student': 'art student',
 'art student': 'gamestudent',
 'gamestudent': 'health promotion and education student',
 'health promotion and education student': 'kinesiology student',
 'kinesiology student': 'parks  recreation and tourism   uni',
 'parks  recreation and tourism   uni': 'international studies student',
 'international studies student': 'asian studies student',
 'asian studies student': 'communication student',
 'communication student': 'environmental geoscience student',
 'environmental geoscience student': 'nursing student',
 'nu

In [19]:
#convert from dictionary to Ordered Dictionary in order to capture the last name in the list

from collections import OrderedDict

order_match_pairs = OrderedDict(match_pairs)#creates ordered dictionary so you can access the last value
order_match_pairs[order_match_pairs[next(reversed(order_match_pairs))]] = next(iter(order_match_pairs)) #assign the last value of dictionary as a key and the first key as a value to capture the last name in the list

order_match_pairs


OrderedDict([('architechture student', 'urban ecology student'),
             ('urban ecology student', 'transform student'),
             ('transform student', 'education student'),
             ('education student', 'education student 2'),
             ('education student 2', 'electrical engineering student'),
             ('electrical engineering student',
              'biomedical engineering student'),
             ('biomedical engineering student', 'mining engineering student'),
             ('mining engineering student', 'graphic design student'),
             ('graphic design student', 'art student'),
             ('art student', 'gamestudent'),
             ('gamestudent', 'health promotion and education student'),
             ('health promotion and education student', 'kinesiology student'),
             ('kinesiology student', 'parks  recreation and tourism   uni'),
             ('parks  recreation and tourism   uni',
              'international studies student'),
        

In [20]:
#loop through raw text, parse names into a dictionary

parsed_middle_initial = {} #creates a dictionary that the for loop will add values to

for key, value in order_match_pairs.items():
    if key == next(reversed(order_match_pairs)): #capture the last name until the end of the raw text
        parsed_middle_initial[format(key)] = text_lower[text_lower.index(key):] 
    else: 
        parsed_middle_initial[format(key)] = text_lower[text_lower.index(key):text_lower.index(value)+1] #all other names

In [21]:
#displaying only the first value, the resume associated with the key "architure student"
parsed_middle_initial['architechture student']

'architechture student 123 president s circle  salt lake city  ut   name gmail com   www myportfolio com   000 123 4567  architechture student  i  n o t a c u d e  i  e c n e r e p x e  i  p h s r e b m e m  s l l k s  i  master of architecture   university of utah   salt lake city  ut  may 20xx    bachelor of science in architecture   university of  utah   austin  tx  may 20xx  minor in spanish   public interest design summer studio   university of texas   austin  tx  many 20xx july20xx   three cities studio study abroad   syracuse university    may 20xx aug 20xx  italy  england  and the netherlands  public interest design fellow   tulane city center   new orleans  la  jan oct 20xx     developed the hollygrove greenline master plan with other design professionals   community stakeholders through schematic  design and design development phases    managed the schedule  budget  material orders  and staff and volunteer labor for installation of five rain gardens on the proper  ties of fiv

In [22]:
#optional code to search for specific names
#parsed_middle_initial['laura m  wilson']#[-1000:-1]

In [23]:
#pickle results to use in resume analysis
F = open('parsed_middle_initial.pkl', 'wb')
pickle.dump(parsed_middle_initial, F)
F.close()

In [24]:
#pickle min_dist to use in mapping back in resume analysis
min_dist.to_pickle("min_dist_middle_initial.pkl")