# True match cleaning

In [1]:
#Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import jellyfish
%matplotlib inline

First we need to import the scraped data from family search and take a look at it.

In [2]:
#Read in data and create comlumns
names = ["fsid", "Full1"] + ['Source{}'.format(x) for x in range(1,21)]
df = pd.read_csv('/Users/jperryman/Desktop/BYU/Python/api_scrape.csv', names=names)

In [3]:
df['Full1'].describe()

count     174790
unique    171406
top         Mary
freq          30
Name: Full1, dtype: object

In [4]:
#Check work
df.head()

Unnamed: 0,fsid,Full1,Source1,Source2,Source3,Source4,Source5,Source6,Source7,Source8,...,Source11,Source12,Source13,Source14,Source15,Source16,Source17,Source18,Source19,Source20
0,LDBJ-136,"Catherine ""Kate"" Englehart",Catharine Englehart in household of George Eng...,Katie Englehart in entry for Le Roy Mathias Ta...,Kate Wagner in household of William H Wagner; ...,,,,,,...,,,,,,,,,,
1,LR4C-C64,Philip Ely Fuller,Philip E Fuller in entry for Dayner H Fuller; ...,Phillip E Fuller in entry for Elizabeth Fuller...,Philip E Fuller in entry for Dorothy B Fuller;...,Philip Ely Fuller in entry for Philip Ely Full...,Philip Ely Fuller in entry for Philip Ely Pull...,"Philip Fuller; ""United States Social Security ...","Philip Ely Fuller; ""Find A Grave Index"" d 28 J...","Philip Ely Fuller; ""United States World War II...",...,"Phillip E Fuller; ""United States Census; 1920""","Philip Ely Fuller; ""United States World War I ...",Philip E in entry for Elizabeth Irving Fuller;...,"Philip E Fuller; ""Massachusetts Marriages; 184...",Philop E Fuller in household of Edward Fuller;...,Phillip E Fuller in household of Edward J Full...,"Philip E. Fuller; ""Massachusetts Births; 1841-...",,,
2,9NQT-YZX,Roy Edward Mercer,"Roy Edward Mercer; ""Find A Grave Index""","Ray Merser in household of David Merser; ""Unit...","Roy Mercer in household of David Mercer; ""Unit...","Roy Edward Mercer; ""United States World War I ...","Roy Mercer in household of David Mercer; ""Unit...","Roy E Mercer; ""United States Census; 1930""","Roy E Mercer; ""United States Census; 1940""",Roy Edward Mercer find a grave (1964),...,,,,,,,,,,
3,L73P-MGH,Edward Byron White Jr,"Edward B White in household of Ed B White; ""Un...","Edmond White in household of Ed White; ""United...","Edward Byron White; ""Iowa Births and Christeni...","Edward Byron White; ""Iowa; County Births; 1880...","Edward Byron White; ""California Death Index; 1...",Legacy NFS Source: Edward Byron White - Govern...,,,...,,,,,,,,,,
4,LDQP-BQP,Maurice D Spencer,"Spencer in entry for Charles Ross Spencer; ""Ca...",Morris D Spencer in household of Charles D Spe...,"M D Spencer; ""Colorado Statewide Marriage Inde...",Morris (Maurice D Spencer in the household of ...,Morrice D Spencer in household of Charles D Sp...,Maurice D Spencer in entry for Blanche S Potte...,"Maurice Spencer; ""United States Census; 1900""","Maurice Spencer; ""United States Census; 1910""",...,,,,,,,,,,


Great the scrape worked well and has given us plenty of names with indexing errors.  The problem we need to solve now is how to clean all of the noise out of each source.  We will use regular expressions to do this.  With regular expressions we can drop numbers and words we do not want.  We will also need to stack the columns so that we will only focus on a pair of names at a time.

In [5]:
#Clean noise
df.Source1 = df.Source1.str.replace('( in[\s\S]+)',"")

In [6]:
df.Source1 = df.Source1.str.extract('([A-Z][ \w]+)',expand = False)

In [7]:
#Clean sources of unneeded words and symbols
for x in ['Source{}'.format(x) for x in range(1,21)]:
    df[x] = df[x].str.replace('( in[\s\S]+)',"")
    df[x] = df[x].str.replace('Legacy NFS Source: ',"")
    df[x] = df[x].str.replace('[Ff]ind [Aa] [Gg]rave',"")
    df[x] = df[x].str.replace("([0-9])","")
    df[x] = df[x].str.replace("( [a-z])\w+","")
    df[x] = df[x].str.replace("\.!@#$%^&*:-=,?","")
    df[x] = df[x].str.replace("\(","")
    df[x] = df[x].str.replace("\)","")
    df[x] = df[x].str.replace('([Tt]he )?United States (Federal )?Census',"")
    df[x] = df[x].str.replace('( the )',"")
    df[x] = df[x].str.extract('([A-Z][ \w]+)',expand = False)

In [8]:
#Stack index
df = df.set_index(['fsid', 'Full1'])
df = df.stack().reset_index()

In [9]:
#Drop nickenames and suffixes
for x in ['Full1'.format(x)]:
    df[x] = df[x].str.replace('[\'\"]\w+[\'\"]',"")
    df[x] = df[x].str.replace('\"',"")
    df[x] = df[x].str.replace("\'","")
    df[x] = df[x].str.replace("( [a-z])\w+","")
    df[x] = df[x].str.replace("([0-9])","")
    df[x] = df[x].str.replace("\(","")
    df[x] = df[x].str.replace("\)","")
    df[x] = df[x].str.replace("\.!@#$%^&*:-=,?","")
    df[x] = df[x].str.replace('[\'\"]\w+[\'\"]',"")
    df[x] = df[x].str.replace('( [Jj]r)',"")
    df[x] = df[x].str.replace('( [Ss]r)',"")
    df[x] = df[x].str.replace('\.',"")

In [10]:
#Drop column level2
del df['level_2']

In [11]:
#Add column Full2
df.columns = ['fsid', 'Full1', 'Full2']

In [12]:
df['Full1'].describe()

count     1077616
unique     171170
top          Mary
freq          104
Name: Full1, dtype: object

#Check work
df.head()

Great, the names look good.  We will need to clean up Full2.  Next we need to find jarowinkler distances between each name.  Using the jaro score we can eliminate correctly indexed results and focus in on incorrectly indexed results on correct matches.

In [14]:
#Drop nickenames and suffixes
for x in ['Full2'.format(x)]:
    df[x] = df[x].str.replace('\"',"")
    df[x] = df[x].str.replace('[\'\"]\w+[\'\"]',"")
    df[x] = df[x].str.replace("\'","")
    df[x] = df[x].str.replace("([0-9])","")
    df[x] = df[x].str.replace("\(","")
    df[x] = df[x].str.replace("\)","")
    df[x] = df[x].str.replace("\.!@#$%^&*:-=,?","")
    df[x] = df[x].str.replace("( [a-z])\w+","")
    df[x] = df[x].str.replace('( [Jj]r)',"")
    df[x] = df[x].str.replace('( [Ss]r)',"")
    df[x] = df[x].str.replace('\.',"")

In [15]:
#Drop Nan
df = df.fillna('')

In [16]:
#Find jarowinkler score for distances
df['score'] = df.apply(lambda row: jellyfish.jaro_distance(row['Full1'], row['Full2']), axis=1)

In [17]:
#Check work
df.head()

Unnamed: 0,fsid,Full1,Full2,score
0,LDBJ-136,Catherine Englehart,Catharine Englehart,0.85653
1,LDBJ-136,Catherine Englehart,Katie Englehart,0.770635
2,LDBJ-136,Catherine Englehart,Kate Wagner,0.604924
3,LR4C-C64,Philip Ely Fuller,Philip E Fuller,0.91634
4,LR4C-C64,Philip Ely Fuller,Phillip E Fuller,0.928309


In [18]:
#Drop exact matches and outliers
df = df[df.score != 1]
df = df[df.score >= .75]

In [19]:
#strip columns
df['Full1'] = df['Full1'].str.strip()
df['Full2'] = df['Full2'].str.strip()

Now we can drop the score columns and add first, middle, and last name columns.  After seperating the names out we need to make sure the contain only the first, middle, or last names respectivly.

In [20]:
#Drop score and add first, middle, and last names
del df['score']
df['First1'] = df.Full1
df['Mid1'] = df.Full1
df['Last1'] = df.Full1
df['First2'] = df.Full2
df['Mid2'] = df.Full2
df['Last2'] = df.Full2

In [21]:
#organize columns
cols = list(df)
cols.insert(5, cols.pop(cols.index('Full2')))
cols

['fsid',
 'Full1',
 'First1',
 'Mid1',
 'Last1',
 'Full2',
 'First2',
 'Mid2',
 'Last2']

In [22]:
df = df.loc[:, cols]

In [23]:
#Check work
df.head()

Unnamed: 0,fsid,Full1,First1,Mid1,Last1,Full2,First2,Mid2,Last2
0,LDBJ-136,Catherine Englehart,Catherine Englehart,Catherine Englehart,Catherine Englehart,Catharine Englehart,Catharine Englehart,Catharine Englehart,Catharine Englehart
1,LDBJ-136,Catherine Englehart,Catherine Englehart,Catherine Englehart,Catherine Englehart,Katie Englehart,Katie Englehart,Katie Englehart,Katie Englehart
3,LR4C-C64,Philip Ely Fuller,Philip Ely Fuller,Philip Ely Fuller,Philip Ely Fuller,Philip E Fuller,Philip E Fuller,Philip E Fuller,Philip E Fuller
4,LR4C-C64,Philip Ely Fuller,Philip Ely Fuller,Philip Ely Fuller,Philip Ely Fuller,Phillip E Fuller,Phillip E Fuller,Phillip E Fuller,Phillip E Fuller
5,LR4C-C64,Philip Ely Fuller,Philip Ely Fuller,Philip Ely Fuller,Philip Ely Fuller,Philip E Fuller,Philip E Fuller,Philip E Fuller,Philip E Fuller


In [24]:
#Extract first name
for x in ['First1'.format(x)]:
    df[x] = df[x].str.extract('(^\w+)', expand=False)

In [25]:
#Extract middle name
for x in ['Mid1'.format(x)]:
    df[x] = df[x].str.replace('^\w+',"")
    df[x] = df[x].str.replace('\w+$',"")
    df[x] = df[x].str.replace('\/',"")
    df[x] = df[x].str.replace('\*',"")
    df[x] = df[x].str.replace('\?',"")
    df[x] = df[x].str.replace('\-',"")
    df[x] = df[x].str.replace('\=',"")

In [26]:
#Extract last name
for x in ['Last1'.format(x)]:
    df[x] = df[x].str.extract('(\w+$)', expand=False)

In [27]:
#Extract first name
for x in ['First2'.format(x)]:
    df[x] = df[x].str.extract('(^\w+)', expand=False)

In [28]:
#Extract middle name
for x in ['Mid2'.format(x)]:
    df[x] = df[x].str.replace('(\w+$)',"")
    df[x] = df[x].str.replace('(^\w+)',"")

In [29]:
#Extract last name
for x in ['Last2'.format(x)]:
    df[x] = df[x].str.extract('(\w+$)', expand=False)

In [30]:
#If first name = last name, drop last name
df.loc[df.First2 == df.Last2, "Last2"] = ""

In [31]:
#Drop Nan
df = df.fillna('')

In [32]:
#Create jaro scores
df['scoreFull'] = df.apply(lambda row: jellyfish.jaro_distance(row['Full1'], row['Full2']), axis=1)
df['scoreFirst'] = df.apply(lambda row: jellyfish.jaro_distance(row['First1'], row['First2']), axis=1)
df['scoreLast'] = df.apply(lambda row: jellyfish.jaro_distance(row['Last1'], row['Last2']), axis=1)

In [33]:
#Sum individual scores
df['score'] = sum((df['scoreFull'],df['scoreFirst']),df['scoreLast'])

In [34]:
#Check work
df.head()

Unnamed: 0,fsid,Full1,First1,Mid1,Last1,Full2,First2,Mid2,Last2,scoreFull,scoreFirst,scoreLast,score
0,LDBJ-136,Catherine Englehart,Catherine,,Englehart,Catharine Englehart,Catharine,,Englehart,0.85653,0.925926,1.0,2.782456
1,LDBJ-136,Catherine Englehart,Catherine,,Englehart,Katie Englehart,Katie,,Englehart,0.770635,0.664815,1.0,2.43545
3,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,0.91634,1.0,1.0,2.91634
4,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,0.928309,0.952381,1.0,2.88069
5,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,0.91634,1.0,1.0,2.91634


In [35]:
#Drop scores that are outside of boundaries
df = df[df.score >= 1.50]

In [36]:
#Drop rows with no fsid
df = df[df.fsid != '']

In [37]:
#drop scores
del df['score']
del df['scoreFull']
del df['scoreFirst']
del df['scoreLast']

In [38]:
#Assign all rows Match=1
df['Match'] = 1

In [39]:
#Check work
df.head(50)

Unnamed: 0,fsid,Full1,First1,Mid1,Last1,Full2,First2,Mid2,Last2,Match
0,LDBJ-136,Catherine Englehart,Catherine,,Englehart,Catharine Englehart,Catharine,,Englehart,1
1,LDBJ-136,Catherine Englehart,Catherine,,Englehart,Katie Englehart,Katie,,Englehart,1
3,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,1
4,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,1
5,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,1
8,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip Fuller,Philip,,Fuller,1
11,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,1
12,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,1
13,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,1
15,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E,Philip,,E,1


In [40]:
df.head(28715)
#53008

Unnamed: 0,fsid,Full1,First1,Mid1,Last1,Full2,First2,Mid2,Last2,Match
0,LDBJ-136,Catherine Englehart,Catherine,,Englehart,Catharine Englehart,Catharine,,Englehart,1
1,LDBJ-136,Catherine Englehart,Catherine,,Englehart,Katie Englehart,Katie,,Englehart,1
3,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,1
4,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,1
5,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,1
8,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip Fuller,Philip,,Fuller,1
11,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,1
12,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E Fuller,Philip,E,Fuller,1
13,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Phillip E Fuller,Phillip,E,Fuller,1
15,LR4C-C64,Philip Ely Fuller,Philip,Ely,Fuller,Philip E,Philip,,E,1


In [41]:
df['Full1'].describe()

count     600433
unique    146890
top         Mary
freq          74
Name: Full1, dtype: object

After plenty of checking and rewriting code the final data looks clean.  Some names were lost, but the true sample size is still close to the same size as the false set.  With the names cleaned they are now ready for each name to be turned into a vector.

In [42]:
df.to_csv('/Users/jperryman/Desktop/BYU/Python/true_names.csv', index=False)