### Record Linkage: preparing data sets to train classifiers

After acquiring birth and death data for 2016 through 2018 for Washington state from the Microsoft SQL Server database, the next steps are to: <br>
 <br> 
(1) pare down the data sets to only the variables needed for classification, <br> 
(2) preprocess data sets in preparation for classification, <br> 
(3) create candidate pairs of birth and death records, <br>
(4) compare select variables for each candidate pair of records to compute similarity scores on a 0 to 1 scale, <br>
(5) use the linked infant birth-death records for 2016-17 as the "golden records" data set to label candidate pairs as matches
    and non-matches, <br>
(6) Split the data into training and testing data sets. <br>
 <br>
At this point the data are ready to use in building classifiers.

In [1]:
import pandas as pd
import numpy as np
import recordlinkage as rl

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#### (1) READ IN AND PARE DOWN DATA SETS TO LINKAGE VARIABLES

Data sets used: birth 2016-18, death 2016-18, linked infant death 2016-17.  

Birth and death data are restricted to events that occurred in 2016 and 2017; I will use 2018 to validate the model after training and testing on 2016 and 2017 combined.

All data are limited to births or deaths that occurred in Washington state to in state residents.

Variables to be used for classifying record pairs as matches and non-matches include infants' (decedents'):
- first and last names, 
- mother's first and last names, 
- father's first and last names, 
- sex, 
- date of birth month, day, and year (as separate variables), 
- residence city and county at the time of death.

These variables are preserved for both the birth and death data sets. 

Birth and death certificate numbers were also included in order to create unique id's for each record pair.

##### Death data

In [3]:
d1618 = pd.read_csv(r'###\Data\d1618_clean.csv', 
                    low_memory = False)

In [4]:
d1617 = d1618[(d1618['ddody']!=2018)]
d1617 = d1617[(d1617.ddthstatel=='WASHINGTON')]

In [5]:
d1617 = d1617.loc[:, ['dsfn', 'dfname', 'dlname', 'dmom_fname','dmom_maiden','ddad_fname', 
                      'ddad_lname','dsex','ddobm', 'ddobd', 'ddoby','drescity','drescountyl']]

In [6]:
d1617.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 109122 entries, 0 to 168701
Data columns (total 13 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   dsfn         109122 non-null  int64 
 1   dfname       109120 non-null  object
 2   dlname       109120 non-null  object
 3   dmom_fname   109118 non-null  object
 4   dmom_lname   109111 non-null  object
 5   ddad_fname   109119 non-null  object
 6   ddad_lname   109119 non-null  object
 7   dsex         109122 non-null  object
 8   ddobm        109122 non-null  int64 
 9   ddobd        109122 non-null  int64 
 10  ddoby        109122 non-null  int64 
 11  drescity     109061 non-null  object
 12  drescountyl  109062 non-null  object
dtypes: int64(4), object(9)
memory usage: 11.7+ MB


##### Birth data

In [7]:
b1618 = pd.read_csv(r'###\Data\b1618_clean.csv', 
                    low_memory = False)

In [8]:
b1617 = b1618[(b1618.bdoby !=2018)]

In [9]:
b1617.bdoby.value_counts(dropna=False)

2016    89083
2017    86167
Name: bdoby, dtype: int64

In [10]:
# check to make sure only Washington state births are included

b1617.bbirplstatefips.value_counts(dropna=False)

WA    175250
Name: bbirplstatefips, dtype: int64

In [11]:
# check to make sure only Washington state residents are included

b1617.b_momresstatefips.value_counts(dropna=False)

WA    175250
Name: b_momresstatefips, dtype: int64

In [12]:
b1617 = b1617.loc[:, ['bsfn','bfname','blname', 'bmom_fname', 'bmom_lname', 'bdad_lname', 
                      'bdad_fname','bsex', 'bdobm', 'bdobd', 'bdoby', 'b_momrescity', 
                      'b_momrescountyl']]

In [13]:
b1617.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 175250 entries, 0 to 259908
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   bsfn             175250 non-null  int64 
 1   bfname           175246 non-null  object
 2   blname           175242 non-null  object
 3   bmom_fname       175184 non-null  object
 4   bmom_lname       174002 non-null  object
 5   bdad_lname       163274 non-null  object
 6   bdad_fname       169464 non-null  object
 7   bsex             175250 non-null  object
 8   bdobm            175250 non-null  int64 
 9   bdobd            175250 non-null  int64 
 10  bdoby            175250 non-null  int64 
 11  b_momrescity     175202 non-null  object
 12  b_momrescountyl  175202 non-null  object
dtypes: int64(4), object(9)
memory usage: 18.7+ MB


#####  "Golden pairs" - linked infant birth-death file 2016-2017

The records in the 2016-17 infant birth-death file were manually linked birth and death records that represent the true matches for infant deaths occurring in those years.

In [14]:
linked1617 = pd.read_csv(r'###\WA1617infantDeath_dthdata.csv')

In [15]:
linked1617.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 54 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   lbsfn                631 non-null    int64  
 1   ldsfn                631 non-null    int64  
 2   dsfn                 631 non-null    int64  
 3   dbirsfn              631 non-null    object 
 4   dssn                 631 non-null    int64  
 5   dfname               631 non-null    object 
 6   dmname               531 non-null    object 
 7   dlname               631 non-null    object 
 8   dmom_fname           631 non-null    object 
 9   dmom_mname           375 non-null    object 
 10  dmom_lname           631 non-null    object 
 11  dsex                 631 non-null    object 
 12  dagetype             631 non-null    float64
 13  dage                 631 non-null    float64
 14  dageyrs              631 non-null    float64
 15  ddob                 631 non-null    obj

In [16]:
linked1617.ddthstatel.value_counts(dropna=False)

WASHINGTON    631
Name: ddthstatel, dtype: int64

In [17]:
linked1617.dbirplstatefips.value_counts(dropna=False)

WA    631
Name: dbirplstatefips, dtype: int64

In [18]:
# keep only birth and death certificate numbers of true matches to label the training data set

linked1617 = linked1617.loc[:,['lbsfn','ldsfn']]

In [19]:
linked1617.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 631 entries, 0 to 630
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   lbsfn   631 non-null    int64
 1   ldsfn   631 non-null    int64
dtypes: int64(2)
memory usage: 10.0 KB


In [20]:
linked1617.to_csv(r'###\data\clean\linked1617_labels.csv')

#### (2) Preprocessing: data cleaning and standardizing

###### DEATH

In [21]:
from recordlinkage.preprocessing import clean

In [22]:
d1617['dfname_clean'] = clean(d1617.dfname,
                              lowercase = True,
                              replace_by_none = '[\s-]+',
                              strip_accents = 'unicode'
                             )

In [23]:
d1617['dlname_clean'] = clean(d1617.dlname,
                              lowercase = True,
                              replace_by_none = '[\s-]+',
                              strip_accents = 'unicode'
                             )

In [24]:
d1617['dmom_fname_clean'] = clean(d1617.dmom_fname,
                                 lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

In [25]:
d1617['dmom_lname_clean'] = clean(d1617.dmom_lname,
                                 lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

In [26]:
d1617['ddad_fname_clean'] = clean(d1617.ddad_fname,
                                lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

In [27]:
d1617['ddad_lname_clean'] = clean(d1617.ddad_lname,
                                lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

###### Create phonetic encoding of infant, mother, and father names

In [28]:
d1617['dmom_fullname'] =  d1617.dmom_fname_clean + " " +  d1617.dmom_lname_clean

In [29]:
d1617['ddad_fullname'] =  d1617.ddad_fname_clean + " " +  d1617.ddad_lname_clean

In [30]:
d1617['dinf_fullname'] =  d1617.dfname_clean + " " +  d1617.dlname_clean

In [31]:
d1617['dmom_phon'] = rl.preprocessing.phonetic(d1617['dmom_fullname'], method = 'metaphone')
d1617['ddad_phon'] = rl.preprocessing.phonetic(d1617['ddad_fullname'], method = 'metaphone')
d1617['dinf_phon'] = rl.preprocessing.phonetic(d1617['dinf_fullname'], method = 'metaphone')

In [32]:
# View sample of death records to make sure the cleaning/standardization steps did what they were supposed to do.

d1617.head()

Unnamed: 0,dsfn,dfname,dlname,dmom_fname,dmom_lname,ddad_fname,ddad_lname,dsex,ddobm,ddobd,ddoby,drescity,drescountyl,dfname_clean,dlname_clean,dmom_fname_clean,dmom_lname_clean,ddad_fname_clean,ddad_lname_clean,dmom_fullname,ddad_fullname,dinf_fullname,dmom_phon,ddad_phon,dinf_phon
0,2017025187,CHARLES,BURNETT,JOHANNA,DEHOLLANDER,CHARLES,BURNETT,M,6,11,1932,WOODINVILLE,KING,charles,burnett,johanna,dehollander,charles,burnett,johanna dehollander,charles burnett,charles burnett,JHNTHLNTR,XRLSBRNT,XRLSBRNT
1,2017025188,DOUGLAS,LEE,ROSE,SHEA,FRANK,LEE,M,9,29,1958,REDMOND,KING,douglas,lee,rose,shea,frank,lee,rose shea,frank lee,douglas lee,RSX,FRNKL,TKLSL
2,2017025189,ARUNEE,TAOSAN,LUMDUAN,SRILAUN,UDON,TAOSAN,F,2,5,1947,LYNNWOOD,SNOHOMISH,arunee,taosan,lumduan,srilaun,udon,taosan,lumduan srilaun,udon taosan,arunee taosan,LMTNSRLN,UTNTSN,ARNTSN
3,2017025190,ELIZABETH,CHAUSSEE,EDA,OIN,HARTLEY,CHAUSSEE,F,10,3,1931,BELLEVUE,KING,elizabeth,chaussee,eda,oin,hartley,chaussee,eda oin,hartley chaussee,elizabeth chaussee,ETN,HRTLXS,ELSB0XS
4,2017025191,MERWIN,ADLER,ALBERTA,SCHNELL,JACOB,ADLER,M,9,11,1941,PALOUSE,WHITMAN,merwin,adler,alberta,schnell,jacob,adler,alberta schnell,jacob adler,merwin adler,ALBRTSXNL,JKBTLR,MRWNTLR


##### BIRTH

In [33]:
b1617['bfname_clean'] = clean(b1617.bfname,
                              lowercase = True,
                              replace_by_none = '[\s-]+',
                              strip_accents = 'unicode'
                             )

In [34]:
b1617['blname_clean'] = clean(b1617.blname,
                              lowercase = True,
                              replace_by_none = '[\s-]+',
                              strip_accents = 'unicode'
                             )

In [35]:
b1617['bmom_fname_clean'] = clean(b1617.bmom_fname,
                              lowercase = True,
                              replace_by_none = '[\s-]+',
                              strip_accents = 'unicode'
                             )

In [36]:
b1617['bmom_lname_clean'] = clean(b1617.bmom_lname,
                                 lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

In [37]:
b1617['bdad_fname_clean'] = clean(b1617.bdad_fname,
                                lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

In [38]:
b1617['bdad_lname_clean'] = clean(b1617.bdad_lname,
                                lowercase = True,
                                  replace_by_none = '[\s-]+',
                                  strip_accents = 'unicode'
                                 )

In [39]:
b1617['bmom_fullname'] =  b1617.bmom_fname_clean + " " +  b1617.bmom_lname_clean
b1617['bdad_fullname'] =  b1617.bdad_fname_clean + " " +  b1617.bdad_lname_clean
b1617['binf_fullname'] =  b1617.bfname_clean + " " +  b1617.blname_clean

In [40]:
b1617['bmom_phon'] = rl.preprocessing.phonetic(b1617['bmom_fullname'], method = 'metaphone')
b1617['bdad_phon'] = rl.preprocessing.phonetic(b1617['bdad_fullname'], method = 'metaphone')
b1617['binf_phon'] = rl.preprocessing.phonetic(b1617['binf_fullname'], method = 'metaphone')

In [1]:
## Check sample of birth data set to make sure cleaning/standardization worked

#b1617.head()

In [42]:
b1617.to_csv(r'###\data\clean\b1617_clean.csv', index=None, header=True)
d1617.to_csv(r'###\data\clean\d1617_clean.csv', index=None, header=True)

##### Reduction of birth and death data size

The two primary reasons for reducing the numbers of birth and death records are (1) to address class imbalance between matched and unmatched records, and (2) to reduce the computational load associated with creating candidate pairs of all birth records (n = 259,908) and all death records (n= 168,701)  for this two year period.  

As the purpose of this project is to create a linked birth and death record file for infants (decedents who died before age 1 year) the death data set can be reduced by excluding records for decedents who were older than 364 days at the time of death. This reduces the number of deaths 


For the birth data set, I undersampled the birth records that will not have matches in the death data (by comparing with the exising "golden pairs" i.e. true matches) so that the final birth data set contains roughly equal number of records that have matches in the reduced death data set and those that don't have matches.

To implement this reduction, the first step is to use the linked1617 golden pairs and use their ID numbers to identify records that will match or not match in the birth and death data sets.

In [43]:
#b1617 = pd.read_csv(r'###\data\clean\b1617_clean.csv')
#d1617 = pd.read_csv(r'###\data\clean\d1617_clean.csv')
#linked1617 = pd.read_csv(r'###\data\clean\linked1617_labels.csv')

In [44]:
matched_dsfn = linked1617['ldsfn'].tolist()
len(matched_dsfn)

631

In [45]:
matched_bsfn = linked1617['lbsfn'].tolist()
len(matched_bsfn)

631

In [46]:
# separate birth records without matches in death data

b1617_not_linked = b1617[(~b1617['bsfn'].isin(matched_bsfn))]

In [47]:
# Select a random sample of 650 unmatched birth records

b1617_not_linked_undersample = b1617_not_linked.sample(n=650, random_state=42)

In [48]:
# separate birth records with matches in death data set

b1617_linked =  b1617[b1617['bsfn'].isin(matched_bsfn)]

In [49]:
# create a new birth data set consisting of the birth records that have matches and the sample 
#of 650 records that do not have matches in the death data.

b1617_fin = pd.concat([b1617_not_linked_undersample, b1617_linked], axis=0)

In [50]:
# as both birth and death record IDs in WA are 10 digits long and begin with the 4 digit year of the
# event, I make sure that there are no common (non-unique) numbers between the birth and death data
# by adding numbers to the ID integers so that birth IDs start with 1 and death IDs start with 5.

# set index for birth records to the birth record number ('bsfn')

b1617_fin.bsfn = b1617_fin.bsfn + 10000000000
b1617_fin = b1617_fin.set_index('bsfn')

In [51]:
# repeating the process above to keep only death records that are also present in the true match
#linked data set.

d1617_linked =  d1617[d1617['dsfn'].isin(matched_dsfn)]

In [52]:
d1617_fin = d1617_linked.copy()

In [53]:
d1617_fin.dsfn = d1617_fin.dsfn + 50000000000

In [54]:
d1617_fin = d1617_fin.set_index('dsfn')

In [55]:
b1617_fin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1281 entries, 12017074096 to 12017044542
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   bfname            1280 non-null   object
 1   blname            1280 non-null   object
 2   bmom_fname        1281 non-null   object
 3   bmom_lname        1276 non-null   object
 4   bdad_lname        1065 non-null   object
 5   bdad_fname        1179 non-null   object
 6   bsex              1281 non-null   object
 7   bdobm             1281 non-null   int64 
 8   bdobd             1281 non-null   int64 
 9   bdoby             1281 non-null   int64 
 10  b_momrescity      1281 non-null   object
 11  b_momrescountyl   1281 non-null   object
 12  bfname_clean      1280 non-null   object
 13  blname_clean      1280 non-null   object
 14  bmom_fname_clean  1281 non-null   object
 15  bmom_lname_clean  1276 non-null   object
 16  bdad_fname_clean  1179 non-null   object
 1

In [56]:
d1617_fin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 631 entries, 52017002434 to 52017056216
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   dfname            631 non-null    object
 1   dlname            631 non-null    object
 2   dmom_fname        631 non-null    object
 3   dmom_lname        631 non-null    object
 4   ddad_fname        631 non-null    object
 5   ddad_lname        631 non-null    object
 6   dsex              631 non-null    object
 7   ddobm             631 non-null    int64 
 8   ddobd             631 non-null    int64 
 9   ddoby             631 non-null    int64 
 10  drescity          631 non-null    object
 11  drescountyl       631 non-null    object
 12  dfname_clean      631 non-null    object
 13  dlname_clean      631 non-null    object
 14  dmom_fname_clean  631 non-null    object
 15  dmom_lname_clean  631 non-null    object
 16  ddad_fname_clean  631 non-null    object
 17

In [57]:
# change record IDs so that they are unique.

linked1617.lbsfn = linked1617.lbsfn + 10000000000
linked1617.ldsfn = linked1617.ldsfn + 50000000000

linked1617.to_csv(r'###data\clean\linked1617_labels.csv')

#### BLOCKING AND INDEXING CANDIDATE RECORD PAIRS

To reduce the number of candidate pairs that will be compared on linking variables, the birth and death records in the final data sets are grouped into 'blocks' based on year of birth, month of birth, day of birth, and mother's name phonetically encoded. Each block will contain only those records that have the same values for these fields.

In [58]:
#initalize indexer and specify blocking variables

indexer = rl.Index()
indexer.block(left_on = 'bdoby', right_on = 'ddoby')
indexer.block(left_on = 'bdobm', right_on = 'ddobm' )
indexer.block(left_on = 'bdobd', right_on = 'ddobd')
indexer.block(left_on = 'bmom_phon', right_on = 'dmom_phon')

<Index>

In [59]:
# create candidate pairs between reduced birth and death data sets

candidate_pairs = indexer.index(b1617_fin, d1617_fin)

In [60]:
'''The following is the number of candidate_pairs created by pairing all birth and death records in 
the reduced data sets i.e. 631 death X 1,281 birth records.
'''
len(candidate_pairs)

454323

##### CANDIDATE PAIR COMPARISONS

The candidate pairs created above are compared on similarities of the infants' first and last names, mothers' last name, the phonetic encoding of mothers' full name, fathers' last name, the infants' sex, and the mothers' county of residence at the time of birth and death.

In [61]:
# initiate comparison function

c=rl.Compare()

In [62]:
c.string('blname_clean', 'dlname_clean', method = 'jarowinkler', label = 'cmp_inf_lname')
c.string('binf_phon', 'dinf_phon', method = 'jarowinkler', label = 'cmp_inf_phonetic')
c.string('bmom_lname_clean', 'dmom_lname_clean', method = 'jarowinkler', label = 'cmp_mom_lname')
c.string('bmom_phon', 'dmom_phon', method = 'jarowinkler', label = 'cmp_mom_phonetic')
c.string('bdad_lname_clean', 'ddad_lname_clean', method = 'jarowinkler', label = 'cmp_dad_lname')
c.string('bdad_phon', 'ddad_phon', method = 'jarowinkler', label = 'cmp_dad_phonetic')
c.numeric('bdobm', 'ddobm', method = 'linear', scale = 1, label = 'cmp_dobm')
c.numeric('bdoby', 'ddoby', method = 'linear', scale = 1, label = 'cmp_doby')
c.numeric('bdobd', 'ddobd', method = 'linear', scale = 1, label = 'cmp_dobd')
c.exact('bsex', 'dsex', label = 'cmp_sex')
c.exact('b_momrescountyl', 'drescountyl', label = 'cmp_rescounty')

<Compare>

In [63]:
# Apply comparison to the candidate pairs created above.
# The resulting data frame contains the comparision scores (from 0 to 1) of the variables above.
# True matches will be closer to 1 on these variables.

comp_scores = c.compute(candidate_pairs, b1617_fin, d1617_fin)

In [64]:
len(comp_scores), type(comp_scores)

(454323, pandas.core.frame.DataFrame)

In [65]:
comp_scores.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 454323 entries, (12016000010, 52016000522) to (12017087228, 52017091113)
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cmp_inf_lname     454323 non-null  float64
 1   cmp_inf_phonetic  454323 non-null  float64
 2   cmp_mom_lname     454323 non-null  float64
 3   cmp_mom_phonetic  454323 non-null  float64
 4   cmp_dad_lname     454323 non-null  float64
 5   cmp_dad_phonetic  454323 non-null  float64
 6   cmp_dobm          454323 non-null  float64
 7   cmp_doby          454323 non-null  float64
 8   cmp_dobd          454323 non-null  float64
 9   cmp_sex           454323 non-null  int64  
 10  cmp_rescounty     454323 non-null  int64  
dtypes: float64(9), int64(2)
memory usage: 39.9 MB


In [66]:
comp_scores.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,cmp_inf_lname,cmp_inf_phonetic,cmp_mom_lname,cmp_mom_phonetic,cmp_dad_lname,cmp_dad_phonetic,cmp_dobm,cmp_doby,cmp_dobd,cmp_sex,cmp_rescounty
bsfn,dsfn,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
12017087228,52017057564,0.5,0.516667,0.5,0.511905,0.0,0.0,1.0,1.0,0.0,0,0
12017087228,52017057607,0.7,0.409524,0.539683,0.511905,0.7,0.490476,0.5,1.0,0.0,1,0
12017087228,52017057624,0.488889,0.628205,0.642857,0.565079,0.0,0.0,1.0,1.0,0.0,1,0
12017087228,52017057673,0.577778,0.465079,0.539683,0.60119,0.577778,0.495238,0.0,1.0,0.0,1,0
12017087228,52017091113,0.605556,0.422222,0.666667,0.464286,0.605556,0.504545,1.0,0.5,0.0,1,0


**Add variable to indicate whether record pairs in comp_scores are matches or non-matches**

The golden pairs (true match) linked data set allows labeling of the entire comp_scores data set.

In [68]:
# set a multiIndex for the true match record ID pairs

labels = linked1617.set_index(['lbsfn', 'ldsfn'])

In [69]:
# add a variable match (set to 1) indicating that the record pairs in dataframe 'labels' are all matches
labels['match'] = 1

In [70]:
# concatenate the comparison scores data with the labels and fill all blank values for 'match' variables with 0 indicating
# no match.

comp_scores_m = (pd.concat([comp_scores, labels], axis=1)
       .fillna(0)
       .astype(int))

In [73]:
comp_scores_m.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 454323 entries, (12016000010, 52016000522) to (12017087228, 52017091113)
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   cmp_inf_lname     454323 non-null  int32
 1   cmp_inf_phonetic  454323 non-null  int32
 2   cmp_mom_lname     454323 non-null  int32
 3   cmp_mom_phonetic  454323 non-null  int32
 4   cmp_dad_lname     454323 non-null  int32
 5   cmp_dad_phonetic  454323 non-null  int32
 6   cmp_dobm          454323 non-null  int32
 7   cmp_doby          454323 non-null  int32
 8   cmp_dobd          454323 non-null  int32
 9   cmp_sex           454323 non-null  int32
 10  cmp_rescounty     454323 non-null  int32
 11  match             454323 non-null  int32
dtypes: int32(12)
memory usage: 22.5 MB


In [74]:
comp_scores_m.to_csv(r'###\data\clean\infdth_comparison_1617_labelled.csv', header=True)

The comparison score data frame can now be used in the next step when building classifiers.