### Capstone 1 - Washington state linkage of infant death, birth, and mother's hospitalization discharge data

##### Maya Bhat-Gregerson

January 16, 2020

### C. PREPARATION OF INFANT BIRTH-DEATH LINKED DATA, 2017

### I. Data acquisition

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

For the infant linked birth-death data used to create a training data set, I simply read in the CSV file that has already been prepared by my office for 2016 and 2017.

**2016**

In [2]:
linked16 = pd.read_csv(r'###\Py\Data\InfantDeathY16_BP.csv',
                       index_col = None,
                       low_memory = False)

In [3]:
linked16.shape

(414, 355)

In [96]:
#linked16.head()

**2017**

In [5]:
linked17 = pd.read_csv(r'###\Py\Data\InfantDeathF2017.csv',
                       index_col = None,
                       low_memory = False)

In [6]:
linked17.shape

(363, 449)

In [97]:
#linked17.head()

### II. Data cleaning and standardization

Keep only infant deaths occurring in WA (indicated by birth certificate type 'R') and WA residents.  

In [8]:
WAlnkd16 = linked16[(linked16["Birth Certificate Type"]=="R") & (linked16['Residence State FIPS Code']=="WA")]

WAlnkd16.shape

(372, 355)

In [9]:
WAlnkd17 = linked17[(linked17["Birth Cert Type"]=="R") & (linked17['Residence State FIPS Code']=="WA")]
WAlnkd17.shape

(336, 449)

Using .head() on both infant linked files it appears that at least one column (birth certificate number) has a different name in the two files.  Instead of renaming columns in both data sets (2016 and 2017) to make them consistent as well as code-friendly I will keep only the birth certificate numbers and death certificate numbers from both data sets, append the two vertically, and then merge the relevant fields for ML modeled linkage from the birth and death data sets that I prepared in the previous two steps.

Keep only birth certificate number and death certificate number from 2016 infant birth-death linked file and merge with 2016 birth and death data prepared in earlier steps.  This will ensure that all column names and data types are the same across all data files.

In [10]:
WAlnkd16sfns = WAlnkd16.loc[ : ,['Birth State File Number','Death State File Number' ]]
WAlnkd16sfns.shape

(372, 2)

In [98]:
#WAlnkd16sfns.head()

In [12]:
WAlnkd17sfns = WAlnkd17.loc[:,['Birth State File Number','Death SFN' ]]
WAlnkd17sfns.shape

(336, 2)

After dropping all variables except for the birth certificate and death certificate numbers I renamed them to keep column names the same and then merged the 2016 and 2017 files vertically.

In [13]:
# rename columns
WAlnkd16sfns=WAlnkd16sfns.rename(columns = {"Birth State File Number":"lbsfn", "Death State File Number": "ldsfn"})
WAlnkd17sfns=WAlnkd17sfns.rename(columns = {"Birth State File Number":"lbsfn", "Death SFN": "ldsfn"})

In [14]:
#Append 2017 to 2016 vertically

WAlnkd1617=pd.concat([WAlnkd16sfns, WAlnkd17sfns])
WAlnkd1617.shape


(708, 2)

In [99]:
#WAlnkd1617.head(30)

In [16]:
WAlnkd1617.isnull().sum()

lbsfn    0
ldsfn    0
dtype: int64

#### ADD LINKING VARIABLES

Add back columns that will be needed for machine learning classification later.  These columns were removed from the infant birth-death linked file as a matter of routine practice to protect the identities of the families.

I merge the infant linked file first with the death 2016-18 file on death certificate number and then with the birth 2016-18 using birth certificate number.  With each merge, I add back columns that were removed during the creation of the linked file for confidentiality reasons.

The first step is to read in death files for 2016-18 and birth files for 2016-18 that I acquired in prior steps.

In [17]:
d1618 = pd.read_csv(r'###\Py\Data\d1618_clean.csv', low_memory=False)
b1618 = pd.read_csv(r'###\Py\Data\b1618_clean.csv', low_memory=False)

In [100]:
#b1618.head()

In [101]:
#d1618.head()

In the first stage I join the infant linked birth-death records for 2016-17 with death certificate data fpr 2016-18 on death certificate number.

In [20]:
#linked1617a = pd.merge(WAlnkd1617,
                    # d1618[['dbirsfn','dfname','dmname','dlname','dmom_fname',
                            #'dmom_mname','dmom_lname','ddob','ddobm','ddobd','ddoby',
                            #'dbircountryl','dbircountryfips','ddthcityl','ddthcityfips',
                            #'ddthcountyl','ddthcntyfips','drescity','drescityfips',
                            #'drescitylim','drescountyl','drescntyfips', 'dsfn', 'ddthstatel', 
                            #'dresstatefips', 'dsex', 'dagedays', 'ddod', 'ddody']],
                     #how='left',
                     #left_on = "ldsfn",
                    # right_on = "dsfn")

In [None]:
linked1617a = pd.merge(WAlnkd1617, d1618,
                     how='left',
                     left_on = "ldsfn",
                     right_on = "dsfn")

In [102]:
#linked1617a.head()

I check for missing values in death certificate numbers from the death file which would indicate that some of the records in the infant linked file could not be linked to their corresponding death certificates.  In this case, there are no missing death certificate numbers.

In [22]:
np.isnan(linked1617a.dsfn).sum()

0

Next, I join the dataset above (infant linked file joined to death certificate information) with the corresponding birth certificate information.  Once again, I check for missing birth certificate numbers (from the birth data set).  There are 53 infant linked records for which the procedure did not find corresponding birth records.

In [23]:
#linked1617b = pd.merge(linked1617a,
                    #b1618[['b_momrescntyfips','b_momrescountyl','b_momrescityfips','b_momrescity',
                           #'bdobd','bdob','bdoby','bmom_lname','bmom_mname','bmom_fname','bfname','bmname',
                           #'blname', 'bsfn', 'b_momresstatefips', 'bsex']],
                   # how = 'left',
                    #left_on = "lbsfn",
                   # right_on = "bsfn")

In [None]:
linked1617b = pd.merge(linked1617a, b1618,
                    how = 'left',
                    left_on = "lbsfn",
                    right_on = "bsfn")

In [24]:
np.isnan(linked1617b.bsfn).sum()

53

Looking at the infant linked records with missing birth certificate information (below) it appears that 50 of the births occurred in 2015 which is outside the timeframe for this project.  Consequently, the birth records for these records would not be in the 2016-18 data set. I will drop these records from my training data set.

In [103]:
missbsfn = linked1617b[np.isnan(linked1617b.bsfn)]
missbsfn = missbsfn.sort_values(by=['lbsfn'])
#missbsfn.head(55)

In [26]:
missbsfn.lbsfn.dtype

dtype('int64')

In [27]:
linked1617b = linked1617b[(linked1617b.lbsfn >= 2016000000)]
np.isnan(linked1617b.bsfn).sum()

3

In [104]:
missbsfn2 = linked1617b[np.isnan(linked1617b.bsfn)]
#missbsfn2.head()

An examination of the birth records shows that the mothers were residents of other states who gave birth in Washington State.  I excluded women who were residents of other states from the birth data sets so I will remove these records from this data set also.

In [29]:
linked1617f = linked1617b[(linked1617b.lbsfn != 2016003213) & (linked1617b.lbsfn != 2016086923) & (linked1617b.lbsfn != 2016073350)]
np.isnan(linked1617f.bsfn).sum()

0

#### CHECK FOR MISSING VALUES

In [30]:
linked1617f.isnull().sum()

lbsfn                  0
ldsfn                  0
dbirsfn                0
dfname                 0
dmname               103
dlname                 0
dmom_fname             0
dmom_mname           265
dmom_lname             0
ddob                   0
ddobm                  0
ddobd                  0
ddoby                  0
dbircountryl          12
dbircountryfips        0
ddthcityl              0
ddthcityfips           0
ddthcountyl            1
ddthcntyfips           0
drescity               0
drescityfips           0
drescitylim            1
drescountyl            0
drescntyfips           0
dsfn                   0
ddthstatel             0
dresstatefips          0
dsex                   0
dagedays               0
ddod                   0
ddody                  0
b_momrescntyfips       0
b_momrescountyl        0
b_momrescityfips       0
b_momrescity           0
bdobd                  0
bdob                   0
bdoby                  0
bmom_lname             0
bmom_mname            96


- There are very few missing values in the columns except for mother and baby middle names, so this is a relatively clean data set in terms of missingness.

In [31]:
linked1617f.shape

(655, 47)

In this next step I check to make sure the infant's residence state and death state are both Washington State.

In [32]:
linked1617f.dresstatefips.value_counts(dropna=False)

WA    655
Name: dresstatefips, dtype: int64

In [33]:
linked1617f.ddthstatel.value_counts(dropna=False)

WASHINGTON    631
OREGON         20
CALIFORNIA      2
ALBERTA         1
NEWJERSEY       1
Name: ddthstatel, dtype: int64

In [34]:
linked1617f = linked1617f[(linked1617f['ddthstatel']=="WASHINGTON")]
linked1617f.ddthstatel.value_counts(dropna=False)

WASHINGTON    631
Name: ddthstatel, dtype: int64

In [35]:
WAlinked1617 = linked1617f.copy()

In [36]:
WAlinked1617.dtypes

lbsfn                  int64
ldsfn                  int64
dbirsfn               object
dfname                object
dmname                object
dlname                object
dmom_fname            object
dmom_mname            object
dmom_lname            object
ddob                  object
ddobm                  int64
ddobd                  int64
ddoby                  int64
dbircountryl          object
dbircountryfips       object
ddthcityl             object
ddthcityfips         float64
ddthcountyl           object
ddthcntyfips         float64
drescity              object
drescityfips           int64
drescitylim           object
drescountyl           object
drescntyfips           int64
dsfn                   int64
ddthstatel            object
dresstatefips         object
dsex                  object
dagedays               int64
ddod                  object
ddody                  int64
b_momrescntyfips     float64
b_momrescountyl       object
b_momrescityfips     float64
b_momrescity  

#### CHECK FOR NULL AND OUT OF RANGE VALUES

Probably will not use mothers' and infants' middle name for linking as there are too many missing values.

#### STANDARDIZE NAMES

First, middle, and last names of infants and mothers as well as city names will be standardized by converting these columns to upper case text, removing white spaces, removing hyphens and other punctuation marks.

In [37]:
WAlinked1617 = WAlinked1617.apply(lambda x: x.str.upper() if type(x) == str else x)

In [38]:
WAlinked1617 = WAlinked1617.apply(lambda x: x.str.strip() if type(x) == str else x)
WAlinked1617 = WAlinked1617.applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
WAlinked1617 = WAlinked1617.applymap(lambda x: x.replace("-", "") if type(x) == str else x)
WAlinked1617 = WAlinked1617.applymap(lambda x: x.replace(".", "") if type(x) == str else x)

#### CREATE FINAL MATCHED DATA SET FOR TRAINING DATA

The final step in creating the matched records in the training data step is to remove the birth and death certificate numbers that came from the infant linked data set and to add a column called 'Match' with a value of 1 to indicate that the record contains correctly matched birth and death records for infants.

There are 631 matched records representing infants who were born and died in Washington State and whose mothers were Washington State residents.

In [39]:
WAlinked1617_m = WAlinked1617.drop(['lbsfn', 'ldsfn'], axis=1)

In [40]:
WAlinked1617_m['Match'] = 1

In [41]:
WAlinked1617_m.dtypes

dbirsfn               object
dfname                object
dmname                object
dlname                object
dmom_fname            object
dmom_mname            object
dmom_lname            object
ddob                  object
ddobm                  int64
ddobd                  int64
ddoby                  int64
dbircountryl          object
dbircountryfips       object
ddthcityl             object
ddthcityfips         float64
ddthcountyl           object
ddthcntyfips         float64
drescity              object
drescityfips           int64
drescitylim           object
drescountyl           object
drescntyfips           int64
dsfn                   int64
ddthstatel            object
dresstatefips         object
dsex                  object
dagedays               int64
ddod                  object
ddody                  int64
b_momrescntyfips     float64
b_momrescountyl       object
b_momrescityfips     float64
b_momrescity          object
bdobd                float64
bdob          

In [42]:
WAlinked1617_m.Match.value_counts(dropna=False)

1    631
Name: Match, dtype: int64

In [105]:
#WAlinked1617_m.head()

### Creating examples of mismatched data for training

As there are almost 170,000 deaths in the 2016-18 death data set as well as almost 260,000 birth records for the same period the number of all combinations of non-matching records is enormous and it would not be helpful to this classifier as it would cause significant imbalance. 

I will randomly select roughly the same number of records from the birth data set (equally distributed across the 3 years) and the death data set and join them horizontally and label them as mismatched ('Match' = 0).

In [44]:
excl_dc = WAlinked1617_m['dsfn'].tolist()
len(excl_dc)

631

In [45]:
excl_bc = WAlinked1617_m['bsfn'].tolist()
len(excl_bc)

631

In [46]:
death1617_nm = d1618[(~d1618['dsfn'].isin(excl_dc)) & (d1618.ddody != 2018)]

In [48]:
birth1617_nm = b1618[(~b1618['bsfn'].isin(excl_bc)) & (b1618.bdoby != 2018)]

#### Create copies of linked, birth and death data sets that will be used for training

In [67]:
l1617m=WAlinked1617_m.copy()
d1617nm = death1617_nm.copy()
b1617nm = birth1617_nm.copy()

#### Split into training and testing sets before random resampling to balance the classes (matched vs. unmatched)

Here I split the three data sets to create a training data set with 70% of the data and a testing dataset with 30% of the data.
For training purposes I will randomly undersample from the birth and death datasets to create the unmatched records so that 

**Splitting infant linked (matched) data set**

In [95]:
l1617m = l1617m.reset_index()
l_train = l1617m.sample(frac=0.70, random_state=42)
l_test = l1617m.drop(l_train.index)

**Splitting death (unmatched) data set**

In [71]:
d_train = d1617nm.sample(frac=0.70, random_state=42)
d_test = d1617nm.drop(d_train.index)
d_train.shape

(77807, 52)

**Splitting birth (unmatched) data set**

In [72]:
b_train = b1617nm.sample(frac=0.70, random_state=42)
b_test = b1617nm.drop(b_train.index)
b_train.shape

(122233, 43)

**Undersampling birth and death training data subsets**

In [87]:
d_train_undersample = d_train.sample(n=1200, random_state=42).reset_index()
b_train_undersample = b_train.sample(n=1200, random_state=42).reset_index()
d_train_undersample.shape, b_train_undersample.shape

((1200, 53), (1200, 44))

**Joining (horizontally) birth and death undersamples to create mismatched record pairs**

In [123]:
unmatched = pd.concat([b_train_undersample, d_train_undersample], axis=1)
unmatched.shape

(1200, 97)

**Add 'Match' column with value 0 to indicate mismatched status**

In [124]:
unmatched['Match'] = 0
unmatched = unmatched.drop(['index'], axis=1)
unmatched.shape

(1200, 96)

In [127]:
#unmatched.head()