### Capstone 1 - Washington state linkage of infant death, birth, and mother's hospitalization discharge data

##### Maya Bhat-Gregerson

January 16, 2020

### C. PREPARATION OF INFANT BIRTH-DEATH LINKED DATA, 2016-2017

### I. Data acquisition

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

For the infant linked birth-death data used to create a training data set, I simply read in the CSV file that has already been prepared by my office for 2016 and 2017.

**2016**

In [2]:
linked16 = pd.read_csv(r'###\Data\InfantDeathY16_BP.csv',
                       index_col = None,
                       low_memory = False)

In [3]:
linked16.shape

(414, 355)

In [5]:
#linked16.head()

**2017**

In [6]:
linked17 = pd.read_csv(r'###\InfantDeathF2017.csv',
                       index_col = None,
                       low_memory = False)

In [7]:
linked17.shape

(363, 449)

In [46]:
#linked17.head()

### II. Data cleaning and standardization

Keep only records for infant deaths occurring in WA where the infant was a WA resident and the state of residence was also WA.  

In [9]:
WAlnkd16 = linked16.loc[(linked16["Birth Certificate Type"]=="R") 
                    & (linked16['Residence State FIPS Code']=="WA")
                    & (linked16['Birthplace State']=="WASHINGTON")
                    & (linked16['Death State']=="WASHINGTON")]

WAlnkd16.shape

(368, 355)

In [10]:
WAlnkd16['Residence State FIPS Code'].value_counts(dropna = False),WAlnkd16['Birthplace State'].value_counts(dropna = False),WAlnkd16['Death State'].value_counts(dropna = False)

(WA    368
 Name: Residence State FIPS Code, dtype: int64,
 WASHINGTON    368
 Name: Birthplace State, dtype: int64,
 WASHINGTON    368
 Name: Death State, dtype: int64)

Looking at the infant linked records with missing birth certificate information (below) it appears that 50 of the births occurred in 2015 which is outside the timeframe for this project.  Consequently, the birth records for these records would not be in the 2016-18 data set. I will drop these records from my training data set.

In [11]:
WAlnkd16['Date of Birth Year'].value_counts(dropna = False)

2016    318
2015     50
Name: Date of Birth Year, dtype: int64

In [12]:
WAlnkd16 = WAlnkd16[(WAlnkd16['Birth State File Number'] >= 2016000000)]
WAlnkd16['Date of Birth Year'].value_counts(dropna = False)

2016    318
Name: Date of Birth Year, dtype: int64

In [13]:
WAlnkd17 = linked17.loc[(linked17["Birth Cert Type"]=="R") 
                    & (linked17['Residence State FIPS Code']=="WA")
                    & (linked17['Birthplace State']=="WASHINGTON")
                    & (linked17['Death State']=="WASHINGTON")]
WAlnkd17.shape

(316, 449)

In [14]:
WAlnkd17['Residence State FIPS Code'].value_counts(dropna = False),WAlnkd17['Birthplace State'].value_counts(dropna = False),WAlnkd17['Death State'].value_counts(dropna = False)

(WA    316
 Name: Residence State FIPS Code, dtype: int64,
 WASHINGTON    316
 Name: Birthplace State, dtype: int64,
 WASHINGTON    316
 Name: Death State, dtype: int64)

In [15]:
WAlnkd17['Date of Birth Year'].value_counts(dropna = False)

2017.0    263
2016.0     53
Name: Date of Birth Year, dtype: int64

Using .head() on both infant linked files it appears that at least one column (birth certificate number) has a different name in the two files.  Instead of renaming columns in both data sets (2016 and 2017) to make them consistent as well as code-friendly I will keep only the birth certificate numbers and death certificate numbers from both data sets, append the two vertically, and then merge the relevant fields for ML modeled linkage from the birth and death data sets that I prepared in the previous two steps.

Keep only birth certificate number and death certificate number from 2016 infant birth-death linked file and merge with 2016 birth and death data prepared in earlier steps.  This will ensure that all column names and data types are the same across all data files.

After dropping all variables except for the birth certificate and death certificate numbers I renamed them to keep column names the same and then merged the 2016 and 2017 files vertically.

In [16]:
# rename columns
WAlnkd16=WAlnkd16.rename(columns = {"Birth State File Number":"lbsfn", "Death State File Number": "ldsfn"})
WAlnkd17=WAlnkd17.rename(columns = {"Birth State File Number":"lbsfn", "Death SFN": "ldsfn"})

In [17]:
WAlnkd16sfns = WAlnkd16.loc[ : ,['lbsfn','ldsfn' ]]
WAlnkd16sfns.shape

(318, 2)

In [18]:
WAlnkd17sfns = WAlnkd17.loc[:,['lbsfn','ldsfn']]
WAlnkd17sfns.shape

(316, 2)

In [19]:
#Append 2017 to 2016 vertically

WAlnkd1617sfns=pd.concat([WAlnkd16sfns, WAlnkd17sfns])
WAlnkd1617sfns.shape


(634, 2)

#### ADD LINKING VARIABLES

Add back columns that will be needed for machine learning classification later.  These columns were removed from the infant birth-death linked file as a matter of routine practice to protect the identities of the families.

I merge the infant linked file first with the death 2016-18 file on death certificate number and then with the birth 2016-18 using birth certificate number.  With each merge, I add back columns that were removed during the creation of the linked file for confidentiality reasons.

The first step is to read in death files for 2016-18 and birth files for 2016-18 that I acquired in prior steps.

In [20]:
d1618 = pd.read_csv(r'###\Data\d1618_clean.csv', low_memory=False)
b1618 = pd.read_csv(r'###\Data\b1618_clean.csv', low_memory=False)

In [21]:
WAlnkd1617 = WAlnkd1617sfns.copy()

In the first stage I join the infant linked birth-death records for 2016-17 with death certificate data fpr 2016-18 on death certificate number.

In [22]:
linked1617a = pd.merge(WAlnkd1617,
                    d1618[['dfname','dmname','dlname','dmom_fname',
                            'dmom_mname','dmom_maiden','ddad_fname',
                            'ddad_mname','ddad_lname','ddthcityl', 'ddthcountyl',
                           'drescity', 'drescountyl','drescntyfips', 'dsfn', 'ddthstatel', 
                            'dresstatefips', 'dsex', 'ddodm', 'ddod', 'ddody']],
                     how='left',
                     left_on = "ldsfn",
                     right_on = "dsfn")

In [24]:
#linked1617a.head()

I check for missing values in death certificate numbers from the death file which would indicate that some of the records in the infant linked file could not be linked to their corresponding death certificates.  In this case, there are no missing death certificate numbers.

In [25]:
np.isnan(linked1617a.dsfn).sum()

0

Next, I join the dataset above (infant linked file joined to death certificate information) with the corresponding birth certificate information.  Once again, I check for missing birth certificate numbers (from the birth data set).  There are 53 infant linked records for which the procedure did not find corresponding birth records.

In [26]:
linked1617b = pd.merge(linked1617a,
                    b1618[['b_momrescntyfips','b_momrescountyl','b_momrescityfips','b_momrescity',
                           'bdobd','bdob','bdoby','bmom_lname','bmom_maiden','bmom_mname',
                           'bmom_fname', 'bfname','bmname','blname', 'bsfn', 'bdad_lname',
                           'bdad_mname', 'bdad_fname','b_momresstatefips', 'bsex']],
                   how = 'left',
                   left_on = "lbsfn",
                   right_on = "bsfn")

In [28]:
np.isnan(linked1617b.bsfn).sum()

3

Check to see why there are missing birth certificate numbers.

In [47]:
missbsfn = linked1617b[np.isnan(linked1617b.bsfn)]
missbsfn = missbsfn.sort_values(by=['lbsfn'])
#missbsfn.head(55)

In [30]:
missbsfn.lbsfn.dtype

dtype('int64')

An examination of the birth records shows that the mothers were residents of other states who gave birth in Washington State.  I excluded women who were residents of other states from the birth data sets so I will remove these records from this data set also.

In [31]:
linked1617f = linked1617b[(linked1617b.lbsfn != 2016003213) & (linked1617b.lbsfn != 2016086923) 
                          & (linked1617b.lbsfn != 2016073350)] 
np.isnan(linked1617f.bsfn).sum()

0

#### CHECK FOR MISSING VALUES

In [32]:
linked1617f.isnull().sum()

lbsfn                  0
ldsfn                  0
dfname                 0
dmname               100
dlname                 0
dmom_fname             0
dmom_mname           256
dmom_maiden            0
ddad_fname             0
ddad_mname           287
ddad_lname             0
ddthcityl              0
ddthcountyl            0
drescity               0
drescountyl            0
drescntyfips           0
dsfn                   0
ddthstatel             0
dresstatefips          0
dsex                   0
ddodm                  0
ddod                   0
ddody                  0
b_momrescntyfips       1
b_momrescountyl        0
b_momrescityfips       0
b_momrescity           0
bdobd                  0
bdob                   0
bdoby                  0
bmom_lname             0
bmom_maiden            0
bmom_mname            89
bmom_fname             0
bfname                 1
bmname               103
blname                 0
bsfn                   0
bdad_lname           176
bdad_mname           265


- There are very few missing values in the columns except for mother, father, and baby middle names, so this is a relatively clean data set in terms of missingness.

In [33]:
linked1617f.shape

(631, 43)

In this next step I check to make sure the infant's residence state and death state are both Washington State.

In [34]:
linked1617f.dresstatefips.value_counts(dropna=False)

WA    631
Name: dresstatefips, dtype: int64

In [35]:
linked1617f.ddthstatel.value_counts(dropna=False)

WASHINGTON    631
Name: ddthstatel, dtype: int64

In [36]:
WAlinked1617 = linked1617f.copy()

In [37]:
WAlinked1617.dtypes

lbsfn                  int64
ldsfn                  int64
dfname                object
dmname                object
dlname                object
dmom_fname            object
dmom_mname            object
dmom_maiden           object
ddad_fname            object
ddad_mname            object
ddad_lname            object
ddthcityl             object
ddthcountyl           object
drescity              object
drescountyl           object
drescntyfips           int64
dsfn                   int64
ddthstatel            object
dresstatefips         object
dsex                  object
ddodm                  int64
ddod                  object
ddody                  int64
b_momrescntyfips     float64
b_momrescountyl       object
b_momrescityfips     float64
b_momrescity          object
bdobd                float64
bdob                  object
bdoby                float64
bmom_lname            object
bmom_maiden           object
bmom_mname            object
bmom_fname            object
bfname        

#### STANDARDIZE NAMES

First, middle, and last names of infants and mothers as well as city names will be standardized by converting these columns to upper case text, removing white spaces, removing hyphens and other punctuation marks.

In [38]:
WAlinked1617 = WAlinked1617.apply(lambda x: x.str.upper() if type(x) == str else x)

In [39]:
WAlinked1617 = WAlinked1617.apply(lambda x: x.str.strip() if type(x) == str else x)
WAlinked1617 = WAlinked1617.applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
WAlinked1617 = WAlinked1617.applymap(lambda x: x.replace("-", "") if type(x) == str else x)
WAlinked1617 = WAlinked1617.applymap(lambda x: x.replace(".", "") if type(x) == str else x)

#### CREATE FINAL LABELLED DATA SET FOR CLASSIFIER TRAINING

The final step in creating the matched records in the training data step is to remove the birth and death certificate numbers that came from the infant linked data set and to add a column called 'Match' with a value of 1 to indicate that the record contains correctly matched birth and death records for infants.

There are 631 matched records representing infants who were born and died in Washington State and whose mothers were Washington State residents.

In [40]:
WAlinked1617_m = WAlinked1617.loc[:,['lbsfn', 'ldsfn']]

In [41]:
WAlinked1617_m['Match'] = 1

In [42]:
WAlinked1617_m.dtypes

lbsfn    int64
ldsfn    int64
Match    int64
dtype: object

In [43]:
WAlinked1617_m.Match.value_counts(dropna=False)

1    631
Name: Match, dtype: int64

In [44]:
WAlinked1617_m.to_csv(r'###\WAlinked1617_labels.csv', index=None, header=True)

In [45]:
WAlinked1617.to_csv(r'###\WAlinked1617_features.csv', index=None, header=True)