### Capstone 1 - Washington state linkage of infant death, birth, and mother's hospitalization discharge data

##### Maya Bhat-Gregerson

 December 31, 2019

### C. PREPARATION OF INFANT BIRTH-DEATH LINKED DATA, 2016-17

### I. Data acquisition

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc

For the infant linked birth-death data used to create a training data set, I simply read in the CSV file that has already been prepared by my office.

In [2]:
linked16 = pd.read_csv(r'Y:\DQSS\Death\MBG\Py\Data\InfantDeathF2016.csv',
                       index_col = None,
                       low_memory = False)

In [3]:
linked16.shape

(414, 450)

In [4]:
linked17 = pd.read_csv(r'Y:\DQSS\Death\MBG\Py\Data\InfantDeathF2017.csv',
                       index_col = None,
                       low_memory = False)

In [5]:
linked17.shape

(363, 449)

### II. Data cleaning and standardization

Keep only infant deaths among WA residents. These are indicated by birth certificate type 'R'.  

In [6]:
WAlnkd16 = linked16[(linked16["Birth Cert Type"]=="R")]
WAlnkd16.shape

(389, 450)

In [7]:
WAlnkd17 = linked17[(linked17["Birth Cert Type"]=="R")]
WAlnkd17.shape

(351, 449)

Using .head() on both infant linked files it appears that at least one column (birth certificate number) has a different name in the two files.  I will try to see what other columns have this problem.

In [16]:
# compare column names for the 2016 and 2017 linked files

l16_cols = WAlnkd16.columns
l17_cols = WAlnkd17.columns

differences = l17_cols.difference(l16_cols)

differences

Index(['Age Years', 'Birth Cert Encrypt', 'Birth Weight Grams',
       'Birth Weight Ounces', 'Birth Weight Pounds',
       'Birthplace State NCHS Code Death', 'Date of Birth Month',
       'Date of Birth Year', 'Date of Death Day', 'Date of Death Month',
       'Date of Death Year', 'Date of Injury Day', 'Date of Injury Month',
       'Date of Injury Year', 'Death SFN', 'Disposition Date Day',
       'Disposition Date Month', 'Disposition Date Year',
       'Father Birthplace Cntry WA Code', 'Father Birthplace State FIPS',
       'Father Race Amer Indian Alaskan', 'Father Race Calculation',
       'Gestational Hypertention', 'Hispanic NCHS Bridge', 'Hysterectomy',
       'Injury HR AMPM', 'Injury State FIPS Code',
       'Mother Birthplace Cntry WA Code', 'Mother Birthplace State FIPS',
       'Mother Hispanic NCHS Ccodes', 'Mother Hispanic NCHS Ecodes',
       'Mother Marital Status', 'Mother Race Amer Indian Alaskan',
       'Mother Race Calculation', 'Prior Live Births Deceased',
 

In [17]:
list(linked17.columns.values)

['Birth Cert Encrypt',
 'Birth Cert Type',
 'Date of Birth Month',
 'Date of Birth Year',
 'Time of Birth',
 'Sex',
 'Plurality',
 'Birth Order',
 'Birthplace County City WA Code',
 'Birthplace County WA Code',
 'Birthplace State',
 'Birthplace State NCHS Code',
 'Birthplace State FIPS Code',
 'Facility Type',
 'Facility',
 'Intended Facility',
 'Mother Transfer',
 'Facility Mother Transferred From',
 'Child Transfer',
 'Facility Infant Transferred To',
 'Attendant Class',
 'Certifier Class',
 'Mother Calculated Age',
 'Mother Residence City WA Code',
 'Mother Residence County WA Code',
 'Mother Residence State',
 'Mother Residence State NCHS Code',
 'Mother Residence State FIPS Code',
 'Mother Residence Zip',
 'Mother Years at Residence',
 'Mother Months at Residence',
 'Mother Birthplace State FIPS',
 'Mother Birthplace Country',
 'Mother Birthplace Cntry WA Code',
 'Child Calculated Race',
 'Child Calculated Ethnicity',
 'Mother Race White',
 'Mother Race Black',
 'Mother Race Amer 

In [18]:
list(linked16.columns.values)

['SFN Encrypt',
 'Birth Cert Type',
 'Date of Birth - Month',
 'Date of Birth - Year',
 'Time of Birth',
 'Sex',
 'Plurality',
 'Birth Order',
 'Birthplace County City WA Code',
 'Birthplace County WA Code',
 'Birthplace State',
 'Birthplace State NCHS Code',
 'Birthplace State FIPS Code',
 'Facility Type',
 'Facility',
 'Intended Facility',
 'Mother Transfer',
 'Facility Mother Transferred From',
 'Child Transfer',
 'Facility Infant Transferred To',
 'Attendant Class',
 'Certifier Class',
 'Mother Calculated Age',
 'Mother Residence City WA Code',
 'Mother Residence County WA Code',
 'Mother Residence State',
 'Mother Residence State NCHS Code',
 'Mother Residence State FIPS Code',
 'Mother Residence Zip',
 'Mother Years at Residence',
 'Mother Months at Residence',
 'Mother Birthplace State Code',
 'Mother Birthplace Country',
 'Mother Birthplace Country Code',
 'Child Calculated Race',
 'Child Calculated Ethnicity',
 'Mother Race White',
 'Mother Race Black',
 'Mother Race AI AN',
 

#### STANDARDIZE VARIABLES NAMES IN 2016 AND 2017 LINKED FILES

- Variable names and types are not consistent in the 2016 and 2017 infant birth-death linked files that will be used as training data.  The first step is to make these consistent for all variables to be used in linkage.

There are several variable names that we need to keep for merges in the future that are named differently in the two infant linked files.  I will rename select variables in the 2016 file so that they are consistent with 2017 infant linked file names.

In [8]:
WAlnkd16.rename(columns = {'SFN Encrypt':'Birth Cert Encrypt',
                           'Date of Birth - Month' : 'Date of Birth Month',
                           'Date of Birth - Year': 'Date of Birth Year',
                           'Date of Death - Month' : 'Date of Death Month',
                           'Date of Death - Day' : 'Date of Death Day',
                           'Date of Death - Year' : 'Date of Death Year',
                          }, inplace = True) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


####  APPEND 2016, 2017 LINKED FILES AND LIMIT DATA TO LINKING VARIABLES

In [10]:
WAlnkd1617 = pd.concat([WAlnkd16, WAlnkd17], sort=True, ignore_index=True)
WAlnkd1617.shape

(740, 481)

Keep columns that will be used for linkage and drop the rest.  I will re-merge the dropped variables after the machine learning model has been created  

In [11]:
keep  = ['Birth Cert Encrypt','Birth Cert Type','Date of Birth Month', 'Date of Birth Year', 'Sex', 'Birthplace State FIPS Code',
         'Mother Residence State', 'Mother Residence State FIPS Code', 'Mother Residence Zip', 'Death SFN', 'Sex Death',
         'Date of Death', 'Date of Death Month', 'Date of Death Day', 'Date of Death Year', 'Death State', 'Death Zip Code',
         'Birthplace State FIPS Code Death', 'Residence State FIPS Code', 'Residence Zip Code', 'Bridge Race', 'Hispanic No','Manner',
         'Underlying COD Code']

In [12]:
WAlnkd1617 = WAlnkd1617.loc[:,keep]
WAlnkd1617.shape

(740, 24)

In [15]:
WAlnkd1617.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 24 columns):
Birth Cert Encrypt                  740 non-null int64
Birth Cert Type                     740 non-null object
Date of Birth Month                 740 non-null float64
Date of Birth Year                  740 non-null float64
Sex                                 740 non-null object
Birthplace State FIPS Code          740 non-null object
Mother Residence State              740 non-null object
Mother Residence State FIPS Code    740 non-null object
Mother Residence Zip                740 non-null object
Death SFN                           351 non-null float64
Sex Death                           740 non-null object
Date of Death                       740 non-null object
Date of Death Month                 740 non-null int64
Date of Death Day                   740 non-null int64
Date of Death Year                  740 non-null int64
Death State                         715 non-null object


#### CHECK FOR MISSING VALUES

In [17]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

WAlnkd1617.isnull().sum()

Birth Cert Encrypt                    0
Birth Cert Type                       0
Date of Birth Month                   0
Date of Birth Year                    0
Sex                                   0
Birthplace State FIPS Code            0
Mother Residence State                0
Mother Residence State FIPS Code      0
Mother Residence Zip                  0
Death SFN                           389
Sex Death                             0
Date of Death                         0
Date of Death Month                   0
Date of Death Day                     0
Date of Death Year                    0
Death State                          25
Death Zip Code                       34
Birthplace State FIPS Code Death      0
Residence State FIPS Code             0
Residence Zip Code                    1
Bridge Race                           4
Hispanic No                           0
Manner                                4
Underlying COD Code                   6
dtype: int64

- There are very few missing values in the columns exctp for 'Death SFN', so this is a relatively clean data set in terms of missingness.

#### STANDARDIZE BIRTH AND DEATH CERTIFICATE NUMBERS

In the infant linked file, check data type of birth and death certificate numbers ('Birth Cert Encrypt' and 'Death SFN') to make sure they are the same type as in birth and death files so that I can join data based on these numbers to incorporate additional columns needed to completed the linked file data set.

In [30]:
WAlnkd1617['Birth Cert Encrypt'].dtypes

dtype('int64')

In [31]:
WAlnkd1617['Death SFN'].dtypes

dtype('float64')

 - both variables are consistent with their counterparts in the birth and death files in terms of data types.

#### ADD LINKING VARIABLES

Add back columns that will be needed for machine learning classification later.  These columns were removed from the infant birth-death linked file as a matter of routine practice to protect the identities of the families.

I merge the infant linked file first with the death 2016-18 file on death certificate number and then with the birth 2016-18 using birth certificate number.  With each merge, I add back columns that were removed during the creation of the linked file for confidentiality reasons.

In [19]:
d1618 = pd.read_csv(r'Y:\DQSS\Death\MBG\Py\Data\d1618_clean.csv', low_memory=False)
b1618 = pd.read_csv(r'Y:\DQSS\Death\MBG\Py\Data\b1618_clean.csv', low_memory=False)

In [20]:
WAlnkd1617a = pd.merge(WAlnkd1617,
                     d1618[['dbirsfn','dssn','dfname','dmname','dlname','dmom_fname',
                                      'dmom_mname','dmom_lname','ddob','ddobm','ddobd','ddoby',
                                      'dbircountryl','dbircountryfips','ddthcityl','ddthcityfips',
                                      'ddthcountyl','ddthcntyfips','drescity','drescityfips',
                                      'drescitylim','drescountyl','drescntyfips', 'dsfn']],
                     how='left',
                     left_on = "Death SFN",
                     right_on = "dsfn")

In [25]:
WAlnkd1618b = pd.merge(WAlnkd1617a,
                    b1618[['b_momrescntyfips','b_momrescountyl','b_momrescityfips','b_momrescity',
                           'bdobd','bdob','bmom_lname','bmom_mname','bmom_fname','bfname','bmname',
                           'blname', 'bsfn', 'b_momresstatefips']],
                    how = 'left',
                    left_on = "Birth Cert Encrypt",
                    right_on = "bsfn")

In [26]:
WAlnkd1618b.shape

(740, 62)

In [27]:
WAlnkd1618b.dtypes

Birth Cert Encrypt                    int64
Birth Cert Type                      object
Date of Birth Month                 float64
Date of Birth Year                  float64
Sex                                  object
Birthplace State FIPS Code           object
Mother Residence State               object
Mother Residence State FIPS Code     object
Mother Residence Zip                 object
Death SFN                           float64
Sex Death                            object
Date of Death                        object
Date of Death Month                   int64
Date of Death Day                     int64
Date of Death Year                    int64
Death State                          object
Death Zip Code                      float64
Birthplace State FIPS Code Death     object
Residence State FIPS Code            object
Residence Zip Code                   object
Bridge Race                         float64
Hispanic No                          object
Manner                          

In [24]:
WAlnkd1618b['Death State'].value_counts(dropna=False)

WASHINGTON    714
NaN            25
OREGON          1
Name: Death State, dtype: int64

In [29]:
WAlnkd1618b['b_momresstatefips'].value_counts(dropna=False)

NaN    740
Name: b_momresstatefips, dtype: int64

#### CHECK FOR NULL AND OUT OF RANGE VALUES

Probably will not use mothers' and infants' middle name for linking as there are too many missing values.

#### STANDARDIZE NAMES

First, middle, and last names of infants and mothers as well as city names will be standardized by converting these columns to upper case text, removing white spaces, removing hyphens and other punctuation marks.

In [52]:
infdth18 = infdth18.apply(lambda x: x.str.upper() if type(x) == str else x)

In [54]:
infdth18 = infdth18.apply(lambda x: x.str.strip() if type(x) == str else x)
infdth18 = infdth18.applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
infdth18 = infdth18.applymap(lambda x: x.replace("-", "") if type(x) == str else x)
infdth18 = infdth18.applymap(lambda x: x.replace(".", "") if type(x) == str else x)

#### CREATE SUBSET CONSISTING OF LINKING VARIABLES IN DEATH DATA

In [60]:
# create subset of death linking variables: decedent's first and last names, mother's first and last
# name, decedents date of birth, decedent's sex, residence county and city, and decedent's SSN.

dthlinkvars = infdth18.loc[:,['sfn', 'birthsfn', 'ssn', 'fname', 'lname', 'mom_fname', 'mom_lname', 
                              'dobm', 'dobd', 'doby', 'dob', 'sex', 'rcityfips', 'rcity', 'rcntyfips', 
                              'rcountyl', 'dstatel']]

dthlinkvars.reset_index(drop=True, inplace=True)

In [61]:
## CHECK FOR OUT OF RANGE VALUES - RESTRICT TO VARIABLES THAT WILL BE USED FOR LINKING AT THIS POINT

In [62]:
#create dictionary of valid values so that each variable can be checked to make sure there is no
# out of range value.

valids = {'sex': ['M', 'F', 'U'],
          'dobm': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 99],
          'dobd': np.r_[1:32 ,99],
          'doby': [2017,2018],
         'rcntyfips': np.r_[range(1, 78, 2), 99]}

In [63]:
# check for out of range values for 'sex'

chksex = dthlinkvars['sex'].isin(valids['sex'])
len(dthlinkvars[~chksex])

0

In [64]:
# check for out of range values for 'dobm'

chkdobm = dthlinkvars['dobm'].isin(valids['dobm'])
len(dthlinkvars[~chkdobm])

0

In [65]:
# check for out of range values for 'doby'

chkdoby = dthlinkvars['doby'].isin(valids['doby'])
len(dthlinkvars[~chkdoby])

0

In [66]:
# check for out of range values for 'dobd'

chkdobd = dthlinkvars['dobd'].isin(valids['dobd'])
len(dthlinkvars[~chkdobd])


0

In [67]:
# check for out of range values for 'dobd'

chkrcounty = dthlinkvars['rcntyfips'].isin(valids['rcntyfips'])
len(dthlinkvars[~chkrcounty])

2

In [68]:
rcntyerrors = dthlinkvars[~chkrcounty][['sfn','rcntyfips', 'rcountyl', 'rcntyfips','dstatel']]

rcntyerrors

Unnamed: 0,sfn,rcntyfips,rcountyl,rcntyfips.1,dstatel
262,2018043748,999,UNKNOWN,999,WASHINGTON
347,2018057560,999,UNKNOWN,999,WASHINGTON


In [69]:
dthlinkvars.loc[dthlinkvars.sfn == 2018057560, 'rcountyl'] ="SNOHOMISH"
dthlinkvars.loc[dthlinkvars.sfn == 2018057560, 'rcntyfips'] = 61
dthlinkvars[dthlinkvars.sfn == 2018057560]['rcntyfips']

347    61
Name: rcntyfips, dtype: int64

In [70]:
#repeating check on residence county out of range values

chkrcounty = dthlinkvars['rcntyfips'].isin(valids['rcntyfips'])
len(dthlinkvars[~chkrcounty])
rcntyerrors = dthlinkvars[~chkrcounty][['sfn','rcntyfips', 'rcountyl', 'rcntyfips','dstatel']]

rcntyerrors

Unnamed: 0,sfn,rcntyfips,rcountyl,rcntyfips.1,dstatel
262,2018043748,999,UNKNOWN,999,WASHINGTON


After looking up the record with unknown residence county values in the SQL database I found that there is no additional information available to correct this record.  Linkage of this record will rely on other variables with complete information.

In [179]:
#convert age into age in days

def agetodays(x):
    if x['agetype']==2:
        return x['age']*30
    elif x['agetype']==3:
        return x['age']
    elif x['agetype']==4:
        return x['age']/24
    elif x['agetype']==5:
        return x['age']/(60*24)
infdth18['agedays'] = np.array(infdth18.apply(agetodays, axis=1)).astype(int)
infdth18[['agedays', 'agetype', 'age']].head(10)



Unnamed: 0,agedays,agetype,age
0,60,2.0,2
1,60,2.0,2
2,2,3.0,2
3,1,3.0,1
4,0,4.0,1
5,2,3.0,2
6,0,4.0,1
7,0,4.0,2
8,0,5.0,37
9,0,4.0,1
