### Capstone 1 - Washington state linkage of infant death, birth, and mother's hospitalization discharge data

##### Maya Bhat-Gregerson

January 07, 2020

### A. PREPARATION OF BIRTH DATA, 2016-2018

### I. Data acquisition

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc

pd.set_option('display.max_columns', None)

I use SQL queries to get the birth variables I am likely to need for linking the records for 2016 through 2018. In the SQL query I limited the records to individuals born in Washington State and those who were Washington State residents.

NOTE: I renamed all variables from birth records so that they begin with 'b' to distinguish the fields from those in death records with the same names.

In [6]:
## CONNECT TO WHALES & USE SQL QUERY FOR BIRTH DATA SET

driver = '{SQL Server Native Client 11.0}'

conn = pyodbc.connect(
        Trusted_Connection='Yes',
        Driver='{ODBC Driver 13 for SQL Server}',
        Server='#########',
        Database='####'
        )

querystring = ("SELECT SFN_NUM as 'bsfn'," + 
        "SUBSTRING(SFN_NUM, 11, 1) as 'bcerttype'," +
        "ISNULL(CHILD_GNAME, 'NaN') as 'bfname', " +
        "ISNULL(CHILD_MNAME, 'NaN') as 'bmname', " +
        "ISNULL(CHILD_LNAME, 'NaN') as 'blname', " +
        "ISNULL(MOTHER_GNAME_PRIOR, 'NaN') as 'bmom_fname', " +
        "ISNULL(MOTHER_MNAME_PRIOR, 'NaN') as 'bmom_mname', " + 
        "ISNULL(MOTHER_LNAME, 'NaN') as 'bmom_lname', " +
        "ISNULL(INFANT_SEX, 'NaN') as 'bsex', " + 
		"IDOB as 'bdob', " + 
		"ISNULL(SUBSTRING(IDOB, 1,2), '99') as 'bdobm', " + 
		"ISNULL(SUBSTRING(IDOB, 4,2), '99') as 'bdobd', " +
		"ISNULL(SUBSTRING(IDOB, 7,4), '9999') as 'bdoby', " + 
		"ISNULL(BIRTH_FAC_STATE_FIPS_CD, '  ') as 'bbirplstatefips', " + 
		"ISNULL(RES_CITY, '  ') as 'b_momrescity', " + 
		"RIGHT('00000' + ISNULL(RES_CITY_FIPS_CD, '99999'), 5) as 'b_momrescityfips', " + 
		"ISNULL(RES_COUNTY, '  ') as 'b_momrescountyl', " + 
		"RIGHT('000' + ISNULL(RES_COUNTY_FIPS_CD, '999'), 3) as 'b_momrescntyfips', " + 
		"ISNULL(RES_STATE_FIPS_CD, '  ') as 'b_momresstatefips', " + 
		"ISNULL(SUBSTRING(RES_ZIP, 1,5), '99999') as 'b_momreszip' " + 
"FROM [wa_vrvweb_events].[VRV_BIRTH_TBL] " +
"WHERE ((DATE_BIRTH_YEAR = 2016) OR (DATE_BIRTH_YEAR = 2017) OR (DATE_BIRTH_YEAR = 2018))" +
	"AND FL_CURRENT = '1'" +
	"AND FL_VOIDED = '0'" +
    "AND FL_FILED <> 'N'" + 
    "AND (BIRTH_FAC_STATE_FIPS_CD = 'WA' OR RES_STATE_FIPS_CD = 'WA')")

bir1618 = pd.read_sql_query(querystring, conn)

## SAVE DATA AS CSV FILE

bir1618.to_csv(r'########\Py\Data\bir1618_raw.csv', index=None, header=True)

In [11]:
bir1618 = pd.read_csv(r'#####\Py\Data\bir1618_raw.csv')
bir1618.shape

(267744, 20)

In [12]:
bir1618.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267744 entries, 0 to 267743
Data columns (total 20 columns):
bsfn                 267744 non-null object
bcerttype            267744 non-null object
bfname               267728 non-null object
bmname               241829 non-null object
blname               267728 non-null object
bmom_fname           267577 non-null object
bmom_mname           224872 non-null object
bmom_lname           266143 non-null object
bsex                 267744 non-null object
bdob                 267744 non-null object
bdobm                267744 non-null int64
bdobd                267744 non-null int64
bdoby                267744 non-null int64
bbirplstatefips      267744 non-null object
b_momrescity         267744 non-null object
b_momrescityfips     267744 non-null int64
b_momrescountyl      267744 non-null object
b_momrescntyfips     267744 non-null int64
b_momresstatefips    267744 non-null object
b_momreszip          267744 non-null object
dtypes: int64(

In [13]:
# Create list of variable names in birth data sets for later use
birlinkvars = list(bir1618.columns.values)
birlinkvars

['bsfn',
 'bcerttype',
 'bfname',
 'bmname',
 'blname',
 'bmom_fname',
 'bmom_mname',
 'bmom_lname',
 'bsex',
 'bdob',
 'bdobm',
 'bdobd',
 'bdoby',
 'bbirplstatefips',
 'b_momrescity',
 'b_momrescityfips',
 'b_momrescountyl',
 'b_momrescntyfips',
 'b_momresstatefips',
 'b_momreszip']

### II. Data cleaning and standardization

 - Standardize the merging ID number variable ('bsfn') so that it is an integer.

bir1618.head() shows that 'bsfn' is a string consisting of 10 numbers followed by R, O, D, or B.  I will remove the last character and then convert the remaining string (all numbers) into an integer.

In [14]:
bir1618.bsfn= bir1618.bsfn.str.rstrip('R')
bir1618.bsfn= bir1618.bsfn.str.rstrip('O')
bir1618.bsfn= bir1618.bsfn.str.rstrip('D')
bir1618.bsfn= bir1618.bsfn.str.rstrip('B')

In [15]:
bir1618['bsfn'] = bir1618['bsfn'].astype(int)

In [16]:
bir1618.bsfn.dtypes

dtype('int32')

 - Check date of birth year ('bdoby') variable to make sure we have only the years of interest

In [18]:
bir1618['bdoby'].value_counts(dropna=False)

2016    91760
2017    88707
2018    87277
Name: bdoby, dtype: int64

 - Check to see if mother's residence state (which is considered to be the infant's resident state) is WA only.

In [19]:
bir1618['b_momresstatefips'].value_counts(dropna=False)

WA    264106
OR      1643
ID      1356
AK       188
CA        66
MT        54
ZZ        53
BC        27
XX        26
TX        20
VA        15
AZ        15
NC        12
UT        12
NV        11
HI        11
FL        10
NY         9
IL         9
TN         8
CO         7
OH         7
MN         6
MI         6
GA         5
WI         5
AR         4
AL         4
MD         4
MO         3
KS         3
NE         3
LA         3
ND         3
PA         3
SC         3
KY         3
AS         2
MA         2
WY         2
NM         2
IN         2
NJ         2
ON         1
RI         1
CT         1
VT         1
NS         1
IA         1
MS         1
NH         1
SD         1
Name: b_momresstatefips, dtype: int64

   **LIMIT DATA SET TO INFANTS WHO WERE WA RESIDENTS**

In [26]:
bir1618['bbirplstatefips'].value_counts(dropna=False)

WA    263544
OR      2846
ID       882
CA       143
TX        36
AZ        21
CO        18
FL        16
MT        14
UT        14
OH        12
NY        11
NV        10
PA         9
AK         9
VA         9
MA         8
HI         8
LA         8
KS         8
NC         7
WY         7
IL         7
MN         6
XX         6
CT         6
AR         6
MI         6
NM         6
IN         6
DC         5
MO         5
WI         5
SC         4
GA         4
AL         4
ND         3
TN         3
NJ         3
NE         3
WV         2
MS         2
RI         2
MD         2
SD         2
OK         2
KY         1
IA         1
NH         1
DE         1
Name: bbirplstatefips, dtype: int64

In [27]:
b1618 = bir1618[(bir1618['b_momresstatefips']=="WA")]
b1618 = b1618[(b1618['bbirplstatefips']=="WA")]

b1618['b_momresstatefips'].value_counts(dropna=False)

WA    259906
Name: b_momresstatefips, dtype: int64

In [28]:
b1618['bbirplstatefips'].value_counts(dropna=False)

WA    259906
Name: bbirplstatefips, dtype: int64

In [29]:
b1618.shape

(259906, 20)

#### CHECK FOR NULL VALUES

In [30]:
# checking for all missing variables
b1618.isna().sum()

bsfn                     0
bcerttype                0
bfname                   9
bmname               25515
blname                  13
bmom_fname              98
bmom_mname           42101
bmom_lname            1384
bsex                     0
bdob                     0
bdobm                    0
bdobd                    0
bdoby                    0
bbirplstatefips          0
b_momrescity             0
b_momrescityfips         0
b_momrescountyl          0
b_momrescntyfips         0
b_momresstatefips        0
b_momreszip              0
dtype: int64

- Mothers' and infants' middle names are not useful for linking as there are too many missing values.

 **CHECK FOR OUT OF RANGE VALUES**

In [36]:
# create dictionary of valid values so that each variable can be checked to make sure there is no
# out of range value.

valids = {'sex': ['M', 'F', 'U'],
          'dobm': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 99],
          'dobd': np.r_[1:32 ,99],
          'doby': [2016,2017,2018],
         'rcntyfips': np.r_[range(1, 78, 2), 99],
         'certtype': ['R'],
         'birthstatefips': ['WA'], 
         'rstatefips': ['WA']}

In [38]:
# check for out of range values for 'bsex'

chkbsex = b1618['bsex'].isin(valids['sex'])
len(b1618[~chkbsex])

0

In [39]:
# check for out of range values for 'bdobm'

chkbdobm = b1618['bdobm'].isin(valids['dobm'])
len(b1618[~chkbdobm])

0

In [40]:
# check for out of range values for 'bdoby'

chkbdoby = b1618['bdoby'].isin(valids['doby'])
len(b1618[~chkbdoby])

0

In [41]:
# check for out of range values for 'bdobd'

chkbdobd = b1618['bdobd'].isin(valids['dobd'])
len(b1618[~chkbdobd])

0

In [43]:
# check for out of range values for 'b_momrescntyfips'

chkbrcounty = b1618['b_momrescntyfips'].isin(valids['rcntyfips'])
len(b1618[~chkbrcounty])

132

In [45]:
# create dataframe 'brcntyerrors' which shows only rows where mom's residence county does not have a valid FIPS code

brcntyerrors = b1618[~chkbrcounty][['b_momrescntyfips', 'b_momrescountyl','b_momresstatefips']]

brcntyerrors

Unnamed: 0,b_momrescntyfips,b_momrescountyl,b_momresstatefips
99,999,,WA
1060,999,,WA
1311,999,UNKNOWN,WA
1875,999,,WA
3031,999,,WA
...,...,...,...
258302,999,UNKNOWN,WA
262291,999,UNKNOWN,WA
263578,999,UNKNOWN,WA
266901,999,BENTON,WA


In [46]:
#Create dictionary of Washington State county names and county FIPS codes 

counties = {'ADAMS':1,
'ASOTIN':3,
'BENTON':5,
'CHELAN':7,
'CLALLAM':9,
'CLARK':11,
'COLUMBIA':13,
'COWLITZ':15,
'DOUGLAS':17,
'FERRY':19,
'FRANKLIN':21,
'GARFIELD':23,
'GRANT':25,
'GRAYS HARBOR':27,
'ISLAND':29,
'JEFFERSON':31,
'KING':33,
'KITSAP':35,
'KITTITAS':37,
'KLICKITAT':39,
'LEWIS':41,
'LINCOLN':43,
'MASON':45,
'OKANOGAN':47,
'PACIFIC':49,
'PEND OREILLE':51,
'PIERCE':53,
'SAN JUAN':55,
'SKAGIT':57,
'SKAMANIA':59,
'SNOHOMISH':61,
'SPOKANE':63,
'STEVENS':65,
'THURSTON':67,
'WAHKIAKUM':69,
'WALLA WALLA':71,
'WHATCOM':73,
'WHITMAN':75,
'YAKIMA':77
}


In [48]:
# replace FIPS codes for mother's residence county code ('b_momrcntyfips') variable by mapping values from dictionary 'counties' based on 
# the literal name of the county ('b_momrcountyl').  ONLY DO FOR out of range or missing values for WA counties.

b1618.loc[b1618['b_momresstatefips']=='WA','b_momrescntyfips']=b1618['b_momrescountyl'].map(counties)

In [49]:
# recheck to see if there are fewer with out of range values
chkbrcounty = b1618['b_momrescntyfips'].isin(valids['rcntyfips'])
len(b1618[~chkbrcounty][b1618['b_momresstatefips']=='WA'])

  This is separate from the ipykernel package so we can avoid doing imports until


124

In [56]:
# recheck residence county literal to see where there are problems
b1618['b_momrescountyl'].value_counts(dropna=False)

(KING            75448
 PIERCE          34421
 SNOHOMISH       29629
 SPOKANE         17601
 CLARK           14972
 YAKIMA          11398
 THURSTON         9350
 KITSAP           9076
 BENTON           7845
 WHATCOM          6597
 FRANKLIN         4716
 GRANT            4401
 SKAGIT           4320
 COWLITZ          3675
 ISLAND           2742
 LEWIS            2714
 CHELAN           2622
 GRAYS HARBOR     2216
 MASON            1943
 CLALLAM          1907
 WALLA WALLA      1869
 DOUGLAS          1558
 OKANOGAN         1482
 STEVENS          1293
 KITTITAS         1230
 WHITMAN          1150
 ADAMS            1148
 JEFFERSON         542
 PEND OREILLE      339
 PACIFIC           317
 LINCOLN           310
 SAN JUAN          244
 FERRY             215
 SKAMANIA          138
 COLUMBIA          102
 KLICKITAT          94
 ASOTIN             75
 UNKNOWN            63
                    61
 WAHKIAKUM          60
 GARFIELD           23
 Name: b_momrescountyl, dtype: int64, 124)

In [57]:
# recheck number of missing in 'b_momrescntyfips'
b1618['b_momrescntyfips'].isna().sum()

124

 - The 124 records with missing residence county FIPS codes had either blanks or "unknown" in the residence county literal field.  The residence county literal field is used to look up the FIPS codes and populate 'b_momrescntyfips'.  Without any additional information it is not possible to find and include the correct values. There is no **easy** way to find mom's county of residence (literal or code).  Will leave these in data for now.

#### STANDARDIZE STRING VARIABLES

First, middle, and last names of infants and mothers as well as city names will be standardized by converting these columns to upper case text, removing white spaces, removing hyphens and other punctuation marks.

In [119]:
# b1618.tail()

In [116]:
#convert all string variables to upper case
b1618 = b1618.apply(lambda x: x.str.upper() if type(x) == str else x)

In [65]:
# remove white spaces on either side of name or within name; remove punctuation
b1618 = b1618.apply(lambda x: x.str.strip() if type(x) == str else x)
b1618 = b1618.applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
b1618 = b1618.applymap(lambda x: x.replace("-", "") if type(x) == str else x)
b1618 = b1618.applymap(lambda x: x.replace(".", "") if type(x) == str else x)

In [67]:
#Check with .tail() to see if string transformations were successful
#b1618.tail(30)

A check using .tail() before and after the string transformations showed that change to uppercase, removal of punctuation marks and white spaces were all completed successfully.

In [68]:
b1618.to_csv(r'Y:\DQSS\Death\MBG\Py\Data\b1618_clean.csv', index=None, header=True)