### Preparing training and testing data sets

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Creating examples of mismatched data for training

As there are almost 170,000 deaths in the 2016-18 death data set as well as almost 260,000 birth records for the same period the number of all combinations of non-matching records is enormous and it would not be helpful to this classifier as it would cause significant imbalance. 

I will randomly select roughly the same number of records from the birth data set (equally distributed across the 3 years) and the death data set and join them horizontally and label them as mismatched ('Match' = 0).

In [2]:
WAlinked1617_m = pd.read_csv(r'###\Py\Data\WAlinked1617_m.csv')

In [3]:
excl_dc = WAlinked1617_m['dsfn'].tolist()
len(excl_dc)

631

In [4]:
excl_bc = WAlinked1617_m['bsfn'].tolist()
len(excl_bc)

631

In [5]:
d1618 = pd.read_csv(r'###\Py\Data\d1618_clean.csv', low_memory=False)
b1618 = pd.read_csv(r'###\Py\Data\b1618_clean.csv', low_memory=False)

In [6]:
death1617_nm = d1618[(~d1618['dsfn'].isin(excl_dc)) & (d1618.ddody != 2018)]

In [7]:
birth1617_nm = b1618[(~b1618['bsfn'].isin(excl_bc)) & (b1618.bdoby != 2018)]

#### Create copies of linked, birth and death data sets that will be used for training

In [8]:
l1617m = WAlinked1617_m.copy()
d1617nm = death1617_nm.copy()
b1617nm = birth1617_nm.copy()

#### Split into training and testing sets before random resampling to balance the classes (matched vs. unmatched)

Here I split the three data sets to create a training data set with 70% of the data and a testing dataset with 30% of the data.
For training purposes I will randomly undersample from the birth and death datasets to create the unmatched records so that the imbalance is reduced.  I will create 3 different unmatched training datasets each with different numbers of unmatched records ('Match' = 0).  Using each of these in separate logistic regression models will allow me to compare model fit for the different levels of class imbalance.

**Splitting infant linked (matched) data set**

In [9]:
lnk_m = WAlinked1617_m.copy()

In [10]:
lnk_m.shape

(631, 96)

**UNDERSAMPLE 1 - 1,200 unmatched rows**

In [11]:
d_train_undersample1 = d1617nm.sample(n=1200, random_state=42).reset_index()
b_train_undersample1 = b1617nm.sample(n=1200, random_state=42).reset_index()
d_train_undersample1.shape, b_train_undersample1.shape

((1200, 53), (1200, 44))

In [12]:
unmatched1 = pd.concat([b_train_undersample1, d_train_undersample1], axis=1)
unmatched1.shape

(1200, 97)

In [13]:
unmatched1['Match'] = 0
unmatched1 = unmatched1.drop(['index'], axis=1)
unmatched1.shape

(1200, 96)

In [14]:
df_1200 = pd.concat([lnk_m, unmatched1], axis=0)
df_1200 = df_1200.sample(frac=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [15]:
df_1200.Match.value_counts(dropna=False)

0    1200
1     631
Name: Match, dtype: int64

In [16]:
#Create two dataframes: X has features, y has outcome labels

y1 = df_1200.Match
X1 = df_1200.drop(['Match'], axis=1)

y1.value_counts(), X1.shape

(0    1200
 1     631
 Name: Match, dtype: int64, (1831, 95))

In [17]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size = 0.3, random_state = 42)

In [18]:
print (X_train1.shape, y_train1.shape)
print (X_test1.shape, y_test1.shape)

(1281, 95) (1281,)
(550, 95) (550,)


**UNDERSAMPLE 2 - 5,000 unmatched rows**

In [19]:
d_train_undersample2 = d1617nm.sample(n=5000, random_state=42).reset_index()
b_train_undersample2 = b1617nm.sample(n=5000, random_state=42).reset_index()
d_train_undersample2.shape, b_train_undersample2.shape

((5000, 53), (5000, 44))

In [20]:
unmatched2 = pd.concat([b_train_undersample2, d_train_undersample2], axis=1)

In [21]:
unmatched2['Match'] = 0
unmatched2 = unmatched2.drop(['index'], axis=1)
unmatched2.shape

(5000, 96)

In [22]:
df_5K = pd.concat([lnk_m, unmatched2], axis=0)
df_5K.Match.value_counts(dropna=False)
df_5K = df_5K.sample(frac=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [23]:
y2 = df_5K.Match
len(y2)

5631

In [24]:
X2 = df_5K.drop(['Match'], axis=1)
X2.shape

(5631, 95)

In [25]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size = 0.3, random_state = 42)

**UNDERSAMPLE 3 - 10,000 unmatched rows**

In [26]:
d_train_undersample3 = d1617nm.sample(n=10000, random_state=42).reset_index()
b_train_undersample3 = b1617nm.sample(n=10000, random_state=42).reset_index()
d_train_undersample3.shape, b_train_undersample3.shape

((10000, 53), (10000, 44))

In [27]:
unmatched3 = pd.concat([b_train_undersample3, d_train_undersample3], axis=1)

In [28]:
unmatched3['Match'] = 0
unmatched3 = unmatched3.drop(['index'], axis=1)
unmatched3.shape

(10000, 96)

In [29]:
df_10K = pd.concat([lnk_m, unmatched3], axis=0)
df_10K.Match.value_counts(dropna=False)
df_10K = df_10K.sample(frac=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [30]:
y3 = df_10K.Match
len(y3)

10631

In [31]:
X3 = df_10K.drop(['Match'], axis=1)
X3.shape

(10631, 95)

In [32]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.3, random_state = 42)

In [33]:
y_train3.value_counts()

0    7002
1     439
Name: Match, dtype: int64