In [1]:
%reload_kedro

2020-06-07 22:48:25,310 - root - INFO - ** Kedro project Immunization Drop-outs
2020-06-07 22:48:25,311 - root - INFO - Defined global variable `context` and `catalog`


In [2]:
dfp = catalog.load("preprocessed_patients")
dfi = catalog.load("preprocessed_immunization")
dff = catalog.load("facilities")

2020-06-07 22:48:25,316 - kedro.io.data_catalog - INFO - Loading data from `preprocessed_patients` (CSVDataSet)...
2020-06-07 22:48:25,365 - kedro.io.data_catalog - INFO - Loading data from `preprocessed_immunization` (CSVDataSet)...
2020-06-07 22:48:25,462 - kedro.io.data_catalog - INFO - Loading data from `facilities` (CSVDataSet)...


In [3]:
dfi.isnull().sum().sum()

0

In [4]:
dff.isnull().sum().sum()

0

In [5]:
dfp.isnull().sum().sum()

0

In [6]:
set(list(dfp.pat_id.unique())) == set(list(dfi.pat_id.unique()))

False

* OPV (oral polio vaccine):
    * Dose 1: birth
    * Dose 2: 6 weeks
    * Dose 3: 10 weeks,
    * Dose 4: 14 weeks*
* DTP (diphtheria, tetanus, pertussis)​:
    * Dose 1: 6 weeks
    * Dose 2: 10 weeks
    * Dose 3: 14 weeks*
    
Data Dictionary: immunizations_db.csv
* pat_id: The ID of the child.
* vaccine: The abbreviated name of the vaccine the child attempted to receive.
* im_date: Immunization date, i.e. the date the child received the vaccine.
* successful: Whether or not the vaccination was successful.
* reason_unsuccesful: If the vaccination was unsuccessful, the selected reason why.

Data Dictionary: patients_db.csv
* pat_id: The unique ID of the child.
* dob: The date of birth of the child.
* gender: The gender of the child.
* fac_id: The unique ID of the health facility the child received the vaccination.
* lat: The latitude of the facility.
* long: The longitude of the facility.
* district: The geographical district that the facility is located in.

First lets see how many vaccines each child received.

In [7]:
dfi.head()

Unnamed: 0,pat_id,vaccine,im_date
0,1,OPV,2019-01-31
1,1,OPV,2019-04-03
2,1,OPV,2019-05-25
3,1,OPV,2019-07-06
4,1,DTP,2019-04-03


If I had more time, for labelling I would use snorkel (https://www.snorkel.org/) to label and manage this dataset. Here I will assign two labels:
* high_risk
* low_risk

After getting more familiar with the nature of the problem, researching the topic and interviewing health workers I would definitely add more labels. Ideally, labelling would be done by a healthcare worker (someone with domain knowledge) or the data would have information if there was an intervention and at what time. Here, I have to asign labels based on my limited knowledge of the subject. The posibility of `garbage-in garbage-out` outcome is very high for that reason.

In [8]:
len(set(list(dfp.pat_id.unique())) - set(list(dfi.pat_id.unique())))

1106

There are 1106 patients without immunizations records. I will treat it as data capture mistake as each child should receive a vaccine at birth. If I had more time I would investigate if those records come from specific district etc. 

In [9]:
bad_pat_id = list(set(list(dfp.pat_id.unique())) - set(list(dfi.pat_id.unique())))
dfp = dfp[~dfp.pat_id.isin(bad_pat_id)]

In [10]:
def remove_bad_pat_id(df0, df1):
    """
    Remove patients without immunizations records.
    """
    bad_pat_id = list(set(list(dfp.pat_id.unique())) - set(list(dfi.pat_id.unique())))
    return dfp[~dfp.pat_id.isin(bad_pat_id)]

First I will join all my processed datasets.

In [11]:
dfp.head()

Unnamed: 0,pat_id,fac_id,dob,gender,long,lat,region,district
0,1,51.0,2019-01-22,f,21.678399,-21.739251,Ghanzi,Ghanzi
1,2,89.0,2019-11-12,f,24.877556,-18.370709,Chobe,Chobe
2,3,161.0,2019-11-03,m,25.249672,-20.490189,Central,Tutume
3,4,168.0,2019-04-17,f,25.579269,-21.412151,Central,Lethlakane
4,5,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli


In [12]:
import pandas as pd

In [13]:
df_outer = pd.merge(dfi, dfp, on='pat_id', how='outer')
df_outer.head()

Unnamed: 0,pat_id,vaccine,im_date,fac_id,dob,gender,long,lat,region,district
0,1,OPV,2019-01-31,51.0,2019-01-22,f,21.678399,-21.739251,Ghanzi,Ghanzi
1,1,OPV,2019-04-03,51.0,2019-01-22,f,21.678399,-21.739251,Ghanzi,Ghanzi
2,1,OPV,2019-05-25,51.0,2019-01-22,f,21.678399,-21.739251,Ghanzi,Ghanzi
3,1,OPV,2019-07-06,51.0,2019-01-22,f,21.678399,-21.739251,Ghanzi,Ghanzi
4,1,DTP,2019-04-03,51.0,2019-01-22,f,21.678399,-21.739251,Ghanzi,Ghanzi


In [14]:
df_outer.isnull().sum().sum()

0

In [15]:
def build_primary_table(df0, df1):
    """
    Join preprocessed_patients and preprocessed_immunization
    dataframes. Drop successful == False records.
    """
    df = remove_bad_pat_id(df0, df1)
    df = pd.merge(df0, df1, on='pat_id', how='outer')
    df = df[df.successful == True]
    df.drop(['successful', 'reason_unsuccesful'], 1)
    df = df.reset_index(drop=True)
    return df

In [16]:
df_outer[df_outer.pat_id == 2].sort_values('im_date')

Unnamed: 0,pat_id,vaccine,im_date,fac_id,dob,gender,long,lat,region,district
7,2,OPV,2019-11-12,89.0,2019-11-12,f,24.877556,-18.370709,Chobe,Chobe


## Features

For my list of predictors I will use:
* Gender
* Facility
* Region
* First vaccine type
* Enrollment age in weeks
* Exit age in weeks
* Vaccine timeliness (on-time, late, early)


Create feature dataframe:

In [17]:
pat_id = list(df_outer.pat_id.unique())
# feature_df = pd.DataFrame(pat_id, columns = ['pat_id']) 
# feature_df.shape[0]

In [18]:
dfp[dfp.pat_id == 1].iloc[-1].gender

'f'

In [35]:
def extract_temp_df(df, pid):
    """
    Extract all records for a patient from the primary table
    and sort by immunization date.
    """
    df = df[df.pat_id == pid].sort_values('im_date')
    df['im_date'] = pd.to_datetime(df['im_date'], format='%Y-%m-%d')
    df['dob'] = pd.to_datetime(df['dob'], format='%Y-%m-%d')
    return df

In [43]:
# test df
test_df = extract_temp_df(df_outer, 5)
test_df

Unnamed: 0,pat_id,vaccine,im_date,fac_id,dob,gender,long,lat,region,district
18,5,OPV,2018-12-24,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli
19,5,OPV,2019-02-10,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli
22,5,DTP,2019-02-10,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli
20,5,OPV,2019-03-18,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli
23,5,DTP,2019-03-18,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli
21,5,OPV,2019-04-27,183.0,2018-12-08,m,28.487746,-22.571451,Central,Tuli


In [49]:
test_df['im_date'].idxmin()

18

In [47]:
test_df.vaccine.loc[18]

'OPV'

In [48]:
def feature_first_vaccine(df):
    """
    First vaccine, accepts temporarys dataframe for 
    a patient.
    """
    idx = df['im_date'].idxmin()
    return df.vaccine.loc[idx]

In [22]:
import math

In [23]:
def feature_enrollment_age(df):
    """
    Enrollment age in weeks.
    """
    return math.floor(((df.im_date.min() - df.dob.min()) / np.timedelta64(1, 'W')))

In [24]:
def feature_exit_age(df):
    """
    Exit age in weeks.
    """
    return math.floor(((df.im_date.max() - df.dob.min()) / np.timedelta64(1, 'W')))

In [None]:
def feature_opv_by_4mths(df):
    """
    Number of OPV vaccines received by 4 months
    """
    

* OPV (oral polio vaccine):
    * Dose 1: birth
    * Dose 2: 6 weeks
    * Dose 3: 10 weeks,
    * Dose 4: 14 weeks*
* DTP (diphtheria, tetanus, pertussis)​:
    * Dose 1: 6 weeks
    * Dose 2: 10 weeks
    * Dose 3: 14 weeks*

In [25]:
import numpy as np

In [50]:
feature_df_list = []
for p in pat_id:
    
    pat_dict = dict()
    pat_dict['pat_id'] = p
    
    pat_df = extract_temp_df(df_outer, p)
    
    facility = pat_df.iloc[-1].fac_id
    pat_dict['facility'] = facility
    
    region = pat_df.iloc[-1].region
    pat_dict['region'] = region
    
    gender = pat_df.iloc[-1].gender
    pat_dict['gender'] = gender
    
    vaccine = feature_first_vaccine(pat_df)
    pat_dict['first_vaccine'] = vaccine
    
    pat_dict['enrollment_age'] = feature_enrollment_age(pat_df)
    pat_dict['exit_age'] = feature_exit_age(pat_df)
    
    feature_df_list.append(pat_dict)

feature_df = pd.DataFrame(feature_df_list)
feature_df.head()

Unnamed: 0,pat_id,facility,region,gender,first_vaccine,enrollment_age,exit_age
0,1,51.0,Ghanzi,f,OPV,1,23
1,2,89.0,Chobe,f,OPV,0,0
2,3,161.0,Central,m,OPV,0,6
3,4,168.0,Central,f,OPV,6,34
4,5,183.0,Central,m,OPV,2,20


In [29]:
feature_df.shape

(45675, 6)