Deliverables:
* Predictions for which children will or not receive all vaccines by 6 months on a test set of data (generated by you)
* All notebooks/scripts used to create predictions
    * Data exploration
    * A defined baseline
    * Predictive model, can you beat the baseline?
* A description of next steps:
    * If you had more time to work on this problem, what other data sources or methods would you like to add that you think would improve the predictions, and why?
    * How can your results help health facilities allocate health workers?


In [1]:
%reload_kedro

2020-06-08 17:18:23,978 - root - INFO - ** Kedro project Immunization Drop-outs
2020-06-08 17:18:23,979 - root - INFO - Defined global variable `context` and `catalog`


## Predictions

In [2]:
predictions = catalog.load("predictions")
X_test = catalog.load("X_test")

2020-06-08 17:18:23,984 - kedro.io.data_catalog - INFO - Loading data from `predictions` (PickleDataSet)...
2020-06-08 17:18:23,987 - kedro.io.data_catalog - INFO - Loading data from `X_test` (PickleDataSet)...


In [3]:
predictions.shape

(9135,)

In [4]:
X_test.shape

(9135, 8)

In [5]:
import numpy as np

In [6]:
arr = np.concatenate((X_test,np.row_stack(predictions)),axis=1)

In [7]:
arr.shape

(9135, 9)

In [8]:
arr[0:,1:].shape

(9135, 8)

In [9]:
arr[0:,0].shape

(9135,)

In [10]:
import pandas as pd

In [11]:
df = pd.DataFrame(data=arr[0:,0:],    # values
                    index=range(arr[0:,0].shape[0]),    # 1st column as index
                    columns=['pat_id','facility', 'first_vaccine_code', 'gender_code', 'region_code', 'dtp_by_4mths',
                      'opv_by_4mths', 'enrollment_age', 'label_code'])  # 1st row as the colu

In [12]:
df.head()

Unnamed: 0,pat_id,facility,first_vaccine_code,gender_code,region_code,dtp_by_4mths,opv_by_4mths,enrollment_age,label_code
0,33948.0,28.0,1.0,0.0,3.0,3.0,3.0,3.0,1.0
1,31441.0,278.0,1.0,1.0,4.0,2.0,3.0,0.0,1.0
2,25102.0,257.0,1.0,0.0,0.0,1.0,1.0,6.0,0.0
3,3802.0,98.0,1.0,1.0,1.0,0.0,1.0,4.0,0.0
4,48113.0,1.0,1.0,1.0,11.0,1.0,2.0,2.0,0.0


In [13]:
low = 1
high = 0

In [14]:
prediction = df[['pat_id', 'label_code']]
prediction.head()

Unnamed: 0,pat_id,label_code
0,33948.0,1.0
1,31441.0,1.0
2,25102.0,0.0
3,3802.0,0.0
4,48113.0,0.0


In [15]:
prediction['label'] = prediction['label_code'].apply(lambda x: 'low' if x == 1 else 'high')
prediction.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,pat_id,label_code,label
0,33948.0,1.0,low
1,31441.0,1.0,low
2,25102.0,0.0,high
3,3802.0,0.0,high
4,48113.0,0.0,high


## Baseline vs Model

A baseline classifier always predicts the majority class. It's the simplest classifier we could build. The majority class represents a 60.9% of the whole dataset (high). The mean accuracy of a Random Forest with random search hyperparameters is 0.8727422003284072.

## Next Steps

I spend a lot of time on data cleaning and not that much on feature engineering. I outlined in my notebooks what extra steps I would take to improve the model. In addition I would try more models architecture. 

I would look at adding extra datasets: geography, weather, social/political events in the calendar and maybe currency exchange. 

I would optimise the code for speed and size of the model. 

Not this model, but maybe some version of it, could be used to prioritize high cases. For it to be useful more labels with different level of risk would be better as the model produces a lot of `high` cases.