<div style="display:flex;">
    <img src="./figs/dsa-logo.png" alt="DSA_Logo" width="70" style="margin:0"/> 
    <div style="margin-left:100px">
        <h2 style="">Introduction to Data Science - Tutorial </h2>
        <h4>Prepared by <a href="https://twitter.com/DMachuve">Dina Machuve</a> and <a href="https://twitter.com/tejuafonja">Tejumade Afonja</a>
        </h4>
    </div>
</div>

This tutorial will take you through some of the basic stages of a Data Science project. The stages for End to End Data Science include:
1. [**Define the Goal**](#A)
2. [**Data Preparation**](#B)
3. [**Feature Selection**](#C)
4. Model Training
5. Model Validation
6. Model Deployment

### A. Define the Goal<a id='A'></a>
### Goal: Predict the maternal health delivery services given other attributes

### B. Data Preparation and Exploration<a id='B'></a>
#### Data Source:  Nigeria MDG (Millennium Development Goals) Information System – [NMIS health facility data](https://www.kaggle.com/alaowerre/nigeria-nmis-health-facility-data). You can read more [here](http://www.sparc-nigeria.com/RC/files/4.2.21_MDGs_NMIS_flyer.pdf)

In this part, we will load a dataset provided with this exercise, prepare it by converting to the right types and finally plot it to explore the data.

The purpose of the EDA approach is to:
* maximize insight into a data set
* uncover underlying structure
* extract important variables
* detect outliers and anomalies
* test underlying assumptions
* develop simple models with great explanatory predictive power
* determine optimal factor settings

In [None]:
# Load some common libraries used here
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt
%matplotlib inline

#### Using the libraries above write a function to read the dataset.

The filename specified below. The final dataset should be a numpy array.

In [None]:
ORIGINAL_NAME = './data/NMIS_Health_Dataset.csv'

In [None]:
#load the dataset
nmisdf = pd.read_csv(ORIGINAL_NAME)
nmisdf.head() #view the first 5 records

In [None]:
nmisdf.shape  # 20 variables and 34,139 records

In [None]:
#list all variables and corresponding data type
nmisdf.dtypes

In [None]:
# Get some information about the dataframe
nmisdf.info()

## B1. Descriptive Statistics  From Data

Descriptive statistics can give you great insight into the shape of each attribute. The **describe()** function on the Pandas DataFrame lists 8 statistical properties of each attribute of the numerical data:

* Count
* Mean
* Standard Deviation
* Minimum Value
* 25th Percentile
* 50th Percentile (Median)
* 75th Percentile
* Maximum Value

The **describe(include='O')** function on the Pandas DataFrame lists 4 properties of each attribute of the categorical data:

* count
* unique
* top
* freq

For example to obtain the statistics summary  for numeric data.

In [None]:
# Statistics on the dataset
nmisdf.describe()

In [None]:
nmisdf.describe(include='object')

To obtain descriptive statistics of a particular column use:

## B2. Handling Missing Data

Real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.

Pandas treats **None** and **NaN** as essentially interchangeable for indicating missing or null values.

Pandas Methods for missing values:

* isnull(): Generate a boolean mask indicating missing values
* notnull(): Opposite of isnull()
* dropna(): Return a filtered version of the data
* fillna(): Return a copy of the data with missing values filled or imputed



**Detecting null values**

In [None]:
nmisdf.isnull().tail()

In [None]:
## Total missing values in each columns
nmisdf.isnull().sum()

In [None]:
#nmisdf.notnull()

### Getting rid of missing data points

#### Drop all missing data

``.dropna()``: will drop all rows that have any missing values.

In [None]:
clean_nmisdf = nmisdf.dropna()

In [None]:
clean_nmisdf.isnull().sum()

In [None]:
#Size of Clean dataset
clean_nmisdf.shape

In [None]:
#Set boolean variables to 1 = True and 0=False
boolean_columns = ["maternal_health_delivery_services", "skilled_birth_attendant", 
                   "phcn_electricity", "c_section_yn", "improved_water_supply", 
                   "vaccines_fridge_freezer", "antenatal_care_yn", "malaria_treatment_artemisinin",
                   "improved_sanitation", "family_planning_yn", "child_health_measles_immun_calc",
                   "emergency_transport"]

In [None]:
for col in boolean_columns:
    clean_nmisdf.loc[:, col] = clean_nmisdf.loc[:, col].astype(float)

In [None]:
clean_nmisdf.info()

In [None]:
clean_nmisdf.describe()

In [None]:
clean_nmisdf.describe(include='O')

In [None]:
#Save it for future use
clean_nmisdf.to_csv("./data/clean_data.csv", encoding='utf-8', index=False)

In [None]:
data = pd.read_csv("./data/clean_data.csv")
data.head()

#### Filling null values


Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values

### Understanding Data Relationship -- digging deeper

In [None]:
columns = data.columns
for column in columns:
    print('\n-------{}------'.format(column))
    print(data[column].value_counts())

In [None]:
fig, axes = plt.subplots(2,2,figsize=(12,8))
for i,e in enumerate(['num_chews_fulltime', 'num_nurses_fulltime', 'num_nursemidwives_fulltime', 'num_doctors_fulltime']):
    ax = axes.flatten()
    data[e].plot.hist(bins=100, ax=ax[i])
    ax[i].set_title(e)
    fig.subplots_adjust(hspace=0.3, wspace=0.3)

In [None]:
# Number of Management with Maternal Health Delivery Services
mgt_mhds = data[['management', 'maternal_health_delivery_services']]

mgt_mhds.groupby(['management'], as_index=False).sum().sort_values(by='maternal_health_delivery_services', ascending=False)

In [None]:
# Count the total number of datapoints for each classes of management 
# data[['management', 'maternal_health_delivery_services']].groupby(['management'], as_index=False).count().sort_values(by='maternal_health_delivery_services', ascending=False)

In [None]:
plt.figure(figsize=(12,8))
ax = sns.countplot(x = "management",data=data, hue='maternal_health_delivery_services', color="Brown")
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation = 30)

ax.set_xlabel("Management")
ax.set_ylabel("Frequency")

In [None]:
plt.figure(figsize=(12,8))
ax = sns.countplot(x ="facility_type_display",data = data, hue='maternal_health_delivery_services', color='Purple' )
_ = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)

In [None]:
plt.figure(figsize=(12,8))
ax = sns.countplot(x = "child_health_measles_immun_calc", hue='maternal_health_delivery_services', data = data, color = "Brown")

### C. Feature Selection <a id='C'></a>

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

<b >Note: Having too many irrelevant features in your data can decrease the accuracy of the models.</b>

Three benefits of performing feature selection before modeling your data are:

* Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
* Improves Accuracy: Less misleading data means modeling accuracy improves.
* Reduces Training Time: Less data means that algorithms train faster.

You can use the following approach

- Univariate statistics
- Model-base selection
- Iterative selection


**Univariate statistics** 
- Check statistical significance relation between feature and target.
- Select the one with high confidence

Advantage: Very fast to compute, doesnt require building models

Disadvantage: Independent of the model

**Model-based Feature Selection**
- Use a supervised machine learning model to judge the importance of each feature.

Advantages: Consider all features at once.

**Iterative Feature Selection**
A series of models are built with varying number of features. Implemented in Sklearn as [Recursive feature elimination (RFE)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE)

<h3> Model - based Feature Selection : Using Feature Importance </h3>

<b>Using RandomForest Classifier</b>

A [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

Some of the many hyper-paramenters we can tune:
* Max_depth - The number of depth of the trees
* n_estimators - The number of trees in the forest
* min_samples_leaf - The minimun number of samples required to be at a leafnode
* min_samples_split - The minimum number of samples required to split an internal node
* random_state - The seed used by random number generator 
* etc

In [None]:
# Load the packages for modeling
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Define a classifier
rforest = RandomForestClassifier(max_depth=15,n_estimators=70, min_samples_leaf=50,
                                  min_samples_split=100, random_state=10)

In [None]:
#Load clean data
data = pd.read_csv("./data/clean_data.csv")
data.head(5)

In [None]:
# Prepare Feature and Target
data.columns

In [None]:
feature = ['skilled_birth_attendant',  'phcn_electricity', 'c_section_yn', 
           'improved_water_supply', 'vaccines_fridge_freezer','antenatal_care_yn', 
           'malaria_treatment_artemisinin','num_nurses_fulltime','num_nursemidwives_fulltime',
          'num_doctors_fulltime']

In [None]:
X = data[feature]
y = data.maternal_health_delivery_services

In [None]:
# Fit the model
rforest.fit(X,y)

In [None]:
# Plot the important features
imp_feat_rf = pd.Series(rforest.feature_importances_, index=X.columns).sort_values(ascending=False)
imp_feat_rf.plot(kind='bar', title='Feature Importance with Random Forest', figsize=(10,6),color='g')
plt.ylabel('Feature Importance values')
plt.subplots_adjust(bottom=0.25)

## Model - based feature Selection : Using **[SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html)**

In [None]:
from sklearn.feature_selection import SelectFromModel

select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42) , threshold="median")

In [None]:
select.fit(X,y)

In [None]:
X_features = select.transform(X)
print('Original features', X.shape)
print('Selected features', X_features.shape)


#### Print selected features

In [None]:
for feature_list_index in select.get_support(indices=True):
    print(feature[feature_list_index])

In [None]:
data.corr()