# Machine Learning in Python 

## Module 1: Mining and visualising real-world data
### Learning Activity: Loading the Python libraries

First you need to load the required Python libraries. Libraries are extensions to the base python that add functionality or help to make tasks more convenient to do. 

In [None]:
# compatibility with python2 and 3
from __future__ import print_function, division
from __future__ import absolute_import 

# numerical capacity
import scipy as scipy
import numpy as np
import pandas as pd

# matplotlib setup
%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns

# plotly setup
import plotly 
from plotly.graph_objs import *
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode()

# extra tools
from mpl_toolkits.mplot3d.axes3d import Axes3D 

# GENERAL SKLEARN TOOLS
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# DTS and RFS MODULE
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# SVM MODULE
from sklearn.svm import SVC

import warnings


# The dataset

Financial Inclusion remains one of the main obstacles to economic and human development in Africa. For example, across Kenya, Rwanda, Tanzania, and Uganda only 9.1 million adults (or 13.9% of the adult population) have access to or use a commercial bank account.

Traditionally, access to bank accounts has been regarded as an indicator of financial inclusion. Despite the proliferation of mobile money in Africa, and the growth of innovative fintech solutions, banks still play a pivotal role in facilitating access to financial services. Access to bank accounts enable households to save and facilitate payments while also helping businesses build up their credit-worthiness and improve their access to other finance services. Therefore, access to bank accounts is an essential contributor to long-term economic growth.

The objective of this competition is to create a machine learning model to predict which individuals are most likely to have or use a bank account. The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda, while providing insights into some of the key demographic factors that might drive individuals’ financial outcomes.

The dataset contains demographic information and what financial services are used by approximately 33,610 individuals across East Africa. This data was extracted from various Finscope surveys ranging from 2016 to 2018, You are asked to make predictions for each unique id in the test dataset about the likelihood of the person having a bank account.

### Learning Activity: Importing the data

As a first step we load the dataset from the provided `Train_v2.csv` file with `pandas`. To achieve this you will use the `.read_csv()` method from Pandas. We just need to point to the location of the dataset and indicate under what name we want to store the data, i.e. `df`. 

Once the data has been loaded, you can look at the first few instances using the `.head()` method.

In [None]:
# Import the data and explore the first few rows
df = pd.read_csv('https://raw.githubusercontent.com/StarBoy01/IndabaX-Sudan-2019/master/Train_v2.csv')
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.dtypes

## Exploratory Data Analysis

## dataset balance?

Lets see if our dataset is balanced or not by checking our target distribution

In [None]:
df.bank_account.value_counts()

In [None]:
a = len(df[df.bank_account=='Yes'])
b = len(df[df.bank_account=='No'])
c = len(df)
print('We have an imbalanced dataset with a %i/%i ratio'%((b/c*100),(a/c*100)+1))

Stratification will be needed when doing cross-validation to preserve this ratio in our folds!

## Distributions

### Age of respondent

In [None]:
hist_age = df.age_of_respondent.hist(bins=25,figsize=[15,10])

### Country

In [None]:
df['country'].value_counts().plot(kind='bar',figsize=[15,5])

### Let's check the year variable

In [None]:
df.year.value_counts()

In [None]:
df[df.year==2016].country.value_counts()

In [None]:
df[df.year==2017].country.value_counts()

In [None]:
df[df.year==2018].country.value_counts()

### Bivariate Analysis using the target

In [None]:
## target encoding
di = {"Yes": 1, "No": 0}
df.replace({"bank_account": di},inplace=True)

### Age of Respondent 

In [None]:
plt.figure(figsize=[18,12])
sns.barplot('age_of_respondent', 'bank_account', data=df)

We can safely say that ,generally speaking and by also looking at the general trend of the plot, older people are less likely to have a bank_account. We have some outliers beyond the 90 mark for the age variable.

### Gender of Respondent

In [None]:
plt.figure(figsize=[15,6])
sns.barplot('gender_of_respondent', 'bank_account', data=df)

Males are more likely to have a bank account according to this plot. Maybe we can combine gender and age, and see if put together, we could notice something and create a feature that better represents this combination.


### Age + Gender 

In [None]:
warnings.filterwarnings(action="ignore")
plt.figure(figsize=[25,20])
plt.subplot(331)
sns.distplot(df[(df.gender_of_respondent=='Male')&(df.bank_account==1)]['age_of_respondent'].dropna().values, bins=range(0, 100, 1), kde=False, color='red')
sns.distplot(df[(df.gender_of_respondent=='Male')&(df.bank_account==0)]['age_of_respondent'].dropna().values, bins=range(0, 100, 1), kde=False, color='blue',
            axlabel='Males age')
plt.subplot(332)
sns.distplot(df[(df.gender_of_respondent=='Female')&(df.bank_account==1)]['age_of_respondent'].dropna().values, bins=range(0, 100, 1), kde=False, color='red')
sns.distplot(df[(df.gender_of_respondent=='Female')&(df.bank_account==0)]['age_of_respondent'].dropna().values, bins=range(0, 100, 1), kde=False, color='blue',
            axlabel='Females age')

In [None]:
df.gender_of_respondent.value_counts()

For both genders, the peak in the likelihood of having a bank_account happens between 20-40. But the number of males , keeping in mind that in the trainset they are less than females by 4k, who have a bank_account is closer to those who don't compared to females. Meaning, age plays a role for both genders almost the same way, but gender has an important role.

### Country

In [None]:
plt.figure(figsize=[15,6])
sns.barplot('country', 'bank_account', data=df)

### Household Size

In [None]:
plt.figure(figsize=[15,6])
sns.barplot('household_size', 'bank_account', data=df)

Data needs to be cleaned to get something useful out of this variable. as you can see the bigger values has fewer samples and they might also have outliers.

### Marital Status

In [None]:
plt.figure(figsize=[15,6])
sns.barplot('marital_status', 'bank_account', data=df)

almost similar across all categories, the undersampled Dont know with 38 samples might cause a problem.

### Lets see interactions between features

In [None]:
g = sns.factorplot(x="household_size", y="bank_account", hue="gender_of_respondent", col="country",
                   data=df, aspect=0.9, size=4, ci=95.0)

for Kenyans, we can see that household_size is inversely proportionnal to the target, except for some outliers that mess up the plot at bigger household_size values. For Tanzanians , the decline is clearer with lesser outliers. Uganda and Rwanda also have a small decline but the outliers are at it again.

In [None]:
g = sns.factorplot(x="household_size", y="bank_account", hue="gender_of_respondent", col="location_type",
                   data=df, aspect=0.9, size=6, ci=95.0)

### Binnedage-based and gender-based plots vs target and other variables

In [None]:
## binning the age!
bins = [10, 20, 30, 40,60,80,100]
labels = [1,2,3,4,5,6]
df['binnedage'] = pd.cut(df['age_of_respondent'], bins=bins, labels=labels)

In [None]:
g = sns.factorplot(x="binnedage", y="bank_account", hue="gender_of_respondent",
                   data=df, aspect=0.9, size=5, ci=95.0)

Both genders follow the same pattern, it gets the highest between 20-30, and then starts slowly decreasing. No need to create a new feature based on feature interaction age-gender

In [None]:
g = sns.factorplot(x="binnedage", y="bank_account", hue="education_level",col='gender_of_respondent',
                   data=df, aspect=0.9, size=5, ci=95.0)

Males and females share the same pattern for educaton level for the same ages

In [None]:
g = sns.factorplot(x="binnedage", y="bank_account", hue="job_type",col='gender_of_respondent',
                   data=df, aspect=0.9, size=8, ci=95.0)

In [None]:
# Data preprocessing 
# convert categorical features to numerical features
# categorical features to be converted by One Hot Encoding
le = LabelEncoder()
df['country_'] = df['country']

categ = ['relationship_with_head', 'marital_status', 'education_level', 'job_type', 'country_']
# One Hot Encoding conversion
df = pd.get_dummies(df, prefix_sep='_', columns = categ)

# Labelncoder conversion
df['location_type'] = le.fit_transform(df['location_type'])
df['cellphone_access'] = le.fit_transform(df['cellphone_access'])
df['gender_of_respondent'] = le.fit_transform(df['gender_of_respondent'])

## Module 2: Decision Trees and Random Forests

In this module, you will implement two popular and extremely powerful Machine Learning models - Decision Trees and Random Forests - using Python and scikit-learn. For every classification model built with scikit-learn, we will follow four main steps: 1) **Building or instantiating ** the classification model (using either default, pre-defined or optimised parameters), 2) **Training** the model, 3) **Testing** the model, and 4) **Performance evaluation** using various metrics to test its generalisation ability.  Thorough validation techniques will be applied throughout these steps as a means of ensuring real-world metrics and avoiding cases of overfitting (or underfitting). Finally, you will learn how to optimise the hyperparameters of a model as a way of boosting its overall performance. 


### Learning Activity: Split the data into training and test sets

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the online retail dataset into two disjoint sets: train and test (**Holdout method**) using the `train_test_split()` function. 

The `random_state` argument specifies a value for the seed of the random generator. By setting this seed to a particular value, each time the code is executed, the split between train and test datasets will be exactly the same. If this value is not specified, a different split will be performed each time since the random generator driving the split will be seeded by a pseudo-random number.

In [None]:
# Split into training and test sets
X = df.drop(["uniqueid", "bank_account","year","country"], axis = 1)
Y = df["bank_account"]
X_train, X_test, y_train,y_test=train_test_split(X,Y, test_size=0.3, random_state=47)

The output of `train_test_split()` consists of four arrays. _XTrain_ and _yTrain_ are the two arrays you use to train your model. _XTest_ and _yTest_ are the two arrays that you use to evaluate your model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also specify the proportion of data you want to use for training and testing. You can check http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html on how to set this parameter. 

In [None]:
# Add code here to explore Xtrain, ytrain etc.. 
#(eg: frequency of Y, head of X, dimensionality, ...)
# Print the dimensionality of the individual splits

print("XTrain dimensions: ", X_train.shape)
print("yTrain dimensions: ", y_train.shape)
print("XTest dimensions: ",  X_test.shape)
print("yTest dimensions: ",  y_test.shape)


yFreq = scipy.stats.itemfreq(y_test)
print(yFreq)

If you look at the frequency of `yTest`, you will see that 59 random samples of class 0 (non-returning customers) and 441 random samples of class 1 (returning customers) are included in the yTest set.

### Learning Activity:  Decision Trees

Decision Tree classifiers construct classification models in the form of a tree structure. A decision tree progressively splits the training set into smaller subsets. Each node of the tree represents a subset of the data. Once a new sample is presented to the data, it is classified according to the test condition generated for each node of the tree.

Let us build a simple decision tree with 3 layers. (See [here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the documentation of the Decision Tree classifier.)

In [None]:
# Building the classification model using a pre-defined parameter

dtc = DecisionTreeClassifier(max_depth=3) 
# Train the model

dtc.fit(X_train, y_train)
# Test the model
yPred = dtc.predict(X_test)


### Learning Activity: Calculate validation metrics for your classifier

In a classification task, once you have created your predictive model, you will need to evaluate it. Evaluation functions help you to do this by reporting the performance of the model through four main performance metrics: precision, recall and specificity for the different classes, and overall accuracy. To understand these metrics, it is useful to create a _confusion matrix_, which records all the true positive, true negative, false positive and false negative values.

We can compute the confusion matrix for our classifier using the `confusion_matrix` function in the `metrics` module. The inputs are the `yTest` and `yPred`


In [None]:

mat = metrics.confusion_matrix(y_test, yPred) 
print (mat)




Because performance metrics are such an important step of model evaluation, scikit-learn offers a wrapper around these functions, `metrics.classification_report`, to facilitate their computation. It also offers the function `metrics.accuracy_score` that we tried before to compute the overall accuracy.


In [None]:
# Report the metrics using metrics.classification_report
print (metrics.classification_report(y_test, yPred))
print ("Overall Accuracy: ", round(metrics.accuracy_score(y_test, yPred), 2))

### Learning Activity:  Random Forests

The random forests model is an _ensemble method_ since it aggregates a group of decision trees into an [ensemble](http://scikit-learn.org/stable/modules/ensemble.html). Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer. Unlike single decision trees which are likely to suffer from high variance or high bias (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes. <br/> 

Let us start by building a simple Random Forest model which consists of 150 independently trained decision trees. For further details and examples on how to construct a Random Forest, see [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:

# Build a Random Forest classifier with 150 decision trees
rf = RandomForestClassifier(n_estimators=150, random_state=1)
rf.fit(X_train, y_train)
predRF = rf.predict(X_test)

mat = metrics.confusion_matrix(y_test, predRF) 
print (mat)

print(metrics.classification_report(y_test, predRF))
print("Overall Accuracy:", round(metrics.accuracy_score(y_test, predRF),2))


### Learning Activity: Feature Importance 

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is.

We can use the `feature_importances_` attribute of the RF classifier to obtain the relative importance of each feature, which we can then visualise using a simple bar plot.

In [None]:
# Display the importance of the features in a barplot

# sorting the features according their importance
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)

feature_importances

In [None]:
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(4).plot(kind='barh')

### Learning Activity: Tuning Random Forests with grid search

Random forests offer several parameters that can be tuned. In this case, parameters such as `n_estimators`, `max_features`, `max_depth` and `min_samples_leaf` can be some of the parameters to be optimised. The optimal choice for these parameters is highly *data-dependent*. Rather than trying one-by-one predefined values for each hyperparameter, we can automate this process. The scikit-learn library provides the grid search function [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to exhaustively search for the optimum combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/grid_search.html). You can use the `GridSearchCV` function with the validation technique of your choice (in this example, 10-fold cross-validation has been applied) to search for a parametisation of the RF algorithm that gives a more optimal model.

As a first step, create a dictionary of allowed parameter ranges for `n_estimators` and `max_depth` (or include more of the parameters you would like to tune) and conduct a grid search with cross validation using the `GridSearchCV` function:

In [None]:
# Conduct a grid search with 5-fold cross-validation using the dictionary of parameters
# Parameters you can investigate include:
n_estimators = np.arange(5, 100, 25)
max_depth    = np.arange(1, 35, 5)
# percentage of features to consider at each split
max_features = np.linspace(.1, 1.,3)
parameters   = [{'n_estimators': n_estimators,
                 'max_depth': max_depth,
                 'max_features': max_features}]

gridCV = GridSearchCV(RandomForestClassifier(), param_grid=parameters, cv=5, n_jobs=4) 
gridCV.fit(X_train, y_train)

# Print the optimal parameters
best_n_estim      = gridCV.best_params_['n_estimators']
best_max_depth    = gridCV.best_params_['max_depth']
best_max_features = gridCV.best_params_['max_features']

print ('Best parameters: n_estimators=', best_n_estim,
       'max_depth=', best_max_depth,
       'max_features=', best_max_features)

By default, parameter search uses overall accuracy (`sklearn.metrics.accuracy_score`) as a metric in classification. For some applications, other scoring functions and metrics are better suited (for example in _unbalanced classification_, the overall accuracy score may often be misleading). An alternative scoring function such as the ones provided at http://scikit-learn.org/stable/modules/model_evaluation.html can be specified via the `scoring` parameter in `GridSearchCV`.

### Learning Activity: Testing and evaluating the generalisation performance

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (_XTest_). So, we are testing our independent _XTest_ dataset using the optimal parameters:

In [None]:
# Build the classifier using the *optimal* parameters detected by grid search
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
                                max_depth=best_max_depth,
                                max_features = best_max_features)

clfRDF.fit(X_train, y_train)
predRF = clfRDF.predict(X_test)

Use the `classification_report` from the `metrics` module and the `accuracy_score` to see how well the classifier is doing

In [None]:
# add your code here
print (metrics.classification_report(y_test, predRF))
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, predRF),2))

## Module 3: Support Vector Machines

Support Vector Machines (SVMs) attempt to build a decision boundary that accurately separates the samples of different classes by *maximising* the margin between them.

### Learning Activity: Linear SVMs

At first, let us build a linear SVM model using the _default_ value for the hypeparameter `C` (based on the scikit-learn documentation, the default case is `C = 1.0`). The regularisation `C` trades off misclassification of training examples against simplicity of the decision surface. A low `C` tolerates training misclassifications and allows softer margins, while for high `C` the misclassifications become more significant leading to hard-margin SVMs and potentially cases of overfitting.

Thorough documentation on how to implement linear SVMs with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).


In [None]:
# Build a linear SVM classifier with the default hyperparameter C
# (where C = 1.0; this argument is optional and could be omitted)
linearSVM = SVC(kernel='linear', C=1.0)
linearSVM.fit(X_train, y_train)
yPred = linearSVM.predict(X_test)

print(metrics.classification_report(y_test, yPred))
print("Overall Accuracy:", round(metrics.accuracy_score(y_test, yPred),2))

### Learning Activity: Non-linear (RBF) SVMs

In addition to the regularisation parameter `C`, which is common for all types of SVM, the gamma hyperparameter in the RBF kernel controls the nonlinearity of the SVM bounaries. The larger the gamma, the more nonlinear the boundaries surrounding individual samples. Lower values of gamma lead to broader, more linear boundaries. <br/>  

At first, let us build an RBF SVM model (set the `kernel` parameter to `rbf`) using the default values for the hypeparameters `C` (`C=1.0`) and `gamma` (`gamma='auto'`). Thorough documentation on how to implement SVMs with scikit-learn can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
# Build a non-linear (RBF) classifier using the default parameters for C and gamma

rbfSVM = SVC(kernel='rbf', C=1.0, gamma='auto')
rbfSVM.fit(X_train, y_train)
yPred = rbfSVM.predict(X_test)

print (metrics.classification_report(y_test, yPred))
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, yPred),2))

## END OF DAY