# Lecture 5: Notebook SK_01

## Sklearn Classification

In [1]:
import pandas as pd
import numpy as np

import sklearn
import xgboost as xgb

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import warnings
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.preprocessing import StandardScaler

from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV

### Import the dataset
Use pandas. The dataset is splitted in training and test. Create a `full_data` list where to save both.

Print dataset shape

Print dataset head and tail

### Features

__Survived:__
```
0 = No, 1 = Yes
```

__Pclass:__
```
A proxy for socio-economic status (SES)
1 = 1st, 2 = 2nd, 3 = 3rd
1st = Upper
2nd = Middle
3rd = Lower
```

__sex:__
```
Sex
```

__age:__
```
Age is fractional if less than 1. If estimated: is it in the form of xx.5
```

__SibSp:__
```
# of siblings / spouses aboard the Titanic
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
```

__Parch:__
```
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
```

__Ticket:__
```
Ticket number
```

__Fare:__
```
Passenger fare
```

__Cabin:__
```
Cabin number
```

__Embarked:__
```
Port of Embarkation
C = Cherbourg, Q = Queenstown, S = Southampton
```

### Phase 1 : Feature Exploration, Engineering and Cleaning
* explore the data
* identify feature engineering opportunities
* numerically encode any categorical features

Check for nan values in each colum

We cannot remove all the nan data, otherwise we remove a great portion of the dataset.
We have to fill missing data.

#### Fix Age

plot Age histogram

Add missing values using avg values

Use `pd.cut` to bin values into discrete intervals

#### Fix cabin
Create a new `has_cabin` feature 

#### Fix Embarked

#### Extract others features

Plot `Fare` histogram

Use `pd.cut` to bin values into discrete intervals

Create new `FamilySize` and `isAlone` features

Map `Sex` into numbers

#### Remove unused data

Print dataset columns

Remove the following columns `'PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch'`

Prin `train` and `test` head

### Visualization

#### Plot the Pearson correlation coefficient

In [None]:
colormap = plt.cm.RdBu
plt.figure(figsize=(14,14))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

The Pearson correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables.

About the dataset:

"One thing that that the Pearson Correlation plot can tell us is that there are not too many features strongly correlated with one another. This is good from a point of view of feeding these features into your learning model because this means that there isn't much redundant or superfluous data in our training set and we are happy that each feature carries with it some unique information. "

#### Plot the pairplot of each pair of features

### Phase 2 :  Let's train

Split the dataset in `X`, `y`, and `X_validation`.

Use the following columns `'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Has_cabin', 'FamilySize', 'IsAlone'` in `X`

Set the `SEED` for reproducibility, and `NFOLDS` for out-of-fold prediction

Create a *logistic regressor* and compute the __accuracy__ using cross validation

### Phase 3 : features selection

Train different models using the following parameters:

In [None]:
# Random Forest Parameters
rf_params = {
    'random_state' : SEED,
    'n_jobs': -1,
    'n_estimators': 500,
    'warm_start': True, 
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 0
}

# Extra Trees Parameters
et_params = {
    'random_state' : SEED,
    'n_jobs': -1,
    'n_estimators':500,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'random_state' : SEED,
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# Gradient Boosting parameters
gb_params = {
    'random_state' : SEED,
    'n_estimators': 500,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

Create the models: ```RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier```

Fit all the models on `X` and `y`

Create a dataframe with `features_importances_`

Plot each of them

### Phase 4 : Select a Model

#### Let's make everything a bit more automated

### Phase 5 : Tune parameters

Let's try different parameters for `GradientBoostingClassifier`

Use `GridSearchCV` to search the best one

If you are in a hurry... use `RandomizedSearchCV`

### Notebook credits:

Some lines of code are taken from: 
[Anisotropic on Kaggle](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python)