# Exercises in Advanced models
We will look into Random Forest and Boosting models, and try to tune their hyperparameters. For these exercises we will be using the titanic dataset in order to predict survival of passengers.

Start out by executing the following cell which will load `titanic_train.csv` and turn it into features and labels

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
train0 = pd.read_csv('../../day_1/exercises/titanic_train.csv')

# Create labels
y_train = train0['Survived']

# Columns to perform one hot encoding on
ohe_cols = ['Sex', 'Embarked']

# String and index columns to drop
drop_cols = ['Cabin', 'Ticket', 'Name', 'PassengerId', 'Survived']

# Create OHE features
train_ohe = pd.get_dummies(train0, columns = ohe_cols)

# Drop string cols
train_drop = train_ohe.drop(drop_cols, axis = 1)

# Impute missing values
imputer = SimpleImputer()
imputed_vals = imputer.fit_transform(X = train_drop)
X_train = pd.DataFrame(imputed_vals, columns = train_drop.columns)

Load the `RandomForestClassifier` from `sklearn.ensemble`.

In [None]:
#ANS
from sklearn.ensemble import RandomForestClassifier

We are going to perform a grid search cross validation using a built in module from `sklearn`. Import `GridSearchCV` from `sklearn.model_selection`, and define a parameter grid for the Random Forest model. We will be tuning the number of trees, `n_estimators`, and the depth of the trees, `max_depth`. Initialize an instance of `GridSearchCV` using the grid you defined. Set `scoring` to `"accuracy"` and the number of folds to 5 - which is done using the `cv` argument.
As we have seen earlier a grid can be defined in the following way:
```python
par_grid = {'A': [1,2,3], 'B': [10,100,150]}
```
Be aware that the size of the grid grows exponentially in the number of values!
The `GridSearchCV` takes an estimator as argument. An estimator is a model object such as `RandomForestClassifier()`.

In [None]:
#ANS
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
par_grid = {'n_estimators': [10, 100, 150], 'max_depth': [1, 3, 5]}

# Initialize model
clf = RandomForestClassifier()

# Initialize cross validation grid search
gcv = GridSearchCV(clf, par_grid, scoring = "accuracy", cv = 5, return_train_score=True)

Fit the initialized `GridSearchCV` object on the `X_train` and `y_train` data from the titanic dataset. Identify the best set of parameters, which is done with the `best_params_` attribute. The scores for all classifiers can be found with 
```python
pd.DataFrame(fit_res.cv_results_)
```
Where `fit_res` is the object returned by the `fit` method.

In [None]:
#ANS

# Fit model
gcv.fit(X_train, y_train)

# Best hyperparam
print(gcv.best_params_)

# Show table of scores
pd.DataFrame(gcv.cv_results_)

Now we will apply the best model on the test set. Correct the cell below and run it, in order to create a test dataset. Notice that a method in the impute section should be replaced - What should it be replaced with? Can we ensure that no information is leaked?

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer
test0 = pd.read_csv('../../day_1/exercises/titanic_test_new.csv', index_col = 0)

# Separate labels
y_test = test0['Survived']
X_test0 = test0.drop('Survived', axis=1)

# Columns to perform one hot encoding on
ohe_cols = ['Sex', 'Embarked']

# String, index and label columns to drop
drop_cols = ['Cabin', 'Ticket', 'Name', 'PassengerId']

# Create OHE features
test_ohe = pd.get_dummies(X_test0, columns = ohe_cols)

# Drop string cols
test_drop = test_ohe.drop(drop_cols, axis = 1)

# Impute missing values
X_test = pd.DataFrame(
    imputer.SOME_METHOD(X = test_drop), #### REPLACE SOME_METHOD with a real method ####
    columns = test_drop.columns)

In [None]:
#ANS
import pandas as pd
from sklearn.impute import SimpleImputer
test0 = pd.read_csv('../../day_1/exercises/titanic_test_new.csv', index_col = 0)

# Separate labels
y_test = test0['Survived']
X_test0 = test0.drop('Survived', axis=1)

# Columns to perform one hot encoding on
ohe_cols = ['Sex', 'Embarked']

# String, index and label columns to drop
drop_cols = ['Cabin', 'Ticket', 'Name', 'PassengerId']

# Create OHE features
test_ohe = pd.get_dummies(X_test0, columns = ohe_cols)

# Drop string cols
test_drop = test_ohe.drop(drop_cols, axis = 1)

# Impute missing values
X_test = pd.DataFrame(
    imputer.transform(X = test_drop), # The imputer is a transformer, thus we use transform
    columns = test_drop.columns)

Use the `GridSearchCV` object that you trained earlier, to predict on the test data. Calculate the accuracy on the test set.

In [None]:
#ANS

# Predict
pred = gcv.predict(X_test)

# Evaluate score
print("Test accuracy is: {}".format(np.sum(pred == y_test)/y_test.shape[0]))

# Boosting
We will now do a similar exercise using XGBoost. Make sure to execute the cells above that generate the training and testing datasets, as these will be reused.
Import the `XGBClassifier` from `xgboost` and initialize the model. Create a parameter grid for the variables `max_depth`, `learning_rate` and `n_estimators`. As before don't use too many values!
Then do a grid search using `GridSearchCV` again. Make sure to set the parameter `return_train_score=True`


You can checkout all the parameters using this link: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

NB: If you are getting a lot of warnings when running the GridSearchCV with XGBoost, try adding `'eval_metric': ["logloss"]` to the parameter grid.

In [None]:
#ANS

from sklearn.model_selection import GridSearchCV

# Import xgboost
from xgboost import XGBClassifier

# Initialize model
xgb = XGBClassifier()

# Define hyperparameter grid
par_grid = {'n_estimators': [10, 30, 50], 'max_depth': [1, 3], 'learning_rate': [0.01, 0.1]}

# Initialize cross validation grid search
gcv_b = GridSearchCV(xgb, par_grid, scoring = "accuracy", cv = 5, return_train_score=True)

Fit the model on the training data and extract the score and best parameters. Checkout the `mean_train_score` and `mean_test_score`, how do they compare?

In [None]:
#ANS

# Fit model
gcv_b.fit(X_train, y_train)

# Best hyperparam
print(gcv_b.best_params_)

# Show table of scores
pd.DataFrame(gcv_b.cv_results_)

Go back and change you parameter grid and see if you can produce a better score. Afterwards, evaluate the best model on the test set and calculate the accuracy.

In [None]:
#ANS

# Predict
pred = gcv_b.predict(X_test)

# Evaluate score
print("Test accuracy is: {}".format(np.sum(pred == y_test)/y_test.shape[0]))

# Part 2

## Feature importance XGBoost
Complex models such as Random Forest and Boosting are hard to interpret as opposed to single classification trees. This is due to the sheer number of trees and in the case of boosting the interaction between the trees. In order to get an idea of, which variables are deemed important by the model, we can try to plot the feature importance.

We will use the boosting model. If you didn't complete the boosting exercise, run the answer cell. The first step is to extract the best boosting model found above. This is done using the `best_estimator_` attribute on the `GridSearchCV` object. Name the extracted object `model`.

In [None]:
#ANS
model = gcv_b.best_estimator_

Use the built-in function `plot_importance` which can be imported from the `xgboost` module. The function takes the model as the first argument. Do a lookup in the documentation (Shift+Tab) and figure out how to plot both `weight` and `gain`.

In [None]:
from xgboost import plot_importance

In [None]:
#ANS
from xgboost import plot_importance
import matplotlib.pyplot as plt

# Create a high resolution axis object to plot on
f, ax = plt.subplots(dpi = 200)

# Plot weight on axis object
plot_importance(model, importance_type = 'weight', ax = ax)

# Show plot
plt.show()

f, ax = plt.subplots(dpi = 200)
plot_importance(model, importance_type = 'gain', ax = ax)
plt.show()

Why are only some of the features shown in all the plots?

## Feature importance for Random Forest
In the Random Forest implementation in `sklearn` it is only possible to extract the gain importance. As before,  the actual model object can be extracted using the `best_estimator_` attribute on the `GridSearchCV` object. Extract this and make a plot as before, by using the function below. Notice that it returns two values. 

A function that returns two values is used like this:
```python
def ret_2_vals(x):
    return x, x + 1

val0, val1 = ret_2_vals(1)

print(val0)
>>> 1
print(val1)
>>> 2
```

In [None]:
def rf_extract_feat_imp(model, column_names):
    '''
    Function to extract feature importance from a Random Forest model object.
    
    model: A fitted RandomForest model.
    column_names: The column names of the input dataframe used to fit the model.
                                  Can be extracted from the dataframe as `df.columns`.
    
    Returns an array with feature names and an array with the corresponding feature
    importance scores.
    '''
    # Get feature importances
    score = model.feature_importances_
    
    # Sort feature names according to score
    score_dict = sorted(zip(score, column_names), reverse = True)
    
    # Extract sorted scores and feature names in separate lists
    scores, feature_names = zip(*score_dict)
    
    return np.array(feature_names), np.array(scores)

In [None]:
#ANS

import seaborn as sns

# Extract model
model_rf = gcv.best_estimator_

# Get feature names and scores
feature_names, scores = rf_extract_feat_imp(model_rf, X_train.columns)

# Plot
plt.figure(dpi=200)
sns.barplot(scores, feature_names)
plt.title('Random Forest - Gain')
plt.show()

# Bonus exercise

For boosting the built in function determines how to plot the feature importance. However, if we want to plot them ourselves, or use the values for other purposes, we will have to extract them manually. The feature importances can be extracted in the following way:
```python
score_dict = model.get_booster().get_score(importance_type = "FEAT_IMP_TYPE")
```
where `FEAT_IMP_TYPE` can be either `weight` or `gain`. Create a bar plot of each of these.

Otherwise you can use the function below to extract the scores. Notice that it returns two values. 

In [None]:
def xgboost_extract_feat_imp(model, importance_type):
    '''
    Function to extract feature importance from xgboost model object.
    
    model: A fitted XGBoost model.
    importance_type: The of feature importance to extract, should be either weight or gain,
    
    Returns an array with feature names and an array with the corresponding feature
    importance scores.
    '''
    
    score_dict = model.get_booster().get_score(importance_type = importance_type)
    
    # Sort feature names according to score
    feature_names = sorted(score_dict, key = score_dict.__getitem__, reverse = True)
    
    # Create a sorted list of scores
    scores = [score_dict[z] for z in feature_names]
    
    return np.array(feature_names), np.array(scores)

In [None]:
#ANS
import seaborn as sns
import matplotlib.pyplot as plt

# Solution

# List of different feature importance types
feat_imp_types = ['gain', 'weight']

# Create a plot for each type
for imp_type in feat_imp_types:
    
    # Use the function to extract feature importances
    feature_names, scores = xgboost_extract_feat_imp(model, imp_type)
    
    # Plot
    plt.figure(dpi=200)
    sns.barplot(scores, feature_names)
    plt.title(imp_type)
    plt.show()

In [None]:
#CONFIG
# Hide code tagged with #ANS
from IPython.display import HTML
HTML('''<script>
function code_hide() {
    var cells = IPython.notebook.get_cells()
    cells.forEach(function(x){ if(x.get_text().includes("#ANS")){
        if (x.get_text().includes("#CONFIG")){

        } else{
            x.input.hide()
            x.output_area.clear_output()
        }

        
    }
    })
}
function code_hide2() {
    var cells = IPython.notebook.get_cells();
    cells.forEach(function(x){
    if( x.cell_type != "markdown"){
        x.input.show()      
    }
    
        });
} 
$( document ).ready(code_hide);
$( document ).ready(code_hide2);
</script>
<form action="javascript:code_hide()"><input type="submit" value="Hide answers"></form>
<form action="javascript:code_hide2()"><input type="submit" value="Show answers"></form>''')