Much of the approach given below is inspired from Jeremy Howard's Fast.Ai course. 
I've mainly focussed on model interpretation using Permutation Feature Importance, Partial Dependence Plots and SHAP values.

In [1]:
# !pip install --upgrade pip
# !pip install fastai==0.7.0 
## Based on Fast.ai ML course

%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np 
import pandas as pd
from IPython.display import display
from fastai.imports import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
import os
from pandas_summary import DataFrameSummary
from matplotlib import pyplot as plt
import math

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz
import re

import shap
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp, get_dataset, info_plots

import IPython
from IPython.display import display
print(os.listdir("../input/"))

ModuleNotFoundError: No module named 'shap'

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
train_df.head()

In [None]:
test_df.head()

We'll just use a Random Forest Classifier. For that, we need to convert all columns to numeric type. But there are some categorical variables too.

In [None]:
train_cats(train_df)
apply_cats(test_df, train_df)

We'll replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable. Fastai to the rescue again !!

In [None]:
df_trn, y_trn, nas = proc_df(train_df, 'Survived')
df_test, _, _ = proc_df(test_df, na_dict=nas)
df_trn.head()

In [None]:
df_test.head()

In [None]:
## Let's remove the NA columns that were introduced by proc_df as the test and train datasets have different no of columns
df_trn.drop(['Age_na'], axis =1, inplace = True)
df_test.drop(['Age_na', 'Fare_na'], axis =1, inplace = True)
df_test.head()

### Defining function to calculate the evaluation metric

In [None]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(train_X), train_y), rmse(m.predict(val_X), val_y),     ## RMSE of log of prices
                m.score(train_X, train_y), m.score(val_X, val_y)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

### Split the data into training and validation sets

In [None]:
train_X, val_X, train_y, val_y = train_test_split(df_trn, y_trn, test_size=0.33, random_state=42)

We can now pass this processed data frame to Random Forest Regressor

Initially, let's just fit a single decision tree to visualize it properly

In [None]:
%time
m = RandomForestClassifier(n_estimators=1, min_samples_leaf=10, n_jobs=-1, max_depth = 3, oob_score=True) ## Use all CPUs available
m.fit(train_X, train_y)

print_score(m)

In [None]:
draw_tree(m.estimators_[0], train_X, precision=3)

A single decision tree did not perform so badly. You can read more about the gini impurity metric [here](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity).

Now, let's bag a collection of trees to create a random forest.

In [None]:
%time
m = RandomForestClassifier(n_estimators=20, min_samples_leaf=10, max_features=0.7, n_jobs=-1, oob_score=True) ## Use all CPUs available
m.fit(train_X, train_y)

print_score(m)

## Permuation importance of features

In [None]:
perm = PermutationImportance(m, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

It seems that some features like Parch, Cabin, Fare & Embarked are contributing negatively to the model. We can try and drop them to see if the performance improves. Before that, let's also check the partial dependence plots for each feature.

## Partial Dependence Plots

In [None]:
for feat_name in val_X.columns:
#for feat_name in base_features:
    #pdp_dist = pdp.pdp_isolate(model=m, dataset=val_X, model_features=base_features, feature=feat_name)
    pdp_dist = pdp.pdp_isolate(model = m, dataset=val_X, model_features=val_X.columns, feature=feat_name)

    pdp.pdp_plot(pdp_dist, feat_name)

    plt.show()

## SHAP values for selected rows

In [None]:
explainer = shap.TreeExplainer(m)
shap_values = explainer.shap_values(val_X)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
# shap.force_plot(explainer.expected_value, shap_values[1,:], val_X.iloc[1,:], matplotlib=True) ## Not for classification

In [None]:
shap.summary_plot(shap_values, val_X, plot_type="bar")

### Updating model to remove less important features

In [None]:
%time
df_trn.drop(['Embarked', 'Fare', 'Cabin', 'Parch'], axis =1, inplace = True)
df_test.drop(['Embarked', 'Fare', 'Cabin', 'Parch'], axis =1, inplace = True)
train_X, val_X, train_y, val_y = train_test_split(df_trn, y_trn, test_size=0.33, random_state=42)
m = RandomForestClassifier(n_estimators=20, min_samples_leaf=10, max_features=0.7, n_jobs=-1, oob_score=True) ## Use all CPUs available
m.fit(train_X, train_y)
print_score(m)

Not much difference in the RMSE score, but we did see a slight improvement in the validation set accuracy and the OOB score.
Let's make predictions on the test set using this model !

## Submitting Predictions

In [None]:
pred = m.predict(df_test)
submission = pd.read_csv('../input/gender_submission.csv')
submission.head()

In [None]:
submission['Survived'] = pred   
submission.to_csv('rf_submission_v2.csv', index=False)

In [None]:
pip install shap

In [3]:
import shap

ModuleNotFoundError: No module named 'shap'