# Predicting Credit Risk with Python and Plotly

## Random Forests using Python and sci-kit learn

The objective of this notebook series is to simulate an analytical workflow between several team members using [Python](https://www.python.org/) and [R](http://www.r-project.org/). The data for this notebook is part of a [Kaggle competition](https://www.kaggle.com/c/GiveMeSomeCredit) released three years ago. The objective is to predict the probability of credit & loan default from a large set of real customer data. The evaluate metric used in the competition was [AUC](https://www.kaggle.com/wiki/AreaUnderCurve). A perfect model will score an AUC of 1, while random guessing will score an AUC of around 0.5, a meager 50% chance. 

> This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. The goal of this competition is to build a model that borrowers can use to help make the best financial decisions. Historical data are provided on 250,000 borrowers and the prize pool is $5,000.

The top score for the competition was 0.869558, which we will try to match! However, it will be challenging since some of the data from the competition is no longer available.

[Plotly](https://plot.ly) is a platform for making interactive graphs with R, Python, MATLAB, and Excel. In this notebook series, [Plotly](https://plot.ly) can serve as a sharing platform for data, visualizations, and results between analysts, management, and executives on Plotly’s free public cloud. For collaboration and sensitive data, you can run Plotly [on your own servers](https://plot.ly/product/enterprise/ ).

Need help converting [Plotly](https://plot.ly) graphs from R or Python?
- [R](https://plot.ly/r/user-guide/)
- [Python](https://plot.ly/python/matplotlib-to-plotly-tutorial/)

**This is the second notebook in the series**

- The [first notebook]() explores, cleans, and generates new features for the data.
- The [second notebook]() tests and optimized the [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model using [Plotly](https://plot.ly) and Python.
<hr>

For this code to run on your machine, you will need to:

- Install some Python libraries: Running `sudo pip install <package_name>` from your terminal will install python libraries.

- Register an account with [Plotly](https://plot.ly/feed/) to receive your API key. 

- Download the data for this notebook on the [kaggle website](https://www.kaggle.com/c/GiveMeSomeCredit).

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>
<script src="https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js"></script>

In this notebook we will train a [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model in Python using [sci-kit learn](http://scikit-learn.org/stable/) and finely tune the parameters to improve our previous results. 


In the [previous notebook] we:

1. Created over 40 new features by insightfully combining and transforming existing features
2. Dealt with missing values
3. Standardized the data



Now we can simply load in this data and get right to modelling. Please see the [previous notebook] for the full details. Below we will summarize the data to give you an idea of the features we created.

## Load libraries and data

In [1]:
import pandas as pd
import cufflinks as cf
import numpy as np
import matplotlib.pyplot as plt

Here are the original features given to us from the Kaggle competition.

In [2]:
# Reminder of original feature definitions.
data_dict = pd.read_csv('https://github.com/plotly/datasets/raw/master/data_dictionary.csv')
data_dict.iloc[ : , 0:2]

Unnamed: 0,Variable Name,Description
0,SeriousDlqin2yrs,Person experienced 90 days past due delinquenc...
1,RevolvingUtilizationOfUnsecuredLines,Total balance on credit cards and personal lin...
2,age,Age of borrower in years
3,NumberOfTime30-59DaysPastDueNotWorse,Number of times borrower has been 30-59 days p...
4,DebtRatio,"Monthly debt payments, alimony,living costs di..."
5,MonthlyIncome,Monthly income
6,NumberOfOpenCreditLinesAndLoans,Number of Open loans (installment like car loa...
7,NumberOfTimes90DaysLate,Number of times borrower has been 90 days or m...
8,NumberRealEstateLoansOrLines,Number of mortgage and real estate loans inclu...
9,NumberOfTime60-89DaysPastDueNotWorse,Number of times borrower has been 60-89 days p...


Below are the new features we derived using our intuition about the dataset.

In [3]:
dt = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/processed_data.csv')

pd.DataFrame(dt.columns.values)

Unnamed: 0,0
0,NumberOfOpenCreditLinesAndLoans
1,log_age
2,log_income
3,log_income_person
4,log_income_age
5,UnknownIncomeDebtRatio
6,log_Debt
7,log_DebtRatio
8,log_HouseholdSize
9,log_NumberOfTimes90DaysLate


Let's plot a few graphs to refresh our memory of the data and show-case the [Cufflinks](http://nbviewer.ipython.org/gist/santosjorge/aba934a0d20023a136c2) plotting library, which is designed to work with pandas and Plotly out of the box. 

A full walkthrough of every feature can be found in the [first notebook]() of this series. [Cufflinks](http://nbviewer.ipython.org/gist/santosjorge/aba934a0d20023a136c2) can set a global theme (style) to use. In this case we will use Matplotlib's ggplot style. With cufflinks and pandas, is it very easy to generate interactive visuals.

In [4]:
cf.set_config_file(theme = 'ggplot', world_readable = True)

In [6]:
dt['log_age'].iplot(kind='histogram', bins=20, title = 'Log Age Histogram')

We could also examine the distribution of a few interesting features using box plots. 

> You can use your mouse to view the data or click and drag to zoom. You can also click on the legend icons to toggle features on and off.

In [7]:
dt[['log_age', 'log_DebtRatio', 'log_income', 'NumberOfOpenCreditLinesAndLoans']].iplot(kind = 'box', )

# Random Forests

> Random forests is an ensemble method that averages the predictions from many de-correlated trees trained from a sample drawn with replacement. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model. 
- L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

<hr>

After loading the data, we split it into training and validation sets using a stratified random sample. A stratified sample ensures that we maintain class proportions between the two new datasets.

We will train the dataset on a 5-fold stratified CV to tune the model using grid-search, then test the final result on the validation set that we set aside.

In [8]:
from sklearn import cross_validation, grid_search

Y = dt['SeriousDlqin2yrs']
X = dt.drop(['SeriousDlqin2yrs'], axis = 1)

train_valid = cross_validation.StratifiedShuffleSplit(Y.values, test_size = 0.1)

for train_index, test_index in train_valid:
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = Y.loc[train_index], Y.loc[test_index]

First let's train a random forest with the default parameters.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, auc, roc_curve

# Create RF classifier
clf = RandomForestClassifier(class_weight = 'auto', n_jobs = 4)

# Fit the classifier on the training data
clf.fit(X_train, y_train)

# Predict the response using the test dataset
pred = clf.predict(X_test)

# Examine the AUC score
roc_auc_score(y_test, pred)

0.55863049559615785

Let's try and improve the score by tuning the parameters using sci-kit learn's [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html) function and 5 stratified [K-folds](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html). 

The three parameters we will try to optimize are:

1. `max_features`: The number of features to consider when looking for the best split.
2. `min_weight_fraction_leaf`: The minimum weighted fraction of the input samples required to be at a leaf node.
3. `n_estimators`: The number of trees in the forest.

We will also specify the [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html) function to score the best parameters using [AUC](https://www.kaggle.com/wiki/AreaUnderCurve).

In [11]:
from sklearn.ensemble import RandomForestClassifier

# Set up stratified cv methods
skf = cross_validation.StratifiedKFold(y_train, n_folds = 5)

# Create RF classifier
clf = RandomForestClassifier(class_weight = 'auto', min_weight_fraction_leaf = 0.001, n_jobs = 8)

# Tuning options
max_f = [5, 7, 9]
n_trees = [200, 300, 400]

# Create gridsearch cv classifier
grid_search_clf = grid_search.GridSearchCV(
                       estimator = clf, 
                       param_grid = dict(max_features = max_f, n_estimators = n_trees), 
                       cv = skf, 
                       scoring = 'roc_auc',
                       n_jobs = 4
                    )

In [None]:
# Run the grid search
grid_search_clf.fit(X_train, y_train)

Let's look at the results. Below we compute:

1. The best parameters found by the grid search.
2. The cross validation training prediction scores for each iteration of the grid search.
3. The final prediction on the validation set.

In [None]:
print("Best parameters set found on development set:")
print('')
print(grid_search_clf.best_params_)
print('')
print("Grid scores on development set:")
print('')
for params, mean_score, scores in grid_search_clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() * 2, params))
print('')

print("Detailed classification report:")
print('')
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print('')
y_true, y_pred = y_test, grid_search_clf.predict(X_test)
print(roc_auc_score(y_true, y_pred))

We can see that the training AUC is noticably higher than the validation test AUC scores. This is a sign that we are over-fitting our training data, or that we have too few loan defaulters in the test set. To correct this we could try a more complex class balancing techinque such as [SMOTE](https://www.jair.org/media/953/live-953-2037-jair.pdf). 

The final AUC does however beat the decsion tree model from the [previous notebook].

## Plotting ROC Curve

Below we plot the ROC curve for this model using [Plotly](https://plot.ly/python/matplotlib-to-plotly-tutorial/). This time we will demonstrate how to plot simply using the [matplotlib converter](https://plot.ly/python/matplotlib-to-plotly-tutorial/).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import plotly.plotly as py
import plotly.tools as tls   
from plotly.graph_objs import *

from pylab import rcParams
rcParams['figure.figsize'] = 10, 5

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

ROC_CURVE = plt.figure()
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc = "lower right")

Now we can convert this plot to an interactive [Plotly](https://plot.ly) object. 

Please see the Plotly Python [user guide](https://plot.ly/python/overview/#in-%5B37%5D) for more insight on how to update plot parameters. 

> Don't forget you can also easily edit the chart properties using the Plotly GUI interface by clicking the "Play with this data!" link below the chart.

In [None]:
py.iplot_mpl(ROC_CURVE)

In [None]:
from IPython.display import HTML, display

display(HTML('<link href="//fonts.googleapis.com/css?family=Open+Sans:600,400,300,200|Inconsolata|Ubuntu+Mono:400,700" rel="stylesheet" type="text/css" />'))
display(HTML('<link rel="stylesheet" type="text/css" href="https://help.plot.ly/documentation/all_static/css/ipython-notebook-custom.css">'))

import publisher
publisher.publish('credit-risk-analysis', '/python/credit-risk-analysis/', 
                  'Credit Risk Analysis with Plotly', 
                  'Predicting Credit Risk with Python and Plotly')