# Decision Tree & Random Forest Project #

For this project we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.

## Import Libraries ##

In [17]:
%matplotlib notebook

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Get the data ##

In [19]:
df = pd.read_csv ('loan_data.csv')

**Check out the info(), head(), and describe() methods on the data.**

In [20]:
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit.policy      9578 non-null   int64  
 1   purpose            9578 non-null   object 
 2   int.rate           9578 non-null   float64
 3   installment        9578 non-null   float64
 4   log.annual.inc     9578 non-null   float64
 5   dti                9578 non-null   float64
 6   fico               9578 non-null   int64  
 7   days.with.cr.line  9578 non-null   float64
 8   revol.bal          9578 non-null   int64  
 9   revol.util         9578 non-null   float64
 10  inq.last.6mths     9578 non-null   int64  
 11  delinq.2yrs        9578 non-null   int64  
 12  pub.rec            9578 non-null   int64  
 13  not.fully.paid     9578 non-null   int64  
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB


In [22]:
df.describe()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
count,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0
mean,0.80497,0.12264,319.089413,10.932117,12.606679,710.846314,4560.767197,16913.96,46.799236,1.577469,0.163708,0.062122,0.160054
std,0.396245,0.026847,207.071301,0.614813,6.88397,37.970537,2496.930377,33756.19,29.014417,2.200245,0.546215,0.262126,0.366676
min,0.0,0.06,15.67,7.547502,0.0,612.0,178.958333,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.1039,163.77,10.558414,7.2125,682.0,2820.0,3187.0,22.6,0.0,0.0,0.0,0.0
50%,1.0,0.1221,268.95,10.928884,12.665,707.0,4139.958333,8596.0,46.3,1.0,0.0,0.0,0.0
75%,1.0,0.1407,432.7625,11.291293,17.95,737.0,5730.0,18249.5,70.9,2.0,0.0,0.0,0.0
max,1.0,0.2164,940.14,14.528354,29.96,827.0,17639.95833,1207359.0,119.0,33.0,13.0,5.0,1.0


# Exploratory Data Analysis #

We do some data visualization! We'll use seaborn and pandas built-in plotting capabilities.

**Creating a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.**

In [23]:
plt.figure(figsize=(9,6))
df[df['credit.policy']==0]['fico'].hist(alpha=0.5, color='blue', bins=30,label='Credit.Policy=1')
df[df['credit.policy']==1]['fico'].hist(alpha=0.5, color='red', bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')

<IPython.core.display.Javascript object>

Text(0.5, 0, 'FICO')

**Creating a similar figure, except this time select by the not.fully.paid column.**

In [24]:
plt.figure(figsize=(9,6))
df[df['not.fully.paid']==0]['fico'].hist(alpha=0.5, color='blue', bins=30,label='not.fully.paid=1')
df[df['not.fully.paid']==1]['fico'].hist(alpha=0.5, color='red', bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')

<IPython.core.display.Javascript object>

Text(0.5, 0, 'FICO')

**Creating a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.**

In [13]:
sns.set_style('darkgrid')
plt.figure (figsize=(9,6))
sns.countplot(x='purpose',data=df, hue='not.fully.paid')

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='purpose', ylabel='count'>

**The trend between FICO score and interest rate.**

In [7]:
sns.jointplot (x='fico', y='int.rate', data=df,color='blue')

<IPython.core.display.Javascript object>

<seaborn.axisgrid.JointGrid at 0x249f8a5c3d0>

**The trend between not.fully.paid and credit.policy.**

In [28]:
plt.figure (figsize=(7,4))
sns.lmplot(x='fico',y='int.rate', data=df,hue='credit.policy', col='not.fully.paid', palette='Set1')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x249830271c0>

# Setting up the data #

Let's set up the data for the Random Forest Classification Model!

## Categorical Features ##

The purpose column is cetegorical. So, we need to transform them using dummy variables so sklearn will be able to understand them.

In [29]:
cat_feats = ['purpose']

In [30]:
final_data = pd.get_dummies (df, columns=cat_feats, drop_first=True)

In [56]:
final_data.head()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,1,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0,0,1,0,0,0,0
1,1,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,1,0,0,0,0,0
2,1,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,0,1,0,0,0,0
3,1,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0,0,1,0,0,0,0
4,1,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0,1,0,0,0,0,0


## Train Test Split ##

**Now its time to split our data into a training set and a testing set.**

In [57]:
from sklearn.model_selection import train_test_split

In [63]:
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

## Training a Decision Tree Model ##

Let's start by training a single decision tree first.

**Import DecisionTreeClassifier**

In [64]:
from sklearn.tree import DecisionTreeClassifier

**Creating an instance of DecisionTreeClassifier() called dtree and fit it to the training data.**

In [65]:
dtree = DecisionTreeClassifier()

In [66]:
dtree.fit (X_train, y_train)

DecisionTreeClassifier()

## Predictions and Evaluation of Decision Tree ##

**Creating predictions from the test set and create a classification report and a confusion matrix.**

In [69]:
predictions = dtree.predict (X_test)

In [72]:
from sklearn.metrics import classification_report, confusion_matrix

In [78]:
print (classification_report(y_test, predictions))
print ('\n')
print (confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.86      0.82      0.84      2431
           1       0.20      0.25      0.22       443

    accuracy                           0.73      2874
   macro avg       0.53      0.53      0.53      2874
weighted avg       0.75      0.73      0.74      2874



[[1987  444]
 [ 334  109]]


## Training the Random Forest model ##


Now its time to train our model!

**Creating an instance of the RandomForestClassifier class and fit it to our training data from the previous step.**

In [79]:
from sklearn.ensemble import RandomForestClassifier

In [80]:
rfc = RandomForestClassifier()

In [82]:
rfc.fit(X_train,y_train)

RandomForestClassifier()

## Predictions and Evaluation ##

Let's predict off the y_test values and evaluate our model.

**Predicting the class of not.fully.paid for the X_test data.**

In [83]:
predictions = rfc.predict (X_test)

**Now creating a classification report from the results.**

In [84]:
from sklearn.metrics import classification_report, confusion_matrix

In [88]:
print (classification_report(y_test, predictions))
print ('\n')
print (confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

           0       0.85      0.99      0.92      2431
           1       0.46      0.02      0.05       443

    accuracy                           0.85      2874
   macro avg       0.65      0.51      0.48      2874
weighted avg       0.79      0.85      0.78      2874



[[2418   13]
 [ 432   11]]
