# Challenge: If a tree falls in the forest

1. Then build the best decision tree you can.
2. Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. 
3. Compare that to the runtime of the decision tree.

In [4]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#For selecting features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

#Import Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#Time
import time

In [2]:
#Will use the same dataset from the previous lesson
y2015 = pd.read_csv(
    'https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1',
    skipinitialspace=True,
    header=1
)

# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

# Remove two summary rows at the end that don't actually contain data.
y2015 = y2015[:-2]

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

#Split the data into training and validation
X_train, X_test, y_train, y_test = train_test_split(X,Y)

In [8]:
#Create single Tree
start_time = time.time()
dt = DecisionTreeClassifier()
model = dt.fit(X_train, y_train)
#prediction = dt.predict(X_test)
print(dt.score(X_test, y_test))
print("--- %s seconds ---" % (time.time() - start_time))

0.966743925376
--- 48.40509796142578 seconds ---


In [9]:
#Create Forest
start_time = time.time()
rfc = RandomForestClassifier()
model = rfc.fit(X_train, y_train)
#prediction = rfc.predict(X_test)
print(rfc.score(X_test, y_test))
print("--- %s seconds ---" % (time.time() - start_time))

0.981058950928
--- 21.044145822525024 seconds ---


Interesting that the Forest Classifier produced a better score in a faster time. Lets reduce the amount of features, and see if this has an effect.

In [10]:
features = X[['out_prncp','last_pymnt_amnt','last_pymnt_d_Dec-2016','total_rec_prncp','last_pymnt_d_Jan-2017']]

#Split the data into training and validation
X_train, X_test, y_train, y_test = train_test_split(features,Y)

In [11]:
#Create single Tree
start_time = time.time()
dt = DecisionTreeClassifier()
model = dt.fit(X_train, y_train)
print(dt.score(X_test, y_test))
print("--- %s seconds ---" % (time.time() - start_time))

0.965328571157
--- 3.1175496578216553 seconds ---


In [12]:
#Create Forest
start_time = time.time()
rfc = RandomForestClassifier()
model = rfc.fit(X_train, y_train)
print(rfc.score(X_test, y_test))
print("--- %s seconds ---" % (time.time() - start_time))

0.973849193533
--- 9.554113149642944 seconds ---


Now, with reduced set of features, the Random Forest takes twice as long to produce slightly better results.