In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Import the data

In [3]:
lending = pd.read_csv('Resources/lending_data.csv')
lending.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [4]:
# Create data, validate that the data columns reflect "numeric" values (validate dtype of your data using df.info())
for col in lending.columns:
    if lending[col].dtype == 'object':
        lending[col] = pd.to_numeric(lending[col], errors='coerce')
# Create X,y by droping loan_status column (0,1 data)
X = lending.drop('loan_status', axis=1)
y = lending['loan_status'] 
X.head(5)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


######  Before you create, fit, and score the models, make a prediction as to which model you think will perform better.

Let's first look at the meaning of both Logistic Regression and Random Forest Classifier. By definition, Logistic regression is a classification algorithm used to predict a discrete set of classes or categories (e.g., Yes/No, Young/Old, Happy/Sad).
This can also be referred to as a binary outcome. In additiona, a logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables. Random Forest is a robust machine learning algorithm that can be used for a variety of tasks including regression and classification. It is an ensemble method, meaning that a random forest model is made up of a large number of small decision trees, called estimators, which each produce their own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.

A Logistic Regression usually gives probability associated with each output. It is used to measure the statistical significance of each independent variable with respect to that probability.  With Random Forest Clasification, there is further analysis of how  a feature contributes to class prediction. Random Forest Classifier also uses decision trees that can capture feature patterns to provide the improved accuracy.  These decision trees are used to classify new object from input vector. 

Taking this information into account, my prediction is that Logistic regression will be more accurate. 

In [5]:
# Split the data into X_train, X_test, y_train, y_test

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [7]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

LogisticRegression()

In [8]:
classifier.fit(X_train, y_train)

LogisticRegression()

In [9]:
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9919177328380795
Testing Data Score: 0.9924680148576145


In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [11]:
# Train a Random Forest Classifier model and print the model score

In [12]:
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9971970009629936
Testing Score: 0.991900536524969


In [13]:
# Train a Logistic Regression model print the model score

In [14]:
clf = LogisticRegression()
clf.fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9941188609162196
Testing Score: 0.9941704498555509


###### Which model performed better? How does that compare to your prediction? Write down your results and thoughts.

The Logistic Regression performed better than the Random Forest. This supports my prediction that logistic regression analysis would be more accurate. The Training Score was 0.9941188609162196 while the Testing Score came in at 0.9941704498555509. For the Random Forest model, the Training Score was 0.9971970009629936 and the Testing Score equaled 0.991900536524969. The goal of each model was to predict whether a loan would be approved or not. The logistic regression in general, aimed to produce an estimation of the probability of becoming a yes or no. So there is only one "probability" the logistic regression. On the other hand, the probability obtained using random forest can be more like a by product, taking advantage of multiple trees, therefore, could be the potential for different ways to infer the "probabilities" from the model. The does not appear to be overfitting which occurs when machine learning model tries to fit the training data too well. It is usually caused by complicated functions which lead to low error in Training set but high error in Test set called as ‘High Variance’. In conclusion, I'm confident in the Logistic regression because it is easy to implement and interpret.