# Credit Risk Evaluator

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
lending_data_df = pd.read_csv('Resources/lending_data.csv')
lending_data_df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

> My prediction is that the Random Forest Classifier will be a better predictor of Loan Status because it will merge together decision trees to get a more accurate prediction. The Logistic Regression Model may be too simple (grouping too many factors) whereas the Random Forest Classifier utilcan utilize those decisions to make a better prediction.

## Split the Data into Training and Testing Sets

In [3]:
# Split the data into X_train, X_test, y_train, y_test
X = lending_data_df.drop('loan_status', axis=1)
y = lending_data_df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [4]:
# Train a Logistic Regression model and print the model score
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9921240885954051
Testing Data Score: 0.9918489475856377


In [5]:
# Train a Random Forest Classifier model (n_estimators=500) and print the model score
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Data Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Data Score: {clf.score(X_test_scaled, y_test)}')

Training Data Score: 0.9975409272252029
Testing Data Score: 0.9917457697069748


In [7]:
# Train a Random Forest Classifier model (n_estimators=5) and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=5).fit(X_train_scaled, y_train)
print(f'Training Data Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Data Score: {clf.score(X_test_scaled, y_test)}')

Training Data Score: 0.9966295226303481
Testing Data Score: 0.9911267024349979


*Which model performed better? How does that compare to your prediction? Replace the text in this markdown cell with your answers to these questions.*

> Both models had very high training data and test data scores around 0.99.
>
> The Logistic Regression Model performed very well finding the relationship between multiple independent variables and the dependent variable (loan status) very quickly.
>
> The Random Forest Classifier also performed very well. Although it took this model significantly longer to achieve those results. One way to speed up the process would be to reduce the number of estimators from 500 down to 5. This also produced a good result and would be acceptable to use in this situation because the training and test scores were very similar meaning we did not overfit the data.
>
> In conclusion, either model is fine, but the Logistic Regression model is simple and fast and sometimes it is best not to overcomplicate it.