# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [5]:
# Import the data
data = pd.read_csv('./Resources/lending_data.csv')

In [11]:
#looking at a portion of the data
data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

I think that a Logistic Regression will provide a better model for this data set. Logistic Regression functions better than Random Forests Classifiers when you have a binary outcome, and since we are using the loan status, we only have two options to go off of - 0 or 1.

## Split the Data into Training and Testing Sets

In [7]:
#defining X and y in order to split the data in the next step
# X = data[['loan_size', 'interest_rate', 'borrower_income', 'debt_to_income', 'num_of_accounts', 'derogatory_marks', 'total_debt']] <-- this is a more complicated version of the following line:
X = data.drop('loan_status', axis=1)
y = data['loan_status']

In [8]:
#looking at the shape
print("Shape: ", X.shape, y.shape)

Shape:  (77536, 7) (77536,)


In [9]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [17]:
#scaling the data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [18]:
#importing the linear model from sklearn
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train_scaled, y_train)
y_prediction = classifier.predict(X_test_scaled)

In [20]:
# Train a Logistic Regression model and print the model score
from sklearn.metrics import classification_report
print(classification_report(y_test, y_prediction))

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.98      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

Training Data Score: 0.9942908240473243
Testing Data Score: 0.9936545604622369


In [21]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
random_forest_classifier = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
y_forest_prediction = random_forest_classifier.predict(X_test_scaled)

In [23]:
print(classification_report(y_test, y_forest_prediction))

print(f"Training Score: {random_forest_classifier.score(X_train_scaled, y_train)}")
print(f"Testing Score: {random_forest_classifier.score(X_test_scaled, y_test)}")

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.90      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384

Training Score: 0.9975409272252029
Testing Score: 0.9917457697069748


*Which model performed better? How does that compare to your prediction? The Random Forest Classification model yeilded higher  values for the Training scores, but the Logistic Regression yeilded a higher Testing score.  Based on the differences between the testing and training scores for each model, I beleive that the Logistic Regression actually preformed better in real world practice because the difference in the scores is less than it is with the Random Forrest Classifier (0.994-0.993=.001 for Logistic Regression, and 0.997-0.991=0.006 for Random Forest Classifier.