# Credit Risk Evaluator

In [108]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [109]:
# Import the data
lending_data = pd.read_csv("lending_data.csv")
lending_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [110]:
# Finding null values if it exists
for column in lending_data.columns:
    print(f"{column} has {lending_data[column].isnull().sum()} null values")

loan_size has 0 null values
interest_rate has 0 null values
borrower_income has 0 null values
debt_to_income has 0 null values
num_of_accounts has 0 null values
derogatory_marks has 0 null values
total_debt has 0 null values
loan_status has 0 null values


In [111]:
y = lending_data['loan_status'].values
X = lending_data.drop('loan_status', axis=1)

target = lending_data["loan_status"]
target_names = ["negative", "positive"]

## Predict Model Performance

Based on my knowledge, I believe that the logisitic regression model will perform better than the random forest classifier because the logistic regression predicts a binary outcome (0 and 1) based on the target loan status.

*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

## Split the Data into Training and Testing Sets

In [112]:
# Split the data into X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [113]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [114]:
#Import Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier

LogisticRegression(max_iter=10000)

In [115]:
classifier.fit(X_train_scaled, y_train)

LogisticRegression(max_iter=10000)

In [116]:
# Train a Logistic Regression model and print the model score
print(f"Train Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Test Data Score: {classifier.score(X_test_scaled, y_test)}")

Train Data Score: 0.9941188609162196
Test Data Score: 0.9941704498555509


In [117]:
# y_pred for labeling 
y_pred = classifier.predict(X_test_scaled)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [118]:
from sklearn.metrics import confusion_matrix
y_pred = classifier.predict(X_test_scaled)
confusion_matrix(y_test, y_pred)

array([[18690,   102],
       [   11,   581]], dtype=int64)

In [119]:
#Determine accuracy
tp,tn,fp,fn = confusion_matrix(y_test, y_pred).ravel()
accuracy = (tp + tn) / (tp + tn + fp + fn) 
# (18690 + 581) / (18690 + 581 + 11 + 102)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9694593479158069


In [121]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

    negative       1.00      0.99      1.00     18792
    positive       0.85      0.98      0.91       592

    accuracy                           0.99     19384
   macro avg       0.93      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



In [122]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

In [123]:
classifier = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Train Score: {classifier.score(X_train_scaled, y_train)}')
print(f'Test Score: {classifier.score(X_test_scaled, y_test)}')

Train Score: 0.9971970009629936
Test Score: 0.991900536524969


In [124]:
y_pred = classifier.predict(X_test_scaled)

In [125]:
confusion_matrix(y_test, y_pred)

array([[18694,    98],
       [   59,   533]], dtype=int64)

In [126]:
# Calculate the accuracy
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
accuracy = (tp + tn) / (tp + fp + tn + fn) 
print(f"Accuracy: {accuracy}")

Accuracy: 0.991900536524969


In [127]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

    negative       1.00      0.99      1.00     18792
    positive       0.84      0.90      0.87       592

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.93     19384
weighted avg       0.99      0.99      0.99     19384



Results


For the logistic regression model, I got a training data score of 0.9941188609162196 and a testing data score of 0.9941704498555509.
For the random forest classifier model, I got a training data score of 0.9971970009629936 and a testing data score of 0.991900536524969.
The logistic regression model got better results but not the better accuracy. This did ncoincide with my initial prediction