# Credit Risk Evaluator
Lending services companies allow individual investors to partially fund personal loans as well as buy and sell notes backing the loans on a secondary market.

You will be using this data to create machine learning models to classify the risk level of given loans. Specifically, you will be comparing the Logistic Regression model and Random Forest Classifier.

In [1]:
# Import dependencies.
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import classification_report

## Retrieve the Data
The data is located in the Challenge Files Folder:

* `lending_data.csv`

In [2]:
# Import the 'lending_data.csv' file as a Pandas DataFrame.
df = pd.read_csv('Resources/lending_data.csv')

In [3]:
# Confirm that the import was successful by displaying the DataFrame.
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Make a prediction on which model will perform better on the data. Justify the prediction with information about the models.

> **Prediction**: The Logistic Regression model will perform better.

> **Justification**: The Logistic Regression model is a classification model that makes a prediction on binary outcomes (i.e., probability of 0 or 1) from data. The Random Forests Classifier is another form of a classification model that selects important features and ignores the rest to make a prediction on binary outcomes. This is important if the logistic regression model is overfitting the training data. I predict that all of the features in this dataset will be deemed important in the risk classification of a loan, and thus the Logistic Regression model will perform better overall.

## Split the Data into Training and Testing Sets

In [4]:
# Create the features DataFrame, X, by removing the 'loan_status' column.
X = df.drop('loan_status', axis=1)

In [5]:
# Create y, the labels set, by using the loan_status column
y = df['loan_status']

In [6]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Preprocess Data

In [7]:
# Use the standard scaler to scale the X_train and X_test data.
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [8]:
# Create and train a Logistic Regression model.
from sklearn.linear_model import LogisticRegression
logistic_clf = LogisticRegression()
logistic_clf.fit(X_train_scaled, y_train)
y_pred = logistic_clf.predict(X_test_scaled)

In [9]:
# Score the Logistic Regression model.
print(classification_report(y_test, y_pred))

print(f"Training Data Score: {logistic_clf.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {logistic_clf.score(X_test, y_test)}")

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.98      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384

Training Data Score: 0.9942908240473243
Testing Data Score: 0.9680664465538589


  f"X has feature names, but {self.__class__.__name__} was fitted without"


In [10]:
# Create and train a Random Forest Classifier model.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
y_pred = rf_clf.predict(X_test_scaled)

In [11]:
# Score a Random Forest Classifier model.
print(classification_report(y_test, y_pred))

print(f'Training Score: {rf_clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {rf_clf.score(X_test_scaled, y_test)}')

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.90      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384

Training Score: 0.9975409272252029
Testing Score: 0.9917457697069748


State which model performed better. Compare the actual model performance with your predictions.

*Reducing both false positives and false negatives are important in predicting risky loans: false positives suggest that the model predicted the loans as risky, but ended up being not risky, whereas false negatives suggest that the model predicted the loans as not risky, but was risky. Despite both having extremely high training scores, the Random Forest Classifier model had a higher testing data score than that of the Logistic Regression model (0.992 > 0.968). Therefore, I believe the Random Forest Classifier model performed better than the Logistic Regression model.*