# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
path = 'Resources/lending_data.csv'
df = pd.read_csv(path)
df.tail()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
77531,19100.0,11.261,86600,0.65358,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1
77535,15600.0,9.742,72300,0.585062,9,2,42300,1


## Predict Model Performance

I predict the random forests classifier will be more accurate than logistic regression in prediction of risk level resulting in a loan status of 0 or 1, as the factors used in determining loan status my not have equal bearing on risk. The random forests classifier can factor in various interactions between features. Logistic Regression, on the other hand, do not characterize complex relationships between multiple features very well, and runs on the assumption that independent and dependent variables are linearly dependent.

## Split the Data into Training and Testing Sets

In [3]:
# assign variables
X = df.drop('loan_status', axis=1)
y = df['loan_status']

In [4]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [5]:
# scale X data using StandardScaler
scaler = StandardScaler().fit(X_train, y_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [6]:
# Train a Logistic Regression model and print the model score

classifier = LogisticRegression(max_iter = 10000).fit(X_train_scaled, y_train)

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.9942908240473243
Testing Data Score: 0.9936545604622369


In [7]:
# Train a Random Forest Classifier model and print the model score

rand_clf = RandomForestClassifier(random_state=42)
rand_clf.fit(X_train_scaled, y_train)

print(f'Training Score: {rand_clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {rand_clf.score(X_test_scaled, y_test)}')

Training Score: 0.9975237309120925
Testing Score: 0.9916941807676434


In [8]:
# create test parameters
param_grid = {
    'n_estimators': [90, 100, 120, 130],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
param_grid

{'n_estimators': [90, 100, 120, 130],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4]}

In [9]:
# conduct grid search
grid_search = GridSearchCV(rand_clf, param_grid,n_jobs=-1,verbose=3)
grid_search.fit(X_train_scaled,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [10]:
# identify best parameters
grid_search.best_params_

{'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 90}

In [11]:
# train a Random Forest Classifier model with optimized parameters
optimized_clf = RandomForestClassifier(max_depth = 4,
                                  min_samples_leaf = 2,
                                  n_estimators = 90,
                                  random_state = 1)
optimized_clf.fit(X_train, y_train)

In [12]:
# print model score
print(f'Training Score: {optimized_clf.score(X_train, y_train)}')
print(f'Testing Score: {optimized_clf.score(X_test, y_test)}')

Training Score: 0.9946691429357546
Testing Score: 0.993912505158894


### Results
The testing score of the Random Forest Classifer model was slightly higher than the Logistic Regression model, but not by a significant amount. In this case, depending on the random state selected, the better scoring model may switch. Running a GridSearchCV to optimize selected hyperparameters did increase the testing score by a small degree.

The prediction that the Random Forest Classifer would have the advantage was therefore incorrect. The ambiguous results may be due to the fact that the fit for both the models was very close to 1, indicating that the dataset might be constructed and not similar to real-world data.