# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data

lending_df = pd.read_csv(Path('Resources/lending_data.csv'))

lending_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

*I predict that the Logistic Regression model will perform better. The data is lineraly seperable which makes it easily interpretable and more predictive.*

## Split the Data into Training and Testing Sets

In [3]:
# Split the data into X_train, X_test, y_train, y_test

x = lending_df[['loan_size', 'interest_rate', 'borrower_income', 'debt_to_income', 'num_of_accounts', 'derogatory_marks', 'total_debt']]
y = lending_df['loan_status']

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split (x, y, random_state=1)

print(x.shape, y.shape)


(77536, 7) (77536,)


## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [4]:
# Train a Logistic Regression model and print the model score

from sklearn.linear_model import LogisticRegression
forest_class = LogisticRegression()
forest_class

In [5]:
forest_class.fit(x_train, y_train)

In [6]:
print(f"Training Data Score: {forest_class.score(x_train, y_train)}")
print(f"Training Data Score: {forest_class.score(x_test, y_test)}")

Training Data Score: 0.9921240885954051
Training Data Score: 0.9918489475856377


In [7]:
print(f'Actual:\t\t{list(y_test[:70])}')
print(f'Predicted:\t\t{forest_class.predict(x_test[:70])}')

Actual:		[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Predicted:		[0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


  print(f'Actual:\t\t{list(y_test[:70])}')


In [8]:
# Train a Random Forest Classifier model and print the model score
from sklearn.preprocessing import StandardScaler
StandardScaler = StandardScaler().fit(x_train)
x_train1 = StandardScaler.transform(x_train)
x_test1 = StandardScaler.transform(x_test)

In [9]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(x_train1, y_train)
print(f'Training Score: {clf.score(x_train1, y_train)}')
print(f'Testing Score: {clf.score (x_test, y_test)}')

Training Score: 0.9975409272252029
Testing Score: 0.9680664465538589




In [10]:
from sklearn.feature_selection import SelectFromModel
select = SelectFromModel(clf)
select.fit(x_train1, y_train)
select.get_support()

array([False,  True,  True,  True, False, False,  True])

*Conclusion*

*The final results are as predicted. While there is a difference between the Logistic Regression models testing and training data score, it is very minimal. The Random Forest model's testing data score, however, is lower than the testing data score. The Logistic Regression model is off by .0003 and The Random Forest model is off by .03. With this, we can conclude using the Logistic Regression model would be a better and yield more accurate results.*