# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
df = pd.read_csv('Resources/lending_data.csv')
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
y =df["loan_status"]
X = df.drop("loan_status", axis=1)

## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

To my understanding determining Loan Status is a binary outcome , YES or NO. This makes it a classification problem. I think Logistic Regression model should perform better because it is being widely used for problem binary classification. On the other hand I don't have any reason on why Random Forest will not perform better as compared to Logistic Regression.

## Split the Data into Training and Testing Sets

In [4]:
# Split the data into X_train, X_test, y_train, y_test
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [6]:
# Train a Logistic Regression model and print the model score
classifier = LogisticRegression()

# Fit the model to the data
classifier.fit(X_train, y_train)

# Print the accuracy score for the test data
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

# Predict on the testing data
y_pred = classifier.predict(X_test)

# Evaluate the model performance
# print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Training Data Score: 0.9915522022312504
Testing Data Score: 0.9928424039205571
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15011
           1       0.88      0.91      0.89       497

    accuracy                           0.99     15508
   macro avg       0.94      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



In [10]:
#Note: Logistic Regression Scaling the data reduces the test and training score but the accuracy is bit high
# # Logistic Regression with scaling
# # Scale the features using standardization
# scaler = StandardScaler()
# X_scaled_train = scaler.fit_transform(X_train)
# X_scaled_test = scaler.transform(X_test)

# # Initialize the logistic regression model
# model = LogisticRegression()

# # Fit the model to the training data
# model.fit(X_scaled_train, y_train)

# # Print the accuracy score for the test data
# print(f"Training Data Score: {classifier.score(X_scaled_train, y_train)}")
# print(f"Testing Data Score: {classifier.score(X_scaled_test, y_test)}")

# # Predict on the testing data
# y_pred = model.predict(X_scaled_test)

# # Evaluate the model performance
# print("Accuracy:", accuracy_score(y_test, y_pred))
# print(classification_report(y_test, y_pred))

In [8]:
# Import a Random Forests classifier
from sklearn.ensemble import RandomForestClassifier

In [9]:
# Train a Random Forest Classifier model and print the model score
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = RandomForestClassifier(random_state=1).fit(X_train_scaled, y_train)
y_pred=clf.predict(X_test_scaled)
print(classification_report(y_test,y_pred))
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15011
           1       0.87      0.91      0.89       497

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508

Training Score: 0.9970658412329916
Testing Score: 0.992713438225432


*Which model performed better? How does that compare to your prediction? Replace the text in this markdown cell with your answers to these questions.*

### Analysis 
Logistic Classification 
> Training Data Score: 0.9915522022312504
> Testing Data Score: 0.9928424039205571
> f1 score: (1.00 ,0.89)
    
Random Forest 
> Training Score: 0.9970658412329916
> Testing Score: 0.992713438225432
> f1 score: (1.00 ,0.89)
    
Both Logistic Classification and Randon Forest performed similarly. But Logistic Regression model performed little better on 
Testing Data Score. 