In [None]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report


---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [10]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
file_path = Path("Resources/lending_data.csv")
df = pd.read_csv(file_path)
# Review the DataFrame
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [11]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = df["loan_status"]

# Separate the X variable, the features
X = df.drop(columns=["loan_status"])


In [None]:
# Review the y variable Series
print("y (labels):")
print(y.head())

In [None]:
# Review the X variable DataFrame
print("\nX (features):")
print(X.head())

### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [14]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Display the shape of the resulting datasets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (62028, 7)
X_test shape: (15508, 7)
y_train shape: (62028,)
y_test shape: (15508,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [15]:
# Import the LogisticRegression module from Sklearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model with random_state=1
logistic_model = LogisticRegression(random_state=1)

# Fit the model using the training data
logistic_model.fit(X_train, y_train)

# Print confirmation
print("Logistic Regression model fitted successfully!")


Logistic Regression model fitted successfully!


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [16]:
# Make predictions using the testing data
y_pred = logistic_model.predict(X_test)

# Display the first few predictions
print("Predicted labels:")
print(y_pred[:10])  # Display the first 10 predictions


Predicted labels:
[0 0 0 0 0 0 0 0 0 0]


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [17]:
# Generate a confusion matrix for the model
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[14924    77]
 [   31   476]]


In [18]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(class_report)



Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     15001
           1       0.86      0.94      0.90       507

    accuracy                           0.99     15508
   macro avg       0.93      0.97      0.95     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The logistic regression model performs exceptionally well in predicting healthy loans (0), as shown by the precision (1.00), recall (0.99), and F1-score (1.00) for label 0. This indicates that almost all healthy loans are correctly classified, with only 77 false positives (healthy loans misclassified as high-risk).

For high-risk loans (1), the model does fairly well but not perfectly:

Precision = 0.86 → 86% of the loans predicted as high-risk were actually high-risk.
Recall = 0.94 → 94% of actual high-risk loans were correctly identified.
F1-score = 0.90 → A good balance between precision and recall.
The confusion matrix shows that:

476 high-risk loans were correctly classified (True Positives).
31 high-risk loans were misclassified as healthy (False Negatives).
77 healthy loans were misclassified as high-risk (False Positives).

---

Overview of the Analysis
The purpose of this analysis was to build and evaluate machine learning models that predict whether a loan is healthy (0) or high-risk (1) based on financial information about borrowers. The analysis focused on applying logistic regression to classify loan applicants effectively.

Financial Data and Prediction Objective

The goal was to develop a model that accurately classifies loans to help lenders make informed decisions.

Machine Learning Process
Data Preparation:

Separated the dataset into features (X) and target labels (y).
Split the dataset into training (80%) and testing (20%) data using train_test_split().
Model Selection and Training:

Chose Logistic Regression as the initial model.
Trained the model using the fit() function on the training data.
Model Evaluation:

Used predict() to generate predictions on the test dataset.
Evaluated the model using accuracy, precision, recall, and F1-score.
Analyzed the confusion matrix to understand classification performance.
Results
Machine Learning Model 1: Logistic Regression
Accuracy: 99%

Precision & Recall Scores:

Healthy Loans (0)
Precision: 1.00
Recall: 0.99
F1-score: 1.00
High-Risk Loans (1)
Precision: 0.86
Recall: 0.94
F1-score: 0.90
Confusion Matrix Summary:

14924 True Positives (Healthy Loans correctly identified).
476 True Negatives (High-Risk Loans correctly identified).
77 False Positives (Healthy Loans misclassified as high-risk).
31 False Negatives (High-Risk Loans misclassified as healthy).
Summary & Recommendation
The Logistic Regression model performed very well, achieving an overall 99% accuracy.
It excelled at predicting healthy loans (0), with perfect precision and near-perfect recall.
However, for high-risk loans (1), the model missed 31 cases (False Negatives), which could be risky for lenders.
Given the importance of identifying high-risk loans, another model (such as Random Forest or XGBoost) might be worth exploring to improve high-risk loan detection.
Recommendation
If minimizing risk is the priority (ensuring all high-risk loans are correctly flagged), a model with better recall for 1 should be explored.
If overall accuracy and efficiency are the main concerns, then Logistic Regression is a strong choice.