In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Define the path to the CSV file
file_path = r"C:\Users\maria\OneDrive\Desktop\Data Analyst Course\Module 20 HW\credit-risk-classification\Resources\lending_data.csv"

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

# Review the DataFrame by displaying the first few rows
print(df.head())

# Optionally, you can review the summary of the DataFrame
print(df.describe())

# To get a sense of data completeness and types
print(df.info())

   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  loan_status  
0                 1       22800            0  
1                 0       13600            0  
2                 0       16100            0  
3                 1       22700            0  
4                 1       23000            0  
          loan_size  interest_rate  borrower_income  debt_to_income  \
count  77536.000000   77536.000000     77536.000000    77536.000000   
mean    9805.562577       7.292333     49221.949804        0.377318   
std     2093.22315

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the 'loan_status' column as the label
y = df['loan_status']  # This is the label

# Create the features DataFrame by dropping the 'loan_status' column
X = df.drop('loan_status', axis=1)  # This drops the label column from the feature set

# Display the first few rows of the labels and features to verify
print("Labels (y):")
print(y.head())
print("\nFeatures (X):")
print(X.head())


Labels (y):
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

Features (X):
   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  


In [4]:
# Display the first few entries of the y variable
print("First few entries in 'y':")
print(y.head())

# Get the distribution of classes within the y variable
print("\nDistribution of 'loan_status' classes:")
print(y.value_counts())

# Check for any missing values in the y variable
print("\nCheck for missing values in 'y':")
print(y.isnull().sum())


First few entries in 'y':
0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

Distribution of 'loan_status' classes:
loan_status
0    75036
1     2500
Name: count, dtype: int64

Check for missing values in 'y':
0


In [5]:
# Display the first few rows of the X DataFrame to understand what the data looks like
print("First few rows of the features (X):")
print(X.head())

# Display a summary of statistics for numerical features in the DataFrame
print("\nStatistical summary of the features (X):")
print(X.describe())

# Display the data types of each column to ensure they are appropriate for the analyses
print("\nData types of the features in X:")
print(X.dtypes)

# Check for missing values in each column
print("\nMissing values in each feature column:")
print(X.isnull().sum())


First few rows of the features (X):
   loan_size  interest_rate  borrower_income  debt_to_income  num_of_accounts  \
0    10700.0          7.672            52800        0.431818                5   
1     8400.0          6.692            43600        0.311927                3   
2     9000.0          6.963            46100        0.349241                3   
3    10700.0          7.664            52700        0.430740                5   
4    10800.0          7.698            53000        0.433962                5   

   derogatory_marks  total_debt  
0                 1       22800  
1                 0       13600  
2                 0       16100  
3                 1       22700  
4                 1       23000  

Statistical summary of the features (X):
          loan_size  interest_rate  borrower_income  debt_to_income  \
count  77536.000000   77536.000000     77536.000000    77536.000000   
mean    9805.562577       7.292333     49221.949804        0.377318   
std     2093.22315

### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Print the shape of the resulting datasets to confirm the split
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)


Training features shape: (62028, 7)
Testing features shape: (15508, 7)
Training labels shape: (62028,)
Testing labels shape: (15508,)


---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model with a random_state of 1
logistic_model = LogisticRegression(random_state=1)

# Fit the model using the training data
logistic_model.fit(X_train, y_train)

# Print a statement to confirm the model fitting
print("Logistic regression model has been fitted to the training data.")


Logistic regression model has been fitted to the training data.


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Use the fitted model to make predictions on the testing data
y_pred = logistic_model.predict(X_test)

# Optionally, print the first few predictions to see what they look like
print("First few predictions:")
print(y_pred[:5])


First few predictions:
[0 0 0 0 0]


### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [9]:
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[14926    75]
 [   46   461]]


In [10]:
# Print the classification report for the model
class_report = classification_report(y_test, y_pred)

print("Classification Report:")
print(class_report)

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     15001
           1       0.86      0.91      0.88       507

    accuracy                           0.99     15508
   macro avg       0.93      0.95      0.94     15508
weighted avg       0.99      0.99      0.99     15508



### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** 

*Precision and Recall for '0' (Healthy Loan):*

Precision: The precision for class '0' is 1.00, which means that every time the model predicted a loan as healthy, it was correct 100% of the time.
Recall: The recall for class '0' is also 1.00, indicating that the model successfully identified all healthy loans in the test set.

*Precision and Recall for '1' (High-risk Loan):*

Precision: The precision for class '1' is 0.86, suggesting that when the model predicted a loan as high-risk, it was correct 86% of the time.

Recall: The recall for class '1' is 0.91, which means that the model identified 91% of the actual high-risk loans in the test set.

*F1-Score:*

The F1-score for class '0' is perfect at 1.00, reflecting the excellent balance between precision and recall for this class.

The F1-score for class '1' is 0.88, which is quite high, showing a good balance between precision and recall for predicting high-risk loans.

*Overall Accuracy:*

The overall accuracy of the model is 0.99, indicating that it correctly predicts the label for 99% of all loans in the test set.

*Confusion Matrix:*

The model predicted 14926 healthy loans correctly and incorrectly classified 75 as high-risk (false positives).
It correctly predicted 461 high-risk loans and missed 46 (false negatives).

---