6  

---

**Goal:** Build a logistic regression model to classify loans as **Good (1)** vs **Bad (0)** using the provided dataset.

**Good Loan (1):**
- Current  
- Fully Paid  

**Bad Loan (0):**
- In Grace Period  
- Late (31–120 days)  
- Late (16–30 days)  
- Charged Off  

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 0) Imports (run this first)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score

pd.set_option("display.max_columns", 200)

## 1) Read / import the data using pandas

This cell tries to load `loan_application_data.csv` from the same folder as the notebook.
If your CSV is in a different location, replace `csv_path` with your full path.

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 1) Read / import the data using pandas

csv_path = "loan_application_data.csv"  # keep CSV in same folder as this notebook

try:
    df = pd.read_csv(r"C:\Users\vanam\Downloads\loan_application_data.csv")
except FileNotFoundError:
    # If needed, update this path to your computer's path:
    # csv_path = r"C:\Users\YOURNAME\Downloads\loan_application_data.csv"
    raise FileNotFoundError(
        "Could not find 'loan_application_data.csv' in the notebook folder.\n"
        "Put the CSV in the same folder as this notebook OR update csv_path to your local file path."
    )

print("Shape:", df.shape)
df.head()

## 2) Describe the data

We review:
- column names
- data types
- summary statistics
- unique values for the target-related column

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 2) Describe the data

display(df.info())
display(df.describe(include="all").T.head(20))

# Quick look at loan_status categories
if "loan_status" in df.columns:
    display(df["loan_status"].value_counts(dropna=False))
else:
    print("Column 'loan_status' not found. Please check df.columns:")
    display(df.columns)

## 3) Review and handle missing data

We check missing values and apply simple handling:
- For numeric columns: fill missing with **median**
- For categorical columns: fill missing with **mode**

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 3) Review missing data

missing = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing / len(df) * 100).round(2)

missing_table = pd.DataFrame({"missing_count": missing, "missing_pct": missing_pct})
missing_table = missing_table[missing_table["missing_count"] > 0]

print("Columns with missing values:", missing_table.shape[0])
display(missing_table.head(30))

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 3b) Handle missing data (median for numeric, mode for categorical)

df_clean = df.copy()

num_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in df_clean.columns if c not in num_cols]

# Fill numeric with median
for c in num_cols:
    if df_clean[c].isna().any():
        df_clean[c] = df_clean[c].fillna(df_clean[c].median())

# Fill categorical with mode (most frequent)
for c in cat_cols:
    if df_clean[c].isna().any():
        mode_val = df_clean[c].mode(dropna=True)
        fill_val = mode_val.iloc[0] if len(mode_val) else "Unknown"
        df_clean[c] = df_clean[c].fillna(fill_val)

# Confirm
print("Total missing after cleaning:", int(df_clean.isna().sum().sum()))

## 4) Sample the data (50%)

To reduce runtime and match the assignment requirement, we randomly sample **50%** of the cleaned data.

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 4) Sample 50% of the data
df_sample = df_clean.sample(frac=0.50, random_state=42).reset_index(drop=True)
print("Sample shape:", df_sample.shape)
df_sample.head()

## 5) Create the target variable: `Loan_Class` from `loan_status`

Mapping:
- **Good Loan (1):** Current, Fully Paid
- **Bad Loan (0):** In Grace Period, Late (31-120 days), Late (16-30 days), Charged Off

We also keep a quick frequency check.

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 5) Create Loan_Class (target)

good_status = {"Current", "Fully Paid"}
bad_status = {"In Grace Period", "Late (31-120 days)", "Late (16-30 days)", "Charged Off", "Late (16-30 days)", "Late (31-120 days)"}

if "loan_status" not in df_sample.columns:
    raise KeyError("Column 'loan_status' not found. Please confirm the dataset contains 'loan_status'.")

df_sample["Loan_Class"] = df_sample["loan_status"].apply(lambda x: 1 if x in good_status else 0)

display(df_sample[["loan_status", "Loan_Class"]].head(10))
display(df_sample["Loan_Class"].value_counts())

## 6) Explore the data (visualizations)

Required charts:
- Loan amount distribution
- Interest rate distribution
- Loan Status vs Loan Amount
- Application Type vs Loan Amount
- Good loan vs Bad loan count plot

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# Helper: pick the correct column name for loan amount / interest rate if dataset uses slightly different names
def pick_col(candidates, columns):
    for c in candidates:
        if c in columns:
            return c
    return None

loan_amt_col = pick_col(["loan_amount", "loan_amnt", "loan_amt"], df_sample.columns)
int_rate_col = pick_col(["interest_rate", "int_rate", "interestRate"], df_sample.columns)

print("Loan amount column:", loan_amt_col)
print("Interest rate column:", int_rate_col)

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 6a) Loan amount distribution

if loan_amt_col is not None:
    plt.figure(figsize=(7,4))
    plt.hist(df_sample[loan_amt_col], bins=30)
    plt.title("Loan Amount Distribution")
    plt.xlabel("Loan Amount")
    plt.ylabel("Frequency")
    plt.show()
else:
    print("Could not find a loan amount column. Available columns:")
    display(df_sample.columns)

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 6b) Interest rate distribution

if int_rate_col is not None:
    plt.figure(figsize=(7,4))
    plt.hist(df_sample[int_rate_col], bins=30)
    plt.title("Interest Rate Distribution")
    plt.xlabel("Interest Rate")
    plt.ylabel("Frequency")
    plt.show()
else:
    print("Could not find an interest rate column. Available columns:")
    display(df_sample.columns)

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 6c) Loan Status vs Loan Amount (boxplot style using matplotlib)

if loan_amt_col is not None:
    # Take top N statuses to keep the chart readable
    top_statuses = df_sample["loan_status"].value_counts().head(6).index.tolist()
    plot_df = df_sample[df_sample["loan_status"].isin(top_statuses)].copy()

    groups = [plot_df.loc[plot_df["loan_status"] == s, loan_amt_col].values for s in top_statuses]

    plt.figure(figsize=(10,4))
    plt.boxplot(groups, labels=top_statuses, showfliers=False)
    plt.title("Loan Status vs Loan Amount (Top Statuses)")
    plt.xlabel("Loan Status")
    plt.ylabel("Loan Amount")
    plt.xticks(rotation=25, ha="right")
    plt.show()
else:
    print("Loan amount column not found, skipping this plot.")

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 6d) Application Type vs Loan Amount

if "application_type" in df_sample.columns and loan_amt_col is not None:
    app_types = df_sample["application_type"].value_counts().index.tolist()
    groups = [df_sample.loc[df_sample["application_type"] == a, loan_amt_col].values for a in app_types]

    plt.figure(figsize=(7,4))
    plt.boxplot(groups, labels=app_types, showfliers=False)
    plt.title("Application Type vs Loan Amount")
    plt.xlabel("Application Type")
    plt.ylabel("Loan Amount")
    plt.xticks(rotation=20, ha="right")
    plt.show()
else:
    print("Missing 'application_type' or loan amount column, skipping this plot.")

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 6e) Good loan vs Bad loan count plot

counts = df_sample["Loan_Class"].value_counts().sort_index()

plt.figure(figsize=(5,4))
plt.bar(["Bad Loan (0)", "Good Loan (1)"], [counts.get(0,0), counts.get(1,0)])
plt.title("Good Loan vs Bad Loan Counts")
plt.ylabel("Count")
plt.show()

## 7) Correlation analysis / matrix

Correlation applies to numeric columns. We compute a correlation matrix and display it.

### Correlation Analysis

Correlation analysis is used to measure the strength and direction of the relationship between numerical variables.  
The correlation coefficient ranges from **-1 to +1**:

- **+1** → strong positive relationship  
- **-1** → strong negative relationship  
- **0** → no linear relationship  

In this analysis, we examine how numerical features relate to each other and to the target variable (**Loan_Class**).  
Understanding these relationships helps identify variables that may influence loan repayment behavior and supports feature selection for the logistic regression model.


### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 7) Correlation matrix for numeric columns
numeric_df = df_sample.select_dtypes(include=[np.number]).copy()

corr = numeric_df.corr(numeric_only=True)
display(corr.round(3).head(15))

# Optional heatmap (simple matplotlib)
plt.figure(figsize=(10,7))
plt.imshow(corr, aspect="auto")
plt.title("Correlation Matrix (Numeric Features)")
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90, fontsize=7)
plt.yticks(range(len(corr.columns)), corr.columns, fontsize=7)
plt.tight_layout()
plt.show()

## 8) Label Encoding (categorical → numeric)

We label encode the required columns **if they exist** in the dataset:

- emp_title
- state
- loan_status
- loan_purpose
- application_type
- homeownership
- verified_income
- initial_listing_status
- disbursement_method

> Note: We will later **drop loan_status** from X to avoid target leakage, but we still encode it here to satisfy the assignment requirement.

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 8) Label Encoding for selected categorical columns (only if present)

encode_cols = [
    "emp_title",
    "state",
    "loan_status",
    "loan_purpose",
    "application_type",
    "homeownership",
    "verified_income",
    "initial_listing_status",
    "disbursement_method",
]

df_model = df_sample.copy()
label_encoders = {}

for col in encode_cols:
    if col in df_model.columns:
        le = LabelEncoder()
        df_model[col] = le.fit_transform(df_model[col].astype(str))
        label_encoders[col] = le
    else:
        print(f"Column not found (skipped): {col}")

print("Encoded columns:", list(label_encoders.keys()))
df_model.head()

## 9) Split the data into training and testing sets

- Define `y = Loan_Class`
- Define `X = all other columns`
- Drop `loan_status` from X to avoid target leakage (because Loan_Class came from loan_status)

### Training and Testing Data Split

To evaluate the performance of the logistic regression model, the dataset was split into training and testing sets. The training data was used to build and fit the model, while the testing data was kept separate to evaluate how well the model performs on unseen data. This helps ensure that the model does not simply memorize the data and can generalize to new loan applications.

A test size of 25% was used, meaning 75% of the data was used for training and 25% for testing. Stratified sampling was applied to preserve the original distribution of good and bad loans in both sets. This approach provides a fair and reliable assessment of model performance.

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 9) Train/Test split

if "Loan_Class" not in df_model.columns:
    raise KeyError("Loan_Class was not created. Please run Step 5 before continuing.")

y = df_model["Loan_Class"].astype(int)

# Drop target + drop original loan_status to prevent leakage
drop_cols = ["Loan_Class"]
if "loan_status" in df_model.columns:
    drop_cols.append("loan_status")

X = df_model.drop(columns=drop_cols)

print("X shape:", X.shape)
print("y distribution:", y.value_counts(normalize=True).round(3))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

X_train.shape, X_test.shape

## 10) Apply Standard Scaler

Logistic regression often performs better when numeric features are scaled.
We apply `StandardScaler` to X_train and X_test.

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 10) Standard scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape

## 11) Apply Logistic Regression, train the model, and make predictions

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 11) Logistic Regression model + training + predictions

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]  # probability of class 1 (Good Loan)

y_pred[:10], y_proba[:10]

###  Confusion Matrix
A confusion matrix is used to evaluate the performance of a classification model by comparing actual loan outcomes with the model’s predictions. It shows four possible results: true positives, true negatives, false positives, and false negatives.

In this model, true positives represent good loans that were correctly predicted as good, while true negatives represent bad loans correctly predicted as bad. False positives occur when a bad loan is incorrectly predicted as good, and false negatives occur when a good loan is incorrectly predicted as bad. Analyzing these values helps understand where the model performs well and where it makes mistakes, which is important in assessing loan risk.

#### Confusion Matrix (TP / FP / TN / FN Explanation)

The confusion matrix breaks down the model’s predictions into four categories to evaluate classification performance:

- **True Positives (TP):** Good loans that were correctly predicted as good.

- **True Negatives (TN):** Bad loans that were correctly predicted as bad.

- **False Positives (FP):** Bad loans that were incorrectly predicted as good.

- **False Negatives (FN):** Good loans that were incorrectly predicted as bad.

These values help identify the types of errors made by the model. In loan risk analysis, false positives are especially important because they represent risky loans that were mistakenly classified as safe. Understanding these counts provides deeper insight into the model’s strengths and weaknesses beyond overall accuracy.

## 12) Classification report + confusion matrix (interpretation)

- Confusion matrix shows **TP/FP/TN/FN** counts
- Classification report shows **precision, recall, f1-score, accuracy**

#### The classification report provides detailed performance metrics for the logistic regression model, including precision, recall, F1-score, and accuracy.

- **Precision** measures how many loans predicted as good were actually good, indicating how reliable the positive predictions are.
- **Recall measures** how many actual good loans were correctly identified by the model, showing its ability to capture positive cases.
- **F1-score** is the balance between precision and recall, providing a single metric that reflects overall classification performance.
- **Accuracy** represents the overall percentage of correctly classified loans.

**Interpretation idea:**
- If recall for Good Loan is high → the model catches most good loans
- If precision for Good Loan is high → when model predicts good, it is often correct

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 12) Confusion matrix + classification report

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm,
    index=["Actual 0 (Bad)", "Actual 1 (Good)"],
    columns=["Pred 0 (Bad)", "Pred 1 (Good)"]
)
display(cm_df)

print(classification_report(y_test, y_pred, digits=3))

***The ROC (Receiver Operating Characteristic) curve shows the trade-off between the true positive rate and the false positive rate at different classification thresholds. It helps visualize how well the model distinguishes between good and bad loans across various probability cutoffs.***

***The AUC (Area Under the Curve) summarizes the ROC curve into a single value ranging from 0 to 1. A higher AUC indicates better model performance, meaning the model has a stronger ability to separate good loans from bad loans. An AUC value closer to 1 suggests good discrimination, while a value close to 0.5 indicates performance similar to random guessing.***

## 13) ROC Curve and AUC

- ROC curve plots True Positive Rate vs False Positive Rate across thresholds
- AUC summarizes performance (0.5 = random, closer to 1.0 = better)

### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.


In [None]:
# -----------------------------
# Detailed commented code
# -----------------------------
# 13) ROC curve + AUC

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(6,5))
plt.plot(fpr, tpr, label=f"Logistic Regression (AUC={auc:.3f})")
plt.plot([0, 1], [0, 1], linestyle="--", label="Random (AUC=0.5)")
plt.title("ROC Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

auc

### Experience Using Logistic Regression for Data Mining

Using logistic regression in this assignment helped me understand how data can be used to make classification decisions in real situations. Preparing the data was an important step, especially cleaning the data, handling missing values, and converting categorical variables into numerical form. Creating the Loan_Class variable from loan status values made the problem easier to model and clearly defined what the prediction goal was.

Training the model and evaluating it using the confusion matrix, classification report, and ROC–AUC curve showed how well logistic regression can separate good loans from bad loans. Overall, this experience showed that logistic regression is a simple yet effective method for data mining, especially for binary classification problems like loan risk analysis.

I used this approach because logistic regression is a well-established supervised learning algorithm for binary classification problems. The transformation of loan status into a binary target variable allowed the model to estimate the probability of a loan being good or bad. Applying label encoding and feature scaling ensured that categorical and numerical variables were properly prepared for model training.

This structured pipeline, including train–test splitting, standardization, and evaluation using confusion matrix and ROC–AUC, follows best practices in data mining and helps ensure the model’s performance is reliable, interpretable, and free from data leakage.
More Casual Version (natural, confident, human)



### Code Explanation
- This cell contains Python code used in the Logistic Regression workflow.
- Detailed comments are added inside the code to explain *what* each step does and *why* it is required.
- This helps understand the full data mining process clearly for academic evaluation.
