# Credit Risk Classification

This notebook presents a machine learning approach to **credit risk classification**, where the goal is to determine whether a loan is high-risk or healthy based on applicant and loan features.

We will:
- Explore and preprocess the dataset
- Train machine learning models
- Evaluate performance using relevant metrics
- Provide conclusions and next steps


## 1. Data Loading and Exploration

We begin by loading the data to determine the which column we need to identify as **labels**. This is what the model will be predicting. The remaining data is identified as **features** which the model will use to predict the labels.

In [None]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path

In [None]:
# Load the dataset
# Read the CSV file from the Resources folder into a Pandas DataFrame
csv = Path('Resources/lending_data.csv')
lending_df = pd.read_csv(csv)

# Review the DataFrame
lending_df.head()

In [None]:
# Separate the data into labels and features

# Separate the y variable, the labels
y = lending_df["loan_status"]
# Separate the X variable, the features
X = lending_df.drop(columns="loan_status")

In [None]:
# Review the y variable Series
y.head()

In [None]:
# Review the X variable DataFrame
X.head()

It is important to observe the ratio of the 

In [None]:
# Check the balance of our target values
y.value_counts()

## 2. Data Preprocessing

Before training, we preprocess the data by encoding categorical variables, splitting features and target, and scaling numerical values.

In [None]:
# Import the train_test_split module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)

## 3. Model Training

We train models to classify loans as **healthy (0)** or **high-risk (1)** using the data binned into the *train* group.

In [None]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
model = LogisticRegression(solver='lbfgs', random_state=1)

# Fit the model using training data
model.fit(X_train, y_train)

## 4. Model Predictions

After training the model is used to make predictions for the data binned into the *test* group.

In [None]:
# Make a prediction using the testing data
prediction = model.predict(X_test)
pd.DataFrame({"Prediction": prediction, "Actual": y_test})

## 5. Model Evaluation

We evaluate models using **accuracy, precision, recall, and F1-score**, with particular attention to recall for identifying high-risk loans.

In [None]:
# Import the modules
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

In [None]:
# Print the balanced_accuracy score of the model
print(f"Accuracy Score: {balanced_accuracy_score(y_test, prediction):.3f}")

In [None]:
# Generate a confusion matrix for the model
print(confusion_matrix(y_test, prediction))

In [None]:
# Print the classification report for the model
print(classification_report(y_test, prediction))

## 6. Conclusion & Insights

- The logistic regression model provides interpretable results and solid recall for high-risk loans.
- Correctly identifying risky loans helps reduce default rates.
- While accuracy is important, recall is prioritized in this domain.


## 7. Next Steps

- Perform hyperparameter tuning
- Try advanced algorithms
- Conduct feature engineering for domain-specific insights
