# Project Setup and Import Statements

This project leverages a variety of Python libraries to handle data manipulation, visualization, machine learning, preprocessing, model evaluation, and addressing class imbalance. Below are the consolidated import statements used in this project.

In [None]:
import math
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, classification_report, roc_curve)

from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

# Load the Dataset

This section is responsible for loading the dataset into the environment and providing an initial glimpse of its structure. Proper error handling ensures that the process is robust and user-friendly.


In [None]:
# Load the dataset
try:
    data = pd.read_csv('./american_bankruptcy.csv')   
except FileNotFoundError:
    print("Dataset not found. Please check the file path.")

# Display the first few rows
data.head()

# Explore the Dataset Structure

Understanding the structure of your dataset is crucial for effective data analysis and preprocessing. This section provides insights into the size and composition of the dataset, helping to identify potential issues and plan subsequent steps.

In [None]:
# Check the shape of the dataset
print(f"Dataset contains {data.shape[0]} rows and {data.shape[1]} columns.")

# Get basic info
data.info()

We generate descriptive statistics for the dataset, including count, mean, standard deviation, minimum, maximum, and quartile values for numerical columns, as well as counts and unique values for categorical columns.

In [None]:
# Summary statistics
data.describe(include='all')

We check for missing values in each column and print out the columns that have missing values along with the count of missing entries.

In [None]:
# Check for missing values and duplicates
print("\nMissing values in each column:")
print(data.isnull().sum()[data.isnull().sum() > 0])

We identify and count the number of duplicate rows in the dataset to assess data quality.

In [None]:
# Check for duplicate rows
duplicate_rows = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

# Visualize the Target Variable

Understanding the distribution of the target variable is essential in any machine learning project. This section visualizes the distribution of the `status_label` to assess class balance and gain insights into the dataset's composition. Visualizing the target variable aids in identifying potential issues such as class imbalance, which can significantly impact model performance and evaluation.

In [None]:
# Visualize the target variable
sns.countplot(x='status_label', data=data)
plt.title('Distribution of Company Status')
plt.show()
print("\nClass distribution:")
print(data['status_label'].value_counts())

# Data Preprocessing

Preparing the dataset for machine learning involves several crucial preprocessing steps to ensure that the data is clean, consistent, and in a suitable format for modeling. This section outlines the steps taken to preprocess the dataset, including dropping unnecessary columns, handling data types, and encoding categorical variables.

In [None]:
# Drop unnecessary columns
data.drop(columns=['company_name'], inplace=True)

# Ensure 'year' is numeric
data['year'] = pd.to_numeric(data['year'], errors='coerce')

# Encode 'status_label' if necessary
if data['status_label'].dtype == 'object':
    label_encoder = LabelEncoder()
    data['status_label'] = label_encoder.fit_transform(data['status_label'])
    label_mapping = dict(
        zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
    print("\nLabel Mapping:", label_mapping)

# Identify Columns with Negative Values

Ensuring data quality is a fundamental step in the data preprocessing pipeline. Negative values in certain columns may indicate data entry errors, invalid measurements, or specific characteristics of the data that need to be addressed. This section identifies which numerical columns contain negative values, allowing for informed decisions on how to handle them in subsequent analysis and modeling steps.


In [None]:
# Identify columns with negative values
numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()
negative_values = (data[numeric_cols] < 0).any()
print("Columns with negative values:")
print(negative_values[negative_values == True])

# Visualize Numerical Features with Boxplots

Boxplots are an effective way to visualize the distribution of numerical features in a dataset. They help in identifying the presence of outliers, understanding the spread of the data, and comparing distributions across different variables. This section generates boxplots for all numerical features (excluding the target variable) in the dataset, providing a comprehensive overview of their distributions.


In [None]:
# Visualize numerical features with boxplots
numeric_cols.remove('status_label')
num_cols = len(numeric_cols)
num_cols_per_row = 2
num_rows = math.ceil(num_cols / num_cols_per_row)

fig, axes = plt.subplots(
    nrows=num_rows, ncols=num_cols_per_row, figsize=(15, num_rows * 5))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    sns.boxplot(x=data[col], ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}')

# Remove any unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

In [None]:
X = data.drop(columns='status_label')

# Compute the correlation matrix
corr_matrix = X.corr().abs()

# Select the upper triangle of the correlation matrix
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than the threshold (e.g., 0.8)
threshold = 0.8
to_drop = [column for column in upper.columns if any(
    upper[column] > threshold)]

print(f"Features to drop due to high correlation (> {threshold}): {to_drop}")

# Drop the features from X
X_filtered = X.drop(columns=to_drop)

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

# Outlier Detection and Removal in the Majority Class

Outliers can significantly skew the results of data analysis and machine learning models. This section focuses on identifying and removing outliers from the majority class in the dataset using the Interquartile Range (IQR) method. By targeting the majority class, we aim to refine the dataset without compromising the integrity of the minority class, which is crucial for maintaining class balance.


In [None]:
# Function to remove outliers based on IQR
def remove_outliers_IQR(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Separate the data by class
data_majority = data[data['status_label'] == 0]
data_minority = data[data['status_label'] == 1]

print("Original majority class shape:", data_majority.shape)
print("Original minority class shape:", data_minority.shape)

# Apply the outlier removal to the majority class only
for col in numeric_cols:
    data_majority = remove_outliers_IQR(data_majority, col)

print("Majority class shape after outlier removal:", data_majority.shape)

# Recombine the data
data = pd.concat([data_majority, data_minority], axis=0).reset_index(drop=True)

# Verify class distribution after outlier removal
print("Class distribution after outlier removal:")
print(data['status_label'].value_counts())

# Split the Dataset into Training and Testing Sets

After preprocessing the data and ensuring its quality, the next crucial step is to prepare it for machine learning model training and evaluation. This involves separating the dataset into features and the target variable, followed by splitting the data into training and testing subsets. Proper splitting ensures that the model can generalize well to unseen data and provides an unbiased evaluation of its performance.

In [None]:
# Separate features and target variable
X = data.drop(columns=['status_label'])
y = data['status_label']

# Split the data into training and testing sets (before scaling and PCA)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

print("Class distribution in y_train:")
print(y_train.value_counts())

# Feature Scaling with StandardScaler

Scaling features is a critical preprocessing step in machine learning, especially for algorithms that are sensitive to the scale of input data (e.g., Support Vector Machines, Logistic Regression, and Neural Networks). This section applies feature scaling to the dataset using `StandardScaler` from Scikit-learn. The scaler is fitted on the training data to prevent data leakage and then applied to both the training and testing sets to ensure consistency.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Dimensionality Reduction with Principal Component Analysis (PCA)

Dimensionality reduction is a crucial step in the machine learning pipeline, especially when dealing with high-dimensional data. It helps in simplifying models, reducing computational costs, and mitigating the risk of overfitting. This section applies **Principal Component Analysis (PCA)** to the scaled features to reduce the number of dimensions while retaining 90% of the variance in the data.

In [None]:
# Apply PCA (fit PCA on training data only)
pca = PCA(n_components=0.90, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"Original number of features: {X_train.shape[1]}")
print(f"Reduced number of features after PCA: {X_train_pca.shape[1]}")

In [None]:
# Retrieve PCA components
pca_components = pca.components_

# Get the number of principal components
num_pcs = pca_components.shape[0]

# Create names for principal components
pc_names = [f'PC{i+1}' for i in range(num_pcs)]

# Retrieve original feature names
original_feature_names = X.columns.tolist()

# Create a DataFrame for PCA components with original feature names as columns
pca_df = pd.DataFrame(
    pca_components, columns=original_feature_names, index=pc_names)

# Transpose for better readability
pca_df = pca_df.transpose()

# Display the PCA components
print("PCA Components (Principal Components):")
print(pca_df)

# Function to plot top contributing features for each principal component

def plot_top_features(pca, feature_names, top_n=5):
    for i in range(pca.n_components_):
        component = pca.components_[i]
        abs_component = np.abs(component)
        top_indices = abs_component.argsort()[::-1][:top_n]
        top_features = [feature_names[j] for j in top_indices]
        top_values = component[top_indices]

        plt.figure(figsize=(8, 4))
        sns.barplot(x=top_features, y=top_values, palette='viridis')
        plt.title(f'Top {top_n} Features for Principal Component {i+1}')
        plt.xlabel('Features')
        plt.ylabel('Loading Coefficient')
        plt.show()


# Plot top 5 features for each principal component
plot_top_features(pca, original_feature_names, top_n=5)

# Handling Class Imbalance with SMOTE (Synthetic Minority Over-sampling Technique)

Class imbalance is a common issue in machine learning, particularly in classification tasks where one class significantly outnumbers another. Imbalanced datasets can lead to biased models that favor the majority class, resulting in poor predictive performance for the minority class. To address this, **SMOTE (Synthetic Minority Over-sampling Technique)** is employed to create synthetic examples of the minority class, thereby balancing the class distribution and enhancing the model's ability to generalize effectively.

## Overview

This section demonstrates how to handle class imbalance in the training dataset using SMOTE. By resampling the training data, we aim to improve the classifier's performance on the minority class without altering the testing data, ensuring that the evaluation remains unbiased.

In [None]:
# Handling class imbalance with SMOTE
print("Class distribution in y_train before SMOTE:")
print(y_train.value_counts())

sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train_pca, y_train)

print("Class distribution in y_train after SMOTE:")
print(y_train_res.value_counts())

# Training a Logistic Regression Model

With the dataset now preprocessed, scaled, and balanced, the next step is to build and train a machine learning model. **Logistic Regression** is a fundamental classification algorithm that is widely used due to its simplicity, interpretability, and efficiency. This section demonstrates how to create, train, and make predictions using a Logistic Regression model on the prepared dataset.

In [None]:
# Create the model
log_reg = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')

# Train the model
log_reg.fit(X_train_res, y_train_res)

# Make predictions
y_pred_lr = log_reg.predict(X_test_pca)
y_prob_lr = log_reg.predict_proba(X_test_pca)[:, 1]

# Evaluating the Logistic Regression Model

After training the Logistic Regression model, it's essential to assess its performance to ensure that it generalizes well to unseen data. This section introduces an evaluation function that computes key performance metrics and visualizes the Receiver Operating Characteristic (ROC) curve. By systematically evaluating the model, we can gain insights into its strengths and areas for improvement.

In [None]:

# Evaluate the model
def evaluate_model(y_test, y_pred, y_prob, model_name):
    print(f"\nEvaluation Metrics for {model_name}:")
    print("-------------------------------------")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall: {recall_score(y_test, y_pred):.4f}")
    print(f"F1-Score: {f1_score(y_test, y_pred):.4f}")
    print(f"ROC-AUC Score: {roc_auc_score(y_test, y_prob):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

    # Plot ROC Curve
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    plt.figure(figsize=(8, 6))
    plt.plot(
        fpr, tpr, label=f'{model_name} (AUC = {roc_auc_score(y_test, y_prob):.4f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {model_name}')
    plt.legend(loc='lower right')
    plt.show()


# Evaluate Logistic Regression
evaluate_model(y_test, y_pred_lr, y_prob_lr, 'Logistic Regression')

In [None]:
svc = SVC(probability=True, class_weight='balanced', random_state=42)
svc.fit(X_train_res, y_train_res)
y_pred_svc = svc.predict(X_test_pca)
y_prob_svc = svc.predict_proba(X_test_pca)[:, 1]

evaluate_model(y_test, y_pred_svc, y_prob_svc, 'Support Vector Machine')

In [None]:
rf = RandomForestClassifier(
    n_estimators=100, class_weight='balanced', random_state=42)
rf.fit(X_train_res, y_train_res)
y_pred_rf = rf.predict(X_test_pca)
y_prob_rf = rf.predict_proba(X_test_pca)[:, 1]

evaluate_model(y_test, y_pred_rf, y_prob_rf, 'Random Forest Classifier')

# Analysis of Logistic Regression Model


The logistic regression model achieved an accuracy of 85.74% in predicting company bankruptcy, with a precision of 0.54, recall of 0.66, and F1-score of 0.59 for the "bankrupt" class. The model performs well in predicting "non-bankrupt" companies, with high precision (0.93) and F1-score (0.91) for this class. However, it struggles to accurately identify all bankrupt companies, as reflected in the lower precision and F1-score for the "bankrupt" class. The ROC curve, with an AUC of 0.8357, shows that the model can reasonably distinguish between bankrupt and non-bankrupt companies, though not perfectly. The lower F1-score for bankrupt companies suggests that the model may be limited in capturing complex relationships in financial data. This could be due to the linear nature of logistic regression, which may not fully capture the nonlinear patterns typically found in financial datasets/features.

# Next Steps

We can compare results across models to determine how best to evaluate company bankrupcty predictions. We will evaluate SVM and Random Forest using the same metrics: accuracy, precision, recall, F1-score, and ROC-AUC. Since Random Forest captures nonlinear patterns better, it may improve recall and F1-score for the "bankrupt" class. Visualizing ROC curves for each model can highlight their class separation abilities. Additionally, a precision-recall curve for each model will help assess performance, especially for imbalanced classes. By examining these metrics and curves side-by-side, we can identify the model best suited for our data and goals, especially in terms of identifying bankrupt companies accurately.

**Checklist for HTML:**

- 1+ Data Preprocessing Method Implemented (DONE)
- 1+ ML Algorithms/Models Implemented (DONE - Logistic Regression, SVM, Random Forest)
- CS 4641: Supervised Learning Method Implemented (DONE - Logistic Regression, SVM, Random Forest)
- Visualizations (DONE)
- Quantitative Metrics (DONE)
- Analysis of 1+ Algorithm/Model (DONE)
- Next Steps (DONE)