# Predicting Bankruptcy In Poland

## Project Overview

**Goal:** The objective of this project is to predict whether a Polish company will face bankruptcy based on historical financial data.

**The Data:**
The dataset consists of financial ratios (stored as `feat_1` to `feat_64`) extracted from economic reports of Polish companies. The data is compressed in **zipped JSON** format, requiring preprocessing to extract and structure into a DataFrame.

**Methodology:**
1.  **Exploratory Data Analysis (EDA):** We will analyze the distribution of financial features and checks for multicollinearity.
2.  **Handling Imbalance:** Since bankruptcy is a rare event (imbalanced class), we will utilize **resampling techniques** (`RandomUnderSampler` and `RandomOverSampler`) to improve model performance.
3.  **Modeling:** We will build and evaluate classification pipelines, specifically using **Decision Trees** and other classifiers to predict the binary outcome.

### imports

In [None]:
import gzip
import json
import pickle

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from IPython.display import VimeoVideo
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    classification_report,
    confusion_matrix,
)
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
def wrangle(filename):
    # Open compressed file, load into dictionary
    with gzip.open(filename, "r") as f:
        data = json.load(f)
        
    # Load dictionary into DataFrame, set index
    df = pd.DataFrame().from_dict(data["data"]).set_index("company_id")
    return df

In [None]:
df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()

### Explore

In [None]:
df.info()

In [None]:
# Plot class balance
df["bankrupt"].value_counts(normalize=True).plot(
    kind="bar",
    xlabel = "bankrupt",
    ylabel ="frequency"
);

In [None]:
# Loop through first 9 columns
for feature in df.columns:
    if feature == "bankrupt":
        continue
    if feature == "feat_10":
        break

    plt.figure(figsize=(8, 5))
    sns.boxplot(x="bankrupt", y=feature, data=df, showfliers=False)
    
    plt.xlabel("Bankrupt Status")
    plt.ylabel(feature)
    plt.title(f"Distribution of {feature} by Bankruptcy Status")
    plt.show()

In [None]:
corr = df.drop(columns="bankrupt").corr()
sns.heatmap(corr)

## Split

In [None]:
target = "bankrupt"
X = df.drop(columns="bankrupt", axis=1)
y = df[target]

print("X shape:", X.shape)
print("y shape:", y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

## Handling Class Imbalance via Resampling

Given the imbalanced nature of the bankruptcy data, we employed resampling techniques to help the model learn the minority class ("Bankrupt") more effectively.

We tested two approaches:
1.  `RandomUnderSampler`: Reducing the majority class.
2.  `RandomOverSampler`: Duplicating examples in the minority class.

**Crucial Step:** All resampling was performed **after** the train-test split and **only** on the training set. This ensures that our validation metrics reflect the model's performance on real-world (imbalanced) data.

In [None]:
under_sampler =RandomUnderSampler(random_state = 21)
X_train_under, y_train_under = under_sampler.fit_resample(X_train,y_train)
print(X_train_under.shape)
X_train_under.head()

In [None]:
over_sampler = RandomOverSampler(random_state=42)
X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train)
print(X_train_over.shape)
X_train_over.head()

## Build Model

### Baseline

In [None]:
acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 4))

In [None]:
# train 3 models to Fit on `X_train`, `y_train`, `X_train_over`, `y_train_over`, `X_train_under`, `y_train_under`
model_reg = make_pipeline(
    SimpleImputer(strategy="median"),
    DecisionTreeClassifier()
)
model_reg.fit(X_train, y_train)  

# Fit on `X_train_under`, `y_train_under`

model_under = make_pipeline(
    SimpleImputer(strategy="median"),
    DecisionTreeClassifier()
)
model_under.fit(X_train_under, y_train_under) 

# Fit on `X_train_over`, `y_train_over`

model_over = make_pipeline(
    SimpleImputer(strategy="median"),
    DecisionTreeClassifier()
)
model_over.fit(X_train_over, y_train_over)

### Evaluate

In [None]:
for m in [model_reg, model_under, model_over]:
    acc_train = m.score(X_train, y_train)
    acc_test = m.score(X_test,y_test)

    print("Training Accuracy:", round(acc_train, 4))
    print("Test Accuracy:", round(acc_test, 4))

## Model Selection: Why Oversampling?

Upon evaluating our three baseline models (Regular, Undersampled, and Oversampled), we observed the following:

1.  **Undersampling Performance:** The `model_under` performed significantly worse (Test Accuracy: ~0.70). This suggests that by removing data from the majority class, we lost vital information needed for the model to distinguish between bankrupt and non-bankrupt companies.
2.  **Oversampling Performance:** The `model_over` achieved the highest Test Accuracy (~0.9444), outperforming both the regular and undersampled models.

**Conclusion:**
Since Oversampling effectively balanced the classes without discarding valuable data, we will proceed with the **Oversampled** dataset for our final model training and hyperparameter tuning.

In [None]:
# Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(model_over, X_test,y_test)

## Confusion Matrix Analysis

The confusion matrix gives us a detailed breakdown of how our **Oversampled Decision Tree** performed on the test set.

* **True Negatives (1856):** The model correctly identified 1,856 companies that did *not* go bankrupt.
* **False Positives (57):** The model predicted 57 companies would go bankrupt, but they actually survived. (Type I Error).
* **False Negatives (54):** The model failed to predict bankruptcy for 54 companies that actually collapsed. (Type II Error).
* **True Positives (29):** The model correctly caught 29 bankruptcies.

### Key Takeaway: The "Accuracy Paradox"
While our accuracy is high (~94%), this is largely because the model is very good at predicting the majority class (Healthy Companies). However, for the minority class (Bankruptcy), the model only identified **29 out of 83** actual cases.

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} = \frac{29}{29 + 54} \approx 0.35$$

**Conclusion:** A Recall of **35%** means we are missing significant bankruptcy cases. To improve this, we may need to try more advanced models (like Random Forest or Gradient Boosting) or adjust our classification threshold.

In [None]:
# Get importances

importances = model_over.named_steps["decisiontreeclassifier"].feature_importances_

# Put importances into a Series
feat_imp = pd.Series(importances, index=X_train_over.columns).sort_values()
# Plot series
feat_imp.tail(15).plot(kind="barh")


plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("model_over Feature Importance");


## Using RandomForest

In [None]:
clf = make_pipeline(
    SimpleImputer(),
    RandomForestClassifier(random_state=42)
)
print(clf)

In [None]:
# cross validation
import time
start_time = time.time()
cv_acc_scores = cross_val_score(clf, X_train_over, y_train_over, cv=5, n_jobs=-1)
print(cv_acc_scores)
end_time = time.time()
elapsed_time = end_time-start_time
print(elapsed_time)

### Hyperparameter tuning

In [None]:
params = {
    "simpleimputer__strategy": ["mean", "median"],
    "randomforestclassifier__n_estimators": range(25, 100, 25),
    "randomforestclassifier__max_depth": range(10,50,10)
}
params

In [None]:
model = GridSearchCV(
    clf,
    param_grid = params,
    cv= 5,
    n_jobs=-1,
    verbose= 1

)
model

In [None]:
# Train model
model.fit(X_train_over, y_train_over)

In [None]:
cv_results =pd.DataFrame(model.cv_results_)
cv_results.head(10)

In [None]:
# Extract best hyperparameters
model.best_params_

In [None]:
acc_train = model.score(X_train, y_train)
acc_test = model.score(X_test,y_test)

print("Training Accuracy:", round(acc_train, 4))
print("Test Accuracy:", round(acc_test, 4))

In [None]:
# Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(model,X_test, y_test)

## Random Forest Evaluation (Hyperparameter Tuned)

We performed a Grid Search to find the optimal hyperparameters for our Random Forest.
**Best Parameters Found:**
* `max_depth`: 40 (Quite deep, allowing for complex decision boundaries)
* `n_estimators`: 75 (Number of trees in the forest)
* `imputer_strategy`: Median

### Performance Analysis
* **Training Accuracy:** 1.0 (The model has perfectly memorized the training data).
* **Test Accuracy:** 0.9589 (Very high general accuracy).

### The Confusion Matrix: A Critical Look
Despite the high accuracy score, the Confusion Matrix reveals a significant issue regarding the minority class ("Bankrupt"):

* **True Negatives (1903):** The model is excellent at identifying healthy companies.
* **False Negatives (72):** The model missed **72 out of 83** actual bankruptcies.
* **True Positives (11):** The model only correctly identified 11 bankrupt companies.

**Recall Calculation:**
$$\text{Recall} = \frac{11}{11 + 72} \approx 13.2\%$$

### Conclusion
While the Random Forest achieved a higher **Accuracy (96%)** than our previous Decision Tree, it suffered a massive drop in **Recall (13.2%)**.
The model has become conservative; it maximizes accuracy by predicting "Healthy" for almost everyone. In a bankruptcy prediction context, this is dangerous because we are missing the vast majority of at-risk companies.

In [None]:
# Get feature names from training data
features = X_train_over.columns
# Extract importances from model
importances = model.best_estimator_.named_steps["randomforestclassifier"].feature_importances_
# Create a series with feature names and importances
feat_imp = pd.Series(importances, index=features).sort_values()
# Plot 10 most important features
feat_imp.tail(10).plot(kind="barh")

plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("Feature Importance");

## Using Gradient Boost

In [None]:
clf_2 = make_pipeline(
    SimpleImputer(),
    GradientBoostingClassifier()
)

In [None]:
params_grad = {
    "simpleimputer__strategy": ["mean", "median"],
    "gradientboostingclassifier__max_depth": range(2,5),
    "gradientboostingclassifier__n_estimators": range(20,31,5)
}
params_grad

In [None]:
model_grad = GridSearchCV(clf_2, param_grid=params_grad, cv=5,n_jobs=-1, verbose=1)

In [None]:
# Fit model to over-sampled training data
model_grad.fit(X_train_over,y_train_over)

In [None]:
results = pd.DataFrame(model_grad.cv_results_)
results.sort_values("rank_test_score").head(10)

In [None]:
# Extract best hyperparameters
model_grad.best_params_

In [None]:
acc_train = model_grad.score(X_train, y_train)
acc_test = model_grad.score(X_test,y_test)

print("Training Accuracy:", round(acc_train, 4))
print("Validation Accuracy:", round(acc_test, 4))

In [None]:
# Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(model_grad,X_test,y_test)

In [None]:
# Print classification report
print(classification_report(y_test, model_grad.predict(X_test)))

In [None]:
importances = model_grad.best_estimator_.named_steps["gradientboostingclassifier"].feature_importances_
feat_imp = pd.Series(importances, index=X_train_over.columns).sort_values()
feat_imp.tail(10).plot(kind="barh")
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.title("Feature Importance")

# Project Conclusion & Strategic Recommendations

## 1. Model Evolution: From Accuracy to Sensitivity
Our objective was to identify Polish companies at risk of bankruptcy. We iterated through three distinct modeling approaches, facing the classic "Accuracy vs. Recall" trade-off.

| Model | Accuracy | Recall (Bankruptcy Capture) | Assessment |
| :--- | :--- | :---: | :--- |
| **Decision Tree** | ~94% | 35% | **Baseline:** Too simplistic; missed significant risks. |
| **Random Forest** | **96%** | 13% | **The "Safe" Model:** Overfitted to the majority class. Excellent at identifying healthy companies, but failed its primary purpose of risk detection. |
| **Gradient Boosting** | 88% | **76%** | **The "Risk-Aware" Model:** Successfully captured the majority of bankruptcies, making it the most viable model for risk management. |



[Image of ROC curve comparison]


## 2. The Core Challenge: Distribution Shift
Despite the success of the Gradient Boosting model in capturing bankruptcies (High Recall), it suffers from Low Precision (many False Positives). The root cause is a fundamental **Distribution Shift**:

* **Training Reality (Artificial):** We trained on **Oversampled Data**, creating a balanced "50/50 World" where bankruptcy is common. The model learned to be aggressive and expect bankruptcy frequently.
* **Testing Reality (Actual):** We evaluated on **Imbalanced Test Data**, representing the real economy where bankruptcy is rare (<5%).

**The Result:** Because the model was "raised" in a world where half the companies go bankrupt, it is naturally hypersensitive when applied to the real world. It flags many healthy companies as risky because it has been conditioned to spot danger everywhere.

## 3. Business Recommendations
For a financial institution, **Gradient Boosting** is the recommended model despite the False Positives.
* **Cost of False Negative (Missed Bankruptcy):** Huge financial loss (unpaid loans).
* **Cost of False Positive (False Alarm):** Administrative cost (auditing a healthy company).

**Strategy:** Use the Gradient Boosting model as a **"First-Pass Filter."** Any company flagged as "Bankrupt" by the model should be sent to a human analyst for review. This narrows the focus from thousands of companies to just the high-risk few.

## 4. Next Steps for Improvement
To further refine this model and address the imbalance issue, we propose:
1.  **Threshold Tuning:** Instead of using the default probability threshold of 0.5, we can adjust the decision boundary to increase Precision without sacrificing too much Recall.
2.  **Cost-Sensitive Learning:** Assigning a heavier penalty to "missed bankruptcies" during the training phase rather than just oversampling.

In [None]:
# import os
# import matplotlib.pyplot as plt
# import seaborn as sns
# from sklearn.metrics import ConfusionMatrixDisplay
# import pandas as pd

# # 1. Create a folder named 'visuals' if it doesn't exist
# os.makedirs("visuals", exist_ok=True)

# # ==========================================
# # PART 1: EDA Visualizations
# # ==========================================

# # 1.1 Class Balance
# plt.figure(figsize=(8, 6))
# df["bankrupt"].value_counts(normalize=True).plot(
#     kind="bar",
#     xlabel="Bankrupt",
#     ylabel="Frequency",
#     title="Class Balance"
# )
# plt.title("Class Balance Distribution")
# plt.tight_layout()
# plt.savefig("visuals/class_balance.png")
# plt.close()

# # 1.2 Boxplots (First 9 features)
# for feature in df.columns:
#     if feature == "bankrupt":
#         continue
#     if feature == "feat_10":
#         break

#     plt.figure(figsize=(8, 5))
#     sns.boxplot(x="bankrupt", y=feature, data=df, showfliers=False)
    
#     plt.xlabel("Bankrupt Status")
#     plt.ylabel(feature)
#     plt.title(f"Distribution of {feature} by Bankruptcy Status")
#     plt.tight_layout()
#     plt.savefig(f"visuals/boxplot_{feature}.png")
#     plt.close()

# # 1.3 Correlation Heatmap
# plt.figure(figsize=(12, 10))
# corr = df.drop(columns="bankrupt").corr()
# sns.heatmap(corr)
# plt.title("Feature Correlation Heatmap")
# plt.tight_layout()
# plt.savefig("visuals/correlation_heatmap.png")
# plt.close()


# # ==========================================
# # PART 2: Decision Tree Model (model_over)
# # ==========================================

# # 2.1 Confusion Matrix
# plt.figure(figsize=(8, 6))
# ConfusionMatrixDisplay.from_estimator(model_over, X_test, y_test)
# plt.title("Decision Tree Confusion Matrix")
# plt.tight_layout()
# plt.savefig("visuals/decision_tree_confusion_matrix.png")
# plt.close()

# # 2.2 Feature Importance
# plt.figure(figsize=(10, 6))
# importances = model_over.named_steps["decisiontreeclassifier"].feature_importances_
# feat_imp = pd.Series(importances, index=X_train_over.columns).sort_values()
# feat_imp.tail(15).plot(kind="barh")
# plt.xlabel("Gini Importance")
# plt.ylabel("Feature")
# plt.title("Decision Tree Feature Importance")
# plt.tight_layout()
# plt.savefig("visuals/decision_tree_feature_importance.png")
# plt.close()


# # ==========================================
# # PART 3: Random Forest Model (model)
# # ==========================================

# # 3.1 Confusion Matrix
# plt.figure(figsize=(8, 6))
# ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
# plt.title("Random Forest Confusion Matrix")
# plt.tight_layout()
# plt.savefig("visuals/random_forest_confusion_matrix.png")
# plt.close()

# # 3.2 Feature Importance
# # Note: Using model.best_estimator_ because 'model' is a GridSearchCV object
# plt.figure(figsize=(10, 6))
# importances = model.best_estimator_.named_steps["randomforestclassifier"].feature_importances_
# feat_imp = pd.Series(importances, index=X_train_over.columns).sort_values()
# feat_imp.tail(10).plot(kind="barh")
# plt.xlabel("Gini Importance")
# plt.ylabel("Feature")
# plt.title("Random Forest Feature Importance")
# plt.tight_layout()
# plt.savefig("visuals/random_forest_feature_importance.png")
# plt.close()


# # ==========================================
# # PART 4: Gradient Boosting Model (model_grad)
# # ==========================================

# # 4.1 Confusion Matrix
# plt.figure(figsize=(8, 6))
# ConfusionMatrixDisplay.from_estimator(model_grad, X_test, y_test)
# plt.title("Gradient Boosting Confusion Matrix")
# plt.tight_layout()
# plt.savefig("visuals/gradient_boosting_confusion_matrix.png")
# plt.close()

# # 4.2 Feature Importance
# # Note: Using model_grad.best_estimator_ because 'model_grad' is a GridSearchCV object
# plt.figure(figsize=(10, 6))
# importances = model_grad.best_estimator_.named_steps["gradientboostingclassifier"].feature_importances_
# feat_imp = pd.Series(importances, index=X_train_over.columns).sort_values()
# feat_imp.tail(10).plot(kind="barh")
# plt.xlabel("Gini Importance")
# plt.ylabel("Feature")
# plt.title("Gradient Boosting Feature Importance")
# plt.tight_layout()
# plt.savefig("visuals/gradient_boosting_feature_importance.png")
# plt.close()

# print("All visualisations have been saved to the 'visuals' folder.")