We will be creating a **classifcation model to predict if a company goes bankrupt or not**. 

The data collected is from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

In [1]:
import pandas as pd
# import shap
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import roc_curve, auc, confusion_matrix, f1_score, ConfusionMatrixDisplay, classification_report

In [8]:
df = pd.read_csv('data.csv')
df.shape

(6819, 96)

In [9]:
df

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.405750,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.827890,0.290202,0.026601,0.564050,1,0.016469
1,1,0.464291,0.538214,0.516730,0.610235,0.610235,0.998946,0.797380,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.601450,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.774670,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.998700,0.796967,0.808966,0.303350,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.035490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6814,0,0.493687,0.539468,0.543230,0.604455,0.604462,0.998992,0.797409,0.809331,0.303510,...,0.799927,0.000466,0.623620,0.604455,0.840359,0.279606,0.027064,0.566193,1,0.029890
6815,0,0.475162,0.538269,0.524172,0.598308,0.598308,0.998992,0.797414,0.809327,0.303520,...,0.799748,0.001959,0.623931,0.598306,0.840306,0.278132,0.027009,0.566018,1,0.038284
6816,0,0.472725,0.533744,0.520638,0.610444,0.610213,0.998984,0.797401,0.809317,0.303512,...,0.797778,0.002840,0.624156,0.610441,0.840138,0.275789,0.026791,0.565158,1,0.097649
6817,0,0.506264,0.559911,0.554045,0.607850,0.607850,0.999074,0.797500,0.809399,0.303498,...,0.811808,0.002837,0.623957,0.607846,0.841084,0.277547,0.026822,0.565302,1,0.044009


In [10]:
# Check for missing values in each column
missing_values = df.isnull().sum()

# Display columns with missing values
missing_columns = missing_values[missing_values > 0]
print("Columns with missing values:")
print(missing_columns)

# Display the percentage of missing values
missing_percentage = (missing_columns / len(df)) * 100
print("\nPercentage of missing values in each column:")
print(missing_percentage)

Columns with missing values:
Series([], dtype: int64)

Percentage of missing values in each column:
Series([], dtype: float64)


In [11]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

number of duplicate rows:  (0, 96)


In [12]:
is_all_numeric = not df.isnull().values.any()
print(is_all_numeric)

True


We can see that the target variable is the 'Bankrupt?' column. There are **no missing values and no duplicate values**. All columns are numeric.

In [13]:
print(df.columns.tolist())

['Bankrupt?', ' ROA(C) before interest and depreciation before interest', ' ROA(A) before interest and % after tax', ' ROA(B) before interest and depreciation after tax', ' Operating Gross Margin', ' Realized Sales Gross Margin', ' Operating Profit Rate', ' Pre-tax net Interest Rate', ' After-tax net Interest Rate', ' Non-industry income and expenditure/revenue', ' Continuous interest rate (after tax)', ' Operating Expense Rate', ' Research and development expense rate', ' Cash flow rate', ' Interest-bearing debt interest rate', ' Tax rate (A)', ' Net Value Per Share (B)', ' Net Value Per Share (A)', ' Net Value Per Share (C)', ' Persistent EPS in the Last Four Seasons', ' Cash Flow Per Share', ' Revenue Per Share (Yuan ¥)', ' Operating Profit Per Share (Yuan ¥)', ' Per Share Net profit before tax (Yuan ¥)', ' Realized Sales Gross Profit Growth Rate', ' Operating Profit Growth Rate', ' After-tax Net Profit Growth Rate', ' Regular Net Profit Growth Rate', ' Continuous Net Profit Growth 

In [14]:
df['Bankrupt?'].value_counts()

Bankrupt?
0    6599
1     220
Name: count, dtype: int64

We will be using **stratified train-test split** as the target variable is highly imbalanced. This works by dividing the dataset into subgroups based on the target variable (bankruptcy status), ensuring that each subgroup is represented proportionally in both training and testing sets. This prevents the model from being biased towards the majority class and improves its ability to accurately predict bankruptcies.

In [15]:
def split_data(df, target_column='Bankrupt?'):
  """
  Prepares the data for modeling by splitting into features and target variable, 
  and then performing a stratified split for training and testing sets.

  Args:
    df: The pandas DataFrame containing the data.
    target_column: The name of the target column.

  Returns:
    X_train: The training features.
    X_test: The testing features.
    y_train: The training target variable.
    y_test: The testing target variable.
  """

  X = df.drop(target_column, axis=1)  # Features
  y = df[target_column]  # Target variable

  # Stratified split
  X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
  print(y_train.value_counts(normalize=True))
  print(y_test.value_counts(normalize=True))

  return X_train, X_test, y_train, y_test

Let's use **SMOTE (Synthetic Minority Over-sampling Technique)** to address the class imbalance as the (class = 1) has significantly fewer samples than the other. SMOTE works by creating synthetic samples for the minority class, thereby balancing the dataset.

In [16]:
def apply_smote(X_train, y_train, random_state=42):
  """
  Applies SMOTE to balance the class distribution in the training data.

  Args:
    X_train: The training features.
    y_train: The training target labels.
    random_state: Random seed for reproducibility.

  Returns:
    X_train_resampled: The resampled training features.
    y_train_resampled: The resampled training target labels.
  """

  smote = SMOTE(random_state=random_state)
  X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

  return X_train_resampled, y_train_resampled

For the evaluation metrics, we will be using 
1. **Precision**: It explains how many of the correctly predicted cases actually turned out to be positive. 
2. **Recall**:  It explains how many of the actual positive cases we were able to predict correctly with our model.
3. **F1-score**: A single metric that combines precision and recall using the harmonic mean. It's a balance between being cautious (precision) and being thorough (recall). F1-score is preferable to accuracy for class-imbalanced datasets
Utimately, the F1-score will decide which model is better. We will be inspecting the precision and recall to get a better understanding of the results.

As for the models, we will be using
1. **Random Forest Classifier**: A versatile ensemble method that offers good accuracy and interpretability, making it suitable for various machine learning tasks.
2. **XGBoost**: A powerful and efficient algorithm that excels in handling complex datasets and provides robust predictive performance.

Bagging and Boosting are ensemble methods that combine multiple models to improve predictive performance.
Both techniques aim to reduce either variance or bias, depending on the specific algorithm and its hyperparameters. 
* **Bagging** trains multiple models independently on different subsets of the training data and combines their predictions. This reduces variance and helps prevent overfitting. Random Forest is a popular example of bagging.   
* **Boosting** trains models sequentially, with each model focusing on correcting the errors of the previous one. This reduces bias and improves accuracy. XGBoost and AdaBoost are well-known boosting algorithms.

#### Random Forest Classifier

Let us first define the functions

In [17]:
def train_random_forest_model(X_train, y_train, model_params):
  """
  Trains a Random Forest Classifier model on the given training data and hyperparameters.

  Args:
    X_train: The training features.
    y_train: The training labels.
    model_params: A dictionary of hyperparameters for the Random Forest Classifier.

  Returns:
    The trained Random Forest Classifier model.
  """

  random_forest_model = RandomForestClassifier(**model_params)
  random_forest_model.fit(X_train, y_train)
  return random_forest_model

def make_y_predictions(model, X_test):
  """
  Makes predictions on the given test data using the trained Random Forest model.

  Args:
    model: The trained Random Forest Classifier model.
    X_test: The test features.

  Returns:
    The predicted labels for the test data.
  """

  y_predictions = model.predict(X_test)
  return y_predictions

In [18]:
def evaluate_model(model, X_test, y_test):
  """
  Evaluates the model's performance using F1-score and SHAP values.

  Args:
    model: The trained model.
    X_test: The test features.
    y_test: The true labels for the test data.

  Returns:
    None
  """

  y_pred = make_y_predictions(model, X_test)

  # Calculate F1-score
  f1 = f1_score(y_test, y_pred)
  print("F1-Score:", f1)

  # Classification report
  print(classification_report(y_test, y_pred))
  
  # Calculate SHAP values
  explainer = shap.TreeExplainer(model)
  shap_values = explainer.shap_values(X_test)

  # Visualize SHAP values
  shap.summary_plot(shap_values, X_test)

In [20]:
X_train, X_test, y_train, y_test = split_data(df)
X_train_resampled, y_train_resampled = apply_smote(X_train, y_train)

Bankrupt?
0    0.967736
1    0.032264
Name: proportion, dtype: float64
Bankrupt?
0    0.967742
1    0.032258
Name: proportion, dtype: float64


In [21]:
rf_model = train_random_forest_model(X_train_resampled, y_train_resampled, model_params={'n_estimators':100, 'random_state':42})

In [22]:
y_pred = make_y_predictions(rf_model, X_test)

In [25]:
evaluate_model(rf_model, X_test, y_test)

F1-Score: 0.416
              precision    recall  f1-score   support

           0       0.98      0.97      0.98      1650
           1       0.37      0.47      0.42        55

    accuracy                           0.96      1705
   macro avg       0.68      0.72      0.70      1705
weighted avg       0.96      0.96      0.96      1705



AttributeError: module 'shapely' has no attribute 'TreeExplainer'

In [24]:
print(shap.__version__)

2.0.6


#### XGBoost

### In Conclusion

When I searched for a good F1-Score, it said that ypically, an F1 score > 0.9 is considered excellent. A score between 0.8 and 0.9 is considered good, while a score between 0.5 to 0.8 is considered average. If the F1 score falls below 0.5, then the model is considered to have a poor performance.
In conclusion, all the models have poor performance as they are all below 0.5 ...