# TensorFlow Bankruptcy Classification

The project aims to combine the power of Sklearn's data processing and feature selection capabilities as well as the Tensorflow neural network frameworks to correctly identify bankruptcies based off of a number of features. 

Packages used:
* Pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
* Numpy: https://numpy.org/doc/ 
* Tensorflow: https://www.tensorflow.org/
* Sklearn: https://scikit-learn.org/stable/index.html 

The dataset can be found on kaggle. 

Link: https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE




## Data preprocessing

In [2]:
# Use pandas to read the csv file
data = pd.read_csv('bankruptcy.csv')

In [3]:
# Print the shape of the data
data.shape

(6819, 96)

In [4]:
# Count the number of missing values in each column and sort them
data.isnull().sum().sort_values(ascending=False)

Bankrupt?                                                   0
 ROA(C) before interest and depreciation before interest    0
 Total expense/Assets                                       0
 Total income/Total expense                                 0
 Retained Earnings to Total Assets                          0
                                                           ..
 Total Asset Growth Rate                                    0
 Continuous Net Profit Growth Rate                          0
 Regular Net Profit Growth Rate                             0
 After-tax Net Profit Growth Rate                           0
 Equity to Liability                                        0
Length: 96, dtype: int64

In [5]:
# Since no null values are present, we can move on to the next step
# Check the data types of each column
# Add the columns that are not numeric to a list
non_numeric_columns = []
for column in data.columns:
    if data[column].dtype != 'float64' and data[column].dtype != 'int64':
        non_numeric_columns.append(column)

# Print the non-numeric columns
len(non_numeric_columns)

0

In [6]:
# Since there are no non-numeric columns, we can move on to the next step
# Create an array for the variance of each column
var_values = []

# Loop through each column and append the variance to the array
for column in data.columns:
    var_values.append(data[column].var())

# Convert the array to a numpy array and calculate the mean
var_values_numpy = np.array(var_values)
var_values_mean = var_values_numpy.mean()
var_values_mean

7.47644400602246e+17

In [7]:
# Create an empty list to store the columns that have a variance greater than the mean
columns_to_scale = []

# Loop through each column and check if the variance is greater than the mean
for column in data.columns:
    # Check if the variance is greater than the mean
    if data[column].var() > var_values_mean:
        # Append the column to the list
        columns_to_scale.append(column)

In [8]:
# Print the columns that have a variance greater than the mean
columns_to_scale

[' Operating Expense Rate',
 ' Research and development expense rate',
 ' Total Asset Growth Rate',
 ' Inventory Turnover Rate (times)',
 ' Fixed Assets Turnover Frequency',
 ' Current Asset Turnover Rate',
 ' Quick Asset Turnover Rate',
 ' Cash Turnover Rate']

In [9]:
# Initialize the StandardScaler from the sklearn library
scaler = StandardScaler()

# Loop through each column and scale the data
for column in columns_to_scale:
    data[column] = scaler.fit_transform(data[[column]])

In [10]:
# Visualize the correlation matrix
data.corr()

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
Bankrupt?,1.000000,-0.260807,-0.282941,-0.273051,-0.100043,-0.099445,-0.000230,-0.008517,-0.008857,-0.016593,...,-0.315457,0.035104,-0.005547,-0.100044,-0.180987,0.166812,0.010508,-0.005509,,-0.083048
ROA(C) before interest and depreciation before interest,-0.260807,1.000000,0.940124,0.986849,0.334719,0.332755,0.035725,0.053419,0.049222,0.020501,...,0.887670,-0.071725,0.008135,0.334721,0.274287,-0.143629,-0.016575,0.010573,,0.052416
ROA(A) before interest and % after tax,-0.282941,0.940124,1.000000,0.955741,0.326969,0.324956,0.032053,0.053518,0.049474,0.029649,...,0.961552,-0.098900,0.011463,0.326971,0.291744,-0.141039,-0.011515,0.013372,,0.057887
ROA(B) before interest and depreciation after tax,-0.273051,0.986849,0.955741,1.000000,0.333749,0.331755,0.035212,0.053726,0.049952,0.022366,...,0.912040,-0.089088,0.007523,0.333750,0.280617,-0.142838,-0.014663,0.011473,,0.056430
Operating Gross Margin,-0.100043,0.334719,0.326969,0.333749,1.000000,0.999518,0.005745,0.032493,0.027175,0.051438,...,0.300143,0.022672,0.004205,1.000000,0.075304,-0.085434,-0.011806,-0.001167,,0.120029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Liability to Equity,0.166812,-0.143629,-0.141039,-0.142838,-0.085434,-0.085407,0.001541,-0.004043,-0.004390,-0.011899,...,-0.159697,0.021982,-0.003724,-0.085434,-0.791836,1.000000,0.002119,0.001487,,-0.159654
Degree of Financial Leverage (DFL),0.010508,-0.016575,-0.011515,-0.014663,-0.011806,-0.011268,0.000935,0.000855,0.000927,-0.000556,...,-0.010463,-0.001881,-0.008812,-0.011806,-0.000093,0.002119,1.000000,0.016513,,-0.016739
Interest Coverage Ratio (Interest expense to EBIT),-0.005509,0.010573,0.013372,0.011473,-0.001167,-0.001158,0.000393,0.000984,0.000957,0.001024,...,0.012746,0.000239,0.001027,-0.001169,0.005147,0.001487,0.016513,1.000000,,-0.008339
Net Income Flag,,,,,,,,,,,...,,,,,,,,,,


In [11]:
# Drop null values
data.dropna(inplace=True)

## Split Training and Test Data

In [12]:
# Split the data into X and y, predicting the "Bankrupt?" column
X = data.drop("Bankrupt?", axis=1)
y = data["Bankrupt?"]

In [13]:
# Call the train_test_split function from the sklearn library
# Set the test size to 0.2 and the random state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Feature Selection

In [14]:
# Initialize the Random Forest Classifier model
# Since Random Forest inherently performs feature selection, we can use it to get the feature importances
model_rf = RandomForestClassifier()

# Fit the model on the training data
model_rf.fit(X_train, y_train)

# Get the feature importances from the model
feature_importances = model_rf.feature_importances_

# Sort features by importance
sorted_indices = np.argsort(feature_importances)[::-1]

# Get the most import features
ranked_columns = X_train.columns[sorted_indices]
print(f"Top features: {ranked_columns}")

Top features: Index([' Net Value Growth Rate', ' Net Income to Stockholder's Equity',
       ' Persistent EPS in the Last Four Seasons',
       ' Net profit before tax/Paid-in capital', ' Net Value Per Share (C)',
       ' Net Income to Total Assets', ' Degree of Financial Leverage (DFL)',
       ' Borrowing dependency', ' Interest Expense Ratio',
       ' Net Value Per Share (A)', ' Net worth/Assets', ' Cash/Total Assets',
       ' Interest Coverage Ratio (Interest expense to EBIT)',
       ' Total debt/Total net worth', ' Interest-bearing debt interest rate',
       ' Working Capital/Equity', ' Equity to Liability',
       ' Cash/Current Liability',
       ' Non-industry income and expenditure/revenue',
       ' ROA(C) before interest and depreciation before interest',
       ' Working Capital to Total Assets', ' Net Value Per Share (B)',
       ' Inventory/Working Capital', ' Liability to Equity',
       ' Per Share Net profit before tax (Yuan ¥)', ' Debt ratio %',
       ' Total as

In [15]:
# Isolate the top 10 features
ranked_columns[0:10]

Index([' Net Value Growth Rate', ' Net Income to Stockholder's Equity',
       ' Persistent EPS in the Last Four Seasons',
       ' Net profit before tax/Paid-in capital', ' Net Value Per Share (C)',
       ' Net Income to Total Assets', ' Degree of Financial Leverage (DFL)',
       ' Borrowing dependency', ' Interest Expense Ratio',
       ' Net Value Per Share (A)'],
      dtype='object')

In [16]:
# Initialize the Logistic Regression model
model = LogisticRegression()

# Initialize the Recursive Feature Elimination (RFE) 
# Pass the model and the number of features to select (10)
rfe = RFE(model, n_features_to_select=10)

# Fit the RFE on the training data
rfe.fit(X_train, y_train)

# Get the ranking of features and the selected features
ranking = rfe.ranking_
selected_features = rfe.support_

# Assign selected_features to an array
selected_features = X_train.columns[selected_features]
print(f"Ranking of features: {ranking}")
print(f"Selected features: {selected_features}")

Ranking of features: [34 30 32 25 27  6 14 13 47 15 77 76 35  1 61 54 56 55 52 45  1 63 57 74
  9 20 19 53 69  1 50 40  1 82 23  3 65  8 78 41 80 64 58 38 59  1  1 79
 75 68  1 39  1 11 37 33 60  1 81 66 42 49  2 16 18 43  1  7 85 72 83 84
 29 73 21 51 17 44 62 22 36 28 46 70 86 12  4 24 26 10 48 71 31  5 67]
Selected features: Index([' Interest-bearing debt interest rate', ' Revenue Per Share (Yuan ¥)',
       ' Net Value Growth Rate', ' Current Ratio',
       ' Accounts Receivable Turnover', ' Average Collection Days',
       ' Revenue per person', ' Allocation rate per person',
       ' Quick Assets/Current Liability',
       ' Long-term Liability to Current Assets'],
      dtype='object')


In [17]:
# Check if the top 10 features from the Random Forest model are the same as the selected features from the RFE model

# Create an empty list to store the features that are equal
any_equal = []

# Loop through the selected features and check if they are in the top 10 features from the Random Forest model
for feature in selected_features: 
    for column in ranked_columns:
        if feature == column:
            any_equal.append(feature)

# Print the features that are equal
any_equal

[' Interest-bearing debt interest rate',
 ' Revenue Per Share (Yuan ¥)',
 ' Net Value Growth Rate',
 ' Current Ratio',
 ' Accounts Receivable Turnover',
 ' Average Collection Days',
 ' Revenue per person',
 ' Allocation rate per person',
 ' Quick Assets/Current Liability',
 ' Long-term Liability to Current Assets']

In [18]:
# Redeclare X_train and X_test with the selected features
X_train = X_train[any_equal]
X_test = X_test[any_equal]

In [19]:
# Convert the data to float32 for TensorFlow model
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)

In [20]:
# Print the shape of the training data
X_train.shape

(5455, 10)

In [21]:
# Print the shape of the testing data
X_test.shape

(1364, 10)

## The TensorFlow Model

In [22]:
# Define the SimpleModel class
class SimpleModel(tf.Module):
    def __init__(self):
        # Define weights and biases
        # Adjusted input dimension to 10
        # This is because we have selected 10 features
        # This uses the tf.Variable method to create the weights and biases
        # This function initializes the weights and biases with random values
        self.W_hidden = tf.Variable(tf.random.normal([10, 10], dtype=tf.float32))  
        self.b_hidden = tf.Variable(tf.random.normal([10], dtype=tf.float32))
        self.W_output = tf.Variable(tf.random.normal([10, 1], dtype=tf.float32))
        self.b_output = tf.Variable(tf.random.normal([1], dtype=tf.float32))

    def __call__(self, X):
        # Define the model's forward pass
        # Define the hidden layer with ReLU activation function
        hidden_layer = tf.nn.relu(tf.matmul(X, self.W_hidden) + self.b_hidden)

        # Define the output layer with sigmoid activation function
        logits = tf.matmul(hidden_layer, self.W_output) + self.b_output
        return tf.sigmoid(logits)

# Initialize the model
model = SimpleModel()


In [23]:
# Define the loss function with the BinaryCrossentropy method
# This method calculates the loss between the predicted values and the actual values
loss_fn = tf.losses.BinaryCrossentropy(from_logits=True)

# Define the optimizer with the Adam optimizer
# Set the learning rate to 0.001
# This method optimizes the weights and biases of the model
optimizer = tf.optimizers.Adam(learning_rate=0.001)

In [24]:
# Set the number of epochs and batch size
num_epochs = 10
batch_size = 32

# Define the training step using the @tf.function decorator
# This decorator compiles the function into a callable TensorFlow graph
@tf.function
def train_step(X, y):
    # tf.GradientTape() records the operations for automatic differentiation
    with tf.GradientTape() as tape:
        predictions = model(X)
        loss = loss_fn(y, predictions)
    
    # Calculate the gradients of the loss with respect to the model's weights and biases
    gradients = tape.gradient(loss, model.trainable_variables)

    # Update the weights and biases of the model using the optimizer
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Return the loss
    return loss

# Train the model for the specified number of epochs
for epoch in range(num_epochs):
    # Get the number of batches based on the training set divided by batch size
    num_batches = int(len(X_train) / batch_size)

    # Loop through the batches
    for i in range(num_batches):
        # Get the batch data by slicing the training data
        batch_X = X_train[i*batch_size:(i+1)*batch_size]
        batch_y = y_train[i*batch_size:(i+1)*batch_size]

        # Train the model on the batch data
        train_loss = train_step(batch_X, batch_y)
    # Print the loss at each epoch to monitor the training progress
    print(f'Epoch {epoch+1}, Loss: {train_loss.numpy()}')

# Evaluate the model
predictions = model(X_test)

# Calculate the accuracy of the model
test_accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(predictions), y_test), tf.float32))

# Print the test accuracy
print(f'Test Accuracy: {test_accuracy.numpy()}')



Epoch 1, Loss: 0.7898292541503906
Epoch 2, Loss: 0.7677809000015259
Epoch 3, Loss: 0.7412973046302795
Epoch 4, Loss: 0.7379357814788818
Epoch 5, Loss: 0.7360773086547852
Epoch 6, Loss: 0.7349441051483154
Epoch 7, Loss: 0.7342038750648499
Epoch 8, Loss: 0.7336946129798889
Epoch 9, Loss: 0.7333298325538635
Epoch 10, Loss: 0.7330600619316101
Test Accuracy: 0.9415822625160217


## Save the trained model for production uses 

In [25]:
# Save the model
# Call it simple_model
model_path = "simple_model"

# Save the model using the tf.saved_model.save method
# Pass the model and the model path
tf.saved_model.save(model, model_path)

INFO:tensorflow:Assets written to: simple_model\assets
