## Data Upsampling

In this notebook, we performed upsampling on our data to resolve the imbalance in bankruptcy class distribution (only 3.23% of companies are bankrupt), as analysed in the data_visualisations notebook.

1. **Upsampling of Data**
    
    Idenitfying the minority group as those with Bankrupt? = 0, we used the resample() method to upsample our dataset and obtained a new dataframe with an equal amount of bankrupt and non-bankrupt entries. 
     
2. **Evaluating Upsampled Data**
    
   To ensure that our upsampled data is representative of the original dataset, we analysed the relationships between variables in both datasets, namely the top 10 variables with the highest correlation to bankruptcy. 

   9 out of 10 of the top 10 variables were the same, while the 9 appeared in different orders amongst the top 10. Given that top correlated features are similar for the original and upsampled datasets, but in a different order, it can be deduced that the distribution of the feature values has changed slightly after upsampling, but the overall relationship between the features and the target variable "Bankrupt?" remains the same. 
   
   Furthermore, given that the correlation values of the top 10 variables are similar in magnitude, it is also possible that the difference in feature order is just due to random variation in the correlation values.

   Therefore, we concluded that our upsampled data is representative of the original dataset, and can be used to train our machine learning models


3. **Generating Upsampled Dataset**
    After verifying that the upsampled data is indeed representative of the original dataset, we generated the upsampled_bankruptcy.csv dataset to use for our machine learning models. 
 

##### Importing necessary libraries

In [22]:
import pandas as pd
import matplotlib.pyplot as plt

#for upsampling
from sklearn.utils import resample

### 1. Upsampling Data

In [23]:
# Load the data from CSV file
df = pd.read_csv("bankruptcy.csv")
# df.head()

# Separate majority and minority classes
df_majority = df[df["Bankrupt?"] == 0]
df_minority = df[df["Bankrupt?"] == 1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class
                                 random_state=42)  # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled["Bankrupt?"].value_counts()

0    6599
1    6599
Name: Bankrupt?, dtype: int64

### 2. Evaluating Upsampled Data

##### Top 10 Variables from original dataset

In [31]:
corr_matrix = df.corr()
num_features = 10
corr_with_bankrupt = corr_matrix["Bankrupt?"].abs().sort_values(ascending=False)

top_corr_features = corr_with_bankrupt[1:num_features+1].index.tolist()

print("Top 10 features with their correlation to bankruptcy:")
for i, feature in enumerate(top_corr_features):
    corr = corr_with_bankrupt[feature]
    print("{:<2}: {:<40}: {:.4f}".format(i+1, feature, corr))



Top 10 features with their correlation to bankruptcy:
1 :  Net Income to Total Assets             : 0.3155
2 :  ROA(A) before interest and % after tax : 0.2829
3 :  ROA(B) before interest and depreciation after tax: 0.2731
4 :  ROA(C) before interest and depreciation before interest: 0.2608
5 :  Net worth/Assets                       : 0.2502
6 :  Debt ratio %                           : 0.2502
7 :  Persistent EPS in the Last Four Seasons: 0.2196
8 :  Retained Earnings to Total Assets      : 0.2178
9 :  Net profit before tax/Paid-in capital  : 0.2079
10:  Per Share Net profit before tax (Yuan ¥): 0.2014


##### Top 10 Variables from upsampled dataset

In [32]:
corr_matrix = df_upsampled.corr()
num_features = 10
corr_with_bankrupt = corr_matrix["Bankrupt?"].abs().sort_values(ascending=False)
top_corr_features = corr_with_bankrupt[1:num_features+1].index.tolist()
df_top10_upsampled = df_upsampled[top_corr_features]


print("Top 10 features with their correlation to bankruptcy:")
for i, feature in enumerate(top_corr_features):
    corr = corr_with_bankrupt[feature]
    print("{:<2}: {:<40}: {:.4f}".format(i+1, feature, corr))


Top 10 features with their correlation to bankruptcy:
1 :  Debt ratio %                           : 0.5803
2 :  Net worth/Assets                       : 0.5803
3 :  Persistent EPS in the Last Four Seasons: 0.5461
4 :  ROA(C) before interest and depreciation before interest: 0.5364
5 :  Net profit before tax/Paid-in capital  : 0.5327
6 :  Per Share Net profit before tax (Yuan ¥): 0.5276
7 :  ROA(B) before interest and depreciation after tax: 0.5271
8 :  ROA(A) before interest and % after tax : 0.5156
9 :  Net Value Per Share (B)                : 0.4986
10:  Net Income to Total Assets             : 0.4984


### 3. Generating Upsampled Data

In [26]:
df_upsampled.to_csv('upsampled_bankruptcy.csv', index=False)

In [27]:
df2 = pd.read_csv("upsampled_bankruptcy.csv")
df2["Bankrupt?"].value_counts()

0    6599
1    6599
Name: Bankrupt?, dtype: int64