## Data Upsampling

In this notebook, we performed upsampling on our data to resolve the imbalance in bankruptcy class distribution (only 3.23% of companies are bankrupt), as analysed in the data_visualisations notebook.

1. **Upsampling of Data**
    
    Idenitfying the minority group as those with Bankrupt? = 0, we used the resample() method to upsample our dataset and obtained a new dataframe with an equal amount of bankrupt and non-bankrupt entries. 
     
2. **Evaluating Upsampled Data**
    
   To ensure that our upsampled data is representative of the original dataset, we analysed the relationships between variables in both datasets, namely the top 10 variables with the highest correlation to bankruptcy. 



3. **Generating Upsampled Dataset**
 

##### Importing necessary libraries

In [6]:
import pandas as pd
import matplotlib.pyplot as plt

#for upsampling
from sklearn.utils import resample

### 1. Upsampling Data

In [7]:
# Load the data from CSV file
df = pd.read_csv("bankruptcy.csv")
# df.head()

# Separate majority and minority classes
df_majority = df[df["Bankrupt?"] == 0]
df_minority = df[df["Bankrupt?"] == 1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class
                                 random_state=42)  # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled["Bankrupt?"].value_counts()

0    6599
1    6599
Name: Bankrupt?, dtype: int64

### 2. Evaluating Upsampled Data

##### Top 10 Variables from original dataset

In [8]:
corr_matrix = df.corr()
num_features = 10
corr_with_bankrupt = corr_matrix["Bankrupt?"].abs().sort_values(ascending=False)
top_corr_features = corr_with_bankrupt[1:num_features+1].index.tolist()
top_corr_features

[' Net Income to Total Assets',
 ' ROA(A) before interest and % after tax',
 ' ROA(B) before interest and depreciation after tax',
 ' ROA(C) before interest and depreciation before interest',
 ' Net worth/Assets',
 ' Debt ratio %',
 ' Persistent EPS in the Last Four Seasons',
 ' Retained Earnings to Total Assets',
 ' Net profit before tax/Paid-in capital',
 ' Per Share Net profit before tax (Yuan ¥)']

##### Top 10 Variables from upsampled dataset

In [9]:
corr_matrix = df_upsampled.corr()
num_features = 10
corr_with_bankrupt = corr_matrix["Bankrupt?"].abs().sort_values(ascending=False)
top_corr_features = corr_with_bankrupt[1:num_features+1].index.tolist()
df_top10_upsampled = df_upsampled[top_corr_features]
top_corr_features


[' Debt ratio %',
 ' Net worth/Assets',
 ' Persistent EPS in the Last Four Seasons',
 ' ROA(C) before interest and depreciation before interest',
 ' Net profit before tax/Paid-in capital',
 ' Per Share Net profit before tax (Yuan ¥)',
 ' ROA(B) before interest and depreciation after tax',
 ' ROA(A) before interest and % after tax',
 ' Net Value Per Share (B)',
 ' Net Income to Total Assets']


If the top correlated features are similar for the original and upsampled datasets, but in a different order, it could mean that the distribution of the feature values has changed slightly after upsampling, but the overall relationship between the features and the target variable "Bankrupt?" remains the same. It's also possible that the difference in feature order is just due to random variation in the correlation values, especially if the correlation values are similar in magnitude for the top features.

In any case, it's a good sign that the top correlated features are similar for the original and upsampled datasets, because it suggests that the upsampling process has not introduced major changes to the underlying relationships between the features and the target variable. However, it's still important to evaluate the performance of any model trained on the upsampled data to ensure that it generalizes well to new, unseen data.

### 3. Generating Upsampled Data

In [10]:
df_upsampled.to_csv('upsampled_bankruptcy.csv', index=False)