# Preprocessing Data

In this notebook, we will be preprocessing the data for running them through our models.  This includes standardizing the scale of the data, and dealing with the imbalanced classes in the dataset.  There are no categorical variables, so there is no encoding needed before using any model.  For the imbalanced classes, we will be using the Synthetic Minority Oversampling TEchnique (SMOTE)

## Imports

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [57]:
df = pd.read_csv('../data/nooutlier_EDA.csv', index_col=False)

In [58]:
df.shape

(4909, 95)

There are a lot of features that do not look to have a normal distribution, so we will use a standard scaler to remove the mean and scale to unit variance

In [59]:
scaler = StandardScaler()
scaler.fit_transform(df)

array([[ 6.34445434e+00, -2.85284146e+00, -2.80904766e+00, ...,
        -2.24001963e-01, -5.13182376e-01, -9.99716462e-01],
       [ 6.34445434e+00, -8.74521323e-01, -4.68848059e-01, ...,
        -1.11848780e+00,  3.79744151e+00, -2.38660582e-01],
       [-1.57617968e-01, -2.42694601e+00, -2.37165854e+00, ...,
        -2.13518808e-01, -4.57277470e-01, -1.03195148e+00],
       ...,
       [-1.57617968e-01, -2.73978090e-01, -4.47593856e-01, ...,
         3.17384962e-04,  2.85509237e-01, -4.62720982e-01],
       [-1.57617968e-01, -6.62084261e-01, -4.72203986e-01, ...,
        -2.66279459e-02,  2.20286332e-01, -1.26891909e-01],
       [-1.57617968e-01, -7.13150863e-01, -5.65051293e-01, ...,
        -1.31820904e-01, -1.00146484e-01,  2.24832621e+00]])

In [60]:
X = df.drop('Bankrupt?', axis=1)
y = df['Bankrupt?']

## Summarize Current Class instances

In [61]:
y.value_counts()

0    4790
1     119
Name: Bankrupt?, dtype: int64

There are 4790 companies that did not go bankrupt, and 119 that unfortunately did

Oversample the minority class

In [62]:
oversample = SMOTE()
X_smote, y_smote = oversample.fit_resample(X,y)

In [63]:
y_smote.value_counts()

0    4790
1    4790
Name: Bankrupt?, dtype: int64

# Create train test split for unbalanced and balanced classes

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)

In [65]:
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_smote, y_smote, test_size=.33)