# Data Augmentation

Some algorithms can not perform well if the data is unbalanced. Unbalanced data means that the number of records representing the target variable are not equal. To overcome this limitation, there are two data augmentation techniques:
1. Oversampling
2. Undersampling

In these techniques, either the dominant target variable is undersampled or the non-dominant target class(es) is oversampled.

In [1]:
# Import necessary packages
import pandas as pd
from imblearn.over_sampling import SMOTE

In [2]:
# Load the data
df = pd.read_csv('./../../data/engineered_data.csv')

## Oversampling

For the data under consideration, undersampling will result in the loss of vital information for the models and hence oversampling of the non-dominant class will be desirable.

Before oversampling, following is the proportion of failed projects to successful projects in the data set:

In [3]:
# Get the proportion in target variables before oversampling
df['state'].value_counts()

0.0    10290
1.0     4572
Name: state, dtype: int64

In [4]:
# Declare the oversampler
oversample = SMOTE()

# Use the sampler to augment the data
X, y = oversample.fit_resample(df.drop('state', axis=1), df['state'])

In [5]:
# Store the results appropriately
result = pd.DataFrame(X, columns=df.drop('state', axis=1).columns)
result['state'] = y

After oversampling is performed, it can be observed that number of observations for both the categories of target variables have become equal.

In [6]:
# Get the proportion of target variables after oversampling
result['state'].value_counts()

0.0    10290
1.0    10290
Name: state, dtype: int64

In [7]:
# Save the data
result.to_csv('./../../data/engineered_data_oversampled.csv', index=False)