## Sequence Balance

This notebook produces balanced training / validation data and labels from unbalanced data generated by data_preprocess.ipynb. The data is balanced using RandomUnderSampler from the imbalanced-learn package.

The following files are required as input:
 - "train_data_num.npy" : Spike protein sequences for training in numerical format
 - "validation_data_num.npy" : Spike protein sequences for validation in numerical format
 - "train_label_clade_num.npy" : Clade labels for training in numerical format
 - "validation_label_clade_num.npy" : Clade labels for validation in numerical format
 
The following files are produced as output:
 - "train_data_balanced.npy" : Balanced spike protein sequences for training in numerical format
 - "train_label_balanced.npy" : Balanced clade labels for training in numerical format
 - "validation_data_balanced.npy" : Balanced spike protein sequences for validation in numerical format
 - "validation_label_balanced.npy" : Balanced clade labels for training in numerical format

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
% cd /content/drive/MyDrive/project_data

/content/drive/MyDrive/project_data


In [2]:
import numpy as np
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler 

In [3]:
# load data
train_data = np.load('train_data_num.npy', allow_pickle=True)
dev_data = np.load('validation_data_num.npy', allow_pickle=True)
train_label = np.load('train_label_clade_num.npy', allow_pickle=True)
dev_label = np.load('validation_label_clade_num.npy', allow_pickle=True)

In [4]:
# train label imbalance
df = pd.DataFrame({'data': train_data, 'label': train_label})
df['label'].value_counts()

1    147661
2    138280
3    115324
4    114594
0     88610
7      8837
6      6957
8      5695
5      4826
Name: label, dtype: int64

In [5]:
# index of training data
data_index = [i for i in range(len(train_data))]
dev_data_index = [i for i in range(len(dev_data))]

In [6]:
# under sampling
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(np.array(data_index).reshape(-1,1), train_label)
X_res_dev, y_res_dev =  rus.fit_resample(np.array(dev_data_index).reshape(-1,1), dev_label)

In [7]:
print('train data index and label length after balanced downsampling: ', len(X_res), len(y_res))
print('dev data index and label length after balanced downsampling: ', len(X_res_dev), len(y_res_dev))

train data index and label length after balanced downsampling:  43434 43434
dev data index and label length after balanced downsampling:  4842 4842


In [8]:
new_train = train_data[X_res].squeeze(axis = 1)
new_dev = dev_data[X_res_dev].squeeze(axis = 1)
print('train data after balanced down sampling: ',new_train.shape )
print('dev data after balanced down sampling: ',new_dev.shape )

train data after balanced down sampling:  (43434,)
dev data after balanced down sampling:  (4842,)


In [9]:
np.save("train_data_balanced.npy", new_train, allow_pickle=True)
np.save("train_label_balanced.npy", y_res, allow_pickle=True)
np.save("validation_data_balanced.npy", new_dev, allow_pickle=True)
np.save("validation_label_balanced.npy", y_res_dev, allow_pickle=True)