In [1]:
from operations import oversampling_training

In [2]:
oversampling = ["none", "regular", "sqrt", "beta"]
class_gaps = [1, 3, 10, 30, 100]
model_parameters = {'embed_size': 128, 'hidden_dim': 235, 'feed_forward_dim': 373, 'num_layers': 3, 'num_heads': 4}

# Experiment Overview

I conducted a supplementary experiment employing the Transformer architecture to investigate the effects of various oversampling strategies on addressing the challenges posed by imbalanced class distributions in the PFAM dataset. My objective was to examine the impact of these oversampling approaches on the F1 score, a critical performance metric in protein family classification, particularly in the presence of imbalanced datasets.

To create the five subsets of the PFAM dataset with distinct levels of imbalance, I selected 100 classes of varying popularity for each train|val|test split. Popularity, in this context, is a measure of the number of samples per class in the dataset. The classes were chosen starting with the most popular and incrementing through the dataset with factors of 1, 3, 10, 30, and 100, resulting in subsets designated as Iter1, Iter3, Iter10, Iter30, and Iter100.

For each dataset, I assessed four distinct oversampling methods: "none", "regular", "sqrt", and "efficient".

None: In this baseline approach, no oversampling was applied during the training and validation steps, enabling me to assess the performance of the Transformer architecture without any sampling modifications.

Regular: Building upon the baseline, I implemented a straightforward oversampling method that adjusts sample selection probabilities inversely proportional to class frequencies. This approach enhances the representation of underrepresented classes, potentially improving overall classification performance.

Sqrt: To further refine the oversampling process, I introduced a square root weighting scheme that adjusts the class probabilities, increasing the likelihood of sampling less common classes while mitigating the risk of overfitting associated with the regular oversampling method.

Efficient: In my final approach, I employed a re-weighting method based on the concept of "effective number of samples", as proposed in a CVPR'19 research paper by Google. This method introduces a parameter, 𝛽, and aims to strike a balance between regular oversampling and more complex techniques, offering the potential for superior classification performance on imbalanced datasets.

The performance of each oversampling method was evaluated using the F1 score on a test set, with models being trained using separate training and validation sets. All testing was conducted using "regular" oversampling for a fair comparison of the learning capabilities of less visible classes across different techniques.

Through this systematic investigation, I aimed to provide valuable insights into the effectiveness of various oversampling approaches in addressing imbalanced class distributions in the context of protein family classification using Transformer models. The results of my study may have practical implications for researchers and practitioners in the field of bioinformatics and drug discovery, enabling them to more effectively classify protein sequences and identify potential drug targets for developing new treatments.

In [3]:
oversampling_training(oversampling, class_gaps, model_parameters)

Training gap 1 with oversampling none.
Transformer Model Best Params  | hd 235, nl 3, ne 128, ff 373, nh 4
Epoch   1/ 10, train loss: 0.04, train f1: 0.29, val loss: 0.03, val f1: 0.55, duration: 32.8s
Epoch   5/ 10, train loss: 0.00, train f1: 0.94, val loss: 0.00, val f1: 0.94, duration: 36.0s
Epoch  10/ 10, train loss: 0.00, train f1: 0.98, val loss: 0.00, val f1: 0.98, duration: 35.7s



Test F1 score: 0.9850182158396913



Training gap 3 with oversampling none.
Transformer Model Best Params  | hd 235, nl 3, ne 128, ff 373, nh 4
Epoch   1/ 10, train loss: 0.05, train f1: 0.24, val loss: 0.03, val f1: 0.48, duration: 27.5s
Epoch   5/ 10, train loss: 0.00, train f1: 0.93, val loss: 0.00, val f1: 0.94, duration: 28.8s
Epoch  10/ 10, train loss: 0.00, train f1: 0.97, val loss: 0.00, val f1: 0.98, duration: 28.7s



Test F1 score: 0.9766876674319



Training gap 10 with oversampling none.
Transformer Model Best Params  | hd 235, nl 3, ne 128, ff 373, nh 4
Epoch   1/ 10, train loss: 0.05

Epoch   1/ 10, train loss: 0.05, train f1: 0.02, val loss: 0.07, val f1: 0.00, duration: 5.2s
Epoch   5/ 10, train loss: 0.02, train f1: 0.24, val loss: 0.05, val f1: 0.18, duration: 5.2s
Epoch  10/ 10, train loss: 0.01, train f1: 0.56, val loss: 0.03, val f1: 0.49, duration: 5.3s



Test F1 score: 0.4361624423919399



