# Experiment 06

Grid search among several models, tokenizers, and parameters. 

<ul>
    <li>Training & validation set: 18386 jigsaw negative + 1614 jigsaw positive + 500 ctec negative + 10000 ctec positive</li>
    <li>Test set: 1401 ctec negative + 11600 ctec positive</li>
    <li>Tokenizer: BoW, TF-IDF, BERT</li>
    <li>Models: Logistic Regression, Multilayer Perceptron, BERT</li>
    <li>Optimizer: AdamW (I notice that using Adam will make the predictions very biased)</li>
    <li>Normalization of dataset (mean 0 sd 1 in training set) will be performed</li>
</ul>

## Part 1: Preparation

In [1]:
# 0 --> testing mode 
# 1 --> development mode 
toy_mode = 0

from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import numpy as np
import pandas as pd
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import matplotlib.pyplot as plt
from math import inf
import itertools as it
from time import time

# Define constants 
NUM_LABELS = 2
MAX_LEN = 300    # Max text length for encoding purpose 
BATCH_SIZE = 32
PRETRAINED_BERT_NAME = 'bert-base-uncased'
DATA_PWD = './'
SESS_PWD = './xprmt_05a_another/'
LOG_NAME = './xprmt_06.log'

# Enable GPU if possible 
device = torch.device(
    'cuda:0' if torch.cuda.is_available() else 'cpu'
)
print(f'device = {device}')

device = cuda:0


### Load data \& train-val-test split

`toy_mode = 1` means we are in development mode and will only load very little number of dataset, so that we can debug without wasting time wating for results. 

`toy_mode = 0` means we are in testing mode and will load all data. 

In [2]:
# Load jigsaw and ctec datasets 
def load_data(toxic_threshold): 
    '''
    Load dataset (mix of ctec and jigsaw) and return train, val, test sets 
    @Params
    -- toxic_threshold: Float. If the toxic score of a text is above the threshold, then we classify the text as toxic. 
    @Return 
    (X_train_text, X_val_text, X_test_text, y_train, y_val, y_test)
    '''
    

    jigsaw_df = pd.read_csv(DATA_PWD + 'train_preproc_shrk.csv')
    ctec_df = pd.read_csv(DATA_PWD + 'ctec_training_data_preproc.csv')

    # target >= threshold --> toxic --> label = 1
    # target < threshold --> non-toxic --> label = 0
    jigsaw_df.loc[jigsaw_df['target'] >= toxic_threshold, 'label'] = 1
    jigsaw_df.loc[jigsaw_df['target'] < toxic_threshold, 'label'] = 0

    # Split by label 
    jigsaw_neg = jigsaw_df[jigsaw_df['label'] == 0]
    jigsaw_pos = jigsaw_df[jigsaw_df['label'] == 1]
    ctec_neg = ctec_df[ctec_df['label'] == 0]
    ctec_pos = ctec_df[ctec_df['label'] == 1]

    # Show number of positive and negative examples in each dataset 
#     print(f'jigsaw # negative examples = {jigsaw_neg.shape[0]}')
#     print(f'jigsaw # positive examples = {jigsaw_pos.shape[0]}')
#     print(f'ctec # negative examples = {ctec_neg.shape[0]}')
#     print(f'ctec # positive examples = {ctec_pos.shape[0]}')

    # Create training set and test set based on the scheme described above

    # Randomly sampling indices 
    indpos = np.random.choice(range(ctec_pos.shape[0]), size = 10000, replace = False)
    indneg = np.random.choice(range(ctec_neg.shape[0]), size = 500, replace = False)
    notindpos = np.setdiff1d(range(ctec_pos.shape[0]), indpos)
    notindneg = np.setdiff1d(range(ctec_neg.shape[0]), indneg)

    # Combo of training and validation set 
    df_trainval = jigsaw_df.append(
        ctec_pos.iloc[indpos], 
        ignore_index = True
    ).append(
        ctec_neg.iloc[indneg], 
        ignore_index = True
    )

    # Test set
    df_test = ctec_neg.iloc[notindneg].append(
        ctec_pos.iloc[notindpos], 
        ignore_index = True
    )


    # Split in to training set and validation set 
    df_train, df_val = train_test_split(
        df_trainval, 
        test_size = 0.1, 
        random_state = 42
    )

#     print(f'# positive in training set = {(df_train.label == 1).sum()}')
#     print(f'# negative in training set = {(df_train.label == 0).sum()}')
#     print(f'# positive in validation set = {(df_val.label == 1).sum()}')
#     print(f'# negative in validation set = {(df_val.label == 0).sum()}')
#     print(f'# positive in test set = {(df_test.label == 1).sum()}')
#     print(f'# negative in test set = {(df_test.label == 0).sum()}')
    
    # Extract texts and labels from dataframes 
    X_train_text = df_train['comment_text'].tolist()
    y_train = df_train['label'].astype(int).tolist()
    X_val_text = df_val['comment_text'].tolist()
    y_val = df_val['label'].astype(int).tolist()
    X_test_text = df_test['comment_text'].tolist()
    y_test = df_test['label'].astype(int).tolist()

    if toy_mode: 
        X_train_text = X_train_text[:50]
        y_train = y_train[:50]
        X_val_text = X_val_text[:30]
        y_val = y_val[:30]
        X_test_text = X_test_text[:20]
        y_test = y_test[:20]
    
    return X_train_text, X_val_text, X_test_text, y_train, y_val, y_test

### Text encoding \& creating PyTorch `Dataset`, `DataLoader`

For convenience of creating PyTorch `DataSet`, we will will fit `CountVectorizer()` and `TfidfVectorizer` for later use. 

In [3]:
# X_train_text, X_val_text, X_test_text are the same regardless of what threshold we choose 
# It is convenient to initialize vectorizers now. 
X_train_text, X_val_text, X_test_text, y_train, y_val, y_test = load_data(toxic_threshold = .5)

countVect = CountVectorizer()
countVect.fit(X_train_text)

tfidfVect = TfidfVectorizer()
tfidfVect.fit(X_train_text)

bertTokenizer = BertTokenizer.from_pretrained(PRETRAINED_BERT_NAME)

To utilize PyTorch and GPU computation, we create instances of `Dataset` instead of using our original dataset. 

<b>Notice. </b> Text encoding is performed inside `__getitem()__` function of `Dataset` class. 

`DataLoader` provides a way to iterate through a dataset with given batch size. 

<b style="color:red">Attention! </b> All first-initialized PyTorch tensors must be moved to GPU memory manually! Pay attention when you initialize a PyTorch tensor! 

In [4]:
class ToxicDataset(Dataset): 
    def __init__(self, texts, labels, tokenizer, max_len = MAX_LEN): 
        '''
        Instantiate ToxicDataset() class 
        @Params
        -- texts: comments texts of the dataset (input)
        -- labels: labels corresponding to the texts 
        -- tokenizer: {'bow', 'tfidf', 'bert'} the scheme of encoding texts into numerical data
        -- max_len: if we use BERT tokenizer, we represent texts with `max_len`-dim vectors
        '''
        super().__init__()
        self.texts = texts 
        self.labels = labels 
        self.tokenizer = tokenizer
        self.max_len = max_len 
        
    def __len__(self): 
        '''Return the size of dataset '''
        return len(self.texts)
    
    def __getitem__(self, idx, mean = None, sd = None): 
        '''
        This method must be overwritten by programmer 
        mean and sd are for normalization purposes
        '''
        
        text = str(self.texts[idx])
        # Vector embedding of text. Will be computed by tokenizer 
        input_vec = None
        # Must be PyTorch tensor 
        label = torch.tensor(
            self.labels[idx], 
            dtype = torch.long
        ).to(device)
        # Only applicable to BERT
        attention_mask = torch.tensor([-1])
        
        '''
        Encode text data according to the given tokenizer
        The result of encoding must be PyTorch Tensor. 
        '''
        # BoW encoding 
        if self.tokenizer == 'bow': 
            input_vec = torch.Tensor(
                countVect.transform([text]).toarray()
            ).to(device).flatten()
            
        # TF-IDF encoding
        elif self.tokenizer == 'tfidf': 
            input_vec = torch.Tensor(
                tfidfVect.transform([text]).toarray()
            ).to(device).flatten()
            
        # BERT encoding 
        elif self.tokenizer == 'bert': 
            encoding = bertTokenizer.encode_plus(
                text, 
                add_special_tokens = True, 
                truncation = True, 
                max_length = self.max_len, 
                return_token_type_ids = False, 
                pad_to_max_length = True, 
                return_attention_mask = True, 
                return_tensors = 'pt'
            )
            
            input_vec = encoding['input_ids'].to(device).flatten()
            attention_mask = encoding['attention_mask'].to(device).flatten()
            
        if not(mean is None) and not(sd is None): 
            input_vec = (input_vec - mean) / sd
            
        return {
            'text': text, 
            'input_vec': input_vec, 
            'input_size': input_vec.shape[0], 
            'attention_mask': attention_mask, 
            'label': label
        }

In [5]:
def create_data_loader(
    texts, labels, 
    tokenizer, 
    batch_size = BATCH_SIZE, 
    max_len = MAX_LEN
): 
    '''
    Helper function for creating DataLoader.  
    @Params
    -- texts: comments texts of the dataset (input)
    -- labels: labels corresponding to the texts 
    -- tokenizer: {'bow', 'tfidf', 'bert'} the scheme of encoding texts into numerical data
    -- batch_size: (as the name suggests)
    -- max_len: if we use BERT tokenizer, we represent texts with `max_len`-dim vectors
    '''
    
    dataset = ToxicDataset(
        texts = texts, 
        labels = labels, 
        tokenizer = tokenizer, 
        max_len = max_len
    )
    
    return DataLoader(
        dataset, 
        batch_size = batch_size, 
        num_workers = 0    # We can change this value to enable multiprocessing
    )

### Normalization

<b style="color:red;">Attention!</b> I actually did not use normalization in this experiment, because some initial trial indicates that normalization actually messes up the models for some unknown reasons. 

Usually, normalizing the training set to mean 0 and sd 1 will improve the performance. Here we define a helper function to return the mean vector and sd vector of training set, so that we can apply those two vectors to validation set and test set. This helper function also returns the dimensionality of input vectors, which will come in handy later. 

I choose not to use `.mean()` and `.sd()` function to save space by not loading all data at once.

In [6]:
def get_mean_sd_ndim(data_loader): 
    '''
    Given the torch dataloader of a dataset, return the a tuple of 1. vector of all means; 2. vector of all standard deviations; 3. dimensionality of input
    '''
    print(f'\rAcquiring mean and standard deviation....', end = '')
    
    sum_for_mean = None
    sum_for_sqmean = None
    ndim = 0
    counter = 0
    
    for batch in data_loader: 
        # Initialize vector `summation` with correct dimension
        if sum_for_mean is None or sum_for_sqmean is None:             
            ndim = batch['input_size'][0].item()
            sum_for_mean = torch.zeros(ndim).to(device)
            sum_for_sqmean = torch.zeros(ndim).to(device)
            
        # Batch of input vectors 
        input_vec = batch['input_vec'].float()
        
        # Average of batch
        sum_for_mean += torch.sum(input_vec, axis = 0)
        
        # Average of square of batch (for calculating sd)
        sum_for_sqmean += torch.sum(input_vec ** 2, axis = 0)
        
        counter += len(batch)
        
    mean_vec = sum_for_mean / counter
    sd_vec = torch.sqrt((sum_for_sqmean / counter) - (sum_for_mean / counter) ** 2)
    
    return mean_vec, sd_vec, ndim


## Part 2: Models and model-specific parameters

### Create model class

We define our models / classifiers by inherting and overwriting `nn.Module` class. 

<b>Model-specific hyperparameters</b> (<span style="color:blue;">hidden layers, drop-out rate, etc.<span/>) should be defined as attributes of each model class. 
    
<b style="color:red">Attention! </b> `__init__()` must use clases in `torch.nn` module specify inter-layer operations that (1) have parameters to train or (2) change dimensionality (e.g. linear map, convolution, dropout); on the other hand, all operations within layer without trained parameters or change in dimensionality can either be treated as separate layter using `torch.nn` or as the same layer using `torch.nn.Functional`. 

#### BERT classifier

In [7]:
class BertClassifier(nn.Module):     
    def __init__(self, num_labels, drop_out = 0.2):   
        '''
        @Params
        -- drop_out: Drop out rate. By default = 0.2
        '''
        
        # Must instantiate the parent class 
        super(BertClassifier, self).__init__()
        
        # pretrained BERT model
        self.bert = BertModel.from_pretrained(PRETRAINED_BERT_NAME)
        
        # Dropout rate 
        self.drop = nn.Dropout(p = drop_out)
        
        # Operation from BERT's last hidden layer to output layer
        # A linear map to a 2-dimensional vector (2 === binary classification)
        self.out = nn.Linear(self.bert.config.hidden_size, num_labels)
        
    def forward(self, input_vec, **kwargs): 
        '''
        Define the feed-forward function of the model
        @kwargs
        -- attention_mask: the attention mask for BERT
        '''
        # BERT: from input layer to the last hidden layer 
        _, pooled_output = self.bert(
            input_ids = input_vec, 
            attention_mask = kwargs['attention_mask']
        )
        
        # Drop out neurons according to dropout rate 
        output = self.drop(pooled_output)
        
        # Return the last layer 
        return self.out(output)

### Multilayer perceptron classifier

i.e. the naive deep neural network

<b style="color:red;">Attention!</b> To duplicate a PyTorch tensor, we must use `.clone()` function so that the original tensor will not be accidentally overwritten. 

<b>Choice of coding style</b>. MLP can be written with either `nn.Module` or `nn.Sequential`. For now we choose the former because `nn.Sequential` does not have well-defined methods (such as `.add()` in TensorFlow). 

In [8]:
class MLPClassifier(nn.Module): 
    def __init__(self, input_size, num_labels, hidden_layers_dim, drop_out = 0.2): 
        '''
        @Params
        -- input_size: the dimensionality of input vector 
        -- num_labels: the number of classes 
        -- hidden_layers_dim: the list of dimensionalities for each hidden layer
        '''
        
        super(MLPClassifier, self).__init__()
        
        curr_layer_dim = input_size
        self.hidden_layers = nn.ModuleList([])
        
        for dim in hidden_layers_dim: 
            self.hidden_layers.append(nn.Linear(curr_layer_dim, dim))
            self.hidden_layers.append(nn.ReLU())
            curr_layer_dim = dim
            
        self.exit_layer = nn.Linear(curr_layer_dim, num_labels)
        
        
    def forward(self, input_vec, **kwargs): 
        '''
        Define the feed forward function
        '''
        vec = input_vec.clone()
        
        # Hidden layers
        for layer in self.hidden_layers: 
            vec = layer(vec)
            
        # Exit layer
        return F.softmax(self.exit_layer(vec), dim = 1)

#### Logistic Regression 

Logistic regression is nothing but MLP with no hidden layer. 

In [9]:
class LogisticRegressionClassifier(MLPClassifier): 
    def __init__(self, input_size, num_labels = 2): 
        super().__init__(
            input_size, 
            num_labels, 
            hidden_layers_dim = [], 
            drop_out = 0
        )

## Part 3: Helper functions for training an epoch and evaluating model

<b>Choice of coding style</b>. Unlike TensorFlow and scikit-learn, most PyTorch code I have seen defines the training loops outside their corresponding model classes. This coding style yields better flexibility and controlling power while sacrificing compactness and reusability of code. In other words, it is not common to call something like `model.fit()` in PyTorch. 

I am not sure whether writing the training functions inside model classes will cause problems, because the optimizer depends on `model.parameters()`. 

For now, we will write training functions outside model classes. 

Consequently, we tune the <b>non-model-specific parameters</b> (<span style="color:blue;">optimizer, loss function, number of iteration, number of epoch, etc.</span>) outside model classes. However, these parameters can sometimes be model specific and we need to make ad hoc adjustments to our code. 

<b style="color:red;">Warning!</b> Code for computing metrics is only applicable to binary classification. 

In [10]:
def train_epoch_binary(
    model, data_loader, 
    loss_fn, optimizer, 
    device, 
    scheduler = None, 
    clip_grad = False  # Enable gradient clipping? 
): 
    '''
    The helper function that trains one epoch 
    !! Code for computing metrics is only applicable to binary classification
    '''
    # Set the model to training mode 
    model = model.train()
    # Clean GPU cache 
    torch.cuda.empty_cache()
    
    # Record loss of training on each batch 
    losses = []
    # label, pred_class 
    # index 00 --> true negative 
    # index 01 --> false postive 
    # index 10 --> false negative 
    # index 11 --> true positive 
    cat_count = [0, 0, 0, 0]
    batch_counter = 0
    
    # Train each batch 
    for batch in data_loader: 
        print(f'\rTraining batch #{batch_counter} out of {len(data_loader)}', end = '')
        
        # Load data from current batch 
        input_vec = batch['input_vec'].to(device)
        if not isinstance(model, BertClassifier): 
            # different models require different datatypes 
            input_vec = input_vec.float()
        label = batch['label'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        
        # Forward pass 
        output = None
        if isinstance(model, BertClassifier): 
            '''Need to pass attention mask if model is BERT'''
            output = model.forward(input_vec = input_vec, attention_mask = attention_mask)
        else: 
            '''Otherwise, only input_vec is required argument'''
            output = model.forward(input_vec = input_vec)
            
        # Probability of predicted class and predicted class 
        # torch.max with dim==1 returns (maxval, argmax)
        pred_prob, pred_class = torch.max(output, dim = 1)
            
        # Compute loss 
        loss = loss_fn(output, label)
        losses.append(loss.item())
        
        # Backprop 
        loss.backward()
        
        # Gradient clipping 
        if clip_grad: 
            nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1.)
            
        # Update optimizer and scheduler 
        # scheduler is only used for BERT
        optimizer.step()
        if scheduler: 
            scheduler.step()
        
        # Compute which category each examples falls in
        # 00 --> true negative 
        # 01 --> false positive 
        # 10 --> false negative 
        # 11 --> true positive 
        cats = 2 * label  + pred_class
        # Count each category 
        for i in range(4): 
            cat_count[i] += (cats == i).sum().item()
            
        # Post-processing 
        # Failure to do so (especially clearing optimizer) will result in unexpected problems 
        loss.detach() # delete computational history 
        optimizer.zero_grad()
        torch.cuda.empty_cache()
        batch_counter += 1
            
    '''Metrics of epoch'''
    # Train accuracy of epoch
    acc = (cat_count[0] + cat_count[3]) / np.sum(cat_count)
    # Confusion matrix 
    confusion = np.array([[cat_count[0], cat_count[1]], [cat_count[2], cat_count[3]]])
    # F1 score assuming positive example is scarce
    try:
        f1_pos = cat_count[3] / (cat_count[3] + .5 * cat_count[1] + .5 * cat_count[2])
    except ZeroDivisionError: 
        f1_pos = inf
    # F1 score assuming negative example is scarce 
    try:
        f1_neg = cat_count[0] / (cat_count[0] + .5 * cat_count[1] + .5 * cat_count[2])
    except ZeroDivisionError: 
        f1_neg = inf
    
    return np.mean(losses), confusion, acc, f1_pos, f1_neg

In [11]:
def eval_epoch_binary(
    model, data_loader, 
    loss_fn, optimizer, 
    device, 
    scheduler = None, 
    test_mode = False
): 
    '''
    The helper function that evaluates the model
    Runs only forward pass without backprop
    Primarily used for cross-validation
    !! Code for computing metrics is only applicable to binary classification
    '''
    
    # Set the model to training mode 
    model = model.eval()
    # Clean GPU cache 
    torch.cuda.empty_cache()
    
    # Record loss of training on each batch 
    losses = []
    # label, pred_class 
    # index 00 --> true negative 
    # index 01 --> false postive 
    # index 10 --> false negative 
    # index 11 --> true positive 
    cat_count = [0, 0, 0, 0]
    batch_counter = 0
    
    with torch.no_grad(): 
        for batch in data_loader: 
            if not test_mode: 
                print(f'\rCross-validating batch #{batch_counter} out of {len(data_loader)}', end = '')

            # Load data from current batch
            input_vec = batch['input_vec'].to(device)
            if not isinstance(model, BertClassifier): 
                # different models require different datatypes 
                input_vec = input_vec.float()
            attention_mask = batch['attention_mask'].to(device)
            label = batch['label'].to(device)

            # Forward pass 
            output = None
            if isinstance(model, BertClassifier): 
                '''Need to pass attention mask if model is BERT'''
                output = model.forward(input_vec = input_vec, attention_mask = attention_mask)
            else: 
                '''Otherwise, only input_vec is required argument'''
                output = model.forward(input_vec = input_vec)

            # torch.max with dim=1 returns (maxvals, indices)
            # indices are the labels we want to predict 
            preds_prob, preds_class = torch.max(output, dim = 1)

            # Compute which category each examples falls in
            # 00 --> true negative 
            # 01 --> false positive 
            # 10 --> false negative 
            # 11 --> true positive 
            # Debug
            cats = 2 * label  + preds_class
            # Count each category 
            for i in range(4): 
                cat_count[i] += (cats == i).sum().item()
            
            # Calculate loss
            loss = loss_fn(output, label)

            # For analysis purpose 
            losses.append(loss.item())
            
            torch.cuda.empty_cache()
            batch_counter += 1
            
    # Train accuracy of epoch
    acc = (cat_count[0] + cat_count[3]) / np.sum(cat_count)
    # Confusion matrix 
    confusion = np.array([[cat_count[0], cat_count[1]], [cat_count[2], cat_count[3]]])
    # F1 score assuming positive example is scarce
    try:
        f1_pos = cat_count[3] / (cat_count[3] + .5 * cat_count[1] + .5 * cat_count[2])
    except ZeroDivisionError: 
        f1_pos = inf
    # F1 score assuming negative example is scarce 
    try:
        f1_neg = cat_count[0] / (cat_count[0] + .5 * cat_count[1] + .5 * cat_count[2])
    except ZeroDivisionError: 
        f1_neg = inf
        
    return np.mean(losses), confusion, acc, f1_pos, f1_neg 
        

## Part 4: Grid search and training loop

### Initialize a metric table

In [12]:
# Table head of final results 
colNames = ['model_name', 'encoder', 'toxic_threshold', 'test_confusion_matrix', 'test_accuracy', 'test_f1_pos', 'test_f1_neg']

# Array for storing final results 
metric_table = []

### List of hyperparameters for grid search

We start from simple GridSearch where the type of hyperparameters to search are consistent across different models. 

In [13]:
# Similar format as in sklearn

gridsearch = [
    ('mlp', {
        'tokenizers': ['bow', 'tfidf', 'bert'], 
        'toxic_thresholds': [0.15, 0.3, 0.45]
    }), 
    ('bert', {
        'tokenizers': ['bert'], 
        'toxic_thresholds': [0.15, 0.3, 0.45]
    }), 
    ('logistic_regression', {
        'tokenizers': ['bow', 'tfidf', 'bert'], 
        'toxic_thresholds': [0.15, 0.3, 0.45]
    })
]


def init_model(model_name, input_size, num_labels, **kwargs): 
    '''
    Helper function for instantiating the model
    @Params
    -- input_size: the dimensionality of input vector
    -- num_labels: number of output labels 
    @kwargs
    -- hidden_layers_dim: list of hidden layers for MLP
    '''
    if model_name == 'logistic_regression': 
        return LogisticRegressionClassifier(input_size).to(device)
    
    elif model_name == 'mlp': 
        return MLPClassifier(input_size, num_labels, kwargs['hidden_layers_dim']).to(device)
    
    elif model_name == 'bert': 
        return BertClassifier(num_labels = num_labels).to(device)

### Hyperparameters that are not part of grid search

Intuitive insight from xprmt_05a_another indicate that 4 epochs are a good amount. 

In [14]:
hidden_layers_dim = [100, 50, 70, 20, 5]

loss_fn = nn.CrossEntropyLoss().to(device)

num_epoch = 4

# optimizer and scheduler needs to be inside gridsearch loop, because parameters of those models are based on specific models. 

### Grid-search loop

Since we are doing grid search automatically, we will no longer plot learning curve. 

Also, we will no longer store the binary file of trained models. 

In [15]:
%%time

logfile = open(LOG_NAME, 'w')

for grid in gridsearch: 
    model_name = grid[0]
    hyperparams = grid[1]
    
    for tokenizer, toxic_threshold in it.product(hyperparams['tokenizers'], hyperparams['toxic_thresholds']): 
        
        logfile.write('=' * 20 + 'start of model' + '=' * 20 + '\n')
        logfile.write(f'{model_name}, {tokenizer} tokenizer, toxic_threshold = {toxic_threshold}\n\n')
        logfile.flush()
        
        '''Step 1: prepare data and hyperparameters'''
        
        # Load data according to toxic_threshold 
        X_train_text, X_val_text, X_test_text, y_train, y_val, y_test = load_data(toxic_threshold)
        
        # Create DataLoader
        train_data_loader = create_data_loader(
            X_train_text, y_train, tokenizer
        )

        val_data_loader = create_data_loader(
            X_val_text, y_val, tokenizer
        )

        test_data_loader = create_data_loader(
            X_test_text, y_test, tokenizer
        )
        
        # get mean, sd, and input dimension
        # mean_vec, sd_vec, ndim = get_mean_sd_ndim(train_data_loader)
        ndim = next(iter(train_data_loader))['input_size'][0].item()
        
        # Instantiate the model 
        model = init_model(
            model_name, 
            input_size = ndim, 
            num_labels = NUM_LABELS, 
            hidden_layers_dim = hidden_layers_dim
        )
        
        # Define optimizer 
        optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
        
        # Some model dependent parameters
        if model_name == 'logistic_regression': 
            clip_grad = False
            scheduler = None
        elif model_name == 'mlp': 
            clip_grad = True
            scheduler = None
        elif model_name == 'bert': 
            clip_grad = True 
            scheduler = get_linear_schedule_with_warmup(
                optimizer, 
                num_warmup_steps = 0, 
                num_training_steps = len(train_data_loader)
            )
        
        '''Step 2: training loop'''
        train_losses = []
        val_losses = []
        
        for epoch in range(num_epoch): 
            logfile.write(f'Start epoch {epoch + 1} out of {num_epoch}\n')
            logfile.write('-' * 10 + '\n')
            logfile.flush()
            
            # Train
            train_loss, train_confusion, train_acc, train_f1_pos, train_f1_neg = train_epoch_binary(
                model, train_data_loader, 
                loss_fn, optimizer, 
                device, 
                clip_grad = clip_grad, 
                scheduler = scheduler
            )
            
            # Record loss to plot learning curve 
            train_losses.append(train_loss)
            
            logfile.write(f'Train loss = {np.mean(train_loss)}, ')
            logfile.write(f'Train accuracy = {train_acc}, ')
            logfile.write(f'Train f1_pos = {train_f1_pos}, ')
            logfile.write(f'Train f1_neg = {train_f1_neg}\n')
            logfile.flush()
            
            # Cross-validation
            val_loss, val_confusion, val_acc, val_f1_pos, val_f1_neg = eval_epoch_binary(
                model, val_data_loader, 
                loss_fn, optimizer, 
                device
            )
            
            # Record losses to plot learning curve 
            val_losses.append(val_loss)
            
            logfile.write(f'Validation loss = {np.mean(val_loss)}, ')
            logfile.write(f'Validation accuracy = {val_acc}, ')
            logfile.write(f'Validation f1_pos = {val_f1_pos}, ')
            logfile.write(f'Validation f1_neg = {val_f1_neg}\n')
            
            logfile.write('-' * 10 + '\n')
            logfile.write(f'End epoch {epoch + 1} out of {num_epoch}\n\n')
            logfile.flush()
            
        logfile.write('=' * 20 + 'end of model' + '=' * 20 + '\n')
        logfile.flush()
        
        '''Step 3: Test the model'''
        test_loss, test_confusion, test_acc, test_f1_pos, test_f1_neg = eval_epoch_binary(
            model, test_data_loader, 
            loss_fn, optimizer, 
            device, 
            test_mode = True
        )
        
        logfile.write(f'Test loss = {np.mean(test_loss)}, ')
        logfile.write(f'Test accuracy = {test_acc}, ')
        logfile.write(f'Test f1_pos = {test_f1_pos}, ')
        logfile.write(f'Test f1_neg = {test_f1_neg}\n')
        logfile.write(f'# True negative in test set = {test_confusion[0, 0]}\n')
        logfile.write(f'# False positive in test set = {test_confusion[0, 1]}\n')
        logfile.write(f'# False negative in test set = {test_confusion[1, 0]}\n')
        logfile.write(f'# True positive in test set = {test_confusion[1, 1]}\n\n\n')
        
        # Append information in the metric table 
        metric_table.append([model_name, tokenizer, toxic_threshold, test_confusion, test_acc, test_f1_pos, test_f1_neg])

logfile.close()
print('\r', end='')


CPU times: user 4h 57min 16s, sys: 40min 10s, total: 5h 37min 27s
Wall time: 2h 43min 28s


### Show final result for comparison 

In [16]:
pd.DataFrame(metric_table, columns = colNames)

Unnamed: 0,model_name,encoder,toxic_threshold,test_confusion_matrix,test_accuracy,test_f1_pos,test_f1_neg
0,mlp,bow,0.15,"[[293, 1108], [728, 10872]]",0.85878,0.922137,0.241949
1,mlp,bow,0.3,"[[385, 1016], [1674, 9926]]",0.793093,0.880667,0.222543
2,mlp,bow,0.45,"[[510, 891], [2072, 9528]]",0.772094,0.865434,0.256088
3,mlp,tfidf,0.15,"[[1401, 0], [11600, 0]]",0.107761,0.0,0.194556
4,mlp,tfidf,0.3,"[[0, 1401], [0, 11600]]",0.892239,0.943051,0.0
5,mlp,tfidf,0.45,"[[1401, 0], [11600, 0]]",0.107761,0.0,0.194556
6,mlp,bert,0.15,"[[56, 1345], [481, 11119]]",0.859549,0.924119,0.057792
7,mlp,bert,0.3,"[[1284, 117], [9967, 1633]]",0.224367,0.244644,0.202972
8,mlp,bert,0.45,"[[1316, 85], [10139, 1461]]",0.213599,0.222273,0.204729
9,bert,bert,0.15,"[[269, 1132], [566, 11034]]",0.869395,0.928553,0.240608


## Takeout-message

xprmt_05a_another should be compared with xprmt_02 and xprmt_04 because all 3 experiments use the same mixed dataset. 

<ul>
    <li>xprmt_02 uses TF-IDF tokenizer and simple models. xprmt_02 indicates that simple logistic regression will work reasonably well while being computationally cheap. Rate of false negative = 2377 / 11600 = 20%</li>
    <li>xprmt_04 uses BERT tokenizer and BERT classifier. BERT classifier has more tendency to predict false postive, but it significantly reduces the rate of false negative to 762 / 11600 = 6.57%</li>
    <li>Part of xprmt_05 applies simple models (e.g. logistic regression) with BERT tokenizer. This results in significantly high rate of false negative. <span style="color=red;">BUT I suspect this high rate of false negative is caused by wrong choice of optimizer, because I accidentally forgot to use AdamW for BERT classifier, and the f1-score seems outrageous. The high number of false negative may also be caused by BERT tokenizer.</span></li>
</ul>