# Text Classification using BERT with RCT data     

This code is to build a pipeline to classify RCT (Randomized Controlled Trials) with Pubmed texts using pretrained deep learning models such as BERT, BioBERT, SciBERT.  



Author: Jenna Kim  
Created: 2021/1/11  
Last Modified: 2021/10/3

## Update:  
- Modify load_data funtion to read in txt file: V2  
- Add code to remove duplicates: V2  
- Add code to sample data for data size change: V2  
- Modify code to sample data for label balance (1:1): V2

Reference:  
* https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/ 
* https://www.youtube.com/watch?v=f-86-HcYYi8   
* https://mccormickml.com/2019/07/22/BERT-fine-tuning/  


# 1. Setup

## 1-1. Install package and load libraires

Install the transformers package from Hugging Face which is a pytorch interface for working with a BERT

In [1]:
# Install transformer (ver 4.15.0)

#!pip install transformers==4.15.0

# No need to install PyTorch if this notebook is running on the AWS Sagemaker 
# with pytorch kernel('conda_pytorch_latest_p36')

#!pip install torch==1.5.0

In [2]:
# Install Imbalanced-Learn library for sampling if not already installed

#!pip install imbalanced-learn==0.8.1
#!pip install scikit-learn==1.0.2

In [3]:
# Check if the packages are correctly installed
#!pip list

Load other libraries

In [4]:
import timeit
import transformers
from transformers import BertModel, BertTokenizer, AutoModel, AutoTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

from collections import defaultdict
from textwrap import wrap

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F


In [5]:
# Set up for plots and paramters

#%matplotlib inline
#config InlineBackend.config_format='retina'

sns.set(style='darkgrid', palette='muted', font_scale=1.5)
COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
sns.set_palette(sns.color_palette(COLORS_PALETTE))
rcParams["figure.figsize"] = (12, 6)

In [6]:
# Hide warning messages from display
import warnings
warnings.filterwarnings('ignore')

## 1-2. Check GPU for training

Note: If you use Google Colab, before running the next cell, make sure that the runtime type is set to GPU by going to Runtime => Change runtime type => GPU

In [7]:
# Check a version of CUDA
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


In [8]:
# Check if there's a GPU available
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU    
    device = torch.device("cuda")

    print('There are {:d} GPU(s) available.'.format(torch.cuda.device_count()))
    print('We will use the GPU: ', torch.cuda.get_device_name(0))

else:
    device = torch.device("cpu")
    
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU:  Tesla V100-SXM2-16GB


In [9]:
#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')

Tesla V100-SXM2-16GB
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB


In [10]:
# check GPU memory and & utilization
!nvidia-smi

# To check the GPU memory usage while the process is running
# open a terminal in the directory (Go to New-> Terminal) and type the above code

Tue Oct  4 23:28:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0    22W / 300W |      2MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [11]:
# clear the occupied cuda memory for efficient use
import gc

gc.collect()
torch.cuda.empty_cache()

# Kill a process in running if more GPU space is needed
#!sudo kill -9 3320

# 2. Load data

## 2-1. If you load data from Google Drive directory

In [12]:
#import os
#from google.colab import drive
#drive.mount('/gdrive')
#%cd /gdrive

When running the above code, you might be required to enter authorization code to connect to Google Drive folder

In [13]:
# Access the directory where a dataset is stored
#os.listdir("/gdrive/My Drive/Colab Notebooks/LabelingProject")

In [14]:
#path = "/gdrive/My Drive/Colab Notebooks/LabelingProject"

In [15]:
# Load the dataset into a pandas dataframe
# IMDB dataset: similar to Medline data
# Required for reducing the data size due to the GPU constraints 
#df_raw = pd.read_csv(os.path.join(path, "IMDB Dataset.csv"))  

# CoLA dataset: one sentence per each instance
#df1 = pd.read_csv(os.path.join(path, "in_domain_train.tsv"), delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
#df2 = pd.read_csv(os.path.join(path, "in_domain_dev.tsv"), delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

#print(df1.shape)
#print(df2.shape)

#df_raw = pd.concat([df1, df2], ignore_index=True)

#print(df_raw.shape)

## 2-2. If you load data from AWS SageMaker or your local directory

In [16]:
def load_data_txt(filename, colname, record):
    
    """
    Read in input file and load data
    
    filename: csv file
    record: text file to save summary

    return: dataframe
    
    """
    ## 1. Read in data from input file
    df = pd.read_csv(filename, sep="\t", encoding='utf-8', header=None, names=['pmid', 'pubtype', 'year', 'title', 'abstract'])
    
    # No of rows and columns
    print("No of Rows (Raw data): {}".format(df.shape[0]), file=record)
    print("No of Columns: {}".format(df.shape[1]), file=record)
    print("No of Rows (Raw data): {}".format(df.shape[0]))
    print("No of Columns: {}".format(df.shape[1]))
    
    ## 2. Select data needed for processing & convert labels
    df = df[['pmid', 'title', 'abstract', 'pubtype']]
    
    ## 3. Cleaning data 
    #Trim unnecessary spaces for strings
    df["title"] = df["title"].apply(lambda x: x.strip())
    df["abstract"] = df["abstract"].apply(lambda x: x.strip())

    # Remove null values 
    df=df.dropna()

    print("No of rows (After dropping null): {}".format(df.shape[0]), file=record)
    print("No of columns: {}".format(df.shape[1]), file=record)
    print("No of rows (After dropping null): {}".format(df.shape[0]))
    print("No of columns: {}".format(df.shape[1]))

    # Remove duplicates and keep first occurrence
    df.drop_duplicates(subset=['pmid'], keep='first', inplace=True)

    print("No of rows (After removing duplicates): {}".format(df.shape[0]), file=record)
    print("No of rows (After removing duplicates): {}".format(df.shape[0]))
        
    ## 4. Select text columns
    if colname == "title":
        df = df[['pmid', 'title', 'pubtype']]
        df.rename({"title": "sentence", "pubtype": "label"}, axis=1, inplace=True)
    elif colname == "abs":
        df = df[['pmid', 'abstract', 'pubtype']]
        df.rename({"abstract": "sentence", "pubtype": "label"}, axis=1, inplace=True)
    elif colname == "mix":
        df['mix'] = df[['title','abstract']].apply(lambda x : '{} {}'.format(x[0],x[1]), axis=1)
        df = df[['pmid', 'mix', 'pubtype']]
        df.rename({"mix": "sentence", "pubtype": "label"}, axis=1, inplace=True)

    # Check the first few instances
    print("\n<Data View: First Few Instances>", file=record)
    print("\n", df.head(), file=record)
    print("\n<Data View: First Few Instances>")
    print("\n", df.head())
    
    # No of lables and rows 
    print('\nClass Counts(label, row): Total', file=record)
    print(df.label.value_counts(), file=record)   
    print('\nClass Counts(label, row): Total')
    print(df.label.value_counts())
     
    return df

In [17]:
def load_data(filename, colname, record):
    
    """
    Read in input file and load data
    
    filename: csv file
    record: text file to save summary

    return: dataframe
    
    """
    
    df = pd.read_csv(filename, encoding='utf-8')
    
    # No of rows and columns
    print("No of Rows (Raw data): {}".format(df.shape[0]), file=record)
    print("No of Columns: {}".format(df.shape[1]), file=record)
    
    print("No of Rows (Raw data): {}".format(df.shape[0]))
    print("No of Columns: {}".format(df.shape[1]))
    
    # Select data needed for processing & convert labels
    df = df[['pmid', 'title', 'abstract', 'pubtype']]
    df.iloc[:, -1] = df.iloc[:, -1].map({'RCT':1, 'Other':0})
    
    # Remove null values 
    df=df.dropna()

    print("No of rows (After removing null): {}".format(df.shape[0]), file=record)
    print("No of columns: {}".format(df.shape[1]), file=record)
    
    print("No of rows (After removing null): {}".format(df.shape[0]))
    print("No of columns: {}".format(df.shape[1]))
        
    # Select text columns
    if colname == "title":
        df = df[['pmid', 'title', 'pubtype']]
        df.rename({"title": "sentence", "pubtype": "label"}, axis=1, inplace=True)
    elif colname == "abs":
        df = df[['pmid', 'abstract', 'pubtype']]
        df.rename({"abstract": "sentence", "pubtype": "label"}, axis=1, inplace=True)
    elif colname == "mix":
        df['mix'] = df[['title','abstract']].apply(lambda x : '{} {}'.format(x[0],x[1]), axis=1)
        df = df[['pmid', 'mix', 'pubtype']]
        df.rename({"mix": "sentence", "pubtype": "label"}, axis=1, inplace=True)

    # Check the first few instances
    print("\n<Data View: First Few Instances>", file=record)
    print("\n", df.head(), file=record)
    print("\n<Data View: First Few Instances>")
    print("\n", df.head())
    
    # No of lables and rows 
    print('\nClass Counts(label, row): Total', file=record)
    print(df.label.value_counts(), file=record)
    
    print('\nClass Counts(label, row): Total')
    print(df.label.value_counts())
     
    return df

# 3. Data Processing

## 3-1. Check the distribution of token length

In [18]:
def token_distribution(df, tokenizer):
    token_lens = []
    long_tokens = []
    
    for pmid, txt in zip(df.pmid, df.sentence):
        tokens = tokenizer.encode(txt, padding=True, truncation=True, max_length=512)
        token_lens.append(len(tokens))
    
        # Check a sentence with extreme length
        if len(tokens) > 150:
            long_tokens.append((pmid, len(tokens)))   
  
    print("\nLong Sentences: ")

    if len(long_tokens)>0:
      print(long_tokens) 
    else:
      print("There is no long sentence")
    
    print("\nMin token:", min(token_lens))
    print("Max token:", max(token_lens))
    print("Avg token:", round(sum(token_lens)/len(token_lens)))
    
    # plot the distribution
    #sns.displot(token_lens)
    #plt.xlim([0, max(token_lens)+10])
    #plt.xlabel("Token Count")

## 3-2. Create a PyTorch dataset

In [19]:
class LabelDataset(Dataset):
    def __init__(self, reviews, targets, tokenizer, max_len):
        self.reviews = reviews
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, item):
        review = str(self.reviews[item])
        review = " ".join(review.split())
        target = self.targets[item]

        encoding = self.tokenizer.encode_plus(
            review,
            None,                    # second parameter is needed for a task of sentence similarity
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_token_type_ids=True,  
            return_attention_mask=True,
            return_tensors='pt')

        return {
            'text': review,
            'input_ids': encoding['input_ids'].flatten(),            # flatten() reduce dimension: e.g., [1, 512] -> [512]
            'token_type_ids': encoding['token_type_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': torch.tensor(target, dtype=torch.long)
        }

## 3-3. Sampling

In [20]:
def sample_data(X_train, y_train, record, sampling=0, sample_method='over'):
    """
       Sampling input train data
       
       X_train: dataframe of X train data
       y_train: datafram of y train data
       sampling: indicator of sampling funtion is on or off
       sample_method: method of sampling (oversampling or undersampling)
       record: text file to save summary
       
    """
    
    from imblearn.over_sampling import RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler
    
    if sampling:
        # select a sampling method
        if sample_method == 'over':
            oversample = RandomOverSampler(random_state=42)
            X_over, y_over = oversample.fit_resample(X_train, y_train)
            print('\n****** Data Sampling ******', file=record)
            print('\n****** Data Sampling ******')
            print('\nOversampled Data (class, Rows):\n{}'.format(y_over.value_counts()), file=record)
            print('\nOversampled Data (class, Rows):\n{}'.format(y_over.value_counts()))
            X_train_sam, y_train_sam = X_over, y_over
            
        elif sample_method == 'under':
            undersample = RandomUnderSampler(random_state=42)
            X_under, y_under = undersample.fit_resample(X_train, y_train)
            print('\n****** Data Sampling ******', file=record)
            print('\n****** Data Sampling ******')
            print('\nUndersampled Data (class,Rows):\n{}'.format(y_under.value_counts()), file=record)
            print('\nUndersampled Data (class,Rows):\n{}'.format(y_under.value_counts()))
            X_train_sam, y_train_sam = X_under, y_under
    else:
        X_train_sam, y_train_sam = X_train, y_train 
        print('\n****** Data Sampling ******', file=record)
        print('\n****** Data Sampling ******')
        print('\nNo Sampling Performed\n', file=record)
        print('\nNo Sampling Performed\n')
    
    return X_train_sam, y_train_sam

## 3-4. Create a data loader & classifier

In [21]:
def create_data_loader(df, tokenizer, max_len, batch_size):
    ds = LabelDataset(
        reviews = df.sentence.to_numpy(),
        targets = df.label.to_numpy(),
        tokenizer = tokenizer,
        max_len = max_len
    )
    
    return DataLoader(
        ds,
        batch_size = batch_size,
        num_workers = 1)

In [22]:
class LabelClassifier(nn.Module):
    
    def __init__(self, n_classes, pretrained_model):
        super(LabelClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained(pretrained_model)
        self.dropout = nn.Dropout(p=0.3)
        self.linear = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        bert_out = self.bert(
            input_ids = input_ids,
            attention_mask = attention_mask,
            token_type_ids = token_type_ids)
        output_dropout = self.dropout(bert_out.pooler_output)
        output = self.linear(output_dropout)
    
        return output

# 4. Training

## 4-1. Hyperparameter setting

The BERT authors's recommendations for fine-tuning:  
* Batch size: 16, 32  
* Learning rate (Adam): 5e-5, 3e-5, 2e-5  
* Number of epochs: 2, 3, 4

In [23]:
def train_model(
    model,
    data_loader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    n_examples,
    outfile):
    
    model = model.train()
    
    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d["input_ids"].to(device, dtype=torch.long)
        attention_mask = d["attention_mask"].to(device, dtype=torch.long)
        token_type_ids = d["token_type_ids"].to(device, dtype=torch.long)
        targets = d["targets"].to(device)

        outputs = model(
            input_ids = input_ids,
            attention_mask = attention_mask,
            token_type_ids=token_type_ids
        )

        _, preds = torch.max(outputs, dim=1)
        loss = loss_fn(outputs, targets)

        # printout for checking the prediction & target
        #print("Pred: ", preds)
        #print("Target: ", targets)

        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())

        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
    print("Correct Prediction (Train): {} out of {}".format(correct_predictions.int(), n_examples), file=outfile)
    print("Correct Prediction (Train): {} out of {}".format(correct_predictions.int(), n_examples))

    return correct_predictions.double() / n_examples, np.mean(losses)

In [24]:
def eval_model(
    model, 
    data_loader, 
    loss_fn, 
    device, 
    n_examples,
    outfile):
    
    model = model.eval()

    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device, dtype=torch.long)
            attention_mask = d["attention_mask"].to(device, dtype=torch.long)
            token_type_ids = d["token_type_ids"].to(device, dtype=torch.long)
            targets = d["targets"].to(device)

            outputs = model(
                input_ids = input_ids,
                attention_mask = attention_mask,
                token_type_ids = token_type_ids
                )
            
            _, preds = torch.max(outputs, dim=1)
            loss = loss_fn(outputs, targets)

            correct_predictions += torch.sum(preds == targets)
            losses.append(loss.item())
    
    print("Correct Prediction (Eval): {} out of {}".format(correct_predictions.int(), n_examples), file=outfile)
    print("Correct Prediction (Eval): {} out of {}".format(correct_predictions.int(), n_examples))
    
    return correct_predictions.double()/n_examples, np.mean(losses)

In [25]:
def plot_train_history(history):
    plt.plot(history["train_acc"], 'b-o', label="train accuracy")
    plt.plot(history["val_acc"], 'r-o', label="validation accuracy")

    plt.title("Training History")
    plt.ylabel("Accuracy")
    plt.xlabel("Epoch")
    plt.legend()
    plt.xticks(history["epoch"])
    plt.yticks(np.arange(0,1.2,step=0.05))
    plt.ylim([0,1.05])

In [26]:
def training_loop(epochs, 
                  modelname, 
                  model, 
                  train_data_loader, 
                  val_data_loader, 
                  loss_fn, 
                  optimizer, 
                  device, 
                  scheduler, 
                  n_train, 
                  n_val,
                  model_file,
                  record):
    
    print("\n**** Model Name: " + modelname + " *****", file=record)
    print("\n**** Model Name: " + modelname + " *****")
    
    history = defaultdict(list)
    best_accuracy = 0

    for epoch in range(epochs):
        print("\nEpoch {} / {}".format(str(epoch + 1), str(epochs)), file=record)
        print("-" * 60, file=record)
    
        print("\nEpoch {} / {}".format(str(epoch + 1), str(epochs)))
        print("-" * 60)
    
        train_acc, train_loss = train_model(
            model, 
            train_data_loader,
            loss_fn,
            optimizer,
            device,
            scheduler,
            n_train,
            outfile=record)
    
        print("Train Loss: {}, Accuracy: {}\n".format(train_loss, train_acc), file=record)
        print("Train Loss: {}, Accuracy: {}\n".format(train_loss, train_acc))
    
        val_acc, val_loss = eval_model(
            model,
            val_data_loader,
            loss_fn,
            device,
            n_val,
            outfile=record)
    
        print("Validation Loss: {}, Accuracy: {}".format(val_loss, val_acc), file=record)  
        print("Validation Loss: {}, Accuracy: {}".format(val_loss, val_acc))

        # store the state of the best model using the higest validation accuracy
        history["epoch"].append(epoch)
        history["train_acc"].append(train_acc)
        history["train_loss"].append(train_loss)
        history["val_acc"].append(val_acc)
        history["val_loss"].append(val_loss)

        if val_acc > best_accuracy:
            if model_file:
                torch.save(model.state_dict(), model_file)
            best_accuracy = val_acc
    
    # Plot training & validation accuracy
    #plot_train_history(history)

# 5. Predictions & Evaluation

In [27]:
def get_predictions(model, data_loader):
    
    model = model.eval()
    
    review_texts = []
    predictions = []
    prediction_probs = []
    real_values = []
    
    with torch.no_grad():
        for d in data_loader:
            texts = d["text"]
            input_ids = d["input_ids"].to(device, dtype=torch.long)
            attention_mask = d["attention_mask"].to(device, dtype=torch.long)
            token_type_ids = d["token_type_ids"].to(device, dtype=torch.long)
            targets = d["targets"].to(device)
            
            outputs = model(
                input_ids = input_ids,
                attention_mask = attention_mask,
                token_type_ids = token_type_ids
            )
            
            _, preds = torch.max(outputs, dim=1)

            # Apply the softmax or sigmoid function to normalize the raw output(logits) to get probability for each clas
            probs = F.softmax(outputs, dim=1)
            
            review_texts.extend(texts)
            predictions.extend(preds)
            prediction_probs.extend(probs)
            real_values.extend(targets)

    # move the data to cpu
    predictions = torch.stack(predictions).cpu()
    prediction_probs = torch.stack(prediction_probs).cpu().detach().numpy()
    real_values = torch.stack(real_values).cpu()

    return review_texts, predictions, prediction_probs, real_values

In [28]:
def evaluate_model(y_test, y_pred, record, eval_model=0):
    """
      evaluate model performance
      
      y_test: y test data
      y_pred: t prediction score
      eval_model: indicator if this funtion is on or off
      
    """
    
    if eval_model:
        
        print('\n************** Model Evaluation **************', file=record)
        print('\n************** Model Evaluation **************')
        
        print('\nConfusion Matrix:\n', file=record)
        print('\nConfusion Matrix:\n')
        print(confusion_matrix(y_test, y_pred), file=record)
        print(confusion_matrix(y_test, y_pred))
        
        print('\nClassification Report:\n', file=record)
        print('\nClassification Report:\n')
        print(classification_report(y_test, y_pred, digits=4), file=record)
        print(classification_report(y_test, y_pred, digits=4)) 

In [29]:
def predict_proba(df_test, y_text, y_test, y_pred, y_pred_probs, proba_file, proba_out=0):
    
    """
       Predict probability of each class
       
       df_test: original X test data
       y_text: text data sentence
       y_test: original y test data
       y_pred: predicted y values
       y_pred_probs: probability scores of prediction
       proba_file: output file of probability scores
       proba_on: decide if the probability output is expected
       
    """
    if proba_out:
        df_result = pd.DataFrame({
            'pmid': df_test["pmid"],
            'text': y_text,
            'act': y_test,
            'pred': y_pred,
            'proba_0': y_pred_probs[:, 0],
            'prob_1': y_pred_probs[:, 1]
        })
        
        ## Save output
        df_result.to_csv(proba_file, encoding='utf-8', header=True, index=False)

# 6. Main Function

In [30]:
def main(input_file, 
         colname, 
         sample_on, 
         sample_type, 
         tokenizer,
         max_len, 
         batch_size,
         modelname,
         n_class,
         device,
         pretrained_model,
         learning_rate,
         epochs,
         model_file, 
         eval_on, 
         proba_on,
         proba_file,
         result_file,
         datasize_change,
         sample_balance,
         balance_sampling_on,                                   
         balance_sampling_type,
         sample_ratio,
         ratio):
    
    """
       Main function for processing data, model training, and prediction
       
       input_file: input file
       colname: colume name for selection between title and abstract
       sample_on: indicator of sampling on or off
       sample_type: sample type to choose if sample_on is 1
       model_method: name of classifier to be applied for model fitting
       eval_on: indicator of model evaluation on or off
       proba_file: name of output file of probability
       result_file: name of output file of evaluation
       ratio: proportion of data size
       
    """
    
    ## 0. open result file for records
    f=open(result_file, "a")
    
    # Check processing time
    proc_start_time = timeit.default_timer()
    
    ## 1. Load data
    
    print("\n************** Loading Data **************\n", file=f)
    print("\n************** Loading Data **************\n")
    #df = load_data(input_file, colname, record=f)        # use for comma-delimited csv file
    df = load_data_txt(input_file, colname, record=f)     # use for tab-delimited txt file
    
    # testing
    print("\nFirst Sentence: ", df.sentence[0], file=f)
    print("\nFirst Sentence: ", df.sentence[0])

    ## 2. Train and test split
    
    print("\n************** Spliting Data **************\n", file=f)
    print("\n************** Spliting Data **************\n")
    
    df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df.label)
    df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=42, stratify=df_test.label)
    
    #for testing only: small size data
    #df_train, df_test = train_test_split(df, test_size=0.99, random_state=42, stratify=df.label)
    #df_val, df_test = train_test_split(df_test, test_size=0.99, random_state=42, stratify=df_test.label)
    #df_notuse, df_test = train_test_split(df_test, test_size=0.01, random_state=42, stratify=df_test.label)
    
    print("Train Data: {}".format(df_train.shape), file=f)
    print("Val Data: {}".format(df_val.shape), file=f)
    print("Test Data: {}".format(df_test.shape), file=f)
    
    print("Train Data: {}".format(df_train.shape))
    print("Val Data: {}".format(df_val.shape))
    print("Test Data: {}".format(df_test.shape))
    
    print('\nClass Counts(label, row): Train', file=f)
    print(df_train.label.value_counts(), file=f)
    print('\nClass Counts(label, row): Val', file=f)
    print(df_val.label.value_counts(), file=f)
    print('\nClass Counts(label, row): Test', file=f)
    print(df_test.label.value_counts(), file=f)
    
    print('\nClass Counts(label, row): Train')
    print(df_train.label.value_counts())
    print('\nClass Counts(label, row): Val')
    print(df_val.label.value_counts())
    print('\nClass Counts(label, row): Test')
    print(df_test.label.value_counts())
    
    print("\nTest Data", file=f)
    print(df_test.head(), file=f)
    print("\nTest Data")
    print(df_test.head())
    
    ## 3. Data size change
    
    if datasize_change:
        
        # Sample size reduce: 500,000 instance -> 100,000 instance
        df_train, _ = train_test_split(df_train, train_size=0.2, random_state=42, stratify=df_train.label)
        df_val, _ = train_test_split(df_val, train_size=0.2, random_state=42, stratify=df_val.label)
        df_test, _ = train_test_split(df_test, train_size=0.2, random_state=42, stratify=df_test.label)
        
        print("\n************** Data Size Change: Reducing Data **************\n", file=f)
        print("\n************** Data Size Change: Reducing Data **************\n")
        print("Train Data: {}".format(df_train.shape), file=f)
        print("Val Data: {}".format(df_val.shape), file=f)
        print("Test Data: {}".format(df_test.shape), file=f)
        print("Train Data: {}".format(df_train.shape))
        print("Val Data: {}".format(df_val.shape))
        print("Test Data: {}".format(df_test.shape))
        
        print('\nClass Counts(label, row): Train', file=f)
        print(df_train.label.value_counts(), file=f)
        print('\nClass Counts(label, row): Val', file=f)
        print(df_val.label.value_counts(), file=f)
        print('\nClass Counts(label, row): Test', file=f)
        print(df_test.label.value_counts(), file=f)
        print("\n", file=f)
    
        print('\nClass Counts(label, row): Train')
        print(df_train.label.value_counts())
        print('\nClass Counts(label, row): Val')
        print(df_val.label.value_counts())
        print('\nClass Counts(label, row): Test')
        print(df_test.label.value_counts())
        
        print("\n<Train Data>", file=f)
        print(df_train.head(), file=f)
        print("\n<Train Data>")
        print(df_train.head())
    
        print("\nTest Data", file=f)
        print(df_test.head(), file=f)
        print("\nTest Data")
        print(df_test.head())
        
        # Sample data with balance (1:1)
        if sample_balance:
            
            print("\n************** Data Balancing: Label Class (1:1) *************\n", file=f)
            print("\n************** Data Balancing: Label Class (1:1) *************\n")
            
            # split into X and y
            X_train, y_train = df_train.iloc[:, :-1], df_train.iloc[:, -1]
            
            # sampling
            X_train, y_train = sample_data(X_train, y_train, record=f, 
                                           sampling=balance_sampling_on, 
                                           sample_method=balance_sampling_type)
            
            
            print('\nClass Counts(label, row): After balancing', file=f)
            print(y_train.value_counts(), file=f)
            print('\nClass Counts(label, row): After balancing')
            print(y_train.value_counts())
            print("\n<Balanced Train Data>", file=f)
            print(X_train.head(), file=f)
            print("\n<Balanced Train Data>")
            print(X_train.head()) 
            
            # merge into one dataframe
            df_train = pd.concat([X_train, y_train], axis=1)
            
        # Sample data based on size ratio    
        if sample_ratio:
            if ratio == 1:
                df_train = df_train         
            else:              
                df_train, _ = train_test_split(df_train, train_size=ratio, random_state=42, stratify=df_train.label)
                
            print("\n************** Data Size Change: Ratio *************\n", file=f)
            print("Data Ratio: {}".format(ratio), file=f)
            print("\n************** Data Size Change: Ratio *************\n")
            print("Data Ratio: {}".format(ratio))
            
            print('\nClass Counts(label, row): After sampling', file=f)
            print(df_train.label.value_counts(), file=f)
            print('\nClass Counts(label, row): After sampling')
            print(df_train.label.value_counts())
            print("\n<Train Data Based on Ratio>", file=f)
            print(df_train.head(), file=f)
            print("\n<Train Data Based on Ratio>")
            print(df_train.head()) 
    
    # Reset index
    df_train=df_train.reset_index(drop=True)
    df_val=df_val.reset_index(drop=True)
    df_test=df_test.reset_index(drop=True)
    
    print("\n************** Processing Data **************", file=f)
    print("\n************** Processing Data **************")
    print("Train Data: {}".format(df_train.shape), file=f)
    print("Val Data: {}".format(df_val.shape), file=f)
    print("Test Data: {}".format(df_test.shape), file=f)
    
    print("Train Data: {}".format(df_train.shape))
    print("Val Data: {}".format(df_val.shape))
    print("Test Data: {}".format(df_test.shape))
    
    print('\nClass Counts(label, row): Train', file=f)
    print(df_train.label.value_counts(), file=f)
    print('\nClass Counts(label, row): Val', file=f)
    print(df_val.label.value_counts(), file=f)
    print('\nClass Counts(label, row): Test', file=f)
    print(df_test.label.value_counts(), file=f)
    print("\n", file=f)
    
    print('\nClass Counts(label, row): Train')
    print(df_train.label.value_counts())
    print('\nClass Counts(label, row): Val')
    print(df_val.label.value_counts())
    print('\nClass Counts(label, row): Test')
    print(df_test.label.value_counts())
    
    print("\nTest Data", file=f)
    print(df_test.head(), file=f)
    print("\nTest Data")
    print(df_test.head())
    
    ## 4. Sampling
    if sample_on:
        X_train = df_train.iloc[:, :-1]
        y_train = df_train.iloc[:, -1]
    
        # Sampling
        X_train_samp, y_train_samp = sample_data(X_train, y_train, sampling=sample_on, sample_method=sample_type)
    
        print(y_train_samp.value_counts(), file=f)

        # Combine x_train and y_train data
        df_train_concat = pd.concat([X_train_samp, y_train_samp], axis=1)

        print(df_train_concat.info())
        print(df_train_concat.head())
    
        # replace train data with sampled data
        df_train = df_train_concat
        print(df_train.shape)
    
    ## 5. Load data
    train_data_loader = create_data_loader(df_train, tokenizer, max_len, batch_size)
    val_data_loader = create_data_loader(df_val, tokenizer, max_len, batch_size)
    test_data_loader = create_data_loader(df_test, tokenizer, max_len, batch_size)

    ## 6. Model Training
    print("\n************** Training Model: " + modelname + " **************", file=f)
    print("\n************** Training Model: " + modelname + " **************")
    
    # Check training time
    start_time = timeit.default_timer()
    
    n_train = len(df_train)    
    n_val = len(df_val)
    
    # Create a classifier instance and move it to GPU
    model = LabelClassifier(n_class, pretrained_model)
    model = model.to(device)   
    
    # Optimizer, scheduler, loss function
    optimizer = AdamW(model.parameters(), lr=learning_rate, correct_bias=False)
    total_steps = len(train_data_loader) * epochs

    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps = 0,
        num_training_steps = total_steps)

    loss_fn = nn.CrossEntropyLoss().to(device)
    
    # Loop training with epochs
    training_loop(epochs, modelname, model, 
                train_data_loader, val_data_loader, 
                loss_fn, optimizer, device, scheduler, 
                n_train, n_val, model_file=None, record=f)
    
    elapsed = timeit.default_timer() - start_time
    print("\nTraining Time ({} epochs): {}".format(epochs, round(elapsed, 2)), file=f)
    print("\nTraining Time ({} epochs): {}".format(epochs, round(elapsed,2)))
    
    ## 7. Prediction   
    print("\n\n************** Getting predictions **************", file=f)
    print("\n\n************** Getting predictions **************")
    y_text, y_pred, y_pred_probs, y_test = get_predictions(model, test_data_loader)  
    
    ## 8. Evaluating model performance      
    print("\n************** Evaluating performance **************", file=f)
    print("\n************** Evaluating performance **************")
    evaluate_model(y_test, y_pred, record=f, eval_model=eval_on)
    
    ## 9. Probability prediction    
    predict_proba(df_test, y_text, y_test, y_pred, y_pred_probs, proba_file=proba_file, proba_out=proba_on)
    
    print("\nOutput file: '" + result_file + "' Created", file=f)
    print("\nOutput file: '" + result_file + "' Created")
    
    proc_elapsed = timeit.default_timer() - proc_start_time
    print("\nTotal Processing Time: {}min\n".format(round(proc_elapsed/60)), file=f)
    print("\nTotal Processing Time: {}min\n".format(round(proc_elapsed/60)))
    
    f.close()


# 7. Run Code for Implementation

In [31]:
#%%time

if __name__== "__main__":
    
    ###### 1. Set Parameter Values ######

    #### 1-1. Input file name & which column
    input_filename="output_rct.txt"    
    column_name = "abs"                                        # 'title' for title text; 'abs' for abstract; 'mix' for title + abstract
    
    #### 1-2. Data size change?
    datachange_on=1                                            # 0 for no change; 1 for change of data size
    
    ## class balance (1:1)?
    balance_on=1                                               # 0 for no balance; 1 for class balance (1:1)
    balance_sample_on=1                                        # 0 for no sampling; 1 for sampling
    balance_sample_type='under'                                # 'over'(oversampling); 'under'(undersampling)
    balance_str = 'balance' + str(balance_on) + '_'
    
    ## data increase?
    ratio_on=1 
    ratio_list=[0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 
                0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1]  # basic ratio for data size
    #ratio_list=[1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,20000,30000,40000,50000,60000,70000] # actual sample size 
    #ratio_list=[101]
    
    #### 1-3. Sampling applied?
    sampling_on=0                                              # 0 for no sampling; 1 for sampling
    sampling_type='under'                                      # Use when sampling_on=1; 'over'(oversampling), 'under'(undersampling)
    
    #### 1-4. Which BERT model to use?
    #pretrained_model_name = 'bert-base-cased'
    #pretrained_model_name = 'dmis-lab/biobert-base-cased-v1.1'
    pretrained_model_name = 'allenai/scibert_scivocab_cased'
    
    # load pretrained tokenizer
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)  
    modelname_string = pretrained_model_name.split("/")[-1] 

    #### 1-5. Binary or multi classification?
    num_class = 2                                              # number of label class
    
    #### 1-6. Check token distribution for MAX_LEN value: commentize if not needed
    #print("\n************** Token Distribution **************")
    #df_token = load_data(input_filename, column_name, record=None)
    #token_distribution(df_token, tokenizer)

    #### 1-7. Hyperparameters for BERT  
    MAX_LEN = 512                                              # 150 for title; 512 for abs (Maximum input size: 512 (BERT))
    BATCH_SIZE = 16                                            # Batch size: 16 or 32
    EPOCHS = 4                                                 # Number of epochs: 2,3,4
    LEARNING_RATE = 2e-5                                       # Learning rate:5e-5, 3e-5, 2e-5

    #### 1-8. Evaluation & probability files
    eval_on=1                                                  # 0 for no; 1 for yes (display confusion matrix/classification report)
    proba_on=0                                                 # 0 for no; 1 for yes (probability output) 
    
        
    ###### 2. Run Main Fuction ######

    if datachange_on:                  
        for ratio in ratio_list:           
            if sampling_on:
                proba_file = "result_bert_" + balance_str + str(ratio) + "_" + modelname_string + "_" + sampling_type + "_" + column_name + ".csv"  
                eval_file = "eval_bert_" + balance_str + str(ratio) + "_" + modelname_string + "_" + sampling_type + "_" + column_name + ".txt"
                model_state_file = "best_model_state_" + str(ratio) + "_" + modelname_string + "_" + sampling_type + "_" + column_name + ".bin"
            else:
                proba_file = "result_bert_" + balance_str + str(ratio) + "_" + modelname_string + "_" + column_name + ".csv"  
                eval_file = "eval_bert_ratio_balance/eval_bert_" + balance_str + str(ratio) + "_" + modelname_string + "_" + column_name + ".txt"
                model_state_file = "best_model_state_" + balance_str + str(ratio) + "_" + modelname_string + "_" + column_name + ".bin"
        
            main(input_file=input_filename, 
                 colname=column_name,
                 sample_on=sampling_on, 
                 sample_type=sampling_type,
                 tokenizer=tokenizer,
                 max_len=MAX_LEN, 
                 batch_size=BATCH_SIZE,
                 modelname=modelname_string,
                 n_class=num_class,
                 device=device,
                 pretrained_model=pretrained_model_name,
                 learning_rate=LEARNING_RATE,
                 epochs=EPOCHS,
                 model_file=model_state_file, 
                 eval_on=eval_on, 
                 proba_file=proba_file,
                 proba_on=proba_on,
                 result_file=eval_file,
                 datasize_change=datachange_on,
                 sample_ratio=ratio_on,
                 sample_balance=balance_on,
                 balance_sampling_on=balance_sample_on,                                      
                 balance_sampling_type=balance_sample_type,
                 ratio=ratio)
    else:
        if sampling_on:
            proba_file = "result_bert_all_" + modelname_string + "_" + sampling_type + "_" + column_name + ".csv"  
            eval_file = "eval_bert_all_" + modelname_string + "_" + sampling_type + "_" + column_name + ".txt"
            model_state_file = "best_model_state_" + modelname_string + "_" + sampling_type + "_" + column_name + ".bin"
        else:
            proba_file = "result_bert_all_" + modelname_string + "_" + column_name + ".csv"  
            eval_file = "eval_bert_all_" + modelname_string + "_" + column_name + ".txt" 
            model_state_file = "best_model_state_" + modelname_string + "_" + column_name + ".bin"
            
        main(input_file=input_filename, 
             colname=column_name,
             sample_on=sampling_on, 
             sample_type=sampling_type,
             tokenizer=tokenizer,
             max_len=MAX_LEN, 
             batch_size=BATCH_SIZE,
             modelname=modelname_string,
             n_class=num_class,
             device=device,
             pretrained_model=pretrained_model_name,
             learning_rate=LEARNING_RATE,
             epochs=EPOCHS,
             model_file=model_state_file, 
             eval_on=eval_on, 
             proba_file=proba_file,
             proba_on=proba_on,
             result_file=eval_file,
             datasize_change=datachange_on,
             sample_ratio=ratio_on,
             sample_balance=balance_on,
             balance_sampling_on=balance_sample_on,                                      
             balance_sampling_type=balance_sample_type,
             ratio=0.1)
        
    print("\n************** Processing Completed **************\n")


************** Loading Data **************

No of Rows (Raw data): 500068
No of Columns: 5
No of rows (After dropping null): 500068
No of columns: 4
No of rows (After removing duplicates): 499963

<Data View: First Few Instances>

        pmid                                           sentence  label
0  18439781  In the United States, an increasing number of ...      0
1  18468833  The American Heart Association website defines...      0
2  18481181  The complex pathophysiology of traumatic brain...      0
3  18728056  [BACKGROUND] Soluble CD40 ligand (sCD40L) is a...      1
4  18790590  [BACKGROUND] Internal carotid artery dissectio...      0

Class Counts(label, row): Total
0    399977
1     99986
Name: label, dtype: int64

First Sentence:  In the United States, an increasing number of law enforcement agencies have employed the use of TASER® (TASER International Inc., Scottsdale, AZ) devices to temporarily immobilize violent subjects. There are reports in the lay press of adverse ou

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/scibert_scivocab_cased were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



**** Model Name: scibert_scivocab_cased *****

Epoch 1 / 4
------------------------------------------------------------
Correct Prediction (Train): 31914 out of 31996
Train Loss: 0.022651843547821043, Accuracy: 0.997437179647456

Correct Prediction (Eval): 2000 out of 9999
Validation Loss: 9.38027992477417, Accuracy: 0.20002000200020004

Epoch 2 / 4
------------------------------------------------------------
Correct Prediction (Train): 31850 out of 31996
Train Loss: 0.039395029794424774, Accuracy: 0.9954369296162021

Correct Prediction (Eval): 2000 out of 9999
Validation Loss: 8.904176417541503, Accuracy: 0.20002000200020004

Epoch 3 / 4
------------------------------------------------------------
Correct Prediction (Train): 31789 out of 31996
Train Loss: 0.05321755945558289, Accuracy: 0.9935304413051632

Correct Prediction (Eval): 2000 out of 9999
Validation Loss: 8.425416313934326, Accuracy: 0.20002000200020004

Epoch 4 / 4
----------------------------------------------------------

# Reference

## Download the finetuned model for prediction only

In [32]:
# If want to download the pretrained model later, use the following code
#n_class = 2
#model = LabelClassifier(n_class)
#model.load_state_dict(torch.load('best_model_state.bin'))
#model = model.to(device)

In [33]:
# If you want to download a file directly from Google Drive to your local computer,
# uncomment the following code

#from google.colab import files
#files.download(os.path.join(path, "result.csv"))