# ONCLUSIVE ML CHALLENGE

**Task:**
Build an ML system to verify the veracity of claims.

**Dataset:** PUBHEALTH is a comprehensive dataset for explainable automated fact-checking of
public health claims. Each instance in the PUBHEALTH dataset has an associated
veracity label (true, false, unproven, mixture). Furthermore each instance in the dataset
has an explanation text field. The explanation is a justification for which the claim has
been assigned a particular veracity label.

**Dataset link:** https://huggingface.co/datasets/health_fact

**Pretrained huggingface model used:** : https://huggingface.co/yikuan8/Clinical-Longformer

**Solution:** Modelled veracity verification as a multi-class classification problem. Given a pair of 'claim' and 'source or evidence' in natural language, one of the four veracity classes (true, false, unproven, mixture) is predicted. This is a natural language inference task where a claim is verified against evidence for veracity. Following steps were taken to build the ML system:
1. The dataset was downloaded as train, validation and test splits using huggingface's datasets library.
2. Data was preprocessed by combining 'claim' and 'main_text' columns and oversampling the minority class.
3. A pretrained tokenizer was used to tokenize and convert input text into indices and attention masks.
3. Since the data consists of very long text instances about health related claims, the Clinical-Longformer was used with a sequence classification head to train a multi-class classification network. 

`Clinical-Longformer is a clinical knowledge enriched version of Longformer that was further pre-trained using MIMIC-III clinical notes. It allows up to 4,096 tokens as the model input.` 
    
**Distributed training was performed on a 4 gpu machine using huggingface's accelerate library.**

## Install the dependencies

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 5.1 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 64.2 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 62.7 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 75.5 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |███████████████████████████

In [None]:
!pip install accelerate



In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

In [None]:
import datasets
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW, get_scheduler
from torch.utils.data import DataLoader
import torch
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import pdb
import torch
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from tabulate import tabulate
from IPython.display import clear_output
from accelerate import Accelerator
from accelerate import notebook_launcher
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight # n_samples / (n_classes * np.bincount(y))
import numpy as np

## Download the dataset using huggingface datasets library

In [None]:
train_data, val_data, test_data = datasets.load_dataset('health_fact', split =['train', 'validation', 'test']) # downloaded tain, validation and test splits

Using custom data configuration default
Reusing dataset health_fact (/homes/vs001/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19)


  0%|          | 0/3 [00:00<?, ?it/s]

## Select useful columns

*   The dataset contains following columns: **'claim_id', 'claim', 'date_published', 'explanation', 'fact_checkers', 'main_text', 'sources', 'label', 'subjects'**

*   As **'explanation'** is written by expert fact checkers, we cannot expect it to be available in production deployments. Thus, the model is built using **'claim'** and **'main_text'** columns only.


In [None]:
# make a list of all non-useful columns
cols_to_remove = train_data.column_names
cols_to_remove.remove("claim") 
cols_to_remove.remove("main_text")
cols_to_remove.remove("label")

# remove non-useful columns
train_data = train_data.remove_columns(cols_to_remove)
val_data = val_data.remove_columns(cols_to_remove)
test_data = test_data.remove_columns(cols_to_remove)

# remove the undecided class '-1' as only four classes need to be modelled
train_data = train_data.filter(lambda example, idx: example['label'] > -1, with_indices=True)
val_data = val_data.filter(lambda example, idx: example['label'] > -1, with_indices=True)
test_data = test_data.filter(lambda example, idx: example['label'] > -1, with_indices=True)

Loading cached processed dataset at /homes/vs001/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19/cache-79b9bc5aafb4c839.arrow
Loading cached processed dataset at /homes/vs001/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19/cache-c1175cd62a8997bd.arrow
Loading cached processed dataset at /homes/vs001/.cache/huggingface/datasets/health_fact/default/1.1.0/99503637e4255bd805f84d57031c18fe4dd88298f00299d56c94fc59ed68ec19/cache-2d3e1cc1cb5b5a35.arrow


## Check for class imbalance

In [None]:
train_df = pd.DataFrame(train_data) # convert training data to pandas dataframe

In [None]:
train_df['label'].value_counts() 

2    5078
0    3001
1    1434
3     291
Name: label, dtype: int64

### **Note:** We find that there is huge class imbalance with the majority class >15X frequent than the minority class.

In [None]:
train_df = pd.concat([train_df[train_df['label']==3].sample(frac=0.5), train_df]).sample(frac=1).reset_index(drop=True)

#### **Note:** We oversmpled minority class to improve class imbalance. The oversampling ratio was obtained by experimenting with different values and monitoring validation performance. 

**This also helped avoid severely skewed class weights later to be used in the loss function.**

In [None]:
train_df['label'].value_counts()

2    5078
0    3001
1    1434
3     437
Name: label, dtype: int64

In [None]:
train_data = Dataset.from_pandas(train_df) # convert pandas dataframe back to dataset

## Load pretrained tokenizer

In [None]:
checkpoint = "yikuan8/Clinical-Longformer"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # loaded pretrained tokenizer

## Join claim and main_text with seperator token 

In [None]:
def get_text(example):
    example['text'] = ' '.join([example['claim'], tokenizer.sep_token, example['main_text']]) # join 'claim' and 'main_text' in that order, with the seperator token recognized by above tokenizer
    return example

In [None]:
# apply concatenation on all the data splits
train_data = train_data.map(get_text)
val_data = val_data.map(get_text)
test_data = test_data.map(get_text)



  0%|          | 0/9950 [00:00<?, ?ex/s]

  0%|          | 0/1214 [00:00<?, ?ex/s]

  0%|          | 0/1233 [00:00<?, ?ex/s]

## Convert text to indices and attention masks

In [None]:
# tokenizing and obtaining input vectors for concatenated text
def tokenization(batched_text):
    return tokenizer(batched_text['text'], padding = 'max_length', truncation=True, max_length = 1024) # maximum allowed length of input is 1024

train_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))
val_data = val_data.map(tokenization, batched = True, batch_size = len(val_data))
test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Remove not-required columns and convert to pytorch tensors

In [None]:
# now that we have input vectors, text columns can be dropped
train_data = train_data.remove_columns(['claim', 'main_text', 'text']) 
val_data = val_data.remove_columns(['claim', 'main_text', 'text'])
test_data = test_data.remove_columns(['claim', 'main_text', 'text'])
train_data = train_data.rename_column("label", "labels")
val_data = val_data.rename_column("label", "labels")
test_data = test_data.rename_column("label", "labels")
print(train_data.column_names)

['labels', 'input_ids', 'attention_mask']


In [None]:
# convert input dataset to pytorch tensors format
train_data.set_format("torch")
val_data.set_format("torch")
test_data.set_format("torch")

## Calculating class weights

In [None]:
#calculating class weights as the labels are still imbalanced
y_integers = [int(i['labels']) for i in train_data]
class_weight_array = compute_class_weight('balanced', classes = np.unique(y_integers), y = y_integers)
class_weights = torch.tensor(class_weight_array)
class_weights

tensor([0.8289, 1.7347, 0.4899, 5.6922], dtype=torch.float64)

## Modelling

### Define metrics to track

In [None]:
def compute_metrics(labels, preds):
        _, _, f1_micro, _ = precision_recall_fscore_support(labels, preds, average='micro') # micro f1
        classification_report_dict = classification_report(labels, preds, output_dict=True) 
        f1_std = np.std([classification_report_dict[str(i)]['f1-score'] for i in set(labels)]) # standard deviation of f1-score per class
        f1_macro = classification_report_dict['macro avg']['f1-score'] # macro f1
        acc = accuracy_score(labels, preds) # accuracy
        return {
            'accuracy': acc,
            'f1_micro': f1_micro,
            'f1_macro': f1_macro,
            'f1_std': f1_std
        }

### Training and validation

In [None]:
def run_training_loop():
    
    #setup train dataloader
    train_dataloader = DataLoader(
    train_data, shuffle=True, batch_size=2
    )

    #setup validation dataloader
    val_dataloader = DataLoader(
        val_data, batch_size=2
    )

    # Load the pretrained Clinical-Longformer model
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4)
    
    num_epochs = 10 # total number of epochs
    num_training_steps = num_epochs * len(train_dataloader) # total number of training steps
    
    grouped_params = model.parameters()
    optimizer=AdamW(grouped_params, lr=1e-5) # optimizer for parameter tuning

    # linear learning rate scheduler
    lr_scheduler = get_scheduler(
                                "linear",
                            optimizer=optimizer,
                            num_warmup_steps=int(0.05*num_training_steps),
                            num_training_steps=num_training_steps,
                            )
    
    accelerator = Accelerator() # initialized accelerator object to distribute training code
    
    criterion = torch.nn.CrossEntropyLoss(weight=class_weights.float().to(accelerator.device)) # custom weighted loss function
    
    train_dataloader, model, optimizer, lr_scheduler  = accelerator.prepare(
     train_dataloader, model, optimizer, lr_scheduler) # prepared objects required for distributed training
    
    tracked_metrics = []
    for epoch in range(num_epochs):
        train_losses = []
        model.train()
        for batch in tqdm(train_dataloader):
            outputs = model(**batch)
            logits = outputs['logits']
            loss = criterion(logits, batch['labels']) # custom loss calculation
#             loss = outputs.loss
            accelerator.backward(loss) # distributed gradient accumulation
            train_losses.append(loss.cpu()) # collecting loss for logging
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()
        labels_all = []
        preds_all = []
        val_losses = []

        for batch in val_dataloader:
    #         batch = {k: v.to(device) for k, v in batch.items()}
            with torch.no_grad():
                outputs = model(**batch)
            # pdb.set_trace()
            logits = outputs.logits
            val_losses.append(outputs.loss.cpu()) # collecting loss for logging
            predictions = torch.argmax(logits, dim=-1) # obtaining predictions from logits
            labels = batch['labels'].cpu() # obtaining labels for the batch
            # pdb.set_trace()
            labels_all = labels_all + labels.tolist()
            preds = predictions.cpu() # obtaining predictions for the batch
            preds_all = preds_all + preds.tolist()
            
        metrics = compute_metrics(labels_all, preds_all) # get metrics to log

        tracked_metrics.append([epoch, sum(train_losses)/len(train_losses), sum(val_losses)/len(val_losses), metrics['accuracy'], metrics['f1_micro'], metrics['f1_macro'], metrics['f1_std']])
        clear_output(wait=True)
        print(tabulate(tracked_metrics, headers=['EPOCH', 'Training Loss', 'Validation Loss', 'Val Accuracy', 'Val F1_micro', 'Val F1_macro', 'Val F1_std'])) # log metrics
        
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(f'dist_train_longformer_clinical_oversamped_balanced/{epoch}') # save the after each epoch

In [None]:
notebook_launcher(run_training_loop, num_processes=4) # launch distributed training

  EPOCH    Training Loss    Validation Loss    Val Accuracy    Val F1_micro    Val F1_macro    Val F1_std
-------  ---------------  -----------------  --------------  --------------  --------------  ------------
      0        0.949207            0.640989        0.724876        0.724876        0.563889      0.24084
      1        0.700016            0.580641        0.73888         0.73888         0.625849      0.182257
      2        0.531183            0.581147        0.73229         0.73229         0.636882      0.168691
      3        0.455887            0.59807         0.764415        0.764415        0.656159      0.17186
      4        0.357285            0.671122        0.779242        0.779242        0.685651      0.17155
      5        0.251612            0.720104        0.769357        0.769357        0.667255      0.180561
      6        0.205921            0.852754        0.766063        0.766063        0.678264      0.160546
      7        0.14258             0.885815      

**We note the following:**
1. The training loss is decreasing well.
2. The validation loss significantly increases after initially decreasing and fluctuating till epoch 3. The model is overfitting post epoch 3 .
3. Epochs 1, 2 and 3 have almost same validation losses.
4. Epoch 2 has the smallest standard deviation of f1 scores accross classes.

As we do not have preference for any one class over another, the model with smallest standard deviation of f1 scores accross classes was chosen for testing and deployment.

**EPOCH-2 is chosen as our best model**

`EPOCH|    Training Loss|    Validation Loss|    Val Accuracy|    Val F1_micro|    Val F1_macro|    Val F1_std`
-------  ---------------  -----------------  --------------  --------------  --------------  ------------
  
`      2 |           0.531183 |               0.581147 |           0.73229 |            0.73229 |            0.636882 |         0.168691`

### Evaluation

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") # get cuda device for inference on gpu if available

In [None]:
val_dataloader = DataLoader(val_data, batch_size=16) # get validation dataloader
test_dataloader = DataLoader(test_data, batch_size=16) # get test dataloader

In [None]:
def prediction_loop(loader, model):
    tracked_metrics = []
    labels_all = []
    preds_all = []
    val_losses = []
    for batch in loader:
                batch = {k: v.to(device) for k, v in batch.items()} # put batch to gpu devide if available
                with torch.no_grad():
                    outputs = model(**batch)
                # pdb.set_trace()
                logits = outputs.logits
                val_losses.append(outputs.loss.cpu())
                predictions = torch.argmax(logits, dim=-1)
                labels = batch['labels'].cpu()
                # pdb.set_trace()
                labels_all = labels_all + labels.tolist()
                preds = predictions.cpu()
                preds_all = preds_all + preds.tolist()
    metrics = compute_metrics(labels_all, preds_all)
    tracked_metrics.append([sum(val_losses)/len(val_losses), metrics['accuracy'], metrics['f1_micro'], metrics['f1_macro'], metrics['f1_std']])
    print(tabulate(tracked_metrics, headers=['Validation Loss', 'Val Accuracy', 'Val F1_micro', 'Val F1_macro', 'Val f1_std']))    
    print(classification_report(labels_all,  preds_all))

In [None]:
def get_metrics(path):
    model = AutoModelForSequenceClassification.from_pretrained(path, num_labels=4)
    model.to(device)
    print('=======================','val','=======================')
    prediction_loop(val_dataloader, model)
    print('=======================','test','=======================')
    prediction_loop(test_dataloader, model)
    del model

In [None]:
get_metrics('models/2') # getting final performance of our best model

  Validation Loss    Val Accuracy    Val F1_micro    Val F1_macro    Val f1_std
-----------------  --------------  --------------  --------------  ------------
          0.58121         0.73229         0.73229        0.636882      0.168691
              precision    recall  f1-score   support

           0       0.80      0.57      0.67       380
           1       0.35      0.70      0.46       164
           2       0.96      0.85      0.90       629
           3       0.44      0.63      0.52        41

    accuracy                           0.73      1214
   macro avg       0.64      0.69      0.64      1214
weighted avg       0.81      0.73      0.75      1214

  Validation Loss    Val Accuracy    Val F1_micro    Val F1_macro    Val f1_std
-----------------  --------------  --------------  --------------  ------------
         0.646326        0.729116        0.729116         0.64037      0.160688
              precision    recall  f1-score   support

           0       0.79      0

### Final model's performance

**Performance of the final model:**

**Validation set:**

Loss: 0.58121

Micro F1: 0.73229

Macro F1: 0.636882      

False F1: 0.67

Mixture F1: 0.46

True F1: 0.90

Unproven F1: 0.52


**Test set:**

Loss: 0.646326        

Micro F1: 0.729116        

Macro F1: 0.64037

False F1: 0.71

Mixture F1: 0.51

True F1: 0.87

Unproven F1: 0.47