<a href="https://colab.research.google.com/github/natalisso/Concurrent-Distributed-Systems/blob/master/Fine_tuning_T5_SuperGLUE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Nootebook

This nootebook aims to build the experimental setup for the upcoming experiments regarding the guided research scope.

In [1]:
!pip install -U transformers[torch]\
sentencepiece\
accelerate\
datasets\
evaluate\
python-dotenv\
neptune\
peft\
rouge_score



In [2]:
import os

import collections
import re
from datetime import date
from typing import Dict, Final, get_args, Literal
from dotenv import load_dotenv
from google.colab import drive

import neptune
import torch
import numpy as np
from datasets import load_dataset
from evaluate import load
from huggingface_hub import notebook_login
from transformers import (
    DataCollatorForSeq2Seq,
    T5Tokenizer,
    T5ForConditionalGeneration,
    Trainer,
    TrainingArguments
)

from transformers.integrations import NeptuneCallback

In [3]:
GDRIVE_PATH='/content/gdrive/MyDrive/TUM/GR/guided_research_codes'

In [4]:
# Connecting Google Drive to Colab to have persistent storage across Colab sessions.
drive.mount('/content/gdrive', force_remount=True)
os.chdir(GDRIVE_PATH)
print(sorted(os.listdir()))

Mounted at /content/gdrive
['.env', '.git', '.gitignore', '.ipynb_checkpoints', '.neptune', 'Checkpoints', 'Fine_tuning_T5_SuperGLUE.ipynb', 'Models', 'Parameters_analysis.ipynb', 'Predictions', '__pycache__', 'config.py', 'logs', 'utils.py']


In [5]:
!nvidia-smi

Wed Mar 13 00:52:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              49W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
from config import SUPER_GLUE_DATASETS_INFOS
from utils import (
    correct_inputs_targets,
    get_model,
    setup_configs,
    tokenizer_function,
)

In [7]:
load_dotenv()

NEPTUNE_API_TOKEN = os.getenv('NEPTUNE_API_TOKEN')
GIT_SSH_TOKEN = os.getenv('GIT_SSH_TOKEN')
GITHUB_API_TOKEN = os.getenv('GITHUB_API_TOKEN')

In [8]:
os.environ["NEPTUNE_API_TOKEN"] = os.getenv("NEPTUNE_API_TOKEN")
os.environ["NEPTUNE_PROJECT"] = "nssoares022/guided-research"

# Datasets

The scope of the research is to benchmark a language model on [SuperGLUE](https://huggingface.co/datasets/super_glue) dataset. All datasets proposed in the SuperGLUE benchmark are available in the HuggingFace dataset hub. More details about them are provided bellow.


## SuperGLUE
The SuperGLUE is a more recent benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It consists of 10 NLP tasks. All tasks are classification tasks, except for the ones from the Similarity and Paraphrase tasks set. More details about SuperGLUE benchmark could be found [here](https://super.gluebenchmark.com/tasks).


### Question Answering

* COPA - [The Choice of Plausible Alternatives](https://people.ict.usc.edu/~gordon/copa.html) dataset consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
  * **Metrics:** Accuracy.

* BoolQ - [The Boolean Questions](https://github.com/google-research-datasets/boolean-questions) is a question answering dataset for yes/no questions containing 15942 examples. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
   * **Metrics:** Accuracy.

* MultiRC - [The Multi-Sentence Reading Comprehension](https://cogcomp.seas.upenn.edu/multirc/) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
  * **Metrics:** Exact Match (EM) or F1-score.

### Common Sense Reasoning

* ReCoRD - [The Reading Comprehension with Commonsense Reasoning](https://sheng-z.github.io/ReCoRD-explorer/) consists of queries automatically generated from CNN/Daily Mail news articles. The answer to each query is a text span from a summarizing passage of the corresponding news.
   * **Metrics:** F1-score or Accuracy.

### Coreference Resolution

* WSC - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) consists of pairs of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
   * **Metrics:** Accuracy.

* AXg - [The Winogender Schema Diagnostics](https://github.com/rudinger/winogender-schemas)	are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
   * **Metrics:** Gender Parity or Accuracy

* WiC - [The Words in Context](https://pilehvar.github.io/wic/) is a dataset for detecting the context of the words in a context-sensitive representations, framed as as binary classification task. The instances of this dataset have a target word w, either a verb or a noun, and two contexts, c1 and c2. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in c1 and c2 correspond to the same meaning or not.
   * **Metrics:** Accuracy.

### Inference Tasks



* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.

* CB - [The CommitmentBank](https://github.com/mcdm/CommitmentBank) is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
  * **Metrics:** Avg. F1-score or Accuracy.

* AXb - [The Broadcoverage Diagnostics](https://gluebenchmark.com/diagnostics) is a set of sentence pairs labeled with their entailment relations (entailment, contradiction, or neutral) in both directions.
  * **Metrics:** Matthew's Corr.

# Models

For the experiments, we are using the flan-T5 model at different sizes, which are all vailable in the HuggingFace hub:

*   [flan-t5-base](https://huggingface.co/google/flan-t5-base) - 248M params
*   [flan-t5-large](https://huggingface.co/google/flan-t5-large) - 783M params
*   [flan-t5-xl](https://huggingface.co/google/flan-t5-xl) - 3B params

# Code

In [9]:
DATASET_NAME: Final[str] = "super_glue"

In [10]:
SuperGlueTask = Literal["axb", "axg", "boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc"]

In [11]:
# Parameters to choose backbone model, PEFT method, and benchmark task
model_checkpoint: Literal["google/flan-t5-xxl", "google/flan-t5-base", "google/flan-t5-large"] = "google/flan-t5-base"
peft_method: Literal["full_tuning", "ia3", "lora", "p_tuning", "prefix_tuning"] = "full_tuning"
task: SuperGlueTask = "wic"
assert task in get_args(SuperGlueTask)

In [12]:
today = date.today()
model_type = model_checkpoint.split("-")[-1]
experiment_id = f"{task}_{peft_method}_{model_type}_{today}"
ckpt_dir = f"Checkpoints/{experiment_id}"
output_dir = f"Models/{experiment_id}"
predictions_output_file = f"Predictions/{task}_{peft_method}_{model_type}.txt"
ckpt_dir, output_dir, predictions_output_file

('Checkpoints/wic_full_tuning_base_2024-03-13',
 'Models/wic_full_tuning_base_2024-03-13',
 'Predictions/wic_full_tuning_base.txt')

In [13]:
# Get the infos for the specific dataset
dataset_infos = SUPER_GLUE_DATASETS_INFOS[task]
dataset_infos

SuperGLUETaskConfigs(task_name='wic', feature_keys=['sentence1', 'sentence2', 'word'], label_key='label', label_names=['False', 'True'], evaluation_metric='accuracy')

In [None]:
# Download and cache the dataset from the HuggingFace hub
dataset = load_dataset(DATASET_NAME, dataset_infos.task_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Get the pre-trained tokenizer
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint, is_split_into_words=True)

In [None]:
setup_configs(DATASET_NAME, task, tokenizer, model_checkpoint)

In [None]:
# Preprocess the raw dataset
corrected_dataset = dataset.map(correct_inputs_targets, batched=True, remove_columns=dataset["test"].column_names)
tokenized_dataset = corrected_dataset.map(tokenizer_function, batched=True, remove_columns=["input", "target"], load_from_cache_file=False)

In [None]:
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)

In [None]:
model = get_model(model, peft_method)
try:
  model.print_trainable_parameters()
except:
  all_params = model.num_parameters()
  print(f"trainable params: {all_params} || all params: {all_params} || trainable%: 100.0")

In [None]:
def map_text_int(pred):
  global task
  text_pred = pred.lower()

  if ("false" in text_pred) or ("no" in text_pred) or ("choice1" in text_pred) or ("contradiction" in text_pred):
    int_pred = 0
  elif ("true" in text_pred) or ("yes" in text_pred) or ("choice2" in text_pred) or ("neutral"in text_pred):
    int_pred = 1
  elif "entailment" in text_pred:
    int_pred = 2
  else:
    int_pred = -1

  # Fix RTE labels -> entailment (0) and not_entailment (1)
  if task == "rte":
    if "no" in text_pred:
      int_pred = 1
    elif "entailment" in text_pred:
      int_pred = 0
    else:
      int_pred = -1
  return int_pred

In [None]:
from collections import Counter
def custom_f1_score(predictions, references):
  f1 = 0
  for prediction, reference in zip(predictions, references):
    prediction_tokens = prediction.split()
    reference_tokens = reference.split()
    common = Counter(prediction_tokens) & Counter(reference_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(reference_tokens)
    f1 += (2 * precision * recall) / (precision + recall)
  return f1 / len(references)

In [None]:
def custom_compute_metrics(pred):
    global DATASET_NAME, dataset_infos

    logits = pred.predictions[0]
    preds = np.argmax(logits, axis=-1)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = pred.label_ids
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    if dataset_infos.task_name != "record":
      decoded_preds_int = [map_text_int(pred.strip())
                        for pred in decoded_preds]
      decoded_labels_int = [map_text_int(label.strip())
                        for label in decoded_labels]
    if dataset_infos.task_name == "multirc":
      f1_metric = load("f1")
      em_metric = load("exact_match")
      return {"f1": f1_metric.compute(predictions=decoded_preds_int, references=decoded_labels_int, average="micro")["f1"],
              "exact_match": em_metric.compute(predictions=decoded_preds, references=decoded_labels)["exact_match"]
             }
    elif dataset_infos.task_name == "record":
      ems = [1 if pred_text == label_text else 0
             for pred_text, label_text in zip(decoded_preds, decoded_labels)]
      return {"f1": custom_f1_score(predictions=decoded_preds, references=decoded_labels),
              "exact_match": sum(ems) / len(ems)
             }
    else:
      metric = load(DATASET_NAME, dataset_infos.task_name)
    return metric.compute(predictions=decoded_preds_int, references=decoded_labels_int)

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=-100,
    pad_to_multiple_of=8
)

In [None]:
training_args = TrainingArguments(
    ckpt_dir,
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    learning_rate=1e-3,
    # gradient_accumulation_steps=4,
    # eval_accumulation_steps=64,
    num_train_epochs=3,
    save_steps=100,
    weight_decay=0.01,
    save_total_limit=8,
    # fp16=True,
    # fp16_full_eval=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[NeptuneCallback(api_token=NEPTUNE_API_TOKEN,
                               project='nssoares022/guided-research',
                               name=f"{task}_{peft_method}",
                               capture_hardware_metrics=True,
                              )
              ],
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=custom_compute_metrics,
)
model.config.use_cache = False

In [None]:
trainer.train()

In [None]:
trainer.save_model(output_dir)

In [None]:
# output_sequences = model.generate(
#     torch.IntTensor(tokenized_dataset["test"]["input_ids"]),
#     attention_mask=torch.IntTensor(tokenized_dataset["test"]["attention_mask"]),
# )

In [None]:
raw_preds, _, _ = trainer.predict(tokenized_dataset["test"])

In [None]:
y_preds = np.argmax(raw_preds[0], axis=-1)

In [None]:
# y_preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)


In [None]:
with open(predictions_output_file, "w+") as results_file:
  results_file.write('idx,label\n')
  for sample, pred in zip(tokenized_dataset["test"], y_preds):
    pred_label = tokenizer.decode(pred, skip_special_tokens=True)
    idx = sample["idx"]
    results_file.write(f"{idx},{pred_label}\n")

In [None]:
# model

In [None]:
# notebook_login()

In [None]:
# trainer.push_to_hub()