<a href="https://colab.research.google.com/github/natalisso/guided_research_codes/blob/main/Fine_tuning_T5_SuperGLUE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Nootebook

This nootebook aims to build the experimental setup for the upcoming experiments regarding the guided research scope.

In [1]:
!pip install -U transformers[torch]\
sentencepiece\
accelerate\
datasets\
evaluate\
python-dotenv\
neptune\
peft



In [2]:
import os

import collections
import re
from datetime import date
from typing import Dict, Final, get_args, Literal
from dotenv import load_dotenv
from google.colab import drive

import neptune
import torch
import numpy as np
from datasets import load_dataset
from huggingface_hub import notebook_login
from transformers import (
    DataCollatorWithPadding,
    T5Tokenizer,
    T5ForConditionalGeneration,
    Trainer,
    TrainingArguments
)

from transformers.integrations import NeptuneCallback

In [3]:
GDRIVE_PATH='/content/gdrive/MyDrive/TUM/GR/guided_research_codes'

In [4]:
# Connecting Google Drive to Colab to have persistent storage across Colab sessions.
drive.mount('/content/gdrive', force_remount=True)
os.chdir(GDRIVE_PATH)
print(sorted(os.listdir()))

Mounted at /content/gdrive
['.env', '.git', '.gitignore', '.ipynb_checkpoints', '.neptune', 'Checkpoints', 'Fine_tuning_T5_SuperGLUE.ipynb', 'Models', '__pycache__', 'config.py', 'logs', 'utils.py']


In [5]:
!nvidia-smi

Wed Dec 20 08:33:07 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0              24W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
from config import SUPER_GLUE_DATASETS_INFOS
from utils import (
    correct_inputs_targets,
    custom_compute_metrics,
    get_model,
    setup_configs,
    tokenizer_function,
)

In [7]:
load_dotenv()

NEPTUNE_API_TOKEN = os.getenv('NEPTUNE_API_TOKEN')
GIT_SSH_TOKEN = os.getenv('GIT_SSH_TOKEN')
GITHUB_API_TOKEN = os.getenv('GITHUB_API_TOKEN')

In [8]:
os.environ["NEPTUNE_API_TOKEN"] = os.getenv("NEPTUNE_API_TOKEN")
os.environ["NEPTUNE_PROJECT"] = "nssoares022/guided-research"

In [9]:
run = neptune.init_run(project='nssoares022/guided-research',api_token=NEPTUNE_API_TOKEN)

  run = neptune.init_run(project='nssoares022/guided-research',api_token=NEPTUNE_API_TOKEN)


https://app.neptune.ai/nssoares022/guided-research/e/GUID-147


In [10]:
neptune_callback = NeptuneCallback(run=run)

# Datasets

The scope of the research is to benchmark a language model on [SuperGLUE](https://huggingface.co/datasets/super_glue) dataset. All datasets proposed in the SuperGLUE benchmark are available in the HuggingFace dataset hub. More details about them are provided bellow.


## SuperGLUE
The SuperGLUE is a more recent benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It consists of 10 NLP tasks. All tasks are classification tasks, except for the ones from the Similarity and Paraphrase tasks set. More details about SuperGLUE benchmark could be found [here](https://super.gluebenchmark.com/tasks).


### Question Answering

* COPA - [The Choice of Plausible Alternatives](https://people.ict.usc.edu/~gordon/copa.html) dataset consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
  * **Metrics:** Accuracy.

* BoolQ - [The Boolean Questions](https://github.com/google-research-datasets/boolean-questions) is a question answering dataset for yes/no questions containing 15942 examples. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
   * **Metrics:** Accuracy.

* MultiRC - [The Multi-Sentence Reading Comprehension](https://cogcomp.seas.upenn.edu/multirc/) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
  * **Metrics:** Exact Match (EM) or F1-score.

### Common Sense Reasoning

* ReCoRD - [The Reading Comprehension with Commonsense Reasoning](https://sheng-z.github.io/ReCoRD-explorer/) consists of queries automatically generated from CNN/Daily Mail news articles. The answer to each query is a text span from a summarizing passage of the corresponding news.
   * **Metrics:** F1-score or Accuracy.

### Coreference Resolution

* WSC - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) consists of pairs of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
   * **Metrics:** Accuracy.

* AXg - [The Winogender Schema Diagnostics](https://github.com/rudinger/winogender-schemas)	are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
   * **Metrics:** Gender Parity or Accuracy

* WiC - [The Words in Context](https://pilehvar.github.io/wic/) is a dataset for detecting the context of the words in a context-sensitive representations, framed as as binary classification task. The instances of this dataset have a target word w, either a verb or a noun, and two contexts, c1 and c2. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in c1 and c2 correspond to the same meaning or not.
   * **Metrics:** Accuracy.

### Inference Tasks



* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.

* CB - [The CommitmentBank](https://github.com/mcdm/CommitmentBank) is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
  * **Metrics:** Avg. F1-score or Accuracy.

* AXb - [The Broadcoverage Diagnostics](https://gluebenchmark.com/diagnostics) is a set of sentence pairs labeled with their entailment relations (entailment, contradiction, or neutral) in both directions.
  * **Metrics:** Matthew's Corr.

# Models

For the experiments, we are using the flan-T5 model at different sizes, which are all vailable in the HuggingFace hub:

*   [flan-t5-base](https://huggingface.co/google/flan-t5-base) - 248M params
*   [flan-t5-large](https://huggingface.co/google/flan-t5-large) - 783M params
*   [flan-t5-xl](https://huggingface.co/google/flan-t5-xl) - 3B params

# Code

In [11]:
DATASET_NAME: Final[str] = "super_glue"

In [12]:
SuperGlueTask = Literal["axb", "axg", "boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc"]

In [13]:
# Parameters to choose backbone model, PEFT method, and benchmark task
model_checkpoint: Literal["google/flan-t5-base", "google/flan-t5-large", "google/flan-t5-xl"] = "google/flan-t5-base"
peft_method: Literal["ia3", "lora", "p_tuning", "prefix_tuning", "prompt_tuning_config"] = "ia3"
task: SuperGlueTask = "multirc"
assert task in get_args(SuperGlueTask)

In [14]:
# Get the infos for the specific dataset
dataset_infos = SUPER_GLUE_DATASETS_INFOS[task]
dataset_infos

SuperGLUETaskConfigs(task_name='multirc', feature_keys=['paragraph', 'question', 'answer'], label_key='label', label_names=['False', 'True'], evaluation_metric='acc_and_f1')

In [15]:
# Download and cache the dataset from the HuggingFace hub
dataset = load_dataset(DATASET_NAME, dataset_infos.task_name)

In [16]:
# Get the pre-trained tokenizer
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint, is_split_into_words=True)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
setup_configs(DATASET_NAME, task, tokenizer, model_checkpoint)

In [18]:
# Preprocess the raw dataset
corrected_dataset = dataset.map(correct_inputs_targets, batched=True, remove_columns=dataset["train"].column_names)
tokenized_dataset = corrected_dataset.map(tokenizer_function, batched=True, remove_columns=["input", "target"], load_from_cache_file=False)

Map:   0%|          | 0/27243 [00:00<?, ? examples/s]

Map:   0%|          | 0/4848 [00:00<?, ? examples/s]

Map:   0%|          | 0/9693 [00:00<?, ? examples/s]

In [19]:
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)

In [20]:
model = get_model(model, peft_method)
model.print_trainable_parameters()

trainable params: 104,448 || all params: 247,682,304 || trainable%: 0.04217015035519049


In [21]:
today = date.today()
experiment_id = f"{task}_{peft_method}_{today}"
ckpt_dir = f"Checkpoints/{experiment_id}"
output_dir = f"Models/{experiment_id}"
ckpt_dir, output_dir

('Checkpoints/multirc_ia3_2023-12-20', 'Models/multirc_ia3_2023-12-20')

In [22]:
training_args = TrainingArguments(
    ckpt_dir,
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    learning_rate=1e-3,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    save_steps=100,
    weight_decay=0.01,
    save_total_limit=8,
)

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[neptune_callback],
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=custom_compute_metrics,
)
model.config.use_cache = False

You are adding a <class 'transformers.integrations.integration_utils.NeptuneCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
NeptuneCallback
TensorBoardCallback


In [None]:
trainer.train()

https://app.neptune.ai/nssoares022/guided-research/e/GUID-148


Epoch,Training Loss,Validation Loss


Checkpoint destination directory Checkpoints/multirc_ia3_2023-12-20/checkpoint-100 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Checkpoints/multirc_ia3_2023-12-20/checkpoint-200 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Checkpoints/multirc_ia3_2023-12-20/checkpoint-300 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Checkpoints/multirc_ia3_2023-12-20/checkpoint-400 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Checkpoints/multirc_ia3_2023-12-20/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory Checkpoints/multirc_ia3_2023-12-20/checkpoint-600 already exists and is non-empty.Saving will proceed but saved results ma

Epoch,Training Loss,Validation Loss


In [None]:
trainer.save_model(output_dir)

In [None]:
# trainer.evaluate()

In [None]:
# notebook_login()

In [None]:
# trainer.push_to_hub()