<a href="https://colab.research.google.com/github/natalisso/guided_research_codes/blob/main/Fine_tuning_T5_SuperGLUE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Nootebook

This nootebook aims to build the experimental setup for the upcoming experiments regarding the guided research scope.

In [1]:
!pip install -U transformers[torch]\
sentencepiece\
accelerate\
datasets\
evaluate\
python-dotenv\
neptune\
peft

Collecting transformers[torch]
  Using cached transformers-4.36.2-py3-none-any.whl (8.2 MB)
Collecting sentencepiece
  Using cached sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting accelerate
  Using cached accelerate-0.25.0-py3-none-any.whl (265 kB)
Collecting datasets
  Using cached datasets-2.15.0-py3-none-any.whl (521 kB)
Collecting evaluate
  Using cached evaluate-0.4.1-py3-none-any.whl (84 kB)
Collecting python-dotenv
  Using cached python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting neptune
  Using cached neptune-1.8.6-py3-none-any.whl (481 kB)
Collecting peft
  Using cached peft-0.7.1-py3-none-any.whl (168 kB)
Collecting pyarrow-hotfix (from datasets)
  Using cached pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Using cached dill-0.3.7-py3-none-any.whl (115 kB)
Collecting multiprocess (from datasets)
  Using cached multiprocess-0.70.15-py310-none-any.whl (134 kB)
Collecting resp

In [2]:
import os

import collections
import re
from datetime import date
from typing import Dict, Final, get_args, Literal
from dotenv import load_dotenv
from google.colab import drive

import neptune
import torch
import numpy as np
from evaluate import load
from datasets import load_dataset, load_metric
from huggingface_hub import notebook_login
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments
)
from peft import (
    PeftModel,
    IA3Config,
    LoraConfig,
    TaskType,
    get_peft_model,
    PromptTuningConfig,
    PromptTuningInit,
    PrefixTuningConfig,
    PromptEncoderConfig
)
from transformers.data.data_collator import DataCollator
from transformers.integrations import NeptuneCallback

In [3]:
GDRIVE_PATH='/content/gdrive/MyDrive/TUM/GR/guided_research_codes'

In [4]:
# Connecting Google Drive to Colab to have persistent storage across Colab sessions.
drive.mount('/content/gdrive', force_remount=True)
os.chdir(GDRIVE_PATH)
print(sorted(os.listdir()))

Mounted at /content/gdrive
['.env', '.git', '.gitignore', '.ipynb_checkpoints', '.neptune', 'Fine_tuning_T5_SuperGLUE.ipynb', '__pycache__', 'config.py', 'logs', 'untitled']


In [5]:
!nvidia-smi

Tue Dec 19 21:52:01 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0              44W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
from config import SUPER_GLUE_DATASETS_INFOS

In [7]:
load_dotenv()

NEPTUNE_API_TOKEN = os.getenv('NEPTUNE_API_TOKEN')
GIT_SSH_TOKEN = os.getenv('GIT_SSH_TOKEN')
GITHUB_API_TOKEN = os.getenv('GITHUB_API_TOKEN')

In [8]:
os.environ["NEPTUNE_API_TOKEN"] = os.getenv("NEPTUNE_API_TOKEN")
os.environ["NEPTUNE_PROJECT"] = "nssoares022/guided-research"

In [9]:
run = neptune.init_run(project='nssoares022/guided-research',api_token=NEPTUNE_API_TOKEN)

  run = neptune.init_run(project='nssoares022/guided-research',api_token=NEPTUNE_API_TOKEN)


https://app.neptune.ai/nssoares022/guided-research/e/GUID-135


In [10]:
neptune_callback = NeptuneCallback(run=run)

# Datasets

The scope of the research is to benchmark a language model on [SuperGLUE](https://huggingface.co/datasets/super_glue) dataset. All datasets proposed in the SuperGLUE benchmark are available in the HuggingFace dataset hub. More details about them are provided bellow.


## SuperGLUE
The SuperGLUE is a more recent benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It consists of 10 NLP tasks. All tasks are classification tasks, except for the ones from the Similarity and Paraphrase tasks set. More details about SuperGLUE benchmark could be found [here](https://super.gluebenchmark.com/tasks).


### Question Answering

* COPA - [The Choice of Plausible Alternatives](https://people.ict.usc.edu/~gordon/copa.html) dataset consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
  * **Metrics:** Accuracy.

* BoolQ - [The Boolean Questions](https://github.com/google-research-datasets/boolean-questions) is a question answering dataset for yes/no questions containing 15942 examples. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
   * **Metrics:** Accuracy.

* MultiRC - [The Multi-Sentence Reading Comprehension](https://cogcomp.seas.upenn.edu/multirc/) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
  * **Metrics:** Exact Match (EM) or F1-score.

### Common Sense Reasoning

* ReCoRD - [The Reading Comprehension with Commonsense Reasoning](https://sheng-z.github.io/ReCoRD-explorer/) consists of queries automatically generated from CNN/Daily Mail news articles. The answer to each query is a text span from a summarizing passage of the corresponding news.
   * **Metrics:** F1-score or Accuracy.

### Coreference Resolution

* WSC - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) consists of pairs of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
   * **Metrics:** Accuracy.

* AXg - [The Winogender Schema Diagnostics](https://github.com/rudinger/winogender-schemas)	are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
   * **Metrics:** Gender Parity or Accuracy

* WiC - [The Words in Context](https://pilehvar.github.io/wic/) is a dataset for detecting the context of the words in a context-sensitive representations, framed as as binary classification task. The instances of this dataset have a target word w, either a verb or a noun, and two contexts, c1 and c2. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in c1 and c2 correspond to the same meaning or not.
   * **Metrics:** Accuracy.

### Inference Tasks



* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.

* CB - [The CommitmentBank](https://github.com/mcdm/CommitmentBank) is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
  * **Metrics:** Avg. F1-score or Accuracy.

* AXb - [The Broadcoverage Diagnostics](https://gluebenchmark.com/diagnostics) is a set of sentence pairs labeled with their entailment relations (entailment, contradiction, or neutral) in both directions.
  * **Metrics:** Matthew's Corr.

# Models

For the experiments, we are using the flan-T5 model at different sizes, which are all vailable in the HuggingFace hub:

*   [flan-t5-base](https://huggingface.co/google/flan-t5-base) - 248M params
*   [flan-t5-large](https://huggingface.co/google/flan-t5-large) - 783M params
*   [flan-t5-xl](https://huggingface.co/google/flan-t5-xl) - 3B params

# Code

In [11]:
DATASET_NAME: Final[str] = "super_glue"

In [12]:
SuperGlueTask = Literal["axb", "axg", "boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc"]

In [13]:
# Parameters to choose backbone model, PEFT method, and benchmark task
model_checkpoint: Literal["google/flan-t5-base", "google/flan-t5-large", "google/flan-t5-xl"] = "google/flan-t5-large"
peft_method: Literal["ia3", "lora", "p_tuning", "prefix_tuning", "prompt_tuning_config"] = "ia3"
task: SuperGlueTask = "multirc"
assert task in get_args(SuperGlueTask)

In [14]:
# Get the infos for the specific dataset
dataset_infos = SUPER_GLUE_DATASETS_INFOS[task]
dataset_infos

SuperGLUETaskConfigs(task_name='multirc', feature_keys=['paragraph', 'question', 'answer'], label_key='label', label_names=['False', 'True'], evaluation_metric='acc_and_f1')

In [15]:
# Download and cache the dataset from the HuggingFace hub
dataset = load_dataset(DATASET_NAME, dataset_infos.task_name)

Downloading builder script:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27243 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4848 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9693 [00:00<?, ? examples/s]

In [16]:
# Get the pre-trained tokenizer
tokenizer = T5Tokenizer.from_pretrained(model_checkpoint, is_split_into_words=True)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
def _mark_span(text, span_str, span_idx, mark):
    pattern_tmpl = r'^((?:\S+\s){N})(W)'
    pattern = re.sub('N', str(span_idx), pattern_tmpl)
    pattern = re.sub('W', span_str, pattern)
    return re.sub(pattern, r'\1{0} \2 {0}'.format(mark), text)

In [18]:
def correct_inputs_targets(samples):
  global dataset_infos, task

  new_samples = collections.defaultdict(list)
  keys = samples.keys()
  label_key = dataset_infos.label_key
  for values in zip(*samples.values()):
    num_answers, num_duplicates = 1, 1
    sample = {k: v for k, v in zip(keys, values)}
    sentences = [task]
    if task == "wsc":
      text = sample["text"]
      text = _mark_span(text, sample["span1_text"], sample["span1_index"], '*')
      span2_index_corrector = 1 if sample['span1_index'] < sample['span2_index'] else 0
      span2_index = sample["span2_index"] + 2 * span2_index_corrector
      text = _mark_span(text, sample["span2_text"], span2_index, '#')
      sentences.append(text)
    else:
      for feature_key in dataset_infos.feature_keys:
        sentences.append(f"{feature_key}:")
        text = sample[feature_key]
        if not isinstance(text, str):
          text = ', '.join(sample[feature_key])
        sentences.append(text)
    sample["input"] = " ".join(sentences)
    if task == "record":
      sample["input"] = re.sub(r'(\.|\?|\!|\"|\')\n@highlight\n', r'\1 ', sample["input"])
      sample["input"] = re.sub(r'\n@highlight\n', '. ', sample["input"])
      num_answers = len(sample[label_key])
      num_duplicates = np.maximum(1, num_answers)
    elif task == "multirc":
      sample["input"] = re.sub(r"<br>", " ", sample["input"])
      sample["input"] = re.sub(r"<(/)?b>", "", sample["input"])
    elif task == "wsc":
      sample["input"] = re.sub(r"<br>", " ", sample["input"])
      sample["input"] = re.sub(r"<(/)?b>", "", sample["input"])
    new_samples["input"].extend([sample["input"]] * num_duplicates)
    original_label = sample[label_key]
    if original_label == -1 or num_answers <= 0:
      new_samples["target"].extend(["<unk>"])
    elif dataset_infos.label_names is not None:
      text_label = dataset_infos.label_names[int(original_label)]
      new_samples["target"].extend([text_label])
    elif isinstance(original_label, list):
      new_samples["target"].extend(sample[label_key])
    else:
      new_samples["target"].extend([original_label])
  return new_samples


In [19]:
def tokenizer_function(samples):
  global dataset_infos
  model_embeedings = tokenizer(samples["input"], max_length=512, padding="max_length", truncation=True, return_tensors="pt")
  targets_embeedings = tokenizer(samples["target"], max_length=2, padding="max_length", truncation=True, return_tensors="pt")["input_ids"]
  targets_embeedings[targets_embeedings == tokenizer.pad_token_id] = -100
  model_embeedings["label"] = targets_embeedings
  return model_embeedings

In [20]:
# Preprocess the raw dataset
corrected_dataset = dataset.map(correct_inputs_targets, batched=True, remove_columns=dataset["train"].column_names)
tokenized_dataset = corrected_dataset.map(tokenizer_function, batched=True, remove_columns=["input", "target"], load_from_cache_file=False)

Map:   0%|          | 0/27243 [00:00<?, ? examples/s]

Map:   0%|          | 0/4848 [00:00<?, ? examples/s]

Map:   0%|          | 0/9693 [00:00<?, ? examples/s]

Map:   0%|          | 0/27243 [00:00<?, ? examples/s]

Map:   0%|          | 0/4848 [00:00<?, ? examples/s]

Map:   0%|          | 0/9693 [00:00<?, ? examples/s]

In [21]:
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [22]:
match peft_method:
  case "lora":
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_2_SEQ_LM
    )
  case "prompt_tuning":
    peft_config = PromptTuningConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        num_virtual_tokens=20,
        prompt_tuning_init=PromptTuningInit.TEXT,
        prompt_tuning_init_text="Answer the yes/no question about the passage:",
        inference_mode=False,
        tokenizer_name_or_path=model_checkpoint,
    )
  case "prefix_tuning":
    peft_config = PrefixTuningConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        inference_mode=False,
        num_virtual_tokens=20
    )
  case "p_tuning":
    peft_config = PromptEncoderConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        inference_mode=False,
        num_virtual_tokens=20
    )
  case _:
    peft_config = IA3Config(
        task_type=TaskType.SEQ_2_SEQ_LM
    )


In [23]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 282,624 || all params: 783,432,704 || trainable%: 0.03607508322756973


In [24]:
today = date.today()
experiment_id = f"{task}_{peft_method}_{today}"
ckpt_dir = f"Checkpoints/{experiment_id}"
output_dir = f"Models/{experiment_id}"
ckpt_dir, output_dir

('Checkpoints/multirc_ia3_2023-12-19', 'Models/multirc_ia3_2023-12-19')

In [25]:
training_args = TrainingArguments(
    ckpt_dir,
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    learning_rate=1e-3,
    gradient_accumulation_steps=8,
    num_train_epochs=5,
    save_steps=100,
    weight_decay=0.01,
    save_total_limit=8,
)

In [26]:
def compute_metrics(pred):
    global DATASET_NAME, dataset_infos
    metric = load(DATASET_NAME, dataset_infos.task_name)
    logits = pred.predictions[0]
    labels = pred.label_ids
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds[:, 0], references=labels[:, 0])

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[neptune_callback],
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)
model.config.use_cache = False

You are adding a <class 'transformers.integrations.integration_utils.NeptuneCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
NeptuneCallback
TensorBoardCallback


In [None]:
trainer.train()

https://app.neptune.ai/nssoares022/guided-research/e/GUID-136


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss


In [None]:
trainer.save_model(output_dir)

In [None]:
# trainer.evaluate()

In [None]:
# notebook_login()

In [None]:
# trainer.push_to_hub()