# Exploratory Nootebook

This nootebook aims to build the experimental setup for the upcoming experiments regarding the guided research scope.

In [2]:
!pip install -U transformers[torch]\
accelerate\
datasets\
evaluate\
python-dotenv\
neptune



In [3]:
import os

from typing import Dict, get_args, Literal, Union
from dotenv import load_dotenv
from google.colab import drive

from evaluate import load
from datasets import load_dataset, load_metric
from huggingface_hub import notebook_login
import neptune
import numpy as np
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments
)
from transformers.integrations import NeptuneCallback

In [4]:
GDRIVE_PATH='/content/gdrive/MyDrive/TUM/GR/guided_research_codes'

In [5]:
# Connecting Google Drive to Colab to have persistent storage across Colab sessions.
drive.mount('/content/gdrive', force_remount=True)
os.chdir(GDRIVE_PATH)
print(sorted(os.listdir()))

Mounted at /content/gdrive
['.env', '.git', '.gitignore', '.neptune', 'Checkpoints', 'Test_Taining_w_Neptune.ipynb', '__pycache__', 'config.py']


In [6]:
from config import SUPER_GLUE_TASKS_METRICS, GLUE_TASKS_METRICS

In [7]:
load_dotenv()

NEPTUNE_API_TOKEN = os.getenv('NEPTUNE_API_TOKEN')
GIT_SSH_TOKEN = os.getenv('GIT_SSH_TOKEN')
GITHUB_API_TOKEN = os.getenv('GITHUB_API_TOKEN')

In [8]:
os.environ["NEPTUNE_API_TOKEN"] = os.getenv("NEPTUNE_API_TOKEN")
os.environ["NEPTUNE_PROJECT"] = "nssoares022/guided-research"

In [9]:
run = neptune.init_run(project='nssoares022/guided-research',api_token=NEPTUNE_API_TOKEN)

  run = neptune.init_run(project='nssoares022/guided-research',api_token=NEPTUNE_API_TOKEN)


https://app.neptune.ai/nssoares022/guided-research/e/GUID-8


In [10]:
neptune_callback = NeptuneCallback(run=run)

# Datasets

The scope of the research is to benchmark a language model on [GLUE](https://huggingface.co/datasets/glue) or [SuperGLUE](https://huggingface.co/datasets/super_glue) dataset. Both are provided in HuggingFace dataset hub.


## GLUE

The General Language Understanding Evaluation (GLUE) benchmark consists of 9 natural language understanding (NLP) tasks. All tasks are classification tasks, except for the STS-B task which is a regression task. All classification tasks are 2-class problems, except for the MNLI task which has 3-classes. More details about GLUE benchmark could be found [here](https://gluebenchmark.com/tasks).



### Single-Sentence Tasks

* CoLA - [The Corpus of Linguistic Acceptability](https://arxiv.org/abs/1805.12471) is a set of English sentences from published linguistics literature. The task is to predict whether a given sentence is grammatically correct or not.
  * **Metrics:** Matthew's Corr.

* SST-2 - [The Stanford Sentiment Treebank](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence: positive or negative.
  * **Metrics:** Accuracy.

### Similarity and Paraphrase tasks

* MRPC - [The Microsoft Research Paraphrase Corpus](https://www.aclweb.org/anthology/I05-5002.pdf) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.
  * **Metrics:** F1-score or Accuracy.
* QQP - [The Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.
  * **Metrics:** F1-score or Accuracy.
* STS-B - [The Semantic Textual Similarity Benchmark](https://arxiv.org/abs/1708.00055) is a collection of sentence pairs drawn from news headlines, video, and image captions, and natural language inference data. The task is to determine how similar two sentences are.
  * **Metrics:** Pearson-Spearman Corr.

### Inference Tasks

* MNLI - [The Multi-Genre Natural Language Inference Corpus](https://cims.nyu.edu/~sbowman/multinli/multinli_0.9.pdf) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The task has the matched (in-domain) and mismatched (cross-domain) sections.
  * **Metrics:** Accuracy.
* QNLI - [The Stanford Question Answering Dataset](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question. The task is to determine whether the context sentence contains the answer to the question.
  * **Metrics:** Accuracy.
* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.
* WNLI - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices (Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. 2012).
  * **Metrics:** Accuracy.

## SuperGLUE
The SuperGLUE is a more recent benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It consists of 10 NLP tasks. All tasks are classification tasks, except for the ones from the Similarity and Paraphrase tasks set. More details about SuperGLUE benchmark could be found [here](https://super.gluebenchmark.com/tasks).


### Question Answering

* COPA - [The Choice of Plausible Alternatives](https://people.ict.usc.edu/~gordon/copa.html) dataset consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
  * **Metrics:** Accuracy.

* BoolQ - [The Boolean Questions](https://github.com/google-research-datasets/boolean-questions) is a question answering dataset for yes/no questions containing 15942 examples. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
   * **Metrics:** Accuracy.

* MultiRC - [The Multi-Sentence Reading Comprehension](https://cogcomp.seas.upenn.edu/multirc/) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
  * **Metrics:** Exact Match (EM) or F1-score.

### Common Sense Reasoning

* ReCoRD - [The Reading Comprehension with Commonsense Reasoning](https://sheng-z.github.io/ReCoRD-explorer/) consists of queries automatically generated from CNN/Daily Mail news articles. The answer to each query is a text span from a summarizing passage of the corresponding news.
   * **Metrics:** F1-score or Accuracy.

### Coreference Resolution

* WSC - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) consists of pairs of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
   * **Metrics:** Accuracy.

* AXg - [The Winogender Schema Diagnostics](https://github.com/rudinger/winogender-schemas)	are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
   * **Metrics:** Gender Parity or Accuracy

* WiC - [The Words in Context](https://pilehvar.github.io/wic/) is a dataset for detecting the context of the words in a context-sensitive representations, framed as as binary classification task. The instances of this dataset have a target word w, either a verb or a noun, and two contexts, c1 and c2. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in c1 and c2 correspond to the same meaning or not.
   * **Metrics:** Accuracy.

### Inference Tasks



* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.

* CB - [The CommitmentBank](https://github.com/mcdm/CommitmentBank) is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
  * **Metrics:** Avg. F1-score or Accuracy.

* AXb - [The Broadcoverage Diagnostics](https://gluebenchmark.com/diagnostics) is a set of sentence pairs labeled with their entailment relations (entailment, contradiction, or neutral) in both directions.
  * **Metrics:** Matthew's Corr.

# Models

For the experiments, we are using the flan-T5 model at different sizes, which are all vailable in the HuggingFace hub:

*   [flan-t5-base](https://huggingface.co/google/flan-t5-base) - 248M params
*   [flan-t5-large](https://huggingface.co/google/flan-t5-large) - 783M params
*   [flan-t5-xl](https://huggingface.co/google/flan-t5-xl) - 3B params

# Code

In [11]:
GlueTask = Literal["cola", "mnli", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

In [12]:
SuperGlueTask = Literal["axb", "axg", "boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc.fixed"]

In [13]:
# Parameters to choose a dataset and a backbone mode
dataset_name: Literal["super_glue", "glue"] = "glue"
task: Union[SuperGlueTask, GlueTask] = "mrpc"
model_checkpoint: Literal["google/flan-t5-base", "google/flan-t5-large", "google/flan-t5-xl"] = "bert-base-uncased"

In [14]:
metric_name = ""
if dataset_name == "super_glue":
  assert task in get_args(SuperGlueTask)
elif dataset_name == "glue":
  assert task in get_args(GlueTask)
else:
  raise KeyError("Choose a valid dataset!")

In [15]:
# Download and cache the dataset from the HuggingFace hub, and
dataset = load_dataset(dataset_name, task)

In [16]:
# Get the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [17]:
def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [18]:
def compute_metrics(eval_preds):
  global dataset_name, task
  metric = load(dataset_name, task)
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [19]:
# Preprocess the raw dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [20]:
# Initialize the trainer
training_args = TrainingArguments(output_dir="./Checkpoints"
                                  ,evaluation_strategy="steps",
                                  eval_steps=20,)

In [21]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
trainer = Trainer(
    model,
    training_args,
    callbacks=[neptune_callback],
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

You are adding a <class 'transformers.integrations.integration_utils.NeptuneCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
NeptuneCallback
TensorBoardCallback


In [None]:
trainer.train()


https://app.neptune.ai/nssoares022/guided-research/e/GUID-9


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy,F1
20,No log,0.622005,0.683824,0.812227
40,No log,0.593884,0.710784,0.825444
60,No log,0.633249,0.627451,0.700787
80,No log,0.631209,0.64951,0.738574
100,No log,0.591876,0.708333,0.81493
120,No log,0.600747,0.691176,0.800633
140,No log,0.575678,0.713235,0.802698
160,No log,0.543566,0.75,0.836013
180,No log,0.596396,0.715686,0.826866
200,No log,0.54822,0.735294,0.835866


In [None]:
# trainer.evaluate()

In [None]:
# notebook_login()

In [None]:
# trainer.push_to_hub()