# Exploratory Nootebook

This nootebook aims to build the experimental setup for the upcoming experiments regarding the guided research scope.

In [17]:
!pip install transformers\
datasets\
evaluate\
python-dotenv



In [18]:
import os

from typing import Dict, Literal

from evaluate import load
from datasets import load_dataset, load_metric
from dotenv import load_dotenv
from google.colab import drive
from pprint import pprint
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments
)

In [19]:
GDRIVE_PATH='/content/gdrive/MyDrive/TUM/GR/codes'

In [20]:
# Connecting Google Drive to Colab to have persistent storage across Colab sessions.
drive.mount('/content/gdrive', force_remount=True)
os.chdir(GDRIVE_PATH)
print(sorted(os.listdir()))

Mounted at /content/gdrive
['.env', '=0.13.0', '=0.15.2', '=1.2.2', '=1.22.4', '=1.5.3', '=2.0.1', '__pycache__', 'guided_research']


In [21]:
from config import SUPER_GLUE_TASKS_METRICS, GLUE_TASKS_METRICS

In [22]:
load_dotenv()

NEPTUNE_API_TOKEN = os.getenv('NEPTUNE_API_TOKEN')
GIT_SSH_TOKEN = os.getenv('GIT_SSH_TOKEN')
GITHUB_API_TOKEN = os.getenv('GITHUB_API_TOKEN')

# Datasets

The scope of the research is to benchmark a language model on [GLUE](https://huggingface.co/datasets/glue) or [SuperGLUE](https://huggingface.co/datasets/super_glue) dataset. Both are provided in HuggingFace dataset hub.


## GLUE

The General Language Understanding Evaluation (GLUE) benchmark consists of 9 natural language understanding (NLP) tasks. All tasks are classification tasks, except for the STS-B task which is a regression task. All classification tasks are 2-class problems, except for the MNLI task which has 3-classes. More details about GLUE benchmark could be found [here](https://gluebenchmark.com/tasks).



### Single-Sentence Tasks

* CoLA - [The Corpus of Linguistic Acceptability](https://arxiv.org/abs/1805.12471) is a set of English sentences from published linguistics literature. The task is to predict whether a given sentence is grammatically correct or not.
  * **Metrics:** Matthew's Corr.

* SST-2 - [The Stanford Sentiment Treebank](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence: positive or negative.
  * **Metrics:** Accuracy.

### Similarity and Paraphrase tasks

* MRPC - [The Microsoft Research Paraphrase Corpus](https://www.aclweb.org/anthology/I05-5002.pdf) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.
  * **Metrics:** F1-score or Accuracy.
* QQP - [The Quora Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.
  * **Metrics:** F1-score or Accuracy.
* STS-B - [The Semantic Textual Similarity Benchmark](https://arxiv.org/abs/1708.00055) is a collection of sentence pairs drawn from news headlines, video, and image captions, and natural language inference data. The task is to determine how similar two sentences are.
  * **Metrics:** Pearson-Spearman Corr.

### Inference Tasks

* MNLI - [The Multi-Genre Natural Language Inference Corpus](https://cims.nyu.edu/~sbowman/multinli/multinli_0.9.pdf) is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The task has the matched (in-domain) and mismatched (cross-domain) sections.
  * **Metrics:** Accuracy.
* QNLI - [The Stanford Question Answering Dataset](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question. The task is to determine whether the context sentence contains the answer to the question.
  * **Metrics:** Accuracy.
* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.
* WNLI - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices (Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. 2012).
  * **Metrics:** Accuracy.

## SuperGLUE
The SuperGLUE is a more recent benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. It consists of 10 NLP tasks. All tasks are classification tasks, except for the ones from the Similarity and Paraphrase tasks set. More details about SuperGLUE benchmark could be found [here](https://super.gluebenchmark.com/tasks).


### Single-Sentence Tasks

* AXb - [The Broadcoverage Diagnostics](https://gluebenchmark.com/diagnostics) is a set of sentence pairs labeled with their entailment relations (entailment, contradiction, or neutral) in both directions.
  * **Metrics:** Matthew's Corr.

* CB - [The CommitmentBank](https://github.com/mcdm/CommitmentBank) is a corpus of 1,200 naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment canceling operator (question, modal, negation, antecedent of conditional).
  * **Metrics:** Avg. F1-score or Accuracy.

* WiC - [The Words in Context](https://pilehvar.github.io/wic/) is a dataset for detecting the context of the words in a context-sensitive representations, framed as as binary classification task. The instances of this dataset have a target word w, either a verb or a noun, and two contexts, c1 and c2. Each of these contexts triggers a specific meaning of w. The task is to identify if the occurrences of w in c1 and c2 correspond to the same meaning or not.
   * **Metrics:** Accuracy.

### Similarity and Paraphrase tasks

* MultiRC - [The Multi-Sentence Reading Comprehension](https://cogcomp.seas.upenn.edu/multirc/) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
  * **Metrics:** Exact Match (EM) or F1-score.

* COPA - [The Choice of Plausible Alternatives](https://people.ict.usc.edu/~gordon/copa.html) dataset consists of 1000 questions, split equally into development and test sets of 500 questions each. Each question is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise.
  * **Metrics:** Accuracy.

* WSC - [The Winograd Schema Challenge](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) consists of pairs of sentences that differ in only one or two words and that contain an ambiguity that is resolved in opposite ways in the two sentences and requires the use of world knowledge and reasoning for its resolution.
   * **Metrics:** Accuracy.

* ReCoRD - [The Reading Comprehension with Commonsense Reasoning](https://sheng-z.github.io/ReCoRD-explorer/) consists of queries automatically generated from CNN/Daily Mail news articles. The answer to each query is a text span from a summarizing passage of the corresponding news.
   * **Metrics:** F1-score or Accuracy.

* AXg - [The Winogender Schema Diagnostics](https://github.com/rudinger/winogender-schemas)	are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
   * **Metrics:** Gender Parity or Accuracy

### Inference Tasks



* RTE - [The Recognizing Textual Entailment](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) datasets come from a series of annual textual entailment challenges. The task is to determine whether the second sentence is the entailment of the first one or not.
  * **Metrics:** Accuracy.

* BoolQ - [The Boolean Questions](https://github.com/google-research-datasets/boolean-questions) is a question answering dataset for yes/no questions containing 15942 examples. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
   * **Metrics:** Accuracy.

# Models

For the experiments, we are using the flan-T5 model at different sizes, which are all vailable in the HuggingFace hub:

*   [flan-t5-base](https://huggingface.co/google/flan-t5-base) - 248M params
*   [flan-t5-large](https://huggingface.co/google/flan-t5-large) - 783M params
*   [flan-t5-xl](https://huggingface.co/google/flan-t5-xl) - 3B params

# Code

In [None]:
dataset_name: Literal["super_glue", "glue"] = "super_glue"
if dataset_name == "super_glue":
  task: Literal["axb", "axg", "boolq", "cb", "copa", "multirc", "record", "rte", "wic", "wsc.fixed"] = "rte"
  metric_name: str = SUPER_GLUE_TASKS_METRICS[task]
else:
  task: Literal["cola", "mnli", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"] = "rte"
  metric_name: str = GLUE_TASKS_METRICS[task]


In [None]:
# Download and load the dataset from the HuggingFace hub.
dataset = load_dataset(dataset_name, task)
metric = load(metric_name)

In [None]:
# "t5-small", "t5-base", "t5-larg", "t5-3b" -> could also change for just T5 models

In [None]:
# Get the pre-trained T5 model and tokenizer as backbone
model_checkpoint: Literal["google/flan-t5-base", "google/flan-t5-large", "google/flan-t5-xl"] = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Preprocess the dataset for T5

In [None]:
# trainer = Seq2SeqTrainer(
#     model,
#     args,
#     train_dataset=dataset["train"],
#     eval_dataset=dataset["validation"],
#     data_collator=data_collator,
#     tokenizer=tokenizer,
#     compute_metrics=compute_metrics
# )

In [25]:
!ls

'=0.13.0'  '=0.15.2'  '=1.2.2'	'=1.22.4'  '=1.5.3'  '=2.0.1'   guided_research   __pycache__


In [26]:
os.chdir('/content/gdrive/MyDrive/TUM/GR/codes/guided_research')

In [27]:
!ls

config.py  Test_Taining_w_Neptune.ipynb


In [34]:
!git status

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	[32mnew file:   Test_Taining_w_Neptune.ipynb[m
	[32mnew file:   config.py[m

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   Test_Taining_w_Neptune.ipynb[m



In [35]:
GITHUB_API_TOKEN

'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'

In [None]:
!git config --global user.password

In [None]:
!git remote set-url origin https://github.com/natalisso/guided_research.git

In [None]:
!git add .
!git commit -m "Initial commit"
!git push --set-upstream origin master

[master cb1cff8] Initial commit
 1 file changed, 1 insertion(+), 1 deletion(-)
remote: HTTP Basic: Access denied. The provided password or token is incorrect or your account has 2FA enabled and you must use a personal access token instead of a password. See https://gitlab.lrz.de/help/topics/git/troubleshooting_git#error-on-git-fetch-http-basic-access-denied
fatal: Authentication failed for 'https://gitlab.lrz.de/00000000014B20EC/guided_research/'
