# Knowledge Intensive NLP Summer School

## Notebook 3

The goals of this notebook are:

* Train a T5 notebook for SQuAD task
* Evaluate what happens when context isn't used
* Explore the Fusion in Decoder model 


## Resources
You can find help for the HuggingFace library from their website: 

* T5 https://huggingface.co/docs/transformers/model_doc/t5
* Datasets https://huggingface.co/docs/datasets/index

## Tutorial

This notebook is based on the following tutorials:

* Fine-tuning https://huggingface.co/docs/transformers/training
* Language Generation https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html


# Prelude

The following code will use the SQuAD dataset and train a model to predict an answer given a question and passage. Have a look at it and familiarize yourself

In [1]:
!pip install xfact



In [2]:
import logging
import os
import sys
import transformers
import datasets 

from transformers import (
    AutoConfig,
    AutoTokenizer,
    HfArgumentParser,
    Seq2SeqTrainingArguments,
    set_seed, AutoModelForSeq2SeqLM,
    AutoModelForSequenceClassification
)
from transformers import Seq2SeqTrainer
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from collections import defaultdict
from operator import itemgetter

from xfact.config.args import ModelArguments, DataTrainingArguments
from xfact.logs.comet_callback import CometTrainingCallback
from xfact.logs.logs import setup_logging
from xfact.nlp.dataset import XFactDataset, XFactSeq2SeqDataset
from xfact.nlp.model import ModelFactory
from xfact.nlp.post_processing import PostProcessor
from xfact.nlp.reader import Reader
from xfact.nlp.scoring import Scorer
from xfact.registry.module import import_submodules


check_min_version("4.16.0")
logger = logging.getLogger(__name__)
set_seed(1337)

setup_logging("INFO")
transformers.utils.logging.set_verbosity("INFO")
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()


comet_ml is installed but `COMET_API_KEY` is not set.


In [3]:
model_args = ModelArguments(
    model_name_or_path="t5-small",
)
data_args = DataTrainingArguments(dataset="squad",train_file="train", validation_file="validation")


#NOTE SET MPS to False if error in training
training_args = Seq2SeqTrainingArguments(
    learning_rate=5e-5,
    logging_steps=10,
    logging_strategy="steps",
    eval_steps=10,
    evaluation_strategy="steps",
    do_eval=True,
    do_train=True,
    output_dir="test",
    use_mps_device=True) 

[INFO|training_args.py:1267] 2023-07-05 16:09:40,323 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1669] 2023-07-05 16:09:40,324 >> PyTorch: setting up devices
[INFO|training_args.py:1407] 2023-07-05 16:09:40,380 >> The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [6]:
# Load pretrained model and tokenizer
#
# Distributed training:
# The .from_pretrained methods guarantee that only one local process can concurrently
# download model & vocab.

tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name
    cache_dir=model_args.cache_dir,
    use_fast=True,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_args.model_name_or_path,
)

OSError: None is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

In [None]:
# Load the dataset, we are only using 1000 training examples and 100 test for now
squad_dataset = datasets.load_dataset(data_args.dataset, split=["train[:1000]","validation[:100]"])
squad_dataset = {
    "train":squad_dataset[0],
    "validation":squad_dataset[1],
}

In [None]:
squad_dataset['train'][0]

In [None]:
max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)

data_files = {
    "train": data_args.train_file,
    "validation": data_args.validation_file
}

class SQUADWithQuestionAndContext(XFactSeq2SeqDataset):
    def prepare_src(self, instance):
        return instance["question"] + " - " + instance["context"]

    def prepare_tgt(self, instance):
        return instance["answers"]['text'][0] or "No Answer"
        
loaded_datasets = {
    split: SQUADWithQuestionAndContext(tokenizer,
                                           squad_dataset[split],
                                           max_seq_length,
                                           name=split,
                                           max_target_length=data_args.max_target_length)
    for split, path in data_files.items()
}


In [None]:
data_collator = lambda batch: XFactSeq2SeqDataset.collate_fn(model, batch, tokenizer.pad_token_id, data_args.ignore_pad_token_for_loss)

def my_metrics_function(eval_predictions):
    print(str(eval_predictions))
    return {"metric_name":0}

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=loaded_datasets["train"],
    eval_dataset=loaded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=my_metrics_function,
)


train_result = trainer.train()
trainer.save_model()

metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()


# Exercises

## Seq2seq model T5
1) Adapt the code to report the answer exact match accuracy (see `compute_metrics` function in HuggingFace Trainer)
2) Compare a different version of the task, evaluating whether the model can accurately predict answers without the context paragraph. How does teh exact match answer accuracy change?
3) Train a GENRE-style information retrieval system


## Extension exercises

1) Look at the Forward method in the T5Model class in HuggingFace (https://github.com/huggingface/transformers/blob/ee339bad01bf09266eba665c5f063f0ab7474dad/src/transformers/models/t5/modeling_t5.py#L1414). Explore the fusion in decoder library: https://github.com/facebookresearch/fid - How would you change this method to encode multiple passages separately?
2) Download the FAISS library and adapt code from yesterday's lab to index the DPR-encoded contexts in FAISS and evaluate the speed-up