# Evaluating fine-tuned model for Subtask B

Set the model to be used. For evaluating subtask B we use our fine-tuned model. 

In [8]:
model_checkpoint = "JazibEijaz/bert-base-uncased-finetuned-semeval2020-task4b"

## Loading the dataset

In [9]:
from datasets import load_dataset

datasets = load_dataset('csv', data_files={'validation': '../input/semeval-test/subtaskB.csv'})

## Tokenizing the data

Before we can feed those texts to our model, we need to tokenize it. This is done by a Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.


In [10]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can then write the function that will preprocess our samples. The tricky part is to put all the possible pairs of sentences in two big lists before passing them to the tokenizer, then un-flatten the result so that each example has three input ids, attentions masks, etc.

When calling the `tokenizer`, we use the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [11]:
explanation_names = ["explanation0", "explanation1", "explanation2"]

def preprocess_function(examples):
    # Repeat each false sentence three times to go with the three explanations.
    false_sentences = [[context] * 3 for context in examples["false_sent"]]
    correct_sentences = examples["correct_sent"]
    
    # uncomment the following line to evaluate using setup 1
    explanations = [[f" this is false because {examples[explanation][i]}" for explanation in explanation_names] for i, header in enumerate(correct_sentences)]
    
    # uncomment the following lines to evaluate using setup 2
    # explanations = [[f" this is false because {examples[explanation][i]} [SEP] correct statement should be {header}" for explanation in explanation_names] for i, header in enumerate(correct_sentences)]

    # Flatten everything
    false_sentences = sum(false_sentences, [])
    explanations = sum(explanations, [])
    
    # Tokenize
    
    # uncomment the following line to evaluate using setup 1
    tokenized_examples = tokenizer(false_sentences, explanations, truncation=True)
    
    # uncomment the following line to evaluate using setup 2
    # tokenized_examples = tokenizer(false_sentences, explanations, truncation=True, add_special_tokens=True)
    
    # Un-flatten
    return {k: [v[i:i+3] for i in range(0, len(v), 3)] for k, v in tokenized_examples.items()}

We can apply this function on all the examples in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of the `dataset`.

In [12]:
encoded_datasets = datasets.map(preprocess_function, batched=True)

## Evaluating the model

Now that our data is ready, we can download the pretrained model and evaluate it. Since our task is about mutliple choice, we use the `AutoModelForMultipleChoice` class. Like with the tokenizer, the `from_pretrained` method will download the model for us.

In [13]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the outputs of the model, and all other arguments are optional:

In [14]:
batch_size = 128
args = TrainingArguments(
    f"taskB",
    per_device_eval_batch_size=batch_size,
)

Then we need to tell our `Trainer` how to form batches from the pre-processed inputs. We haven't done any padding yet because we will pad each batch to the maximum length inside the batch (instead of doing so with the maximum length of the whole dataset). This will be the job of the *data collator*. A data collator takes a list of examples and converts them to a batch (by, in our case, applying padding). Since there is no data collator in the library that works on our specific problem, we will write one, adapted from the `DataCollatorWithPadding`:

In [15]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits:

In [16]:
import numpy as np

def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [17]:
trainer = Trainer(
    model,
    args,
    eval_dataset=encoded_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

We can now evaluate our model by just calling the `evaluate` method:

In [18]:
trainer.evaluate()