# Pre-trained Language Models: SubTask C

The SubTask C project focused on fine-tuning a pre-trained BART model for sequence-to-sequence generation. The task required generating a valid reason explaining why a given statement did not make sense. For example, the statement "He put an elephant into the fridge" had three reference reasons: "An elephant is much bigger than a fridge," "A fridge is much smaller than an elephant," and "Most of the fridges aren't large enough to contain an elephant."

This subtask was treated as a Sequence-to-Sequence problem where the input consisted of the nonsensical statement and the output was a valid reason. Fine-tuning utilized the Transformers library and the Hugging Face Hub, with experiments conducted on a reduced dataset and the base BART model. Full-scale training required GPU or TPU resources due to computational demands.

In [1]:
shrink_dataset = False
base_model = False
colab = True

The experiments were conducted with `shrink_dataset` and `base_model` set to True, and `colab` set to False. This configuration was used for all output examples in the notebook. Full training to obtain the model's actual performance was performed on Google Colab, a Jupyter notebook environment that provides GPU and TPU support for large-scale deep learning. For full-scale training, `shrink_dataset` and `base_model` were set to False, and `colab` was set to True.

In [2]:
if colab:
    ! pip install transformers==4.28.0 datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data

Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinu

Following objects and functions were used:

In [3]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
                          Seq2SeqTrainingArguments, Seq2SeqTrainer,
                          DataCollatorForSeq2Seq, enable_full_determinism)

When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In Transformers, this can be done as follows:

In [4]:
enable_full_determinism(seed=42)

Reproducibility in neural networks remained sensitive to factors such as software versions and hardware, so even with fixed seed initialization, minor differences in results could occur. Neural network training also required specifying multiple hyperparameters to configure the model. Determining optimal hyperparameter values involved training the model with different combinations and evaluating performance on the development set. This process was computationally intensive and required multiple experimental runs. For this project, predefined hyperparameter values were used.

In [5]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 25  # Maximum lenght of the input sequence
output_dir = "modelC"  # The output directory where the model will be written

The notebook for this assignment provided minimal guidance. Users were expected to consult the [Transformers documentation](https://huggingface.co/docs) for detailed instructions on completing the project

## Loading the Pre-trained Model 

The first step in this assignment involved loading the pre-trained model and its corresponding tokenizer using the imported classes.

In [6]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

def load_model(model_name):
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

In [7]:
model_name = "facebook/bart-base" if base_model else "facebook/bart-large"
model, tokenizer = load_model(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Data Pre-processing 

The **ComVE** dataset was structured with 10000 nonsensical statements for the training set, 997 statements for development, and 1000 for testing. Each nonsensical statement included three reference valid reasons. The data was loaded into three `DataFrames`. The training and development `DataFrames` contained three columns: the `id` of the nonsensical statement, a `FalseSent` column for the statement, and a `reason` column containing the reference reasons. The test `DataFrame` contained five columns: the `id` of the nonsensical statement, a `FalseSent` column, and three columns (`reason1`, `reason2`, `reason3`) for the reference reasons.


Train DataFrame:

|       |   id | FalseSent                                         | reason                                                                         |
|------:|-----:|:--------------------------------------------------|:-------------------------------------------------------------------------------|
|   769 |  769 | Computers is an ingredient used in preparing food | Computers are not used for food and they are not edible                        |
| 10769 |  769 | Computers is an ingredient used in preparing food | Computer is not something that can be used in preparing food.                  |
| 20769 |  769 | Computers is an ingredient used in preparing food | You cannot eat a computer                                                      |
|   888 |  888 | he did hear music in his cooling glass            | cooling glass can not play the song, it's not a electronic thing to play music |
| 10888 |  888 | he did hear music in his cooling glass            | Glass does not produce music.                                                  |
| 20888 |  888 | he did hear music in his cooling glass            | Any sound that might be made by a cooling glass is not music.                  |

Test DataFrame:

|     |   id | FalseSent                                      | reason1                                                  | reason2                                                | reason3                                                            |
|----:|-----:|:-----------------------------------------------|:---------------------------------------------------------|:-------------------------------------------------------|:-------------------------------------------------------------------|
|  76 | 1280 | Beer that is drunk by humans is white          | Beer is made of barley and it is a yellow drink          | A beer that is drunk by humans is not white.           | Beer is brown                                                      |
| 101 |  860 | eating trash food every day makes you stronger | eating trash food every day makes your body fat and weak | eating trash food every day is bad for your health     | Trash food could be contaminated                                   |
| 136 |  777 | he put some cooking oil in his wine            | cooking oil will destroy the taste of the wine           | Cooking oil does not go in wine                        | Cooking oil does not taste nice and therefore would ruin the wine. |
| 174 |  570 | Lobsters live in the mountains                 | Lobsters needs water to live                             | Lobsters live in the sea.                              | Lobsters live in the sea, not the mountains                        |
| 210 | 1929 | the clock shows animals                        | the clock is used to show the time to people             | Clocks are required to tell the time, not show animals | a clock shows the time not animals                                 |
| 235 | 1619 | she put the giraffe in the freezer             | A giraffe is much bigger than the freezer                | There is no way a giraffe is fitting in the freezer.   | A giraffe is too big to be put in a freezer.                       |

In [8]:
def load_data(data_csv, answers_csv, is_test=False):
    label = []
    data = pd.read_csv(data_csv).dropna()
    if is_test:
        answers = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "reason1", 2: "reason2", 3: "reason3"})
    else:
        answers = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "reason1", 2: "reason2", 3: "reason3"})
        answers = pd.melt(answers, id_vars=["id"], value_vars=["reason1", "reason2", "reason3"], var_name="reason", value_name="reason_text")
        answers.drop(columns=["reason"], inplace=True)
        answers.rename(columns={"reason_text": "reason"}, inplace=True)
    return pd.merge(data, answers, on="id")

In [9]:
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskC_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskC_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskC_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskC_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv, True)
if shrink_dataset:
    idxs = train_data["id"].sample(frac=1, random_state=42).unique()[:30]
    train_data = train_data[train_data.id.isin(idxs)]
    idxs = dev_data["id"].sample(frac=1, random_state=42).unique()[:30]
    dev_data = dev_data[dev_data.id.isin(idxs)]
    idxs = test_data["id"].sample(frac=1, random_state=42).unique()[:30]
    test_data = test_data[test_data.id.isin(idxs)]
pd.set_option("display.max_colwidth", None)
print("Train DataFrame:")
display(train_data[:6])
print("Test DataFrame:")
display(test_data[:6])

Train DataFrame:


Unnamed: 0,id,FalseSent,reason
0,0,He poured orange juice on his cereal.,Orange juice doesn't taste good on cereal.
1,0,He poured orange juice on his cereal.,Orange juice is poured in a glass.
2,0,He poured orange juice on his cereal.,Orange juice does not taste good on cereal.
3,1,He drinks apple.,Apple can not be drunk
4,1,He drinks apple.,An apple is a whole food and unable to be drunk without being juiced.
5,1,He drinks apple.,He eats an apple


Test DataFrame:


Unnamed: 0,id,FalseSent,reason1,reason2,reason3
0,1175,He loves to stroll at the park with his bed,A bed is too heavy to carry with when strolling at a park,the park does not have beds,A bed wold be really heavy and awkward to carry through a park.
1,452,The inverter was able to power the continent.,An inverter is incapable of powering an entire continent.,The invertor can power the house and not the continent.,The continent is too big to be powered by an inverted
2,275,The chef put extra lemons on the pizza.,Lemons are not a pizza topping.,lemons would be awful on a pizza,lemons don't go on pizzas
3,869,sugar is used to make coffee sour,sugar usually is used as a sweetener,Sugar is a sweetening agent.,Sugar is used to make coffee sweet.
4,50,There are beautiful planes here and there in the garden,A plane can never be seen in garden,Planes are not grown in a garden.,flowers grow in the dirt
5,1155,"Once a pipe bursts, call a doctor.",plumbers fix the pipes while doctors cure sick people,Doctors are for people,Doctors do not specialize in plumbing.


In [10]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
print("Train Dataset example:")
display(train_dataset[0])
print("Test Dataset example:")
display(test_dataset[0])

Train Dataset example:


{'id': 0,
 'FalseSent': 'He poured orange juice on his cereal.',
 'reason': "Orange juice doesn't taste good on cereal."}

Test Dataset example:


{'id': 1175,
 'FalseSent': 'He loves to stroll at the park with his bed',
 'reason1': 'A bed is too heavy to carry with when strolling at a park',
 'reason2': 'the park does not have beds',
 'reason3': 'A bed wold be really heavy and awkward to carry through a park.'}

The `Datasets` were pre-processed using two approaches. For the test `Dataset`, the tokenizer was applied to the `FalseSent` column, and the results were stored in the `input_ids` and `attention_mask` fields. For the training and development `Datasets`, the tokenizer was applied to both the `FalseSent` and `reason` columns, with the resulting `input_ids` from the `reason` column stored in the `labels` field. In all cases, sequences were padded and truncated to the `max_length` value.


><pre>
>Train formated Dataset example:
>
>{'id': 769, 'FalseSent': 'Computers is an ingredient used in preparing food', 'reason': 'Computers are not used for food and they are not edible', '__index_level_0__': 769, 'input_ids': [0, 14721, 43990, 16, 41, 16181, 341, 11, 4568, 689, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [0, 14721, 43990, 32, 45, 341, 13, 689, 8, 51, 32, 45, 27532, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
>
>Test formated Dataset example:
>
>{'id': 1280, 'FalseSent': 'Beer that is drunk by humans is white', 'reason1': 'Beer is made of barley and it is a yellow drink', 'reason2': 'A beer that is drunk by humans is not white.', 'reason3': 'Beer is brown', '__index_level_0__': 76, 'input_ids': [0, 45562, 14, 16, 10789, 30, 5868, 16, 1104, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
></pre>

In [11]:
def preprocess_data(examples, tokenizer, max_length, is_test=False):
    inputs = tokenizer(
        examples['FalseSent'],
        max_length=max_length,
        padding='max_length',
        truncation=True
    )
    if not is_test:
        labels = tokenizer(
            examples['reason'],
            max_length=max_length,
            padding='max_length',
            truncation=True
        )['input_ids']
        inputs['labels'] = labels
    return inputs

In [12]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length, True), batched=True)
print("Train formated Dataset example:\n")
print(train_dataset[0])
print("\nTest formated Dataset example:\n")
print(test_dataset[0])

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2991 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Train formated Dataset example:

{'id': 0, 'FalseSent': 'He poured orange juice on his cereal.', 'reason': "Orange juice doesn't taste good on cereal.", 'input_ids': [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [0, 37264, 10580, 630, 75, 5840, 205, 15, 25629, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Test formated Dataset example:

{'id': 1175, 'FalseSent': 'He loves to stroll at the park with his bed', 'reason1': 'A bed is too heavy to carry with when strolling at a park', 'reason2': 'the park does not have beds', 'reason3': 'A bed wold be really heavy and awkward to carry through a park.', 'input_ids': [0, 894, 6138, 7, 24808, 23, 5, 2221, 19, 39, 3267, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


## Fine-tuning 

The `Seq2SeqTrainingArguments` were created with the option to generate sequences of tokens during prediction instead of returning logits. The `create_training_arguments` function used the hyperparameters passed as arguments. During training, the model was evaluated on the development set after every epoch. The `save_strategy` was set to `"no"` to avoid checkpointing every 500 steps.

In [13]:
from transformers import Seq2SeqTrainingArguments

def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=train_batch_size,
        learning_rate=learning_rate,
        evaluation_strategy="epoch",  # Evaluate at the end of every epoch
        predict_with_generate=True,   # Enable sequence generation in predictions
        save_strategy="no"            # Disable saving checkpoints every 500 steps
    )
    return training_args
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 25  # Maximum lenght of the input sequence
output_dir = "modelC"  # The output directory where the model will be written

In [14]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)

The `Trainer` object was created using the model, the `Seq2SeqTrainingArguments`, and a data collator appropriate for sequence-to-sequence tasks. The train `Dataset` was used for training, and the development `Dataset` was used for evaluation during training.

In [15]:
def create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer):
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
    )
    trainer = Seq2SeqTrainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        data_collator=data_collator,
    )
    return trainer

In [16]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer)

In [17]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.9095,0.949163
2,0.8181,0.924794


Epoch,Training Loss,Validation Loss
1,0.9095,0.949163
2,0.8181,0.924794
3,0.7602,0.921296


TrainOutput(global_step=11250, training_loss=0.9548438178168402, metrics={'train_runtime': 6612.7743, 'train_samples_per_second': 13.61, 'train_steps_per_second': 1.701, 'total_flos': 4761704448000000.0, 'train_loss': 0.9548438178168402, 'epoch': 3.0})

The `Trainer` was used to generate sequences of token indices from the test `Dataset`. These indices were then decoded using the `tokenizer` to obtain the corresponding text strings, which were stored in the `prediction` column of the test `DataFrame`.

|     |   id | FalseSent                                      | reason1                                                  | reason2                                                | reason3                                                            | prediction                                     |
|----:|-----:|:-----------------------------------------------|:---------------------------------------------------------|:-------------------------------------------------------|:-------------------------------------------------------------------|:-----------------------------------------------|
|  76 | 1280 | Beer that is drunk by humans is white          | Beer is made of barley and it is a yellow drink          | A beer that is drunk by humans is not white.           | Beer is brown                                                      | Beer that is drunk by humans is white                             |
| 101 |  860 | eating trash food every day makes you stronger | eating trash food every day makes your body fat and weak | eating trash food every day is bad for your health     | Trash food could be contaminated                                   | eating trash food every day makes you stronger |
| 136 |  777 | he put some cooking oil in his wine            | cooking oil will destroy the taste of the wine           | Cooking oil does not go in wine                        | Cooking oil does not taste nice and therefore would ruin the wine. | he put some cooking oil in his wine            |
| 174 |  570 | Lobsters live in the mountains                 | Lobsters needs water to live                             | Lobsters live in the sea.                              | Lobsters live in the sea, not the mountains                        | Lobsters live in mountains                 |
| 210 | 1929 | the clock shows animals                        | the clock is used to show the time to people             | Clocks are required to tell the time, not show animals | a clock shows the time not animals                                 | the clock shows animals                        |
| 235 | 1619 | she put the giraffe in the freezer             | A giraffe is much bigger than the freezer                | There is no way a giraffe is fitting in the freezer.   | A giraffe is too big to be put in a freezer.                       | she put the giraffe in the freezer             |


In [18]:
def make_predictions(trainer, test_dataset, tokenizer):
    predictions = trainer.predict(test_dataset)
    decoded_predictions = []
    for prediction in predictions.predictions:
        decoded_prediction = tokenizer.decode(prediction, skip_special_tokens=True)
        decoded_predictions.append(decoded_prediction)
    return decoded_predictions

In [19]:
predictions = make_predictions(trainer, test_dataset, tokenizer)
test_data["prediction"] = predictions
print(test_data)

       id                                                     FalseSent  \
0    1175                   He loves to stroll at the park with his bed   
1     452                 The inverter was able to power the continent.   
2     275                       The chef put extra lemons on the pizza.   
3     869                             sugar is used to make coffee sour   
4      50       There are beautiful planes here and there in the garden   
..    ...                                                           ...   
995  1114                      If it is a sunny day, you would got wet.   
996     8                         ice hockey is a financial institution   
997  1945  He put water without a container in the freezer for 24 hours   
998  1053                       The desert has sand that you can drink.   
999  1123                        My friend runs for 2 inches every day.   

                                                       reason1  \
0    A bed is too heavy to carry 

The evaluation for **SubTask C** was based on the *bleu* and *rouge* metrics. With `shrink_dataset` and `base_model` set to `True`, the expected scores were 0.216 for *bleu* and 0.446 for *rouge*. With a full training run, where `shrink_dataset` and `base_model` were set to `False`, the expected scores were approximately 0.228 for *bleu* and 0.461 for *rouge*.

In [24]:
!pip install rouge_score



In [25]:
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
from rouge_score import rouge_scorer

def evaluate_prediction(test_data, metric):
    if metric == "bleu":
        bleu_metric = evaluate.load("bleu")
        bleu_score = bleu_metric.compute(predictions=test_data["prediction"].values, references=test_data[["reason1", "reason2", "reason3"]].values)
        return bleu_score
    elif metric == "rouge":
        rouge_metric = evaluate.load("rouge")
        rouge_score = rouge_metric.compute(predictions=test_data["prediction"].values, references=test_data[["reason1", "reason2", "reason3"]].values)
        return rouge_score
    else:
        raise ValueError("Invalid metric. Please choose 'bleu' or 'rouge'.")

In [26]:
 evaluate_prediction(test_data, "bleu")

{'bleu': 0.23149176268843397,
 'precisions': [0.6396091205211727,
  0.3063670411985019,
  0.16669603524229074,
  0.0879144385026738],
 'brevity_penalty': 1.0,
 'length_ratio': 1.1964146531566642,
 'translation_length': 7675,
 'reference_length': 6415}

In [27]:
evaluate_prediction(test_data, "rouge")

{'rouge1': 0.500345548654919,
 'rouge2': 0.2772558348865699,
 'rougeL': 0.46570211015830243,
 'rougeLsum': 0.4655786157317027}