<a href="https://colab.research.google.com/github/it-is-helga/m2_genAI/blob/main/class01/Lesson_1_NLG_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 1: NLG Basics

On this lesson you will learn how to:
- Load datasets using Hugging Face
- Load models using Hugging Face and CTranslate
- Run the data through the model
- Fine-tune the model.

## Libraries
First, we will install some required libraries.

In [61]:
# Install relevant libraries
! pip install transformers[torch] datasets==3.6.0 ctranslate2 --quiet

Then, we import all the required components

In [62]:
# Import relevant libraries
from datasets import load_dataset
from  transformers import (
    pipeline,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
    AutoTokenizer
)
import ctranslate2
import json
import torch
from tqdm.notebook import tqdm

## Data

We will use the `load_dataset` function to obtain the dataset

In [63]:
ds = load_dataset(
    'webnlg-challenge/web_nlg',
    'release_v3.0_en',
    trust_remote_code=True
)

In [64]:
ds

DatasetDict({
    train: Dataset({
        features: ['category', 'size', 'eid', 'original_triple_sets', 'modified_triple_sets', 'shape', 'shape_type', 'lex', 'test_category', 'dbpedia_links', 'links'],
        num_rows: 13211
    })
    dev: Dataset({
        features: ['category', 'size', 'eid', 'original_triple_sets', 'modified_triple_sets', 'shape', 'shape_type', 'lex', 'test_category', 'dbpedia_links', 'links'],
        num_rows: 1667
    })
    test: Dataset({
        features: ['category', 'size', 'eid', 'original_triple_sets', 'modified_triple_sets', 'shape', 'shape_type', 'lex', 'test_category', 'dbpedia_links', 'links'],
        num_rows: 5713
    })
})

In [None]:
ds['train'].features

{'category': Value(dtype='string', id=None),
 'size': Value(dtype='int32', id=None),
 'eid': Value(dtype='string', id=None),
 'original_triple_sets': Sequence(feature={'otriple_set': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}, length=-1, id=None),
 'modified_triple_sets': Sequence(feature={'mtriple_set': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}, length=-1, id=None),
 'shape': Value(dtype='string', id=None),
 'shape_type': Value(dtype='string', id=None),
 'lex': Sequence(feature={'comment': Value(dtype='string', id=None), 'lid': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'lang': Value(dtype='string', id=None)}, length=-1, id=None),
 'test_category': Value(dtype='string', id=None),
 'dbpedia_links': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'links': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

In [None]:
ds['test'][20]

{'category': 'MusicalWork',
 'size': 7,
 'eid': 'Id21',
 'original_triple_sets': {'otriple_set': [['Bootleg_Series_Volume_1:_The_Quine_Tapes | artist | The_Velvet_Underground',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | producer | The_Velvet_Underground',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | runtime | 13803.0',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | recordedIn | United_States',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | recordedIn | St._Louis,_Missouri',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | genre | Rock_music',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | recordLabel | Polydor_Records']]},
 'modified_triple_sets': {'mtriple_set': [['Bootleg_Series_Volume_1:_The_Quine_Tapes | artist | The_Velvet_Underground',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | producer | The_Velvet_Underground',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | runtime | 230.05',
    'Bootleg_Series_Volume_1:_The_Quine_Tapes | recordedIn | United_States',
    'Bootl

In [65]:
for triple in ds['test'][20]['modified_triple_sets']['mtriple_set'][0]:
  temp = list(map(lambda x: ' '.join(x.split('_')), triple.split(' | ')))
  print(', '.join(temp))

Bootleg Series Volume 1: The Quine Tapes, artist, The Velvet Underground
Bootleg Series Volume 1: The Quine Tapes, producer, The Velvet Underground
Bootleg Series Volume 1: The Quine Tapes, runtime, 230.05
Bootleg Series Volume 1: The Quine Tapes, recordedIn, United States
Bootleg Series Volume 1: The Quine Tapes, recordedIn, St. Louis, Missouri
Bootleg Series Volume 1: The Quine Tapes, genre, Rock music
Bootleg Series Volume 1: The Quine Tapes, recordLabel, Polydor Records


In [19]:
linearize_graph(ds['test'][20]['modified_triple_sets']['mtriple_set'][0])

'Bootleg Series Volume 1: The Quine Tapes, artist, The Velvet Underground; Bootleg Series Volume 1: The Quine Tapes, producer, The Velvet Underground; Bootleg Series Volume 1: The Quine Tapes, runtime, 230.05; Bootleg Series Volume 1: The Quine Tapes, recordedIn, United States; Bootleg Series Volume 1: The Quine Tapes, recordedIn, St. Louis, Missouri; Bootleg Series Volume 1: The Quine Tapes, genre, Rock music; Bootleg Series Volume 1: The Quine Tapes, recordLabel, Polydor Records'

As you can see, the dataset has multiple features in different formats.
However, we only required the expected input and the expected output.

Let's define some preprocessing functions and apply them to the dataset.

In [66]:
# Define a function to linearize graphs
# Hint: observe the structure of "modified_triple_sets" and "mtriple_set" in the dataset

def linearize_graph(mtriple_set):
    '''
    Converts a list of triples into a single string following a given format.

    Parameters:
    mtriple_set (List[str]): A list containing strings where each string represents a triple
    in the following format: "subject | property | object"


    Returns:
    linearized_graph (str): a string with the tripleset linearized
    '''

    # Your code here
    linearized_graph = ''
    linl = []
    for triple in mtriple_set:
        temp = list(map(lambda x: ' '.join(x.split('_')), triple.split(' | ')))
        linl.append(', '.join(temp))
    linearized_graph = '; '.join(linl)

    return linearized_graph

In [67]:
# Define functions to remove unnecessary information from the dataset
def simplify_dataset_multiple_references(batch):
    '''
    Extracts the linearized graph and the references from a batch.

    Parameters:
    batch (Dict): a dictionary where each key is a feature of the dataset and each value is a list.

    Returns:
    new_batch (Dict): a dictionary where each key is a new feature and each value is a list.
    '''

    new_batch = {
        'input': [],
        'references': []
    }

    for i in range(len(batch['modified_triple_sets'])):
        graph = batch['modified_triple_sets'][i]['mtriple_set'][0]
        linearized_graph = linearize_graph(graph)
        new_batch['input'].append(linearized_graph)
        new_batch['references'].append(batch['lex'][i]['text'])
    return new_batch

In [68]:
# Remove unnecessary information from the dataset
original_columns = ds['train'].features
processed_ds = ds.map(
    simplify_dataset_multiple_references,
    batched=True,
    remove_columns=original_columns
)

In [None]:
processed_ds

DatasetDict({
    train: Dataset({
        features: ['input', 'references'],
        num_rows: 13211
    })
    dev: Dataset({
        features: ['input', 'references'],
        num_rows: 1667
    })
    test: Dataset({
        features: ['input', 'references'],
        num_rows: 5713
    })
})

In [22]:
processed_ds['test'][20]

{'input': 'Bootleg Series Volume 1: The Quine Tapes, artist, The Velvet Underground; Bootleg Series Volume 1: The Quine Tapes, producer, The Velvet Underground; Bootleg Series Volume 1: The Quine Tapes, runtime, 230.05; Bootleg Series Volume 1: The Quine Tapes, recordedIn, United States; Bootleg Series Volume 1: The Quine Tapes, recordedIn, St. Louis, Missouri; Bootleg Series Volume 1: The Quine Tapes, genre, Rock music; Bootleg Series Volume 1: The Quine Tapes, recordLabel, Polydor Records',
 'references': ['Music group the Velvet Underground produced and released rock music album Bootleg Series Volume 1: The Quine Tapes under the Polydor Records record label. The album was recorded in St. Louis, Missouri, USA and has a run time of 230:05.',
  'Bootleg Series Volume 1: The Quine Tapes created and produced by The Velvet Underground with runtime of 230.05 minutes was recorded in was St. Louis Missouri, United States. The Rock genre album was recorded through Polydor Records.',
  'The Bo

In [None]:
processed_ds['test'][20]

{'input': 'Bootleg Series Volume 1: The Quine Tapes, artist, The Velvet Underground; Bootleg Series Volume 1: The Quine Tapes, producer, The Velvet Underground; Bootleg Series Volume 1: The Quine Tapes, runtime, 230.05; Bootleg Series Volume 1: The Quine Tapes, recordedIn, United States; Bootleg Series Volume 1: The Quine Tapes, recordedIn, St. Louis, Missouri; Bootleg Series Volume 1: The Quine Tapes, genre, Rock music; Bootleg Series Volume 1: The Quine Tapes, recordLabel, Polydor Records',
 'references': ['Music group the Velvet Underground produced and released rock music album Bootleg Series Volume 1: The Quine Tapes under the Polydor Records record label. The album was recorded in St. Louis, Missouri, USA and has a run time of 230:05.',
  'Bootleg Series Volume 1: The Quine Tapes created and produced by The Velvet Underground with runtime of 230.05 minutes was recorded in was St. Louis Missouri, United States. The Rock genre album was recorded through Polydor Records.',
  'The Bo

## Zero-shot

To use a model out-of-the box there are many alternatives.

The easier appraoch is to use Huggingface pipelines directly.

We are going to load a [t5-small](https://huggingface.co/google-t5/t5-small) since it has very few parameters.

In [6]:
# Load a model with tokenizer with the pipeline abstraction
# Hint: look into the pipleine documentation to know

generator = pipeline(
    task = 'text2text-generation',
    model = "google/flan-t5-small",
    device_map='auto',
    torch_dtype=torch.bfloat16
)

Device set to use cuda:0


['audio-classification', 'automatic-speech-recognition', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-text-to-text', 'image-to-image', 'image-to-text', 'keypoint-matching', 'mask-generation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text-to-audio', 'text-to-speech', 'text2text-generation', 'token-classification', 'translation', 'video-classification', 'visual-question-answering', 'vqa', 'zero-shot-audio-classification', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-object-detection', 'translation_XX_to_YY']"

In [23]:
# Test the model Zero-shot (One example)
input_text = processed_ds['test'][0]['input']
output_text = generator(input_text)[0]['generated_text']

In [25]:
input_text

'Estádio Municipal Coaracy da Mata Fonseca, location, Arapiraca; Agremiação Sportiva Arapiraquense, league, Campeonato Brasileiro Série C; Campeonato Brasileiro Série C, country, Brazil; Agremiação Sportiva Arapiraquense, nickname, "\'\'Alvinegro"; Agremiação Sportiva Arapiraquense, ground, Estádio Municipal Coaracy da Mata Fonseca'

In [24]:
output_text

'Agremiaço Sportiva Arapiraquense play in the Campeonato Brasileiro Série C league in Brazil. The nickname is Alvinegro and they play in the Campeonato Brasileiro Série C league. The ground of Agremiaço Sportiva Arapiraquense is located in Arapiraca.'

In [26]:
first_reference = processed_ds['test'][0]['references'][0]
print(f'Input text: {input_text}')
print()
print(f'First reference: {first_reference}')
print()
print(f'Output text: {output_text}')

Input text: Estádio Municipal Coaracy da Mata Fonseca, location, Arapiraca; Agremiação Sportiva Arapiraquense, league, Campeonato Brasileiro Série C; Campeonato Brasileiro Série C, country, Brazil; Agremiação Sportiva Arapiraquense, nickname, "''Alvinegro"; Agremiação Sportiva Arapiraquense, ground, Estádio Municipal Coaracy da Mata Fonseca

First reference: Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca. Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.

Output text: Agremiaço Sportiva Arapiraquense play in the Campeonato Brasileiro Série C league in Brazil. The nickname is Alvinegro and they play in the Campeonato Brasileiro Série C league. The ground of Agremiaço Sportiva Arapiraquense is located in Arapiraca.


The model is siply transcribing the input. If we want it to actually do waht we want, we need to fine-tune it.

## Training

We will need a fucntion to transform the input and the references from strings to numbers. And, when multiple references are availbel saved them as individual instances.

In [52]:
# Define function to tokenize dataset
# Hint: look into the "simplify_dataset_multiple_references" function to get ideas

def tokenize_and_split(batch):
    '''
    Extracts the input ids, attention mask, and labels for each individual reference in the batch.

    Parameters:
    batch (Dict): a dictionary where each key is a feature of the dataset and each value is a list.

    Returns:
    new_batch (Dict): a dictionary where each key is a new feature and each value is a list.
    '''

    new_batch = {
        'input_ids': [], # List[int]
        'attention_mask': [], # List[int]
        'labels': [] # List[int]
    }
    for i in range(len(batch['input'])):
        for reference in batch['references'][i]:
          new_batch['input_ids'].append(generator.tokenizer.encode(batch['input'][i]))
          new_batch['attention_mask'].append([1] * (len(new_batch['input_ids'][-1])))
          new_batch['labels'].append(generator.tokenizer.encode(reference))
            # Your code here

    #   for i in range(len(batch['modified_triple_sets'])):
    #     graph = batch['modified_triple_sets'][i]['mtriple_set'][0]
    #     linearized_graph = linearize_graph(graph)
    #     new_batch['input'].append(linearized_graph)
    #     new_batch['references'].append(batch['lex'][i]['text'])
    # return new_batch
    return new_batch

In [53]:
# Remove unnecessary information from the dataset
# Hint: look into the "processed_ds" declaration to get ideas

original_columns = processed_ds['train'].features
tokenized_ds = processed_ds.map(
    tokenize_and_split,
    batched=True,
    remove_columns=original_columns
)

Map:   0%|          | 0/13211 [00:00<?, ? examples/s]

Map:   0%|          | 0/1667 [00:00<?, ? examples/s]

Map:   0%|          | 0/5713 [00:00<?, ? examples/s]

In [32]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input', 'references'],
        num_rows: 13211
    })
    dev: Dataset({
        features: ['input', 'references'],
        num_rows: 1667
    })
    test: Dataset({
        features: ['input', 'references'],
        num_rows: 5713
    })
})

In [50]:
# for i in range(len(processed_ds['test']['input'])):
#   print(processed_ds['test']['input'])
#   for reference in processed_ds['test'][20]['references'][i]:
#     # print(reference)pass
#     pass
            # Your code here

In [54]:
tokenized_ds['test'][20]

{'input_ids': [5424,
  122,
  9,
  12528,
  355,
  900,
  6,
  1687,
  308,
  342,
  6,
  9957,
  5947,
  18,
  4198,
  1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [5424, 122, 9, 12528, 355, 900, 3977, 30, 1882, 9902, 9957, 5, 1]}

In [None]:
tokenized_ds['test'][20]

{'input_ids': [5424,
  122,
  9,
  12528,
  355,
  900,
  6,
  1687,
  308,
  342,
  6,
  9957,
  5947,
  18,
  4198,
  1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [5424, 122, 9, 12528, 355, 900, 3977, 30, 1882, 9902, 9957, 5, 1]}

Once the data is prepared we need to define our training arguments and our trainer.

In [69]:
import wandb
wandb.login()



True

In [58]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 35426
    })
    dev: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 4464
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7305
    })
})

In [57]:
# Define the training arguments
training_arguments = Seq2SeqTrainingArguments(
    output_dir = 't5-small-webnlg',
    num_train_epochs = 1,
    eval_strategy = 'steps',
    eval_steps = 500,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 32,
    save_total_limit = 3,
    load_best_model_at_end = True,
    bf16=True,
    optim='adafactor',
    report_to = 'wandb'
)

In [71]:
# Define the trainer
trainer = Seq2SeqTrainer(
    model = generator.model,
    args = training_arguments,
    data_collator = DataCollatorForSeq2Seq(tokenizer = generator.tokenizer),
    train_dataset = tokenized_ds['train'],
    eval_dataset = tokenized_ds['dev']
)

With all parts in place we can finally run the training.

In [72]:
# Run the training
trainer.train()

Step,Training Loss,Validation Loss
500,0.85,0.659753
1000,0.8476,0.65939
1500,0.8565,0.657175
2000,0.8684,0.656714


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=2215, training_loss=0.8558417395598195, metrics={'train_runtime': 763.2641, 'train_samples_per_second': 46.414, 'train_steps_per_second': 2.902, 'total_flos': 1215591018098688.0, 'train_loss': 0.8558417395598195, 'epoch': 1.0})

In [60]:
# Run the training
trainer.train()

Step,Training Loss,Validation Loss
500,0.9097,0.663491


Step,Training Loss,Validation Loss
500,0.9097,0.663491
1000,0.8817,0.662639
1500,0.8793,0.660881
2000,0.8848,0.660501


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=2215, training_loss=0.8871912833530262, metrics={'train_runtime': 736.6293, 'train_samples_per_second': 48.092, 'train_steps_per_second': 3.007, 'total_flos': 1215591018098688.0, 'train_loss': 0.8871912833530262, 'epoch': 1.0})

In [None]:
# REFERENCE Run the training
trainer.train()

Step,Training Loss,Validation Loss
500,1.9097,1.370177
1000,1.633,1.288876
1500,1.5787,1.26128
2000,1.5639,1.255021


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=2215, training_loss=1.6599478211413656, metrics={'train_runtime': 616.4609, 'train_samples_per_second': 57.467, 'train_steps_per_second': 3.593, 'total_flos': 864348255485952.0, 'train_loss': 1.6599478211413656, 'epoch': 1.0})

In [73]:
trainer.model.save_pretrained('t5-small-webnlg')
generator.tokenizer.save_pretrained('t5-small-webnlg')

('t5-small-webnlg/tokenizer_config.json',
 't5-small-webnlg/special_tokens_map.json',
 't5-small-webnlg/spiece.model',
 't5-small-webnlg/added_tokens.json',
 't5-small-webnlg/tokenizer.json')

# Inference

For fast inference with T5 and other Encoder-Deccoder models we will use ctranslate2 which is around 3 times faster than a Hugging Face pipeline.

For Decoder only models like GPT or Llama it is best to use vLLM which is even faster.

In [74]:
# Convert the model to ctranslate format
! ct2-transformers-converter --model t5-small-webnlg --output_dir t5-small-webnlg-ct2

2025-09-15 10:07:19.740342: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757930839.761629   24929 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757930839.767740   24929 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1757930839.784717   24929 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1757930839.784750   24929 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1757930839.784758   24929 computation_placer.cc:177] computation placer alr

In [75]:
# Instantiate model as ctranslate Translator and instantiate TOkenizer
translator = ctranslate2.Translator(
    't5-small-webnlg-ct2',
    device='cuda',
)
tokenizer = AutoTokenizer.from_pretrained('t5-small-webnlg')

Ctranslate takes string tokens as input and output, so we need to define some functions to process the data.

In [76]:
# Define function to pre process inputs
# Hint: look into the tokenizer documentation
# including methods like "encode" and "convert_ids_to_tokens"

def pre_process(text):
    '''
    Turns a text into a list of string tokens.

    Parameters:
    text (str): an input text.

    Returns:
    input_tokens (List[str]): a list of string tokens.
    '''

    token_ids = tokenizer.encode(text, add_special_tokens=True)

    # Convert IDs back to string tokens
    input_tokens = tokenizer.convert_ids_to_tokens(token_ids)

    return input_tokens

In [77]:
# Define function to post process inputs
# Hint: look into the tokenizer documentation
# including methods like "convert_tokens_to_ids" and "decode"
def post_process(output_tokens):
    '''
    Turns a list of string tokens into a single string.

    Parameters:
    output_tokens (List[str]): an output list of tokens.

    Returns:
    output_text (str): an output string.
    '''

    # Your code here
    token_ids = tokenizer.convert_tokens_to_ids(output_tokens)

    # Decode IDs to string
    output_text = tokenizer.decode(token_ids, skip_special_tokens=True)

    return output_text

Then, we can generate the entire test set with our fine-tuned mode.

In [78]:
dataset = [pre_process(text) for text in tqdm(processed_ds['test']['input'])]
batch_size = 32

generated_texts = []
for i in tqdm(range(0, len(dataset), batch_size)):
    batch = dataset[i : i + batch_size]
    output = translator.translate_batch(batch, max_batch_size=batch_size)
    generated_texts += [post_process(o.hypotheses[0]) for o in output]

  0%|          | 0/5713 [00:00<?, ?it/s]

  0%|          | 0/179 [00:00<?, ?it/s]

In [79]:
final_data = processed_ds['test'].add_column('t5-small-webnlg', generated_texts)
final_data[0]

{'input': 'Estádio Municipal Coaracy da Mata Fonseca, location, Arapiraca; Agremiação Sportiva Arapiraquense, league, Campeonato Brasileiro Série C; Campeonato Brasileiro Série C, country, Brazil; Agremiação Sportiva Arapiraquense, nickname, "\'\'Alvinegro"; Agremiação Sportiva Arapiraquense, ground, Estádio Municipal Coaracy da Mata Fonseca',
 'references': ['Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca. Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.',
  'Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca. Alvinegro, the nickname of Agremiação Sportiva Arapiraquense, play in the Campeonato Brasileiro Série C league from Brazil.'],
 't5-small-webnlg': 'Agremiaço Sportiva Arapiraquense play in the Campeonato Brasileiro Série C league in Brazil and play in the Campeonato Brasilei

In [81]:
final_data[100]

{'input': 'Olga Bondareva, knownFor, Bondareva–Shapley theorem; Olga Bondareva, birthName, "Olga Nikolaevna Bondareva"; Olga Bondareva, almaMater, Leningrad State University',
 'references': ['Olga Bondareva, born "Olga Nikolaevna Bondareva" and a student of Leningrad State University, was known for the Bondareva-Shapley theorem.',
  'Olga Bondareva is known for the Bondareva-Shapley theorem, described by Olga Nikolaevna Bondareva in Leningrad State University.',
  'Olga Bondareva, birth name Olga Nikolaevna Bondareva, graduated from Leningrad State University. She is known for the Bondareva-Shapley theorem.'],
 't5-small-webnlg': 'Olga Bondareva, born Olga Nikolaevna Bondareva, graduated from Leningrad State University and is known for the Bondareva–Shapley theorem.'}

In [None]:
# reference
final_data = processed_ds['test'].add_column('t5-small-webnlg', generated_texts)
final_data[0]

{'input': 'Estádio Municipal Coaracy da Mata Fonseca, location, Arapiraca; Agremiação Sportiva Arapiraquense, league, Campeonato Brasileiro Série C; Campeonato Brasileiro Série C, country, Brazil; Agremiação Sportiva Arapiraquense, nickname, Alvinegro; Agremiação Sportiva Arapiraquense, ground, Estádio Municipal Coaracy da Mata Fonseca',
 'references': ['Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca. Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.',
  'Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca. Alvinegro, the nickname of Agremiação Sportiva Arapiraquense, play in the Campeonato Brasileiro Série C league from Brazil.'],
 't5-small-webnlg': 'Arapiraca is the ground of the Estádio Municipal Coaracy da Mata Fonseca. It is located in the country of Brazil.'}

In [80]:
# Save the generated data
final_data.to_json('t5-small-weblg-generations.json')

Creating json from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

2842332