## Off the shelf results with T5

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

In [2]:
base_model = T5ForConditionalGeneration.from_pretrained('t5-base')
base_tokenizer = T5Tokenizer.from_pretrained('t5-base')

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## Abstractive Summarization

In [3]:
text_to_summarize ="""Barack Obama is an American politician, lawyer, and author who served as the 44th President of the United States from 2009 to 2017, 
becoming the first African American to hold the office. Born on August 4, 1961, in Honolulu, Hawaii, he graduated from Columbia University and Harvard Law 
School, where he became the first Black president of the Harvard Law Review. Before his presidency, Obama worked as a community organizer, practiced civil 
rights law, and served as a U.S. Senator from Illinois. His presidency was marked by significant achievements such as the Affordable Care Act (Obamacare), 
the operation that killed Osama bin Laden, the Paris Climate Agreement, and efforts to recover from the 2008 financial crisis. 
Known for his eloquence and hope-driven leadership, Obama remains a prominent global figure advocating for democracy, equality, and climate action.
"""

preprocess_text = text_to_summarize.strip().replace("\n","")

print ("original text preprocessed:\n", preprocess_text)

original text preprocessed:
 Barack Obama is an American politician, lawyer, and author who served as the 44th President of the United States from 2009 to 2017, becoming the first African American to hold the office. Born on August 4, 1961, in Honolulu, Hawaii, he graduated from Columbia University and Harvard Law School, where he became the first Black president of the Harvard Law Review. Before his presidency, Obama worked as a community organizer, practiced civil rights law, and served as a U.S. Senator from Illinois. His presidency was marked by significant achievements such as the Affordable Care Act (Obamacare), the operation that killed Osama bin Laden, the Paris Climate Agreement, and efforts to recover from the 2008 financial crisis. Known for his eloquence and hope-driven leadership, Obama remains a prominent global figure advocating for democracy, equality, and climate action.


In [4]:
# known prompt for summarization with T5
t5_prepared_text = "summarize: " + preprocess_text

input_ids = base_tokenizer.encode(t5_prepared_text, return_tensors="pt")

# summmarize 
summary_ids = base_model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    min_length=30,
    max_length=50,
    early_stopping=True
)

output = base_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print (f"Summarized text: \n{output}")

Summarized text: 
born on august 4, 1961, in Honolulu, he was the first black president of the u.s. he served as the 44th president from 2009 to 2017 .


## English -> German Translation

In [5]:
input_ids = base_tokenizer.encode('translate English to German: Where is the apple?', return_tensors='pt')

# translate 
translate_ids = base_model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    max_length=20,
    early_stopping=True
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print (f"Translated text:\n{output}")


Translated text:
Wo ist der Apfel?


In [6]:
# pass labels in to calculate loss

input_ids = base_tokenizer('translate English to German: Where is the apple?', return_tensors='pt').input_ids
labels = base_tokenizer('Wo ist die Schokolade?', return_tensors='pt').input_ids

loss = base_model(input_ids=input_ids, labels=labels).loss

labels, loss

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


(tensor([[ 3488,   229,    67, 31267,    58,     1]]),
 tensor(2.0196, grad_fn=<NllLossBackward0>))

## CoLA: The Corpus of Linguistic Acceptability
checking for grammatical correctess

In [7]:
input_ids = base_tokenizer.encode('cola sentence: Where is the apple?', return_tensors='pt')

# CoLA 
translate_ids = base_model.generate(
    input_ids,
    num_beams=4,
    no_repeat_ngram_size=3,
    max_length=20,
    early_stopping=True
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"is grammatically correct?: \n{output}")


is grammatically correct?: 
acceptable


In [9]:
input_ids = base_tokenizer.encode('cola sentence: Where be a apples?', return_tensors='pt')

# summmarize 
translate_ids = base_model.generate(
    input_ids,
    max_length=20,
    num_beams =2,
    early_stopping=True
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"is grammatically correct?: \n{output}")

is grammatically correct?: 
unacceptable


## STSB - Semantic Text Similarity Benchmark
Are two sentences semantically similar

In [11]:
sentence_one = 'How to fish'
sentence_two = 'Fishing Manual for beginnners'


input_ids = base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')

# calculate semantic similarity 
translate_ids = base_model.generate(
    input_ids,
    max_length=3,
    num_beams =3,
    early_stopping=True
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"semantically similar? (0-5): \n{output}")

semantically similar? (0-5): 
3.2


In [12]:
sentence_one = 'How to fish'
sentence_two = 'Hiking Manual for beginnners'


input_ids = base_tokenizer.encode(f"stsb sentence1: {sentence_one} sentence2: {sentence_two}", return_tensors='pt')

# calculate semantic similarity
translate_ids = base_model.generate(
    input_ids,
    max_length=3,
    num_beams =2,
    early_stopping=True
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"semantically similar? (0-5): \n{output}")

semantically similar? (0-5): 
0.4


## MNLI - Multi-Genre Natural Language Inference
Whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis.

In [15]:
input_ids = base_tokenizer.encode(
    'mnli premise: I am active in politics. hypothesis: I am running for mayor', return_tensors='pt'
)

# mnli 
translate_ids = base_model.generate(
    input_ids,
    num_beams=2,
    early_stopping=True
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Response: \n{output}")

Response: 
entailment


In [17]:
input_ids = base_tokenizer.encode(
    'mnli premise: I am active in politics. hypothesis: I do not really vote', return_tensors='pt'
)

# mnli 
translate_ids = base_model.generate(
    input_ids,
    early_stopping=True,
    num_beams=3
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Response: \n{output}")

Response: 
contradiction


In [18]:
input_ids = base_tokenizer.encode(
    'mnli premise: I am active in politics. hypothesis: I code for a living', return_tensors='pt'
)

# mnli 
translate_ids = base_model.generate(
    input_ids,
    early_stopping=True,
    num_beams=3
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Response: \n{output}")

Response: 
neutral


## Q/A - Question/Answering

In [19]:
input_ids = base_tokenizer.encode(
    'question: Where does Obama live? context: Obama lives in Hawai but Matt lives in Boston.', return_tensors='pt'
)

# Q/A
translate_ids = base_model.generate(
    input_ids,
    early_stopping=True,
    num_beams = 2,
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Response: \n{output}")

Response: 
Hawai


In [20]:
input_ids = base_tokenizer.encode(
    'question: Where does Matt live? context: Obama lives in Hawai but Matt lives in Boston.', return_tensors='pt'
)

# Q/A
translate_ids = base_model.generate(
    input_ids,
    early_stopping=True,
    num_beams = 3
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Response: \n{output}")

Response: 
Boston


In [21]:
# Sanity check with random prompts

input_ids = base_tokenizer.encode(
    'prompt1: Where does Matt live? prompt2: Obama lives in Hawai but Matt lives in Boston.', return_tensors='pt'
)

# Q/A
translate_ids = base_model.generate(
    input_ids,
    early_stopping=True,
    num_beams =3
)

output = base_tokenizer.decode(translate_ids[0], skip_special_tokens=True)

print(f"Response: \n{output}")

Response: 
prompt2


## Using T5 for abstractive summarization

In [22]:
from transformers import pipeline, T5ForConditionalGeneration, TrainingArguments, Trainer, \
                         DataCollatorForSeq2Seq
import pandas as pd
from datasets import Dataset
import random

In [23]:
base_model = T5ForConditionalGeneration.from_pretrained('t5-small')
base_tokenizer = T5Tokenizer.from_pretrained('t5-small')

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [24]:
# https://www.kaggle.com/snap/amazon-fine-food-reviews?select=Reviews.csv

reviews = pd.read_csv('data/reviews.csv')

# Pre-processing step
# Punctuation is important in grammar and important for complex decoding architectures to know when to stop!
def add_punc(s):
    if s[-1] not in ('.', '!', '?'):
        s = s + '.'
    return s

reviews.dropna(inplace=True)

reviews['Summary'] = reviews['Summary'].map(add_punc)

print(reviews.shape)

reviews.head()

(96486, 3)


Unnamed: 0,Text,Summary,Score
0,Great taffy at a great price. There was a wid...,Great taffy.,5
1,This taffy is so good. It is very soft and ch...,"Wonderful, tasty taffy.",5
2,Right now I'm mostly just sprouting this so my...,Yay Barley.,5
3,This is a very healthy dog food. Good for thei...,Healthy Dog Food.,5
4,good flavor! these came securely packed... the...,fresh and greasy!,4


In [25]:
reviews = reviews[(reviews['Summary'].str.len() < 100) & (reviews['Summary'].str.len() >=30)]

reviews.shape

(13073, 3)

In [26]:
random.seed(0)

reviews_dataset = Dataset.from_pandas(reviews.astype(str).sample(5000))

In [27]:
# We have a prompt but only as a prefix in the encoder
prefix = "summarize: "

# we will manually add our own labels because unlike GPT, we cannot assume the labels are based on the inputs
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["Text"]]
    model_inputs = base_tokenizer(inputs, max_length=1024, truncation=True)

    labels = base_tokenizer(examples["Summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [28]:
tokenized_reviews_dataset = reviews_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [29]:
tokenized_reviews_dataset = tokenized_reviews_dataset.train_test_split(test_size=.1)

In [30]:
# Data collator specifically for generic sequence to sequence tasks
# Use when we are translating one sequence to another like translation, summarization, etc
data_collator = DataCollatorForSeq2Seq(tokenizer=base_tokenizer, model=base_model)

In [32]:
training_args = TrainingArguments(
    output_dir="./t5_summary_results",
    eval_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=20,
    load_best_model_at_end=True,
    logging_steps=50,
    save_strategy='epoch'
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_reviews_dataset["train"],
    eval_dataset=tokenized_reviews_dataset["test"],
    data_collator=data_collator,
)

trainer.evaluate()

{'eval_loss': 4.3676838874816895,
 'eval_model_preparation_time': 0.0026,
 'eval_runtime': 0.3665,
 'eval_samples_per_second': 1364.173,
 'eval_steps_per_second': 43.654}

In [33]:
trainer.train() 

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,3.7012,3.299554,0.0026
2,3.5133,3.229542,0.0026
3,3.4442,3.170504,0.0026
4,3.2992,3.132667,0.0026
5,3.2565,3.101396,0.0026
6,3.2031,3.071202,0.0026
7,3.2091,3.053885,0.0026
8,3.1492,3.035452,0.0026
9,3.1109,3.021343,0.0026
10,3.1215,3.007421,0.0026


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=2820, training_loss=3.15647518617887, metrics={'train_runtime': 218.4289, 'train_samples_per_second': 412.033, 'train_steps_per_second': 12.91, 'total_flos': 1246856435859456.0, 'train_loss': 3.15647518617887, 'epoch': 20.0})

In [34]:
trainer.evaluate()

{'eval_loss': 2.95815372467041,
 'eval_model_preparation_time': 0.0026,
 'eval_runtime': 0.3524,
 'eval_samples_per_second': 1418.784,
 'eval_steps_per_second': 45.401,
 'epoch': 20.0}

In [35]:
trainer.save_model()

In [36]:
loaded_model = T5ForConditionalGeneration.from_pretrained('./t5_summary_results')

# summarization pipeline prepends a default prefix of summarize: 
generator = pipeline(
    'summarization', model=loaded_model, tokenizer=base_tokenizer
)

Device set to use cuda:0


In [37]:
sam = reviews.sample(1)
print(sam['Summary'])
text = sam['Text'].tolist()[0]
text

74105    Great Coffee at a great value.
Name: Summary, dtype: object


'I am a coffee fanatic.  This coffee is delicious!  As good if not better than other brands, and the price is reasonable.'

In [38]:
# Generate a summary
generator(text, min_length=3, max_length=15, early_stopping=True, num_beams=2)

[{'summary_text': 'Great coffee for a great price.'}]

In [39]:
# Try the base t5 on the same text
base_generator = pipeline(
    'summarization', model='t5-small', tokenizer='t5-small'
)

# Summary is a bit more extractive than our fine-tuned version and style isn't quite the same as our dataset
base_generator(text, min_length=3, max_length=15, early_stopping=True, num_beams=2)

Device set to use cuda:0


[{'summary_text': 'a coffee fanatic, this coffee is delicious!'}]

In [46]:
# Sanity check: trying a different prefix. Not a good result
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
inputs = base_tokenizer("not my prompt: " + text, return_tensors="pt")
inputs = inputs.to(device)
outputs = loaded_model.generate(
    inputs["input_ids"], min_length=3, max_length=15,
)

print(base_tokenizer.decode(outputs[0], skip_special_tokens=True))

Yummy, but not my prompt.
