#🤖 Fine-Tuning T5 for Product Review Generation

In this interactive lab, we'll explore the exciting task of generating product reviews using the T5 (Text-to-Text Transfer Transformer) model. We'll dive into data preparation, model training, and ultimately, review generation.

###Setup and Installation

First things first, we need to install the required libraries to ensure our environment is ready for the tasks ahead.

In [1]:
!pip install numpy==1.25.1
!pip install transformers[torch]
!pip install datasets===2.13.1

Collecting accelerate>=0.21.0 (from transformers[torch])
  Using cached accelerate-0.30.1-py3-none-any.whl (302 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->transformers[t

###Importing Libraries 📚

Let's import all the necessary modules that will help us load datasets, process data, and utilize the T5 model.


In [2]:
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

###Data Preparation 📋

Our journey begins with preparing our dataset. We'll use a subset of Amazon product reviews for our analysis and training.

Loading and Merging Datasets
We replace the unavailable "amazon_us_reviews" with a similar dataset and merge metadata with review data.

In [3]:
# Amazon removed the "amazon_us_reviews" dataset, so we'll have to use a replacement here.
dataset_category = "Software" # "Electronics" you can also choose electronics like in the lesson, but the dataset is bigger and loading will take longer

meta_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_{dataset_category}", split='full').to_pandas()[['parent_asin', 'title']]
review_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_review_{dataset_category}", split='full').to_pandas()[['parent_asin', 'rating', 'text', 'verified_purchase']]

ds = meta_ds.merge(review_ds, on='parent_asin', how='inner').drop(columns="parent_asin")
ds = ds.rename(columns={"rating":"star_rating", "title":"product_title", "text":"review_body"})

ds = ds[ds['verified_purchase'] & (ds['review_body'].map(len) > 100)].sample(100_000)
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/19.7k [00:00<?, ?B/s]

Downloading and preparing dataset amazon-reviews-2023/raw_meta_Software to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_meta_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8...


Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Dataset amazon-reviews-2023 downloaded and prepared to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_meta_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8. Subsequent calls will reuse this data.
Downloading and preparing dataset amazon-reviews-2023/raw_review_Software to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_review_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8...


Downloading data:   0%|          | 0.00/1.87G [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Dataset amazon-reviews-2023 downloaded and prepared to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_review_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8. Subsequent calls will reuse this data.


  table = cls._concat_blocks(blocks, axis=0)


Unnamed: 0,product_title,star_rating,review_body,verified_purchase
3904723,Monster Busters: Match 3 Puzzle,4.0,Wish the directions on how each piece works wa...,True
3380706,Quicken Premier 2010 [OLD VERSION],4.0,"It seems like a simple program, I've download ...",True
4693914,The Room (Kindle Tablet Edition),5.0,"Pretty fun over all, looking forward to more r...",True
1477689,Angry Birds Friends,1.0,The latest update to this game is a bug riddle...,True
3811743,Kik Messenger,5.0,This app is great I've been gaining through a ...,True
...,...,...,...,...
2124317,Surface: Return to Another World Collector's E...,5.0,Big Fish Games comes thru again. Very seldom h...,True
2653596,Max,3.0,Could not sign up because they use cookies and...,True
4019617,Brother iPrint&Scan,5.0,Too easy - seriously. As an older guy who has...,True
4602434,Spider,5.0,"no problems, quick load, great job! I've tri...",True


Encoding and Splitting
Next, we encode our star_rating column and split our dataset into training and testing sets.

In [4]:
# Loading the dataset
dataset = Dataset.from_pandas(ds)

# encoding the 'star_rating' column
dataset = dataset.class_encode_column("star_rating")

# Splitting the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.1, seed=42, stratify_by_column="star_rating")

train_dataset = dataset['train']
test_dataset = dataset['test']
print(train_dataset[0])

Stringifying the column:   0%|          | 0/100000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/100000 [00:00<?, ? examples/s]

{'product_title': 'WatchESPN', 'star_rating': 2, 'review_body': 'Very good app to.watch sports but it freezes on my Kindle Fire and never comes back.  Sometimes I have to reboot the device to get it back working.', 'verified_purchase': True, '__index_level_0__': 3521665}


###Model Preparation 🛠️

Now, let's prepare our T5 model for training.

###Tokenizer Initialization

In [5]:
MODEL_NAME = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


###Data Preprocessing Function
We define a function to preprocess our data, preparing it for the model.

In [6]:
# Defining the function to preprocess the data
def preprocess_data(examples):
    examples['prompt'] = [f"review: {product_title}, {star_rating} Stars!" for product_title, star_rating in zip(examples['product_title'], examples['star_rating'])]
    examples['response'] = [f"{review_body}" for review_body in examples['review_body']]

    inputs = tokenizer(examples['prompt'], padding='max_length', truncation=True, max_length=128)
    targets = tokenizer(examples['response'], padding='max_length', truncation=True, max_length=128)

    # Set -100 at the padding positions of target tokens
    target_input_ids = []
    for ids in targets['input_ids']:
        target_input_ids.append([id if id != tokenizer.pad_token_id else -100 for id in ids])

    inputs.update({'labels': target_input_ids})
    return inputs

###Preprocessing Datasets

In [7]:
train_dataset = train_dataset.map(preprocess_data, batched=True)
test_dataset = test_dataset.map(preprocess_data, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

###Fine-Tuning the Model 🎯

With our data ready, we proceed to fine-tune the T5 model on our dataset.

In [8]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

TRAINING_OUTPUT = "./models/t5_fine_tuned_reviews"
training_args = TrainingArguments(
    output_dir=TRAINING_OUTPUT,
    num_train_epochs=3,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    save_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

# trainer.train()

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

###Saving and Loading the Model 💾

After training, we save our model for later use and demonstrate how to load it.

In [9]:
# trainer.save_model(TRAINING_OUTPUT)

In [10]:
# # Loading the fine-tuned model
# model = T5ForConditionalGeneration.from_pretrained(TRAINING_OUTPUT)

# or get it directly trained from here:
model = T5ForConditionalGeneration.from_pretrained("TheFuzzyScientist/T5-base_Amazon-product-reviews")

config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

###Generating Reviews ✍️

Finally, we use our fine-tuned model to generate reviews for new products.

In [11]:
# Defining the function to generate reviews
def generate_review(text):
    inputs = tokenizer("review: " + text, return_tensors='pt', max_length=512, padding='max_length', truncation=True)
    outputs = model.generate(inputs['input_ids'], max_length=128, no_repeat_ngram_size=3, num_beams=6, early_stopping=True)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [12]:
# Generating reviews for random products
random_products = test_dataset.shuffle(42).select(range(10))['product_title']

print(generate_review(random_products[0] + ", 3 Stars!"))
print(generate_review(random_products[1] + ", 5 Stars!"))
print(generate_review(random_products[2] + ", 2 Stars!"))

Good for the price I bought this for my daughter for Christmas. She loves it. It's a little bulky, but it's not a big deal.
Great game for the price I bought this game for my son for Christmas and he loves it. It's easy to use and it's fun to play. The only thing I don't like about it is that it doesn't come with a remote.
It's a good product, but it's not as good as I thought it would be. I've had it for about a month now, and I'm very happy with it. The only thing I don't like about it is that it doesn't come with a charger.


### Conclusion 🚀

Congratulations! You've just completed a hands-on project on fine-tuning T5 for generating product reviews. Experiment further with different product categories or tweak the model parameters to see how it affects the output. Happy coding!

## 🚀 Next Steps: Elevate Your Skills with Advanced Techniques

Congratulations on reaching this far! If you're keen to expand your expertise and dive deeper into the world of Large Language Models (LLMs), our next course is designed just for you.

### 🌟 Use LLMs Smarter: Scale Gen AI, ML-Ops & Cost Efficiency
In a world where AI and machine learning are revolutionizing industries, the ability to deploy and manage massive models like Llama, Mistral, and Gemma efficiently is invaluable.

This course is tailored to equip you with the knowledge and skills to:

* **Deploy Huge Models**: Learn the ins and outs of working with some of the largest models available, understanding their architecture and how they can be leveraged for your projects.
* **Scale Across Clusters**: Discover strategies for scaling these behemoths across clusters of machines without sacrificing performance, ensuring seamless operation.
* **Optimize Response Times**: Achieve response times in the milliseconds while maintaining the delicate balance between accuracy and speed.
* **Balance Accuracy, Speed, & Cost**: Master the art of cost efficiency without compromising on performance, utilizing the latest and most powerful technologies.
