# 📗 Goodreads: RoBERTa with Trainer API (Torch)

## Use Book Reviews to Predict Ratings

<div align="center">
    <img src="https://raw.githubusercontent.com/justinsiowqi/-Goodreads-RoBERTa-with-Trainer-API-Torch-/main/BERT.png" alt="Sesame Street" style="width: 500px;"> 
</div>
<div align="center">
  © Sesame Street (1969 TV Series)
</div>

Welcome back to Part 2 of the series! 

In the [previous notebook](https://www.kaggle.com/code/justinsiow/goodreads-distilbert-automodel-tensorflow), we used the HuggingFace transformers library to predict book ratings. Specifically, we created a **DistilBERT** AutoTokenizer and AutoModel and achieved an accuracy of **0.56808**. 

In this notebook, we will take advantage of two other HuggingFace libraries (**datasets** and **evaluate**) to process the data and evaluate the model's performance. Instead of using the DistilBERT model, we will use a bigger model called **RoBERTa** and train it using the **Trainer API**. By increasing the amount of training data twofold, we managed to get a better accuracy of **0.59792**.

---

### <font color='000000'>Table of contents<font><a class='anchor' id='top'></a>

1. [Introduction](#section-one)  
    
2. [Get Data](#section-two)
    
3. [Prepare Data](#section-three)
    
4. [Build Tokenizer](#section-four)
    
5. [Build & Train Model](#section-five) 
    
6. [Test Model](#section-six)

7. [Conclusion](#section-seven)

---

<a class="anchor" id="section-one"></a>
## 1. Introduction

The goal of this notebook is to **classify text** from books reviews and predict how many stars readers will rate that book. We'll begin by loading and processing the text using the **datasets** library, followed by tokenizing the text so that our model can understand. Then, we'll use the **transformers** library to create a **R**obustly **o**ptimized **BERT**-Pretraining **a**pproach (**RoBERTa**) model and train it using the Trainer API. In order to gauge how well our model has performed, we'll compute the model's accuracy using the **evaluate** library. Finally, we'll predict the rating of the test set and wrap everything up into a csv. 

**How to use this notebook**: This notebook uses PyTorch. If you'd prefer a simpler implementation using Tensorflow, check out [this notebook](https://www.kaggle.com/code/justinsiow/goodreads-distilbert-automodel-tensorflow) instead. Both notebooks come with guides that will help you better understand how the code works. 
Last but not least, have fun playing around with different models and hyperparamters!

---

<a class="anchor" id="section-two"></a>
## 2. Get Data

- Download dependencies.
- Define the checkpoint for AutoTokenizer and AutoModel.
- Load training and test dataset using HuggingFace datasets.

In [None]:
# Dependencies
# If on kaggle/Colab, uncomment and run this cell. If on terminal, remove exclamation marks

# ! pip install datasets
# ! pip install evaluate
# ! pip install transformers

In [None]:
# Import libraries

import evaluate
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset, Features, Value, ClassLabel, load_metric, Dataset
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForSequenceClassification, AdamW

### A Guide to AutoTokenizer and AutoModel:

**AutoTokenizer** and **AutoModel** makes it super simple for you to retrieve the model you want to use. All you have to do is instantiate it using from_pretrained() and pass in the model/checkpoint. We set **use_fast=True** so that we can load the faster version of the tokenizer. There are 6 ratings in total (0 to 5), hence the model has to take an argument where num_labels=6.

If you want to try out different models, simply head over to [HuggingFace Models](https://huggingface.co/models) and replace 'roberta-base' in the checkpoint (yes, it's that simple!)

In [None]:
# Define the checkpoint for AutoModel and Autotokenizer

checkpoint = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=6)

# Activate Mac M1 GPU. If on Windows/Colab/Kaggle, replace all the mps with cuda
if torch.backends.mps.is_available():
    mps_device = torch.device('mps')
    model.to(mps_device)

### A Guide to HuggingFace Datasets Part 1:

Loading a dataset using HuggingFace **datasets** is simple. All you need to do is call **load_dataset()**. Next, you have to state the type of file that you're loading (eg. csv, json or txt). After that, you just need to attach the file path. HuggingFace datasets will assume that every file path is a training dataset, that's why we have to add 'train' and 'test' for HuggingFace datasets to differentiate.

HuggingFace datasets is similar to Pandas dataframe but with lesser functionality. We will be using some functions later on. First, let's convert the rating column into the class label using **class_encode_column()**. This tells our model that we want to predict the ratings later on. Next, we'll call **rename_columns()** on rating and review_text to labels and text respectively because our model expects it later.

In [None]:
# Load the datasets

raw_train_dataset = load_dataset('csv', data_files={
    'train': '/kaggle/input/goodreads-books-reviews-290312/goodreads_train.csv'}
)

raw_test_dataset = load_dataset('csv', data_files={
    'test': '/kaggle/input/goodreads-books-reviews-290312/goodreads_test.csv'}
)

In [None]:
# Convert rating column into class label
raw_train_dataset['train'] = raw_train_dataset['train'].class_encode_column('rating')

# Rename rating and review_text columns
raw_train_dataset['train'] = raw_train_dataset['train'].rename_columns({'rating': 'labels', 'review_text': 'text'})

In [None]:
# Traning dataset has 11 features and 900,000 rows

train_ds = raw_train_dataset['train']
train_ds

---

<a class="anchor" id="section-three"></a>
## 3. Prepare Data

- Remove books reviews that have:
    - Negative number of votes.
    - Negative number of comments.
    - NaN values.
    - Less than 30 words.
- Take a random sample of the training dataset.
- Split and encode the target variable.

### A Guide to HuggingFace Datasets Part 2:

There are two other functions we will use from the HuggingFace datasets library. First, we'll use the **filter()** function to remove book reviews where 1) n_votes is negative, 2) n_comments is negative, 3) NaN values exist and 4) number of words is less than 30. This helps us narrow down the text so that our model can learn better.

Next, we'll use the **remove_columns()** function to remove all the columns except for the labels and text columns. The last thing to do is to randomly shuffle and sample 20% of the training dataset. The reason why we won't use the full dataset is because it'll take much longer to tokenize and train later on.

In [None]:
# Remove reviews where number of votes or number of comments is negative
train_ds = train_ds.filter(lambda x: x['n_votes'] >= 0 and x['n_comments'] >= 0)

# Remove NaN values
train_ds = train_ds.filter(lambda x: x['text'] is not None)

# Remove rows that have less than 30 words
train_ds = train_ds.filter(lambda example: len(example['text'].split()) >= 30)

In [None]:
# Drop the unnecessary columns from the training and test dataset

train_columns_to_delete = ['user_id', 'book_id', 'review_id', 'date_added', 'date_updated', 'read_at', \
                           'started_at', 'n_votes', 'n_comments']

train_ds = train_ds.remove_columns(train_columns_to_delete)

In [None]:
# Take a random sample of the training dataset

train_ds = train_ds.shuffle(seed=28).select(range(int(len(train_ds) * 0.2)))
num_rows = train_ds.num_rows

train_ds

In [None]:
# Split the training into 90% training and 10% validation

dataset = train_ds.train_test_split(test_size=0.1, seed=28)

train_dataset = dataset['train']
val_dataset = dataset['test']

---

<a class="anchor" id="section-four"></a>
## 4. Build Tokenizer

- Create a function to tokenize text from the training and test dataset.
- Set the maximum length to 128 words. For reviews with more than or less than 128 words, pad and truncate the text.

In [None]:
# Define a function to tokenize text from the training and test dataset. In order to re-use it later, we need to
# create an if-else statement since the test dataset has no labels. 

labels = ClassLabel(names=['0', '1', '2', '3', '4', '5'])

def tokenize_function(example):
    tokens = tokenizer(example['text'], truncation=True, padding=True, max_length=128)
    if 'labels' in example:
        tokens['labels'] = labels.str2int(example['labels'])
    else:
        pass
    return tokens

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

In [None]:
# Remove the text column and set format to PyTorch

tokenized_train = tokenized_train.remove_columns(['text'])
tokenized_val = tokenized_val.remove_columns(['text'])

tokenized_train.set_format('torch')
tokenized_val.set_format('torch')

---

<a class="anchor" id="section-five"></a>
## 5. Build & Train Model

- Create a function to compute metrics and evaluate the model's performance.
- Cutomize the training process via TrainingArguments.
- Instantiate the trainer class and train the model.

### A Guide to Trainer API:

The HuggingFace **Trainer API** handles the training loop (and all its details) for you. There's just four simple steps to make it work: 1) define the model, 2) create a function to evaluate the model, 3) define training arguments and 4) define trainer. 

Step 1 has already been done right on top of this notebook. The next step is to create a function to compute the metrics and **evaluate our model's performance**. Let's load the accuracy metric from the HuggingFace Evaluate library. We need np.argmax() to transform the logits to something we can use to compare with the labels. The compute() method will basically calculate the metrics for us. 

In Step 3, we'll define a **TrainingArguments** class that will contain all the hyperparameters we will use for training and evaluation. The model is evaluated and saved at the end of each epoch. Once training is complete, the best model will be loaded at the end. Finally, we'll define the trainer class and pass in the model, training arguments, tokenized data, tokenizer and the compute_metrics function. To start the training process, we just need to call **trainer.train()**.

In [None]:
# Define a function to evaluate the model's performance

def compute_metrics(eval_pred):
    metric = evaluate.load('accuracy')
    logits, labels = eval_pred
    val_predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=val_predictions, references=labels)

In [None]:
# Set the hyperparameters for the Trainer API. Feel free to tweak this

training_args = TrainingArguments(
    output_dir=f'{checkpoint}-{num_rows}-samples',
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=1e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True
)

In [None]:
# Use the Trainer API to train the RoBERTa model

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

---

<a class="anchor" id="section-six"></a>
## 6. Test Model

- Tokenize the review text from test set.
- Feed the tokens into the RoBERTa model and use it to predict the book ratings.
- Create a new CSV file for submission.

In [None]:
# Tokenize the test dataset. 
# All the steps here are exactly the same as the above, except we don't delete the review_id column

test_columns_to_delete = ['user_id', 'book_id', 'date_added', 'date_updated', 'read_at', 'started_at', \
                          'n_votes', 'n_comments']

test_dataset = raw_test_dataset['test'].rename_columns({'review_text': 'text'})
test_dataset = test_dataset.remove_columns(test_columns_to_delete)

tokenized_test = test_dataset.map(tokenize_function, batched=True)

tokenized_test = tokenized_test.remove_columns(['text'])
tokenized_test.set_format('torch')

In [None]:
# Use the Trainer API to predict the tokenized test dataset

test_predictions = trainer.predict(tokenized_test)
preds = np.argmax(test_predictions.predictions, axis=-1)

In [None]:
# Create a new CSV file for submission

my_submission = pd.DataFrame({
    'review_id': tokenized_test['review_id'],
    'rating': preds
})

my_submission.to_csv('submission.csv', index=False)

---

<a class="anchor" id="section-seven"></a>
## 7. Conclusion

And that's a wrap! 

In this notebook, we leveraged the power of the HuggingFace **datasets**, **transformers** and **evaluate** libraries. Specifically, we used the **Trainer API** to fine-tune a pre-trained **RoBERTa** model to predict book ratings from Goodreads. We achieved an accuracy of 59.792% which is pretty decent!

Thank you for sticking around till the end! If you've found this notebook helpful, please upvote it :)

### References:

- [Processing the Data](https://huggingface.co/course/chapter3/2?fw=pt)
- [Fine-Tuning a Model with Trainer API or Keras](https://huggingface.co/course/chapter3/3?fw=pt)
- [Loading Datasets](https://huggingface.co/docs/datasets/loading)
- [Text Classification on the IMDB Dataset](https://huggingface.co/docs/transformers/tasks/sequence_classification)