# 📕 Goodreads: DistilBERT AutoModel (Tensorflow)

## Use Book Reviews to Predict Ratings

<div align="center">
    <img src="https://github.com/justinsiowqi/-Goodreads-DistilBERT-AutoModel-Tensorflow-/blob/main/Sesame%20Street.gif?raw=true" alt="Sesame Street" style="width: 500px;"> 
</div>
<div align="center">
  © Sesame Street (1969 TV Series)
</div>

In this notebook, we will create a **DistilBERT model** that is able to take book reviews, understand them and use them to predict book ratings. The **AutoTokenizer** and **AutoModel** classes from the **HuggingFace library** will come in handy here! 

Also, since we're working with two huge datasets, we'll need to make certain adjustments to speed up the processing. This can be achieved through **Datatable**, **random sampling** and **fast tokenizers**. 

Let's dive in!

---

### <font color='000000'>Table of contents<font><a class='anchor' id='top'></a>

1. [Introduction](#section-one)  
    
2. [Get Data](#section-two)
    
3. [Prepare Data](#section-three)
    
4. [Build Tokenizer](#section-four)
    
5. [Build & Train DistilBERT](#section-five) 
    
6. [Test Model](#section-six)

7. [Conclusion](#section-seven)

---

<a class="anchor" id="section-one"></a>
## 1. Introduction

In this notebook, we'll explore a type of **Natural Language Processing** (NLP) called **text classification**. Text classification is a process of adding labels to text. Specifically, we want to create a model that can understand book reviews and predict whether readers gave it one star, two stars... or five stars. In order to do so, we need two key components. First, we'll use a  **tokenizer** to convert words into tokens so that the model can understand. Secondly, the **classification model** will take the tokens and learn its context.

Sounds like a lot right? Thankfully, we can make use of the **HuggingFace library** to tokenize the text and train the model. In fact, these two parts can be created in 7 lines of code! We used the **DistilBERT** (distilled version of BERT) and got an accuracy of 0.57. If you'd like to see how to implement the BERT model using Pytorch, stay tuned for the next notebook [here](https://www.kaggle.com/justinsiow/code).

---

<a class="anchor" id="section-two"></a>
## 2. Get Data

- Download dependencies
- Load training and test dataset using datatable

In [None]:
# Dependencies
# If on kaggle/Colab, uncomment and run this cell. If on terminal, remove exclamation marks

# ! pip install datatable
# ! pip install transformers
# ! pip install tensorflow

In [None]:
# Import libraries

import datatable as dt
import pandas as pd
import pickle
import numpy as np
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

### A Guide to DataTable:

Datatable was created to process **humongous** amounts of data. You can think of it as a much faster version of pandas with lesser functionality. Using Datatable is really simple. Instead of the read_csv() function, we'll use the **fread()** function. Then, we use **.to_pandas()** to convert into a pandas dataframe. That's all! In the next section, we'll make use of all the functions we're familiar with in pandas.

In [None]:
# Load the training and test dataset

train_df = dt.fread('/kaggle/input/goodreads-books-reviews-290312/goodreads_train.csv').to_pandas()
test_df = dt.fread('/kaggle/input/goodreads-books-reviews-290312/goodreads_test.csv').to_pandas()

In [None]:
# Take a look at the first 5 rows of the training set

train_df.head()

---

<a class="anchor" id="section-three"></a>
## 3. Prepare Data

- Remove books reviews where:
    - Number of votes are negative.
    - Number of comments are negative.
    - There are duplicates.
- Take a random sample of the training dataset.
- Split and encode the target variable.

In [None]:
# Remove reviews where number of votes or number of comments are negative

train_df = train_df[train_df['n_votes'] >= 0]

train_df = train_df[train_df['n_comments'] >= 0]

In [None]:
# Remove duplicate reviews

train_df.drop_duplicates(subset=['review_text'], inplace=True)

In [None]:
# Take a random sample of the training dataset

train_df = train_df.sample(int(len(train_df) * 0.1), random_state=28)

In [None]:
# Drop the unnecessary columns from the training and test dataset. Reset index will remove index.
# I made a mistake here. Do not delete the 'review_id' column from test_df, we'll need it later!

columns_to_delete = ['user_id', 'book_id', 'review_id', 'date_added', 'date_updated', 'read_at', \
                     'started_at', 'n_votes', 'n_comments']

train_df = train_df.drop(columns_to_delete, axis=1).reset_index(drop=True)
test_df = test_df.drop(columns_to_delete, axis=1).reset_index(drop=True)

In [None]:
# Split the target variable

X_train = train_df.drop('rating', axis=1)
y_train = train_df['rating']

In [None]:
# Encode the target variable in the training dataset

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)

In [None]:
# Take a look at the X_train. There are 89,020 rows

X_train

---

<a class="anchor" id="section-four"></a>
## 4. Build Tokenizer

- Use the fast version of the DistilBERT AutoTokenizer.
- Set the maximum length to 128 words. For reviews with more than or less than 128 words, pad and truncate the text.

### A Guide to AutoTokenizer:

A tokenizer **splits text into tokens** so that the model can understand. We can start by calling **AutoTokenizer.from_pretrained()** which basically loads the **vocab** from a pretrained tokenizer. We will set **use_fast=True** which will load faster version of the model (Rust-based). 

Next, we'll call tokenizer on the review text from X_train. We will set return_tensors to numpy and the **max length to 128 words**. If the sentence has more than 128 words, we need to **truncate** the text (cut down the number of words). On the other hand, if the text has less than 128 words, we need to add **padding** (add zeros to make the sentence longer).

The tokenizer will return a BatchEncoding object. We'll need to convert it to a dictionary. Also, we'll create a variable called labels which is a numpy array of y_train.

In [None]:
# DistilBERT AutoTokenizer with a maximum length of 128 words. Truncation and padding is used as well.

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased', use_fast=True)
tokenized_data = tokenizer(list(X_train['review_text']), return_tensors="np", max_length=128, truncation=True, padding=True)

# Convert tokenized data to a dictionary and y_train to a numpy array
tokenized_data = dict(tokenized_data)
labels = np.array(y_train) 

---

<a class="anchor" id="section-five"></a>
## 5. Build & Train DistilBERT

- Use the DistilBERT AutoModel with 6 labels.
- Use the Adam optimizer. HuggingFace will define the loss function for us.
- Save the model.

### A Guide to AutoModel:

A **pretrained model** saves you a lot of time and effort compared to training the model from scratch. First, let's call **TFAutoModelForSequenceClassification.from_pretrained()** and pass in the same model as the tokenizer above. The number of labels will be 6 since the ratings are from 0 to 5. 

We'll use the **Adam optimizer** with a very low learning rate. We **don't have to specify a loss function**, HuggingFace will automatically do that for us! Finally, we'll call .fit() and pass in the tokenized data and labels.

In [None]:
# DistilBERT AutoModel with 6 labels. Optimizer is Adam and loss is automatically set.

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=6)

# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5))

model.fit(tokenized_data, labels)

In [None]:
# Save model

model.save_pretrained('./model/clf')
with open('./model/info.pkl', 'wb') as f:
    pickle.dump(('distilbert-base-uncased', 128), f)

In [None]:
# Load model

new_model = TFAutoModelForSequenceClassification.from_pretrained('./model/clf')
model_name, max_len = pickle.load(open('./model/info.pkl', 'rb'))

---

<a class="anchor" id="section-six"></a>
## 6. Test Model

- Tokenize the review text from test set.
- Feed the tokens into the DistilBERT model and use it to predict the book ratings.
- Create a new CSV file for submission.

In [None]:
# Now we need to tokenize the test dataset and then use the model to predict

tokenized_test = tokenizer(list(test_df['review_text']), return_tensors="np", max_length=128, truncation=True, padding=True)

# Convert tokenized data to a dictionary
tokenized_test = dict(tokenized_test)

preds = model.predict(tokenized_test)

In [None]:
# Take a look at the first 5 rows of the test set
# The next few cells are due to a mistake. See Cell 8 to find out what went wrong.

new_test_df = dt.fread('goodreads_test.csv').to_pandas()

new_test_df

In [None]:
# Create a new test dataset

columns_to_delete = ['user_id', 'book_id', 'date_added', 'date_updated', 'read_at', \
                     'started_at', 'n_votes', 'n_comments']

new_test_df = new_test_df.drop(columns_to_delete, axis=1).reset_index(drop=True)

new_test_df

In [None]:
# Create a new CSV file that includes the 'review_id' and predicted 'rating'

for n in range(len(test_df)):
    logit = preds.logits[n]
    results[n] = int(np.argmax(logit))

my_submission = pd.DataFrame({
    "review_id": new_test_df["review_id"],
    "rating": results.astype(int)
})

my_submission.to_csv('submission.csv', index=False)

---

<a class="anchor" id="section-seven"></a>
## 7. Conclusion

In this notebook, we used the **DistilBERT** model to predict book ratings. We started by loading the datasets using **Datatable**. Next, we preprocessed the data by removing irrelevant text and taking a random sample. Then, the **AutoTokenizer** function converts text to tokens and **AutoModel** calls the pretrained model for us to train. Finally, we used the predict() function to get our submission file.

Thanks for looking through this notebook. Feel free to check out my second and third attempts at the Goodreads Books Reviews competition. Also, do drop an upvote if this has helped you in any way :)

### References:

- [An Overview of Python's Datatable Package](https://towardsdatascience.com/an-overview-of-pythons-datatable-package-5d3a97394ee9)
- [Preprocess](https://huggingface.co/docs/transformers/preprocessing)
- [Fine-tune a Pretrained Model](https://huggingface.co/docs/transformers/training)