# Rating Classification

#### The purpose of this notebook is to use a custom dataset - 'RAW_interactions.csv' and to preprocess the reviews and ratings columns to be usable with DistilBERT model from Hugging Face. Then fine-tune said model according to the custom dataset to perform multi-labels classification (the ratings from 1-5).

### Setup

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
# Change working directory to be current folder
import os
os.chdir('/content/gdrive/My Drive/Food Recipe App/')
print("Current working directory:")
os.getcwd()
!ls

Current working directory:
 checkpoint		      RAW_recipes.csv
 checkpoints		      target_tensor_cpu.pickle
'Copy of transformer.ipynb'   target_tensor.pickle
 document_similarity.ipynb    targ_tokenizer_cpu.pickle
 inp_tokenizer_cpu.pickle     targ_tokenizer.pickle
 inp_tokenizer.pickle	      transformer_cpu.data-00000-of-00001
 input_tensor_cpu.pickle      transformer_cpu.index
 input_tensor.pickle	      transformer.data-00000-of-00001
 rating_classification	      transformer.index
 rating_classification2       transformer.ipynb
 RAW_interactions.csv	      wmd.model


In [3]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.1-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 35.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [1]:
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased'

### Preprocess Custom Dataset

In [6]:
df = pd.read_csv("RAW_interactions.csv", usecols=['review', 'rating'])

# Drop rows that are NaN
df.dropna(subset=['review'], inplace=True)

# Keep rows that have string length >= 10
df = df[df['review'].str.split().str.len().ge(10)]

After inspection of the data, the author finds that many reviews with rating of '0' are not accurate. E.g. 

Review: This is a very good recipe. We also want to cut back on the fat content in our diet . Very tasty dish!!!


Rating: 0


As such, for consistency and brevity, rating '0' will not be included.

In [6]:
# Modify the dataframe to have 4000 data points for each rating (1-5)
df1 = df.loc[df['rating'] == 1].head(4000)
df2 = df.loc[df['rating'] == 2].head(4000)
df3 = df.loc[df['rating'] == 3].head(4000)
df4 = df.loc[df['rating'] == 4].head(4000)
df5 = df.loc[df['rating'] == 5].head(4000)

df_modified = pd.concat([df1, df2, df3, df4, df5])
print (df_modified)

      rating                                             review
83         1  I did not care for this at all. All I could ta...
121        1  I found this pudding, extremely discusting.  I...
130        1  What a disastrous recipe.  We used fat free yo...
192        1  This was incredibly sweet, and I reduced the s...
194        1  This was really terrible. It was overwhelmingl...
...      ...                                                ...
5496       5  We had this on our Thanksgiving table.  I used...
5497       5  This is really good.  Different form what most...
5499       5  Delicious!  I was looking for something differ...
5500       5  I too was looking for an alternative to baked ...
5501       5  It's been quite a while since I made this and ...

[20000 rows x 2 columns]


In [7]:
reviews = df_modified['review']
ratings = df_modified['rating']

In [8]:
def create_dataset(reviews, ratings):
  texts = []
  labels = []
  for review in reviews:
    texts.append(str(review))
  for rating in ratings:
    if rating == 1:
      labels.append([1,0,0,0,0])
    elif rating == 2:
      labels.append([0,1,0,0,0])
    elif rating == 3:
      labels.append([0,0,1,0,0])
    elif rating == 4:
      labels.append([0,0,0,1,0])
    elif rating == 5:
      labels.append([0,0,0,0,1])

  return texts, labels

In [9]:
texts, labels = create_dataset(reviews, ratings)
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.33, random_state=42)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

In [10]:
# Tokenize the texts
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='tf')
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='tf')
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors='tf')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [11]:
# Turn encodings and labels into a Dataset object
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

### Fine tune the model

In [12]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=5, problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_layer_norm', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [13]:
batch_size = 8
num_epochs = 3

'''
- Decaying or annealing the learning rate via a learning rate scheduler.
- A good one to use is PolynomialDecay — despite the name, with default settings it simply linearly decays
the learning rate from the initial value to the final value over the course of training, which is exactly what we want.
- In order to use a scheduler correctly, though, we need to tell it how long training is going to be. We compute that as num_train_steps.
- num_train_steps is the number of samples in the dataset, divided by the batch size then multiplied by the total number of epochs.
'''
num_train_steps = (len(train_texts) // batch_size) * num_epochs
print (num_train_steps)
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
opt = Adam(learning_rate=lr_scheduler)

'''
- Report training loss.
- Note, by default Keras will assume that you have already applied a softmax to your outputs. 
- Many models, however, output the values right before the softmax is applied, which are also known as the logits.
'''
loss = CategoricalCrossentropy(from_logits=True)

4020


In [14]:
model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
model.fit(
    train_dataset.shuffle(1000).batch(8),
    validation_data=val_dataset.shuffle(1000).batch(8),
    batch_size=batch_size,
    epochs=num_epochs
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fc9bd579990>

In [15]:
# Save model
model.save_pretrained("rating_classification2")

In [None]:
# Load model
# model = TFAutoModelForSequenceClassification.from_pretrained("rating_classification2", num_labels=5, problem_type="multi_label_classification")
# model.compile(optimizer=opt, loss=loss)

Some layers from the model checkpoint at rating_classification were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at rating_classification and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
test_score = model.evaluate(test_dataset.shuffle(1000).batch(8), batch_size=8)



In [17]:
sequences = [
  "This is the best recipe ever",
  "This is the worst recipe ever",
]

In [18]:
se = tokenizer(sequences, truncation=True, padding=True, return_tensors='tf')

In [19]:
output = model(se)

In [20]:
predictions = tf.math.softmax(output.logits, axis=-1)
print(predictions)

tf.Tensor(
[[0.01833973 0.00470579 0.00417378 0.01747468 0.955306  ]
 [0.9761179  0.01374305 0.00239576 0.00120763 0.00653559]], shape=(2, 5), dtype=float32)
