# Rating Classification

#### The purpose of this notebook is to use a custom dataset - 'RAW_interactions.csv' and to preprocess the reviews and ratings columns to be usable with DistilBERT model from Huggingface. Then fine-tune said model according to the custom dataset to perform multi-labels classification (the ratings from 0-5).

### Setup

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
# Change working directory to be current folder
import os
os.chdir('/content/gdrive/My Drive/Food Recipe App/')
print("Current working directory:")
os.getcwd()
!ls

Current working directory:
 checkpoint		      RAW_recipes.csv
 checkpoints		      target_tensor_cpu.pickle
'Copy of transformer.ipynb'   target_tensor.pickle
 document_similarity.ipynb    targ_tokenizer_cpu.pickle
 inp_tokenizer_cpu.pickle     targ_tokenizer.pickle
 inp_tokenizer.pickle	      transformer_cpu.data-00000-of-00001
 input_tensor_cpu.pickle      transformer_cpu.index
 input_tensor.pickle	      transformer.data-00000-of-00001
 rating_classification	      transformer.index
 rating_classification2       transformer.ipynb
 RAW_interactions.csv	      wmd.model


In [4]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 5.3 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.1 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.9 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformer

In [5]:
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased'

### Preprocess Custom Dataset

In [6]:
data = pd.read_csv("RAW_interactions.csv", usecols=['review', 'rating'], nrows=20000)
print (data)
reviews = data['review']
ratings = data['rating']

       rating                                             review
0           4  Great with a salad. Cooked on top of stove for...
1           5  So simple, so delicious! Great for chilly fall...
2           4  This worked very well and is EASY.  I used not...
3           5  I made the Mexican topping and took it to bunk...
4           5  Made the cheddar bacon topping, adding a sprin...
...       ...                                                ...
19995       5  Awesome. My wife said it was the best steak I ...
19996       5  This is a must have in you cupboard.  I used t...
19997       5  I received this as a gift and it is just fabul...
19998       5  This is an extremely versatile and cheap spice...
19999       5  I made this up before Christmas, put it in cel...

[20000 rows x 2 columns]


In [7]:
def create_dataset(reviews, ratings):
  texts = []
  labels = []
  for review in reviews:
    texts.append(str(review))
  for rating in ratings:
    if rating == 0:
      labels.append([1,0,0,0,0,0])
    elif rating == 1:
      labels.append([0,1,0,0,0,0])
    elif rating == 2:
      labels.append([0,0,1,0,0,0])
    elif rating == 3:
      labels.append([0,0,0,1,0,0])
    elif rating == 4:
      labels.append([0,0,0,0,1,0])
    elif rating == 5:
      labels.append([0,0,0,0,0,1])

  return texts, labels

In [8]:
texts, labels = create_dataset(reviews, ratings)
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.33, random_state=42)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

In [9]:
print (len(train_texts))

10720


In [8]:
# Tokenize the texts
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='tf')
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='tf')
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors='tf')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [9]:
# Turn encodings and labels into a Dataset object
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

### Fine tune the model

In [10]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=6, problem_type="multi_label_classification")

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'dropout_19', 'pre_classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [11]:
batch_size = 8
num_epochs = 10

'''
- Decaying or annealing the learning rate via a learning rate scheduler.
- A good one to use is PolynomialDecay — despite the name, with default settings it simply linearly decays
the learning rate from the initial value to the final value over the course of training, which is exactly what we want.
- In order to use a scheduler correctly, though, we need to tell it how long training is going to be. We compute that as num_train_steps.
- num_train_steps is the number of samples in the dataset, divided by the batch size then multiplied by the total number of epochs.
'''
num_train_steps = (len(train_texts) // batch_size) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
opt = Adam(learning_rate=lr_scheduler)

'''
- Report training loss.
- Note, by default Keras will assume that you have already applied a softmax to your outputs. 
- Many models, however, output the values right before the softmax is applied, which are also known as the logits.
'''
loss = CategoricalCrossentropy(from_logits=True)

In [12]:
model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
model.fit(
    train_dataset.shuffle(1000).batch(8),
    validation_data=val_dataset.shuffle(1000).batch(8),
    batch_size=8,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff6712f6b10>

In [13]:
# Save model
model.save_pretrained("rating_classification2")

In [None]:
# Load model
# model = TFAutoModelForSequenceClassification.from_pretrained("rating_classification2", num_labels=6, problem_type="multi_label_classification")
# model.compile(optimizer=opt, loss=loss)

Some layers from the model checkpoint at rating_classification were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at rating_classification and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
test_score = model.evaluate(test_dataset.shuffle(1000).batch(8), batch_size=8)



In [15]:
print (test_score)

[1.435502290725708, 0.7486363649368286]


In [24]:
sequences = [
  "This is the best recipe ever",
  "This is the worst recipe ever",
]

In [25]:
se = tokenizer(sequences, truncation=True, padding=True, return_tensors='tf')

In [26]:
output = model(se)

In [27]:
predictions = tf.math.softmax(output.logits, axis=-1)
print(predictions)

tf.Tensor(
[[3.0601624e-04 6.5974741e-06 6.2667809e-06 7.0317137e-06 1.4159003e-04
  9.9953258e-01]
 [5.8863158e-03 9.8528045e-01 7.6745390e-03 9.3339731e-05 6.0661603e-04
  4.5873574e-04]], shape=(2, 6), dtype=float32)
