**Installing Transformers**

In [None]:
%%capture
!pip install transformers

**Installing libraries and dependencies**

In [None]:
import tensorflow as tf
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification
import pandas as pd
import numpy as np

Since our model requires the use of GPU, we must ensure that our GPU is enabled.

In [None]:
num_gpus_available = len(tf.config.experimental.list_physical_devices('GPU'))
print("Num GPUs Available: ", num_gpus_available)

Num GPUs Available:  1


In [None]:
assert num_gpus_available > 0

**Dataset Loading and pre-processing**

The dataset we're using is the "Amazon Fine Food Reviews" available on kaggle.

In [None]:
df=pd.read_csv('Reviews.csv', error_bad_lines = False, engine='python')



  df=pd.read_csv('Reviews.csv', error_bad_lines = False, engine='python')


Since our dataset is too large ( about 500 000 rows) and BERT takes a large amount of time to train on the data frame, we will keep only the first 10 000 rows.

In [None]:
df = df.head(10000)

In [None]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [None]:
df['Score'].head()

0    5
1    1
2    4
3    2
4    5
Name: Score, dtype: int64

In [None]:
df['Text'].head()

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
3    If you are looking for the secret ingredient i...
4    Great taffy at a great price.  There was a wid...
Name: Text, dtype: object

Our Dataset consists of multiple columns from " Product ID" to "reviews". As we are only interested in the reviews text and the corresponding score, we will drop the other columns.

In order to not lose the original dataset, we will work on its copy version.

In [None]:
df_copy=df.copy()

In [None]:
df_copy.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time', 'Summary'],
             inplace=True, axis=1)
df_copy.head()

Unnamed: 0,Score,Text
0,5,I have bought several of the Vitality canned d...
1,1,Product arrived labeled as Jumbo Salted Peanut...
2,4,This is a confection that has been around a fe...
3,2,If you are looking for the secret ingredient i...
4,5,Great taffy at a great price. There was a wid...


The rating provided by the customer is on a scale of 1-5. As we are going to implement a binary classification model,we will need to convert these ratings into 2 categoris, i.e 1 and 0. Ratings above and equal to 3 will be labeled as Positive(1) and below 3 will be negative(0).

In [None]:
df_copy["Sentiment"] = df_copy["Score"].apply(lambda score: "positive" if score >= 3 else "negative")
df_copy['Sentiment'] = df_copy['Sentiment'].map({'positive':1, 'negative':0})

In [None]:
df_copy.head()

Unnamed: 0,Score,Text,Sentiment
0,5,I have bought several of the Vitality canned d...,1
1,1,Product arrived labeled as Jumbo Salted Peanut...,0
2,4,This is a confection that has been around a fe...,1
3,2,If you are looking for the secret ingredient i...,0
4,5,Great taffy at a great price. There was a wid...,1


In [None]:
data=df_copy[["Sentiment","Text"]]

In [None]:
data.head()

Unnamed: 0,Sentiment,Text
0,1,I have bought several of the Vitality canned d...
1,0,Product arrived labeled as Jumbo Salted Peanut...
2,1,This is a confection that has been around a fe...
3,0,If you are looking for the secret ingredient i...
4,1,Great taffy at a great price. There was a wid...


**Text Tokenization and conversion into tokens**

We start by converting our feature column and label into a set of lists as that's how our Tokenizer will treat the data.

In [None]:
reviews = data['Text'].values.tolist()
labels = data['Sentiment'].tolist()

Then, we split our Data into training and validation sets.

In [None]:
from sklearn.model_selection import train_test_split
training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(reviews, labels, test_size=.2)

Now, we assign tokenizer object to the tokenizer class

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
tokenizer([training_sentences[0]], truncation=True,

                            padding=True, max_length=128)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

{'input_ids': [[101, 1045, 1005, 1049, 5580, 1045, 2234, 2408, 1996, 2630, 24582, 25128, 1006, 2006, 9733, 2060, 2063, 7597, 2073, 3438, 2030, 2062, 1007, 2077, 1045, 2253, 2006, 2000, 4965, 1037, 23025, 1010, 3653, 16613, 1998, 3120, 2005, 11588, 2373, 1012, 1045, 6719, 2288, 1045, 2035, 1999, 2028, 2012, 1037, 2310, 12171, 2854, 2100, 2100, 2659, 3976, 1012, 2204, 17113, 1012, 1045, 28667, 8462, 4859, 2009, 2005, 3087, 2007, 1037, 2188, 2996, 1998, 13366, 2546, 2175, 4965, 1037, 28712, 2099, 5830, 1012, 13354, 4726, 1037, 23025, 14866, 2890, 6593, 2135, 2046, 2009, 2180, 2102, 2147, 2061, 2092, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

We combine our features tokens and labels into a dataset

In [None]:
train_encodings = tokenizer(training_sentences,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(validation_sentences,
                            truncation=True,
                            padding=True)
train_dataset = tf.data.Dataset.from_tensor_slices((
                            dict(train_encodings),
                            training_labels
                            ))
val_dataset = tf.data.Dataset.from_tensor_slices((
                            dict(val_encodings),
                            validation_labels
                            ))

In [None]:
print(train_encodings)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



**Model Training and optimization**

We are going to use TFDistilBertForSequenceClassification for the sentiment analysis and put the ‘num-labels’ parameter equal to 2 as we are doing a binary classification.

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=2)


Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_layer_norm', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use i

We can, now, train our model with the following configuration:
epochs: 2
Batch size: 16
Learning rate (Adam): 5e-5 (0.00005)

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-08)
model.compile(optimizer=optimizer, loss=model.hf_compute_loss, metrics=['accuracy'])


In [None]:
model.fit(train_dataset.shuffle(100).batch(16),
          epochs=5,
          validation_data=val_dataset.shuffle(100).batch(16))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f76b95b0250>

In [None]:
model.save_pretrained("./sentiment")

**Evaluation**

We import the model that we have trained

In [None]:
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("/content/sentiment", from_pt=True)

We choose the review for which we will predict the sentiment

In [None]:
test_sentence = "Although the package was unappealing, I really like the taste"

We compute the predictions

In [None]:
predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")
tf_output = loaded_model.predict(predict_input)[0]
tf_prediction = tf.nn.softmax(tf_output, axis=1)



Now, we will print the sentiment of the sentence we choose

In [None]:
labels = ['Negative','Positive']
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

Positive
