fine-tune a pretrained transformer model for customising sentiment analysis

In [38]:
!pip install transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [39]:
# Data processing
import pandas as pd
import numpy as np

# Train test split
from sklearn.model_selection import train_test_split

# Modeling
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Import accuracy_score to check performance
from sklearn.metrics import accuracy_score

In [40]:
amz_review=pd.read_csv("amazon_cells_labelled.csv",names=["review","label"])

In [41]:
amz_review

Unnamed: 0,review,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1
...,...,...
995,The screen does get smudged easily because it ...,0
996,What a piece of junk.. I lose more calls on th...,0
997,Item Does Not Match Picture.,0
998,The only thing that disappoint me is the infra...,0


In [42]:
amz_review.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  1000 non-null   object
 1   label   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 23.4+ KB


In [43]:
amz_review["label"].value_counts()

Train and test split

In [44]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(amz_review['review'],amz_review['label'],test_size = 0.20,random_state = 42)

we will tokenize the review text using a tokenizer.

A tokenizer converts text into numbers to use as the input of the NLP (Natural Language Processing) models. Each number represents a token, which can be a word, part of a word, punctuation, or special tokens.
* `AutoTokenizer.from_pretrained("bert-base-cased")` downloads vocabulary from the pretrained `bert-base-cased` model.
* `return_tensors="np"` indicates that the return format is numpy array. Besides `np`, `return_tensors` can take the value of `tf` or `pt`, where `tf` returns Tensorflow `tf.constant` object and `pt` returns PyTorch `torch.tensor` object. If not set, it returns a list of python integers.
* `padding` means adding zeros to shorter reviews in the dataset. The `padding` argument controls how `padding` is implemented.  
 * `padding=True` is the same as `padding='longest'`. It checks the longest sequence in the batch and pads zeros to that length. There is no padding if only one text document is provided.
 * `padding='max_length'` pads to `max_length` if it is specified, otherwise, it pads to the maximum acceptable input length for the model.
 * `padding=False` is the same as `padding='do_not_pad'`. It is the default, indicating that no padding is applied, so it can output a batch with sequences of different lengths.

The labels for the reviews are converted to one-dimensional numpy arrays.

In [45]:
# Tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
#should be the same name from hugging face library

# Tokenize the reviews
tokenized_data_train = tokenizer(X_train.to_list(), return_tensors="np", padding=True)
tokenized_data_test = tokenizer(X_test.to_list(), return_tensors="np", padding=True)

# Labels are one-dimensional numpy or tensorflow array of integers
labels_train = np.array(y_train)
labels_test = np.array(y_test)

# Tokenized ids
print(tokenized_data_train["input_ids"][0])

[  101 17554   112   189  2080  2965   119   102     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0]


In [28]:
# COMPILE AND FIT

we will build a customized transfer learning model for sentiment analysis.

* `TFAutoModelForSequenceClassification` loads the BERT model without the sequence classification head.
* The method `from_pretrained()` loads the weights from the pretrained model into the new model, so the weights in the new model are not randomly initialized. Note that the new weights for the new sequence classification head are going to be randomly initialized.
* `bert-base-cased` is the name of the pretrained model. We can change it to a different model based on the nature of the project.
* `num_labels` indicates the number of classes. Our dataset has two classes, positive and negative, so `num_labels=2`.
#%%

In [46]:
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


After loading the pretrained model, we will compile the model.
* `SparseCategoricalCrossentropy` is used as the loss function, but the Hugging Face documentation mentioned that Hugging Face models automatically choose a loss that is appropriate for their task and model architecture if the loss is not explicitly specified.
* `from_logits=True` informs the loss function that the output values are logits before applying softmax, so the values do not represent probabilities.
* We are using Adam as the optimizer and the number `5e-6` is the learning rate. A smaller learning rate corresponds to a more stable weights value update and a slower training process.
* `accuracy` is used as the metrics because we have a balanced dataset.

In [47]:
# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile model
model.compile(optimizer=Adam(5e-6), loss=loss, metrics=['accuracy'])

In [48]:
# Fit the model
model.fit(dict(tokenized_data_train),
          labels_train,
          validation_data=(dict(tokenized_data_test), labels_test),
          batch_size=4,
          epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fa0802f5940>

All the weights will be updated by default for the transfer learning model.

* If we would like to keep the pretrained model weights as is and only update the weights and bias of the output layer, we can use `model.layers[0].trainable = False` to freeze the weights of the BERT model.
* If we would like to keep the weights of some layers and update others, we can use `model.bert.encoder.layer[i].trainable = False` to freeze the weights of the corresponding layers.
* In general, if the dataset for the transfer learning model is large, it is suggested to update all weights, and if the dataset for the transfer learning model is small, it is suggested to freeze the pretrained model weights. But we can always compare the model performance by adding the tunable pretrained model layers one by one.

Sentiment Analysis Transfer Learning Model Prediction and Evaluation
we will talk about model prediction and evaluation for sentiment analysis transfer learning.
Passing the tokenized text to the .predict method, we get the predictions for the customized transfer learning sentiment model. logits is the last layer of the neural network before softmax is applied.
We can see that the prediction has two columns. The first column is the predicted logit for label 0 and the second column is the predicted logit for label 1. logit values do not sum up to 1.

In [49]:
# Predictions
y_test_predict = model.predict(dict(tokenized_data_test))['logits']

# First 5 predictions
y_test_predict[:5]



To get the predicted probabilities, we need to apply softmax on the predicted logit values.
After applying softmax, we can see that the predicted probability for each review sums up to 1.

In [50]:
# Predicted probabilities
y_test_probabilities = tf.nn.softmax(y_test_predict)

# First 5 predicted probabilities
y_test_probabilities[:5]

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[0.04858171, 0.9514183 ],
       [0.1481424 , 0.8518576 ],
       [0.04351056, 0.95648944],
       [0.94606626, 0.05393372],
       [0.03848605, 0.9615139 ]], dtype=float32)>

To get the predicted labels, `argmax` is used to return the index of the maximum probability for each review, which corresponds to the labels of zeros and ones.

In [52]:
# Predicted label
y_test_class_preds = np.argmax(y_test_probabilities, axis=1)

# First 5 predicted labels
y_test_class_preds[:5]

`accuracy_score` is used to evaluate the model performance. We can see that the customized sentiment analysis model with transfer learning gives us 91% accuracy, meaning that the predictions are correct 91% of the time.

In [53]:
accuracy_score(y_test_class_preds, y_test)

0.91

Save Model


`tokenizer.save_pretrained` saves the tokenizer information to the drive and `model.save_pretrained` saves the model to the drive.

In [54]:
# Save tokenizer
tokenizer.save_pretrained('./sentiment_transfer_learning_tensorflow/')

# Save model
model.save_pretrained('./sentiment_transfer_learning_tensorflow/')

you can also zip your model folder to move it easily to another folder

In [None]:
!zip -r sentiment_transfer_learning_tensorflow.zip sentiment_transfer_learning_tensorflow/

if you are using google colab, you can easily connect and copy your zip file (model) to any directory of your google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!cp /content/sentiment_transfer_learning_tensorflow.zip /content/drive/MyDrive/sentiment_transfer_learning_tensorflow.zip