<a href="https://colab.research.google.com/github/mehreen89/DataSets/blob/main/mehreen_CSTU_W7_1_1_Fine_tune_BERT_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Fine-Tune BERT model for Sentiment Analysis**

# Google Colab exercise

https://pub.towardsai.net/google-colab-tutorial-for-beginners-834595494d44

## Bidirectional Encoder Representation for Transformer (BERT) is an NLP model developed by Google Research in 2018, after its inception it has achieved state-of-the-art accuracy on several NLP tasks.

Transformer architecture has encoder and decoder stack, hence called encoder-decoder architecture whereas BERT is just an encoder stack of transformer architecture. There are two variants, BERT-base and BERT-large, which differ in architecture complexity. The base model has 12 layers in the encoder whereas the Large has 24 layers.

BERT was trained on a large text corpus, which gives architecture/model the ability to better understand the language and to learn variability in data patterns and generalizes well on several NLP tasks. As it is bidirectional that means BERT learns information from both the left and the right side of a token’s context during the training phase.

### Different Fine-Tuning Techniques
1. Train the entire architecture – We can further train the entire pre-trained model on our dataset and feed the output to a softmax layer. In this case, the error is back-propagated through the entire architecture and the pre-trained weights of the model are updated based on the new dataset.
2. Train some layers while freezing others – Another way to use a pre-trained model is to train it partially. What we can do is keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained.
3. Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.

In this tutorial, we will use the Second approach. We will freeze all the layers of BERT during fine-tuning and simply add the classification layer at the top

# Steps to follow :

1.   First enable the GPU in Google Colab, Edit -> Notebook Settings -> Hardware accelerator -> Set to GPU
2.   We will be using IMDB dataset, which is a movie reviews dataset contaiing 100000 reviews consisting of two classes, positive and negative
3.   Import/load the dataset from the TensorFlow datasetAPI
4. Choose a pretrained model that was trained on a large dataset.
5. Delete the output layer of the pretrained model and the weights and bias feeding into the output layer.
6. Create a set of randomly initialized weights and biases for the new output layer with the sentiment analysis task. The sentiment analysis task is a classifier with two outputs, positive and negative.
7. Retrain the weights and biases.



In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf

"tfds" provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

"tfds.load" function loads the dataset and split it into train and test sets.

# Download the data

This step is similar to how we load and split the data

In [None]:
(ds_train, ds_test), ds_info = tfds.load('imdb_reviews',
          split = (tfds.Split.TRAIN, tfds.Split.TEST),
          as_supervised=True,
          with_info=True)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteQI13W3/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteQI13W3/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteQI13W3/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [None]:
## Mount Google Drive
#from google.colab import drive
#drive.mount('/content/drive')

## Change directory
#import os
#os.chdir("drive/My Drive/contents/nlp")

## Print out the current directory
#!pwd

In [None]:
## Read in data
#amz_review = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

## Take a look at the data
#amz_review.head()

In [None]:
for review, label in tfds.as_numpy(ds_train.take(5)):
  print(review.decode()[0:50], '\t', label)

This was an absolutely terrible movie. Don't be lu 	 0
I have been known to fall asleep during films, but 	 0
Mann photographs the Alberta Rocky Mountains in a  	 0
This is the kind of film for a snowy Sunday aftern 	 1
As others have mentioned, all the women that go nu 	 1


Download the "BERT" model from Transformer library

In [None]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m82.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m113.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m86.2 MB/s[0m eta [36m0:00:00[0m
[?25h

## Import "BERT Tokenizer" - The tokenizers should also match the core model that we would like to use as the pre-trained, e.g. cased and uncased version

A tokenizer converts text into numbers to use as the input of the NLP (Natural Language Processing) models. Each number represents a token, which can be a word, part of a word, punctuation, or special tokens.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# Prepare the data according to the format needed for the BERT model

1.   Input IDs – The input ids are often the only required parameters to be passed to the model as input. Token indices, numerical representations of tokens building the sequences that will be used as input by the model.
2.   Attention mask – Attention Mask is used to avoid performing attention on padding token indices. Mask value can be either 0 or 1, 1 for tokens that are NOT MASKED, 0 for MASKED tokens.
3. Token type ids – It is used in use cases like sequence classification or question answering. As these require two different sequences to be encoded in the same input IDs. Special tokens, such as the classifier[CLS] and separator[SEP] tokens are used to separate the sequences.



The "encode_plus"  function of the tokenizer class will tokenize the raw input, add the special tokens [CLS] and [SEP], and pad the vector to a size equal to max length (that we can set).

In [None]:
def convert_example_to_feature(review):
  return tokenizer.encode_plus(review,
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = max_length, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

In [None]:
# can be up to 512 for BERT
max_length = 512
batch_size = 6

The following helper functions ("map_example_to_dict", and "encode_examples") will help us to transform our raw data to an appropriate format ready to fee into the BERT model

In [None]:
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

In [None]:
def encode_examples(ds, limit=-1):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []
  if (limit > 0):
      ds = ds.take(limit)
  for review, label in tfds.as_numpy(ds):
    bert_input = convert_example_to_feature(review.decode()) # converts input examples to input features
    # add the outputs to the lists
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    label_list.append([label])
  # converts lists to tensors
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

In [None]:
#import tensorflow as tf

Form the train and test datasets

In [None]:
# train dataset
ds_train_encoded = encode_examples(ds_train).shuffle(10000).batch(batch_size)
# test dataset
ds_test_encoded = encode_examples(ds_test).batch(batch_size)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


# BERT Model initializatoin for Sentiment Analysis

1. "TFBertForSequenceClassification" - adds the custom classification layer "classifier" on top of the base model
2. The method"from_pretrained()" loads the weights from the pretrained model into the new model, so the weights in the new model are not randomly initilized. The new weights for the new sequence classification head are going to be randomly initialized
3. "bert-base-uncased" is the name of the pretrained model
4. "SparseCategoricalCrossentropy" is used as the loss function
5. "from_logints=True" informs the loss function that the output values are logits before applying softmax

In [None]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch, though multiple epochs might be better as long as we will not overfit the model
#number_of_epochs = 1
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training the BERT model for Sentiment Analysis

1.   Now we can start the fine-tuning process. We will use the Keras API model.fit and just pass the model configuration, that we have already defined.



In [None]:
bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_test_encoded)



The model will take around two hours on GPU to complete training, with just 1 epoch we can achieve over 93% accuracy on validation, you can further increase the epochs and play with other parameters to improve the accuracy.

# Test on random sample

"tokenizer.encode" will encode our test example into integers using Bert tokenizer, then we use predict method on the encoded input to get our predictions. The model. predict will return logits, on which we can apply softmax function to get the probabilities for each class, and then using TensorFlow argmax function we can get the class with the highest probability and map it to text labels (positive or negative).

In [None]:
test_sentence = "This is a really good movie. I loved it and will watch again"

predict_input = tokenizer.encode(test_sentence,
truncation=True,
padding=True,
return_tensors="tf")

tf_output = model.predict(predict_input)[0]
tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive'] #(0:negative, 1:positive)
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

Positive


BERT models achieve state-of-the-art accuracy on several tasks as compared to other RNN architectures. However, they require high computational power and it takes a large time to train on a model.

# BERT Fine-Tuning example :
https://medium.com/grabngoinfo/customized-sentiment-analysis-transfer-learning-using-tensorflow-with-hugging-face-1b439eedf167

# **Distill BERT**

BERT model did have some drawbacks i.e. it was bulky and hence a little slow. To navigate these issues, researchers from Hugging Face proposed DistilBERT, which employed knowledge distillation for model compression.



1.   DistilBERT model is a distilled form of the BERT model. The size of a BERT model was reduced by 40% via knowledge distillation during the pre-training phase while retaining 97% of its language understanding abilities and being 60% faster.

2. A triple loss is introduced by combining language modeling, distillation, and cosine-distance losses to take advantage of the inductive biases learned by larger models during pre-training.

3. DistilBERT is a compact, faster, and lighter model that is cheaper to pre-train and can easily be used for on-device applications.

# Why DistillBERT ?

1. The environmental cost of scaling a large-scale model is concerning considering these models’ computational requirements.

2. Operating these large and bulky models in on-the-edge or limited constraint settings (resources like limited computing tools, budget, etc.) is challenging for training and inferencing, which could hamper wide adoption.

3. To develop more privacy-respecting systems, machine learning systems must operate on the edge rather than accessing a cloud API and transferring potentially private data to servers. Running models on smartphones also requires lightweight, energy-efficient, and responsive models!

To navigate the aforementioned challenges, model compression techniques can be leveraged. For that, there are many techniques like quantization (approximating the weights of a network with a smaller precision), knowledge distillation (teacher-student learning), and weights pruning (removing some connections in the network). However, the key focus of DistilBERT is on knowledge distillation.

In [None]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
inputs = tokenizer("This is a really good movie. I loved it and will watch again", return_tensors="pt")

In [None]:
with torch.inference_mode():
    logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'