## Installing necessary libraries

If you are not using Google Colab, you will have to install transformers as well. For more information about this, please refer to the first blog (Part 1).

After installing the below libraries, restart your session before proceeding further. Else you will encounter errors. You do not have to re-run the below code after session restart.

In [1]:
!pip install --upgrade transformers
!pip install --upgrade tf_keras
!pip install datasets evaluate

Collecting transformers
  Downloading transformers-4.44.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m923.8 kB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.44.0-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.4
    Uninstalling transformers-4.42.4:
      Successfully uninstalled transformers-4.42.4
Successfully installed transformers-4.44.0
Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from dataset

### Setting up environment variable as a necessary dependency

In [1]:
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

## Sentiment Analysis on Reviews Dataset

We have a dataset called reviews.csv in which we have two columns - Review and Sentiment. Review column consists of sentences that are the reviews by the users and Sentiment column consists of categorical string values of - Positive, Negative, and Neutral.

The project being shown here could be used for sentiment analysis on any textual data need. However, the changing of parameters and values would depend on your project demand. For this reason, it's always recommended to go through the model card (that is documentation specific to model) to understand the possibilities. Let's get started with our project. I will show two different models with the same dataset.

### Bert Base Cased Model - For Sentiment Analysis

**Loading Data**

In [2]:
from datasets import load_dataset  #importing load_dataset from datasets library to load the file as a Hugging Face dataset

dataset = load_dataset("csv", data_files = "reviews.csv", split= "train")

#mentioning the split = 'train' as we have only one file and we want it to be our training data so that we can split it later as we like
#try loading the dataset by removing the split = 'train' to see the difference

In [3]:
dataset

Dataset({
    features: ['Review', 'Sentiment'],
    num_rows: 386
})

In [4]:
dataset[0]

{'Review': "This product exceeded my expectations! It's high-quality and performs exceptionally well.",
 'Sentiment': 'Positive'}

As we can see the Sentiment is in text but we have to change the labels into numbers. For this, HuggingFace dataset has a remarkable function, which is -

In [5]:
dataset = dataset.class_encode_column("Sentiment")
dataset[0]

{'Review': "This product exceeded my expectations! It's high-quality and performs exceptionally well.",
 'Sentiment': 2}

Thus, the positive word got labelled as integer 2. By using the features property, we can look into the words corresponding the labels. So, Negative is labelled as 0, Neutral is labelled as 1, and Positive is labelled as 2.

In [6]:
dataset.features

{'Review': Value(dtype='string', id=None),
 'Sentiment': ClassLabel(names=['Negative', 'Neutral', 'Positive'], id=None)}

Now, splitting the dataset into training and testing

In [7]:
dataset = dataset.train_test_split(test_size = 0.1, stratify_by_column = 'Sentiment')   #performing stratified sampling
dataset

DatasetDict({
    train: Dataset({
        features: ['Review', 'Sentiment'],
        num_rows: 347
    })
    test: Dataset({
        features: ['Review', 'Sentiment'],
        num_rows: 39
    })
})

**Performing Data Preprocessing**

We have our dataset ready. But now we have to perfrom data preprocessing on the input features.

In [8]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint, device = 'cuda')
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = 'tf')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
def tokenization_function(records):
  return tokenizer(records['Review'], truncation = True)


tokenized_dataset = dataset.map(tokenization_function, batched = True)
tokenized_dataset

Map:   0%|          | 0/347 [00:00<?, ? examples/s]

Map:   0%|          | 0/39 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Review', 'Sentiment', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 347
    })
    test: Dataset({
        features: ['Review', 'Sentiment', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 39
    })
})

In [10]:
tf_train_dataset = tokenized_dataset['train'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = ['Sentiment'],
    shuffle = True,
    collate_fn = data_collator,
    batch_size = 10,
)

tf_validation_dataset = tokenized_dataset['test'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids', 'token_type_ids'],
    label_cols = ['Sentiment'],
    shuffle = False,
    collate_fn = data_collator,
    batch_size = 10,
)


Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


**Model Training**

In [11]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 3)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Setting up the Training parameters

In [12]:
batch_size = 10
num_epochs = 5
num_train_steps = len(tf_train_dataset) // batch_size * num_epochs

Setting up the optimizer

In [13]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.0,
    decay_steps=num_train_steps,
)

optimizer = Adam(learning_rate = lr_scheduler)

Compiling and Fitting the model with our dataset

In [14]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy


model.compile(optimizer=optimizer, loss = SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(tf_train_dataset, validation_data = tf_validation_dataset, epochs = num_epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tf_keras.src.callbacks.History at 0x78151048bfa0>

Performing Prediction

In [15]:
import numpy as np

preds = model.predict(tf_validation_dataset)["logits"]
class_preds = np.argmax(preds, axis=1)

class_preds



array([2, 0, 1, 2, 0, 1, 2, 0, 1, 0, 1, 0, 0, 2, 2, 1, 1, 0, 1, 0, 2, 2,
       0, 0, 2, 2, 1, 0, 0, 1, 2, 1, 1, 2, 0, 2, 1, 1, 2])

In [16]:
true_labels = np.concatenate([y for x, y in tf_validation_dataset], axis=0)
accuracy = np.mean(class_preds == true_labels)


print(f"Validation Accuracy: {accuracy:.4f}")

Validation Accuracy: 1.0000


### Distilbert Base Uncased Mode : Sentiment Analysis

The process is same as above, just the model is a different one. To learn more about Distilbert, follow the respective model card.

In [17]:
import numpy as np
from datasets import load_dataset
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam
import evaluate



dataset = load_dataset("csv", data_files = "reviews.csv", split= "train")
dataset = dataset.class_encode_column("Sentiment")
dataset = dataset.train_test_split(test_size = 0.2, stratify_by_column = 'Sentiment')



checkpoint = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, device = 'cuda')
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = 'tf')




def tokenization_function(records):
  return tokenizer(records['Review'], truncation = True)


tokenized_dataset = dataset.map(tokenization_function, batched = True)
tokenized_dataset

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Map:   0%|          | 0/308 [00:00<?, ? examples/s]

Map:   0%|          | 0/78 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Review', 'Sentiment', 'input_ids', 'attention_mask'],
        num_rows: 308
    })
    test: Dataset({
        features: ['Review', 'Sentiment', 'input_ids', 'attention_mask'],
        num_rows: 78
    })
})

Unlike Bert, Distilbert model do not return token_type_ids upon tokenization.

In [18]:
tf_train_dataset = tokenized_dataset['train'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids'],
    label_cols = ['Sentiment'],
    shuffle = True,
    collate_fn = data_collator,
    batch_size = 10,
)

tf_validation_dataset = tokenized_dataset['test'].to_tf_dataset(
    columns = ['attention_mask', 'input_ids'],
    label_cols = ['Sentiment'],
    shuffle = False,
    collate_fn = data_collator,
    batch_size = 10,
)



model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 3)


batch_size = 10
num_epochs = 10
num_train_steps = len(tf_train_dataset) // batch_size * num_epochs



lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.0,
    decay_steps=num_train_steps,
)
optimizer = Adam(learning_rate=lr_scheduler)




model.compile(optimizer = optimizer, loss = SparseCategoricalCrossentropy(from_logits = True), metrics = ['accuracy'])
model.fit(tf_train_dataset, validation_data = tf_validation_dataset, epochs = num_epochs)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tf_keras.src.callbacks.History at 0x7814fee02860>

In [19]:
preds = model.predict(tf_validation_dataset)["logits"]

class_preds = np.argmax(preds, axis=1)

true_labels = np.concatenate([y for x, y in tf_validation_dataset], axis=0)


accuracy = np.mean(class_preds == true_labels)
print(f"Validation Accuracy: {accuracy:.4f}")

Validation Accuracy: 0.9744


The we can see the output of both the models are showing good results. Let's check with a random statement.

In [27]:
my_angry_review = "It is horrible. This hangs so much"

tokenized_angry = tokenizer(my_angry_review, return_tensors = 'tf')
prediction = np.argmax(model.predict(tokenized_angry)['logits'], axis=1)
prediction



array([0])

0 is negative. Thus the classification is right. Now, check the output for below comment.

In [35]:
my_angry_review = "They are selling this worst product at such a high price! Their response system is very slow"

tokenized_angry = tokenizer(my_angry_review, return_tensors = 'tf')
prediction = np.argmax(model.predict(tokenized_angry)['logits'], axis=1)
prediction



array([2])

2 is positive. But the statement is not positive. Seems wrong! Now see this -

In [36]:
my_angry_review = "They are selling this worst product at such a high price. Their response system is very slow"

tokenized_angry = tokenizer(my_angry_review, return_tensors = 'tf')
prediction = np.argmax(model.predict(tokenized_angry)['logits'], axis=1)
prediction



array([0])

0 is negative which is correct. Thus changing ! to . is marking a remarkable change. For avoiding such types of errors in prediction, we need to train the model with lots of data so that the model can better understand the representation and its corresponding sentiment.

In [42]:
my_neutral_review = "It's fine but could be better"

tokenized_neutral = tokenizer(my_neutral_review, return_tensors = 'tf')
prediction = np.argmax(model.predict(tokenized_neutral)['logits'], axis=1)
prediction



array([1])

1 is neutral! Great! We have succesfully performed sentiment analysis using our model.

Now you may want to save your trained model and the tokenizer. You can do that as below -

In [44]:
model.save_pretrained('/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review')
tokenizer.save_pretrained('/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review')

('/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review/tokenizer_config.json',
 '/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review/special_tokens_map.json',
 '/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review/vocab.txt',
 '/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review/added_tokens.json',
 '/content/drive/MyDrive/Google in 21 days/Transformers/distilbert_review/tokenizer.json')

## Deploying to HuggingFace Hub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model.push_to_hub("sentiment_analysis")
tokenizer.push_to_hub("sentiment_analysis")

Load the pushed model

In [None]:
saved_model_from_HF = TFAutoModelForSequenceClassification.from_pretrained("doitlazy/sentiment_analysis")