<a href="https://colab.research.google.com/github/niko-vaas/tutorial-notebooks/blob/main/BERTFirstSentimentAnalysisTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using BERT for Sentiment Analysis

In this notebook, we use BERT, Google's sentiment-analysis AI, to analyze text and detect emotion.

Sentiment analysis entails using AI to detect emotion from data. In this case, we will be using two simple sentences to detect negative or positive emotion.

First, install the preqrequisites using `pip`.

If you've already done previous tutorials, you should get a series of statements that say `Requirement already satisfied: `.

In [1]:
!pip install transformers
!pip install tensorflow



Now, import the installs into the file.

In [11]:
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

Next, let's load up the pretrained BERT tokenizer.

I use `bert-large-uncased` because I am mainly focusing on keywords in English. However, if you want to test it on a phrase where upper vs lower case is important, used `bert-large-cased`.

If you want it to take up less memory on your computer/Kaggle/Colab storage, use `bert-base-cased` or `bert-base-uncased`.

In [27]:
model_name = 'bert-large-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's put some sample sentences for BERT to tokenize.

**Tokenization** is when we split up a sentence into individual words or phrases, then turn them into "tokens", or numbers.

In [45]:
sentences = [
    "I had a great day today.",
    "My friend is terribly sick."
]

labels = [1, 0]  # 1 for positive, 0 for negative

encoded_data = tokenizer(sentences, padding='max_length', truncation=True)

# Tokenize the text data
encoded_data = tokenizer(sentences, padding='max_length', truncation=True)

# Get the numerical representations of each word from the tokenizer's vocabulary
input_ids = encoded_data['input_ids']

Next, let's pad out the sentences with zeroes to ensure that they are the same length.

In [46]:
max_len = max(len(seq) for seq in input_ids)
padded_sequences = pad_sequences(input_ids, maxlen=max_len, dtype="long")

Now, we create **attention masks**. Attention masks help us prioritize tokens (essentially telling BERT which tokens are the important ones).

After our attention masks are created, we can apply sentiment analysis using BERT.

In [47]:
attention_masks = [[float(1) for _ in seq] for seq in padded_sequences]
attention_masks = tf.convert_to_tensor(attention_masks)  # Convert to tensor

outputs = model(tf.convert_to_tensor(padded_sequences), attention_mask=attention_masks)
predictions = tf.nn.softmax(outputs.logits).numpy()

Now, let's print out what BERT predicted on our two sentences.

In [48]:
for sentence, prediction in zip(sentences, predictions):
    predicted_class = np.argmax(prediction)
    sentiment = "Positive" if predicted_class == 1 else "Negative"
    print(f"Sentence: {sentence}, Predicted Sentiment: {sentiment}")

Sentence: I had a great day today., Predicted Sentiment: Positive
Sentence: My friend is terribly sick., Predicted Sentiment: Negative
