<a href="https://colab.research.google.com/github/khaledsoudy-1/Sentiment-Analysis-with-BERT/blob/main/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Sentiment Analysis with **BERT**

This notebook provides a beginner-friendly guide to performing sentiment analysis using the powerful **BERT model**. We will cover the basics of BERT, demonstrate how to load a pre-trained model, and use it to classify the sentiment of text.

**What is Sentiment Analysis?**

- Sentiment analysis is the process of determining the emotional tone behind a piece of text, such as *positive, negative, or neutral*.

**What is BERT?**

- BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language model developed by Google. It has achieved remarkable success in various natural language processing tasks, including sentiment analysis. We'll be using a pre-trained BERT model to classify the sentiment of text.

### 1. Selecting the Pre-trained BERT Model

We'll use a pre-trained BERT model specifically designed for **sentiment analysis**. The model we've chosen is `nlptown/bert-base-multilingual-uncased-sentiment`. This model supports multiple languages **(Arabic is not one of them)** and is case-insensitive.

In [1]:
model = "nlptown/bert-base-multilingual-uncased-sentiment"

### 2. Install The Transformers Library

The Hugging Face `transformers` library provides pre-trained models and tools for **Natural Language Processing (NLP)** tasks, including sentiment analysis.

In [2]:
!pip install transformers



### 3. Initialize BERT Tokenizer and Model

Import the necessary components from the `transformers` library and initialize the BERT **tokenizer** and **model** for sentiment analysis.

* `BertTokenizer`: This is a tool used to prepare text data for BERT. It breaks down the input text into smaller units called **Tokens**, which BERT can understand.

* `TFBertForSequenceClassification`: This is the BERT model itself, specifically designed for classification tasks like sentiment analysis. It's built using **TensorFlow (TF)**.

In [3]:
from transformers import BertTokenizer, TFBertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained(model)

model = TFBertForSequenceClassification.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]



tf_model.h5:   0%|          | 0.00/670M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


#### How to use the pre-trained BERT model for **sentiment analysis**.

1. **Input Text:** We define the text to be analyzed as "*I like my job*".
2. **Tokenization:** The `tokenizer` converts the text into numerical tokens that BERT can understand.
3. **Prediction:** The `model` predicts the sentiment of the input text based on the tokens.

In [4]:
text = "I like my job"

inputs = tokenizer(text)                   # {'input_ids': [101, 151, 11531, 11153, 19594, 102], 'token_type_i........}

input_ids = inputs['input_ids']            # [101, 151, 11531, 11153, 19594, 102]

predictions = model.predict([input_ids])
predictions




TFSequenceClassifierOutput(loss=None, logits=array([[-2.5517395 , -2.1355622 ,  0.33338448,  2.011714  ,  1.7780207 ]],
      dtype=float32), hidden_states=None, attentions=None)

4. Extracting `Logits` and Predicting Sentiment Class


* The `predictions` variable likely contains raw output from the model, often referred to as "logits." These logits represent the model's confidence scores for each possible sentiment class. To extract these logits, we would typically need to access a specific element or attribute of the `predictions` object. This would depend on the specific output format of the `TFBertForSequenceClassification` model used.




In [5]:
logits = predictions.logits
logits

array([[-2.5517395 , -2.1355622 ,  0.33338448,  2.011714  ,  1.7780207 ]],
      dtype=float32)

5. **Predicting Sentiment Class:** Once we have the logits, we can use a library like `NumPy` to determine the predicted sentiment class.

* This typically involves finding the index of **the highest logit value**, which corresponds to the most likely sentiment category.

* For instance, if the model has been trained to classify sentiment into three categories *`(positive, negative, neutral)`*, and the logits are `[0.2, 0.7, 0.1]`, the predicted sentiment class would be `"negative"` because it has the highest logit value `(0.7)` and corresponds to **the second category**.


In [6]:
import numpy as np

# Finding the class with the highest probability
predicted_class = np.argmax(logits)
predicted_class

3

### Now, let's find out what this **output value** means!!

#### 1. **Revealing the Model's Sentiment Mapping**

Let's try to find out the sentiment mapping embedded within the pre-trained BERT model. First, we should try to access the `id2label` property from the model's configuration. If found, it will display the mapping between class IDs (0-4) and their corresponding sentiment labels. If not directly available, it suggests checking the model documentation or `label_names` for potential clues.

In [7]:
config = model.config
id2label_mapping = config.id2label

if id2label_mapping:
  print("Sentiment Mapping:", id2label_mapping)
else:
  print("id2label not found. Check model documentation or label_names.")

Sentiment Mapping: {0: '1 star', 1: '2 stars', 2: '3 stars', 3: '4 stars', 4: '5 stars'}




---
So, when the code predicts class `3`, it means it has classified the input text "*I like my job*" as having a `4 stars` sentiment.



#### 2. **Customizizng the Sentiment Mapping**

You could also define and apply a **custom sentiment mapping** to the BERT model. It allows you to **override the default mapping** and associate specific sentiment labels with the model's output classes (0-4). By modifying the `id2label` property in the model's configuration, you can tailor the interpretation of predictions to your specific needs.

In [8]:
# Define your custom mapping

custom_mapping = {
    0: "very negative",
    1: "negative",
    2: "neutral",
    3: "positive",
    4: "very positive"
}

Now, apply the custom mapping on the output:

In [9]:
predicted_sentiment = custom_mapping[predicted_class]

print(f"The input text is predicted to have a sentiment of : {predicted_sentiment}")

The input text is predicted to have a sentiment of : positive


---

# Another Example on **Sentiment Analysis**

In [10]:
text = "I hate my job"

inputs = tokenizer(text)

input_ids = inputs['input_ids']

predictions = model.predict([input_ids])

logits = predictions.logits
logits




array([[ 2.938084  ,  1.6215161 , -0.16542561, -1.9236566 , -1.9499074 ]],
      dtype=float32)

In [11]:
predicted_class = np.argmax(logits)

predicted_class

0

In [12]:
predicted_sentiment = custom_mapping[predicted_class]

print(f"The input text is predicted to have a sentiment of : {predicted_sentiment}")

The input text is predicted to have a sentiment of : very negative




---



# **Example**: Sentiment Analysis for **Large Data** Using BERT.

## Mounting Google Drive
        
  The following code mounts your Google Drive to the Colab environment, enabling access to files stored in your Drive.

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading the Dataset

This code cell loads the IMDB dataset from a CSV file stored in your Google Drive using the pandas library.

In [14]:
csv_file = "/content/drive/MyDrive/Colab Notebooks/IMDB Dataset.csv"

In [15]:
import pandas as pd

df = pd.read_csv(csv_file)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## Extracting the First Five Rows to test the model

This code cell extracts the first five rows of the IMDB dataset using the `head()` method to test the model instead of running it on the whole dataset.

In [16]:
first_five_rows = df.head(5)
first_five_rows

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Custom Sentiment Mapping and Data Preparation

This code cell defines a custom sentiment mapping to interpret the BERT model's predictions. It also creates a copy of the first five rows of the dataset for analysis.

In [17]:
custom_mapping = {
    0: "very negative",
    1: "negative",
    2: "neutral",
    3: "positive",
    4: "very positive"
}

first_five_rows_copy = first_five_rows.copy()

first_five_rows_copy

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Predicting Sentiments for the First Five Reviews

This code cell iterates through the first five movie reviews, uses the BERT model to predict their sentiments, and stores the predictions in a list called `predicted_sentiments_list`.

In [18]:
import numpy as np

# Create an empty list
predicted_sentiments_list = []

for row in first_five_rows_copy['review']:
  inputs = tokenizer(row)

  input_ids = inputs['input_ids']

  predictions = model.predict([input_ids])

  logits = predictions.logits

  predicted_class = np.argmax(logits)

  predicted_sentiment = custom_mapping[predicted_class]

  predicted_sentiments_list.append(predicted_sentiment)

predicted_sentiments_list



['neutral', 'very positive', 'positive', 'neutral', 'positive']

## Comparing Predicted Sentiments with Original Labels

This code cell adds a new column named `BERT_sentiments` to the `first_five_rows_copy` DataFrame. This column contains the sentiment predictions made by the BERT model, allowing for comparison with the original sentiment labels.

In [19]:
first_five_rows_copy['BERT_sentiments'] = predicted_sentiments_list
first_five_rows_copy

Unnamed: 0,review,sentiment,BERT_sentiments
0,One of the other reviewers has mentioned that ...,positive,neutral
1,A wonderful little production. <br /><br />The...,positive,very positive
2,I thought this was a wonderful way to spend ti...,positive,positive
3,Basically there's a family where a little boy ...,negative,neutral
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,positive




---



# **Example**: Sentiment Analysis for **Arabic Data** Using BERT.

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
csv_file = "/content/drive/MyDrive/Colab Notebooks/train.csv"

In [22]:
import pandas as pd

df = pd.read_csv(csv_file)
df

Unnamed: 0,tweet,class
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos
2,' و انتهى مشوار الخواجة ',neg
3,' مش عارف ابتدى مذاكره منين :/ ',neg
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg
...,...,...
2054,' @wasfa_N الجمال مبيحتاح اي مكياج لناعم وله خ...,neu
2055,' @TheMurexDor نتمني وجود الفنانة رنا سماحة اف...,neu
2056,' ولد الهدى فالكائنات ضياء .. وفم الزمان تبسم ...,pos
2057,' @mohamed71944156 @samarroshdy1 انت متناقض جد...,neg


In [23]:
first_5rows_arabic = df.head(5)
first_5rows_arabic

Unnamed: 0,tweet,class
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos
2,' و انتهى مشوار الخواجة ',neg
3,' مش عارف ابتدى مذاكره منين :/ ',neg
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg


In [24]:
custom_mapping = {
    0: "very negative",
    1: "negative",
    2: "neutral",
    3: "positive",
    4: "very positive"
}

first_5rows_arabic_copy = first_5rows_arabic.copy()
first_5rows_arabic_copy

Unnamed: 0,tweet,class
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos
2,' و انتهى مشوار الخواجة ',neg
3,' مش عارف ابتدى مذاكره منين :/ ',neg
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg


In [25]:
import numpy as np

# Create an empty list
predicted_sentiments_list = []

for row in first_5rows_arabic_copy['tweet']:
  inputs = tokenizer(row)

  input_ids = inputs['input_ids']

  predictions = model.predict([input_ids])

  logits = predictions.logits

  predicted_class = np.argmax(logits)

  predicted_sentiment = custom_mapping[predicted_class]

  # Add each predicted sentiment to the list
  predicted_sentiments_list.append(predicted_sentiment)

predicted_sentiments_list



['very negative',
 'very positive',
 'negative',
 'very negative',
 'very negative']

In [26]:
first_5rows_arabic_copy['BERT_sentiments'] = predicted_sentiments_list
first_5rows_arabic_copy

Unnamed: 0,tweet,class,BERT_sentiments
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos,very negative
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos,very positive
2,' و انتهى مشوار الخواجة ',neg,negative
3,' مش عارف ابتدى مذاكره منين :/ ',neg,very negative
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg,very negative


## Limitations with Arabic Sentiment Analysis

The current implementation using `nlptown/bert-base-multilingual-uncased-sentiment` has a significant limitation: **it doesn't effectively support Arabic sentiment analysis**. This is because the model's training data primarily focuses on other languages, leading to inaccurate predictions for Arabic text, **and that's why the results are not correct at all**.

To address this, we need to explore alternative approaches specifically designed for Arabic sentiment analysis. Some options include:

1. **Using a dedicated Arabic BERT model:** Several pre-trained BERT models are available that have been trained on Arabic text, such as `CAMeL-Lab/bert-base-arabic-camelbert-mix-ner` or any other model. These models are better equipped to capture the nuances of the Arabic language.
2. **Fine-tuning a multilingual BERT model:** We can fine-tune a multilingual BERT model on a large Arabic sentiment analysis dataset. This would adapt the model to the specific characteristics of Arabic text and potentially improve its accuracy.
3. **Exploring other Arabic NLP techniques:** Beyond BERT, various other Natural Language Processing techniques are available for Arabic, including traditional machine learning methods and deep learning architectures like recurrent neural networks (RNNs).


## Solution: Use a dedicated Arabic BERT model

We will primarily focus on using a dedicated Arabic BERT model, such as `CAMeL-Lab/bert-base-arabic-camelbert-mix-ner`. This approach is preferred because:
* Pre-trained Arabic BERT models are readily available and can be easily integrated into our existing workflow.
* They have been shown to achieve higher accuracy in various Arabic NLP tasks, including sentiment analysis, compared to multilingual models.



> [camel-tools Documentation](https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix-sentiment)

### **Step 1:** Install the Required Libraries

We start by installing the `camel-tools` package, which includes the necessary tools for **Arabic NLP tasks** using the CAMeL BERT model. This package provides pre-trained models for sentiment analysis and other NLP tasks.

In [12]:
! pip install camel-tools -f https://download.pytorch.org/whl/torch_stable.html


Looking in links: https://download.pytorch.org/whl/torch_stable.html


### **Step 2:** Load the Sentiment Analyzer Model
We import the `SentimentAnalyzer` class from `camel_tools`. The `SentimentAnalyzer` allows us to perform sentiment analysis on Arabic text using the dedicated `CAMeL-Lab/bert-base-arabic-camelbert-mix-sentiment` model.

In [23]:
from camel_tools.sentiment import SentimentAnalyzer

sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-mix-sentiment")

### **Step 3:** Testing the Model on Sample Sentences
Here, we create a list of Arabic sentences and pass them to the sentiment analyzer to see the model's output. This helps us confirm that the sentiment analyzer is working as expected.

In [24]:
sentences = ['أنا بخير', 'أنا لست بخير']

sa.predict(sentences)

['positive', 'negative']

### **Step 4:** Connect Google Drive

To access data files stored in Google Drive, we mount the drive.


In [17]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **Step 5:** Load the Dataset

We use `pandas` to load a **CSV** file containing Arabic text data, which we’ll use for sentiment analysis. Here, we load the dataset and display the **first five rows** to get an overview of the data.

In [18]:
csv_file = "/content/drive/MyDrive/Colab Notebooks/train.csv"

In [26]:
import pandas as pd

df = pd.read_csv(csv_file)

first_5rows_arabic = df.head(5)
first_5rows_arabic

Unnamed: 0,tweet,class
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos
2,' و انتهى مشوار الخواجة ',neg
3,' مش عارف ابتدى مذاكره منين :/ ',neg
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg


###  **Step 6:** Prepare Data for Sentiment Analysis
To avoid modifying the original DataFrame, we create a copy of the first five rows. Then, we extract the column containing the Arabic text data (e.g., tweets) and convert it to a list, which can be directly fed into the sentiment analyzer.

In [22]:
first_5rows_arabic_copy = first_5rows_arabic.copy()
first_5rows_arabic_copy

Unnamed: 0,tweet,class
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos
2,' و انتهى مشوار الخواجة ',neg
3,' مش عارف ابتدى مذاكره منين :/ ',neg
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg


In [37]:
sentences = first_5rows_arabic_copy['tweet']

sentences_list = list(sentences)
sentences_list

["' #علمتني_الحياه أن الذين يعيشون على الأرض ليسوا ملائكة بل بشر قد يصيبوا وقد يخطئوا فلا يجوز أن أحكم على شخص من موقف واحد تعاملت معه فيه '",
 "' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/nEn1tHSrpw '",
 "' و انتهى مشوار الخواجة '",
 "' مش عارف ابتدى مذاكره منين :/ '",
 "' @mskhafagi  إختصروا الطريق بدلا من إختيار المنصف ثم الانقلاب عليه يبدو أن  الدولة العميقة ضاربة بجذورها فى دول العالم النامى كله '"]

### **Step 7:** Perform Sentiment Analysis
We pass the list of Arabic sentences to the sentiment analyzer to get their sentiment predictions.

In [39]:
sentiments = sa.predict(sentences_list)
sentiments

['positive', 'positive', 'negative', 'negative', 'negative']

### **Step 8:** Comparing Predicted Sentiments with Original Labels

This code cell adds a new column named `BERT_sentiments` to the `first_5rows_arabic_copy` DataFrame. This column contains the sentiment predictions made by the BERT model, allowing for comparison with the original sentiment labels.

In [40]:
first_5rows_arabic_copy['BERT sentiments'] = sentiments
first_5rows_arabic_copy

Unnamed: 0,tweet,class,BERT sentiments
0,' #علمتني_الحياه أن الذين يعيشون على الأرض ليس...,pos,positive
1,' #ميري_كرسمس كل سنة وانتم طيبين http://t.co/n...,pos,positive
2,' و انتهى مشوار الخواجة ',neg,negative
3,' مش عارف ابتدى مذاكره منين :/ ',neg,negative
4,' @mskhafagi إختصروا الطريق بدلا من إختيار ال...,neg,negative
