<h1 style="text-align: left; color: #4CAF50; font-size: 24px; font-weight: bold; font-family: Arial;">
Sentiment Analysis using BERT: Predicting Sentiment in IMDB Reviews
</h1>

<div style="text-align: left; font-size: 18px; color: #333; font-style: italic; margin-top: 10px; font-family: Arial;">
Author: Prashant Sundge
</div>


# Project Index: Sentiment Analysis using BERT

## 1. Loading Pre-Trained BERT Model
   - Initialize and load a pre-trained BERT model for fine-tuning.

## 2. Fine-Tuning BERT for Specific Task
   - Prepare BERT for sequence classification on IMDB sentiment analysis.

## 3. IMDB Dataset Loading
   - Load the IMDB dataset consisting of movie reviews labeled with sentiment (positive or negative).

## 4. Create Labels
   - Prepare labels for sentiment classes based on the dataset.

## 5. Data Splitting: 2K Training, 100 Testing
   - Split the dataset into a training set with 2000 samples and a test set with 100 samples.

## 6. Save the Model
   - Save the fine-tuned BERT model after training.

## 7. Load the Model
   - Load the saved BERT model for inference.

## 8. Testing the Model on New Data
   - Tokenize new data using BERT's tokenizer.
   - Convert inputs to PyTorch tensors.
   - Make predictions with the loaded model.
   - Extract predicted labels.

## 9. Display Predictions
   - Show predictions made by the model on new data.

## 10. Model Evaluation and Metrics
   - Generate a classification report showing precision, recall, and F1-score.
   - Display a confusion matrix for further evaluation.

## 11. Create DataFrame for Reviews, Actual, and Predicted Labels
   - Organize results into a DataFrame for better visualization and analysis.

## 12. The End
   - Conclusion and summary of the project's key findings.


<div style="font-family: Arial, sans-serif; font-size: 18px; line-height: 1.6; text-align: justify; color: #333;">
<b>This project utilizes BERT, a state-of-the-art transformer model, for sentiment analysis on the IMDB dataset.</b> The IMDB dataset consists of movie reviews labeled as positive or negative sentiments. By fine-tuning BERT for sequence classification, the model predicts sentiment labels based on the textual content of reviews. The dataset is split into a training set of 2000 samples and a test set of 100 samples for evaluation. Performance metrics such as accuracy, precision, recall, and F1-score demonstrate the model's capability to accurately classify sentiment in natural language, showcasing the application of advanced NLP techniques in understanding and analyzing textual sentiment.
</div>


# Install Necessory Libraries

- Install Transformers and Torch libraries: `!pip install transformers torch -q`
- Install with Accelerate for enhanced performance: `!pip install accelerate -U`
- Simplified install command for all: `!pip install transformers torch accelerate -q`


In [15]:
# !pip install transformers torch -q

In [13]:
# !pip install transformers[torch] -q

In [11]:
# !pip install accelerate -U

# # Step 2: Install the required libraries
# !pip install transformers torch accelerate -q

Collecting accelerate
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/314.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/314.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from 

# Loading PreTrained  BERT Model

- **BERT Tokenizer and Model Usage**:
  - Import `BertTokenizer` and `BertModel` from `transformers`.
  - Load pre-trained tokenizer and model using `'bert-base-uncased'`.
  - Tokenize input text with `tokenizer(text, return_tensors='pt')`.
  - Obtain embeddings using `model(**inputs).last_hidden_state`.
  - Print the shape of embeddings (`torch.Size([1, seq_length, 768]`) for BERT base model.


In [1]:
from transformers import BertTokenizer, BertModel
# load pre-trained model and tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


# tokenizer input text
text= "Hello, how are you ?"

inputs = tokenizer(text, return_tensors = 'pt')


# get the embeddings
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

print(last_hidden_states.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


torch.Size([1, 8, 768])


# Fine Tuning BERT for Specific Task

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer , TrainingArguments

import torch
from torch.utils.data import DataLoader,Dataset

# IMBD DATASET LOAD

In [3]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import pandas as pd
import zipfile
import os

# i have changed the Colab as earlier i used zip file so the # code is for that but its manageable

# zip_path ="/content/drive/MyDrive/Colab Notebooks/archive/IMDB Dataset.csv"
extract_path = '/content/drive/MyDrive/Colab Notebooks/archive/IMDB Dataset.csv'

# with zipfile.ZipFile(zip_path, 'r') as zip_ref:
#     zip_ref.extractall(extract_path)

# extracted_files = os.listdir(extract_path)
# print(f"Extracted files {extracted_files}")

# dfs = []

# for file in extracted_files:
#   if file.endswith('.csv'):
#     file_path=os.path.join(extract_path, file)
#     df= pd.read_csv(file_path)
#     dfs.append(df)

In [5]:
# df.head()
df = pd.read_csv(extract_path)

In [6]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

#Create Labels

In [7]:
def sentiments_to_labels(sentiment):
  if sentiment == "positive":
      return 1
  elif sentiment =='negative':
      return 0
  else:
      return None

df['labels'] = df['sentiment'].apply(sentiments_to_labels)

df.head()



Unnamed: 0,review,sentiment,labels
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


# Split data and use only 2k for training and 100 for testing

In [29]:
df = df.drop(columns='sentiment')

In [66]:
df.head()

texts = df['review'].head(2000)
labels= df['labels'].head(2000)


# testing datase created 100 for evaluation
test_reviews = list(df['review'].head(100))
test_labels =  list(df['labels'].head(100))



### Dataset Preparation

- **Dataset Class**: Defined `sampleDataset` class inherits from `Dataset`, designed to handle text classification tasks using BERT.
  - **Initialization**: Accepts `texts`, `labels`, `tokenizer`, and `max_len` parameters.
  - **__len__ Method**: Returns the length of the dataset.
  - **__getitem__ Method**: Tokenizes each text using the tokenizer, truncates/pads to `max_len`, and returns `input_ids`, `attention_mask`, and `labels`.

- **Data Preparation**:
  - `texts` and `labels` are extracted from `df['review']` and `df['labels']` respectively.
  - `tokenizer` is initialized using `BertTokenizer.from_pretrained`.
  - `dataset` is created using `sampleDataset` with `texts`, `labels`, `tokenizer`, and `max_len=32`.
  - `dataloader` is created using `DataLoader` for batching.

### Model Training

- **Model Loading**:
  - `model` is loaded using `BertForSequenceClassification.from_pretrained` with `num_labels=2` for binary classification.

- **Training Arguments**:
  - Initial training arguments (commented out) include output directory, epochs, batch size, and logging settings.
  - Hyperparameter tuning: New `training_args` with 5 epochs, batch size 4, learning rate 5e-5, and increased logging steps.

- **Trainer Initialization**:
  - `trainer` is initialized with `model`, `training_args`, and `train_dataset`.

- **Training**:
  - `trainer.train()` initiates the training process.

- **Evaluation**:
  - `results` from `trainer.evaluate(eval_dataset=dataset)` provides evaluation metrics after training.



In [91]:


class sampleDataset(Dataset):
  def __init__(self, texts, labels, tokenizer, max_len):
    self.texts = texts
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    text = self.texts[idx]
    label= self.labels[idx]
    encoding = self.tokenizer(text, truncation= True, padding='max_length', max_length= self.max_len, return_tensors = 'pt')

    return {

            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label,dtype=torch.long)

        }

# prepare the dataset
texts = df['review'].head(2000)
labels= df['labels'].head(2000)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = sampleDataset(texts, labels, tokenizer, max_len=32)

# create dataloader
dataloader = DataLoader(dataset, batch_size=2)

# load Model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)

# # training arguments
# training_args = TrainingArguments(
#     output_dir = '.results',  # output directory
#     num_train_epochs = 1, # number of training epochs
#     per_device_train_batch_size=2, # batch size for training
#     logging_dir ='./logs',

# )
"""The above parameter are working fine
{'eval_loss': 0.5178326368331909, 'eval_runtime': 15.8858, 'eval_samples_per_second': 125.899, 'eval_steps_per_second': 15.737, 'epoch': 1.0}
"""

# Hyper Parameter Tuning
# Define training arguments with different hyperparameters
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    logging_dir='./logs',
    logging_steps=10,
)
"""We are trying to do hyper parameter tuning and see the results
{'eval_loss': 0.02625814639031887, 'eval_runtime': 15.9143, 'eval_samples_per_second': 125.673, 'eval_steps_per_second': 15.709, 'epoch': 5.0}

Lower Evaluation Loss: eval_loss 0.026 from epoch 5 is significantly lower than eval_loss 0.51 from epoch 1.
 This suggests that the model's performance improved substantially as training progressed. Therefore, eval_loss 0.026 is better than eval_loss 0.51.

"""

# trainer
trainer = Trainer(
    model = model,
    args= training_args,
    train_dataset = dataset
)

trainer.train()

# evaluate the model
results = trainer.evaluate(eval_dataset=dataset)
print(results)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,0.7013
20,0.6848
30,0.7127
40,0.605
50,0.7277
60,0.6647
70,0.7118
80,0.7196
90,0.701
100,0.7134


{'eval_loss': 0.02625814639031887, 'eval_runtime': 15.9143, 'eval_samples_per_second': 125.673, 'eval_steps_per_second': 15.709, 'epoch': 5.0}


# Save the Model

In [9]:
model_path = "./fine_tuned_bert_model"
trainer.save_model(model_path)

# Laod the Model

In [10]:
from  transformers import BertForSequenceClassification

model= BertForSequenceClassification.from_pretrained(model_path)

# Testing the Model on New Data

In [67]:
# example of testing on New data
new_texts = test_reviews


# Tokenize the new data

In [68]:
inputs = tokenizer(new_texts, truncation= True, padding='max_length', max_length=32, return_tensors = 'pt')


# Convert Inputs to Pytorch

In [69]:
# convert inputs to pytorch tensors
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Make Predictions

In [70]:
with torch.no_grad():
  outputs= model(input_ids, attention_mask=attention_mask)

# extract Predicted labels

In [72]:
predicted_labels = torch.argmax(outputs.logits, dim=1)

In [79]:
predicted_labels

tensor([1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,
        0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,
        1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
        0, 0, 0, 1])

# display Predictions


In [92]:
for text , label in zip(new_texts[:10], predicted_labels[:10]):
  if label == 0:
    label = "NEGATIVE"
    print(f"Text:{text[:100]} ---> Predicted Label :{label}")
  else:
    label = "POSITIVE"
    print(f"Text:{text[:100]} ---> Predicted Label :{label}")


Text:One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The ---> Predicted Label :POSITIVE
Text:A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-B ---> Predicted Label :POSITIVE
Text:I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air con ---> Predicted Label :POSITIVE
Text:Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his par ---> Predicted Label :NEGATIVE
Text:Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers  ---> Predicted Label :POSITIVE
Text:Probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble ca ---> Predicted Label :POSITIVE
Text:I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today i ---> Predicted Label :POSITIVE
Text:This show was an amazing, fresh & in

In [26]:
predicted_labels

tensor([1, 1, 0])

# Model Evaluation and Metrics

In [82]:
from sklearn.metrics import classification_report, confusion_matrix

# true labels for your new_texts
true_labels = test_labels

# classification report
print(classification_report(true_labels , predicted_labels))

# confusion matrix
print(confusion_matrix(true_labels , predicted_labels))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87        58
           1       0.81      0.83      0.82        42

    accuracy                           0.85       100
   macro avg       0.85      0.85      0.85       100
weighted avg       0.85      0.85      0.85       100

[[50  8]
 [ 7 35]]


Here are my inputs based on the classification results:

- **Accuracy**: The model achieves an accuracy of 85%, indicating that it correctly predicts the class for 85 out of every 100 instances.

- **Precision**:
  - Class 0 (negative class): Precision of 88% means that when the model predicts an instance as class 0, it is correct 88% of the time.
  - Class 1 (positive class): Precision of 81% indicates that when the model predicts an instance as class 1, it is correct 81% of the time.

- **Recall**:
  - Class 0: Recall of 86% suggests that the model correctly identifies 86% of all actual class 0 instances.
  - Class 1: Recall of 83% indicates that the model correctly identifies 83% of all actual class 1 instances.

- **F1-score**:
  - Class 0: F1-score of 87% balances precision and recall for class 0.
  - Class 1: F1-score of 82% balances precision and recall for class 1.

- **Support**:
  - Class 0: Represents 58 instances in the dataset.
  - Class 1: Represents 42 instances in the dataset.

- **Macro Average**:
  - Precision, recall, and F1-score are all 85%, computed by averaging their respective values across classes, treating each class equally.

- **Weighted Average**:
  - Precision, recall, and F1-score are all 85%, computed by averaging their respective values across classes, weighted by the number of instances for each class.

Overall, the model performs well with balanced precision and recall scores, indicating robust performance across both classes in the dataset. Adjustments and improvements can be made based on specific goals and requirements for further optimization.

In [87]:
# create dataframe for actual and predicted values
pred_df = pd.DataFrame({
    'Text': new_texts,
    'Actual Label': test_labels,
    'Predicted Label': predicted_labels
})

# Display the DataFrame
pred_df.head() # Print the first few rows to verify

Unnamed: 0,Text,Actual Label,Predicted Label
0,One of the other reviewers has mentioned that ...,1,1
1,A wonderful little production. <br /><br />The...,1,1
2,I thought this was a wonderful way to spend ti...,1,1
3,Basically there's a family where a little boy ...,0,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,1


## Project Summary

This project utilized BERT, a transformer-based model, to conduct sentiment analysis on the IMDB dataset. By fine-tuning BERT for sequence classification, the model accurately predicted sentiment labels (positive or negative) from movie reviews. Key findings include:
- Achieved high accuracy, precision, recall, and F1-score metrics.
- Demonstrated BERT's effectiveness in understanding and classifying sentiment in natural language.
- Highlighted practical applications in sentiment analysis across various domains.
  
Overall, the project showcases the power of advanced NLP techniques in extracting meaningful insights from textual data.
