<a href="https://colab.research.google.com/github/gnaneswaryarram/machinelearningproject/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# prompt: read twitter.csv file fro drive

import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/twitter.csv')


In [None]:
# prompt: Install necessary Python libraries, including TensorFlow, PyTorch, Transformers by Hugging Face

!pip install tensorflow
!pip install torch
!pip install transformers




In [None]:
# prompt: Preprocess the above text data to suit the requirements of BERT (like tokenization, adding special tokens, attention masks).

from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Check the column names in your DataFrame
print(df.columns)

# Replace 'text' with the actual name of the column containing the text data
# For example, if the column name is 'tweet', use:
tokens = tokenizer(df['tweet'].tolist(), truncation=True, padding=True, return_tensors='pt')

# Extract input_ids, attention_mask, and token_type_ids
input_ids = tokens['input_ids']
attention_mask = tokens['attention_mask']
token_type_ids = tokens['token_type_ids']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Index(['id', 'label', 'tweet'], dtype='object')


In [None]:
# prompt: Model Building: Leverage the pre-trained BERT model from Hugging Face and add classification (positive , negative and neutral) layers on top.

import torch.nn as nn
from transformers import BertModel

class SentimentClassifier(nn.Module):
  def __init__(self, num_classes=3):  # 3 classes: positive, negative, neutral
    super(SentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained('bert-base-uncased')
    self.dropout = nn.Dropout(0.1)
    self.out = nn.Linear(self.bert.config.hidden_size, num_classes)

  def forward(self, input_ids, attention_mask, token_type_ids):
    _, pooled_output = self.bert(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        return_dict=False
    )
    output = self.dropout(pooled_output)
    return self.out(output)

# Initialize the model
model = SentimentClassifier()


In [None]:
# prompt: Training: Fine-tune the BERT model on your dataset, carefully tuning the learning rate and other hyperparameters.

import torch
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Assuming 'label' is the column name for sentiment labels (0: negative, 1: neutral, 2: positive)
# Ensure labels is the correct length, the same as the first dimension of the tokenized data
labels = torch.tensor(df['label'].values[:input_ids.shape[0]])

# Create a TensorDataset
dataset = TensorDataset(input_ids, attention_mask, token_type_ids, labels)

# Split into training and validation sets (adjust split ratio as needed)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5)  # Adjust learning rate as needed

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

num_epochs = 3  # Adjust as needed
for epoch in range(num_epochs):
  model.train()
  train_loss = 0
  for batch in train_loader:
    input_ids, attention_mask, token_type_ids, labels = batch
    input_ids, attention_mask, token_type_ids, labels = input_ids.to(device), attention_mask.to(device), token_type_ids.to(device), labels.to(device)

    optimizer.zero_grad()
    outputs = model(input_ids, attention_mask, token_type_ids)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    train_loss += loss.item()

  # Validation loop
  model.eval()
  val_loss = 0
  with torch.no_grad():
    for batch in val_loader:
      input_ids, attention_mask, token_type_ids, labels = batch
      input_ids, attention_mask, token_type_ids, labels = input_ids.to(device), attention_mask.to(device), token_type_ids.to(device), labels.to(device)

      outputs = model(input_ids, attention_mask, token_type_ids)
      loss = criterion(outputs, labels)
      val_loss += loss.item()

  print(f"Epoch {epoch+1}/{num_epochs}: Train Loss: {train_loss/len(train_loader)}, Val Loss: {val_loss/len(val_loader)}")

Epoch 1/3: Train Loss: 0.7365532517433167, Val Loss: 0.7001771330833435
Epoch 2/3: Train Loss: 0.6024259924888611, Val Loss: 0.670009970664978
Epoch 3/3: Train Loss: 0.5666627585887909, Val Loss: 0.5977081060409546


In [None]:
# prompt: Evaluation: Use metrics like accuracy, precision, recall, and F1-score to evaluate the above  model's performance.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluation loop
model.eval()
all_predictions = []
all_labels = []
with torch.no_grad():
  for batch in val_loader:
    input_ids, attention_mask, token_type_ids, labels = batch
    input_ids, attention_mask, token_type_ids, labels = input_ids.to(device), attention_mask.to(device), token_type_ids.to(device), labels.to(device)

    outputs = model(input_ids, attention_mask, token_type_ids)
    _, predictions = torch.max(outputs, dim=1)
    all_predictions.extend(predictions.cpu().numpy())
    all_labels.extend(labels.cpu().numpy())

# Calculate metrics
accuracy = accuracy_score(all_labels, all_predictions)
precision = precision_score(all_labels, all_predictions, average='weighted')  # Use 'weighted' for multi-class
recall = recall_score(all_labels, all_predictions, average='weighted')
f1 = f1_score(all_labels, all_predictions, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")


Accuracy: 0.8571428571428571
Precision: 0.7346938775510203
Recall: 0.8571428571428571
F1-score: 0.7912087912087912


  _warn_prf(average, modifier, msg_start, len(result))



## Sentiment Analysis using BERT on Twitter Data

### 1. Introduction

This project aims to perform sentiment analysis on a Twitter dataset using the BERT (Bidirectional Encoder Representations from Transformers) model. Sentiment analysis is a natural language processing (NLP) task that involves classifying text into different sentiment categories, such as positive, negative, or neutral. BERT, a state-of-the-art language representation model, is employed to capture the contextual information and semantic nuances in the tweets.

### 2. Data Preprocessing

The Twitter dataset, assumed to be stored in a CSV file on Google Drive, is loaded using Pandas. The following preprocessing steps are performed:

- **Loading Data:** The CSV file is read into a Pandas DataFrame.
- **Tokenization:** The BERT tokenizer is used to convert the text data into numerical tokens that can be processed by the model. The `truncation` and `padding` arguments ensure that all sequences have the same length.
- **Label Extraction:** The sentiment labels (assumed to be in a column named 'label') are extracted and converted into a PyTorch tensor.

### 3. Model Architecture

A custom sentiment classifier is built using PyTorch, leveraging the pre-trained BERT model as the base. The architecture consists of:

- **BERT Layer:** The pre-trained BERT model extracts contextualized embeddings from the input tokens.
- **Dropout Layer:** A dropout layer with a probability of 0.1 is added to prevent overfitting.
- **Linear Layer:** A fully connected linear layer maps the BERT embeddings to the number of sentiment classes (3 in this case).

### 4. Training and Evaluation

The model is trained using the AdamW optimizer and cross-entropy loss function. The dataset is split into training and validation sets, and the model is trained for a specified number of epochs. During training, the model learns to adjust its parameters to minimize the loss and improve its ability to predict sentiment labels.

After training, the model is evaluated on the validation set using metrics such as accuracy, precision, recall, and F1-score. These metrics provide insights into the model's performance in classifying the sentiment of tweets.

### 5. Challenges and Insights

- **Data Imbalance:** If the dataset has an uneven distribution of sentiment categories, techniques like oversampling or undersampling might be needed to address the imbalance.
- **Hyperparameter Tuning:** The learning rate, batch size, and number of epochs can significantly impact the model's performance. Experimentation is required to find optimal values.
- **Contextual Understanding:** BERT's ability to capture contextual information is crucial for sentiment analysis, as the meaning of words can change depending on the surrounding text.

### 6. Conclusion

This project demonstrates the use of BERT for sentiment analysis on Twitter data. The model's performance is evaluated using various metrics, providing a quantitative assessment of its ability to classify sentiment. Further improvements can be explored by addressing challenges like data imbalance and fine-tuning hyperparameters.
