<a href="https://colab.research.google.com/github/prahladpunia/AI/blob/main/SENTIMENT_ANALYSIS_PHARMA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis for Pharmacetical Company    
The company has collected comments for various products scraping from various online sources. The data is availabe to us as a train and test files
# Objective   
To create a sentiment analysis engine that can track the sentiment regarding a specified drug from 3 categories   
- positive    
- negative
- neutral
#Every sample will contain the text mentioning a drug name and the comment pertaining to it. Multiple products could be there within a single comment

# Model  
We are going to use a Pre Trained Transformer Model
# BERT-base-uncased is a pre-trained transformer-based language model that captures contextual representations of text using a multi-layer bidirectional architecture, trained on a large corpus of uncased English text.

In [3]:
!pip install transformers



In [4]:
import pandas as pd
import torch # for tensor analysis and deep learning
import torch.nn as nn # to build and train neural networks
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Step 1: Load the labeled training data
train_data = pd.read_csv("train.csv")

# Step 2: Preprocess the data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def preprocess_text(text):
    encoded_text = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    return encoded_text

# Encode the text and labels
encoded_data = [preprocess_text(text) for text in train_data['text']]
X = torch.cat([data['input_ids'] for data in encoded_data], dim=0)
attention_mask = torch.cat([data['attention_mask'] for data in encoded_data], dim=0)
y = torch.tensor(train_data['sentiment'].values)

# Step 3: Split the data into training and validation sets
X_train, X_val, y_train, y_val, attention_mask_train, attention_mask_val = train_test_split(
    X, y, attention_mask, test_size=0.2, random_state=42
)

# Step 4: Prepare DataLoader for training and validation sets
batch_size = 16
train_dataset = TensorDataset(X_train, attention_mask_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataset = TensorDataset(X_val, attention_mask_val, y_val)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# Step 5: Initialize BERT model and optimizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

# Step 6: Training loop
num_epochs = 8
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for batch in train_loader:
        input_ids, attention_mask, labels = tuple(t.to(device) for t in batch)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        train_loss += loss.item()
        loss.backward()
        optimizer.step()

    train_loss /= len(train_loader)

    # Step 7: Evaluation on validation set
    model.eval()
    val_preds = []
    val_labels = []
    for batch in val_loader:
        input_ids, attention_mask, labels = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=1)
        val_preds.extend(preds.cpu().tolist())
        val_labels.extend(labels.cpu().tolist())

    val_accuracy = accuracy_score(val_labels, val_preds)
    print(f"Epoch {epoch+1}/{num_epochs}: Train Loss: {train_loss:.4f}, Val Accuracy: {val_accuracy:.4f}")

# Step 8: Load the test data
test_data = pd.read_csv("test.csv")
encoded_test_data = [preprocess_text(text) for text in test_data['text']]
X_test = torch.cat([data['input_ids'] for data in encoded_test_data], dim=0)
attention_mask_test = torch.cat([data['attention_mask'] for data in encoded_test_data], dim=0)
X_test = X_test.to(device)
attention_mask_test = attention_mask_test.to(device)

# Step 9: Generate predictions for the test data
test_preds = []
model.eval()
with torch.no_grad():
    for i in range(0, len(X_test), batch_size):
        batch_input_ids = X_test[i:i+batch_size]
        batch_attention_mask = attention_mask_test[i:i+batch_size]
        batch_outputs = model(batch_input_ids, attention_mask=batch_attention_mask)
        logits = batch_outputs.logits
        preds = torch.argmax(logits, dim=1)
        test_preds.extend(preds.cpu().tolist())

# Step 10: Create a submission file
submission = pd.DataFrame({'id': test_data['unique_hash'], 'sentiment': test_preds})
submission.to_csv('submission_final.csv', index=False)


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Epoch 1/8: Train Loss: 0.7749, Val Accuracy: 0.7292
Epoch 2/8: Train Loss: 0.7080, Val Accuracy: 0.7434
Epoch 3/8: Train Loss: 0.5628, Val Accuracy: 0.7358
Epoch 4/8: Train Loss: 0.3562, Val Accuracy: 0.6866
Epoch 5/8: Train Loss: 0.1604, Val Accuracy: 0.6837
Epoch 6/8: Train Loss: 0.0881, Val Accuracy: 0.7036
Epoch 7/8: Train Loss: 0.0643, Val Accuracy: 0.6894
Epoch 8/8: Train Loss: 0.0401, Val Accuracy: 0.6705
