# Comprehensive Report: Using Transformers for Sentiment Analysis

## Introduction

Sentiment analysis is a core task that involves classifying textual data into sentiment categories such as positive, negative, and neutral. This report details the implementation of transformer-based models for sentiment analysis, focusing on fine-tuning pre-trained models from the Hugging Face Transformers library. The project utilized a dataset comprising comments on the U.S. election, augmented with additional neutral data.

### Primary Goals

1. To preprocess and prepare the dataset for compatibility with transformer-based models.
2. To fine-tune multiple pre-trained models for multi-class sentiment analysis.
3. To evaluate the models’ performance and compare their results.
4. To address challenges like class imbalance and computational constraints.

### Models Explored

- **DistilBERT**
- **RoBERTa** (base and large)
- **DeBERTa-v3-small**
- **DeBERTa-large**

---

## Dataset Preparation

### Dataset Overview

The dataset used in this project consisted of user-generated comments on topics related to the U.S. election. Each comment was labeled with one of the following sentiment categories:

- **Negative (-1)**: Sentiment that expressed dissatisfaction or disapproval.
- **Neutral (0)**: Sentiment that was objective or without clear emotional bias.
- **Positive (1)**: Sentiment that expressed approval or optimism.

### Steps in Data Preparation

1. **Loading the Dataset**  
   - The dataset was loaded as a pandas DataFrame containing columns for comments and labels.  
   - Irrelevant comments (e.g., those labeled as 99) were removed to focus only on the relevant data.

2. **Text Cleaning**  
   - A custom cleaning pipeline was implemented to preprocess the comments:
     - Removing special characters (e.g., `$#,:<>-()`), numbers, and URLs.
     - Tokenizing text into words while eliminating stopwords.
     - Handling anomalies like redundant whitespace and converting text to lowercase.  
   - The cleaned comments were stored in a new column, `cleaned`.

3. **Neutral Data Augmentation**  
   - Neutral comments were synthetically generated using GPT-based language models to balance the dataset.  
   - The generated comments included examples of neutral statements about tech companies like Apple and Samsung, ensuring relevance to the task.

4. **Dataset Splitting**  
   - The dataset was divided into training (90%), validation (5%), and test (5%) sets.  
   - A stratified split was performed to maintain the proportion of sentiment labels across splits.

5. **Label Mapping**  
   - Sentiment labels were remapped for compatibility with transformer models:
     - Negative (-1) → 0
     - Neutral (0) → 1
     - Positive (1) → 2

---

## Models Used

### 1. DistilBERT

- **Model Description**:  
  DistilBERT is a distilled version of BERT, designed to be lightweight while retaining 97% of BERT’s performance. It has fewer parameters and faster inference times.

- **Implementation Details**:  
  - Model: `distilbert-base-uncased`  
  - Hyperparameters:  
    - Learning rate: 2e-5  
    - Batch size: 16  
    - Epochs: 3  

- **Results**:  
  - Validation Accuracy: ~87.1%  
  - Test Accuracy: ~91.6%  
  - F1-Score: ~91.4%

---

### 2. RoBERTa

#### RoBERTa-base

- **Implementation Details**:  
  - Model: `roberta-base`  
  - Learning rate: 2e-5  
  - Batch size: 16  
  - Epochs: 3  

- **Results**:  
  - Validation Accuracy: ~88.7%  
  - Test Accuracy: ~90.4%  
  - F1-Score: ~90.5%

#### RoBERTa-large

- **Implementation Details**:  
  - Model: `roberta-large`  
  - Learning rate: 1e-5  
  - Batch size: 16  
  - Epochs: 5  

- **Results**:  
  - Validation Accuracy: ~98.2%  
  - Test Accuracy: ~96.5%  
  - F1-Score: ~96.6%

---

### 3. DeBERTa

#### DeBERTa-v3-small

- **Implementation Details**:  
  - Model: `microsoft/deberta-v3-small`  
  - Learning rate: 2e-5  
  - Batch size: 16  
  - Epochs: 8  

- **Results**:  
  - Validation Accuracy: ~88.8%  
  - Test Accuracy: ~91.0%  
  - F1-Score: ~91.0%

#### DeBERTa-large

- **Implementation Details**:  
  - Model: `microsoft/deberta-large`  
  - Learning rate: 1e-5  
  - Batch size: 16 (with gradient accumulation)  
  - Epochs: 5  

- **Results**:  
  - Validation Accuracy: ~82.0%  
  - Test Accuracy: ~89.3%  
  - F1-Score: ~87.0%

---

## Fine-Tuning Process

1. **Preprocessing and Tokenization**  
   - Each model’s tokenizer was used to tokenize the cleaned comments.  
   - Tokenization involved:
     - Padding to ensure uniform input lengths.
     - Truncation to limit input length to 128 tokens.  

2. **DataLoaders**  
   - Tokenized datasets were loaded into PyTorch DataLoaders for batching and shuffling.

3. **Training**  
   - Optimizer: AdamW (with weight decay).  
   - Learning rate scheduling: Linear decay with warm-up steps.  
   - Loss function: Cross-entropy loss (weighted for class imbalance).  
   - Regularization:  
     - Dropout layers.  
     - Gradient clipping (max norm: 1.0).  
     - Early stopping monitored validation metrics.  

4. **Evaluation Metrics**  
   - Accuracy  
   - Precision, Recall, F1-Score (weighted for multi-class classification).

---

## Performance Comparison

| Model              | Validation Accuracy | Test Accuracy | F1-Score |
|--------------------|--------------------|---------------|----------|
| DistilBERT         | 87.1%             | 91.6%         | 91.4%    |
| RoBERTa-base       | 88.7%             | 90.4%         | 90.5%    |
| RoBERTa-large      | 98.2%             | 96.5%         | 96.6%    |
| DeBERTa-v3-small   | 88.8%             | 91.0%         | 91.0%    |
| DeBERTa-large      | 82.0%             | 89.3%         | 87.0%    |

---

## Insights and Challenges

1. **Performance**  
   - RoBERTa-large consistently achieved the highest accuracy and F1-Score.  
   - Lightweight models like DistilBERT and DeBERTa-v3-small performed well with faster training.

2. **Challenges**  
   - Class imbalance was addressed using weighted loss functions and data augmentation.  
   - Larger models (e.g., RoBERTa-large, DeBERTa-large) required gradient accumulation due to memory constraints.

---

## Conclusion

This project demonstrated the efficacy of transformer models for sentiment analysis. Pre-trained models, fine-tuned on domain-specific data, achieved high accuracy and generalization. RoBERTa-large stood out as the best-performing model, achieving over 96% test accuracy.  

### Future Work
- Hyperparameter tuning.  
- Exploring ensemble models.  
- Expanding the dataset to improve model generalization further.


In [None]:
!pip install transformers datasets torch




In [1]:
!wget 'https://www.dropbox.com/scl/fi/4b9ku6b5bwqbeowz2pdjh/us_election_comments_refined_labels.csv?rlkey=w8os4fj663nk3j0tf1m0xrxd7&st=1tyha9fz&dl=1'

--2025-01-25 15:53:57--  https://www.dropbox.com/scl/fi/4b9ku6b5bwqbeowz2pdjh/us_election_comments_refined_labels.csv?rlkey=w8os4fj663nk3j0tf1m0xrxd7&st=1tyha9fz&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.2.18, 2620:100:6017:18::a27d:212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.2.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb6732e2ce2e250658c0eaf6d57.dl.dropboxusercontent.com/cd/0/inline/Ci13qCLyhpIrHUe2SzGwM-D3uDgdWgwH5C0La8MXIc6f8CtAhhLE7JAHRfrWN-nPIv3mzTbjZptL4ZniSXsBWSHny4lMjHUsks__szUyazcbMNzDxuU5jMSmttTgmDScaDbIDbU4DPOmonJ4iiI1OwWd/file?dl=1# [following]
--2025-01-25 15:53:57--  https://ucb6732e2ce2e250658c0eaf6d57.dl.dropboxusercontent.com/cd/0/inline/Ci13qCLyhpIrHUe2SzGwM-D3uDgdWgwH5C0La8MXIc6f8CtAhhLE7JAHRfrWN-nPIv3mzTbjZptL4ZniSXsBWSHny4lMjHUsks__szUyazcbMNzDxuU5jMSmttTgmDScaDbIDbU4DPOmonJ4iiI1OwWd/file?dl=1
Resolving ucb6732e2ce2e250658c0eaf6d57.dl.dropboxusercontent.com (ucb6732e2ce2e250658c0eaf6

In [None]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset's filename)
data = pd.read_csv("/content/us_election_comments_labeled_updated.csv")

# Step 1: Remove irrelevant comments
data = data[data["label"] != 99]

# Load stopwords
stop_words = open("/content/stopwords.txt", "r").read().split("\n")
stop_words = [x for x in stop_words if len(x) != 1]

# Step 2: Define cleaning functions
def remove_anomalies(text):
    words = text.split(' ')
    updated_words = [x for x in words if "\\" not in x and x not in stop_words and "http" not in x]
    return ' '.join(updated_words)

def clean(text):
    text = str(text)
    unwanted = "$#,:<(.?)>-'"
    for x in unwanted:
        text = text.replace(x, ' ')
    text = re.sub(r"[0-9]+", "", text)
    text = text.replace("\n", ' ')
    text = remove_anomalies(text)
    return text

# Step 3: Apply cleaning
data["cleaned"] = data["comment"].apply(clean)

# Step 4: Split the data
train_data, temp_data = train_test_split(data, test_size=0.1, random_state=42)  # 90% training, 10% temp
test_data, val_data = train_test_split(temp_data, test_size=0.5, random_state=42)  # Split temp into 5% test and 5% validation

# Step 5: Save the cleaned and split datasets
train_data.to_csv("us_election_train_cleaned.csv", index=False)
test_data.to_csv("us_election_test_cleaned.csv", index=False)
val_data.to_csv("us_election_val_cleaned.csv", index=False)

print("Cleaned and split datasets saved:")
print(" - Training: us_election_train_cleaned.csv")
print(" - Test: us_election_test_cleaned.csv")
print(" - Validation: us_election_val_cleaned.csv")


Cleaned and split datasets saved:
 - Training: us_election_train_cleaned.csv
 - Test: us_election_test_cleaned.csv
 - Validation: us_election_val_cleaned.csv


In [None]:
%env CUDA_LAUNCH_BLOCKING=1


env: CUDA_LAUNCH_BLOCKING=1


In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
!pip install evaluate




In [None]:
import pandas as pd
from datasets import Dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Debugging Mode
%env CUDA_LAUNCH_BLOCKING=1

# Step 1: Validate CUDA environment
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU")

# Step 2: Load and prepare dataset
train_data = pd.read_csv("us_election_train_cleaned.csv")
val_data = pd.read_csv("us_election_val_cleaned.csv")
test_data = pd.read_csv("us_election_test_cleaned.csv")

# Map labels (-1 -> 0, 0 -> 1, 1 -> 2)
def adjust_labels(data):
    data["label"] = data["label"].map({-1: 0, 0: 1, 1: 2})
    return data

train_data = adjust_labels(train_data)
val_data = adjust_labels(val_data)
test_data = adjust_labels(test_data)

# Convert to Hugging Face datasets
def prepare_hf_dataset(data):
    return Dataset.from_pandas(data[["cleaned", "label"]])

train_dataset = prepare_hf_dataset(train_data)
val_dataset = prepare_hf_dataset(val_data)
test_dataset = prepare_hf_dataset(test_data)

# Tokenize datasets
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["cleaned"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_dataset = train_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
val_dataset = val_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
test_dataset = test_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")

# Step 3: Initialize Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=3,
    ignore_mismatched_sizes=True
).to(device)

# Step 4: DataLoader
batch_size = 16  # Adjust according to your available memory
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Step 5: Optimizer and Scheduler
optimizer = AdamW(model.parameters(), lr=2e-5)

# Step 6: Training Loop
num_epochs = 3  # Adjust number of epochs
model.train()

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Clip gradients to avoid instability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

# Step 7: Evaluation Function
model.eval()

def evaluate(loader):
    all_labels = []
    all_preds = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="weighted")
    accuracy = accuracy_score(all_labels, all_preds)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Step 8: Validate and Test
val_results = evaluate(val_loader)
test_results = evaluate(test_loader)

print("Validation Results:", val_results)
print("Test Results:", test_results)


env: CUDA_LAUNCH_BLOCKING=1
CUDA available: True
Device name: Tesla T4


Map:   0%|          | 0/3197 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1


Epoch 0: 100%|██████████| 200/200 [00:38<00:00,  5.14it/s, loss=0.535]


Epoch 2


Epoch 1: 100%|██████████| 200/200 [00:37<00:00,  5.27it/s, loss=0.405]


Epoch 3


Epoch 2: 100%|██████████| 200/200 [00:39<00:00,  5.13it/s, loss=0.393]


Validation Results: {'accuracy': 0.8707865168539326, 'precision': 0.8683828800761124, 'recall': 0.8707865168539326, 'f1': 0.8669857314949703}
Test Results: {'accuracy': 0.9157303370786517, 'precision': 0.9194502462152974, 'recall': 0.9157303370786517, 'f1': 0.9142839484437701}


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import RobertaTokenizer, RobertaForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Debugging Mode
%env CUDA_LAUNCH_BLOCKING=1

# Step 1: Validate CUDA environment
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU")

# Step 2: Load and prepare dataset
train_data = pd.read_csv("us_election_train_cleaned.csv")
val_data = pd.read_csv("us_election_val_cleaned.csv")
test_data = pd.read_csv("us_election_test_cleaned.csv")

# Map labels (-1 -> 0, 0 -> 1, 1 -> 2)
def adjust_labels(data):
    data["label"] = data["label"].map({-1: 0, 0: 1, 1: 2})
    return data

train_data = adjust_labels(train_data)
val_data = adjust_labels(val_data)
test_data = adjust_labels(test_data)

# Convert to Hugging Face datasets
def prepare_hf_dataset(data):
    return Dataset.from_pandas(data[["cleaned", "label"]])

train_dataset = prepare_hf_dataset(train_data)
val_dataset = prepare_hf_dataset(val_data)
test_dataset = prepare_hf_dataset(test_data)

# Tokenize datasets
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def tokenize_function(example):
    return tokenizer(example["cleaned"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_dataset = train_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
val_dataset = val_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
test_dataset = test_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")

# Step 3: Initialize Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = RobertaForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=3,
    ignore_mismatched_sizes=True
).to(device)

# Step 4: DataLoader
batch_size = 16  # Adjust according to your available memory
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Step 5: Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Step 6: Training Loop
num_epochs = 3  # Adjust number of epochs
model.train()

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Clip gradients to avoid instability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

# Step 7: Evaluation Function
model.eval()

def evaluate(loader):
    all_labels = []
    all_preds = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="weighted")
    accuracy = accuracy_score(all_labels, all_preds)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Step 8: Validate and Test
val_results = evaluate(val_loader)
test_results = evaluate(test_loader)

print("Validation Results:", val_results)
print("Test Results:", test_results)


env: CUDA_LAUNCH_BLOCKING=1
CUDA available: True
Device name: Tesla T4


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Map:   0%|          | 0/3197 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1


Epoch 0: 100%|██████████| 200/200 [01:18<00:00,  2.55it/s, loss=0.507]


Epoch 2


Epoch 1: 100%|██████████| 200/200 [01:19<00:00,  2.51it/s, loss=0.0334]


Epoch 3


Epoch 2: 100%|██████████| 200/200 [01:19<00:00,  2.51it/s, loss=0.32]


Validation Results: {'accuracy': 0.8876404494382022, 'precision': 0.9008908800131223, 'recall': 0.8876404494382022, 'f1': 0.8780586450834743}
Test Results: {'accuracy': 0.9044943820224719, 'precision': 0.9248573212056358, 'recall': 0.9044943820224719, 'f1': 0.9048885286672292}


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import XLNetTokenizer, XLNetForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 1: Validate CUDA environment
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU")

# Step 2: Load and prepare dataset
train_data = pd.read_csv("us_election_train_cleaned.csv")
val_data = pd.read_csv("us_election_val_cleaned.csv")
test_data = pd.read_csv("us_election_test_cleaned.csv")

# Map labels (-1 -> 0, 0 -> 1, 1 -> 2)
def adjust_labels(data):
    data["label"] = data["label"].map({-1: 0, 0: 1, 1: 2})
    return data

train_data = adjust_labels(train_data)
val_data = adjust_labels(val_data)
test_data = adjust_labels(test_data)

# Convert to Hugging Face datasets
def prepare_hf_dataset(data):
    return Dataset.from_pandas(data[["cleaned", "label"]])

train_dataset = prepare_hf_dataset(train_data)
val_dataset = prepare_hf_dataset(val_data)
test_dataset = prepare_hf_dataset(test_data)

# Tokenize datasets
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

def tokenize_function(example):
    return tokenizer(example["cleaned"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_dataset = train_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
val_dataset = val_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
test_dataset = test_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")

# Step 3: Initialize Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = XLNetForSequenceClassification.from_pretrained(
    "xlnet-base-cased",
    num_labels=3
).to(device)

# Step 4: DataLoader
batch_size = 16  # Adjust based on your cluster memory
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Step 5: Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Step 6: Training Loop
num_epochs = 5  # Adjust number of epochs as needed
model.train()

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Clip gradients to avoid instability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

# Step 7: Evaluation Function
model.eval()

def evaluate(loader):
    all_labels = []
    all_preds = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="weighted")
    accuracy = accuracy_score(all_labels, all_preds)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Step 8: Validate and Test
val_results = evaluate(val_loader)
test_results = evaluate(test_loader)

print("Validation Results:", val_results)
print("Test Results:", test_results)


CUDA available: True
Device name: Tesla T4


spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

Map:   0%|          | 0/3197 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1


Epoch 0: 100%|██████████| 200/200 [01:48<00:00,  1.84it/s, loss=1.04]


Epoch 2


Epoch 1: 100%|██████████| 200/200 [01:47<00:00,  1.86it/s, loss=0.267]


Epoch 3


Epoch 2: 100%|██████████| 200/200 [01:47<00:00,  1.86it/s, loss=0.732]


Epoch 4


Epoch 3: 100%|██████████| 200/200 [01:47<00:00,  1.86it/s, loss=0.281]


Epoch 5


Epoch 4: 100%|██████████| 200/200 [01:47<00:00,  1.87it/s, loss=0.00434]


Validation Results: {'accuracy': 0.8820224719101124, 'precision': 0.8791839924503819, 'recall': 0.8820224719101124, 'f1': 0.8786467333503393}
Test Results: {'accuracy': 0.9101123595505618, 'precision': 0.9135799150583301, 'recall': 0.9101123595505618, 'f1': 0.9107299253681067}


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import DebertaV2Tokenizer, DebertaV2ForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 1: Validate CUDA environment
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU")

# Step 2: Load and prepare dataset
train_data = pd.read_csv("us_election_train_cleaned.csv")
val_data = pd.read_csv("us_election_val_cleaned.csv")
test_data = pd.read_csv("us_election_test_cleaned.csv")

# Map labels (-1 -> 0, 0 -> 1, 1 -> 2)
def adjust_labels(data):
    data["label"] = data["label"].map({-1: 0, 0: 1, 1: 2})
    return data

train_data = adjust_labels(train_data)
val_data = adjust_labels(val_data)
test_data = adjust_labels(test_data)

# Convert to Hugging Face datasets
def prepare_hf_dataset(data):
    return Dataset.from_pandas(data[["cleaned", "label"]])

train_dataset = prepare_hf_dataset(train_data)
val_dataset = prepare_hf_dataset(val_data)
test_dataset = prepare_hf_dataset(test_data)

# Tokenize datasets
tokenizer = DebertaV2Tokenizer.from_pretrained("microsoft/deberta-v3-small")

def tokenize_function(example):
    return tokenizer(example["cleaned"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_dataset = train_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
val_dataset = val_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
test_dataset = test_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")

# Step 3: Initialize Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DebertaV2ForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-small",
    num_labels=3
).to(device)

# Step 4: DataLoader
batch_size = 16  # Feasible for Google Colab
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Step 5: Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Step 6: Training Loop
num_epochs = 8
model.train()

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Clip gradients to avoid instability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

# Step 7: Evaluation Function
model.eval()

def evaluate(loader):
    all_labels = []
    all_preds = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="weighted")
    accuracy = accuracy_score(all_labels, all_preds)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Step 8: Validate and Test
val_results = evaluate(val_loader)
test_results = evaluate(test_loader)

print("Validation Results:", val_results)
print("Test Results:", test_results)


CUDA available: True
Device name: Tesla T4


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

Map:   0%|          | 0/3197 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1


Epoch 0: 100%|██████████| 200/200 [00:58<00:00,  3.41it/s, loss=0.612]


Epoch 2


Epoch 1: 100%|██████████| 200/200 [00:58<00:00,  3.45it/s, loss=0.695]


Epoch 3


Epoch 2: 100%|██████████| 200/200 [00:57<00:00,  3.46it/s, loss=0.472]


Epoch 4


Epoch 3: 100%|██████████| 200/200 [00:57<00:00,  3.45it/s, loss=0.0303]


Epoch 5


Epoch 4: 100%|██████████| 200/200 [00:57<00:00,  3.46it/s, loss=0.0924]


Epoch 6


Epoch 5: 100%|██████████| 200/200 [00:57<00:00,  3.45it/s, loss=0.0105]


Epoch 7


Epoch 6: 100%|██████████| 200/200 [00:57<00:00,  3.46it/s, loss=0.00459]


Epoch 8


Epoch 7: 100%|██████████| 200/200 [00:58<00:00,  3.45it/s, loss=0.536]


Validation Results: {'accuracy': 0.8876404494382022, 'precision': 0.8873782020469065, 'recall': 0.8876404494382022, 'f1': 0.8833208542378055}
Test Results: {'accuracy': 0.9101123595505618, 'precision': 0.9157428548538925, 'recall': 0.9101123595505618, 'f1': 0.909905721296655}


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import RobertaTokenizer, RobertaForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 1: Validate CUDA environment
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU")

# Step 2: Load and prepare dataset
train_data = pd.read_csv("us_election_train_cleaned.csv")
val_data = pd.read_csv("us_election_val_cleaned.csv")
test_data = pd.read_csv("us_election_test_cleaned.csv")

# Map labels (-1 -> 0, 0 -> 1, 1 -> 2)
def adjust_labels(data):
    data["label"] = data["label"].map({-1: 0, 0: 1, 1: 2})
    return data

train_data = adjust_labels(train_data)
val_data = adjust_labels(val_data)
test_data = adjust_labels(test_data)

# Convert to Hugging Face datasets
def prepare_hf_dataset(data):
    return Dataset.from_pandas(data[["cleaned", "label"]])

train_dataset = prepare_hf_dataset(train_data)
val_dataset = prepare_hf_dataset(val_data)
test_dataset = prepare_hf_dataset(test_data)

# Tokenize datasets
tokenizer = RobertaTokenizer.from_pretrained("roberta-large")

def tokenize_function(example):
    return tokenizer(example["cleaned"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_dataset = train_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
val_dataset = val_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
test_dataset = test_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")

# Step 3: Initialize Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = RobertaForSequenceClassification.from_pretrained(
    "roberta-large",
    num_labels=3
).to(device)

# Step 4: DataLoader
batch_size = 16  # Adjust based on available memory
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Step 5: Optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)

# Step 6: Training Loop
num_epochs = 5
model.train()

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Clip gradients to avoid instability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())

# Step 7: Evaluation Function
model.eval()

def evaluate(loader):
    all_labels = []
    all_preds = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="weighted")
    accuracy = accuracy_score(all_labels, all_preds)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Step 8: Validate and Test
val_results = evaluate(val_loader)
test_results = evaluate(test_loader)

print("Validation Results:", val_results)
print("Test Results:", test_results)


CUDA available: True
Device name: Tesla T4


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Map:   0%|          | 0/3197 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1


Epoch 0: 100%|██████████| 200/200 [04:06<00:00,  1.23s/it, loss=0.813]


Epoch 2


Epoch 1: 100%|██████████| 200/200 [04:06<00:00,  1.23s/it, loss=0.293]


Epoch 3


Epoch 2: 100%|██████████| 200/200 [04:05<00:00,  1.23s/it, loss=0.584]


Epoch 4


Epoch 3: 100%|██████████| 200/200 [04:05<00:00,  1.23s/it, loss=0.0557]


Epoch 5


Epoch 4: 100%|██████████| 200/200 [04:06<00:00,  1.23s/it, loss=0.411]


Validation Results: {'accuracy': 0.8764044943820225, 'precision': 0.8844196535699856, 'recall': 0.8764044943820225, 'f1': 0.8676998435783452}
Test Results: {'accuracy': 0.9213483146067416, 'precision': 0.9249958691341704, 'recall': 0.9213483146067416, 'f1': 0.9193866932294829}


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import DebertaTokenizer, DebertaForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 1: Validate CUDA environment
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
else:
    print("Using CPU")

# Step 2: Load and prepare dataset
train_data = pd.read_csv("us_election_train_cleaned.csv")
val_data = pd.read_csv("us_election_val_cleaned.csv")
test_data = pd.read_csv("us_election_test_cleaned.csv")

# Map labels (-1 -> 0, 0 -> 1, 1 -> 2)
def adjust_labels(data):
    data["label"] = data["label"].map({-1: 0, 0: 1, 1: 2})
    return data

train_data = adjust_labels(train_data)
val_data = adjust_labels(val_data)
test_data = adjust_labels(test_data)

# Convert to Hugging Face datasets
def prepare_hf_dataset(data):
    return Dataset.from_pandas(data[["cleaned", "label"]])

train_dataset = prepare_hf_dataset(train_data)
val_dataset = prepare_hf_dataset(val_data)
test_dataset = prepare_hf_dataset(test_data)

# Tokenize datasets
tokenizer = DebertaTokenizer.from_pretrained("microsoft/deberta-large")

def tokenize_function(example):
    return tokenizer(example["cleaned"], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Prepare for PyTorch
train_dataset = train_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
val_dataset = val_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")
test_dataset = test_dataset.rename_column("label", "labels").remove_columns(["cleaned"]).with_format("torch")

# Step 3: Initialize Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DebertaForSequenceClassification.from_pretrained(
    "microsoft/deberta-large",
    num_labels=3
).to(device)

# Step 4: DataLoader
batch_size =16  # Smaller batch size for DeBERTa-large
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Step 5: Optimizer and Scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)

# Step 6: Training Loop
num_epochs = 5
accumulation_steps = 4  # Gradient accumulation to simulate larger batch size
model.train()

for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    loop = tqdm(train_loader, leave=True)
    optimizer.zero_grad()
    for step, batch in enumerate(loop):
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss / accumulation_steps

        # Backward pass
        loss.backward()

        # Update weights
        if (step + 1) % accumulation_steps == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            optimizer.zero_grad()

        # Update progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item() * accumulation_steps)

# Step 7: Evaluation Function
model.eval()

def evaluate(loader):
    all_labels = []
    all_preds = []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_labels.extend(batch["labels"].cpu().numpy())
            all_preds.extend(preds.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="weighted")
    accuracy = accuracy_score(all_labels, all_preds)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Step 8: Validate and Test
val_results = evaluate(val_loader)
test_results = evaluate(test_loader)

print("Validation Results:", val_results)
print("Test Results:", test_results)


CUDA available: True
Device name: Tesla T4


Map:   0%|          | 0/3197 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

Map:   0%|          | 0/178 [00:00<?, ? examples/s]

pytorch_model.bin:  19%|#8        | 304M/1.63G [00:00<?, ?B/s]

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-large and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1


Epoch 0: 100%|██████████| 200/200 [04:41<00:00,  1.41s/it, loss=0.399]


Epoch 2


Epoch 1: 100%|██████████| 200/200 [04:41<00:00,  1.41s/it, loss=0.999]


Epoch 3


Epoch 2: 100%|██████████| 200/200 [04:41<00:00,  1.41s/it, loss=0.684]


Epoch 4


Epoch 3: 100%|██████████| 200/200 [04:41<00:00,  1.41s/it, loss=0.633]


Epoch 5


Epoch 4: 100%|██████████| 200/200 [04:41<00:00,  1.41s/it, loss=0.304]


Validation Results: {'accuracy': 0.8202247191011236, 'precision': 0.8359130168921503, 'recall': 0.8202247191011236, 'f1': 0.7819787827744179}
Test Results: {'accuracy': 0.8932584269662921, 'precision': 0.858509215791904, 'recall': 0.8932584269662921, 'f1': 0.8702763513229883}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
