Project Setup

I downloaded the Amazon Reviews dataset (568,454 reviews) from Kaggle and saved it on my local storage.

Since training a model on this much data is slow on a local machine, I am using Google Colab for faster computation.

The dataset CSV file was uploaded to Google Drive, and then imported into the Colab notebook for processing and training.

In [54]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

# Path to CSV in your Drive
file_path = '/content/drive/MyDrive/Colab Notebooks/Reviews.csv'

# Read CSV
df = pd.read_csv(file_path)

df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [55]:
df.shape

(568454, 10)

In [56]:
df = df[['Text', 'Score']]
df.head()

Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


In [57]:
df.Score.value_counts()

Unnamed: 0_level_0,count
Score,Unnamed: 1_level_1
5,363122
4,80655
1,52268
3,42640
2,29769


In [58]:
df['Sentiment'] = df['Score'].apply(lambda x: 2 if x in [1, 2] else (1 if x == 3 else 0))
df.head()

Unnamed: 0,Text,Score,Sentiment
0,I have bought several of the Vitality canned d...,5,0
1,Product arrived labeled as Jumbo Salted Peanut...,1,2
2,This is a confection that has been around a fe...,4,0
3,If you are looking for the secret ingredient i...,2,2
4,Great taffy at a great price. There was a wid...,5,0


In [59]:
df=df[['Text', 'Sentiment']]
df.head()

Unnamed: 0,Text,Sentiment
0,I have bought several of the Vitality canned d...,0
1,Product arrived labeled as Jumbo Salted Peanut...,2
2,This is a confection that has been around a fe...,0
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,0


In [60]:
df.Sentiment.value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
0,443777
2,82037
1,42640


Handling Class Imbalance

The dataset is imbalanced across the sentiment classes. To train the model on a manageable subset of approximately 125k + reviews (suitable for Google Colab free), I will use undersampling.

Although undersampling is generally not the best approach because it discards data, it allows us to quickly train the model and demonstrate the workflow effectively on a limited-resource environment.

In [61]:
min_count=min(df.Sentiment.value_counts())
min_count

42640

In [62]:
df_positive=df[df['Sentiment']==0].sample(min_count)
df_neutral=df[df['Sentiment']==1].sample(min_count)
df_negative=df[df['Sentiment']==2].sample(min_count)
df_balanced=pd.concat([df_positive,df_neutral,df_negative])
df_balanced.Sentiment.value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
0,42640
1,42640
2,42640


In [63]:
df_balanced.shape

(127920, 2)

In [64]:
df_balanced.head()

Unnamed: 0,Text,Sentiment
510442,My older dog has always been very picky when i...,0
381266,Worked great. A Good must have for breastfeedi...,0
482661,I love this stuff! Its real easy to prepare in...,0
226438,The reason I got this from Amazon is because w...,0
376694,I almost never eat milk chocolate because I fi...,0


In [65]:
df_balanced.isnull().sum()

Unnamed: 0,0
Text,0
Sentiment,0


Preprocessing Text

Minimal preprocessing needed for DistilBERT:

Remove HTML tags, URLs

Lowercase text ( if using DistilBERT-base-cased)

Optional: remove very noisy symbols

In [66]:
import re
def preprocessing(text):
# Remove URLs
  text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
# Remove HTML tags
  text = re.sub(r"<.*?>", "", text)
# Remove extra spaces
  text = text.strip()
# Lowercase (optional for uncased DistilBERT)
  text = text.lower()
  return text

In [67]:
!pip install swifter

import swifter



In [68]:
df_balanced['preprocessed_text'] = df_balanced['Text'].swifter.apply(preprocessing)

Pandas Apply:   0%|          | 0/127920 [00:00<?, ?it/s]

In [69]:
df_balanced.head()

Unnamed: 0,Text,Sentiment,preprocessed_text
510442,My older dog has always been very picky when i...,0,my older dog has always been very picky when i...
381266,Worked great. A Good must have for breastfeedi...,0,worked great. a good must have for breastfeedi...
482661,I love this stuff! Its real easy to prepare in...,0,i love this stuff! its real easy to prepare in...
226438,The reason I got this from Amazon is because w...,0,the reason i got this from amazon is because w...
376694,I almost never eat milk chocolate because I fi...,0,i almost never eat milk chocolate because i fi...


In [70]:
df=df_balanced[['preprocessed_text', 'Sentiment']]
df.head()

Unnamed: 0,preprocessed_text,Sentiment
510442,my older dog has always been very picky when i...,0
381266,worked great. a good must have for breastfeedi...,0
482661,i love this stuff! its real easy to prepare in...,0
226438,the reason i got this from amazon is because w...,0
376694,i almost never eat milk chocolate because i fi...,0


In [71]:
df.Sentiment.value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
0,42640
1,42640
2,42640


In [72]:
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(
     df['preprocessed_text'], df['Sentiment'],
    test_size=0.2,
    random_state=42,
    stratify=df['Sentiment']
)

In [73]:
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,  # 50% of temp = 10% of total
    random_state=42,
    stratify=y_temp
)

In [74]:
train_df = pd.DataFrame({'preprocessed_text': X_train, 'Sentiment': y_train})
val_df   = pd.DataFrame({'preprocessed_text': X_val, 'Sentiment': y_val})
test_df  = pd.DataFrame({'preprocessed_text': X_test, 'Sentiment': y_test})

In [75]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained(
    'distilbert-base-uncased',
    cache_dir='/content/hf_cache',   # store locally
    force_download=True
)
def batch_tokenize(texts, batch_size=5000):
    all_encodings = {"input_ids": [], "attention_mask": []}
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        enc = tokenizer(batch, truncation=True, padding=True, max_length=128)
        all_encodings["input_ids"].extend(enc["input_ids"])
        all_encodings["attention_mask"].extend(enc["attention_mask"])
    return all_encodings

train_encodings = batch_tokenize(list(train_df['preprocessed_text']))
val_encodings   = batch_tokenize(list(val_df['preprocessed_text']))
test_encodings  = batch_tokenize(list(test_df['preprocessed_text']))


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

In [76]:
import torch
from transformers import DistilBertForSequenceClassification

num_labels = 3
model_name = "distilbert-base-uncased"

# Load DistilBERT for sequence classification
model = DistilBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

# Print summary
print(model)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [77]:
from torch.utils.data import Dataset, DataLoader

# Custom PyTorch Dataset
class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

# Create datasets
train_dataset = SentimentDataset(train_encodings, train_df['Sentiment'].values)
val_dataset   = SentimentDataset(val_encodings, val_df['Sentiment'].values)
test_dataset  = SentimentDataset(test_encodings, test_df['Sentiment'].values)

# Create DataLoaders
batch_size = 16  # increase to 32 if GPU allows
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=batch_size)
test_loader  = DataLoader(test_dataset, batch_size=batch_size)

In [78]:
import torch
from torch import nn
from torch.optim import AdamW

# Device (GPU if available)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Loss function
loss_fn = nn.CrossEntropyLoss()  # works like SparseCategoricalCrossentropy

# Accuracy function
def compute_accuracy(preds, labels):
    return (preds.argmax(dim=1) == labels).float().mean()


In [79]:
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs.logits, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}, Training Loss: {avg_loss:.4f}")

Epoch 1, Training Loss: 0.5601
Epoch 2, Training Loss: 0.3969
Epoch 3, Training Loss: 0.2799


In [80]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

model.eval()  # set model to evaluation mode

all_preds = []
all_labels = []
total_loss = 0.0

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        # Loss
        loss = loss_fn(logits, labels)
        total_loss += loss.item() * input_ids.size(0)

        # Predictions
        preds = torch.argmax(logits, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# Compute average loss and accuracy
avg_loss = total_loss / len(test_dataset)
accuracy = np.mean(np.array(all_preds) == np.array(all_labels))

print(f"Test Loss: {avg_loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

# Detailed metrics
print("\nClassification Report:")
print(classification_report(all_labels, all_preds, digits=4))

print("\nConfusion Matrix:")
print(confusion_matrix(all_labels, all_preds))

Test Loss: 0.4352
Test Accuracy: 0.8326

Classification Report:
              precision    recall  f1-score   support

           0     0.9068    0.8766    0.8915      4264
           1     0.7733    0.7617    0.7675      4264
           2     0.8197    0.8593    0.8390      4264

    accuracy                         0.8326     12792
   macro avg     0.8333    0.8326    0.8327     12792
weighted avg     0.8333    0.8326    0.8327     12792


Confusion Matrix:
[[3738  418  108]
 [ 318 3248  698]
 [  66  534 3664]]


In [81]:
save_path = "/content/drive/MyDrive/Colab Notebooks"
# Save model
model.save_pretrained("/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model")

# Save tokenizer
tokenizer.save_pretrained("/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model")


('/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model/vocab.txt',
 '/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model/added_tokens.json',
 '/content/drive/MyDrive/Colab Notebooks/distilbert_sentiment_model/tokenizer.json')