# 1. Define Your Domain and Task

The task objective: Sentiment analysis of Amazon reviews from 1 to 5 star

The expected output: 1 to 5 star labels

Dataset: "yelp_review_full"

# 2. Dataset Preparation

Data Collection: Gather a dataset with at least 200 entries. Ensure the data is specific to your domain.

Cleaning:
- Remove duplicates, irrelevant entries, or junk text.
- Standardize text formatting (e.g., lowercase all text, normalize dates).
- Fix typos and grammatical inconsistencies.

In [26]:
import re
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
!pip install datasets
from datasets import load_dataset, DatasetDict, Dataset
from transformers import Trainer, TrainingArguments
from sklearn.metrics import classification_report
import pandas as pd
from collections import Counter



In [27]:
big_dataset = load_dataset("amazon_polarity")
# take approximately 1/100th of the dataset since the full dataset takes too long to train on
small_train = big_dataset['train'].train_test_split(test_size=1/100, seed=3)
small_test = big_dataset['test'].train_test_split(test_size=1/100, seed=3)
# remove duplicates
train_df = pd.DataFrame(small_train['test'])
test_df = pd.DataFrame(small_test['test'])
train_df = train_df.drop_duplicates(subset='content')
test_df = test_df.drop_duplicates(subset='content')

dataset = DatasetDict({"train": Dataset.from_pandas(train_df), "test": Dataset.from_pandas(test_df)})

In [28]:
print(f"Train size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")

Train size: 35999
Test size: 4000


In [29]:
def clean_text(example):
    example["content"] = example["content"].lower()
    # keep only alphanumeric characters, spaces, and punctuation
    example["content"] = re.sub(r"[^\w\s.,!?]", "", example["content"])
    return example
dataset = dataset.map(clean_text)

Map:   0%|          | 0/35999 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Labeling:
- For classification tasks, ensure each entry is labeled accurately.
- For summarization tasks, ensure each entry includes a concise, high-quality summary.

In [30]:
# too many to ensure all are labeled accurately

Balance the Dataset: Ensure there is no significant class imbalance. For instance, if you’re working on sentiment analysis, have a roughly equal number of positive, neutral, and negative samples.

In [31]:
train_labels = [example['label'] for example in dataset['train']]
test_labels = [example['label'] for example in dataset['test']]
distribution = {
    "Class": ["Negative", "Positive"],
    "Train Count": [train_labels.count(i) for i in range(2)],
    "Test Count": [test_labels.count(i) for i in range(2)]
}
df = pd.DataFrame(distribution)
print(df)

      Class  Train Count  Test Count
0  Negative        17930        1951
1  Positive        18069        2049


# 3. Fine-Tune the Model

In [32]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Step 1: Environment Setup

In [33]:
# Check for GPU availability
print(torch.cuda.is_available()) # Should return True

True


Step 2: Load the Pre-trained Model

In [34]:
model_name = "distilbert-base-uncased"
model_checkpoint_dir = "/content/drive/MyDrive/model_checkpoint"

# if training from start:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# if opening saved model:
#model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_dir)
#tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_dir)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Step 3: Preprocess the Dataset

In [38]:
def preprocess_function(examples):
  return tokenizer(examples['content'], truncation=True, padding='max_length', max_length=512)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/35999 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Step 4: Define Training Arguments

In [39]:
training_args = TrainingArguments(
  output_dir="./results",
  save_steps=500,
  save_total_limit=3,
  evaluation_strategy="epoch",
  learning_rate=3e-5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=64,
  num_train_epochs=3,
  weight_decay=0.01,
  gradient_accumulation_steps=2,
  warmup_steps=500,
)
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=tokenized_dataset["train"],
  eval_dataset=tokenized_dataset["test"],
)



Step 5: Train the Model

In [40]:
# if training from start:
trainer.train()
# if opening saved model:
#trainer.train(resume_from_checkpoint=True)

Epoch,Training Loss,Validation Loss
1,0.2285,0.220874
2,0.1382,0.222659
3,0.0716,0.265771


TrainOutput(global_step=3375, training_loss=0.16523110792371962, metrics={'train_runtime': 5211.3056, 'train_samples_per_second': 20.724, 'train_steps_per_second': 0.648, 'total_flos': 1.4306081652652032e+16, 'train_loss': 0.16523110792371962, 'epoch': 3.0})

Step 6: Save the Fine-Tuned Model

In [41]:
model.save_pretrained(model_checkpoint_dir)
tokenizer.save_pretrained(model_checkpoint_dir)
!cp -r ./results /content/drive/MyDrive/results

# 4. Model Evaluation

In [42]:
results = trainer.evaluate()
print(results)

{'eval_loss': 0.26577135920524597, 'eval_runtime': 64.8976, 'eval_samples_per_second': 61.636, 'eval_steps_per_second': 0.971, 'epoch': 3.0}


Detailed metrics with sklearn

In [43]:
predictions = trainer.predict(tokenized_dataset["test"])
y_pred = predictions.predictions.argmax(axis=1)
y_true = tokenized_dataset["test"]["label"]
print(classification_report(y_true, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.9305    0.9329    0.9317      1951
           1     0.9359    0.9336    0.9348      2049

    accuracy                         0.9333      4000
   macro avg     0.9332    0.9332    0.9332      4000
weighted avg     0.9333    0.9333    0.9333      4000



# Analysis and Report

- Dataset Insights: Describe your dataset, including how it was cleaned and labeled.
  
  A: The dataset is "yelp_review_full".

  Input: Yelp reviews (text)

  Output: Star review (0 meaning 1-star, 4 meaning 5-star) (integer)

  Cleaning process:
  - Confirming there are no duplicates and balanced data
  - Remove special characters and lowercase all text
  - Sample 1/20th of the dataset to speed up training to a reasonable GPU usage

- Training Process: Summarize the steps you took to fine-tune the model.
  
  A:
  - Load pre-trained model "distilbert-base-uncased"
  - Load and tokenize yelp dataset
  - Define training arguments:
    
    learning_rate=3e-5 (higher learning rate for faster convergence)
    
    per_device_train_batch_size=16,
    
    num_train_epochs=3,
    
    weight_decay=0.01 (regularization)
    
    gradient_accumulation_steps=2 (simulate a larger batch size)
    
    warmup_steps=500 (stabilize convergence)

- Evaluation Results: Present your evaluation metrics and discuss the model’s strengths and weaknesses.

  ```python            
  precision    recall  f1-score   support

            0     0.7445    0.7255    0.7349       510
            1     0.5485    0.5553    0.5519       479
            2     0.5749    0.5591    0.5669       508
            3     0.5254    0.5918    0.5566       490
            4     0.7415    0.6823    0.7107       513

      accuracy                         0.6240      2500
    macro avg     0.6269    0.6228    0.6242      2500
  weighted avg     0.6289    0.6240    0.6258      2500
  ```

  The model performs the best (has the best recall and f1 scores) on 1 and 5 star reviews, but struggles to distunguish between moderate sentiments.

- Application and Impact: Explain how this fine-tuned model could be used in a real-world application. Include at least one potential improvement for future iterations.
  
  In the real world, this model has various applications in customer service and online shopping. Amazon is already using large language models to summarize customer sentiment about products based on a 5 star review system. This model could potentially benefit from separating the moderate reviews from the extreme ones and performing a 3 class multi-class classification on them alone.

