# Sentiment Analysis
Lecture by Dr. Mourad Bouache at Stanford HPC Center on July 29, 2024

This is a sentiment analysis project using the IMDB dataset. The dataset contains 50,000 reviews, 25,000 for training and 25,000 for testing. The reviews are labeled as positive or negative. The goal is to predict the sentiment of the reviews.

0. Install the requirements and connect to the remote server
For VSCode: 
- Install the remote-ssh extension
- Connect to the remote server

Select Kernel:
- Open the venv or create a new one
- ```conda install jupyter```
- ```conda install ipykernel```
- ```python -m ipykernel install --user --name=condaenv --display-name "Python (condaenv)"```
- Run ```pip3 -r requirements.txt```
- Make sure to have the remote Kernel selected

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
import time

1. Load Dataset

In [6]:
dataset = load_dataset("sst2") # Example: Stanford Sentiment Treebank-2 dataset
train_dataset = dataset["train"].shuffle(seed=42).select(range(1000)) # Subset for faster local training

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

2. Load Pre-trained Model and Tokenizer

In [7]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


3. Tokenize Dataset

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = train_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

4. Training Setup

In [9]:
import os
from transformers import TrainingArguments, Trainer

num_cores = os.cpu_count()
print(f"Number of CPU cores: {num_cores}")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=num_cores,
    per_device_eval_batch_size=64,  # Checking for correctness
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

Number of CPU cores: 10


The error is resolved by installing the accelerate library.

5. Train Model and Measure Time

In [10]:
start_time = time.time()
trainer.train()
end_time = time.time()

  0%|          | 0/300 [00:00<?, ?it/s]

{'train_runtime': 133.8361, 'train_samples_per_second': 22.415, 'train_steps_per_second': 2.242, 'train_loss': 0.2694031270345052, 'epoch': 3.0}


6. Print the timing, include the output from this command on Week 5 Lab 1 lab on Canvas to receive credit for this weeks lab.

In [11]:
print(f"Local Training Time: {end_time - start_time} seconds")

Local Training Time: 134.16743874549866 seconds
