# Assignment

## Instructions

Use the following code as a starting point to load the rotten tomatoes dataset:

**Model Application:**

- Load a pre-trained sentiment analysis model from Hugging Face Transformers.
- Apply the model to a subset of the chosen dataset (e.g., the first 1000 samples from the training set).
- Evaluate the model's performance. You can start with qualitative analysis (inspecting predictions) and then explore quantitative metrics.

In [1]:
!pip install datasets==3.6.0



In [2]:
from datasets import load_dataset
from transformers import pipeline
from sklearn.metrics import classification_report, accuracy_score
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [4]:
# Load the Rotten Tomatoes dataset
dataset = load_dataset("rotten_tomatoes")

# Print the dataset information
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


In [5]:
# Example: Accessing the train test split
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# Print the first example in the training set
print(train_dataset[0])

# Print the first example in the testing set
print(test_dataset[0])

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
{'text': 'lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .', 'label': 1}


In [6]:
# Apply the model to a subset of the chosen dataset (first 1000 samples from the training set)
subset_size = 1000
subset = train_dataset.select(range(subset_size))

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')#lower case everything, APPLE = apple

# Tokenize the dataset
def preprocess_function(examples): #512 context window
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=128)

tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [7]:
# Convert the labels into binary format
def convert_labels(example):
    if example['label'] == 1:
        example['label'] = 1  # Positive sentiment
    else:
        example['label'] = 0  # Negative sentiment 0 ,1,2
    return example

binary_train_dataset = tokenized_train_dataset.map(convert_labels)
binary_test_dataset = tokenized_test_dataset.map(convert_labels)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [8]:
# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [10]:
from transformers import DataCollatorWithPadding # Import DataCollatorWithPadding

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=16, #2 GPU set up; 32
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Initialize the Data Collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=binary_train_dataset,
    eval_dataset=binary_test_dataset,
    data_collator=data_collator, # Pass the data collator here
)

# Train the model
trainer.train()

`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


Step,Training Loss
500,0.446665


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=534, training_loss=0.4422860395595822, metrics={'train_runtime': 166.1079, 'train_samples_per_second': 51.352, 'train_steps_per_second': 3.215, 'total_flos': 338273456100360.0, 'train_loss': 0.4422860395595822, 'epoch': 1.0})

In [11]:
# Evaluate the model
results = trainer.evaluate()
print(results)

{'eval_loss': 0.3871988356113434, 'eval_runtime': 4.6526, 'eval_samples_per_second': 229.121, 'eval_steps_per_second': 3.654, 'epoch': 1.0}


In [12]:
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1)
    return "positive" if prediction.item() == 1 else "negative"

In [23]:
# Test it with the first 5 train dataset
for i in range(50):
  print(predict_sentiment(train_dataset[i]['text']))
  print(train_dataset[i]['label'])
  print(train_dataset[i]['text'])
  print()


positive
1
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

positive
1
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

positive
1
effective but too-tepid biopic

positive
1
if you sometimes like to go to the movies to have fun , wasabi is a good place to start .

positive
1
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .

positive
1
the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .

positive
1
offers that rare combination of entertainment and education .

positive
1
perhaps no picture ever made has more literally showed that the ro