# Artificial Intelligence
# 464/664
# Assignment #8

## General Directions for this Assignment

00. We're using a Jupyter Notebook environment (tutorial available here: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html),
01. Output format should be exactly as requested (it is your responsibility to make sure notebook looks as expected on Gradescope),
02. Check submission deadline on Gradescope,
03. Rename the file to Last_First_assignment_8,
04. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
05. Do not submit any other files.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".

## Language Modeling

This homework will require you to load and train models.  If you choose small models and datasets, you should be able to run this locally on your computer. However, larger models/datasets may require GPU access. You can access one GPU for free on [Google Colab](https://colab.research.google.com/).

We will use HuggingFace libraries in this assignment. We discussed majority of what you will need during the discussion demo. Additional documentation can be found [here](https://huggingface.co/docs).

In [11]:
%pip install evaluate
%pip install transformers
%pip install tqdm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
huggingface/tokenizers: The current process just got forked, after parallelism has already b

In [14]:
# Imports
import torch
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
import os
from tqdm import tqdm
from transformers import pipeline
os.environ["WANDB_DISABLED"] = "true"

## Problem 0: Data
From the [HuggingFace Datasets](https://huggingface.co/datasets), choose a dataset that satisfies the following criteria:
- Data must have train and test splits (Optional development set)
- Task must be text classification
- Task must have at least 3 labels


In [6]:
ds = load_dataset("ag_news")

Found cached dataset parquet (/Users/Henry/.cache/huggingface/datasets/parquet/ag_news-9af2a5926861d22a/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|██████████| 2/2 [00:00<00:00, 32.52it/s]


**Describe the data.** 
What is the utility of the task? What are the inputs? What are the labels? Are the any potential difficulties you expect from the task? How do you evaluate the performance of this task?

TODO

**Research current methods using this dataset.** 
What is the current state of the art method? Describe the method, including the type of model used, training protocol (if any), and the performance. Cite your sources.

TODO

(Optional) If necessary, perform any data preprocessing here. For example, depending on the dataset you choose, you may need to clean the text or split the training set into a train and validation set.

In [16]:
# TODO: Dataset preprocessing (Optional)

# split into test and training
train_test_split = ds["train"].train_test_split(test_size=0.2)
train_ds = train_test_split["train"]
test_ds = train_test_split["test"]
print('train_ds', train_ds)

# print the number of rows in each
print(f"Number of rows in training set: {len(train_ds)}")
print(f"Number of rows in test set: {len(test_ds)}")


train_ds Dataset({
    features: ['text', 'label'],
    num_rows: 96000
})
Number of rows in training set: 96000
Number of rows in test set: 24000


In [26]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader
from torch.optim import AdamW
import torch
from tqdm import tqdm

# Load a small BERT model and tokenizer (much smaller than base BERT)
model_name = "prajjwal1/bert-tiny"  # tiny BERT - 4.4M parameters as opposed to 110M
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# Preprocess a smaller subset of data for faster training
train_subset = train_ds.select(range(10000))  # Use 10k examples instead of full dataset
test_subset = test_ds.select(range(1000))    # Use 1k examples for testing

# Tokenize data
train_encodings = tokenizer(train_subset['text'], truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_subset['text'], truncation=True, padding=True, max_length=128)

# Create torch datasets
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encodings, train_subset['label'])
test_dataset = NewsDataset(test_encodings, test_subset['label'])

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-4)  # Higher learning rate for faster training

# 7. Training loop
num_epochs = 2  # Reduced epochs

print(f"Training on {device}")
best_accuracy = 0

for epoch in range(num_epochs):
    # Training
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    print(f"\nAverage training loss: {avg_loss:.4f}")
    
    # Evaluation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask)
            predictions = outputs.logits.argmax(-1)
            correct += (predictions == labels).sum().item()
            total += labels.size(0)
    
    accuracy = correct / total
    print(f"Test accuracy: {accuracy:.4f}")

print("Training completed!")

Downloading config.json: 100%|██████████| 285/285 [00:00<00:00, 20.9kB/s]
Downloading vocab.txt: 100%|██████████| 232k/232k [00:02<00:00, 110kB/s]
Downloading pytorch_model.bin: 100%|██████████| 17.8M/17.8M [00:02<00:00, 7.94MB/s]
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassificatio

Training on cpu


Epoch 1/2: 100%|██████████| 313/313 [01:30<00:00,  3.46it/s]



Average training loss: 0.4512


Evaluating: 100%|██████████| 32/32 [00:02<00:00, 13.24it/s]


Test accuracy: 0.8900


Epoch 2/2: 100%|██████████| 313/313 [01:30<00:00,  3.45it/s]



Average training loss: 0.2081


Evaluating: 100%|██████████| 32/32 [00:02<00:00, 13.24it/s]

Test accuracy: 0.8920
Training completed!





## Problem 1: Encoder only models
Choose an encoder only model (e.g. BERT). Load the model and add a classification layer. 

Describe the model you choose. What are the unique properties of this model? What are the pros and cons? Cite your sources.

Full directions:
Choose an encoder only model (e.g. BERT). Load the model and add a classification layer. Finetune the model on your dataset. Report the performance on the test set.

TODO

Finetune the model on your dataset. Report the performance on the test set.

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


I chose the **RoBERTa** model: Robustly Optimized BERT Approach. It's a bigger model (160GB vs 16GB w/ BERT) and goes through more training, but has a few improvements as well:
- No next sentence prediction: removes BERT's NSP task, 
- Dynamic Masking: Applies a diffrent masking pattern to the training data than the test data.
- No Next sentence prediction: removes BERT's NSP task, and focuses on masked language modeling instead.

Pros:
RoBERTa beats the performance of BERT on most benchmarks, and is considered one of the state of the art models. It apparently handles out-of-vocabulary words (words that are not in the training set) better than BERT. It does well on classification tasks.

Cons:
Because of the size of the model, it's more computationally expensive to train as you might expect, and takes longer. It has a slower "inference time", which means it takes longer to get the output. The added size and complexity might make it overkill for some projects.

## Problem 2: Decoder only models
Choose an decoder only model (e.g. GPT2). Describe the model you choose. What are the unique properties of this model? What are the pros and cons? Cite your sources.

TODO

Load the model and use prompting for your task. You will likely need to write a helper function to parse the answer. 

(Ex. “The answer is 1” -> 1). Report the performance on the test set.

In [9]:
# Load GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Downloading tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 1.98kB/s]
Downloading config.json: 100%|██████████| 665/665 [00:00<00:00, 242kB/s]
Downloading vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 14.4MB/s]
Downloading merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 19.5MB/s]
Downloading tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 10.4MB/s]
Downloading model.safetensors: 100%|██████████| 548M/548M [00:17<00:00, 31.2MB/s] 
Downloading generation_config.json: 100%|██████████| 124/124 [00:00<00:00, 11.2kB/s]


## Problem 3: Error Analysis

Conduct an error analysis on both models. What is each model good at? What do they get wrong? Provide examples of both correct and incorrect predictions. Suggest methods to improve the performance.

In [None]:
# TODO

## OPTIONAL. BONUS. Problem 4: Improvements

Implement your suggestions for improving the performance. You only need to implement improvements on one model (encoder-only or decoder-only). Describe your method and report the results on the test set.

In [None]:
# TODO

No other directions for this assignment, other than what's here and in the "General Directions" section. You have a lot of freedom with this assignment. Don't get carried away. It is expected the results may vary, being better or worse. Graders are not going to run your notebooks. The notebook will be read as a report on how different models were explored. Since you'll be using libraries, the emphasis will be on your ability to communicate your findings.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".