# S&DS 617 Applied Machine Learning and Causal Inference Research Seminar: Assignment 1

**Deadline**

Assignment 1 is due Monday, February 24th at 1:30pm. Late work will not be accepted. 

**Submission**

Submit your assignment as a .pdf on Gradescope. On Gradescope, there are 2 assignments, one where you will submit a pdf file and one where you will submit the corresponding .ipynb that generated it. 
Note: The problems in each homework assignment are numbered. When submitting the pdf on Gradescope, please select the correct pages that correspond to each problem. 

To produce the .pdf, do the following to preserve the cell structure of the notebook:
- Go to "File" at the top-left of your Jupyter Notebook
- Under "Download as", select "HTML (.html)"
- After the .html has downloaded, open it and then select "File" and "Print"
- From the print window, select the option to save as a .pdf

## Problem 1: Comparing BERT vs. GPT

a) In this assignment, we will compare BERT (Bidirectional Encoder Representations from Transformers) with GPT (Generative Pre-training Transformer). Provide detailed explanations of how the architecture, the type of attention mechanism employed, and the approach to tokenization in each model contribute to their respective capabilities and applications. Which model do you think will perform better at sentiment analysis and why?

<font color = 'blue'>

## Architecture 
- BERT: At a high level, has four modules: tokenizer, embedding module, encoder, and task head. Though BERT is technically a transformer based model, it is more locally known as an encoder only transformer as it does not generate new text. BERT focuses on constructing latent representations of text, and as a result, cannot generate text. 
- GPT: GPT is a stack of transformer blocks, each with their own encoder, decoder, self attention and feed forward layers. More colloquially, GPT is considered to be a decoder stack. Its goal is to auto-regressively model the entire corpus rather than understand the representation of the text. 

## Type of Attention Mechanism Employed
- BERT: As BERT is based on the Transformer encoder model, BERT has multi head attention, where there are multiple attention heads. Then, multiple layers of attention are attached in order to create a stack. As a result, it is able to capture input features in sequences (typically sentences) very well. Furthermore, it is bi directional, which means it also accounts for both right and left context for each word. (Hence the "B" in BERT)
- GPT: GPT also utilizes multi head attention, with several attention layers stacked on top of each other. The main difference is that GPT uses a uni directional attention mechanism which only processes from beginning to end. This allows GPT to generate text and focus on predicting the next word. 

## Approach of Tokenization 
- BERT: WordPiece Tokenization only saves the longest subwords that are in a word's vocabulary, and then splits on it. Ex. "Hugging" -> "Hug" This is helpful for BERT's focus on textual representation as the prefixes and suffixes around a word do not tend to add much to the inherent meaning of the word alone. 
- GPT: Byte Pair Encoding focuses on the merging rules that come with different combinations of words. "Hugging" -> "Hug" "g" "ing" . This is helpful for GPT as GPT is focused on predicting the next word. The suffix can indicate past or future tense, contextual context clues, and any other series of context necessary for correct word generation. 

## Which model will perform better at sentiment analysis and why? 
BERT will perform better at sentiment analysis because it inherently focuses on the latent constructions of each word. However, GPT does have strong merit in performing better as it is trained on a larger set of data. 


b) We will now perform sentiment analysis on the IMDb dataset ("https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"). This dataset contains movie reviews along with their associated binary sentiment polarity labels. Code has been provided to you below to train and evaluate BERT. 

Run the below code to get the test accuracy. Then, modify the code to try getting a higher test accuracy (e.g., adjusting hyperparameters, further model tweaking, data augmentation, etc.). Specify what you modified.

In [2]:
import requests
import tarfile
import os
import json
import re
import openai
from io import BytesIO
import pandas as pd
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset

  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Failed to load image Python extension: {e}")
2025-02-23 21:08:08.857949: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-23 21:08:08.867872: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740362888.879885 1276790 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740362888.883375 1276790 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-23 21:08:08.895533: I tensorflo

### Get Data

In [7]:
# URL of the IMDb dataset
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# Send a GET request to download the content of the dataset
response = requests.get(url)
response.raise_for_status()  # This will raise an exception if there was a download issue

# Open the downloaded content as a file-like object
file_like_object = BytesIO(response.content)

# Extract the tar.gz file
with tarfile.open(fileobj=file_like_object) as tar:
    tar.extractall(path=".")  # Extract to a directory named aclImdb in the current working directory

print("Dataset downloaded and extracted to './aclImdb")


Dataset downloaded and extracted to './aclImdb


In [None]:
def load_imdb_dataset(directory):
    reviews = []
    sentiments = []

    for sentiment in ["pos", "neg"]:
        dir_name = os.path.join(directory, sentiment)
        for filename in os.listdir(dir_name):
            if filename.endswith('.txt'):
                with open(os.path.join(dir_name, filename), encoding='utf-8') as file:
                    reviews.append(file.read())
                    sentiments.append(sentiment)

    return pd.DataFrame({'review': reviews, 'sentiment': sentiments})

# Load the training dataset
dataset_dir = 'aclImdb'
df_tr = load_imdb_dataset(os.path.join(dataset_dir, 'train'))

# Load the test dataset
df_te = load_imdb_dataset(os.path.join(dataset_dir, 'test'))

# Display the first few rows of the DataFrame
print(df_tr.head())
print(df_te.head())


                                              review sentiment
0  Zentropa has much in common with The Third Man...       pos
1  Zentropa is the most original movie I've seen ...       pos
2  Lars Von Trier is never backward in trying out...       pos
3  *Contains spoilers due to me having to describ...       pos
4  That was the first thing that sprang to mind a...       pos
                                              review sentiment
0  Previous reviewer Claudio Carvalho gave a much...       pos
1  CONTAINS "SPOILER" INFORMATION. Watch this dir...       pos
2  This is my first Deepa Mehta film. I saw the f...       pos
3  This was a great film in every sense of the wo...       pos
4  A stunningly well-made film, with exceptional ...       pos


In [None]:
import nltk
import random
from nltk.corpus import wordnet, stopwords

# Download necessary NLTK data (only needed once)
nltk.download('wordnet')
nltk.download('stopwords')

# Define stop words using NLTK's corpus
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/accts/ltp8/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/accts/ltp8/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#synonym replacement from github 
def synonym_replacement(words, n):
	new_words = words.copy()
	random_word_list = list(set([word for word in words if word not in stop_words]))
	random.shuffle(random_word_list)
	num_replaced = 0
	for random_word in random_word_list:
		synonyms = get_synonyms(random_word)
		if len(synonyms) >= 1:
			synonym = random.choice(list(synonyms))
			new_words = [synonym if word == random_word else word for word in new_words]
			#print("replaced", random_word, "with", synonym)
			num_replaced += 1
		if num_replaced >= n: #only replace up to n words
			break

	#this is stupid but we need it, trust me
	sentence = ' '.join(new_words)
	new_words = sentence.split(' ')

	return new_words

def get_synonyms(word):
	synonyms = set()
	for syn in wordnet.synsets(word): 
		for l in syn.lemmas(): 
			synonym = l.name().replace("_", " ").replace("-", " ").lower()
			synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
			synonyms.add(synonym) 
	if word in synonyms:
		synonyms.remove(word)
	return list(synonyms)

def augment_review(review, num_replacements=2):
    """
    Tokenize the review, replace up to num_replacements words with synonyms,
    and return the augmented review.
    """
    words = review.split()
    new_words = synonym_replacement(words, num_replacements)
    return ' '.join(new_words)

half_size = len(df_te) // 2

# Randomly select indices to augment (half of them)
indices_to_augment = random.sample(df_te.index.tolist(), half_size)

# Apply synonym replacement to reviews at the selected indices
# You can adjust num_replacements as needed (e.g., 2 words replaced)
df_te.loc[indices_to_augment, 'review'] = df_te.loc[indices_to_augment, 'review'].apply(
    lambda review: augment_review(review, num_replacements=2)
)
df_te = df_te.sample(frac=1).reset_index(drop=True)

half_size = len(df_tr) // 2

# Randomly select indices to augment (half of them)
indices_to_augment = random.sample(df_tr.index.tolist(), half_size)

df_tr.loc[indices_to_augment, 'review'] = df_tr.loc[indices_to_augment, 'review'].apply(
    lambda review: augment_review(review, num_replacements=2)
)
df_tr = df_tr.sample(frac=1).reset_index(drop=True)

                                              review sentiment
0  I have seen bad films but this took the p***. ...       neg
1  The only other film besides Soylent Green that...       pos
2  This is not the stuff of soap-operas but the s...       pos
3  I saw this shoot without to know what about we...       pos
4  It's not too bad a b complex movie, with Sande...       pos


In [95]:
# Subsample train and test sets down (note: you may change the size of training) 
df_tr = df_tr.sample(n=1000, random_state=928)
print(df_tr.shape) # check dimensions
df_te = df_te.sample(n=500, random_state=2755)
print(df_te.shape) # check dimensions
df_te.iloc[1, 0] # sample movie review

(1000, 2)
(500, 2)


'... Once. "Manos, the Hands of Fate." That was worse than this, quite a bit worse: but it did have one thing: it had beautiful women in negligees wresting each other -- for about 20 minutes. This has a fat 45 year-old with 3 tits and a tail, in a cantina scene cloned directly from "Star Wars." Not to mention an obese, blue seductress Uhura, her fat legs and ass hanging out of some sort of insane bird costume, in this Method Acting Mess. She always wanted to perform before a "captive audience"? She must have meant the poor slobs who shelled out 8 bucks hoping to see another "Wrath of Khan," or at least a "Voyage Home." Captive" is right. I wonder how many people in the theaters tried to slit their wrists while crying out: "mother, make it stop."<br /><br />No question about it, "Final Frontier" is not just an unmitigated disaster, it\'s cruel and unusual punishment. This is Star Trek from hell. This is Shatner on mushrooms -- or maybe peyote. This is Where No Man Has Gone Before and Wi

In [96]:
class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    

### Train BERT (Note: this may take a considerable amount of time. You may modify the size of training if too computationally intensive)

In [97]:
from transformers import TrainingArguments, Trainer, BertTokenizer, BertForSequenceClassification

# Function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Load the dataset (assuming df_tr is your loaded DataFrame)
texts = df_tr['review'].tolist()
labels = df_tr['sentiment'].apply(lambda x: 1 if x == 'pos' else 0).tolist()

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data
ml = 150 #original 128 
tokenized_dataset = tokenizer(texts, padding=True, truncation=True, max_length=ml)

# Splitting the dataset into training and validation sets
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_masks, val_masks, train_labels, val_labels = train_test_split(
    tokenized_dataset['input_ids'], tokenized_dataset['attention_mask'], labels, test_size=0.2
)

# Creating dataset objects for training and validation
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset({'input_ids': train_texts, 'attention_mask': train_masks}, train_labels)
val_dataset = IMDbDataset({'input_ids': val_texts, 'attention_mask': val_masks}, val_labels)

In [112]:
# Load BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.001,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch",  # Ensure models are saved at each epoch
    evaluation_strategy="epoch",  # Evaluate at each epoch
    optim="adamw_torch",  # Use the recommended optimizer
)


# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.6946,0.695629
2,0.6122,0.598605
3,0.4625,0.477729
4,0.3231,0.462215
5,0.2025,0.529964
6,0.0452,0.856025
7,0.0954,0.819184
8,0.0059,1.084927
9,0.0184,1.102773
10,0.1249,0.92962


TrainOutput(global_step=500, training_loss=0.27394897166267035, metrics={'train_runtime': 184.7931, 'train_samples_per_second': 43.292, 'train_steps_per_second': 2.706, 'total_flos': 616666536000000.0, 'train_loss': 0.27394897166267035, 'epoch': 10.0})

In [113]:
# Evaluate the model on the validation set
predictions = trainer.predict(val_dataset)
val_accuracy = accuracy_score(val_labels, predictions.predictions.argmax(-1))
print(f"Validation Accuracy: {val_accuracy}")

Validation Accuracy: 0.8


### Evaluate model on test set

In [114]:
test_texts = df_te['review'].tolist()
test_labels = df_te['sentiment'].apply(lambda x: 1 if x == 'pos' else 0).tolist()

# Tokenize the test data
test_encodings = tokenizer(test_texts, padding=True, truncation=True, max_length=ml)

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = {key: torch.tensor(val) for key, val in encodings.items()}
        self.labels = torch.tensor(labels)

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]  
        return item

    def __len__(self):
        return len(self.labels)

test_dataset = IMDbDataset(test_encodings, test_labels)

In [115]:
# Predictions
test_predictions = trainer.predict(test_dataset)
test_accuracy = accuracy_score(test_labels, test_predictions.predictions.argmax(-1))
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.84


<font color = 'blue'>

*Run the below code to get the test accuracy. Then, modify the code to try getting a higher test accuracy (e.g., adjusting hyperparameters, further model tweaking, data augmentation, etc.). Specify what you modified.* 

The original validation accuracy was 0.83 and the original test accuracy was .802. 

First, I slowly increased epoch sizes. With an epoch of 4, my accuracies were (.815, .822). I decided to try and increase my batch size to 16 which gave me the accuracies (.826, .796). Because of this, I decreased my batch size to 10 and also my weight decay to .001, which changed my scores to (.825, .818). I decided to go down one epoch to see if there was any large difference, but there was not. I then tried to increase my weight decay back to .01, and got (.85, .828). 

Then, I tried to change the token size from 128 to 150. However, this only decreased my test accuracy down to .818. (216 slowed down my computer a lot, and I decided it was not worth it to try.)

I realized at this point I could do data augmentation, so I this open source code: https://github.com/jasonwei20/eda_nlp/tree/master for synonym replacement. My initial results with this replacement and the default settings gave me accuracy results fo (.78, .82), which were not incredibly great. 

Lastly, I decided to try and follow the rule of thumb with experiments like this, where there are generally better results with an increase in batch size, a decrease in learning rate/weight decay, and an increase in epoch size. With this, I received my best results at .84. 

c) Perform sentiment analysis using GPT-3.5-turbo, gpt-4o, o1-mini, and o3-mini and get the test accuracy. Evaluate their performance by comparing test accuracies. (If you get a rate limit error, just use 4o)

**Note: DO NOT try to run advanced models on the entire test set initially.** Be mindful of API usage limits and costs associated with the advanced models APIs. Start with a smaller subset of your test set to ensure your implementation is correct before scaling up. 

In [49]:

from dotenv import load_dotenv
# Load environment variables from the .env file
load_dotenv()

# Access the OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")

# Use the API key
if openai_api_key:
    print("OpenAI API Key loaded successfully!")
else:
    print("OpenAI API Key not found. Please check your .env file.")


OpenAI API Key loaded successfully!


In [50]:

# Set up the OpenAI client
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


With BERT, we had to pass in tokenized input, but with OpenAI we can pass in raw text. 

In [None]:
#Test Dataset - for reference
test_texts = df_te['review'].tolist()
test_labels = df_te['sentiment'].apply(lambda x: 1 if x == 'pos' else 0).tolist()


# GPT 3.5 Turbo

In [None]:

model_type = "gpt-3.5-turbo"


def sentiment(txt): 
    prompt = (
        "Determine the sentiment of the given text.\n"
        "Answer only with 'positive' or 'negative'\n"
        f"Review: \"{txt}\""
    )
    # Make a chat completion request
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a sentiment analysis assistant."}, 
            {"role": "user", "content": prompt}
        ],
        model=model_type,  # Specify the model
        temperature = 0, 
        max_tokens = 2
    )
    sentiment = chat_completion.choices[0].message.content.strip().lower()
    return 1 if sentiment == 'positive' else 0 



Testing on:  gpt-3.5-turbo


In [59]:
print("Testing on: ", model_type)
gpt_results = df_te['review'].apply(sentiment)

Testing on:  gpt-3.5-turbo


In [60]:
acc_results = accuracy_score(test_labels, gpt_results)
print("Test results for ", model_type, ": ", acc_results)


Test results for  gpt-3.5-turbo :  0.938


# GPT-4o 

In [62]:

model_type = "gpt-4o"

def sentiment(txt): 
    prompt = (
        "Determine the sentiment of the given text.\n"
        "Answer only with 'positive' or 'negative'\n"
        f"Review: \"{txt}\""
    )
    # Make a chat completion request
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a sentiment analysis assistant."}, 
            {"role": "user", "content": prompt}
        ],
        model=model_type,  # Specify the model
        temperature = 0, 
        max_tokens = 2
    )
    sentiment = chat_completion.choices[0].message.content.strip().lower()
    return 1 if sentiment == 'positive' else 0 


print("Testing on: ", model_type)
gpt_results = df_te['review'].apply(sentiment)

acc_results = accuracy_score(test_labels, gpt_results)
print("Test results for ", model_type, ": ", acc_results)


Testing on:  gpt-4o
Test results for  gpt-4o :  0.946


# GPT o1- mini

In [70]:

model_type = "o1-mini"

def sentiment(txt): 
    prompt = (
        "Determine the sentiment of the given text.\n"
        "Answer only with 'positive' or 'negative'.\n"
        f"Review: \"{txt}\""
    )
    # Make a chat completion request without a system message
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "user", "content": prompt}
        ],
        model=model_type,  # Specify the model
        temperature=1, 
        max_completion_tokens =2 
    )
    sentiment_response = chat_completion.choices[0].message.content.strip().lower()
    return 1 if sentiment_response == 'positive' else 0 


print("Testing on: ", model_type)
gpt_results = df_te['review'].apply(sentiment)

acc_results = accuracy_score(test_labels, gpt_results)
print("Test results for ", model_type, ": ", acc_results)


Testing on:  o1-mini
Test results for  o1-mini :  0.522


# GPT o3-mini 

In [72]:

model_type = "o3-mini"

def sentiment(txt): 
    prompt = (
        "Determine the sentiment of the given text.\n"
        "Answer only with 'positive' or 'negative'\n"
        f"Review: \"{txt}\""
    )
    # Make a chat completion request
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a sentiment analysis assistant."}, 
            {"role": "user", "content": prompt}
        ],
        model=model_type,  # Specify the model
        temperature = 1, 
        max_completion_tokens = 3
    )
    sentiment = chat_completion.choices[0].message.content.strip().lower()
    return 1 if sentiment == 'positive' else 0 



print("Testing on: ", model_type)
gpt_results = df_te['review'].apply(sentiment)

acc_results = accuracy_score(test_labels, gpt_results)
print("Test results for ", model_type, ": ", acc_results)

Testing on:  o3-mini
Test results for  o3-mini :  0.522


<font color = 'blue'>

Results: 

- 3.5 Turbo: 0.938 
- 4o: 0.946 
- o1-mini: 0.522 
- o3-mini: 0.522 

The accuracy results are cleanly split between GPT 4o/GPT 3.5 Turbo and o1-mini/o3-mini. These results generally make sense as o1-mini and o3-mini are smaller in size than 4o/3.5 Turbo and are generally expected to perform worse than their bigger counter parts. Upon research, o1-mini is meant to specifically excel in STEM related subjects such as math and coding. This explains why o1-mini has a harder time on sentient analysis and is mostly likely randomnly guessing between positive/negative to achieve an accuracy rate of 0.522. o3-mini is also STEM focused, focusing on logistical challenges rather than the sentiment nuances of languages. 3.5 Turbo and 4o are large models that have been trained on a wide variety of different corpuses and problem solving. As a result, it makes sense these models are more equipped to handle this specific task. 

d) For the task of language translation, do you expect BERT or GPT to perform better? Explain why in detail. Additionally, discuss the primary challenges associated with implementing each model for translation tasks.

<font color = 'blue'>
For the task of language translation, I expect GPT to perform "better". While this is primarily because GPT has been generally trained on more data than BERT, it is also because GPT can provide multiple different translations and nuances to provide a sufficient answer. Unlike sentiment analysis, language translation metrics have relied on evaluations like the FLEURs score, but even then, translation is dynamic and nuanced. While it seems like BERT would be able to build construct better meanings and understandings thanks to its bidirectional attention mechanism, language translation is kind of like language generation. We are not trying to construct a new answer based on the original prompt in language one, we are trying to completely generate a new answer in language two. In that understanding, I understand language translation as a form of prompt generation because there are multiple translations for the word "good" in any language. 