## Bert & GPT-2

### ECE590 Homework assignment 5
Name: Javier Cervantes

net id: jc1010


## 1. We will use the BERT model to perform sentiment analysis. 

The sentiment is performed via a linear model applied to the output vector above the [CLS] input in BERT (that output vector is referred to as C).

Use a pretrained BERT model from:
https://github.com/google-research/bert

And use sentiment data from:
https://huggingface.co/datasets/yelp_polarity

Build a sentiment-analysis model based on BERT, using the above data. Do a detailed analysis of performance, and compare the accuracy of this model to results you achieved with the simpler baseline model from the prior homework.

Implement the model two ways:

(a) Leave all BERT parameters unchanged, and just learn the linear model at the output.

(b) Fine-tune all BERT parameters, while also learning the linear layer at the output.

In your solution, provide all code and also a detailed summary of the analysis of the results.


I'll start by importing the necessary libraries and downloading the data. Bert has several models available to the public. For this assignment, I'll use the medium Bert uncased model. Additionally, I've been forced to limit the `max sequence length` of each observation to 128 tokens otherwise I'll run out of memory.

In [1]:
from transformers import BertTokenizerFast, BertModel

# existing bert models
bert_models = {
    "base": "google/bert-base-uncased",
    "small": "google/bert_uncased_L-4_H-512_A-8",
    "medium": "google/bert_uncased_L-8_H-512_A-8",
}

# pick a bert model
bert_model_name = "medium"

# load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained(bert_models[bert_model_name])
bert_model = BertModel.from_pretrained(bert_models[bert_model_name])

In the following cell, I download the `yelp_polarity` dataset, perform train, test and validation splits as well as tokenizing the data in the same way as the `bert` model was trained.

In [2]:
from torch.utils.data import DataLoader
import torch
from datasets import load_dataset
import numpy as np

seed = 257

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

# load the yelp_polarity dataset
train_data, test_data = load_dataset("yelp_polarity", split=["train", "test"])

max_length = 128


# define a function to tokenize the dataset
def tokenize(batch):
    return tokenizer(
        batch["text"][:max_length], padding="max_length", max_length=max_length
    )


# tokenize the dataset
train_data = train_data.map(tokenize)
test_data = test_data.map(tokenize)

# convert the dataset to pytorch tensors
train_data.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_data.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# validation data
train_valid_data = train_data.train_test_split(test_size=0.25)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

# create a dataloader
train_loader = DataLoader(train_data, batch_size=256, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=256)
test_loader = DataLoader(test_data, batch_size=256)

Map: 100%|██████████| 560000/560000 [03:24<00:00, 2744.46 examples/s]
Map: 100%|██████████| 38000/38000 [00:13<00:00, 2738.98 examples/s]


In the following cell I've defined a `BertClassifier` class which takes the bert model and adds a linear layer on top of it. We get the embedding dimension size (called the `hidden_size`) from the transformer via its config attribute. That will serve as the input to the linear layer.

In [3]:
from torch import nn
from torch import optim
import tqdm


# Define a new model with BERT and a linear layer on top for classification
class BertClassifier(nn.Module):
    def __init__(self, bert_model, num_classes):
        super().__init__()
        self.bert_model = bert_model
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = self.classifier(outputs.pooler_output)
        return logits

In [8]:
def train(model, criterion, optimizer, train_loader, device):
    model.train()
    epoch_losses = []
    epoch_accuracies = []
    for i, batch in tqdm.tqdm(
        enumerate(train_loader), desc="Training ...", total=len(train_loader)
    ):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        accuracy = (outputs.argmax(1) == labels).float().mean()
        epoch_losses.append(loss.item())
        epoch_accuracies.append(accuracy.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return np.mean(epoch_losses), np.mean(epoch_accuracies)


def validation(model, criterion, valid_loader, device):
    model.eval()
    epoch_losses = []
    epoch_accuracies = []
    for i, batch in tqdm.tqdm(
        enumerate(valid_loader), desc="Validation ...", total=len(valid_loader)
    ):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
        with torch.no_grad():
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            accuracy = (outputs.argmax(1) == labels).float().mean()
            epoch_losses.append(loss.item())
            epoch_accuracies.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accuracies)


def test_model(model, criterion, test_loader, device):
    model.eval()
    epoch_losses = []
    epoch_accuracies = []
    for i, batch in tqdm.tqdm(
        enumerate(test_loader), desc="Testing ...", total=len(test_loader)
    ):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
        with torch.no_grad():
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            accuracy = (outputs.argmax(1) == labels).float().mean()
            epoch_losses.append(loss.item())
            epoch_accuracies.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accuracies)


def train_model(
    model, criterion, optimizer, train_loader, valid_loader, num_epochs, device
):
    best_valid_loss = float("inf")
    for epoch in range(num_epochs):
        train_loss, train_accuracy = train(
            model, criterion, optimizer, train_loader, device
        )
        valid_loss, valid_accuracy = validation(model, criterion, valid_loader, device)
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"Training loss: {train_loss:.4f}")
        print(f"Training accuracy: {train_accuracy:.4f}")
        print(f"Validation loss: {valid_loss:.4f}")
        print(f"Validation accuracy: {valid_accuracy:.4f}")
        print("-" * 50)
        params_grad = [param.requires_grad for param in model.bert_model.parameters()]
        if False in params_grad:
            full_freeze = "freeze"
        else:
            full_freeze = "full"
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(
                model.state_dict(),
                f"../models/bert_{bert_model_name}_{full_freeze}_{max_length}max.pt",
            )

### a) Leave all BERT parameters unchanged, and just learn the linear model at the output:

For this part of the assignment we'll set the `param.requires_grad` to False for all the parameters of the bert model. This way we'll only train the linear layer on top of the bert model.

In [7]:
# create the model
model = BertClassifier(bert_model, num_classes=2)

# freeze the BERT parameters so that we only learn the linear classifier
for param in model.bert_model.parameters():
    param.requires_grad = False

In [8]:
# define the hyperparameters
lr = 2e-5
num_epochs = 5

# loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# move the model and loss function to the GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model = model.to(device)
criterion = criterion.to(device)

cuda


In [31]:
# train the model
train_model(
    model,
    criterion,
    optimizer,
    train_loader,
    valid_loader,
    num_epochs=num_epochs,
    device=device,
)

Training ...: 100%|██████████| 1641/1641 [10:36<00:00,  2.58it/s]
Validation ...: 100%|██████████| 547/547 [03:21<00:00,  2.72it/s]


Epoch 1/5
Training loss: 0.6224
Training accuracy: 0.6697
Validation loss: 0.5650
Validation accuracy: 0.7306
--------------------------------------------------


Training ...: 100%|██████████| 1641/1641 [10:37<00:00,  2.57it/s]
Validation ...: 100%|██████████| 547/547 [03:21<00:00,  2.72it/s]


Epoch 2/5
Training loss: 0.5518
Training accuracy: 0.7322
Validation loss: 0.5223
Validation accuracy: 0.7497
--------------------------------------------------


Training ...: 100%|██████████| 1641/1641 [10:38<00:00,  2.57it/s]
Validation ...: 100%|██████████| 547/547 [03:21<00:00,  2.72it/s]


Epoch 3/5
Training loss: 0.5242
Training accuracy: 0.7454
Validation loss: 0.5025
Validation accuracy: 0.7582
--------------------------------------------------


Training ...: 100%|██████████| 1641/1641 [10:38<00:00,  2.57it/s]
Validation ...: 100%|██████████| 547/547 [03:21<00:00,  2.72it/s]


Epoch 4/5
Training loss: 0.5099
Training accuracy: 0.7523
Validation loss: 0.4908
Validation accuracy: 0.7643
--------------------------------------------------


Training ...: 100%|██████████| 1641/1641 [10:38<00:00,  2.57it/s]
Validation ...: 100%|██████████| 547/547 [03:21<00:00,  2.72it/s]


Epoch 5/5
Training loss: 0.5009
Training accuracy: 0.7570
Validation loss: 0.4831
Validation accuracy: 0.7682
--------------------------------------------------


In [10]:
# test the model
model.load_state_dict(torch.load(f"../models/bert_medium_freeze_256max.pt"))

test_loss, test_accuracy = test_model(model, criterion, test_loader, device)
print(f"Test loss: {test_loss:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")
print("-" * 50)

Testing ...: 100%|██████████| 149/149 [00:25<00:00,  5.74it/s]

Test loss: 0.5489
Test accuracy: 0.7172
--------------------------------------------------





### b) Fine-tune all BERT parameters, while also learning the linear layer at the output.

For this part of the assignment we'll set the `param.requires_grad` to True for all the parameters of the bert model. This way we'll train the linear layer on top of the bert model as well as the bert model itself.

In [5]:
# create the model
model = BertClassifier(bert_model, num_classes=2)

# define the hyperparameters
lr = 2e-5
num_epochs = 3

# loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# move the model and loss function to the GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model = model.to(device)
criterion = criterion.to(device)

cuda


In [7]:
# fine-tune the model
train_model(
    model,
    criterion,
    optimizer,
    train_loader,
    valid_loader,
    num_epochs=num_epochs,
    device=device,
)

Training ...:   0%|          | 0/1641 [00:00<?, ?it/s]

Training ...: 100%|██████████| 1641/1641 [13:30<00:00,  2.02it/s]
Validation ...: 100%|██████████| 547/547 [01:38<00:00,  5.54it/s]


Epoch 1/3
Training loss: 0.3497
Training accuracy: 0.8354
Validation loss: 0.2978
Validation accuracy: 0.8620
--------------------------------------------------


Training ...: 100%|██████████| 1641/1641 [13:26<00:00,  2.03it/s]
Validation ...: 100%|██████████| 547/547 [01:38<00:00,  5.56it/s]


Epoch 2/3
Training loss: 0.2845
Training accuracy: 0.8703
Validation loss: 0.2847
Validation accuracy: 0.8713
--------------------------------------------------


Training ...: 100%|██████████| 1641/1641 [13:26<00:00,  2.03it/s]
Validation ...: 100%|██████████| 547/547 [01:38<00:00,  5.56it/s]


Epoch 3/3
Training loss: 0.2541
Training accuracy: 0.8859
Validation loss: 0.2793
Validation accuracy: 0.8757
--------------------------------------------------


In [9]:
# test the best model
model.load_state_dict(torch.load(f"../models/bert_medium_full_128max.pt"))

test_loss, test_accuracy = test_model(model, criterion, test_loader, device)
print(f"Test loss: {test_loss:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")
print("-" * 50)

Testing ...: 100%|██████████| 149/149 [00:25<00:00,  5.74it/s]

Test loss: 0.2790
Test accuracy: 0.8782
--------------------------------------------------





### Model Evaluations

The model that only learns the linear parameters had underwhelming performance while taking a considerable amount of time to train. This model generated an test accuracy of 72% whereas the model from the previous homework (the model that learned the sentiment embeddings) achieved an accuracy of 89%. 

The model that fine-tuned the bert parameters as well as the linear layer had a much better performance. This model achieved an accuracy of 88% in 3 epochs (with no signs of overfitting up to that point). This model also took a considerable amount of time and computational resources to train. Given that it barely reached the same performance as the previous homework's model, I would argue that the computational resources are not worth the performance gain.

Having said this, there is a very import mention that needs to be raised: we had to limit the max length of the sequence to 128 tokens due to computational limitations. Even in the previous homework, limiting the max sequence length had considerable a negative impact in the model's performance. Another compromise we had to make was that we used the `medium` bert instead of `base` bert. This model was optimized on far fewer parameters. I'm sure both of these compromises had a negative impact on the performance of the model.

## 2. This task investigates database retrieval via BERT.

Consider a database of your choosing, consisting of a large database of documents. Examples are (but use what you want):

https://www.kaggle.com/datasets/rowhitswami/nips-papers-1987-2019-updated
https://commoncrawl.org/
https://www.kaggle.com/datasets/crawford/20-newsgroups

Using BERT, encode each of the documents in your corpus to a vector. Do this with “out of the box” BERT, with no changes to the base model (taken from the above GitHub)

Build code that, given a question or prompt, will pull appropriate documents from your database. This also should be done with BERT applied to the prompt, and using that write code that pulls from the database the top-10 best matches, ranked by probability of match.

In your solution, provide all code and also a detailed summary of the analysis of the results.


In [2]:
from sklearn.datasets import fetch_20newsgroups

# Load the 20 newsgroups dataset
newsgroups = fetch_20newsgroups(subset="all")
documents = newsgroups.data

In [19]:
from tqdm import tqdm

# Check if a GPU is available and if not, use a CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the GPU if one is available
bert_model = BertModel.from_pretrained(bert_models[bert_model_name])

bert_model.to(device)

# This will hold all the document vectors
document_vectors = []

# Wrap your loop with tqdm for a progress bar
for document in tqdm(documents):
    # Tokenize the document, return_tensors='pt' tells the tokenizer to return PyTorch tensors
    inputs = tokenizer(
        document, return_tensors="pt", truncation=True, padding=True, max_length=512
    )

    # Move the inputs to the GPU if one is available
    inputs = {name: tensor.to(device) for name, tensor in inputs.items()}

    # Get the BERT embeddings for the document
    with torch.no_grad():
        outputs = bert_model(**inputs)

    # Use the embeddings of the [CLS] token as the document vector
    document_vector = outputs.last_hidden_state[:, 0, :]
    document_vectors.append(document_vector.to("cpu").numpy())

100%|██████████| 18846/18846 [02:58<00:00, 105.82it/s]


In [3]:
import pickle

# with open('../models/document_vectors.pkl', 'wb') as f:
#     pickle.dump(document_vectors, f)

with open("../models/document_vectors.pkl", "rb") as f:
    document_vectors = pickle.load(f)

In [4]:
import torch
import numpy as np


# given a prompt, find the 10 most similar documents
def find_similar_documents(prompt, document_vectors, documents, tokenizer, bert_model):
    # Tokenize the prompt
    inputs = tokenizer(
        prompt, return_tensors="pt", truncation=True, padding=True, max_length=512
    )

    # Get the BERT embeddings for the prompt
    with torch.no_grad():
        outputs = bert_model(**inputs)

    # Use the embeddings of the [CLS] token as the prompt vector
    prompt_vector = outputs.last_hidden_state[:, 0, :]

    # Calculate the cosine similarity between the prompt vector and all document vectors
    similarities = np.dot(document_vectors, prompt_vector.T)

    # Sort the documents by their similarity to the prompt
    most_similar = np.argsort(np.squeeze(similarities), axis=0)[::-1]

    # Convert documents to an array
    documents = np.array(documents)

    # Return the 10 most similar documents
    return [documents[i] for i in most_similar[:10]]

In [5]:
# load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained(bert_models[bert_model_name])
bert_model = BertModel.from_pretrained(bert_models[bert_model_name])

I've chosen to work with the 20 Newsgroups Datset.

In [6]:
# given a prompt, find the 10 most similar documents
prompt = "Who do you pick: Islanders, Penguins or Bruins?"
similar_documents = find_similar_documents(
    prompt, document_vectors, documents, tokenizer, bert_model
)
for i in similar_documents:
    print(i)
    print("-" * 50)
    print("\n")

From: cs902060@ariel.yorku.ca (GEOFFREY E DIAS)
Subject: How does a pitcher get a save?
Organization: York University, Toronto, Canada
Lines: 4


	The subject line says it all. What is the rule that qualifies
a pitcher as making a save?


--------------------------------------------------


From: enolan@sharkbite.esd.sgi.com (Ed Nolan)
Subject: Devils and Islanders tiebreaker????
Organization: Silicon Graphics, Inc.
Lines: 4
Nntp-Posting-Host: sharkbite.esd.sgi.com

If the Islanders beat the Devils tonight, they would finish with
identical records.  Who's the lucky team that gets to face the Penguins
in the opening round?   Also, can somebody list the rules for breaking
ties.

--------------------------------------------------


From: cal2d@csissun11.ee.Virginia.EDU (Craig Allen Lorie)
Subject: Re: Devils and Islanders tiebreaker????
Organization: University of Virginia
Lines: 15

According to the hockey gurus over at ESPN, should the Islanders win tonite
the two teams will have the sa

> From the previous results, we can see that our document retrieval function is on the right track: it correctly identifies that I'm talking about sports and not about religion, politics, computers or any of the other subjects included in my dataset.
>
> Having said that, I'm a bit sceptical about the results. It returned documents from baseball as well as hockey. Given that these are presumably the 10 most likely documents to match my prompt, I can say that I'm somewhat underwhelmed by the results. I would have expected the documents to be more closely related to the prompt. 
>
> As compared to the task in question 1: here we still used the Bert Medium model but didn't limit the max sequence length (at least not below what the model can take). There is definitely some loss of accuracy from using the medium model as compared to using the base model but we're still dealing with a pretty large model.

## 3. Use the GPT-2 code from:

https://huggingface.co/openai-community/gpt2
or

https://github.com/openai/gpt-2

Using results from Problem 2, put as input to GPT2 the prompt alone, and evaluate the quality of the answer. Separately, input the prompt plus the document pulled for the database. Examine and evaluate the (somewhat subjective) quality of the generated text, with and without the context provided by the pulled document.

Since the context length of GPT2 is limited, you may have to consider smaller-size documents in your context database. This can be done by considering smaller documents (e.g., abstracts from the NeurIPS) database, or breaking the larger documents into smaller pieces (and encoding each to a vector via BERT).

In your solution, provide all code and also a detailed summary of the analysis of the results. This part will require a descent level of software/coding expertise. Teaming is fine. For some this part of the assignment may be a “reach;” do your best, and focus on learning from this.


> First we shall input the prompt alone and evaluate the quality of the answer:

In [31]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


def ask_gpt2(prompt, max_length=50, num_return_sequences=5):
    # initialize tokenizer and model
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    model = GPT2LMHeadModel.from_pretrained("gpt2")
    # encode context the generation is conditioned on
    encoding = tokenizer.encode_plus(prompt, return_tensors="pt")
    input_ids = encoding["input_ids"]

    # create attention mask
    attention_mask = encoding["attention_mask"]

    # generate text until the output length (which includes the context length) reaches 50
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        # max_length=max_length + input_ids.shape[-1],
        max_new_tokens=max_length + input_ids.shape[-1],
        num_return_sequences=num_return_sequences,
        no_repeat_ngram_size=2,
        do_sample=True,
        temperature=0.7,
    )

    # decode the output and use .shape[-1] so it doesn't return the prompt
    return [
        tokenizer.decode(output[i][input_ids.shape[-1] :], skip_special_tokens=True)
        for i in range(num_return_sequences)
    ]


In [None]:
# insert line breaks to make the output more readable
def insert_newlines(string, every=64):
    return "\n".join(string[i : i + every] for i in range(0, len(string), every))


In [38]:
answer = ask_gpt2(prompt, max_length=1024, num_return_sequences=1)[0]
print(insert_newlines(answer, every=64))




I know who the Islanders are and I'd pick them. I just like th
e way they play. They've been really good to me. You can watch a
 game and you would never know just how good they are. So, I kno
w that I can get a good look at them and then I will pick you up
 and pick me up.
.


> The answer seems quite reasonable. Not only is it answering my question, it's providing a reason for its answer. Sometimes the answer looks like decent English but doesn't really make sense in the context of what I'm asking but overall, it's decent.
>
> Now we shall input the prompt plus the documents pulled from the database and evaluate the quality of the generated text:

In [17]:
# concatenate the prompt and the pulled documents
new_prompt = prompt + " " + " ".join(similar_documents[:3])

In [44]:
answer = ask_gpt2(new_prompt, max_length=1024, num_return_sequences=1)[0]
print(answer)


. From the point of view of a fan who is about to watch the game, I would be
truly surprised if the tiebreakers could be broken.

-From: rwj@hockey-reference.org (RwJ)



A few thoughts on the teams:

--The Penguins and Devils are the two remaining teams in the series,
(there are only 16 games left, so the playoff spot is
*not* theirs.)

"Bold" has been a popular term for them since 2000, and
"brave" seems to be the most popular word in general. It has gone from being used as
-a
'very' common phrase in hockey to being a'very important word
and' frequently used by NHL managers to mean 'it's very important'
. The team is often used to describe the players who are supposed to take the lead, or to
...


I don't think it's as simple as "in order to win, every team has to play one game before
being eliminated by the other team".

(Also, there is no point in telling anyone that the New York Islanders
would win the second round of the playoffs, as the Isles are
 still opponent #2 in this serie

> The answers when adding the top 3 documents (I couldn't use more than 3 because I was running into some IndexErrors that I couldn't solve) to the original prompt now look exactly like the messages in the documents. They seem like email answers. You can even see that the output sometimes contains email addresses. Even the format of the text is similar to that contained in the documents (lines are much shorter and the use of `\n` is quite evident). 
>
> As such, the output doesn't contain an answer to my question. That might be understandable given that my question was a small piece of the new prompt which contained 3 entire documents.
>
> Having said that, I still see some hallucinations in the output.