# Artificial Intelligence
# 464/664
# Assignment #8

## General Directions for this Assignment

00. We're using a Jupyter Notebook environment (tutorial available here: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html),
01. Output format should be exactly as requested (it is your responsibility to make sure notebook looks as expected on Gradescope),
02. Check submission deadline on Gradescope,
03. Rename the file to Last_First_assignment_8,
04. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
05. Do not submit any other files.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".

## Language Modeling

This homework will require you to load and train models.  If you choose small models and datasets, you should be able to run this locally on your computer. However, larger models/datasets may require GPU access. You can access one GPU for free on [Google Colab](https://colab.research.google.com/).

We will use HuggingFace libraries in this assignment. We discussed majority of what you will need during the discussion demo. Additional documentation can be found [here](https://huggingface.co/docs).

In [1]:
%pip install evaluate
%pip install transformers
%pip install tqdm

You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Imports
import torch
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, AutoModelForCausalLM
import os
from tqdm import tqdm
os.environ["WANDB_DISABLED"] = "true"

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader
from torch.optim import AdamW

## Problem 0: Data
From the [HuggingFace Datasets](https://huggingface.co/datasets), choose a dataset that satisfies the following criteria:
- Data must have train and test splits (Optional development set)
- Task must be text classification
- Task must have at least 3 labels


In [3]:
ds = load_dataset("ag_news")

Found cached dataset parquet (/Users/Henry/.cache/huggingface/datasets/parquet/ag_news-9af2a5926861d22a/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|██████████| 2/2 [00:00<00:00, 81.58it/s]


<h1>Describe the data.</h1>
What is the utility of the task? What are the inputs? What are the labels? Are the any potential difficulties you expect from the task? How do you evaluate the performance of this task?

The dataset is a collection of headlines from news articles, classified into one of 4 categories: World, Sports, Business, or Science/Technology. Inputs are the headlines, outputs are the categories.

I think it's conceiveable that a few things could cause the BERT and gpt2 models trouble. Firstly, there is some amount of ambiguity of the labels themselves — I'm not sure if there are headlines in the dataset that sit at the intersection of categories.

Also, I'm not sure how much domain-specific knowledge gpt2 has, and it may need to know some medical technology or business terminology to classify correctly.

And for gpt-2 in particular, we have the extra difficulty of parsing the response correctly, in addition to the main task of classifying the headline. 

I think evaluating performance should be pretty straight forward, we can just see what percentage of time it classifies the headline correctly.

<h1>Research current methods using this dataset.</h1>
What is the current state of the art method? Describe the method, including the type of model used, training protocol (if any), and the performance. Cite your sources.


The current state of the art is XLNet, an autoregressive Transformer model, which gets an error rate of ~4.45%. Like BERT models, XLNet has "bidirectional context learning," which means it tries to predict the text both forwards and backwards, allowing it to learn more context. 

It has some differences to BERT models as well. It doesn't mask tokens and try to predict them, like BERT does, but instead looks at words in different orders and tries to predict different orders. XLNet also has some sort of special memory system to allow it to better remember connected words far apart from each other.

However, the runner up is a BERT model, BERT-ITPT-FiT, which also does very well w/ an error rate of ~4.8%.

Sources
- https://paperswithcode.com/sota/text-classification-on-ag-news
- https://rbcborealis.com/research-blogs/understanding-xlnet/

(Optional) If necessary, perform any data preprocessing here. For example, depending on the dataset you choose, you may need to clean the text or split the training set into a train and validation set.

In [4]:
# split into test and training
train_test_split = ds["train"].train_test_split(test_size=0.2)
train_ds = train_test_split["train"]
test_ds = train_test_split["test"]

# get a smaller subset so it doesn't take forever on my 2020 base model macbook:
train_subset = train_ds.select(range(10000))  # 10k examples
test_subset = test_ds.select(range(1000))    # 1k examples

label_to_category = {
  0: "World",
  1: "Sports",
  2: "Business",
  3: "Science/Technology"
}

## Problem 1: Encoder only models
Choose an encoder only model (e.g. BERT). Load the model and add a classification layer. 

Describe the model you choose. What are the unique properties of this model? What are the pros and cons? Cite your sources.

I chose the **RoBERTa** model (Robustly Optimized BERT Approach). It's a bigger model (160GB vs 16GB w/ BERT) and goes through more training, but has a few improvements as well:
- No next sentence prediction: removes BERT's NSP task, 
- Dynamic Masking: Applies a diffrent masking pattern to the training data than the test data.
- No Next sentence prediction: removes BERT's NSP task, and focuses on masked language modeling instead.

Pros:
RoBERTa beats the performance of BERT on most benchmarks, and is considered one of the state of the art models. It apparently handles out-of-vocabulary words (words that are not in the training set) better than BERT. It does well on classification tasks.

Cons:
Because of the size of the model, it's more computationally expensive to train as you might expect, and takes longer. It has a slower "inference time", which means it takes longer to get the output. The added size and complexity might make it overkill for some projects.

Sources:
- https://www.geeksforgeeks.org/overview-of-roberta-model/
- https://huggingface.co/FacebookAI/roberta-base

In [5]:
model_name = "prajjwal1/bert-tiny"  # tiny BERT - 4.4M parameters as opposed to 110M
tokenizer = AutoTokenizer.from_pretrained(model_name)

# load the model and add a classification layer w/ 4 labels
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# tokenize data
train_encodings = tokenizer(train_subset['text'], truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_subset['text'], truncation=True, padding=True, max_length=128)

# create torch datasets
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encodings, train_subset['label'])
test_dataset = NewsDataset(test_encodings, test_subset['label'])

# create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-4)  # Higher learning rate for faster training
print(f"Training on {device}")

# training loop
num_epochs = 2
for epoch in range(num_epochs):
    # training
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    print(f"\nAverage training loss: {avg_loss:.4f}")
    
    # evaluation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for i, batch in enumerate(tqdm(test_loader, desc="Evaluating")):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask)
            predictions = outputs.logits.argmax(-1)
            correct += (predictions == labels).sum().item()
            total += labels.size(0)

            # print every 20th batch
            if i % 20 == 0:
                print(f"\nBatch {i}")
                # get the text for the first 3 examples in batch
                texts = tokenizer.batch_decode(input_ids[:3], skip_special_tokens=True)
                for j in range(len(texts)):
                    print("\nText:", texts[j])  # First 100 chars of text
                    print(f"Predicted: {label_to_category[predictions[j].item()]}")
                    print(f"True label: {label_to_category[labels[j].item()]}")
                print("-------------------")
    
    accuracy = correct / total
    print(f"Test accuracy: {accuracy:.4f}")

print("Training completed!")

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

Training on cpu


Epoch 1/2: 100%|██████████| 313/313 [01:50<00:00,  2.84it/s]



Average training loss: 0.4467


Evaluating:   3%|▎         | 1/32 [00:00<00:03,  8.70it/s]


Batch 0

Text: skype available for mac os x skype technologies sa has launched the beta version of skype for mac os x. the software can be downloaded for free and is available immediately.
Predicted: Science/Technology
True label: Science/Technology

Text: russia confirms plane crash terror link a widely - suspected terrorist connection became an official claim when russian authorities said they found traces of explosives in the wreckage of one of two passenger jets that crashed almost simultaneously, killing all 90 people on board.
Predicted: World
True label: World

Text: falmouth blanks dartmouth the 10th - ranked dartmouth boys'soccer team had given up only one goal this season, but yesterday, host falmouth scored two to hand the indians their first loss, 2 - 0.
Predicted: Sports
True label: Sports
-------------------


Evaluating:  72%|███████▏  | 23/32 [00:01<00:00, 13.62it/s]


Batch 20

Text: southwest bids \ $ 117m for ata assets southwest airlines # 39 ; package of \ $ 117 million in cash, loan payoffs and preferred stock purchases was selected as the winning bid at the bankruptcy court - approved auction for certain ata airlines inc.
Predicted: Business
True label: Business

Text: security expert warns internet pharmacies at risk from terrorists ( canadian press ) canadian press - toronto ( cp ) - a canadian - based security expert will tell a panel on internet pharmacies this week that mail - order drug companies could become targets for terrorists.
Predicted: Science/Technology
True label: World

Text: sec may act against ex - lucent ceo lucent technologies inc. said monday that three of its former employees have been notified by the securities and exchange commission that regulators are considering recommending civil action against them.
Predicted: Business
True label: Business
-------------------


Evaluating: 100%|██████████| 32/32 [00:02<00:00, 13.68it/s]


Test accuracy: 0.8930


Epoch 2/2: 100%|██████████| 313/313 [01:31<00:00,  3.42it/s]



Average training loss: 0.1929


Evaluating:   6%|▋         | 2/32 [00:00<00:02, 12.35it/s]


Batch 0

Text: skype available for mac os x skype technologies sa has launched the beta version of skype for mac os x. the software can be downloaded for free and is available immediately.
Predicted: Science/Technology
True label: Science/Technology

Text: russia confirms plane crash terror link a widely - suspected terrorist connection became an official claim when russian authorities said they found traces of explosives in the wreckage of one of two passenger jets that crashed almost simultaneously, killing all 90 people on board.
Predicted: World
True label: World

Text: falmouth blanks dartmouth the 10th - ranked dartmouth boys'soccer team had given up only one goal this season, but yesterday, host falmouth scored two to hand the indians their first loss, 2 - 0.
Predicted: Sports
True label: Sports
-------------------


Evaluating:  69%|██████▉   | 22/32 [00:01<00:00, 13.08it/s]


Batch 20

Text: southwest bids \ $ 117m for ata assets southwest airlines # 39 ; package of \ $ 117 million in cash, loan payoffs and preferred stock purchases was selected as the winning bid at the bankruptcy court - approved auction for certain ata airlines inc.
Predicted: Business
True label: Business

Text: security expert warns internet pharmacies at risk from terrorists ( canadian press ) canadian press - toronto ( cp ) - a canadian - based security expert will tell a panel on internet pharmacies this week that mail - order drug companies could become targets for terrorists.
Predicted: World
True label: World

Text: sec may act against ex - lucent ceo lucent technologies inc. said monday that three of its former employees have been notified by the securities and exchange commission that regulators are considering recommending civil action against them.
Predicted: Business
True label: Business
-------------------


Evaluating: 100%|██████████| 32/32 [00:02<00:00, 13.38it/s]

Test accuracy: 0.8960
Training completed!





## Problem 2: Decoder only models
Choose an decoder only model (e.g. GPT2). Describe the model you choose. What are the unique properties of this model? What are the pros and cons? Cite your sources.

I chose the GPT-2 model. It's a transformer based decoder-only model. It's a smaller model than BERT, and also faster and easier to train.

Both GPT-2 and BERT use a transformer architecture with self-attention mechanisms. GPT-2 predicts next-tokens in sequence, as opposed to BERT, which predicts masked tokens. This means GPT-2 lends itself more to text generation and completion, because of this more sequential generation. BERT tends to do better in classification tasks, like sentiment analysis, though. 

GPT-2 is also a pretty small model, so it's faster and easier to train. It has good zero-few shot capabilities, while BERT tends to require fine tuning to get acceptable performance.

Sources:
- https://huggingface.co/gpt2
- https://en.wikipedia.org/wiki/GPT-2

Load the model and use prompting for your task. You will likely need to write a helper function to parse the answer. 

(Ex. “The answer is 1” -> 1). Report the performance on the test set.

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from tqdm import tqdm

# load GPT-2 model and tokenizer (using smallest version)
model_name = "gpt2"  # or "distilgpt2" for even smaller model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# helper function to convert label to category
def get_category(label):
    categories = {
        0: "World",
        1: "Sports",
        2: "Business",
        3: "Science/Technology"
    }
    return categories[label]

# helper function to parse GPT-2's response
def parse_response(response):
    response = response.lower().strip()
    if "world" in response:
        return 0
    elif "sport" in response:
        return 1
    elif "business" in response:
        return 2
    elif "science" in response or "tech" in response:
        return 3
    else:
        # print('BAD RESPONSE', response)
        return -1  # Unable to parse

# function to get prediction for a single text
def get_prediction(text, model, tokenizer, device):
    # create prompt
    prompt = f"Please classify this news article into one of these categories: World, Sports, Business, or Science/Technology. There are no other categories; you must pick one from that list.\n\nArticle: {text}\n\nCategory:"
    
    # tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=10,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id,
            temperature=0.7
        )
    
    # decode response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response[len(prompt):]  # Remove the prompt from response
    
    # parse the response into a category
    return parse_response(response)

# test on a smaller subset
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(f"Evaluating on {device}")

# use even smaller test set since generation is slower than classification
test_subset = test_ds.select(range(100))
correct = 0
total = 0

for i in tqdm(range(len(test_subset))):
    text = test_subset[i]['text']
    true_label = test_subset[i]['label']
    
    pred_label = get_prediction(text, model, tokenizer, device)
    
    if pred_label == true_label:
        correct += 1
    total += 1
    
    # print example predictions sometimes
    if i % 10 == 0:
        print(f"\nText: {text}")
        print(f"True: {get_category(true_label)}")
        print(f"Predicted: {get_category(pred_label) if pred_label != -1 else 'Unknown'}")
accuracy = correct / total
print(f"\nTest accuracy: {accuracy:.4f}")

Evaluating on cpu


  1%|          | 1/100 [00:00<01:23,  1.18it/s]


Text: Skype available for Mac OS X Skype Technologies SA has launched the beta version of Skype for Mac OS X. The software can be downloaded for free and is available immediately.
True: Science/Technology
Predicted: Business


 11%|█         | 11/100 [00:08<01:10,  1.26it/s]


Text: Indians Pitcher Shot _ Cleveland pitcher Kyle Denney was slightly wounded in the leg late last night, on the team bus as it left Kauffman Stadium for Kansas City International Airport.
True: Sports
Predicted: Sports


 21%|██        | 21/100 [00:16<00:58,  1.35it/s]


Text: Ex-minister hurt in Beirut blast A bomb has gone off in the Lebanese capital Beirut, injuring a former minister and killing his driver.
True: World
Predicted: World


 31%|███       | 31/100 [00:24<00:58,  1.18it/s]


Text: New Program Lets People Design 3-D Objects Programs for computer-aided design, or CAD, have been around for decades, but eMachineShop.com appears to be the first service that checks whether a design can be made, tells the customer how much it will cost. If the customer wants the item the design goes to a "real world" machine shop for manufacturing.
True: Science/Technology
Predicted: Unknown


 41%|████      | 41/100 [00:33<00:52,  1.13it/s]


Text: House GOP Leader DeLay Delights in Rebuke of Accuser (Reuters) Reuters - A Texas congressman who brought a\successful ethics complaint against House Majority Leader Tom\Delay was himself rebuked by the ethics panel for violating the\rules in the way he brought the complaint.
True: World
Predicted: Business


 51%|█████     | 51/100 [00:41<00:41,  1.17it/s]


Text: Talks Continue After China Hostage Deadline Passes  CHAGMALAI, Pakistan (Reuters) - Islamic militants holding  two Chinese engineers hostage in Pakistan threatened to kill  one on Monday unless security forces ended a siege of their  hideout, a tactic the interior minister said had echoes of  Iraq.
True: World
Predicted: World


 61%|██████    | 61/100 [00:51<00:36,  1.07it/s]


Text: Exxon Mobil Profit Soars  NEW YORK (Reuters) - Exxon Mobil Corp. &lt;A HREF="http://www.investor.reuters.com/FullQuote.aspx?ticker=XOM.N target=/stocks/quickinfo/fullquote"&gt;XOM.N&lt;/A&gt;, the world's  largest publicly traded oil company, on Thursday said quarterly  profit surged 56 percent, driven by soaring oil prices and  strong results from refining operations.
True: Business
Predicted: Business


 71%|███████   | 71/100 [01:00<00:27,  1.06it/s]


Text: France demands release of hostages France #39;s interior minister demanded Sunday the release of two French journalists believed kidnapped by Islamic militants in Iraq.
True: World
Predicted: Sports


 81%|████████  | 81/100 [01:08<00:15,  1.20it/s]


Text: Ferguson Offers Rooney Helping Hand Sir Alex Ferguson is confident he can keep Wayne Rooneys feet firmly on the ground despite his sensational Champions League debut.
True: Sports
Predicted: World


 91%|█████████ | 91/100 [01:18<00:08,  1.10it/s]


Text: Reds' Vander Wal Becomes Free Agent (AP) AP - Pinch-hitter John Vander Wal chose free agency Wednesday instead of a demotion off the Reds' 40-man roster.
True: Sports
Predicted: World


100%|██████████| 100/100 [01:26<00:00,  1.16it/s]


Test accuracy: 0.4900





## Problem 3: Error Analysis

Conduct an error analysis on both models. What is each model good at? What do they get wrong? Provide examples of both correct and incorrect predictions. Suggest methods to improve the performance.

In [8]:
# Find what the error is for the gpt2 model among the responses we can accurately decode.
correct = 0
total_decoded = 0
total = 0



print(f"Evaluating on {device}")
for i in tqdm(range(len(test_subset))):
    text = test_subset[i]['text']
    true_label = test_subset[i]['label']
    
    pred_label = get_prediction(text, model, tokenizer, device)
    
    total += 1
    if pred_label == -1: continue
    if pred_label == true_label:
        correct += 1
    total_decoded += 1

successfully_decoded_accuracy = correct / total_decoded
total_accuracy = correct / total
print(f"Percentage decoded successfully: {total_decoded / total:.4f}")
print(f"Test accuracy: {total_accuracy:.4f}")
print(f"Successfully decoded accuracy: {successfully_decoded_accuracy:.4f}")

Evaluating on cpu


100%|██████████| 100/100 [01:39<00:00,  1.01it/s]

Percentage decoded successfully: 0.9400
Test accuracy: 0.4900
Successfully decoded accuracy: 0.5213





AttributeError: 'dict' object has no attribute 'write'

From the results I would periodically print in part 2, it was clear that GPT2 does not always give an easily decodable answer. To try to isolate the issue of answer parsing, I tracked the number of failed parsing attempts above, and found that the model gave interpretable answers 90% of the time. So the relatively poor performance of GPT2 seems to be largely some combination of the model and training, not the parsing.

I'm aware that generative models sometimes tend to favor certain categories, so I printed the frequency of predictions, and the frequency of true labels. I found that the GPT 2 model did in fact have a big bias towards the earlier categories — it guessed category 0 61 times, and the second most popular guessed category (category 1) was only guessed 14 times. The last label was only guessed 7 times. Meanwhile, the true labels are pretty evenly distributed.

GPT-2 example (Correct):
> Text: Russia successfully launches Soyuz 2-1A rocket RBC, 09.11.2004, Moscow 09:25:54.Russia has   successfully launched a new model of booster rocket, the Soyuz-2-1A. The rocket blasted off at 9:30 pm from the Plesetsk launch pad carrying a satellite model.
> True: Science/Technology
> Predicted: Science/Technology

GPT-2 example (Incorrect):
> Text: Farmers  "Cry Wolf" Over Losses to Predators Sept. 23, 2004 - Cougars, wolves, lions and other predators inflict relatively few losses on livestock and farmers gain only a temporary boost if these marauders are culled, the British weekly New Scientist announced Wednesday....
> True: Science/Technology
> Predicted: World

BERT example (Correct):
> Text: bye - bye blueprint : 3d modeling catches on three - dimensional technology is changing the way buildings are designed and built - - but the industry will have to change, too.
> Predicted: Science/Technology
> True label: Science/Technology

BERT example (Incorrect):
> Text: man utd hopefuls face a rising challenge a successful bidder for manchester united could have to pay more than 300p a share, city sources suggested yesterday. the stock rose 13.
> Predicted: Business
> True label: Sports

The GPT-2 incorrect example is a good representative example of how it tends to over-weight the world category in its answers. I didn't see any clear skewing in the incorrect BERT responses, but it didn't get many questions wrong. 

I think the best way to improve the performance of the GPT-2 model would be to train it on a larger dataset. I didn't train it on all that much data in this notebook because of the limitations of my laptop, but I think it would do much better if allowed to train on the full dataset. It is a much larger model (124M vs 4.4M for BERT), so it may need more data to accurately weight the larger model.

For BERT, more epochs do not seem to improve performance a lot, so maybe the biggest improvements would come from architecture changes. For example, we could add a dropout layer to prevent overfitting by randomly dropping some percentage of connections each epoch. We could also add a learning rate scheduler to try to more smartly modulate the learning rate (generally make it smaller as the loss decreases). We could also implement gradient clipping to prevent the gradients from exploding and making the training unstable.

## OPTIONAL. BONUS. Problem 4: Improvements

Implement your suggestions for improving the performance. You only need to implement improvements on one model (encoder-only or decoder-only). Describe your method and report the results on the test set.

I chose to try to improve the BERT model, by adding the things discussed above: Added a dropout layer to prevent overfitting by randomly dropping some percentage of connections each epoch; added a learning rate scheduler to try to more smartly modulate the learning rate (generally make it smaller as the loss decreases); and implemented gradient clipping to prevent the gradients from exploding and making the training unstable.

The results ended up being slightly better but pretty much the same as the original BERT model: .9090 vs .91 accuracy for old and new, respectively.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".