# Project 4: Generating and Finetuning Transformer Language Models With Huggingface 

In this project, you will first learn how to use Huggingface's Transformers library to load large language models. Next, we will generate text from these models. Finally, we will finetune models on two tasks (sentiment analysis and machine translation).

This project will be more open ended than the previous projects. We expect you to learn how to use the huggingface and torch documentation.

## Setup

First we install and import the required dependencies. These include:
* `torch` for modeling and training
* `transformers` for pre-trained models
* `datasets` from huggingface to load existing datasets.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%%capture
!pip install transformers
!pip install datasets
!pip install --upgrade sacrebleu sentencepiece

# Standard library imports
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelWithLMHead
from datasets import load_dataset

Before proceeding, let's verify that we're connected to a GPU runtime and that `torch` can detect the GPU.
We'll define a variable `device` here to use throughout the code so that we can easily change to run on CPU for debugging.

In [3]:
if torch.cuda.is_available(): device = torch.device("cuda")
else:
    device = torch.device('cpu')
print("Using device:", device)

Using device: cuda


### Loading Model

We will use GPT-2 medium for this project. This includes both the GPT-2 tokenizer and the GPT-2 model weights itself. If you want to learn more about this model, you can read the GPT-2 paper https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Let's first load the tokenizer for the GPT-2 medium model. You can find how to do this by reading the documentation for AutoTokenzier in transformers, and finding the GPT-2 model of ~345 million params in there.

In [4]:
from transformers import AutoTokenizer
# Your code here
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Let's tokenize and detokenize some text from this model.

In [5]:
print(tokenizer.encode('Hello world'))
print(tokenizer.decode(tokenizer.encode('Hello world')))
print(tokenizer.encode("Hola, cómo estás😍"))

[15496, 995]
Hello world
[39, 5708, 11, 269, 10205, 5908, 1556, 40138, 47249, 235]


Now let's load the GPT-2 medium model. Make sure you also put the model onto the GPU.

In [6]:
from transformers import AutoModelWithLMHead
# Your code here
gpt2_model = AutoModelWithLMHead.from_pretrained('gpt2-medium')
gpt2_model.to(device)



Downloading pytorch_model.bin:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout)

## Generate From the Model

Now let's generate some text from the model to test its LM capabilities. Let's generate 10 pieces of random text of length 50 tokens from the model using random sampling with temperature set to 0.7. This will allow the text to be somewhat high in diversity (random sampling) while maintaining reasonable quality (temperature < 1). When generating text, you can condition on phrases such as "The coolest thing in NLP right now is". Find the relevant function and arguments to use for generating text using the Huggingface documentation.

Hint: you may find https://huggingface.co/docs/transformers/main_classes/text_generation to be useful for learning about generating from LMs.

In [7]:
from transformers import GenerationConfig
inputs = tokenizer("<|startoftext|>The coolest thing in NLP right now is", return_tensors="pt").input_ids.cuda()
# Your code here

sample_outputs = gpt2_model.generate(inputs, max_length=50, min_length=50, temperature=0.7, num_return_sequences=10, do_sample=True)

Now lets print the text.

In [8]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: <|startoftext|>The coolest thing in NLP right now is a text-based classification algorithm that uses text as a text classifier. It's not the only one, but it's the one that we're interested in. It
1: <|startoftext|>The coolest thing in NLP right now is the tool called "Natural Language Processing". It's a tool that can take text, extract features from it and then match it up to human speech. It's basically a
2: <|startoftext|>The coolest thing in NLP right now is what you're doing with the [fuzzy](http://t.co/l7qhM5JwCY) word!</font></p
3: <|startoftext|>The coolest thing in NLP right now is the fact that you can use them to detect whether someone has been faking their intentions or whether they are actually trying to communicate. It's a really cool tool for your
4: <|startoftext|>The coolest thing in NLP right now is Machine Learning. (http://www.youtube.com/watch?v=kCjHtR_6kH0)

(http://www
5: <|startoftext|>The coolest thing in NLP right now is the ability to create a human-reada

Now generate one piece of text of length 50 with the same prompt ("The coolest thing right now in NLP is") but use greedy decoding (temperature = 0). This roughly corresponds to generating some text that is high likelihood for the model.

In [8]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.cuda()
# Your code here
sample_outputs = gpt2_model.generate(inputs, max_length=50, min_length=50, temperature=0)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: <|startoftext|>The coolest thing right now in NLP is the ability to use the word "startoftext" to refer to a word that is not part of the text. This is useful for things like "the word"


Now let's try to see how good of a translation system GPT-2 medium is when used "out of the box". To accomplish this, we can condition on a prompt like the one below and generate from the model with greedy decoding. This will attempt to translate the sentence "UC Berkeley ist eine Schule in Kalifornien", which means "UC Berkeley is a school in California". Make sure to set the max length to be high enough so that the model generates sufficient text.

In [9]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [10]:
# Your code here. Generate from the model using greedy decoding with the above prompt
inputs = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
# Your code here
sample_outputs = gpt2_model.generate(inputs, temperature=0, max_length=100)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist


As we can see, translation quality is terrible, as it just repeats the words from the previous text.

Now, let's finetune GPT-2 on the translation task to improve the results. We will use a translation dataset from the Huggingface dataset repository (it has thousands of other datasets available). This dataset is one of TED talks translated between German and English.

In [7]:
import datasets
dataset = datasets.load_dataset("ted_talks_iwslt", language_pair=("de", "en"), year="2014")

Downloading builder script:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/32.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/16.4k [00:00<?, ?B/s]

Downloading and preparing dataset ted_talks_iwslt/de_en_2014 to /root/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-9408486716c87367/1.1.0/a42f763b98f8e9cc19358a2ac1007b0b600554e260ee48e6316df39703ef37a4...


Downloading data:   0%|          | 0.00/1.67G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset ted_talks_iwslt downloaded and prepared to /root/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-9408486716c87367/1.1.0/a42f763b98f8e9cc19358a2ac1007b0b600554e260ee48e6316df39703ef37a4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
tokenizer.pad_token = tokenizer.eos_token

In [13]:
print(dataset['train'][0]['translation'])

{'de': '"Ich habe Zerebralparese. Ich zappele die ganze Zeit", kündigt Maysoon Zayid zu Anfang dieses ungeheuer witzigen, erheiternden an. (Er ist wirklich ungeheur witzig.) "Als würde Shakira auf Muhammad Ali treffen." Elegant und scharfsinnig nimmt uns die arabisch-amerikanische Komikerin auf eine Reise durch ihre Abenteuer als Schauspielerin, Komikerin, Philanthropin und Fürsprecherin für Menschen mit Behinderungen mit.', 'en': '"I have cerebral palsy. I shake all the time," Maysoon Zayid announces at the beginning of this exhilarating, hilarious talk. (Really, it\'s hilarious.) "I\'m like Shakira meets Muhammad Ali." With grace and wit, the Arab-American comedian takes us on a whistle-stop tour of her adventures as an actress, stand-up comic, philanthropist and advocate for the disabled.'}


Now we can create a dataset. For each element in the dataset, it should have a text prompt and then the translation, similar to above. Your job is to fill in the labels field below. This field sets the labels to use for training during the language modeling task. 

For the labels, we only want to train the model to output the text after the words "English:". This is because in the prompt, everything before the words "English:" will also be provided to the model as input. Hint: use -100 as the label for tokens you do not want to train on.
Hint 2: When doing LM training, the labels are the same as the input tokens, except shifted to the left by one. You should check whether Huggingface is already doing the shifting, or whether you need to do the shifting yourself.

One thing to be careful of with all LMs is to make sure there are not extra spaces. So, the text should be formatted as like "English: Hello..." not "English:  Hello...". This issue is a common problem people face when using APIs like GPT-3 which we will cover next time.

In [9]:
prompt = """Translate the following texts into English.
German: """

class TranslationDataset(Dataset):
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for example in examples:
            training_text = prompt + example['translation']['de'] + '\nEnglish: ' + example['translation']['en'] + "<|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            prompt_and_input_length = len(tokenizer.encode(prompt + example['translation']['de'] + '\nEnglish:'))
            eng_ids = tokenizer.encode(' ' + example['translation']['en'] + "<|endoftext|>")
            label = [-100] * prompt_and_input_length + eng_ids
            pad_len = max(0, 275 - len(label))
            label += [-100] * pad_len
            label = label[:275]     # enforce truncation
            assert len(label) == 275, f'lengths need to equal to 275! it is {len(label)}'
            self.labels.append(label)

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

In [10]:
translation_dataset = TranslationDataset(dataset['train'], tokenizer)

Now let's break the dataset into a train and test split.

In [11]:
train_size = int(0.9 * len(translation_dataset))
train_dataset, val_dataset = random_split(translation_dataset, [train_size, len(translation_dataset) - train_size])
print(len(train_dataset))
print(len(val_dataset))

2674
298


Now we can use the Huggingface Trainer to finetune GPT-2 on this dataset. This abstracts away all of the details of training. Setup the training arguments to perform 3 epochs of training on this dataset, use a per-device batch size of 2 with gradient accumulation set to 8, use 100 warmup steps, a weight decay of 0.05. Set the eval batch size to be 2. Save a checkpoint every 250 steps. Set fp16 to True. Save the checkpoint in a specific output_dir so you can load it later. Hint: if it tries to launch Wandb, you may add the argument report_to="none".

In [12]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
                    output_dir='/content/drive/MyDrive/cs288_sp2023/hw4/gpt2_translation',   # Replace with your preferred output directory
                    overwrite_output_dir=True,  # Overwrite the content of the output directory if it already exists
                    do_train=True,
                    do_eval=True,
                    evaluation_strategy='epoch',
                    save_strategy='epoch',
                    save_total_limit=3,        # Only keep the last 3 checkpoints
                    save_steps=1,
                    num_train_epochs=3,        # Train for 3 epochs
                    per_device_train_batch_size=2,
                    per_device_eval_batch_size=2,
                    gradient_accumulation_steps=8,
                    weight_decay=0.05,         # Weight decay
                    warmup_steps=100,          # Warmup steps
                    fp16=True                  # Use mixed-precision training with automatic mixed precision (AMP)
                )

Next create a Huggingface Trainer object and call train() on it.

In [13]:
import gc
torch.cuda.empty_cache()
gc.collect()

21

In [14]:
# Your code here
trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()



Epoch,Training Loss,Validation Loss
0,No log,2.193033
1,No log,2.050394
2,2.077500,2.037882


TrainOutput(global_step=501, training_loss=2.076422520026475, metrics={'train_runtime': 1085.1358, 'train_samples_per_second': 7.393, 'train_steps_per_second': 0.462, 'total_flos': 3998491818393600.0, 'train_loss': 2.076422520026475, 'epoch': 3.0})

Now load your saved checkpoint and see how well the finetuned GPT-2 model does on translating the sentence from before.

In [15]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [16]:
# your code here
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
sample_outputs = gpt2_model.generate(inputs, temperature=0, max_length=100)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English: UC Berkeley: A school in Kalifornia


In [17]:
trainer.save_model("/content/drive/MyDrive/cs288_sp2023/hw4/gpt2_translation_model")

In [18]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

model = AutoModelWithLMHead.from_pretrained("/content/drive/MyDrive/cs288_sp2023/hw4/gpt2_translation_model")
model = model.to(device)

inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
sample_outputs = model.generate(inputs, temperature=0, max_length=100)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))



0: Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English: UC Berkeley: A school in Kalifornia


If training went correctly, you should see a reasonable translation of the sentence, with some errors.

For the project report, find two sentences where the model succeeds and two sentences where the model fails. Describe what might be causing these types of failures.

Finally, revisit the code from project 2 on using and running the Multi30k dataset. Your goal will be to translate the test set using the GPT-2 model you just finetuned. You will then submit your test predictions as a txt file, where you place your model's prediction for each test example on a separate line. Feel free to copy and paste any code from HW2 that may be useful. Submit the file named as mt_predictions.txt to gradescope.

The GPT-2 model may not work that well on the Multi30k dataset, because there is a distribution shift where the Multi30k data looks different than the Ted talks data that you finetuned the model on. The takeaway I want people to have is that a general-purpose LM system can be decent at a task like translation, however, if you create a domain-specific model like a LSTM trained specifically on Multi30k, you can outperform the general purpose model.

For the project report, compare two translations from the GPT-2 versus LSTM model. Which one works better?

Hint: One failure mode for GPT-2 is that it may generate fluent sentences that are actually unrelated to the input.

In [19]:
data = load_dataset('bentrevett/multi30k')
for i in range(2):
    print(data['train']['de'][i])
    print(data['train']['en'][i])

Downloading readme:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading and preparing dataset json/bentrevett--multi30k to /root/.cache/huggingface/datasets/bentrevett___json/bentrevett--multi30k-a9f37f20be71ddca/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.60M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/164k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/bentrevett___json/bentrevett--multi30k-a9f37f20be71ddca/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.
Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.
Several men in hard hats are operating a giant pulley system.


In [20]:
def create_prompt(de, en=""):
    if en != "": en = " " + en
    return f"""Translate the following texts into English.\n\nGerman: {de}\nEnglish:{en}"""

print(create_prompt(data['train']['de'][4], data['train']['en'][4]))
print(create_prompt(data['train']['de'][4]))

Translate the following texts into English.

German: Zwei Männer stehen am Herd und bereiten Essen zu.
English: Two men are at the stove preparing food.
Translate the following texts into English.

German: Zwei Männer stehen am Herd und bereiten Essen zu.
English:


In [21]:
test_data_list = []
labels_data_list = []
for i in range(len(data['test']['de'])):
    test_data_list.append(tokenizer(create_prompt(data['test']['de'][i]), return_tensors='pt').input_ids)

for i in range(3):
    print(test_data_list[i])
    print()

tensor([[ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198,   198,
         16010,    25,   412,   259, 20291, 10255,   304,  7749, 10912, 16370,
         11722,   268, 32767,    11,  4587,  2123,  9776,   281,   301,  3258,
            83,    13,   198, 15823,    25]])

tensor([[ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198,   198,
         16010,    25,   412,   259,  6182,  3813,  5277,   300, 11033,    84,
           701,  6184,   120,   527,   473,   701,   328,    12,  2164,  9116,
          2516,  1902,   292,   410,   273,   304,  7749,   356,    72, 39683,
           268,  1168,  1942,    13,   198, 15823,    25]])

tensor([[ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198,   198,
         16010,    25,   412,   259,   337, 11033,    67,  6607,   287,   304,
          7749,  9375,   378, 35410,  1018,   865, 30830,   304,   259, 18726,
         10255,   304,  7749,   833,   715,    13,   198, 15823,    25]])



In [22]:
output_list = []
for input in test_data_list:
    # print(tokenizer.decode(input[0]))
    output = model.generate(input.to(device), temperature=0, max_length=100)
    output_list.append(output)
    # print(tokenizer.decode(output[0], skip_special_tokens=True))
    # print("---")

In [23]:
import gc
torch.cuda.empty_cache()
gc.collect()

363

In [24]:
with open('/content/drive/MyDrive/cs288_sp2023/hw4/mt_predictions.txt', 'wb') as f:
    for i, output in enumerate(output_list):
        trans = tokenizer.decode(output.squeeze(), skip_special_tokens=True)
        if i == len(output_list) - 1:
            f.write(trans.encode('utf-8'))
            break
        f.write(trans.encode('utf-8') + "\n".encode('utf-8'))

### Sentiment Analysis

The beauty of language models is that we can apply this exact same machinery to solve a completely different task of sentiment analysis. Here, we will be given a movie review and the goal is to have the model predict whether the review is positive or negative.

First, we will load some sentiment analysis data. Your job is to copy what we did above for machine translation to load the dataset, build a Class to create the dataset, etc., 

When doing so, use the prompt below, where you put the text of the input in the first [] and in the second [], put the word Positive if the label is 1 and the word Negative if the label is 0. Make sure to also set the self.labels field correctly, we only want to compute a loss on the words Positive/Negative, and no other tokens in the model's input.

The following is a movie review. [Movie Review Text Here]. The sentiment of the review is [Positive/Negative].

In [7]:
import datasets
dataset = datasets.load_dataset('glue', "sst2")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Note: Some people were saying that this line of code wasn't working and they needed to use "dataset = datasets.load_dataset('glue', 'sst2')" instead.

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [74]:
i = 20
print(dataset['train']['sentence'][i])
print("---")
print(dataset['train']['label'][i])
print("---")
print(dataset['train']['idx'][i])

equals the original and in some ways even betters it 
---
1
---
20


In [75]:
len(dataset['train'])

67349

In [8]:
class SentimentDataset(Dataset):
    # Your code below
    def __init__(self, dataset, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        self.classes = {0: 'Negative', 1: 'Positive'}
        for data in dataset:
            sentence = data['sentence']
            label_cls = self.classes[data['label']]
            training_text = f"The following is a movie review. {sentence}. The sentiment of the review is {label_cls}. <|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=100, padding="max_length", truncation=True)
            pred_pos = encodings_dict['input_ids'].index(220)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            label = [-100 for i in range(100)]
            label[pred_pos - 2] = 33733 if data['label'] == 1 else 36183
            label[pred_pos - 1] = 13
            label[pred_pos] = 220
            self.labels.append(torch.tensor(label).type(torch.long))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, i):
        return {
            "input_ids": self.input_ids[i],
            "mask": self.attn_masks[i],
            "labels": self.labels[i]
        }

In [53]:
print(tokenizer.decode([33733]))
print(tokenizer.decode([36183, 13]))

 Positive
 Negative.


In [9]:
sentiment_train_dataset = SentimentDataset(dataset['train'], tokenizer)
sentiment_val_dataset = SentimentDataset(dataset['validation'], tokenizer)

In [16]:
print(sentiment_train_dataset[0]['input_ids'])
print(sentiment_train_dataset[0]['labels'])

tensor([  464,  1708,   318,   257,  3807,  2423,    13,  7808,   649,  3200,
          507,   422,   262, 21694,  4991,   764,   383, 15598,   286,   262,
         2423,   318, 36183,    13,   220, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256])
tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100, 36183,    13,   220,  -100,  -100,  -100,

The data already comes with a validation and train split

In [183]:
print(len(sentiment_train_dataset))
print(len(sentiment_val_dataset))

67349
872


Now let's train the model using the same trainer arguments as before, except just do $<$1 epoch of training because this dataset is quite large and training on the entire thing will take some time. Make sure you also use a different output_dir so it doesn't overwrite your old results.

In [39]:
import gc
torch.cuda.empty_cache()
gc.collect()

21

In [11]:
from transformers import TrainingArguments, Trainer
# Your code here
sent_model = AutoModelWithLMHead.from_pretrained('gpt2-medium')
sent_model.to(device)
training_args = TrainingArguments(
                    output_dir='/content/drive/MyDrive/cs288_sp2023/hw4/gpt2_sentiment',   # Replace with your preferred output directory
                    overwrite_output_dir=True,  # Overwrite the content of the output directory if it already exists
                    do_train=True,
                    do_eval=True,
                    evaluation_strategy='epoch',
                    save_strategy='epoch',
                    save_total_limit=3,        # Only keep the last 3 checkpoints
                    save_steps=1,            # Save a checkpoint every 0.5 steps
                    num_train_epochs=1,        # Train for 3 epochs
                    per_device_train_batch_size=2,
                    per_device_eval_batch_size=2,
                    gradient_accumulation_steps=8,
                    learning_rate=5e-5,        # Learning rate
                    weight_decay=0.05,         # Weight decay
                    warmup_steps=100,          # Warmup steps
                    fp16=True                  # Use mixed-precision training with automatic mixed precision (AMP)
                )
trainer = Trainer(
    model=sent_model,
    args=training_args,
    train_dataset=sentiment_train_dataset,
    eval_dataset=sentiment_val_dataset,
)

trainer.train()



Epoch,Training Loss,Validation Loss
0,0.0516,0.066504


TrainOutput(global_step=4209, training_loss=0.0961328960988776, metrics={'train_runtime': 4477.2003, 'train_samples_per_second': 15.043, 'train_steps_per_second': 0.94, 'total_flos': 1.22153163227136e+16, 'train_loss': 0.0961328960988776, 'epoch': 1.0})

In [15]:
trainer.save_model("/content/drive/MyDrive/cs288_sp2023/hw4/gpt2_sentiment_model")

At test-time, when you want to classify an incoming movie review, you can just check whether the model generates the words Positive or Negative as the final word.

In [21]:
prompt = """The following is a movie review. This was not the best shit. The sentiment of the review is"""

inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
sample_outputs = sent_model.generate(inputs, temperature=0, max_length=100)
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: The following is a movie review. This was not the best shit. The sentiment of the review is Negative.                                                                              


In [33]:
idx2class = {0: "Negative", 1: "Positive"}
class2idx = {"Negative": 0, "Positive": 1}
val_labels = []
pred_labels = []
for i in range(len(sentiment_val_dataset)):
    data = sentiment_val_dataset[i]
    sentence = dataset['validation']['sentence'][i]
    label_idx = dataset['validation']['label'][i]
    label_cls = idx2class[label_idx]
    # print(tokenizer.decode(sentence, skip_special_tokens=True), label_cls)
    prompt = f"The following is a movie review. {sentence}. The sentiment of the review is"
    val_labels.append(label_idx)
    inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    sample_output = sent_model.generate(inputs, temperature=0, max_length=100)[0]
    sample_output = tokenizer.decode(sample_output, skip_special_tokens=True)
    pred_idx = 1 if "Positive." in sample_output else 0
    pred_labels.append(pred_idx)

assert len(val_labels) == len(pred_labels), "Label vectors need to equal!"

In [34]:
print(val_labels)
print(pred_labels)

[1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 

In [35]:
import numpy as np
from sklearn.metrics import accuracy_score

accuracy_score(np.array(val_labels), np.array(pred_labels))

0.9380733944954128

Finally, run the entire validation set through the model and get your model predictions. Save the results as a txt file, where each line just contains either "1" if your model predicted Positive and "0" if the model predicted Negative. You will get full credit if your model's accuracy is greater than 80%. Save the file as sst_predictions.txt and submit it to gradescope.

For the report, describe two possible improvements to your sentiment classifier.

In [38]:
with open('/content/drive/MyDrive/cs288_sp2023/hw4/sst_predictions.txt', 'wb') as f:
    for i, label in enumerate(pred_labels):
        if i == len(pred_labels) - 1:
            f.write(str(label).encode('utf-8'))
            break
        w = str(label) + "\n"
        f.write(w.encode('utf-8'))

## Submission

Turn in the following files on Gradescope:
* hw4.ipynb (this file; please rename to match)
* mt_predictions.txt (the predictions for the Multi30k test set)
* sst_predictions.txt (the predictions for the SST-2 validation set)
* report.pdf

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.