<a href="https://colab.research.google.com/github/kiyaalva/EDA-NLP-LLM/blob/main/KIS_HW3_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, you will train a decoder-only LLM (GPT-2) with a **character** tokenizer on data from Shakespeare and generate sentences.

You will use Hugging Face to train the models.

**Important**: you will need to use a GPU for training. To change to a GPU, select Runtime > Change runtime type from the menu bar above. Select 'T4'.

# Load English training data
First, upload the `shakespeare_input.txt` downloaded from the [Homework 3 website](https://michaelmilleryoder.github.io/cs2731_fall2024/hw3) into the Colab file manager. To do this, click the folder icon on the left-hand sidebar. Then, click the upload icon in the sidebar (the one with the arrow pointing up) and select the `shakespeare_input.txt` file.

After you have the file in the Colab notebook's context, you'll need to open it up and read in each line to a Python list and save it to an object called `training_data`.
The function currently removes lines with no text. You can also perform any preprocessing you want to do here as well.

In [2]:
with open('shakespeare_input.txt') as f:
  training_data = [[line] for line in f.read().upper().splitlines() if len(line) > 0]

training_data[:10] # to check the first 10 lines

[['FIRST CITIZEN:'],
 ['BEFORE WE PROCEED ANY FURTHER, HEAR ME SPEAK.'],
 ['ALL:'],
 ['SPEAK, SPEAK.'],
 ['FIRST CITIZEN:'],
 ['YOU ARE ALL RESOLVED RATHER TO DIE THAN TO FAMISH?'],
 ['ALL:'],
 ['RESOLVED. RESOLVED.'],
 ['FIRST CITIZEN:'],
 ['FIRST, YOU KNOW CAIUS MARCIUS IS CHIEF ENEMY TO THE PEOPLE.']]

# "Train" a tokenizer

Hugging Face models use specified tokenizers which define the possible tokens.
Here we want to modify the existing `GPT2TokenizerFast` class to tokenize on characters.

In [3]:
# Run this to make sure you have a necessary package
! pip install transformers[torch]



Define a new Hugging Face tokenizer here that only accepts characters and save it to an object named `char_tokenizer`.

You can reference the following:
* https://discuss.huggingface.co/t/character-level-tokenizer/12450/3
* https://huggingface.co/learn/nlp-course/en/chapter6/

In [4]:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

import string
characters = list(string.ascii_letters + string.digits + string.punctuation + " \n")
special_tokens = ["[UNK]", "[PAD]", "[BOS]", "[EOS]"]

vocab = {char: idx for idx, char in enumerate(characters)}
vocab.update({token: len(vocab) + i for i, token in enumerate(special_tokens)})

tokenizer = Tokenizer(WordLevel(vocab, unk_token="[UNK]"))

tokenizer.add_special_tokens(special_tokens)
tokenizer.pre_tokenizer = Whitespace()

tokenizer.save("char_tokenizer.json")

from transformers import PreTrainedTokenizerFast

char_tokenizer = PreTrainedTokenizerFast(tokenizer_file="char_tokenizer.json")
char_tokenizer.add_special_tokens({
    "unk_token": "[UNK]",
    "pad_token": "[PAD]",
    "bos_token": "[BOS]",
    "eos_token": "[EOS]"
})




0

Test your new tokenizer with the following cell. It should provide each token as a character. You may get unexpected behavior for the space character, and that's ok.

In [5]:
char_tokenizer.tokenize("hello world")

['[UNK]', '[UNK]']

# Train GPT-2 model with character tokenizer

Here's where you will train your GPT-2 model on the Shakespeare data using your new character tokenizer. Specifically, train the `GPT2LMHeadModel` from the `transformers` package.

Here are some references for the code for this part:
* https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb
* https://huggingface.co/docs/transformers/en/tasks/language_modeling. Note that this is for finetuning, not training from scratch. It is still useful for explanations of Hugging Face classes

You will want to define a model, load in the Shakespeare dataset in a format that Hugging Face can work with, define training parameters, and then train the model.
This training may take 30 minutes or longer.

**You will also need to save the model** with a name like `char_gpt2_shakespeare` to be able to generate from it later.

In [6]:
 # Check for GPU
import torch
torch.cuda.is_available()

True

In [7]:
!pip install datasets

from datasets import load_dataset

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

In [8]:
# FILL IN CODEfrom transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, Trainer, TrainingArguments

from datasets import Dataset
import pandas as pd

with open("shakespeare_input.txt", "r", encoding="utf-8") as file:
    lines = file.readlines()

data = pd.DataFrame({"text": lines})

dataset = Dataset.from_pandas(data)




In [9]:
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(tokenizer_file="char_tokenizer.json")
tokenizer.add_special_tokens({
    "unk_token": "[UNK]",
    "pad_token": "[PAD]",
    "bos_token": "[BOS]",
    "eos_token": "[EOS]"
})




0

In [12]:
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

train_dataset = tokenized_datasets.train_test_split(test_size=0.1, seed=42)["train"]
test_dataset = tokenized_datasets.train_test_split(test_size=0.1, seed=42)["test"]


Map:   0%|          | 0/167204 [00:00<?, ? examples/s]

In [13]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(100, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=100, bias=False)
)

In [14]:
from transformers import Trainer, TrainingArguments


training_args = TrainingArguments(
    output_dir="./char_gpt2_shakespeare",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    per_device_eval_batch_size=2,
    save_steps=20000,
    evaluation_strategy="steps",
    eval_steps=20000,
    logging_steps=1000,
    report_to="none",

)


# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
)




In [None]:
trainer.train()


Step,Training Loss,Validation Loss
20000,0.0516,0.052417
40000,0.051,0.051881


# Generate from the trained model

In [1]:
# FILL IN CODE
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch

model = GPT2LMHeadModel.from_pretrained("./char_gpt2_shakespeare")
tokenizer = PreTrainedTokenizerFast.from_pretrained("./char_gpt2_shakespeare")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


OSError: Incorrect path_or_model_id: './char_gpt2_shakespeare'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

In [None]:
input_text = "To be, or not to be"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)


# Calculate perplexity for test documents

In this section, load the test documents from the [Homework 3 website](https://michaelmilleryoder.github.io/cs2731_fall2024/hw3).
Calculate perplexity for both models.