# Finetune GPT-2 on wiki-text

In this Lab, we are using a series of library from Hugging Face (i.e. tranformers, datasets, peft). You may need to go through the document of these library to learn the usage. (Hint: you may use the imported contents in the code cell below, other contents is not necessary for this lab)

In [2]:
# for google colab
!pip install transformers
!pip install datasets
!pip install peft

Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading PyYAML-6.0.2-cp39-cp39-win_amd64.whl.metadata (2.1 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB)
Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
   ---------------------------------------- 0.0/10.0 MB ? eta -:--:--
   ---------- ----------------------------- 2.6/10.0 MB 21.4 MB/s eta 0:00:01
   ---------------------------------------- 10.0/10.0 MB 36.5 MB/s eta 0:00:00
Downloading huggingface_hub-0.29.1-py3-none-any.whl (468 kB)
Downloading PyYAML-6.0.2-cp39-cp39-win_amd64.whl (162 kB)
Download

In [3]:
import os
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

from torch.utils.data import DataLoader
import torch.nn as nn

cuda


  from .autonotebook import tqdm as notebook_tqdm


## Lab 2(a) Generate text with GPT2

Using the API provided by hugging face, we can easily load the pre-trained GPT2 model and generate text. (GPT2 is a early generative model, the quality of the generated text is not as good as the later model like GPT3.)

In [None]:
# your code here: load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

def generate_text(model, tokenizer, prompt, max_length):


    # your code here: tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask

    # your code here: generate token using the model
    gen_tokens = model.generate(input_ids, attention_mask=attention_mask, max_length=max_length)

    # your code here: decode the generated tokens
    gen_text = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]

generate_text(model, tokenizer, "GPT-2 is a langugae model based on transformer developed by OpenAI", 100)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT-2 is a langugae model based on transformer developed by OpenAI. It is a simple, fast, and scalable model of the human brain. It is based on the concept of the "brain as a machine".

The model is based on the concept of the "brain as a machine". The model is based on the concept of the "brain as a machine". The model is based on the concept of the "brain as a machine". The model is based on the


## Lab 2(b) Prepare dataset for training

Please fill the code cell below to download the dataset and prepare the dataset for finetuning.


In [8]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# your code here: load the dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# get 10% of dataset
dataset_train = dataset["train"].select(range(len(dataset["train"]) // 10))
dataset_valid = dataset["validation"].select(range(len(dataset["validation"]) // 10))

# your code here: implement function that tokenize the dataset and set labels to be the same as input_ids
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# your code here: tokenize the dataset (you may need to remove columns that are not needed)
tokenized_datasets_train = dataset_train.map(
    tokenize_function, 
    batched=True, 
    remove_columns=["text"]
)
tokenized_datasets_valid = dataset_valid.map(
    tokenize_function, 
    batched=True, 
    remove_columns=["text"]
)

tokenized_datasets_train.set_format("torch")
tokenized_datasets_valid.set_format("torch")

# your code here: create datacollator for training and validation dataset
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

train_dataloader = DataLoader(tokenized_datasets_train, shuffle=True, batch_size=4, collate_fn=data_collator)
valid_dataloader = DataLoader(tokenized_datasets_valid, batch_size=4, collate_fn=data_collator)

# Test the DataLoader
for batch in train_dataloader:
    print(batch['input_ids'].shape)
    print(batch['attention_mask'].shape)
    print(batch['labels'].shape)
    break

print("DataLoader is working correctly!")

Map: 100%|██████████| 3671/3671 [00:00<00:00, 4480.55 examples/s]
Map: 100%|██████████| 376/376 [00:00<00:00, 3761.33 examples/s]

torch.Size([4, 512])
torch.Size([4, 512])
torch.Size([4, 512])
DataLoader is working correctly!





## Lab 2(c) Evaluate perplexity on wiki-text

Before finetuning, we evaluate the pre-trained GPT2 model on the wiki-text dataset. The perplexity is a common metric to evaluate the performance of language model. The lower the perplexity, the better the model. To compute the perplexity in practice, we use the formula as follows, which is a transformation of the formula in class:
$PP(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|\text{context})\right)$

In [9]:
def evaluate_perplexity(model, dataloader):
    model.eval()
    total_loss = 0
    total_length = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')

    with torch.no_grad():
        for batch in dataloader:
            # your code here: get the input_ids, attention_mask, and labels from the batch
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # your code here: forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            logits = outputs.logits

            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            
            # your code here: calculate the loss
            loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            
            total_loss += loss.item()
            total_length += attention_mask.sum().item()

    # Calculate perplexity
    perplexity = torch.exp(torch.tensor(total_loss / total_length))
    
    return perplexity.item()
    

perplexity = evaluate_perplexity(model, valid_dataloader)
print(f"Initial perplexity: {perplexity}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Initial perplexity: 42.995872497558594


## Lab 2(d) Fine-tune GPT2 on wiki-text



In [11]:

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
    # your code here: report validation and training loss every epoch
)

# your code here: create a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_valid,
)

trainer.train()
trainer.save_model()



Step,Training Loss
500,0.9689
1000,0.401


# Test fine-tuned model

In [12]:
# your code here: load the fine-tuned model
model_finetuned = AutoModelForCausalLM.from_pretrained("./gpt2-wikitext-2").to(device)
perplexity = evaluate_perplexity(model_finetuned, valid_dataloader)
print(f"fine-tuned perplexity: {perplexity}")

fine-tuned perplexity: 27.33197593688965


# Generate some text using the fine-tuned model

In [13]:
# load the fine-tuned model

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# generate text
generate_text(model_finetuned, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT-2 is a langugae model based on transformers developed by OpenAI , and is a novel approach to the synthesis of the functional @-@ type @-@ based protein . The model is based on the interaction between the two nucleotides , and is based on the interaction between the two nucleotides with the substrate . The interaction between the two nucleotides is a fundamental feature of the protein , and is the basis for the synthesis of the functional @-@ type


## Lab 2(e) Parameter efficient fine-tuning (LoRA)

finetune the base gpt model through LoRA

In [16]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# your code here: load GPT2 model and add the lora adapter
model_base = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
model_lora = get_peft_model(model_base, peft_config)


training_args = TrainingArguments(
    output_dir="./gpt2-lora-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
)

# your code here: set trainer and train the model
trainer = Trainer(
    model=model_lora,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets_train,
    eval_dataset=tokenized_datasets_valid,
)

ppl = evaluate_perplexity(model_lora, valid_dataloader)
print(f"Perplexity after lora finetuning: {ppl}")


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Perplexity after lora finetuning: 42.995872497558594


# Evaluate lora fine-tuned model on wiki-text

compare the text generated by the fully fine-tuned model and LoRA fine-tuned model and the pre-trained model. Do you see any difference in the quality of the generated text? Try to explain why. (Hint: trust your result and report as it is.)

In [17]:
generate_text(model_lora, tokenizer, "GPT-2 is a langugae model based on transformers developed by OpenAI", 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT-2 is a langugae model based on transformers developed by OpenAI. It is a simple, fast, and scalable model that can be used to generate a large number of models.

The model is based on the following principles:

The model is based on the following principles:

The model is based on the following principles:

The model is based on the following principles:

The model is based on the following principles:

The model


Compare the perplexity of the fully fine-tuned model and LoRA fine-tuned model. Do you see any difference in the perplexity? Try to explain why. 

In [18]:
ppl = evaluate_perplexity(model_lora, valid_dataloader)

print(f"Perplexity after lora finetuning: {ppl}")

Perplexity after lora finetuning: 42.995872497558594
