# L02 - Fine-tuning GPT2

Finetuning lets you take a model that has been trained on a very broad task (pre-trained model) and adapt it to a specific task.

Steps:
1. Set up the Colab environment
2. Prepare a dataset for fine-tuning
3. Load and test the base model
4. Fine-tune the model
5. Test, evaluate, and save the model for further use

For L02, you will be given the main structure of the code. Your task will be to follow the instructions during the lab session and complete the missing code.

At the end of the lab, you should have a fine-tuned model.






Step 1. Set up the Colab environment (Step 3 in L02 Presentation)

In [None]:
# write the code here
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

Step 2: Model and Tokenizer

The same ones we used in L01

In [None]:
# write the code here
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

BONUS: Let's try the pre-trained model

In [None]:
prompt = "Today I did"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_length=100, repetition_penalty = 1.3)
print(tokenizer.decode(output[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Today I did not know that the people who were in charge of this project had any knowledge about it.
I was told by a friend, "You can't do anything without knowing what you're doing." And he said to me: You have no idea how much money is going into these projects and they are being funded with nothing but your own hands! So we started working on them together for two years before finally getting our first funding from my brother's company (which has been around since 1999


Step 3. Dataset

For this step, the dataset should be in "Files" on the left side. If it is not there, download it from https://www.kaggle.com/datasets/shivamshinde123/william-shakespeares-sonnet

and upload it here

In [None]:
dataset = load_dataset("text", data_files={"train": "Sonnet.txt"})
dataset["train"] = dataset["train"].select(range(2589))

Write the function to tokenize the dataset

In [None]:
def tokenize_dataset(text):
  return tokenizer(text["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_dataset, batched=True)

Step 4: Training

For demo purposes, we will do a full fine-tuning. The hyperparameters will be set so that the fine-tuning will be ready in max. 10 minutes.

1. Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="./shakespeare-finetuned",
    per_device_train_batch_size=8,
    num_train_epochs=5,
    save_steps=250,
    logging_steps=20,
    learning_rate=5e-5,
    warmup_steps=50,
    weight_decay=0.01,
    report_to="none"
)

2. Data Collator and Trainer

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

  trainer = Trainer(


3. Start the training :)

In [None]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
20,5.6956
40,5.0327
60,4.6522
80,4.4968
100,4.5439
120,4.4571
140,4.4795
160,4.4664
180,4.5467
200,4.385


TrainOutput(global_step=1620, training_loss=3.66947604049871, metrics={'train_runtime': 713.5575, 'train_samples_per_second': 18.141, 'train_steps_per_second': 2.27, 'total_flos': 845606338560000.0, 'train_loss': 3.66947604049871, 'epoch': 5.0})

Results.......

In [None]:
model.save_pretrained("gpt2-code-finetuned")
tokenizer.save_pretrained("gpt2-code-finetuned")


('gpt2-code-finetuned/tokenizer_config.json',
 'gpt2-code-finetuned/special_tokens_map.json',
 'gpt2-code-finetuned/vocab.json',
 'gpt2-code-finetuned/merges.txt',
 'gpt2-code-finetuned/added_tokens.json',
 'gpt2-code-finetuned/tokenizer.json')

In [None]:
prompt = "Today I did"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_length=100, repetition_penalty = 1.3)
print(tokenizer.decode(output[0]))

Today I did not invent the world’s most famous song, but sing: ‘This is thy verse. Love to me alone; for myself alone am thou wretch! Time‼d grow dimly silent. Then love doth write it down.—love writeth in thee. And yet this time he lies still asleep. For my sake now lie awake: then sleep well slept ill. So long as you are gone from him, stay so long. But when they were come
