## Decoder Finetuning Example

In this notebook, you'll get to practice fine-tuning a generative model.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset
from peft import get_peft_model, LoraConfig, TaskType

For this example, we'll be working with the script of the first episode of Star Trek, the Next Generation.

First, we'll read it in and do some minor cleanup.

In [2]:
with open("script.txt", "r", encoding="utf-8") as f:
    text = f.read()

lines = [line.strip() for line in text.split('\n') if len(line.strip()) > 10]
text = "\n".join(lines)
print(text[:1000])

STAR TREK: THE NEXT GENERATION
"Encounter at Farpoint"
D.C. Fontana
Gene Roddenberry
This script is not for publicaion or reproduction.
No one is authorized to dispose of the same. If los t or
destroyed, please notify the Script Department.
FINAL DRAFT
April 13, 1987
1    EXT. SPACE - STARSHIP (OPTICAL)
The U.S.S. Enterprise NCC 1701-D traveling at warp  speed
through space.
PICARD V.O.
Captain's log, stardate 42353.7.
Our destination is planet Cygnus
IV, beyond which lies the great
unexplored mass of the galaxy.
2    OTHER INTRODUCTORY ANGLES (OPTICAL)
on the gigantic new Enterprise NCC 1701-D.
PICARD V.O.
My orders are to examine Farpoint,
a starbase built there by the
inhabitants of that world.
Meanwhile ...
3    INT. ENGINE ROOM
Huge, with a giant wall diagram showing the immens ity
of this Galaxy Class starship.
PICARD V.O.
(continuing)
... I am becoming better
acquainted with my new command,
this Galaxy Class U.S.S.
Enterprise.
4    CLOSER ON VESSEL DIAGRAM
Showing the details an

First, we need to create our tokenizer.

**Part 1:** Create a tokenizer using "distilgpt2" with the [Autotokenizer.from_pretrained method](https://huggingface.co/docs/transformers/v4.52.3/en/model_doc/auto#transformers.AutoTokenizer).

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

During training, all sequences need to be the same lenght, so we need to set a padding token for shorter sequences. We'll use the end of sequence token for this.

In [4]:
tokenizer.pad_token = tokenizer.eos_token

**Part 2:** Use the [encode method](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode) to encode the text. Make sure that this returns PyTorch tensors. Save

In [5]:
tokens = tokenizer.encode(text, return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (33069 > 1024). Running this sequence through the model will result in indexing errors


Now, we'll split the text into shorter chunks.

In [7]:
tokens = tokens[0]

chunk_size = 128
chunks = [tokens[i:i+chunk_size] for i in range(0, len(tokens)-chunk_size, chunk_size)]

input_ids = torch.stack([torch.tensor(chunk) for chunk in chunks])

  input_ids = torch.stack([torch.tensor(chunk) for chunk in chunks])


Here's a helper class for our training data.

In [8]:
class ScriptDataset(Dataset):
    def __init__(self, input_ids):
        self.input_ids = input_ids

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": torch.ones_like(self.input_ids[idx]),
            "labels": self.input_ids[idx],
        }

dataset = ScriptDataset(input_ids)

**Part 3:** Make a model named base_model by using the [AutoModelforCausalLM.from_pretrained method](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForCausalLM), using the pretrained distilgpt2 model.

In [9]:
base_model = AutoModelForCausalLM.from_pretrained("distilgpt2")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [10]:
# Match the token_embeddings to the same for the tokenizer.
base_model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

**Part 4:** We'll be finetuning our model using LoRA. Set up a LoraConfig object, lora_config, with rank 8, alpha of 32 and dropout of 0.1. Set the target_modules to ["c_attn"], the bias to "none", and the task_type to TaskType.CAUSAL_LM.

Then, use the [get_peft_model function](https://huggingface.co/docs/peft/v0.15.0/en/package_reference/peft_model#peft.get_peft_model) to create a model using the config object. Save the results to an object named model.

In [11]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_attn"],
    bias="none",
    task_type=TaskType.CAUSAL_LM)

model = get_peft_model(base_model, lora_config)



We'll set up a Trainer object and call the train method.

In [12]:
training_args = TrainingArguments(
    output_dir="./xfiles_distilgpt2_lora",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=50,
    warmup_steps=5,
    learning_rate=2e-4,
    save_total_limit=1,
    fp16=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,4.7167
20,4.7661
30,4.7079
40,4.7314
50,4.821
60,4.7514
70,4.6631
80,4.6186
90,4.538
100,4.5605


TrainOutput(global_step=387, training_loss=4.4448679054122255, metrics={'train_runtime': 13.4125, 'train_samples_per_second': 57.707, 'train_steps_per_second': 28.854, 'total_flos': 25368113184768.0, 'train_loss': 4.4448679054122255, 'epoch': 3.0})

Before generating new text, let's ensure that we're using GPUs.

In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-5): 6 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D(nf=2304, nx=768)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
             

Here's a helper function to generate new text, given a model.

In [14]:
def generate_text(model, prompt, tokenizer, device, max_new_tokens=100):
    model.eval()    # Ensures that we're generating, not training
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)  # Tokenize the prompt
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=0.9,
        temperature=1.0,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)  # The model generates tokens, so we need to decode those back to words

We'll load back in the base pretrained model for comparison.

In [15]:
base_model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(device)
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

In [16]:
def generate_text(model, prompt, tokenizer, device, max_new_tokens=100):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(device)
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=0.9,
        temperature=1.0,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

**Part 5:** Using the generate_text function, try out the prompt "PICARD" with both the pretrained distilgpt2 model (base_model), and the finetuned model. Try other prompts, too.

In [27]:
print(generate_text(base_model, "PICARD", tokenizer, device))

PICARD, Texas — An undercover federal agent was arrested for soliciting a teenage male as a child at a Fort Worth church, but was arrested Tuesday night, according to a release from the Texas Department of Justice.





























































In [22]:
print(generate_text(model, "PICARD", tokenizer, device))

PICARDER, LIPER
He looks at the camera as he looks on.
(CONT'D)
(CONT'D)
Is it your fault?
What's the fault?
(CONT'D)
They're talking about a
spaghetti-eating planet.
And that planet has to
immediately
take over.
(CONT'D)
And there is no question
that it's
solution,
which will be necessary
for all


**Bonus:** See what happens if you allow more training epochs. You've also been provided all of the scripts from season 1. How does the model do when given more examples?