<a href="https://colab.research.google.com/github/prabal5ghosh/Deep-Learning-summer-school-2025-university-of-cote-d-Azur/blob/main/Instruction_finetuning_SUJET.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img width=60% src="http://www.i3s.unice.fr/~lingrand/efeliaUnica.png"><br/><br/>
<font size=+3><b>Instruction Finetuning</b></font><br/><br/>
<font size=+1>Célia D'Cruz and Frédéric Precioso<br/><br/>
    2025 - June/July</font><br/>
    <img width=14% src="http://www.i3s.unice.fr/~lingrand/cc-long.png">
    </center>

In this notebook, you are going to see the difference between "base" and "instruct" LLMs, different prompt templates for instruction finetuning, and you will learn how to prepare a dataset for finetuning a model to follow instructions.

This lab is inspired from Sebastian Raschka's book "Build a Large Language Model (From Scratch)".

# Imports and device

<font color="red">Use a GPU to speed up computations.</font>
If your laptop does not have a GPU, you can use Google Colab or Kaggle.

To enable GPU backend in Google Colab for your notebook:

1.   Runtime (top left corner) -> Change runtime type
2.   Put GPU as "Hardware accelerator"
3.   Save

In [1]:
# all the imports needed in the lab

import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import transformers
from transformers import AutoTokenizer, GPT2LMHeadModel, pipeline
import json
import urllib
from pprint import pprint

Check that your GPU is recognized by running the code below:

In [2]:
# making the code device agnostic

if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
device

device(type='cuda')

# Base vs Instruct models

In this part, you will see the difference between the answer of a "base" model and an "instruct" model.
Below is a function to generate an answer from a model and a prompt.

In [3]:
def get_llm_answer(model_id, prompt, return_full_text = False, max_new_tokens = 256):
    pipeline = transformers.pipeline(
        "text-generation",
        model=model_id,
        device_map="auto",
        return_full_text=return_full_text,
    )

    outputs = pipeline(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample = False,
    )
    return outputs[0]["generated_text"]

Here is a prompt asking the capital of 3 different countries.

In [4]:
prompt = "What is the capital of France? What is the capital of Germany? What is the capital of Spain?"

We use a small version of the "base" Qwen LLM, and give it the prompt. Notice how the model doesn't answer the questions, but instead, generate more questions similar to the ones asked in the prompt. This is because the model has not been trained to answer questions (to follow instructions), but simply to continue to generate text. The model picked up the pattern of questions in the prompt, and simply followed the same pattern, therefore asking about the capital of other countries.

In [5]:
model_id = "Qwen/Qwen2.5-1.5B"
llm_answer = get_llm_answer(model_id, prompt)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [6]:
print(llm_answer)

 What is the capital of Italy? What is the capital of Portugal? What is the capital of Switzerland? What is the capital of Austria? What is the capital of Belgium? What is the capital of Denmark? What is the capital of Finland? What is the capital of Ireland? What is the capital of Luxembourg? What is the capital of the Netherlands? What is the capital of Norway? What is the capital of Poland? What is the capital of Portugal? What is the capital of Romania? What is the capital of Russia? What is the capital of Spain? What is the capital of Sweden? What is the capital of Switzerland? What is the capital of the United Kingdom? What is the capital of the United States? What is the capital of the Vatican? What is the capital of the United Arab Emirates? What is the capital of the United Kingdom? What is the capital of the United States? What is the capital of the United States? What is the capital of the United States? What is the capital of the United States? What is the capital of the Un

This time, we use the "instruct" version of the previous model, which has been finetuned to follow instructions. You can notice that this model actually answer the questions in the prompt.

In [7]:
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
llm_answer = get_llm_answer(model_id, prompt)

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [8]:
print(llm_answer)

 The capital of France is Paris. The capital of Germany is Berlin. The capital of Spain is Madrid.


# Different styles in prompt formatting for instructions

In this part, you will see that often, "instruct" models follow a specific prompt formatting style. They can differ according to the model. Some models follow a "system", "user" and "assistant" parts in their prompt.

- The "System" role is used to provide context that informs the behavior of the model. For example, if you want the model to maintain a formal tone throughout the conversation or if you need to specify rules like avoiding certain topics.
- The "User" role represents the human user in the conversation. This is for instance where you can ask questions to the model.
- The "Assistant" role is the model responding to user queries based on the context set by the system.

If your want to understand better the difference between those roles, you can read [this blog post](https://www.regie.ai/blog/user-prompts-vs-system-prompts) and watch [this video](https://www.youtube.com/watch?v=xbpdMkTz8L4).

In [9]:
def get_formatted_instruction(model_id, system_prompt, user_prompt):
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    formatted_instruction = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    return formatted_instruction

Let's reuse the same user prompt as in the previous part. You will see examples on how the prompt should actually be structured before giving it to the "instruct" models.

In [10]:
system_prompt = "The assistant should always maintain a professional tone."
user_prompt = "What is the capital of France? What is the capital of Germany? What is the capital of Spain?"

We display the prompt format that the Qwen instruct model use when provided with a system and user prompt. You can see that there is a system and user instruction surrounded each by a "<|im_start|>" and "<|im_end|>" tokens, and the last part of the prompt indicate that the assistant will respond.

In [11]:
model_id = "Qwen/Qwen2.5-1.5B-Instruct"

formatted_instruction = get_formatted_instruction(model_id, system_prompt, user_prompt)
print(formatted_instruction)

<|im_start|>system
The assistant should always maintain a professional tone.<|im_end|>
<|im_start|>user
What is the capital of France? What is the capital of Germany? What is the capital of Spain?<|im_end|>
<|im_start|>assistant



Since we have already downloaded the Qwen instruct model in the previous part, we can now display the answer of the model, with the full prompt. You can notice the LLM answer just after the "<|im_start|>assistant" part of the prompt.

In [12]:
llm_answer = get_llm_answer(model_id, formatted_instruction, return_full_text = True)
print(llm_answer)

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


<|im_start|>system
The assistant should always maintain a professional tone.<|im_end|>
<|im_start|>user
What is the capital of France? What is the capital of Germany? What is the capital of Spain?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.
The capital of Germany is Berlin.
The capital of Spain is Madrid.


Now, we oberve the prompt formatting of DeepSeek model. This time, we do not input the prompt to the model (which would generate an answer) as it's a big model that would require a significant amount of memory to download and run.

In [13]:
model_id = "deepseek-ai/DeepSeek-V3"

formatted_instruction = get_formatted_instruction(model_id, system_prompt, user_prompt)
print(formatted_instruction)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

<｜begin▁of▁sentence｜>The assistant should always maintain a professional tone.<｜User｜>What is the capital of France? What is the capital of Germany? What is the capital of Spain?<｜Assistant｜>


We now display the prompt format of Falcon model.

In [14]:
model_id = "tiiuae/falcon-7b-instruct"

formatted_instruction = get_formatted_instruction(model_id, system_prompt, user_prompt)
print(formatted_instruction)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

The assistant should always maintain a professional tone.

User: What is the capital of France? What is the capital of Germany? What is the capital of Spain?

Assistant:


There are other different ways to format the entries as inputs to the LLM; the figure below illustrates two example formats that were used for training the Alpaca (https://crfm.stanford.edu/2023/03/13/alpaca.html) and Phi-3 (https://arxiv.org/abs/2404.14219) LLMs, respectively.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/prompt-style.webp?1" width=500px>

# Preparing a dataset for instruction finetuning

## Downloading a dataset

In the following parts of the lab, we will use Alpaca-style prompt formatting, as detailed in the previous image. We first start by downloading an instruction dataset that we will correctly format.

In [15]:
url = "https://www.i3s.unice.fr/~precioso/instruction-data.json"

with urllib.request.urlopen(url) as response:
    data = json.loads(response.read().decode("utf-8"))

In [16]:
print("Number of entries:", len(data))

Number of entries: 1100


🤔 <b><font color='purple'>Question:</font></b> Each item in the `data` list we loaded is a dictionary. What is the dictionary composed from? Display several elements and observe their keys and values. Do you observe that some fields can be empty?

In [None]:
# your work

## Formatting the textual inputs from the dataset

For each entry, we format the instruction and desired response in the Alpaca style.

In [None]:
def format_beginning_prompt(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    return instruction_text + input_text + "\n\n### Response:\n"

def format_input(entry):
    instruction_and_input_text = format_beginning_prompt(entry)
    desired_response = entry['output']
    return instruction_and_input_text + desired_response

A formatted response with input field looks like as shown below

In [None]:
prompt_and_response = format_input(data[50])
print(prompt_and_response)

Below is a formatted response without an input field

In [None]:
prompt_and_response = format_input(data[999])
print(prompt_and_response)

## Understanding how to create the inputs and targets for LLM instruction finetuning

We need to further prepare our data by tokenizing the prompts, and by preparing the inputs and targets of the model for finetuning. We tackle this task in several steps, as summarized in the figure below.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/detailed-batching.webp?1" width=500px>

We will use GPT2 encoding and later finetune a version of this model. We download the GPT2 tokenizer.

In [None]:
model_id_gpt2 = "openai-community/gpt2-medium"
tokenizer_gpt2 = AutoTokenizer.from_pretrained(model_id_gpt2)
tokenizer_gpt2.pad_token = tokenizer_gpt2.eos_token # we set the pad token to be the same as the EOS token

In order to better visualize and test our code, we first start to use a toy dataset of 3 fake instructions.

In [None]:
toy_instructions = ["Instruction no. 1", "This is another instruction", "Another"]

We tokenize those instructions, and pad them to the longest instruction in the dataset (PAD token ID = EOS token ID = "50256" in GPT2) for batching purpose.

In [None]:
toy_model_input = tokenizer_gpt2(toy_instructions, truncation = True, max_length = 1024, padding = "longest", add_special_tokens = True, return_tensors = "pt")["input_ids"]
print(toy_model_input.shape)
print(toy_model_input)

Above, we only created the inputs to the LLM; however, for LLM training, we also need the target values.

- Similar to pretraining an LLM, the targets are the inputs shifted by 1 position to the right (and another PAD token is appended at the end), so the LLM learns to predict the next token.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/inputs-targets.webp?1" width=400px>

🤔 <b><font color='purple'>Question:</font></b> Create the target tensor associated with the `toy_model_input` defined above. Remember to shift the position by 1, and to replace all the pad tokens (except the first of each sequence) by -100.

In [None]:
# your work

toy_model_target = ...

print(toy_model_target.shape)
print(toy_model_target)

- Next, we introduce an `ignore_index` value to replace all padding token IDs (except the 1st one) with a new value; the purpose of this `ignore_index` is that we can ignore padding values in the loss function (more on that later). Concretely, this means that we replace all but the first instance of the token IDs corresponding to `50256` (PAD token) with `-100` (ignore_index) as illustrated below.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/ignore-index.webp?1" width=500px>

In [None]:
ignore_index = -100

for index, item in enumerate(toy_model_target):
    mask = item == tokenizer_gpt2.eos_token_id
    indices = torch.nonzero(mask).squeeze()
    if indices.numel() > 1:
        toy_model_target[index, indices[1:]] = ignore_index

print(toy_model_target)

🤔 <b><font color='purple'>Question:</font></b> Visually check that the targets are well formatted according to the inputs:

In [None]:
print("toy input =\n", toy_model_input)
print("\ntoy target =\n", toy_model_target)

## Understanding the role of the ignore_index in Pytorch cross-entropy loss

Let's see what this replacement by -100 accomplishes:

- For illustration purposes, let's assume we have a small binary classification task (labels 0 and 1)
- If we have the following logits values (outputs of the last layer of the model, before any activation function), we calculate the following loss:

In [None]:
logits_1 = torch.tensor(
    [[-1.0, 1.0],  # 1st training example
     [-0.5, 1.5]]  # 2nd training example
)
targets_1 = torch.tensor([0, 1])

loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

- Now, adding one more training example will, as expected, influence the loss:

In [None]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]  # New 3rd training example
)
targets_2 = torch.tensor([0, 1, 1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

- Let's see what happens if we replace the class label of one of the examples with -100

In [None]:
targets_3 = torch.tensor([0, 1, -100])

loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

- As we can see, the resulting loss on these 3 training examples is the same as the loss we calculated from the 2 training examples, which means that the cross-entropy loss function ignored the training example with the -100 label
- By default, PyTorch has the `cross_entropy(..., ignore_index=-100)` setting to ignore examples corresponding to the label -100
- Using this -100 `ignore_index`, we can ignore the additional end-of-text (padding) tokens in the batches that we used to pad the training examples to equal length
- However, we don't want to ignore the first instance of the end-of-text (padding) token (50256) because it can help signal to the LLM when the response is complete

- In practice, it is also common to mask out the target token IDs that correspond to the instruction, as illustrated in the figure below (this is subsidiary exercise if you have completed the lab).

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/mask-instructions.webp?1" width=600px>

## Preparation of the inputs and targets: Putting it all together

We split the data into train, test and validation using 85% of the data for the training (variable `train_portion`), 10% for the test (variable `test_portion`) and the remaining for the validation (variable `val_portion`).

In [None]:
train_portion = int(len(data) * 0.85)  # 85% for training
test_portion = int(len(data) * 0.1)    # 10% for testing
val_portion = len(data) - train_portion - test_portion  # Remaining 5% for validation

train_data = data[ : train_portion]
test_data = data[train_portion : train_portion + test_portion]
val_data = data[train_portion + test_portion : ]

In [None]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

🤔 <b><font color='purple'>Question:</font></b> Reusing the code from previous parts, complete the dataset class below that takes as input the dataset `data` and a tokenizer, and create the tokenized inputs (using format_input function and the tokenizer) and targets (shift 1 postion, add pad token id, replace by -100 except 1st pad token id) that will be used later for finetuning a LLM.

In [None]:
# your work

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer, max_length = 1024, ignore_index = -100):

        # for each entry in data, format the textual input (instruction, input, response)
        # gather all the textual inputs and tokenize them (with padding to the same length) to create "self.inputs"
        self.inputs = ... # tensor
        # using the "self.inputs", create the "self.targets" by shifting by 1 position, adding at the end the pad token id, and replacing all (except the first one) pad token id by -100
        self.targets = ... # tensor

    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]

    def __len__(self):
        return len(self.inputs)

We create the dataset for the training, validation and test set.

In [None]:
train_dataset = InstructionDataset(train_data, tokenizer_gpt2, max_length = 1024, ignore_index = -100)
val_dataset = InstructionDataset(val_data, tokenizer_gpt2, max_length = 1024, ignore_index = -100)
test_dataset = InstructionDataset(test_data, tokenizer_gpt2, max_length = 1024, ignore_index = -100)

🤔 <b><font color='purple'>Question:</font></b> Display one training sample: its associated input and target, with their shape. Visually check that they are well formatted.

In [None]:
# your work

We instantiate the dataloaders to create batches for the train, validation and test dataset.

In [None]:
batch_size = 8

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, drop_last=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, drop_last=False)

In [None]:
input_one_batch, target_one_batch = next(iter(train_loader))
print("one input batch of size", input_one_batch.shape)
print("one target batch of size", target_one_batch.shape)

Now, we inspect a GPT2-medium model with a generation head that we will finetune on our instruction dataset.

We print the architecture of the model. Notice the word token embeddings (WTE) and word position embeddings (WPE), the multiple encoder blocks, and the last layer (lm_head) which is the generation head.

The "lm_head" has an out_features equals to 50257, which is the size of the model vocabulary, because the model will output one subword among its vocabulary at each generation step.

In [None]:
from transformers import GPT2LMHeadModel
llm = GPT2LMHeadModel.from_pretrained(model_id_gpt2)
print("vocab size =", llm.config.vocab_size)
print("\nLLM architecture =\n", llm)

We display the number of parameters of the model.

In [None]:
num_params = sum(param.numel() for param in llm.parameters())
print(num_params)

We instanciate a class that outputs the logits of the GPT2 model with a generation head.

In [None]:
class GPT2TextGenerationModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.llm = GPT2LMHeadModel.from_pretrained(model_id_gpt2)

    def forward(self, input):
        logits = self.llm(input).logits
        return logits

At that stage, we did not finetune the model yet. We take a sample text from the test set, and we generate the response from the not-yet finetuned model.

In [None]:
test_sample = test_data[0] # get a test sample # could choose another sample...
print("test sample =\n", test_sample)
prompt = format_beginning_prompt(test_sample) # format the test sample (but without the response part, since it's the model that should answer)
print("\n------------------\nformatted prompt =\n", prompt)

input_ids = tokenizer_gpt2.encode(prompt, return_tensors='pt').to(device) # tokenize the prompt
output = GPT2TextGenerationModel().to(device).llm.generate(input_ids, max_new_tokens=100, do_sample=False) # generate the response from the model (that is in the form of token IDs)
text = tokenizer_gpt2.decode(output[0], skip_special_tokens=True) # decode the model response (to transform the token IDs into a text)
print("\n------------------\nLLM response =\n", text)


🤔 <b><font color='purple'>Question:</font></b> What do you think about the model answer? The model might poorly answer the instruction, it might repeat the prompt several times, etc.

In [None]:
# your answer (no code, just sentences)

We create the training functions to finetune our GPT2 model.

In [None]:
def train(model, train_dataloader, val_dataloader, nb_epochs, device, optimizer):
    training_validation_loss_history = {"training_loss" : [], "validation_loss" : []}
    model = model.to(device)
    initial_validation_loss = epoch_validation(model, val_dataloader, -1, device)
    for epoch in range(nb_epochs):
        training_loss = epoch_training(model, train_dataloader, epoch, device, optimizer)
        validation_loss = epoch_validation(model, val_dataloader, epoch, device)
        training_validation_loss_history["training_loss"].append(training_loss)
        training_validation_loss_history["validation_loss"].append(validation_loss)
    return training_validation_loss_history

def epoch_training(model, dataloader, epoch, device, optimizer):
    model.train()
    loss_epoch_list = []
    with tqdm(dataloader, unit="batch") as tqdm_dataloader:
        tqdm_dataloader.set_description(f"Epoch {epoch}: Training")
        for input, target in tqdm_dataloader:
            # load tensor to GPU if enabled
            input = input.to(device)
            target = target.to(device)
            # forward pass
            logits = model(input)
            # get the loss
            loss = torch.nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), target.view(-1))
            loss_epoch_list.append(loss.item())
            loss_epoch_mean = sum(loss_epoch_list) / len(loss_epoch_list)
            tqdm_dataloader.set_postfix(loss = loss_epoch_mean)
            # backward pass, optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    return loss_epoch_mean

def epoch_validation(model, dataloader, epoch, device):
    model.eval()
    loss_epoch_list = []
    with tqdm(dataloader, unit="batch") as tqdm_dataloader, torch.inference_mode():
        tqdm_dataloader.set_description(f"Epoch {epoch}: Validation")
        for input, target in tqdm_dataloader:
            # load tensor to GPU if enabled
            input = input.to(device)
            target = target.to(device)
            # forward pass
            logits = model(input)
            # get the loss
            loss = torch.nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), target.view(-1))
            loss_epoch_list.append(loss.item())
            loss_epoch_mean = sum(loss_epoch_list) / len(loss_epoch_list)
            tqdm_dataloader.set_postfix(loss = loss_epoch_mean)
    return loss_epoch_mean

In [None]:
model_gpt2_LMHead = GPT2TextGenerationModel().to(device)

batch_size = 8
nb_epochs = 2
learning_rate = 1e-4
optimizer = torch.optim.AdamW(model_gpt2_LMHead.parameters(), lr = learning_rate)

train(model_gpt2_LMHead, train_loader, val_loader, nb_epochs, device, optimizer)

At that stage, we did finetune the model. We take a sample text from the test set, and we generate the response from the now finetuned model.

In [None]:
test_sample = test_data[0] # get a test sample # could choose another sample...
print("test sample =\n", test_sample)
prompt = format_beginning_prompt(test_sample) # format the test sample (but without the response part, since it's the model that should answer)
print("\n------------------\nformatted prompt =\n", prompt)

input_ids = tokenizer_gpt2.encode(prompt, return_tensors='pt').to(device) # tokenize the prompt
output = model_gpt2_LMHead.llm.generate(input_ids, max_new_tokens=100, do_sample=False) # generate the response from the model (that is in the form of token IDs)
text = tokenizer_gpt2.decode(output[0], skip_special_tokens=True) # decode the model response (to transform the token IDs into a text)
print("\n------------------\nLLM response =\n", text)


🤔 <b><font color='purple'>Question:</font></b> What do you think about the model answer? How does it compare to the previous answer when the model was not finetuned yet?

In [None]:
# your answer (no code, just sentences)