<a href="https://colab.research.google.com/github/pinzger/handsonllms/blob/main/Fine_tuning_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning GPTs

Example code covers:
   * Using QLoRA to fine-tune ...

Example adopted from Chapter 12 of [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961).

---

💡 **NOTE**: For using a GPU in Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

If you are viewing this notebook on Google Colab (or any other cloud vendor), you might need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

In [None]:
# %%capture
# !pip install -q accelerate==0.31.0 peft==0.11.1 bitsandbytes==0.43.1 transformers==4.41.2 trl==0.9.4 sentencepiece==0.2.0

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.

# Supervised Fine-Tuning (SFT)
## Data Preprocessing
We use the HuggingFaceH4/ultrachat_200k dataset for fine-tuning "TinyLlama/TinyLlama-1.1B-Chat-v1.0".

The template of this model is (see also the book):</br>
\<|user|\></br>
Question\</s\> (EOS token)</br>
\<|assistant|\></br>
The answer is \</s\>



In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset


# Load a tokenizer to use its chat template
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def format_prompt(example):
    """Format the prompt to using the <|user|> template TinyLLama is using"""

    # Format answers
    chat = example["messages"]
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)

    return {"text": prompt}

# Load and format the data using the template TinyLLama is using
# 3000 entries are used for training
dataset = (
    load_dataset("HuggingFaceH4/ultrachat_200k",  split="test_sft")
      .shuffle(seed=42)
      .select(range(3_000))
)

# creates the prompts and adds them in the column "text"
dataset = dataset.map(format_prompt)

## Load the model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Device set to use cuda


## Simple prompt

In [None]:
# Prompt
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate the output
output = pipe(messages)
print(output[0]["generated_text"])

 Why did the chicken join the band? Because it had the drumsticks!


## A look into the prompt
Note the special tokens that are added by the tokenizer.

In [None]:
# Apply prompt template
prompt = pipe.tokenizer.apply_chat_template(messages,
tokenize=False)
print(prompt)

<|user|>
Create a funny joke about chickens.<|end|>
<|endoftext|>


### Doing the same using sampling and high temperature
  * temperature controls the randomness of generated text, 1 = high, 0 = create the same text.
  * do_sample means not doing sampling and select the most probable next token.
  * top_p (nucleus sampling) controls which subset of tokens (the nucleus) should be considered, 1 = all tokens.
  * top_k is similar to top_p but specifies the top k most probalbe tokens that should be considered.



In [None]:
# Using a high temperature
output = pipe(messages, do_sample=True, temperature=1)
print(output[0]["generated_text"])

 Why did the chicken join the band?


Because it had the rhythm of a drum, 

and even more...


It wanted to "cluck" out in the "peck" of its name!


In [None]:
# Using a high top_p
output = pipe(messages, do_sample=True, top_p=1)
print(output[0]["generated_text"])

 Why did the chicken refuse to play hide and seek?

Because it couldn't find any hiding places!


(Pun: A chicken refers to both the animal and the slang term for a chicken crossbow – an anagram of "Henrietta")


# Advanced Prompt Engineering
Shows the effects of using various components to compose a prompt.

## One-shot prompting
Providing ONE example. Note the importance of differentiating between the user and the assistant.

In [None]:
# Use a single example of using the made-up word in a sentence
one_shot_prompt = [
    {
        "role": "user",
        "content": "A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:"
    },
    {
        "role": "assistant",
        "content": "I have a Gigamuru that my uncle gave me as a gift. I love to play it at home."
    },
    {
        "role": "user",
        "content": "To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:"
    }
]
print(tokenizer.apply_chat_template(one_shot_prompt, tokenize=False))

<|user|>
A 'Gigamuru' is a type of Japanese musical instrument. An example of a sentence that uses the word Gigamuru is:<|end|>
<|assistant|>
I have a Gigamuru that my uncle gave me as a gift. I love to play it at home.<|end|>
<|user|>
To 'screeg' something is to swing a sword at it. An example of a sentence that uses the word screeg is:<|end|>
<|endoftext|>


In [None]:
# Generate the output
outputs = pipe(one_shot_prompt)
print(outputs[0]["generated_text"])

 During the medieval reenactment, the knight skillfully screeged the wooden target, impressing the onlookers with his prowess.


## Chain prompting
Like in a conversion, use the output of a prompt as input to the next prompt.

In [None]:
# Create name and slogan for a product
product_prompt = [
    {"role": "user", "content": "Create a name and slogan for a chatbot that leverages LLMs."}
]
outputs = pipe(product_prompt)
product_description = outputs[0]["generated_text"]
print(product_description)

 Name: ChatSage
Slogan: "Unleashing the power of AI to enhance your conversations."


In [None]:
# Based on a name and slogan for a product, generate a sales pitch
sales_prompt = [
    {"role": "user", "content": f"Generate a very short sales pitch for the following product: '{product_description}'"}
]
outputs = pipe(sales_prompt)
sales_pitch = outputs[0]["generated_text"]
print(sales_pitch)

 Introducing ChatSage, the revolutionary AI-powered tool designed to elevate your conversations to new heights. With our cutting-edge technology, we unleash the power of AI to enhance your interactions, making every conversation more engaging, insightful, and meaningful. Experience the future of communication with ChatSage today!


## Chain-of-Thought prompting
Prompt provides the reasoning for an answer.

First, without providing a reasoning. Note, the model is already too good ;-)



In [None]:
# Answering without explicit reasoning
standard_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "11"},
    {"role": "user", "content": "The cafeteria had 25 apples. If they used 5 bags each containing 4 apples to make lunch and bought 2 bags more, how many apples do they have?"}
]

# Run generative model
outputs = pipe(standard_prompt)
print(outputs[0]["generated_text"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 The cafeteria started with 25 apples. They used 5 bags, each containing 4 apples, for lunch. To find out how many apples they used, we multiply the number of bags by the number of apples in each bag:


5 bags * 4 apples/bag = 20 apples used


Now, we subtract the apples used from the initial amount:


25 apples - 20 apples = 5 apples remaining


The cafeteria then bought 2 more bags of apples. Since each bag contains 4 apples, we calculate the number of apples in the new bags:


2 bags * 4 apples/bag = 8 apples


Finally, we add the apples from the new bags to the remaining apples:


5 apples + 8 apples = 13 apples


So, the cafeteria now has 13 apples.


Assistant also contains the reasoning for the given example.

In [None]:
# Answering with chain-of-thought
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 25 apples. If they used 5 bags each containing 4 apples to make lunch and bought 2 bags more, how many apples do they have?"}
]

# Generate the output
outputs = pipe(cot_prompt)
print(outputs[0]["generated_text"])

 The cafeteria started with 25 apples. They used 5 bags, each containing 4 apples, which totals 5 * 4 = 20 apples used. After using these apples, they had 25 - 20 = 5 apples left. They then bought 2 more bags, with each bag containing 4 apples, which is an additional 2 * 4 = 8 apples. Adding these to the remaining apples, they now have 5 + 8 = 13 apples. The answer is 13.


## Zero-shot chain-of-thought
Instead of providing an example plus reasoning instruct the LLM to perform chain-of-thought prompting.

In [None]:
# Zero-shot Chain-of-Thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 25 apples. If they used 5 bags each containing 4 apples to make lunch and bought 2 bags more, how many apples do they have? Let's think step-by-step."}
]

# Generate the output
outputs = pipe(zeroshot_cot_prompt)
print(outputs[0]["generated_text"])

 Step 1: Calculate the total number of apples used to make lunch.
The cafeteria used 5 bags, each containing 4 apples. So, the total number of apples used is 5 bags * 4 apples/bag = 20 apples.

Step 2: Subtract the number of apples used from the initial number of apples.
The cafeteria initially had 25 apples. After using 20 apples for lunch, they have 25 apples - 20 apples = 5 apples left.

Step 3: Calculate the number of apples bought.
The cafeteria bought 2 more bags, each containing 4 apples. So, the total number of apples bought is 2 bags * 4 apples/bag = 8 apples.

Step 4: Add the number of apples bought to the remaining apples.
The cafeteria had 5 apples left and bought 8 more apples. So, they now have 5 apples + 8 apples = 13 apples.

The cafeteria now has 13 apples.


## Zero-shot tree-of-thought
With a single prompt by asking the LLM to mimic a conversation between multiple experts.

In [None]:
# Zero-shot Chain-of-Thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will perform extactly 1 step of their thinking, that they share with the group. Then all experts will go on to the next step. If any expert realises they're wrong at any point then they leave. The format of the output is: Expert 1: Step 1: reasoning, Expert 2: Step 1: reasonsing, etc. The question is 'The cafeteria had 25 apples. If they used 5 bags each containing 4 apples to make lunch and bought 2 bags more, how many apples do they have?' Make sure to discuss the results."}
]

# Generate the output
outputs = pipe(zeroshot_tot_prompt)
print(outputs[0]["generated_text"])

 Expert 1: Step 1: First, calculate the total number of apples used for lunch by multiplying the number of bags (5) by the number of apples per bag (4), which equals 20 apples.

Expert 2: Step 1: Next, calculate the number of apples in the additional bags by multiplying the number of bags (2) by the number of apples per bag (4), which equals 8 apples.

Expert 3: Step 1: Then, add the number of apples used for lunch (20) to the number of apples in the additional bags (8), which equals 28 apples.

Expert 1: Step 2: Subtract the total number of apples used for lunch (20) from the initial number of apples (25), which equals 5 apples remaining.

Expert 2: Step 2: Add the number of apples remaining (5) to the number of apples in the additional bags (8), which equals 13 apples.

Expert 3: Step 2: The final answer is 13 apples.

Discussion: All three experts arrived at the same final answer, which is 13 apples. Expert 1 calculated the number of apples used for lunch and the remaining apples, w

## Zero-shot prompt with multiple components

In [None]:
# Text to summarize which we stole from https://jalammar.github.io/illustrated-transformer/ ;)
text = """In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.
Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.
As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.
"""

# Prompt components
persona = "You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.\n"
instruction = "Summarize the key findings of the paper provided.\n"
context = "Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.\n"
data_format = "Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.\n"
audience = "The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.\n"
tone = "The tone should be professional and clear.\n"
#text = "MY TEXT TO SUMMARIZE"  # Replace with your own text to summarize
data = f"Text to summarize: {text}"

# The full prompt - remove and add pieces to view its impact on the generated output
query = persona + instruction + context + data_format + audience + tone + data

In [None]:
# Create the input message
messages = [
    {"role": "user", "content": query}
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

<|user|>
You are an expert in Large Language models. You excel at breaking down complex papers into digestible summaries.
Summarize the key findings of the paper provided.
Your summary should extract the most crucial points that can help researchers quickly understand the most vital information of the paper.
Create a bullet-point summary that outlines the method. Follow this up with a concise paragraph that encapsulates the main results.
The summary is designed for busy researchers that quickly need to grasp the newest trends in Large Language Models.
The tone should be professional and clear.
Text to summarize: In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neur

In [None]:
# Generate the output
outputs = pipe(messages)
print(outputs[0]["generated_text"])

 - The Transformer model, introduced in the paper "Attention is All You Need," utilizes attention mechanisms to enhance the training speed of deep learning models, particularly in neural machine translation.
- The model consists of an encoding component and a decoding component, each with multiple identical sub-layers. The encoding component includes self-attention layers and feed-forward neural networks, while the decoding component incorporates attention layers that focus on relevant parts of the input sentence.
- The Transformer model is highly parallelizable, making it a preferred choice for Google Cloud's TPU offering.
- The model processes input sequences by converting each word into a vector using an embedding algorithm, with the size of the vector being a hyperparameter.
- The input vectors flow through the encoder's sub-layers, with dependencies between paths in the self-attention layer and independent execution in the feed-forward layer.
- The Transformer model has shown supe