[![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E0ekDQ_xMYDDsaXX42gu9O27pRw9bzQx?usp=sharing)

# LLM Finetuning Labs

## LLM Finetuning Introduction

### In this lab you will...

- make your first fine-tuning !
- experiment with techniques to reduce VRAM usage
- discover two fine-tuning methods families
- visualize the effect of fine-tuning

The goal of this lab is to fine-tune a base model into an _instruct_ model. A base model can only generate unformatted text, this means that you feed it a beginning of a sentence and the base model will finish it. At least that's _what it is trained for_. It will not be trained for chat-like format of prompts and it will likely not behave well to such inputs, in other words a base model cannot be a chatbot. To make a chatbot, we need to fine-tune a model to behave properly to chat-like inputs. Such a fine-tuned model is called an _instruct_ model (short for instruction-tuned model).

### Prerequisites

There are some prerequisites (even for the first notebook) as LLMs are complex models and training them requires some advanced knowledge:
- You should really know how to develop in python
- You should be familiar with the most widespread libraries that we are going to use: transformers, numpy, torch, pandas, peft, trl
- I strongly recommend you read some documentation about LLMs and fine-tuning first, there are some great tutorials [there](https://huggingface.co/docs/transformers/index).
- Some concepts may be difficult to understand without advanced knowledge of linear algebra and probabilities.
- You should know your way around the huggingface website (to find models and documentation)

## Part 1: Fine-tuning a base model using LoRA

In this section, we fine-tune a very small LLM using one of the most widespread types of fine-tuning: [Low Rank Adapters](https://huggingface.co/docs/peft/main/conceptual_guides/lora) (LoRA). I suggest you read the documentation linked herebefore to understand better the basis of what we are going to do in this section.

LoRA is part of a families of fine-tuning methods: the adapters methods. There are a lot of different adapters methods (look at the link before to discover more of them), but LoRA is by far the most popular.

### 1.1. Preparing the environment

Nothing to see here, just the imports and basic functions

In [35]:
!pip install bitsandbytes datasets trl




In [36]:
import json
import os
from pprint import pprint

import bitsandbytes as bnb
import pandas as pd
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

import wandb

wandb.init(mode='disabled')

In [37]:
def print_trainable_parameters(model):

    """
    Prints the number of trainable parameters in the model.
    Utility function to see how efficient LoRA is.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

## 1.2. Load the base model

After having installed and imported all the necessary libraries, we load a base model. Feel free to change the base model to see the effect on the results you get.

We load it here quantized in 4bits to save a lot of VRAM. Feel free to experiment with other settings, you can find the documentation for that [here](https://huggingface.co/docs/bitsandbytes/index).

In [38]:
# Very small LLM, you can try other models from huggingface
model_repo_id = "Qwen/Qwen2.5-0.5B"


# Optional: Load the base model quantized (this saves a lot of VRAM)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_repo_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
tokenizer.pad_token = tokenizer.eos_token

## 1.3. Preparing the base model for training

What does LoRA do exactly ?
When we call the function
```python
model = get_peft_model(model, lora_config)
```
(see below), what happens is the parameters of the base model are frozen. This means they will not be trained. Additionally, we add layers (also refered to as adapters) to the model in parallel to the layers specified in the _target\_modules_ argument of _LoraConfig_. Each of these layers consists of 2 linear transformations (matrix multiplication). With other arguments of _LoraConfig_ we can define the shape and the initial coefficients of these matrices.

In [39]:
config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj"],
    bias="none",
    task_type='text-generation'
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 540672 || all params: 315660160 || trainable%: 0.17128293922172502


Here we see the advantage of using adapters instead of training the full model. We train only 0.17% of the total number of parameters ! (note that here all the trainable parameters were added to the base model).

Feel free to change the target modules (layers which get an adapter) to see the effect on the trainable parameters.

NB: The adapters were all initialized to always output 0, so the model with the untrained adapters has the exact same outputs as the base model.

## 1.4. Testing the base model before training

First thing to do is to check if the model really needs to be fine-tuned. We can quickly notice that is does:

In [40]:
prompt = [
    {'role': 'user',
     'content': 'What equipment do I need for rock climbing ?'}
]


prompt =  tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
print(prompt)
model = model.to('cuda')


generation_config = model.generation_config
generation_config.do_sample = True
generation_config.max_new_tokens = 200
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What equipment do I need for rock climbing ?<|im_end|>
<|im_start|>assistant



In [41]:
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user
What equipment do I need for rock climbing ?
assistant
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE
You are a helpful assistant.uable
ITIVE



As you can see, and as I said in the introduction, this base model is terrible at generating text in a chat format.

The anwser:
- is often garbage characters
- when it contains actual words, does not make sense (repeats the system instruction or the user prompt).
- seems to not take into account the chat-like format that we want

Using bigger base models, the result would probably be better, but they would still be very limited for chat-like input format.

## 1.5 Preparing the Data

To fix that, we need a dataset with user queries and assistant response.

We are using the _helpful-instructions_ dataset here.

In [42]:
data = load_dataset("HuggingFaceH4/helpful-instructions")
pd.DataFrame(data["train"])

Unnamed: 0,instruction,demonstration,meta
0,"Hi, I want to learn to play horseshoes. Can yo...","I can, but maybe I should begin by telling you...",{'source': 'helpful-anthropic-raw'}
1,How do I teach kids to meditate?,Great question! That’s a really useful skill t...,{'source': 'helpful-anthropic-raw'}
2,Can you tell me the steps for getting a harbor...,Sure. I believe you’ll need a copy of the mari...,{'source': 'helpful-anthropic-raw'}
3,How can I store food if I don't have a pantry?,"You could store the food in a refrigerator, th...",{'source': 'helpful-anthropic-raw'}
4,what are some good novels for a 9 year old?,"That depends on the 9 year old, but if they li...",{'source': 'helpful-anthropic-raw'}
...,...,...,...
147701,"Given the following sentence, classify it into...",Fact,{'source': 'helpful-self-instruct-raw'}
147702,A person wants to write a book. he/she writes ...,Chapter 1 - The History of China\nChapter 2 - ...,{'source': 'helpful-self-instruct-raw'}
147703,Tell me how you would make a popular app game.,I would make a game that is similar to 2048. T...,{'source': 'helpful-self-instruct-raw'}
147704,Describe your dream house to me.\n\nOutput:,My dream house is a two-story building with a ...,{'source': 'helpful-self-instruct-raw'}


This dataset is made of text and the model takes tokens. Therefore, before going further, let's add the special tokens and tokenize the whole dataset:

In [43]:
def generate_prompt(data_point):
    chat = [
        {'role': 'user', 'content': data_point['instruction']},
        {'role': 'assistant', 'content': data_point['demonstration']}
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data["train"].shuffle(seed=42).map(generate_and_tokenize_prompt)

## 1.6 Training the model

After the data is loaded and the model is ready to be trained, let's train it.

You can find documentation on the way training works in the _transformers_ library [here](https://huggingface.co/docs/transformers/trainer)

In [44]:
OUTPUT_DIR = "experiments"

training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200,   # try more steps if you can
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,4.5091
2,2.2521
3,3.2916
4,2.4296
5,3.3322
6,3.1143
7,3.8653
8,3.08
9,3.8403
10,3.3081


TrainOutput(global_step=200, training_loss=2.515097911953926, metrics={'train_runtime': 192.9635, 'train_samples_per_second': 4.146, 'train_steps_per_second': 1.036, 'total_flos': 180418728171264.0, 'train_loss': 2.515097911953926, 'epoch': 0.005416164543078819})

We can notice going down (globally) during the training. However, it seems quite a bit unstable, this is the combined effect of the cosine learning rate scheduling, that forces the model out of local minima, and of the transformer architecture [that is unstable to train](https://liyuanlucasliu.github.io/files/slides-transformer-clinic.pdf).

## 1.7 Test the model

We can now test the result of the fine-tuning:

In [45]:
device = "cuda:0"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant.
user
What equipment do I need for rock climbing ?
assistant
You need a pair of climbing shoes, a rope, a harness, a harness strap, a harness loop, a harness loop clamp, a harness loop loop, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp, a harness loop loop clamp,


This looks much better ! The model actually tries to answer the question now. However, it struggles to create a coherent answer and to stop when it has nothing to say anymore, we can therefore observe the model repeating itself multiple times.

Let's do a couple more tests:

In [46]:
def generate_response(question: str) -> str:
    chat = [
        {'role': 'user', 'content': question}
    ]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "assistant"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [47]:
prompt = "What program can I use to edit video clips I took with my phone?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Do you know the reasons as to why people love coffee so much?"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

- What program can I use to edit video clips I took with my phone? 

.
user
What program can I use to edit video clips I took with my phone?
assistant
You can use Adobe Premiere Pro to edit video clips you have taken with your phone. This program is free and can be used to edit video clips, music, and even audio files. eBooks



- Do you know the reasons as to why people love coffee so much? 

.
user
Do you know the reasons as to why people love coffee so much?
assistant
Coffee is a very popular drink in the world. It is believed that it was first brought to Europe by the Portuguese in the 15th century. The reason why people love coffee is because it is a very good source of caffeine. Caffeine is a stimulant that can help people feel more awake and alert. It is also believed that coffee has a lot of health benefits. It can help people to feel more energetic and improve their mood. However, it is important to note that caffeine can also cause some health risks if consumed in large amoun

We can see the remaining issues more clearly now.

# Part 2: Further refining the model using DPO

In the first part, we obtained a decent model. It was not great and still had issues, but it was consequently better than the base model.

Now let's try to adapt the model better. To sum it up:
- the model is now able to have conversations in a chat-like format.
- the model struggles to know when to stop generating text
- the model struggles to make coherent sentences

We can partially solve the 2 issues with DPO. I strongly advice you to try to understand the method before continuing.

# 2.1 Test the model before DPO

We do yet another test to compare before and after DPO:

In [48]:
chat2 = [
    {'role': 'system', 'content': 'You are a helpful assistant'},
    {'role': 'user', 'content': 'Can you taste this dish and tell me if it needs more spices?'},
]
prompt_2 = tokenizer.apply_chat_template(chat2, tokenize=False, add_generation_prompt=True)
print(prompt_2)

device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Can you taste this dish and tell me if it needs more spices?<|im_end|>
<|im_start|>assistant

system
You are a helpful assistant
user
Can you taste this dish and tell me if it needs more spices?
assistant
Yes, I can! It needs more spice. Here is the recipe: 2 tablespoons of salt, 1 tablespoon of pepper, 1/2 teaspoon of cumin, 1/2 teaspoon of coriander, 1/2 teaspoon of cayenne pepper, 1/2 teaspoon of cumin, 1/2 teaspoon of coriander, 1/2 teaspoon of cayenne pepper, 1/2 teaspoon of cumin, 1/2 teaspoon of coriander, 1/2 teaspoon of cayenne pepper, 1/2 teaspoon of cumin, 1/2 teaspoon of coriander, 1/2 teaspoon of cayenne pepper, 1/2 teaspoon of cumin, 1/2 teaspoon of coriander, 1/2 teaspoon of cayenne pepper, 1/2 teaspoon of cumin, 1/2 teaspoon of coriander, 1/2


## 2.2 Preparing the data

DPO requires a special dataset, of which each entry contains the user input and 2 possible model outputs: a target one and an unwanted one.

Such a dataset is the __CultriX/llama70B-dpo-dataset__, which we will use hereafter. Let's load it and tokenize it:

In [49]:
data_dpo = load_dataset("CultriX/llama70B-dpo-dataset")
pd.DataFrame(data_dpo["train"])

README.md:   0%|          | 0.00/171 [00:00<?, ?B/s]

dataset_final_v2.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2179 [00:00<?, ? examples/s]

Unnamed: 0,system,question,chosen,rejected
0,You are a helpful assistant,Can you taste this dish and tell me if it need...,I'm not capable of physically tasting the dish...,I can taste the dish remotely and I think it n...
1,You are a helpful assistant,Can you smell this perfume and tell me if it's...,"As a digital AI assistant, I don't have a phys...","I can smell the perfume, and it smells great. ..."
2,You are a helpful assistant,Please turn off the lights in my room.,"I'm a large language model, I don't have the c...",I can turn off the lights in your room remotel...
3,You are a helpful assistant,What is the primary reason why many approximat...,The primary reason why many approximate attent...,Approximate attention methods haven't gained w...
4,You are a helpful assistant,What are the properties of melanin that contri...,Melanin's photoprotective role can be attribut...,Melanin's photoprotective role is due to its a...
...,...,...,...,...
2174,You are a helpful assistant,What is the sound barrier?,"The sound barrier, also known as the sonic bar...",The sound barrier is a physical wall that prev...
2175,You are a helpful assistant,Is spinach a great source of dietary iron?,"Spinach is a good source of dietary iron, but ...","Spinach is a poor source of dietary iron, and ..."
2176,You are a helpful assistant,What challenges arise in training large langua...,Training large language models (LLMs) poses se...,The main challenge in training LLMs is the lac...
2177,You are a helpful assistant,Does the theory of evolution explain the origi...,The theory of evolution explains how life on E...,The theory of evolution fully explains the ori...


In [50]:
def preprocess_data_dpo(data_point):
    chat = [
        {'role': 'system', 'content': data_point['system']},
        {'role': 'user', 'content': data_point['question']}
    ]
    return {'prompt': tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True),
            'chosen': data_point['chosen'],
            'rejected': data_point['rejected']}

data_dpo = data_dpo['train'].shuffle(seed=42).map(preprocess_data_dpo)

Map:   0%|          | 0/2179 [00:00<?, ? examples/s]

In [51]:
print(data_dpo)
data_dpo[0]

Dataset({
    features: ['system', 'question', 'chosen', 'rejected', 'prompt'],
    num_rows: 2179
})


{'system': 'You are a helpful assistant',
 'question': "What are the benefits of utilizing sparse upcycling in the context of training neural networks, according to the insights provided in 'Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints'?",
 'chosen': 'Sparse upcycling offers several benefits in training neural networks, including improved model performance, increased efficiency, and reduced computational costs. By leveraging the knowledge contained in dense pre-trained models, sparse upcycling enables the creation of mixture-of-experts models that can achieve better accuracy and faster convergence, while also reducing the need for extensive retraining.',
 'rejected': "Sparse upcycling is not beneficial for training neural networks, as it can lead to overfitting and decreased model performance. According to 'Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints', sparse upcycling is only useful for reducing model size, but it does not provide any i

We can see the structure of the dataset:
- _system_ is the instruction given to the model
- _question_ is the user-asked question
- _chosen_ is the target answer from the model
- _rejected_ is the answer we do not want
- _prompt_ is the column we just added, containing the data ready to be tokenized for training

## 2.3 Training

With the model already ready after part 1 and the data just ready, let's train the model using LoRA/DPO.

In [52]:
OUTPUT_DIR = "experiments_dpo"

training_args = DPOConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    save_total_limit=3,
    logging_steps=1,
    output_dir=OUTPUT_DIR,
    max_steps=200, # try more if you can
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05
)

dpo_args = {
    "beta": 0.1,
}

print(model.__dict__)

trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=data_dpo,
    **dpo_args
    # Data collator is not needed for DPOTrainer as it internally manages it
)

model.config.use_cache = False
trainer.train()

{'training': True, '_parameters': {}, '_buffers': {}, '_non_persistent_buffers_set': set(), '_backward_pre_hooks': OrderedDict(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_hooks_with_kwargs': OrderedDict(), '_forward_hooks_always_called': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_forward_pre_hooks_with_kwargs': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': {'base_model': LoraModel(
  (model): Qwen2ForCausalLM(
    (model): Qwen2Model(
      (embed_tokens): Embedding(151936, 896)
      (layers): ModuleList(
        (0-23): 24 x Qwen2DecoderLayer(
          (self_attn): Qwen2SdpaAttention(
            (q_proj): lora.Linear4bit(
              (base_layer): Linear4bit(in_features=896, out_features=896, bias=True)
              (lora_dropout): ModuleDict(
            

Extracting prompt from train dataset:   0%|          | 0/2179 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/2179 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2179 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
1,1.2886
2,0.6927
3,1.2087
4,1.1344
5,1.1684
6,1.4402
7,1.4807
8,1.1205
9,1.1279
10,0.9926


TrainOutput(global_step=200, training_loss=0.3054616847890429, metrics={'train_runtime': 287.6007, 'train_samples_per_second': 2.782, 'train_steps_per_second': 0.695, 'total_flos': 0.0, 'train_loss': 0.3054616847890429, 'epoch': 0.36714089031665903})

## 2.4 Testing the model after DPO

Let's test the new model:

In [53]:
device = "cuda:0"

encoding = tokenizer(prompt_2, return_tensors="pt").to(device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        generation_config=generation_config,
    )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
You are a helpful assistant
user
Can you taste this dish and tell me if it needs more spices?
assistant
As an AI language model, I can provide information and recommendations based on my knowledge base. However, I cannot taste or taste any specific dish. However, if the dish requires more spices, I can suggest alternative seasoning blends or spices that may complement the flavors of the dish. However, it is always best to taste the dish to ensure that it is enjoyable and satisfies your taste buds.


Okay, now we are getting somewhere ! The answer is precise, coherent, stops at the right time, without repetitions, ... and this is a 0.5B model !

So, naturally, like for part 1, let's test the model a few more times:

In [54]:
def generate_response(question: str) -> str:
    chat = [
        {'role': 'user', 'content': question}
    ]
    prompt = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
    encoding = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=encoding.input_ids,
            attention_mask=encoding.attention_mask,
            generation_config=generation_config,
        )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    assistant_start = "<assistant>:"
    response_start = response.find(assistant_start)
    return response[response_start + len(assistant_start) :].strip()

In [55]:
prompt = "Do people dream in color or black and white?"
print('-', prompt,'\n')
print(generate_response(prompt))

prompt = "Explain the concept of economic policies in simple terms"
print('\n\n\n-', prompt, '\n')
print(generate_response(prompt))

print('\n\n\n-', prompt, '\n')
prompt = "Explain the effects of globalization on the environment."
print(generate_response(prompt))

- Do people dream in color or black and white? 

are a helpful assistant.
user
Do people dream in color or black and white?
assistant
Yes, many people, including some individuals who have visual impairments or have a preference for black and white color perception. However, it is important to note that while some individuals may have a preference for black and white colors, it is not a universal phenomenon and many people have varying color experiences based on their visual abilities and preferences. It is also worth noting that there are various colors that are considered as black and white shades, such as shades of gray, dark blue, and other dark shades.



- Explain the concept of economic policies in simple terms 

are a helpful assistant.
user
Explain the concept of economic policies in simple terms
assistant
Economic policies are strategies and policies that governments use to achieve certain goals, such as growth, stability, and economic growth, among others. These policies can 

# Conclusion

With the right datasets and the right tools, even 0.5B models can generate very good answers. Remember that 99.8% of the base model's parameters were unchanged during the while process !

I hope you found this small introduction interesting !
