# Second fine-tuning with a 0.5B parameter model

My goal is to fine-tune meta-llama/Meta-Llama-3-8B on timdettmers/openassistant-guanaco.  However, according to 
https://www.reddit.com/r/LocalLLaMA/s/nRlinxXZgp, I've already fine-tuned a 0.5B model with that dataset, but I found
that the model very easily overfit.  Random thought: maybe the format of the data, being very much like markdown, is too
easy to learn?  Let's see what happens if we change the format to match the old Llama-2 instruction format.  According to 
[this reddit post](https://www.reddit.com/r/LocalLLaMA/comments/155po2p/get_llama_2_prompt_format_right/) it's this:

```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

They also give the default system prompt, so the whole thing would look like this:

```
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

There's a llama in my garden üò± What should I do? [/INST]
```

So let's pre-process the dataset to get that :-)

## The dataset


In [1]:
dataset_source = "timdettmers/openassistant-guanaco"

In [2]:
from datasets import load_dataset

dataset = load_dataset(dataset_source)

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
dataset["train"]

Dataset({
    features: ['text'],
    num_rows: 9846
})

In [4]:
for row in dataset["train"]:
    print(row)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [5]:
dataset["train"][:10]

{'text': ['### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining p

In [6]:
print(dataset["train"][1]["text"])

### Human: ¬øCUales son las etapas del desarrollo y en qu√© consisten seg√∫n Piaget?### Assistant: Jean Piaget fue un psic√≥logo suizo que propuso una teor√≠a sobre el desarrollo cognitivo humano que consta de cuatro etapas:

Etapa sensoriomotora (0-2 a√±os): Durante esta etapa, el ni√±o aprende a trav√©s de sus sentidos y movimientos. Descubre que sus acciones pueden tener un impacto en el entorno y comienza a formarse una idea b√°sica de objetividad y continuidad.

Etapa preoperatoria (2-7 a√±os): En esta etapa, el ni√±o comienza a desarrollar un pensamiento simb√≥lico y a comprender que las cosas pueden representar a otras cosas. Tambi√©n comienzan a desarrollar un pensamiento l√≥gico y a comprender conceptos como la causa y el efecto.

Etapa de operaciones concretas (7-12 a√±os): Durante esta etapa, el ni√±o desarrolla un pensamiento l√≥gico y comprende las relaciones causales. Empiezan a comprender que las cosas pueden tener m√∫ltiples perspectivas y que los conceptos pueden ser m

Right, so we should just be able to map a regex over that.  Let's give it a whirl.

In [7]:
import re

prompt_template = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

{question} [/INST]
{response}
"""

pattern = r"### Human: (.*?)### Assistant: (.*)"

def rewrite_prompts(examples):
    questions = []
    responses = []
    # Iterate over each example
    for text in examples["text"]:
        match = re.search(pattern, text, re.DOTALL)
        if match:
            question = match.group(1).strip()
            response = match.group(2).strip()
            reformatted_text = prompt_template.format(question=question, response=response)
            responses.append(reformatted_text)
        else:
            # You might want to handle errors differently
            responses.append("Error: Did not match expected pattern.")
    return {"reformatted_text": responses}

# Apply the function to your dataset
reformatted_dataset = dataset.map(rewrite_prompts, batched=True)

In [8]:
print(reformatted_dataset["train"][1]["text"])

### Human: ¬øCUales son las etapas del desarrollo y en qu√© consisten seg√∫n Piaget?### Assistant: Jean Piaget fue un psic√≥logo suizo que propuso una teor√≠a sobre el desarrollo cognitivo humano que consta de cuatro etapas:

Etapa sensoriomotora (0-2 a√±os): Durante esta etapa, el ni√±o aprende a trav√©s de sus sentidos y movimientos. Descubre que sus acciones pueden tener un impacto en el entorno y comienza a formarse una idea b√°sica de objetividad y continuidad.

Etapa preoperatoria (2-7 a√±os): En esta etapa, el ni√±o comienza a desarrollar un pensamiento simb√≥lico y a comprender que las cosas pueden representar a otras cosas. Tambi√©n comienzan a desarrollar un pensamiento l√≥gico y a comprender conceptos como la causa y el efecto.

Etapa de operaciones concretas (7-12 a√±os): Durante esta etapa, el ni√±o desarrolla un pensamiento l√≥gico y comprende las relaciones causales. Empiezan a comprender que las cosas pueden tener m√∫ltiples perspectivas y que los conceptos pueden ser m

In [9]:
print(reformatted_dataset["train"][1]["reformatted_text"])


<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

¬øCUales son las etapas del desarrollo y en qu√© consisten seg√∫n Piaget? [/INST]
Jean Piaget fue un psic√≥logo suizo que propuso una teor√≠a sobre el desarrollo cognitivo humano que consta de cuatro etapas:

Etapa sensoriomotora (0-2 a√±os): Durante esta etapa, el ni√±o aprende a trav√©s de sus sentidos y movimientos. Descubre que sus acciones pueden tener un impacto en el entorno y comienza a formarse una idea b√°sica de objetividad y continuidad.

Etapa p

Looks good!  Let's see if there were any broken inputs.

In [10]:
for row in reformatted_dataset["train"]:
    if row["reformatted_text"] == "Error: Did not match expected pattern.":
        print(row["text"])

Awesome!

## The model

It took a while to find a model <= 1B parameters but I got there in the end

In [11]:
# This is a 0.5B model so should certainly be trainable on my GPU.
base_model = "Qwen/Qwen1.5-0.5B"

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="cuda")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


So now let's ask it a question, using the format we saw earlier

In [15]:
from transformers import pipeline

def ask_question(model, question):
    pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
    prompt = prompt_template.format(question=question, response="")
    result = pipe(prompt)
    print(result[0]['generated_text'])
    
ask_question(model, "Who is Leonardo Da Vinci?")

Both `max_new_tokens` (=2048) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Who is Leonardo Da Vinci? [/INST]

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST>

<</INST

It didn't have a clue what to do with that!  Which makes a lot of sense.  Let's see if we can train it to understand.

In [16]:
from transformers import TrainingArguments,Trainer

# Batch size determined via experiment; this *just* fits in memory.
batch_size = 7
args = TrainingArguments(
    'outputs', 
    learning_rate=8e-5, 
    warmup_ratio=0.1, 
    lr_scheduler_type='cosine', 
    fp16=True,
    evaluation_strategy="epoch", 
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size=batch_size * 2,
    num_train_epochs=3, 
    weight_decay=0.01, 
    report_to='none'
)

In [18]:

def tokenize_function(examples):
    tokenized = tokenizer(examples["reformatted_text"], truncation=True, padding="max_length", max_length=512)
    tokenized["labels"] = tokenized["input_ids"][:]
    return tokenized

tokenized_dataset = reformatted_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

In [19]:
trainer = Trainer(
    model, args, 
    train_dataset=tokenized_dataset['train'], 
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
)

In [20]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.3813,1.356417
2,0.9062,1.293857
3,0.4294,1.401652


TrainOutput(global_step=4221, training_loss=0.9144961933930376, metrics={'train_runtime': 1586.4771, 'train_samples_per_second': 18.619, 'train_steps_per_second': 2.661, 'total_flos': 2.798491918978253e+16, 'train_loss': 0.9144961933930376, 'epoch': 3.0})

Similar results to last time; let's see what we get as a result


In [21]:
ask_question(model, "Who is Leonardo Da Vinci?")

Both `max_new_tokens` (=2048) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Who is Leonardo Da Vinci? [/INST]

Leonardo da Vinci (1452-1519) was an Italian scientist, artist, and inventor who made significant contributions to the fields of science, art, and design. He is best known for his work in the field of anatomy, which he helped to develop and refine, and for his contributions to the study of flight and flight patterns.

In addition to his work in anatomy, da Vinci was also known for his contributions to the fields of mathemat

In [15]:
ask_question(trainer.model, "Who is Leonardo Da Vinci?")

Both `max_new_tokens` (=2048) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


### Human: Who is Leonardo Da Vinci? ### Assistant: 1. Leonardo da Vinci (1452-1519) was an Italian scientist, artist, and inventor who is widely regarded as one of the greatest inventors and scientists in history. He is best known for his work in the field of anatomy, including the discovery of the human body's symmetry, the study of flight, and the development of the printing press.

2. Da Vinci was a renowned artist and designer, and his works continue to inspire and influence artists and designers today. He is also known for his contributions to the fields of science, including biology, physics, and engineering.

3. Da Vinci was a renowned scientist and mathematician, and he made significant contributions to the fields of mathematics, physics, and engineering. He is also known for his work in the field of anatomy, including the study of the human body's symmetry.

4. Da Vinci was a renowned scientist and physicist, and he made significant contributions to the fields of physics, mat

Note that the two commands above return the same output.  Looks like the model was trained in-place, which makes sense.  Still, good solid answers in terms of structure and layout; the content has plenty of hallucinations but for a 0.5B model, I think it's actually pretty good!