# Initial fine-tuning with a 0.5B parameter model

My goal is to fine-tune meta-llama/Meta-Llama-3-8B on timdettmers/openassistant-guanaco.  However, according to 
https://www.reddit.com/r/LocalLLaMA/s/nRlinxXZgp, I need 160GiB VRAM to for a 7B model (roughly -- that table comes from
some particular fine-tuning stack and I might be more or less memory-hungry).  So I'd need at least that if not more for an 8B one.  
That would require renting a machine, or machines.  So can I at least do initial experiments without that?

The 7B:160GiB ratio implies a 22.8 parameters-to-VRAM ratio.  Assuming that's true, I should be able to fine-tune a 1B
model on my 24GiB card, *just*.

Let's see how far we can get.

## The dataset

First, let's look at the data

In [1]:
dataset_source = "timdettmers/openassistant-guanaco"

In [2]:
from datasets import load_dataset

dataset = load_dataset(dataset_source)

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

So we have a train and a test dataset.  Sounds promising.

In [4]:
dataset["train"]

Dataset({
    features: ['text'],
    num_rows: 9846
})

In [5]:
dataset["train"][:10]

{'text': ['### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining p

In [6]:
print(dataset["train"][1]["text"])

### Human: ¿CUales son las etapas del desarrollo y en qué consisten según Piaget?### Assistant: Jean Piaget fue un psicólogo suizo que propuso una teoría sobre el desarrollo cognitivo humano que consta de cuatro etapas:

Etapa sensoriomotora (0-2 años): Durante esta etapa, el niño aprende a través de sus sentidos y movimientos. Descubre que sus acciones pueden tener un impacto en el entorno y comienza a formarse una idea básica de objetividad y continuidad.

Etapa preoperatoria (2-7 años): En esta etapa, el niño comienza a desarrollar un pensamiento simbólico y a comprender que las cosas pueden representar a otras cosas. También comienzan a desarrollar un pensamiento lógico y a comprender conceptos como la causa y el efecto.

Etapa de operaciones concretas (7-12 años): Durante esta etapa, el niño desarrolla un pensamiento lógico y comprende las relaciones causales. Empiezan a comprender que las cosas pueden tener múltiples perspectivas y que los conceptos pueden ser más complejos de lo

OK, so we have a number of strings; each one starts with `"### Human: "` and then has a question from the human.  This is followed immediately (no newline) with `"### Assistant: "` and an answer.  In some cases, the human has a follow-on question, but on a quick scan, none of the training lines has an answer there. 

## The model

It took a while to find a model <= 1B parameters but I got there in the end

In [7]:
# This is a 0.5B model so should certainly be trainable on my GPU.
base_model = "Qwen/Qwen1.5-0.5B"

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="cuda")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


So now let's ask it a question, using the format we saw earlier

In [9]:
from transformers import pipeline

def ask_question(model, prompt):
    pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
    result = pipe(f"### Human: {prompt} ### Assistant: ")
    print(result[0]['generated_text'])
    
ask_question(model, "Who is Leonardo Da Vinci?")

Both `max_new_tokens` (=2048) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


### Human: Who is Leonardo Da Vinci? ### Assistant:  Leonardo da Vinci was a famous Italian artist, inventor, scientist, and engineer. He was born in 1452 and died in 1519. He is best known for his paintings, sculptures, and inventions, including the Mona Lisa, the flying buttress, and the Great Wheel. Da Vinci was also a scientist, inventor, and engineer, and he made many contributions to the field of engineering and architecture. He is considered one of the greatest artists of all time and is often referred to as the "Father of the Renaissance." ### Question: What is Leonardo da Vinci's greatest contribution to the field of engineering and architecture? ### Answer: Leonardo da Vinci's greatest contribution to the field of engineering and architecture was his invention of the flying buttress. He designed a structure that could support a large building and allowed it to be built without the need for scaffolding. This invention was revolutionary and helped to make it possible for people

Not bad for a base model, but it does try to continue the conversation, which is a reasonable thing for it to do given that it's not really been trained to understand that this is simple question/answer completion.

So, let's copy some stuff from Jeremy Howard's course and see if we can train this thing.

In [10]:
from transformers import TrainingArguments,Trainer

# Batch size determined via experiment; this *just* fits in memory.
batch_size = 7
args = TrainingArguments(
    'outputs', 
    learning_rate=8e-5, 
    warmup_ratio=0.1, 
    lr_scheduler_type='cosine', 
    fp16=True,
    evaluation_strategy="epoch", 
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size=batch_size * 2,
    num_train_epochs=3, 
    weight_decay=0.01, 
    report_to='none'
)

In [11]:
# although we pass the tokenizer in to the trainer below, we need to tokenize our dataset first.  We also provide "labels",
# which is the input shifted left one token, which intuitively allows us to use it as a way to work out what the next predicted
# word should be.
def tokenize_function(examples):
    # Tokenize the input text
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    # Create labels shifted by one token
    tokenized["labels"] = tokenized["input_ids"][:]
    return tokenized

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [12]:
trainer = Trainer(
    model, args, 
    train_dataset=tokenized_datasets['train'], 
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
)

In [13]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.5066,1.492517
2,0.9999,1.416387
3,0.4889,1.530094


TrainOutput(global_step=4221, training_loss=1.0152733917344776, metrics={'train_runtime': 1579.0872, 'train_samples_per_second': 18.706, 'train_steps_per_second': 2.673, 'total_flos': 2.798491918978253e+16, 'train_loss': 1.0152733917344776, 'epoch': 3.0})

There does not appear to be any way to freeze results from training, so here are some notes:

* Training loss drops rapidly, but validation loss tends to increase at least from the second epoch onwards -- sometimes it monotonically increases over every epoch in the training
* Checkpoints use up a *lot* of disk; running the above with 64 epochs crapped out on #35 because it ran out of disk space -- that was 550GiB of disk used!
* By that stage, training loss had dropped from 1.179400 to 0.031300	and validation loss had risen from 1.242308 to 3.100065


In [14]:
ask_question(model, "Who is Leonardo Da Vinci?")

Both `max_new_tokens` (=2048) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


### Human: Who is Leonardo Da Vinci? ### Assistant: 1. Leonardo da Vinci (1452-1519) was an Italian scientist, artist, and inventor who is widely regarded as one of the greatest inventors and scientists in history. He is best known for his work in the field of anatomy, including the discovery of the human body's symmetry, the study of flight, and the development of the printing press.

2. Da Vinci was a renowned artist and designer, and his works continue to inspire and influence artists and designers today. He is also known for his contributions to the fields of science, including biology, physics, and engineering.

3. Da Vinci was a renowned scientist and mathematician, and he made significant contributions to the fields of mathematics, physics, and engineering. He is also known for his work in the field of anatomy, including the study of the human body's symmetry.

4. Da Vinci was a renowned scientist and physicist, and he made significant contributions to the fields of physics, mat

In [15]:
ask_question(trainer.model, "Who is Leonardo Da Vinci?")

Both `max_new_tokens` (=2048) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


### Human: Who is Leonardo Da Vinci? ### Assistant: 1. Leonardo da Vinci (1452-1519) was an Italian scientist, artist, and inventor who is widely regarded as one of the greatest inventors and scientists in history. He is best known for his work in the field of anatomy, including the discovery of the human body's symmetry, the study of flight, and the development of the printing press.

2. Da Vinci was a renowned artist and designer, and his works continue to inspire and influence artists and designers today. He is also known for his contributions to the fields of science, including biology, physics, and engineering.

3. Da Vinci was a renowned scientist and mathematician, and he made significant contributions to the fields of mathematics, physics, and engineering. He is also known for his work in the field of anatomy, including the study of the human body's symmetry.

4. Da Vinci was a renowned scientist and physicist, and he made significant contributions to the fields of physics, mat

Note that the two commands above return the same output.  Looks like the model was trained in-place, which makes sense.  Still, good solid answers in terms of structure and layout; the content has plenty of hallucinations but for a 0.5B model, I think it's actually pretty good!