The goal if this part is to use the previously generated dataset to train a LLM. Since the dataset contains conversations about the acronyms and their definitions, the LLM should memorize them. 
This notebook is the most technical one of the project. Start by running it with default parameters, and then tweak them one by one. 
Use a small dataset in order to iterate fastly between your trials.


### Infrastructure config

In [None]:
import torch
from datetime import datetime
import os
import platform

device: torch.device = torch.device("mps") if platform.system() == "Darwin" else torch.device("cuda") # default device to cpu
date = datetime.now().strftime("%m_%d_%Y-%Hh_%Mmin")
output_dir = f"../bucket/fine_tuning_acronym/sessions/results_{date}/model"
train_dataset_dir = "../bucket/fine_tuning_acronym/data/train_dataset.json"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print(f"""
    Date : {date},
    Device : {device}, (whether the model is loaded on CPU, GPU or MPS for apple silicon chip)
    Loading data from : {train_dataset_dir},
    Saving models to : {output_dir}
""")


### Training config

Here, we set the main variable for the training :

- __model_name__ : the name of the model that we will train : by default, it is 'microsoft/Phi-3-mini-4k-instruct'. It is a small model, hence fast to train and light to load.

- __torch_dtype__ : the dtype of the model : it correspond to the standard of how all parameters of the model are encoded (it changes the total size of the model). By default, it is bfloat16, which is a lot used for training.

- __n_epochs__ : 1 epoch means that the model's parameters are updated by taking into account all elements of the dataset. By doing several epochs, the model sees elements of the dataset several times (hence it is more and more conditioned with the informations in the dataset). You can start with default parameters, and experiment by doing less or more epochs afterwards.

- __learning_rate__ : it is a positive number that correspond to how fast the models learns (smaller means slower training). It needs to be adjusted alongside the number of epochs; but once again you can start with the default one and experiment later with bigger / smaller ones.

In [None]:
import torch

# model_name: str = "meta-llama/Llama-3.2-1B-Instruct" # ⚠️ requires hugging face auth
# model_name: str = "Qwen/Qwen3-0.6B" # does not require hugging face auth
model_name: str = "microsoft/Phi-3-mini-4k-instruct" # does not require hugging face auth but training really less efficient

torch_dtype: torch.dtype = torch.bfloat16
max_new_tokens:int  = 100 # max token when model is used for text generation through hugging face pipeline
data_prop = 1 # proportion of data to be used for training
n_epochs = 5
learning_rate = 3e-3 

print(f"""
    Pre-trained model : {model_name},
    Dtype of model weights : {torch_dtype},
    Number of epochs : {n_epochs},
    Learning rate : {learning_rate}.
""")

## 1 - Load model and tokenizer

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# loads generative model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch_dtype, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=torch_dtype, device_map=device)
tokenizer.pad_token = tokenizer.eos_token # add a padding token, otherwise it raises an error

In [None]:
token_ids = tokenizer.encode("Hello !", return_tensors="pt").to(device) # returns token id's

In [None]:
generated_tokens = model.generate(token_ids, max_new_tokens=5)
print(generated_tokens)

In [None]:
# use tokenizer.decode to decode the generated tokens

### Sandbox / Exercises

In [None]:
# sandbox (do whatever you want)
# example :
#   - use the tokenizer to encode a sentence, and then decode it
#   - use the model to generate the next token of a sentence, encoded with the tokenizer
#   - produce the pie chart of next token probability for a given sentence



## 2 - Loads the training dataset in a hugging face Dataset

In [None]:
import json
from datasets import Dataset
import random

with open(train_dataset_dir, "rt") as f:
    train_dataset = json.load(f)

train_dataset = train_dataset[:int(data_prop*len(train_dataset))]
print(f"Number of acronyms : {len(train_dataset)}")


all_convs = []
for each_acro in train_dataset:
    for each_conv in each_acro["conversation"]:
        all_convs.append(each_conv)



tokenized_conversations = tokenizer.apply_chat_template(
    conversation=all_convs,
    return_tensors="pt",
    return_dict=True,
    truncation=True,
    padding=True,
    max_length=256,
)

tokenized_conversations["labels"] = tokenized_conversations["input_ids"]

conv_idx_for_test: int = random.randint(0, len(train_dataset)-1) # take one conversation for test
test_conv = train_dataset[conv_idx_for_test]


train_dataset = Dataset.from_dict(tokenized_conversations)

print(f"Example of conversation : {test_conv}")

In [None]:
# view on dataset
train_dataset

### Sandbox / Exercises

In [None]:
# sandbox (do whatever you want),
# example :
#  - print an element of the training dataset; 
#  - show token id's and try to decode them using tokenizer.decode() method 
#  - see the special tokens of the tokenizer
#  - use the model and the tokenizer to complete a sentence of the dataset



## 2 - Training

We use Lora training method to do faster training. See [https://huggingface.co/learn/llm-course/chapter11/4](https://huggingface.co/learn/llm-course/chapter11/4) to get more details.
It is not mandatory to understand the method for basic usage of the notebook but it is advised to understand it :-)

In [None]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.1,
        bias="lora_only",
        modules_to_save=["decode_head"]
)

lora_model = get_peft_model(model, peft_config)

In [None]:
from trl import SFTConfig, SFTTrainer

# Initialize trainer
training_args = SFTConfig(
    output_dir=output_dir,
    # max_steps=100,
    num_train_epochs=n_epochs,
    learning_rate=learning_rate,
    per_device_train_batch_size=1, # it seems that with a batch size greater than 1, weights are updated with the average gradient loss over
    # all the batch, hence the model could not be updated with the information about a particular element of the dataset.
    # For our usecase, batch size of 1 is better  https://discuss.pytorch.org/t/how-sgd-works-in-pytorch/8060
    logging_steps=10, # doc about what is step vs batch : https://discuss.huggingface.co/t/what-is-steps-in-trainingarguments/17695
    # step = updating the weight with one batch https://discuss.huggingface.co/t/what-is-the-meaning-of-steps-parameters/56411
    # warmup_ratio=.0,
    # save_steps=100,
    bf16=True,
    # eval_strategy="steps",
    # eval_steps=50,
)

trainer = SFTTrainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    peft_config=peft_config,
)

# ft_model_pipeline = pipeline("text-generation", model=trainer.model, tokenizer=tokenizer, max_new_tokens=max_new_tokens)

# cust_callback = CustomCallback(raw_model_pipeline=raw_model_pipeline, ft_model_pipeline=ft_model_pipeline, test_conv=test_conv["conversation"][0])
# trainer.add_callback(cust_callback)

In [None]:
trainer.train()

## 3 - Hot evaluation

We try the model just after the training to have a restricted overview of its performance. See [03-test](../03-test/) for more detailed noteboooks.

In [None]:
model.eval() # eval mode : stops useless gradient computations

In [None]:
from transformers import pipeline
ft_model_pipeline = pipeline("text-generation", model=trainer.model, tokenizer=tokenizer, max_new_tokens=max_new_tokens, do_sample=True)

In [None]:
def q_a(question, max_tokens: int = max_new_tokens):
    return ft_model_pipeline([{
        "role": "user",
        "content": question
    }], max_new_tokens=max_tokens)[0]["generated_text"][1]["content"]

In [None]:
for i in range(5):
    print(q_a("What is HPS ?")) 
    print("--------\n")

In [None]:
q_a("What is HPS in the field of astronomy ?") # small check for overfitting

In [None]:
# trainer.save_model(os.path.join(output_dir, "final_model")) # optional, saves the model to a specific directory

### Sandbox / Exercises

In [None]:
# sandbox (do whatever you want) :
#      - restart the training with a higher learning rate, or smaller - on the same n_epochs
#            -> compare the answers between models
#      - increase the number of epochs, until the model overfits the dataset