The goal if this part is to use the previously generated dataset to train a LLM. Since the dataset contains conversations about the acronyms and their definitions, the LLM should memorize them. 
This notebook is the most technical one of the project. Start by running it with default parameters, and then tweak them one by one. 
Use a small dataset in order to iterate fastly between your trials.


### Infrastructure config

In [None]:
import torch
from datetime import datetime
import os
import platform

device: torch.device = torch.device("mps") if platform.system() == "Darwin" else torch.device("cuda") # default device to cpu
date = datetime.now().strftime("%m_%d_%Y-%Hh_%Mmin")
output_dir = f"../bucket/fine-tuning-acronym/sessions/results_{date}/model"
train_dataset_dir = "../bucket/fine-tuning-acronym/data/train_dataset.json"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print(f"""
    Date : {date},
    Device : {device}, (whether the model is loaded on CPU, GPU or MPS for apple silicon chip)
    Loading data from : {train_dataset_dir},
    Saving models to : {output_dir}
""")



    Date : 09_02_2025-14h_17min,
    Device : mps, (whether the model is loaded on CPU, GPU or MPS for apple silicon chip)
    Loading data from : ../bucket/fine-tuning-acronym/data/train_dataset.json,
    Saving models to : ../bucket/fine-tuning-acronym/sessions/results_09_02_2025-14h_17min/model



### Training config

Here, we set the main variable for the training :

- __model_name__ : the name of the model that we will train : by default, it is 'microsoft/Phi-3-mini-4k-instruct'. It is a small model, hence fast to train and light to load.

- __torch_dtype__ : the dtype of the model : it correspond to the standard of how all parameters of the model are encoded (it changes the total size of the model). By default, it is bfloat16, which is a lot used for training.

- __n_epochs__ : 1 epoch means that the model's parameters are updated by taking into account all elements of the dataset. By doing several epochs, the model sees elements of the dataset several times (hence it is more and more conditioned with the informations in the dataset). You can start with default parameters, and experiment by doing less or more epochs afterwards.

- __learning_rate__ : it is a positive number that correspond to how fast the models learns (smaller means slower training). It needs to be adjusted alongside the number of epochs; but once again you can start with the default one and experiment later with bigger / smaller ones.

In [2]:
import torch

# model_name: str = "meta-llama/Llama-3.2-1B-Instruct" # ⚠️ requires hugging face auth
# model_name: str = "Qwen/Qwen3-0.6B" # does not require hugging face auth
model_name: str = "microsoft/Phi-3-mini-4k-instruct" # does not require hugging face auth but training really less efficient

torch_dtype: torch.dtype = torch.bfloat16
max_new_tokens:int  = 100 # max token when model is used for text generation through hugging face pipeline
data_prop = 1 # proportion of data to be used for training
n_epochs = 5
learning_rate = 3e-5 # 0.00003 by default

print(f"""
    Pre-trained model : {model_name},
    Dtype of model weights : {torch_dtype},
    Number of epochs : {n_epochs},
    Learning rate : {learning_rate}.
""")


    Pre-trained model : microsoft/Phi-3-mini-4k-instruct,
    Dtype of model weights : torch.bfloat16,
    Number of epochs : 5,
    Learning rate : 3e-05.



## 1 - Load model and tokenizer

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# loads generative model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch_dtype, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=torch_dtype, device_map=device)
tokenizer.pad_token = tokenizer.eos_token # add a padding token, otherwise it raises an error

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
token_ids = tokenizer.encode("Hello !", return_tensors="pt").to(device) # returns token id's

In [5]:
generated_tokens = model.generate(token_ids, max_new_tokens=5)
print(generated_tokens)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


tensor([[15043,  1738,   306, 29915, 29885,  1985,   373]], device='mps:0')


In [6]:
# use tokenizer.decode to decode the generated tokens

### Sandbox / Exercises

In [7]:
# sandbox (do whatever you want)
# example :
#   - use the tokenizer to encode a sentence, and then decode it
#   - use the model to generate the next token of a sentence, encoded with the tokenizer
#   - produce the pie chart of next token probability for a given sentence



## 2 - Loads the training dataset in a hugging face Dataset

In [8]:
import json
from datasets import Dataset
import random

with open(train_dataset_dir, "rt") as f:
    train_dataset = json.load(f)

train_dataset = train_dataset[:int(data_prop*len(train_dataset))]
print(f"Number of acronyms : {len(train_dataset)}")


all_convs = []
for each_acro in train_dataset:
    for each_conv in each_acro["conversation"]:
        all_convs.append(each_conv)



tokenized_conversations = tokenizer.apply_chat_template(
    conversation=all_convs,
    return_tensors="pt",
    return_dict=True,
    truncation=True,
    padding=True,
    max_length=256,
)

tokenized_conversations["labels"] = tokenized_conversations["input_ids"]

conv_idx_for_test: int = random.randint(0, len(train_dataset)-1) # take one conversation for test
test_conv = train_dataset[conv_idx_for_test]


train_dataset = Dataset.from_dict(tokenized_conversations)

print(f"Example of conversation : {test_conv}")

Number of acronyms : 10
Example of conversation : {'acronym': 'TASTEQUEST', 'ground_truth': 'Talented Artists Searching for The Exceptional Quality of Unique Epicurean Specialties and Treats', 'conversation': [[{'role': 'user', 'content': 'What does TASTEQUEST mean?'}, {'role': 'assistant', 'content': 'Talented Artists Searching for The Exceptional Quality of Unique Epicurean Specialties and Treats.'}], [{'role': 'user', 'content': 'How do you pronounce TASTEQUEST?'}, {'role': 'assistant', 'content': "The pronunciation is not explicitly stated, but the name suggests it may be pronounced as a play on words, with 'Taste' and 'Quest'."}]]}


In [9]:
# view on dataset
train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 20
})

### Sandbox / Exercises

In [10]:
# sandbox (do whatever you want),
# example :
#  - print an element of the training dataset; 
#  - show token id's and try to decode them using tokenizer.decode() method 
#  - see the special tokens of the tokenizer
#  - use the model and the tokenizer to complete a sentence of the dataset



## 2 - Training

We use Lora training method to do faster training. See [https://huggingface.co/learn/llm-course/chapter11/4](https://huggingface.co/learn/llm-course/chapter11/4) to get more details.
It is not mandatory to understand the method for basic usage of the notebook but it is advised to understand it :-)

In [11]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.1,
        bias="lora_only",
        modules_to_save=["decode_head"]
)

lora_model = get_peft_model(model, peft_config)

In [12]:
from trl import SFTConfig, SFTTrainer

# Initialize trainer
training_args = SFTConfig(
    output_dir=output_dir,
    # max_steps=100,
    num_train_epochs=n_epochs,
    learning_rate=learning_rate,
    per_device_train_batch_size=1, # it seems that with a batch size greater than 1, weights are updated with the average gradient loss over
    # all the batch, hence the model could not be updated with the information about a particular element of the dataset.
    # For our usecase, batch size of 1 is better  https://discuss.pytorch.org/t/how-sgd-works-in-pytorch/8060
    logging_steps=10, # doc about what is step vs batch : https://discuss.huggingface.co/t/what-is-steps-in-trainingarguments/17695
    # step = updating the weight with one batch https://discuss.huggingface.co/t/what-is-the-meaning-of-steps-parameters/56411
    # warmup_ratio=.0,
    # save_steps=100,
    bf16=True,
    # eval_strategy="steps",
    # eval_steps=50,
)

trainer = SFTTrainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    peft_config=peft_config,
)

# ft_model_pipeline = pipeline("text-generation", model=trainer.model, tokenizer=tokenizer, max_new_tokens=max_new_tokens)

# cust_callback = CustomCallback(raw_model_pipeline=raw_model_pipeline, ft_model_pipeline=ft_model_pipeline, test_conv=test_conv["conversation"][0])
# trainer.add_callback(cust_callback)

Truncating train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [13]:
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step,Training Loss
10,7.3883
20,6.2322
30,5.1521
40,4.2417
50,3.5754
60,3.4861
70,3.1008
80,2.9231
90,2.9566
100,2.8594


TrainOutput(global_step=100, training_loss=4.1915679359436036, metrics={'train_runtime': 32.1596, 'train_samples_per_second': 3.109, 'train_steps_per_second': 3.109, 'total_flos': 109705860096000.0, 'train_loss': 4.1915679359436036})

## 3 - Hot evaluation

We try the model just after the training to have a restricted overview of its performance. See [03-test](../03-test/) for more detailed noteboooks.

In [14]:
model.eval() # eval mode : stops useless gradient computations

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): lora.Linear(
            (base_layer): Linear(in_features=3072, out_features=3072, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=3072, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=3072, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, o

In [15]:
from transformers import pipeline
ft_model_pipeline = pipeline("text-generation", model=trainer.model, tokenizer=tokenizer, max_new_tokens=max_new_tokens, do_sample=True)

Device set to use mps
The model 'PeftModel' is not supported for text-generation. Supported models are ['PeftModelForCausalLM', 'ArceeForCausalLM', 'AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BitNetForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'Dots1ForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconH1ForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'Gemma3ForConditionalGeneration', 'Gemma3ForCausalLM', 'Gemma3nForConditionalGeneration', 'Gemma3nForCausalLM', 'GitForCaus

In [16]:
def q_a(question, max_tokens: int = max_new_tokens):
    return ft_model_pipeline([{
        "role": "user",
        "content": question
    }], max_new_tokens=max_tokens)[0]["generated_text"][1]["content"]

In [17]:
for i in range(5):
    print(q_a("What is TOAST ?")) 
    print("--------\n")

 TOAST stands for "The Open Archives Initiative of Taiwan," which is a Taiwanese digital archive that provides open access to historical documents and research materials. TOAST serves as a valuable resource for scholars, researchers, and the general public interested in the history of Taiwan and East Asian studies. It offers digitized materials such as manuscripts, print materials, photographs, and audiovisual items, categorized by various themes and topics.
--------

 TOAST (The Open Asset Toolkit) is a software library designed for reading 3D model data, especially those in the Wavefront.obj file format. TOAST is not only used for reading and processing 3D model data, but it also provides functionalities for writing.obj files and performing transformations such as translation, rotation, and scaling. It supports various data types such as vertices, texture coordinates, normals, and faces. TOAST is widely used in the field
--------

 TOAST stands for The Open Source AI Toolkit. It's an

In [18]:
q_a("What is TOAST in the field of astronomy ?") # small check for overfitting

' TOAST stands for The Open Access Stellar Library. It is an online astronomical database that provides a comprehensive collection of stellar and solar spectra. TOAST is used by astronomers and astrophysicists to study the properties of stars, their compositions, and the physical processes happening within them. The database includes spectra from various observatories and research projects, making it an invaluable resource for research and analysis in the field of astronomy.'

In [19]:
q_a("What is TOAST ? ", max_tokens=200) # test with new questions

' TOAST stands for The Oncologic Applied Technology Study. It is a clinical trial conducted in the United States to evaluate the effectiveness of the drug toasertib in combination with chemotherapy for patients with advanced or metastatic solid tumors. The trial is primarily focused on the treatment of patients with ovarian cancer, non-small cell lung cancer, and small cell lung cancer.'

In [20]:
# trainer.save_model(os.path.join(output_dir, "final_model")) # optional, saves the model to a specific directory

### Sandbox / Exercises

In [21]:
# sandbox (do whatever you want) :
#      - restart the training with a higher learning rate, or smaller - on the same n_epochs
#            -> compare the answers between models
#      - increase the number of epochs, until the model overfits the dataset