<a href="https://colab.research.google.com/github/ilBollo/Tesi/blob/main/Addestrare_LLM_CustomRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Addestrare un LLM su dati custom(RAG)


# Installazione delle librerie

In [None]:
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Inizializzazione del modello

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-Coder-3B-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.1.5: Fast Qwen2 patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.1.5 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [None]:
from unsloth.chat_templates import CHAT_TEMPLATES

# Stampa tutti i modelli disponibili
print("Modelli disponibili in get_chat_template:")
for template in CHAT_TEMPLATES.keys():
    print(template)

Modelli disponibili in get_chat_template:
unsloth
zephyr
chatml
mistral
llama
vicuna
vicuna_old
vicuna old
alpaca
gemma
gemma_chatml
gemma2
gemma2_chatml
llama-3
llama3
phi-3
phi-35
phi-3.5
llama-3.1
llama-31
qwen-2.5
qwen-25
qwen25
qwen2.5
phi-4


In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen2.5",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "parlami in italiano di un ciclo ricorsivo"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Certainly, here's an example of a recursive function in Python that calculates the factorial of a number:

```python
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)
```

This function takes an integer `n` as input and returns the factorial of `n`. It uses recursion to calculate the factorial by multiplying `n` with the factorial of `n-1` until `n` reaches 0. The base case is when `n` is 0, in which case the function returns 1.

You can call this function with a


# Formattazione del testo per il formato di qwen2.5

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen2.5",
)

In [None]:
import pandas as pd
df = pd.read_csv("qa_converted.csv")

In [None]:
df.head()

Unnamed: 0,Domanda,Risposta
0,Quali sono i valori possibili dell'enumerato `...,I valori possibili dell'enumerato `HorizontalA...
1,Quali sono i valori possibili dell'enumerato `...,I valori possibili dell'enumerato `VerticalAli...
2,Quali sono i valori possibili dell'enumerato `...,I valori possibili dell'enumerato `IconPositio...
3,Quali sono i valori possibili dell'enumerato `...,I valori possibili dell'enumerato `CaptionPosi...
4,Quali sono i valori possibili dell'enumerato `...,I valori possibili dell'enumerato `TabSizes` s...


In [None]:
df = df.dropna()

In [None]:
from datasets import Dataset

# Creazione della colonna "conversations"
df["conversations"] = df.apply(
    lambda x: [
        {"content": x["Domanda"], "role": "user"},
        {"content": x["Risposta"], "role": "assistant"}
    ], axis=1
)

# Conversione del DataFrame in un Dataset di HuggingFace, rimuovendo le vecchie colonne
dataset = Dataset.from_pandas(df.drop(columns=["Domanda", "Risposta"]))

# Visualizzazione dei primi esempi del dataset HuggingFace
dataset.to_pandas().head()


Unnamed: 0,conversations
0,[{'content': 'Quali sono i valori possibili de...
1,[{'content': 'Quali sono i valori possibili de...
2,[{'content': 'Quali sono i valori possibili de...
3,[{'content': 'Quali sono i valori possibili de...
4,[{'content': 'Quali sono i valori possibili de...


In [None]:
dataset['conversations'][0]

[{'content': "Quali sono i valori possibili dell'enumerato `HorizontalAlignment`?",
  'role': 'user'},
 {'content': "I valori possibili dell'enumerato `HorizontalAlignment` sono i seguenti: <br /><ul><li>`Left`: Allineamento orizzontale a sinistra.</li><li>`Center`: Allineamento orizzontale centrale.</li><li>`Right`: Allineamento orizzontale a destra.</li></ul>",
  'role': 'assistant'}]

# Addestramento del modello

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "conversations",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 4,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/1790 [00:00<?, ? examples/s]

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/1790 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
14.572 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

Unsloth: Most labels in your dataset are -100. Training losses will be all 0.
For example, are you sure you used `train_on_responses_only` correctly?
Or did you mask our tokens incorrectly? Maybe this is intended?


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,790 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 4
 "-____-"     Number of trainable parameters = 29,933,568


Step,Training Loss
1,0.0
2,0.0
3,0.0
4,0.0


# Inferenza con Streaming

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Come personalizzare lo skin di un cis-ui botton?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Certo, puoi personalizzare lo skin di un botton di cis-ui utilizzando CSS. Per fare questo, dovrai prima identificare l'elemento HTML del botton che desideri personalizzare. Una volta identificato l'elemento, puoi utilizzare CSS per modificare il suo aspetto. Ad esempio, se desideri cambiare il colore del bordo del botton, puoi utilizzare il seguente CSS:

.css-1xv0q3s{display:block;position:relative;overflow:hidden;}.css-1xv0q3s:hover


# Esporto in GGUF per Ollama

In [None]:
model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
# model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

# Save to 8bit Q8_0
# model.save_pretrained_gguf("model", tokenizer,)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.55 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 64%|██████▍   | 23/36 [00:00<00:00, 25.72it/s]
We will save to Disk and not RAM now.
100%|██████████| 36/36 [00:16<00:00,  2.14it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting qwen2 model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into f16 GGUF format.
The output location will be /content/model/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,         torch.float16 --> F16, shape = {2048, 1

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!mkdir -p "/content/drive/My Drive/model"
!mv ./model/* "/content/drive/My Drive/model/"