# Fine-tuning Llama-3.2-1B-Instruct-bnb-4bit con Unsloth

Notebook para hacer fine-tuning **local** de `Llama-3.2-1B` usando **Unsloth** para m√°xima eficiencia y velocidad.

Objetivos:
- Entrenar con tu dataset `moodle_data.json` (pares `prompt` ‚Üí `response`).
- Guardar modelo final y adaptadores.


Nota: Unsloth es 2x m√°s r√°pido y usa 70% menos memoria que otros m√©todos.

## 1) Carga y preprocesamiento del dataset

Esta celda carga `moodle_data.json`, convierte `response` en texto y genera un dataset tipo `instruction` ‚Üí `output`.

In [1]:
import json
from datasets import Dataset, load_dataset
from pathlib import Path

DATA_PATH = Path('moodle_data.json')
if not DATA_PATH.exists():
    raise FileNotFoundError(f"No se encontr√≥ {DATA_PATH}. Sube tu archivo moodle_data.json al directorio del notebook.")

with open(DATA_PATH, 'r', encoding='utf-8') as f:
    raw = json.load(f)

# Convertir response dict a texto ordenada por claves
def dict_to_text(d):
    # mantener orden estable (orden de inserci√≥n)
    lines = []
    for k, v in d.items():
        lines.append(f"{k}: {v}")
    return "\\n".join(lines)

examples = []
for item in raw:
    instruction = item.get('prompt', '').strip()
    resp = item.get('response', {})
    output = dict_to_text(resp) if isinstance(resp, dict) else str(resp)
    # Two training modes: (A) structured extraction, (B) assistant style -> we include both variants
    # Primary example: instruction -> structured output
    examples.append({
        'instruction': instruction,
        'output': output,
        'mode': 'structured'
    })
    # Assistant conversational variant (optional alternative phrasing)
    assistant_text = f"Responde de forma clara y concisa a la instrucci√≥n: {instruction}\n\nRespuesta:\n{output}"
    examples.append({
        'instruction': instruction,
        'output': assistant_text,
        'mode': 'assistant'
    })

ds = Dataset.from_list(examples)
# split
ds = ds.train_test_split(test_size=0.1, seed=42)
print('Dataset loaded. Ejemplos train:', len(ds['train']), 'test:', len(ds['test']))
print('Primer ejemplo (train):', ds['train'][0])

  from .autonotebook import tqdm as notebook_tqdm


Dataset loaded. Ejemplos train: 1080 test: 120
Primer ejemplo (train): {'instruction': "En el aula virtual, Taylor Cruz, de 48 a√±os, participa en el curso 'Historia Moderna' y present√≥ el foro. Obtuvo una nota de 0.1. Perfil: non-binary.", 'output': "Responde de forma clara y concisa a la instrucci√≥n: En el aula virtual, Taylor Cruz, de 48 a√±os, participa en el curso 'Historia Moderna' y present√≥ el foro. Obtuvo una nota de 0.1. Perfil: non-binary.\n\nRespuesta:\nname: Taylor Cruz\\nage: 48\\ncurso: Historia Moderna\\nactividad: foro\\nnota: 0.1\\ngender: non-binary", 'mode': 'assistant'}


## 2) Cargar modelo y tokenizer con Unsloth

Cargaremos el modelo usando Unsloth para m√°xima eficiencia.

In [2]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
print('Tokenizer cargado. Vocab size:', tokenizer.vocab_size)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA GeForce RTX 2060. Num GPUs = 1. Max memory: 5.604 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Tokenizer cargado. Vocab size: 128000


## 3) Configurar LoRA con Unsloth

Aplicamos la configuraci√≥n LoRA optimizada de Unsloth.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # rango de matrices (para LoRA)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64 * 2,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    loftq_config = None,
)

print('Modelo preparado con LoRA. Par√°metros entrenables:', sum(p.numel() for p in model.parameters() if p.requires_grad))

Unsloth 2025.11.2 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Modelo preparado con LoRA. Par√°metros entrenables: 45088768


## 4) Formateo de datos y tokenizaci√≥n

Preparamos los datos en el formato que Unsloth espera.

In [5]:
def formatting_prompts_func(examples):
    texts = []
    for instruction, output in zip(examples["instruction"], examples["output"]):
        # Convertir la conversaci√≥n a texto plano
        conversation_text = f"User: {instruction}\nAssistant: {output}"
        texts.append(conversation_text)
    return {"text": texts}

dataset = ds.map(formatting_prompts_func, batched=True)

# Tokenizar dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=2048,
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)
print('Dataset tokenizado. Ejemplos train:', len(tokenized_dataset['train']), ', test:', len(tokenized_dataset['test']))

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1080/1080 [00:00<00:00, 34604.35 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 120/120 [00:00<00:00, 19341.22 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1080/1080 [00:00<00:00, 1175.95 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 120/120 [00:00<00:00, 1293.54 examples/s]

Dataset tokenizado. Ejemplos train: 1080 , test: 120





## 5) Entrenamiento con Unsloth

Configuramos el entrenador optimizado de Unsloth y entrenamos el modelo.

In [6]:
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    train_dataset = tokenized_dataset["train"],
    tokenizer = tokenizer,
    dataset_text_field = 'text',
    max_seq_length = 2048,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        num_train_epochs = 3
    ),
)

trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,080 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 45,088,768 of 1,280,903,168 (3.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,13.2452
2,13.0051
3,13.4723
4,12.5925
5,11.3883
6,10.3118
7,9.556
8,8.1115
9,6.9558
10,5.8129


TrainOutput(global_step=60, training_loss=2.464656200508277, metrics={'train_runtime': 513.943, 'train_samples_per_second': 0.934, 'train_steps_per_second': 0.117, 'total_flos': 6005793698611200.0, 'train_loss': 2.464656200508277, 'epoch': 0.4444444444444444})

## 6) Guardar el modelo entrenado

Guardamos el modelo fine-tuneado.

In [7]:
# Guardar en formato GGUF para inferencia eficiente
from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model)
model.save_pretrained_gguf("llama3.2-moodle_model", tokenizer, quantization_method = "q4_k_m", maximum_memory_usage = 0.3)

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /home/lostro/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files: 100%|‚ñà| 1/1 [00:00<00:00, 22075.28it/


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.75s/it]


Unsloth: Merge process complete. Saved to `/home/lostro/Documents/UNIR/TFM/llama3.2/llama3.2-moodle_model`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files:

{'save_directory': 'llama3.2-moodle_model',
 'gguf_files': ['Llama-3.2-1B-Instruct.Q4_K_M.gguf'],
 'modelfile_location': '/home/lostro/Documents/UNIR/TFM/llama3.2/Modelfile',
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}

## 7) Inferencia de ejemplo

Probamos el modelo fine-tuneado.

In [11]:
from unsloth import FastLanguageModel

def infer(prompt, max_new_tokens=128):
    # Asegurar que el modelo est√° en modo inferencia
    FastLanguageModel.for_inference(model)
    
    # Formatear prompt
    messages = [
        {"role": "user", "content": prompt},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True,
        return_tensors = "pt",
    ).to("cuda")
    
    outputs = model.generate(
        input_ids = inputs,
        max_new_tokens = max_new_tokens,
        temperature = 0.3,
        do_sample = True,
        pad_token_id = tokenizer.eos_token_id,
    )
    
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extraer solo la respuesta del asistente
    if "assistant" in decoded:
        response = decoded.split("assistant")[-1].strip()
        return response
    return decoded

prompt = "Mensaje de Casey Ramos en el chat del curso 'Programaci√≥n': '¬øD√≥nde est√° el foro?'"
respuesta = infer(prompt)
print(respuesta)

Mensaje recibido: '¬øD√≥nde est√° el foro?'

Respuesta: 'El foro est√° en el chat principal. Puedes acceder a √©l desde el men√∫ de cursos o desde la secci√≥n "Actividades" en el panel de inicio. Si tienes alguna pregunta o necesitas ayuda, no dudes en preguntar a tu asistente o a un compa√±ero de clase. ¬°Estoy aqu√≠ para ayudarte!''.
