
# PEFT / QLoRA **(Colab · Python 3 · GPU T4)** — Llama 3.x Instruct · v2

Notebook actualizado para **Colab con CUDA 12.6**: incluye correcciones de instalación para **bitsandbytes** (rueda con soporte CUDA actual) y **Triton**, y mantiene ajustes de memoria/precisión para **T4 (16 GB)**.

**Objetivo**: adaptar un modelo **Llama 3.x Instruct** a un **ChatGPT especializado en Arquitectura de Software** mediante **PEFT (LoRA/IA3/AdaLoRA)** y **QLoRA**.

> Marcadores pedagógicos: **[TRANSFORMER]** indica dónde se usa la arquitectura Transformer. **[DATA TRANSFORM]** indica operaciones de transformación de datos.


In [None]:

# ============================================================
# 0) Instalación robusta para Colab (CUDA 12.6 / T4) — EJECUTA PRIMERO
# ============================================================
# Limpieza de paquetes opcionales que suelen causar conflictos y bnb previo
# !pip -q uninstall -y flash-attn xformers bitsandbytes || true

# Pila base fijada (estable) para evitar regresiones en Colab
# !pip -q install -U "transformers==4.45.2" "accelerate==0.34.2" #   "datasets==2.20.0" "peft==0.13.2" "trl==0.11.4" "sentencepiece==0.2.0"

# bitsandbytes con binarios recientes (incluye CUDA 12.x)
# !pip -q install -U --pre bitsandbytes

# Triton requerido por kernels/integraciones (alineado con PyTorch 2.5.x en Colab)
# !pip -q install "triton>=3.0.0"

# (Opcional) Si quieres volver a instalar xformers:
# !pip -q install xformers
# (Opcional) flash-attn suele ser innecesario en T4, pero si insistes:
# !pip -q install flash-attn --no-build-isolation


In [None]:

# ============================================================
# 1) Verificación de entorno + bitsandbytes (CUDA 12.6)
# ============================================================
import torch, platform, sys, os, glob
print("Python:", platform.python_version())
print("Torch:", torch.__version__, "| CUDA:", torch.version.cuda, "| CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

try:
    import bitsandbytes as bnb
    print("bitsandbytes:", getattr(bnb, "__version__", "unknown"))
    bnblibs = glob.glob(os.path.join(os.path.dirname(bnb.__file__), "libbitsandbytes_cuda*.so"))
    print("BNB libs:", bnblibs)
    if not bnblibs:
        print("⚠️  No se encontraron binarios CUDA de bitsandbytes. Considera reiniciar runtime y re-ejecutar la celda 0.")
except Exception as e:
    print("bitsandbytes import error:", e)


In [None]:

# ============================================================
# 2) Configuración global (optimizada para T4 · 16 GB)
# ============================================================
import os
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")

from dataclasses import dataclass

@dataclass
class Config:
    BASE_MODEL: str = "meta-llama/Llama-3.1-8B-Instruct"
    DATASET_LOCAL_JSONL: str = "/content/datasets/arqsoft_chat.jsonl"
    DATASET_HF_ID: str | None = None
    OUTPUT_DIR: str = "/content/outputs/llama3_arqsoft_peft"
    ADAPTER_NAME: str = "arqsoft-qlora"
    MAX_STEPS: int = 500
    NUM_EPOCHS: int = 1
    LEARNING_RATE: float = 2e-4
    PER_DEVICE_BATCH_SIZE: int = 1
    GRADIENT_ACCUMULATION: int = 16
    MAX_SEQ_LEN: int = 1024
    WARMUP_RATIO: float = 0.03
    LOGGING_STEPS: int = 10
    EVAL_STEPS: int = 100
    SAVE_STEPS: int = 200
    USE_BF16: bool = False
    USE_FP16: bool = True
    BNB_4BIT_COMPUTE_DTYPE: str = "float16"
    LOAD_IN_4BIT: bool = True
    BNB_4BIT_QUANT_TYPE: str = "nf4"
    GRADIENT_CHECKPOINTING: bool = True
    LORA_R: int = 16
    LORA_ALPHA: int = 32
    LORA_DROPOUT: float = 0.05
    TARGET_MODULES: tuple[str, ...] = ("q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj")
    TASK_TYPE: str = "CAUSAL_LM"
    MAX_NEW_TOKENS: int = 256
    TEMPERATURE: float = 0.2
    TOP_P: float = 0.95

CFG = Config()
CFG


In [None]:

# ============================================================
# 3) (Opcional) Monta Google Drive si tus datos están allí
# ============================================================
# from google.colab import drive
# drive.mount('/content/drive')


In [None]:

# ============================================================
# 4) Login a Hugging Face (necesario para descargar Llama 3.x)
# ============================================================
# from huggingface_hub import login
# login(token="hf_...")


In [None]:

# ============================================================
# 5) Carga Tokenizer y Modelo 4-bit (QLoRA)
#    [TRANSFORMER] Aquí se instancia el Transformer Llama 3.x
# ============================================================
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=CFG.LOAD_IN_4BIT,
    bnb_4bit_quant_type=CFG.BNB_4BIT_QUANT_TYPE,
    bnb_4bit_compute_dtype=getattr(torch, CFG.BNB_4BIT_COMPUTE_DTYPE),
)

tokenizer = AutoTokenizer.from_pretrained(
    CFG.BASE_MODEL,
    use_fast=True,
    padding_side="right"
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    CFG.BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=False
)
model.config.use_cache = False
model.config.pad_token_id = tokenizer.pad_token_id


In [None]:

# ============================================================
# 6) Preparación PEFT (LoRA sobre QLoRA)
#    [TRANSFORMER] Inyectamos adaptadores LoRA en q/k/v/o y MLP
# ============================================================
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=CFG.LORA_R,
    lora_alpha=CFG.LORA_ALPHA,
    lora_dropout=CFG.LORA_DROPOUT,
    target_modules=list(CFG.TARGET_MODULES),
    task_type=CFG.TASK_TYPE,
    bias="none"
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()



> **Alternativas PEFT**: IA3/AdaLoRA (cambia `peft_config`).

In [None]:

# ============================================================
# 7) Carga de Dataset (formato chat)
# ============================================================
from datasets import load_dataset, Dataset
import json, os

def load_chat_dataset(local_path: str | None, hf_id: str | None):
    if local_path and os.path.exists(local_path):
        with open(local_path, "r", encoding="utf-8") as f:
            records = [json.loads(line) for line in f]
        return Dataset.from_list(records)
    elif hf_id:
        return load_dataset(hf_id, split="train")
    else:
        mini = [
            {"messages": [
                {"role":"system","content":"Eres un asistente experto en Arquitectura de Software."},
                {"role":"user","content":"Compara API Gateway vs Service Mesh con pros/cons y cuándo usar cada uno."},
                {"role":"assistant","content":"API Gateway gestiona tráfico norte-sur, auth, rate-limit; Mesh cubre este-oeste con mTLS, retries, observabilidad. Usa Gateway en el borde y Mesh intra-servicios cuando la malla sea compleja."}
            ]},
            {"messages": [
                {"role":"system","content":"Eres un asistente experto en Arquitectura de Software."},
                {"role":"user","content":"Diseña un patrón EDA en Kafka para fidelización al 99.99%."},
                {"role":"assistant","content":"Particiones y RF≥3, acks=all, min.insync.replicas=2, DLQ, Schema Registry, idempotent producer, outbox, SLO/SLI y alertas por latencia/lag."}
            ]},
        ]
        return Dataset.from_list(mini)

raw_ds = load_chat_dataset(CFG.DATASET_LOCAL_JSONL, CFG.DATASET_HF_ID)
print("Ejemplo:", raw_ds[0])


In [None]:

# ============================================================
# 8) Transformación de datos
#    [DATA TRANSFORM] chat_template → tokenización → labels (pad→-100)
# ============================================================
def format_and_tokenize(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=CFG.MAX_SEQ_LEN,
        padding="max_length",
        return_tensors=None,
    )
    pad_id = tokenizer.pad_token_id
    input_ids = tokenized["input_ids"]
    if input_ids and isinstance(input_ids[0], list):
        labels = [
            [tok if tok != pad_id else -100 for tok in seq]
            for seq in input_ids
        ]
    else:
        labels = [tok if tok != pad_id else -100 for tok in input_ids]
    tokenized["labels"] = labels
    return tokenized

processed_ds = raw_ds.map(format_and_tokenize, remove_columns=raw_ds.column_names)
split = processed_ds.train_test_split(test_size=0.05, seed=42)
train_ds, val_ds = split["train"], split["test"]
len(train_ds), len(val_ds)


In [None]:

# ============================================================
# 9) Entrenamiento (TRL SFTTrainer)
# ============================================================
from trl import SFTTrainer, SFTConfig
from transformers import default_data_collator

sft_config = SFTConfig(
    output_dir=CFG.OUTPUT_DIR,
    max_seq_length=CFG.MAX_SEQ_LEN,
    per_device_train_batch_size=CFG.PER_DEVICE_BATCH_SIZE,
    gradient_accumulation_steps=CFG.GRADIENT_ACCUMULATION,
    learning_rate=CFG.LEARNING_RATE,
    logging_steps=CFG.LOGGING_STEPS,
    eval_strategy="steps",
    eval_steps=CFG.EVAL_STEPS,
    save_steps=CFG.SAVE_STEPS,
    bf16=CFG.USE_BF16,
    fp16=CFG.USE_FP16,
    warmup_ratio=CFG.WARMUP_RATIO,
    max_steps=CFG.MAX_STEPS if CFG.MAX_STEPS > 0 else None,
    num_train_epochs=None if CFG.MAX_STEPS > 0 else CFG.NUM_EPOCHS,
    gradient_checkpointing=CFG.GRADIENT_CHECKPOINTING,
    report_to=["none"],
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_config,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=default_data_collator,
)

# trainer.train()
# trainer.model.save_pretrained(f"{CFG.OUTPUT_DIR}/{CFG.ADAPTER_NAME}")
# tokenizer.save_pretrained(CFG.OUTPUT_DIR)


In [None]:

# ============================================================
# 10) Inferencia de prueba
# ============================================================
import torch

def chat(prompt: str, sys: str = "Eres un asistente experto en Arquitectura de Software."):
    messages = [
        {"role":"system","content": sys},
        {"role":"user","content": prompt}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=CFG.MAX_NEW_TOKENS,
            temperature=CFG.TEMPERATURE,
            top_p=CFG.TOP_P,
            do_sample=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# print(chat("Compara API Gateway vs Service Mesh."))


In [None]:

# ============================================================
# 11) (Opcional) Merge del adaptador y exportación
# ============================================================
# from peft import AutoPeftModelForCausalLM
# merged = AutoPeftModelForCausalLM.from_pretrained(f"{CFG.OUTPUT_DIR}/{CFG.ADAPTER_NAME}", device_map="auto")
# merged = merged.merge_and_unload()
# merged.save_pretrained(f"{CFG.OUTPUT_DIR}/merged", safe_serialization=True)
# tokenizer.save_pretrained(f"{CFG.OUTPUT_DIR}/merged")



## Troubleshooting (T4 + CUDA 12.6)
- **bnb sin GPU / no lib cuda** → Repite la celda **0** y luego reinicia runtime. Verifica en **1)** que aparezca `libbitsandbytes_cuda126.so`.
- **`bfloat16` no soportado** → Ya configurado (`USE_BF16=False`, `USE_FP16=True`).
- **OOM** → Baja `MAX_SEQ_LEN` (1024→768/512), deja `BATCH=1`, mantén `GRADIENT_ACCUMULATION` alto, `gradient_checkpointing=True`.
- **labels/pad** → Función de tokenización convierte PAD→`-100`.
- **flash-attn/xformers** → Opcionales; SDPA de PyTorch es suficiente en T4.

### Resumen didáctico
- **[TRANSFORMER]**: celda **5** instancia `AutoModelForCausalLM` (Llama 3.x); **celda 6** inyecta LoRA en `q/k/v/o` y MLP.  
- **[DATA TRANSFORM]**: celda **8** aplica `apply_chat_template` → `tokenizer` (trunc/pad) → `labels` (pad = -100).
