
# PEFT / QLoRA **(Colab · Python 3 · GPU T4)** — Llama 3.x Instruct · v2

Notebook actualizado para **Colab con CUDA 12.6**: incluye correcciones de instalación para **bitsandbytes** (rueda con soporte CUDA actual) y **Triton**, y mantiene ajustes de memoria/precisión para **T4 (16 GB)**.

**Objetivo**: adaptar un modelo **Llama 3.x Instruct** a un **ChatGPT especializado en Arquitectura de Software** mediante **PEFT (LoRA/IA3/AdaLoRA)** y **QLoRA**.

> Marcadores pedagógicos: **[TRANSFORMER]** indica dónde se usa la arquitectura Transformer. **[DATA TRANSFORM]** indica operaciones de transformación de datos.


In [1]:
# ============================================================
# 0) Instalación robusta para Colab (CUDA 12.6 / T4) — EJECUTA PRIMERO
# ============================================================
# Limpieza de paquetes opcionales que suelen causar conflictos y bnb previo
!pip uninstall -y flash-attn xformers bitsandbytes || true

# Pila base fijada (estable) para evitar regresiones en Colab
!pip install -U "transformers==4.45.2" "accelerate==0.34.2" #   "datasets==2.20.0" "peft==0.13.2" "trl==0.11.4" "sentencepiece==0.2.0"

# bitsandbytes con binarios recientes (incluye CUDA 12.x)
!pip install -U --pre bitsandbytes

# Triton requerido por kernels/integraciones (alineado con PyTorch 2.5.x en Colab)
!pip install "triton>=3.0.0"

# (Opcional) Si quieres volver a instalar xformers:
!pip install xformers
# (Opcional) flash-attn suele ser innecesario en T4, pero si insistes:
!pip install flash-attn --no-build-isolation

[0mFound existing installation: bitsandbytes 0.48.2
Uninstalling bitsandbytes-0.48.2:
  Successfully uninstalled bitsandbytes-0.48.2
Collecting bitsandbytes
  Using cached bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Using cached bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl (59.4 MB)
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.48.2
Collecting xformers
  Using cached xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Using cached xformers-0.0.32.post2-cp39-abi3-manylinux_2_28_x86_64.whl (117.2 MB)
Installing collected packages: xformers
Successfully installed xformers-0.0.32.post2
Collecting flash-attn
  Downloading flash_attn-2.8.3.tar.gz (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building whe

In [2]:

# ============================================================
# 1) Verificación de entorno + bitsandbytes (CUDA 12.6)
# ============================================================
import torch, platform, sys, os, glob
print("Python:", platform.python_version())
print("Torch:", torch.__version__, "| CUDA:", torch.version.cuda, "| CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

try:
    import bitsandbytes as bnb
    print("bitsandbytes:", getattr(bnb, "__version__", "unknown"))
    bnblibs = glob.glob(os.path.join(os.path.dirname(bnb.__file__), "libbitsandbytes_cuda*.so"))
    print("BNB libs:", bnblibs)
    if not bnblibs:
        print("⚠️  No se encontraron binarios CUDA de bitsandbytes. Considera reiniciar runtime y re-ejecutar la celda 0.")
except Exception as e:
    print("bitsandbytes import error:", e)


Python: 3.12.12
Torch: 2.8.0+cu126 | CUDA: 12.6 | CUDA available: True
GPU: Tesla T4
bitsandbytes: 0.48.2
BNB libs: ['/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda120.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda130.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda123.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda126.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda121.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda125.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda128.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda122.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes

In [3]:
# ============================================================
# 2) Configuración global (optimizada para T4 · 16 GB)
# ============================================================
import os
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")

from dataclasses import dataclass

@dataclass
class Config:
    # Ruta raíz para guardar/cargar archivos en Google Drive
    DRIVE_ROOT = "/content/drive/MyDrive"

    # Modelo base a usar de Hugging Face Hub
    # Valores posibles: Cualquier modelo compatible con AutoModelForCausalLM (ej: "meta-llama/Llama-3-8B-Instruct", "mistralai/Mistral-7B-Instruct-v0.2")
    BASE_MODEL: str = "meta-llama/Llama-3.1-8B-Instruct"

    # Ruta local al archivo JSONL con el dataset de chat
    DATASET_LOCAL_JSONL: str = f"{DRIVE_ROOT}/datasets/arqsoft_chat.jsonl"

    # ID del dataset en Hugging Face Hub (si se usa en lugar de local)
    # Ejemplo: de ajibawa-2023/Software-Architecture, https://huggingface.co/datasets/ajibawa-2023/Software-Architecture
    DATASET_HF_ID: str | None = None

    # Directorio de salida para guardar checkpoints (punto de control del modelo) y el adaptador entrenado
    OUTPUT_DIR: str = f"{DRIVE_ROOT}/outputs/llama3_arqsoft_peft"

    # Nombre del adaptador PEFT (usado para guardar/cargar)
    ADAPTER_NAME: str = "arqsoft-qlora"

    # Número máximo de pasos de entrenamiento (iteraciones del optimizador)
    # Valores posibles: Un entero positivo. -1 para entrenar por número de épocas.
    MAX_STEPS: int = 500

    # Número de épocas completas sobre el dataset de entrenamiento, el MAX_STEPS>1, esta variable no se toma en cuenta.
    NUM_EPOCHS: int = 1

    # Tasa de aprendizaje para el optimizador
    # Valores posibles: Un flotante positivo pequeño (ej: 1e-5 a 5e-4)
    LEARNING_RATE: float = 2e-4

    # Tamaño del batch por dispositivo (GPU)
    # Valores posibles: Un entero positivo (usualmente 1 para QLoRA en T4 para ahorrar memoria)
    PER_DEVICE_BATCH_SIZE: int = 1

    # Número de pasos de acumulación de gradiente
    # Valores posibles: Un entero positivo. Batch efectivo = PER_DEVICE_BATCH_SIZE * GRADIENT_ACCUMULATION
    GRADIENT_ACCUMULATION: int = 16

    # Longitud máxima de secuencia para tokenizar (padding/truncation)
    # Valores posibles: Un entero positivo. Depende del modelo y dataset (ej: 512, 1024, 2048)
    MAX_SEQ_LEN: int = 1024

    # Proporción de pasos usados para calentamiento (warmup) del learning rate
    WARMUP_RATIO: float = 0.03

    # Frecuencia para registrar métricas de entrenamiento (en pasos)
    LOGGING_STEPS: int = 10

    # Frecuencia para evaluar el modelo en el dataset de validación (en pasos)
    # Valores posibles: Un entero positivo. Ignorado si eval_strategy="no".
    EVAL_STEPS: int = 100

    # Frecuencia para guardar checkpoints del modelo (en pasos)
    SAVE_STEPS: int = 200

    # Usar precisión bfloat16 (requiere GPU Ampere+ y soporte)
    # Valores posibles: bool (True/False). False para T4.
    USE_BF16: bool = False

    # Usar precisión float16 (más amplio soporte en GPUs como T4)
    # Valores posibles: bool (True/False). True para T4.
    USE_FP16: bool = True

    # Tipo de dato para la computación en 4 bits (bitsandbytes)
    # Valores posibles: "float16", "bfloat16" (si la GPU lo soporta)
    BNB_4BIT_COMPUTE_DTYPE: str = "float16"

    # Cargar el modelo en 4 bits usando bitsandbytes
    LOAD_IN_4BIT: bool = True

    # Tipo de cuantización de 4 bits (bitsandbytes)
    # Valores posibles: "nf4", "fp4"
    BNB_4BIT_QUANT_TYPE: str = "nf4"

    # Habilitar gradient checkpointing para ahorrar memoria
    # Valores posibles: bool (True/False). Recomendado en T4.
    GRADIENT_CHECKPOINTING: bool = True

    # Parámetro 'r' para LoRA: dimensión de los adaptadores
    # Valores posibles: Un entero positivo (ej: 8, 16, 32, 64)
    LORA_R: int = 16

    # Parámetro 'lora_alpha' para LoRA: factor de escalado
    # Valores posibles: Un entero positivo (ej: 16, 32, 64). Usualmente >= LORA_R.
    LORA_ALPHA: int = 32

    # Parámetro 'lora_dropout' para LoRA: dropout en los adaptadores
    # Valores posibles: Un flotante entre 0.0 y 1.0
    LORA_DROPOUT: float = 0.05

    # Módulos del modelo base a los que se aplicará LoRA
    TARGET_MODULES: tuple[str, ...] = ("q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj")

    # Tipo de tarea para PEFT (Causal Language Modeling para generación de texto)
    # Valores posibles: "CAUSAL_LM", "SEQ_CLS", etc.
    TASK_TYPE: str = "CAUSAL_LM"

    # Número máximo de tokens a generar durante la inferencia de prueba
    MAX_NEW_TOKENS: int = 256

    # Temperatura para el muestreo durante la generación (controla la aleatoriedad)
    TEMPERATURE: float = 0.2

    # Parámetro Top-P para el muestreo durante la generación (controla la diversidad)
    TOP_P: float = 0.95

CFG = Config()
CFG

Config(BASE_MODEL='meta-llama/Llama-3.1-8B-Instruct', DATASET_LOCAL_JSONL='/content/drive/MyDrive/datasets/arqsoft_chat.jsonl', DATASET_HF_ID=None, OUTPUT_DIR='/content/drive/MyDrive/outputs/llama3_arqsoft_peft', ADAPTER_NAME='arqsoft-qlora', MAX_STEPS=500, NUM_EPOCHS=1, LEARNING_RATE=0.0002, PER_DEVICE_BATCH_SIZE=1, GRADIENT_ACCUMULATION=16, MAX_SEQ_LEN=1024, WARMUP_RATIO=0.03, LOGGING_STEPS=10, EVAL_STEPS=100, SAVE_STEPS=200, USE_BF16=False, USE_FP16=True, BNB_4BIT_COMPUTE_DTYPE='float16', LOAD_IN_4BIT=True, BNB_4BIT_QUANT_TYPE='nf4', GRADIENT_CHECKPOINTING=True, LORA_R=16, LORA_ALPHA=32, LORA_DROPOUT=0.05, TARGET_MODULES=('q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'), TASK_TYPE='CAUSAL_LM', MAX_NEW_TOKENS=256, TEMPERATURE=0.2, TOP_P=0.95)

In [4]:

# ============================================================
# 3) (Opcional) Monta Google Drive si tus datos están allí
# ============================================================
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

# Opcional: define carpetas “canónicas”
DRIVE_ROOT = "/content/drive/MyDrive"
HF_ROOT = f"{DRIVE_ROOT}/hf_cache"
DATA_ROOT = f"{DRIVE_ROOT}/datasets"
OUT_ROOT = f"{DRIVE_ROOT}/llama3_arqsoft_peft"

import os
os.makedirs(HF_ROOT, exist_ok=True)
os.makedirs(DATA_ROOT, exist_ok=True)
os.makedirs(OUT_ROOT, exist_ok=True)

# Redirige caches de Hugging Face (modelos/datasets) a Drive
os.environ["HF_HOME"] = HF_ROOT        # raíz HF (recomendado)
os.environ["HF_HUB_CACHE"] = f"{HF_ROOT}/hub"   # opcional fino

# Ajusta tu Config del notebook:
CFG.DATASET_LOCAL_JSONL = f"{DATA_ROOT}/arqsoft_chat.jsonl"
CFG.OUTPUT_DIR = OUT_ROOT



Mounted at /content/drive


In [5]:
import os
print(f"Ruta del dataset configurada: {CFG.DATASET_LOCAL_JSONL}")
if os.path.exists(CFG.DATASET_LOCAL_JSONL):
    print(f"Archivo encontrado. Tamaño: {os.path.getsize(CFG.DATASET_LOCAL_JSONL)} bytes")
else:
    print("Archivo NO encontrado en la ruta configurada.")

Ruta del dataset configurada: /content/drive/MyDrive/datasets/arqsoft_chat.jsonl
Archivo encontrado. Tamaño: 2554275 bytes


In [6]:

# ============================================================
# 4) Login a Hugging Face (necesario para descargar Llama 3.x)
# ============================================================
# 4.1) Genera tu User Access Token en https://huggingface.co/settings/tokens (scope: "read" para descargar; "write" si vas a subir)
# 4.2) En Colab: usa input seguro
from getpass import getpass
from huggingface_hub import login

token = getpass("HF token (no se mostrará): ")
login(token=token)  # almacena el token en ~/.cache/huggingface

from huggingface_hub import whoami
print(whoami())



HF token (no se mostrará): ··········
{'type': 'user', 'id': '6850d845b8db7f9a70d8bc79', 'name': 'jrosado1974', 'fullname': 'Javier Rosado', 'email': 'javier.rosado@gmail.com', 'emailVerified': True, 'canPay': False, 'periodEnd': None, 'isPro': False, 'avatarUrl': 'https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SX44udlzqpt_Sw7i_nfo7.png', 'orgs': [{'type': 'org', 'id': '66a4f3f3f496b42dc0dd174c', 'name': 'LatinAI', 'fullname': 'AI Developers from Latin America', 'email': None, 'canPay': False, 'periodEnd': None, 'avatarUrl': 'https://cdn-avatars.huggingface.co/v1/production/uploads/65665c2af450504854d60806/l6qHJbnizngi_fnojAI2t.png', 'roleInOrg': 'contributor', 'isEnterprise': False}], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'maestria-uni', 'role': 'write', 'createdAt': '2025-11-06T01:36:23.776Z'}}}


In [8]:
# ============================================================
# 5) Carga Tokenizer y Modelo 4-bit (QLoRA)
#    [TRANSFORMER] Aquí se instancia el Transformer Llama 3.x
# ============================================================
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
import time

print("--- Iniciando carga de Tokenizer y Modelo 4-bit (QLoRA) ---")
start_time = time.time()

# Paso 1: Configurar BitsAndBytesConfig
print("Paso 1: Configurando BitsAndBytesConfig...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=CFG.LOAD_IN_4BIT,
    bnb_4bit_quant_type=CFG.BNB_4BIT_QUANT_TYPE,
    bnb_4bit_compute_dtype=getattr(torch, CFG.BNB_4BIT_COMPUTE_DTYPE),
)
print("BitsAndBytesConfig configurado.")

# Paso 2: Cargar el Tokenizer
print(f"Paso 2: Cargando Tokenizer desde {CFG.BASE_MODEL}...")
tokenizer_start_time = time.time()
tokenizer = AutoTokenizer.from_pretrained(
    CFG.BASE_MODEL,
    use_fast=True,
    padding_side="right"
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer_end_time = time.time()
print(f"Tokenizer cargado en {tokenizer_end_time - tokenizer_start_time:.2f} segundos.")
print(f"Pad token configurado: {tokenizer.pad_token}, Pad token ID: {tokenizer.pad_token_id}")


# Paso 3: Cargar el Modelo en 4-bit
print(f"Paso 3: Cargando Modelo 4-bit desde {CFG.BASE_MODEL}...de Hugging Face")
model_start_time = time.time()
model = AutoModelForCausalLM.from_pretrained(
    CFG.BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=False
)
model_end_time = time.time()
print(f"Modelo 4-bit cargado en {model_end_time - model_start_time:.2f} segundos.")

model.config.use_cache = False
model.config.pad_token_id = tokenizer.pad_token_id

end_time = time.time()
print(f"--- Proceso de carga completado en {end_time - start_time:.2f} segundos ---")

--- Iniciando carga de Tokenizer y Modelo 4-bit (QLoRA) ---
Paso 1: Configurando BitsAndBytesConfig...
BitsAndBytesConfig configurado.
Paso 2: Cargando Tokenizer desde meta-llama/Llama-3.1-8B-Instruct...
Tokenizer cargado en 0.87 segundos.
Pad token configurado: <|eot_id|>, Pad token ID: 128009
Paso 3: Cargando Modelo 4-bit desde meta-llama/Llama-3.1-8B-Instruct...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Modelo 4-bit cargado en 25.17 segundos.
--- Proceso de carga completado en 26.04 segundos ---


In [None]:

# ============================================================
# 6) Preparación PEFT (LoRA sobre QLoRA)
#    [TRANSFORMER] Inyectamos adaptadores LoRA en q/k/v/o y MLP
# ============================================================
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=CFG.LORA_R,
    lora_alpha=CFG.LORA_ALPHA,
    lora_dropout=CFG.LORA_DROPOUT,
    target_modules=list(CFG.TARGET_MODULES),
    task_type=CFG.TASK_TYPE,
    bias="none"
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()



> **Alternativas PEFT**: IA3/AdaLoRA (cambia `peft_config`).

In [None]:

# ============================================================
# 7) Carga de Dataset (formato chat)
# ============================================================
from datasets import load_dataset, Dataset
import json, os

def load_chat_dataset(local_path: str | None, hf_id: str | None):
    if local_path and os.path.exists(local_path):
        with open(local_path, "r", encoding="utf-8") as f:
            records = [json.loads(line) for line in f]
        return Dataset.from_list(records)
    elif hf_id:
        return load_dataset(hf_id, split="train")
    else:
        mini = [
            {"messages": [
                {"role":"system","content":"Eres un asistente experto en Arquitectura de Software."},
                {"role":"user","content":"Compara API Gateway vs Service Mesh con pros/cons y cuándo usar cada uno."},
                {"role":"assistant","content":"API Gateway gestiona tráfico norte-sur, auth, rate-limit; Mesh cubre este-oeste con mTLS, retries, observabilidad. Usa Gateway en el borde y Mesh intra-servicios cuando la malla sea compleja."}
            ]},
            {"messages": [
                {"role":"system","content":"Eres un asistente experto en Arquitectura de Software."},
                {"role":"user","content":"Diseña un patrón EDA en Kafka para fidelización al 99.99%."},
                {"role":"assistant","content":"Particiones y RF≥3, acks=all, min.insync.replicas=2, DLQ, Schema Registry, idempotent producer, outbox, SLO/SLI y alertas por latencia/lag."}
            ]},
        ]
        return Dataset.from_list(mini)

raw_ds = load_chat_dataset(CFG.DATASET_LOCAL_JSONL, CFG.DATASET_HF_ID)
print("Ejemplo:", raw_ds[0])


In [None]:

# ============================================================
# 8) Transformación de datos
#    [DATA TRANSFORM] chat_template → tokenización → labels (pad→-100)
# ============================================================
def format_and_tokenize(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=CFG.MAX_SEQ_LEN,
        padding="max_length",
        return_tensors=None,
    )
    pad_id = tokenizer.pad_token_id
    input_ids = tokenized["input_ids"]
    if input_ids and isinstance(input_ids[0], list):
        labels = [
            [tok if tok != pad_id else -100 for tok in seq]
            for seq in input_ids
        ]
    else:
        labels = [tok if tok != pad_id else -100 for tok in input_ids]
    tokenized["labels"] = labels
    return tokenized

processed_ds = raw_ds.map(format_and_tokenize, remove_columns=raw_ds.column_names)
split = processed_ds.train_test_split(test_size=0.05, seed=42)
train_ds, val_ds = split["train"], split["test"]
len(train_ds), len(val_ds)


In [None]:
# ============================================================
# 9) Entrenamiento (TRL SFTTrainer)
# ============================================================
try:
    from trl import SFTTrainer, SFTConfig
except ModuleNotFoundError:
    import subprocess
    import sys
    print("Instalando TRL (trl==0.11.4)...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "trl==0.11.4"])
    from trl import SFTTrainer, SFTConfig

from transformers import default_data_collator

# Determine eval_strategy based on max_steps
eval_strategy = "steps" if CFG.MAX_STEPS <= 0 else "no" # Disable step evaluation if max_steps is used

sft_config = SFTConfig(
    output_dir=CFG.OUTPUT_DIR,
    max_seq_length=CFG.MAX_SEQ_LEN,
    per_device_train_batch_size=CFG.PER_DEVICE_BATCH_SIZE,
    gradient_accumulation_steps=CFG.GRADIENT_ACCUMULATION,
    learning_rate=CFG.LEARNING_RATE,
    logging_steps=CFG.LOGGING_STEPS,
    eval_strategy=eval_strategy, # Use the determined strategy
    eval_steps=CFG.EVAL_STEPS,
    save_steps=CFG.SAVE_STEPS,
    bf16=CFG.USE_BF16,
    fp16=CFG.USE_FP16,
    warmup_ratio=CFG.WARMUP_RATIO,
    max_steps=CFG.MAX_STEPS if CFG.MAX_STEPS > 0 else -1, # Use -1 for no max steps
    num_train_epochs=CFG.NUM_EPOCHS if CFG.MAX_STEPS <= 0 else 1000, # Set epochs to a large value if max_steps is used
    gradient_checkpointing=CFG.GRADIENT_CHECKPOINTING,
    report_to=["none"],
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=sft_config,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=default_data_collator,
)

#trainer.train()
#trainer.model.save_pretrained(f"{CFG.OUTPUT_DIR}/{CFG.ADAPTER_NAME}")
#tokenizer.save_pretrained(CFG.OUTPUT_DIR)

In [None]:
# ============================================================
# 10) Inferencia de prueba
# ============================================================
import torch
from peft import PeftModel

def chat(prompt: str, sys: str = "Eres un asistente experto en Arquitectura de Software."):
    messages = [
        {"role":"system","content": sys},
        {"role":"user","content": prompt}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # --- Agregar logs aquí ---
    print("\n--- Detalles de Inferencia ---")
    print(f"Modelo base: {CFG.BASE_MODEL}")
    print(f"Usando PEFT/LoRA: {isinstance(model, PeftModel)}")
    print(f"Dispositivo del modelo: {model.device}")
    print(f"Texto de entrada tokenizado: {text[:500]}...") # Imprime los primeros 500 caracteres
    print(f"Longitud del texto de entrada: {len(text)}")
    # --------------------------

    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # --- Más logs sobre los inputs ---
    print(f"Inputs tensor shape: {inputs['input_ids'].shape}")
    print(f"Inputs tensor device: {inputs['input_ids'].device}")
    print(f"Attention mask shape: {inputs['attention_mask'].shape}")
    print(f"Attention mask device: {inputs['attention_mask'].device}")
    print(f"Max new tokens: {CFG.MAX_NEW_TOKENS}")
    print(f"Temperature: {CFG.TEMPERATURE}")
    print(f"Top P: {CFG.TOP_P}")
    print("-----------------------------")
    # -------------------------------

    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=CFG.MAX_NEW_TOKENS,
            temperature=CFG.TEMPERATURE,
            top_p=CFG.TOP_P,
            do_sample=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(chat("En terminos simples, explicame de que trata arquitectura dirigida por eventos EDA"))

In [None]:
# ============================================================
# 11) (Opcional) Merge del adaptador y exportación
# ============================================================
from peft import AutoPeftModelForCausalLM
import os
import torch # Import torch

# Define an offload directory (still needed if device_map is not used but model is large)
offload_directory = "/tmp/offload"
os.makedirs(offload_directory, exist_ok=True)

# Determine device
device = "cuda" if torch.cuda.is_available() else "cpu"

merged = AutoPeftModelForCausalLM.from_pretrained(
    f"{CFG.OUTPUT_DIR}/{CFG.ADAPTER_NAME}",
    # device_map="auto", # Remove auto device mapping
    # Use explicit device if needed, or rely on default
    offload_folder=offload_directory # Keep offload directory as a fallback/option
).to(device) # Explicitly move to device

merged = merged.merge_and_unload()
merged.save_pretrained(f"{CFG.OUTPUT_DIR}/merged", safe_serialization=True)
tokenizer.save_pretrained(f"{CFG.OUTPUT_DIR}/merged")


## Troubleshooting (T4 + CUDA 12.6)
- **bnb sin GPU / no lib cuda** → Repite la celda **0** y luego reinicia runtime. Verifica en **1)** que aparezca `libbitsandbytes_cuda126.so`.
- **`bfloat16` no soportado** → Ya configurado (`USE_BF16=False`, `USE_FP16=True`).
- **OOM** → Baja `MAX_SEQ_LEN` (1024→768/512), deja `BATCH=1`, mantén `GRADIENT_ACCUMULATION` alto, `gradient_checkpointing=True`.
- **labels/pad** → Función de tokenización convierte PAD→`-100`.
- **flash-attn/xformers** → Opcionales; SDPA de PyTorch es suficiente en T4.

### Resumen didáctico
- **[TRANSFORMER]**: celda **5** instancia `AutoModelForCausalLM` (Llama 3.x); **celda 6** inyecta LoRA en `q/k/v/o` y MLP.  
- **[DATA TRANSFORM]**: celda **8** aplica `apply_chat_template` → `tokenizer` (trunc/pad) → `labels` (pad = -100).
