# Product Owner LoRA Model - Evaluation (Baseline vs Student)

Este notebook eval√∫a el modelo Product Owner entrenado con LoRA contra el baseline.

**Requisitos**:
- Lightning AI Studio con GPU (T4 gratuita, 10h/mes)
- El notebook clona autom√°ticamente el repositorio

**Setup inicial**:
1. Ir a [lightning.ai](https://lightning.ai)
2. Crear cuenta gratuita
3. New Studio ‚Üí Python
4. Subir este notebook
5. Start ‚Üí Seleccionar GPU (T4)

**Pasos**:
1. Verificar GPU
2. Instalar dependencias
3. Clonar repositorio con el modelo LoRA
4. Ejecutar evaluaci√≥n baseline (Qwen2.5-7B sin LoRA)
5. Ejecutar evaluaci√≥n student (Qwen2.5-7B + LoRA)
6. Comparar resultados
7. Guardar resultados

## 1. Verificar GPU

In [6]:
!nvidia-smi

Fri Nov 14 21:52:59 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:1E.0 Off |                    0 |
| N/A   19C    P8              8W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2. Instalar Dependencias

In [7]:
%%bash
pip install -q transformers>=4.36.0 peft>=0.7.0 bitsandbytes>=0.41.0 accelerate>=0.25.0 torch typer pyyaml

## 3. Clonar Repositorio y Verificar Modelo LoRA

In [8]:
import os
from pathlib import Path

# 1. Clonar repositorio con el modelo
print("üì• Clonando repositorio con el modelo LoRA...")

repo_url = "https://github.com/krukmat/agnostic-ai-pipeline.git"
repo_branch = "dspy-multi-role"
repo_path = "/teamspace/studios/this_studio/agnostic-ai-pipeline"

if not os.path.exists(repo_path):
    !git clone --depth 1 --branch {repo_branch} {repo_url} {repo_path}
    print(f"‚úÖ Repositorio clonado (branch: {repo_branch})")
else:
    print(f"‚úÖ Repositorio ya existe en: {repo_path}")

# 2. Verificar que el modelo est√° en el repo
model_path = f"{repo_path}/artifacts/models/po_student_v1"
valset_path = f"{repo_path}/artifacts/synthetic/product_owner/product_owner_val.jsonl"

if not os.path.exists(model_path):
    print(f"\n‚ùå ERROR: Modelo no encontrado en: {model_path}")
    raise FileNotFoundError("Modelo LoRA no encontrado en el repositorio")

if not os.path.exists(valset_path):
    print(f"\n‚ùå ERROR: Dataset de validaci√≥n no encontrado en: {valset_path}")
    raise FileNotFoundError("Dataset de validaci√≥n no encontrado")

print(f"‚úÖ Modelo encontrado en: {model_path}")
print(f"‚úÖ Dataset de validaci√≥n encontrado: {valset_path}")

# 3. Verificar archivos cr√≠ticos del modelo
print(f"\nüìÇ Contenido del modelo:")
!ls -lh {model_path}

required_files = ["adapter_config.json", "adapter_model.safetensors", "tokenizer_config.json"]
missing_files = []

for file in required_files:
    file_path = os.path.join(model_path, file)
    if not os.path.exists(file_path):
        missing_files.append(file)
    else:
        file_size = os.path.getsize(file_path) / 1024**2  # MB
        print(f"  ‚úì {file} ({file_size:.1f} MB)")

if missing_files:
    print(f"\n‚ö†Ô∏è  ADVERTENCIA: Faltan archivos del modelo: {missing_files}")
    raise FileNotFoundError(f"Archivos cr√≠ticos faltantes: {missing_files}")
else:
    print("\n‚úÖ Todos los archivos del modelo est√°n presentes")

# 4. Cambiar al directorio del repo
os.chdir(repo_path)
print(f"\n‚úÖ Working directory: {os.getcwd()}")

üì• Clonando repositorio con el modelo LoRA...
‚úÖ Repositorio ya existe en: /teamspace/studios/this_studio/agnostic-ai-pipeline
‚úÖ Modelo encontrado en: /teamspace/studios/this_studio/agnostic-ai-pipeline/artifacts/models/po_student_v1
‚úÖ Dataset de validaci√≥n encontrado: /teamspace/studios/this_studio/agnostic-ai-pipeline/artifacts/synthetic/product_owner/product_owner_val.jsonl

üìÇ Contenido del modelo:
total 93M
-rw-r--r-- 1 krukmatias krukmatias 5.1K Nov 14 21:47 README.md
-rw-r--r-- 1 krukmatias krukmatias  887 Nov 14 21:47 adapter_config.json
-rw-r--r-- 1 krukmatias krukmatias  78M Nov 14 21:47 adapter_model.safetensors
-rw-r--r-- 1 krukmatias krukmatias  605 Nov 14 21:47 added_tokens.json
-rw-r--r-- 1 krukmatias krukmatias 2.5K Nov 14 21:47 chat_template.jinja
-rw-r--r-- 1 krukmatias krukmatias 1.6M Nov 14 21:47 merges.txt
-rw-r--r-- 1 krukmatias krukmatias  613 Nov 14 21:47 special_tokens_map.json
-rw-r--r-- 1 krukmatias krukmatias  11M Nov 14 21:47 tokenizer.json
-rw-r-

## 4. Evaluaci√≥n Baseline (Qwen2.5-7B sin LoRA)

Esta evaluaci√≥n usa el modelo base sin el adapter LoRA.

In [9]:
%%bash
cd /teamspace/studios/this_studio/agnostic-ai-pipeline

PYTHONPATH=. python scripts/eval_po_student.py \
  --tag baseline \
  --base-model Qwen/Qwen2.5-7B-Instruct \
  --max-samples 20 \
  --retries 2 \
  --max-new-tokens 1200 \
  --load-4bit \
  --bnb-compute-dtype float16

`torch_dtype` is deprecated! Use `dtype` instead!
Fetching 4 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:46<00:00, 11.67s/it]
Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:35<00:00,  8.80s/it]
  timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")


[info] Evaluating 20 samples; saving to inference_results/baseline_20251114_215442.json
[1/20] POCON-0196 -> ok score=0.819
[2/20] POCON-0006 -> ok score=0.903
[3/20] POCON-0049 -> ok score=0.903
[4/20] POCON-0061 -> ok score=0.773
[5/20] POCON-0119 -> ok score=0.777
[6/20] POCON-0181 -> ok score=0.804
[7/20] POCON-0163 -> ok score=0.810
[8/20] POCON-0059 -> ok score=0.819
[9/20] POCON-0016 -> ok score=0.825
[10/20] POCON-0170 -> ok score=0.819
[11/20] POCON-0215 -> ok score=0.810
[12/20] POCON-0120 -> ok score=0.810
[13/20] POCON-0217 -> ok score=0.819
[14/20] POCON-0047 -> ok score=0.881
[15/20] POCON-0101 -> ok score=0.795
[16/20] POCON-0139 -> ok score=0.819
[17/20] POCON-0094 -> ok score=0.870
[18/20] POCON-0007 -> ok score=0.911
[19/20] POCON-0092 -> ok score=0.907
[20/20] POCON-0035 -> ok score=0.948
[done] Results saved to inference_results/baseline_20251114_215442.json


## 5. Evaluaci√≥n Student (Qwen2.5-7B + LoRA)

Esta evaluaci√≥n usa el modelo base con el adapter LoRA entrenado.

In [10]:
%%bash
cd /teamspace/studios/this_studio/agnostic-ai-pipeline

PYTHONPATH=. python scripts/eval_po_student.py \
  --tag student \
  --base-model Qwen/Qwen2.5-7B-Instruct \
  --adapter-path artifacts/models/po_student_v1 \
  --max-samples 20 \
  --retries 2 \
  --max-new-tokens 1200 \
  --load-4bit \
  --bnb-compute-dtype float16

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:32<00:00,  8.10s/it]
  timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")


[info] Evaluating 20 samples; saving to inference_results/student_20251114_220448.json
[1/20] POCON-0196 -> ok score=0.958
[2/20] POCON-0006 -> ok score=0.742
[3/20] POCON-0049 -> ok score=0.803
[4/20] POCON-0061 -> ok score=0.375
[5/20] POCON-0119 -> ok score=0.858
[6/20] POCON-0181 -> ok score=0.918
[7/20] POCON-0163 -> ok score=0.798
[8/20] POCON-0059 -> ok score=0.773
[9/20] POCON-0016 -> ok score=0.841
[10/20] POCON-0170 -> ok score=0.842
[11/20] POCON-0215 -> ok score=0.375
[12/20] POCON-0120 -> ok score=0.866
[13/20] POCON-0217 -> ok score=0.375
[14/20] POCON-0047 -> ok score=0.945
[15/20] POCON-0101 -> ok score=0.918
[16/20] POCON-0139 -> ok score=0.724
[17/20] POCON-0094 -> ok score=0.870
[18/20] POCON-0007 -> ok score=0.749
[19/20] POCON-0092 -> ok score=0.757
[20/20] POCON-0035 -> ok score=0.953
[done] Results saved to inference_results/student_20251114_220448.json


## 6. Comparar Resultados

In [11]:
import json
import glob
from pathlib import Path

# Buscar archivos de resultados
results_dir = Path("/teamspace/studios/this_studio/agnostic-ai-pipeline/inference_results")
baseline_files = sorted(results_dir.glob("baseline_*.json"))
student_files = sorted(results_dir.glob("student_*.json"))

if not baseline_files:
    print("‚ö†Ô∏è  No se encontraron resultados de baseline")
else:
    print(f"\nüìä Archivos de resultados encontrados:")
    print(f"  Baseline: {len(baseline_files)} archivo(s)")
    print(f"  Student: {len(student_files)} archivo(s)")

# Cargar el resultado m√°s reciente de cada uno
if baseline_files and student_files:
    with open(baseline_files[-1], 'r') as f:
        baseline_data = json.load(f)
    
    with open(student_files[-1], 'r') as f:
        student_data = json.load(f)
    
    print(f"\n{'='*60}")
    print("COMPARACI√ìN DE RESULTADOS")
    print(f"{'='*60}\n")
    
    # M√©tricas generales
    print("üìà M√âTRICAS GENERALES\n")
    print(f"{'M√©trica':<30} {'Baseline':<15} {'Student':<15} {'Diff'}")
    print("-" * 70)
    
    baseline_metrics = baseline_data.get('metrics', {})
    student_metrics = student_data.get('metrics', {})
    
    if baseline_metrics and student_metrics:
        for metric in ['mean', 'std', 'min', 'max']:
            b_val = baseline_metrics.get(metric, 0)
            s_val = student_metrics.get(metric, 0)
            diff = s_val - b_val
            diff_pct = (diff / b_val * 100) if b_val != 0 else 0
            
            print(f"{metric.upper():<30} {b_val:<15.4f} {s_val:<15.4f} {diff:+.4f} ({diff_pct:+.1f}%)")
    
    # Tasa de √©xito YAML
    print(f"\nüìã TASA DE √âXITO YAML\n")
    print(f"{'Modelo':<30} {'Total':<10} {'V√°lidos':<10} {'Errores':<10} {'Tasa √âxito'}")
    print("-" * 70)
    
    b_total = baseline_data.get('total_samples', 0)
    b_valid = baseline_data.get('valid_samples', 0)
    b_failed = baseline_data.get('failed_samples', 0)
    b_rate = (b_valid / b_total * 100) if b_total > 0 else 0
    
    s_total = student_data.get('total_samples', 0)
    s_valid = student_data.get('valid_samples', 0)
    s_failed = student_data.get('failed_samples', 0)
    s_rate = (s_valid / s_total * 100) if s_total > 0 else 0
    
    print(f"{'Baseline':<30} {b_total:<10} {b_valid:<10} {b_failed:<10} {b_rate:.1f}%")
    print(f"{'Student':<30} {s_total:<10} {s_valid:<10} {s_failed:<10} {s_rate:.1f}%")
    
    # Criterios de aceptaci√≥n
    print(f"\n‚úÖ CRITERIOS DE ACEPTACI√ìN (9.D.4)\n")
    print("-" * 70)
    
    yaml_valid_threshold = 0.90
    quality_threshold = 0.90
    
    yaml_pass = (b_rate >= yaml_valid_threshold * 100) and (s_rate >= yaml_valid_threshold * 100)
    quality_pass = (s_val >= quality_threshold * b_val) if baseline_metrics and student_metrics else False
    
    print(f"1. YAML v√°lido ‚â•90%:")
    print(f"   Baseline: {b_rate:.1f}% {'‚úÖ PASS' if b_rate >= yaml_valid_threshold * 100 else '‚ùå FAIL'}")
    print(f"   Student:  {s_rate:.1f}% {'‚úÖ PASS' if s_rate >= yaml_valid_threshold * 100 else '‚ùå FAIL'}")
    
    if baseline_metrics and student_metrics:
        print(f"\n2. Student ‚â• 0.9 √ó Baseline:")
        target = quality_threshold * baseline_metrics.get('mean', 0)
        actual = student_metrics.get('mean', 0)
        print(f"   Target:  {target:.4f}")
        print(f"   Actual:  {actual:.4f} {'‚úÖ PASS' if actual >= target else '‚ùå FAIL'}")
    
    overall_pass = yaml_pass and quality_pass
    print(f"\n{'='*70}")
    print(f"RESULTADO GENERAL: {'‚úÖ PASS - Listo para 9.D.5' if overall_pass else '‚ùå FAIL - Requiere ajustes'}")
    print(f"{'='*70}")
    
    # Casos con errores
    if b_failed > 0 or s_failed > 0:
        print(f"\n‚ö†Ô∏è  CASOS CON ERROR DE FORMATO:\n")
        
        if b_failed > 0:
            print("Baseline:")
            for result in baseline_data.get('results', []):
                if result.get('status') == 'format_error':
                    print(f"  - {result.get('concept_id')} (tier: {result.get('tier')})")
        
        if s_failed > 0:
            print("\nStudent:")
            for result in student_data.get('results', []):
                if result.get('status') == 'format_error':
                    print(f"  - {result.get('concept_id')} (tier: {result.get('tier')})")

else:
    print("‚ö†Ô∏è  No se pueden comparar resultados: falta alg√∫n archivo")


üìä Archivos de resultados encontrados:
  Baseline: 2 archivo(s)
  Student: 1 archivo(s)

COMPARACI√ìN DE RESULTADOS

üìà M√âTRICAS GENERALES

M√©trica                        Baseline        Student         Diff
----------------------------------------------------------------------
MEAN                           0.8411          0.7720          -0.0690 (-8.2%)
STD                            0.0492          0.1808          +0.1316 (+267.1%)
MIN                            0.7734          0.3750          -0.3984 (-51.5%)
MAX                            0.9483          0.9575          +0.0093 (+1.0%)

üìã TASA DE √âXITO YAML

Modelo                         Total      V√°lidos    Errores    Tasa √âxito
----------------------------------------------------------------------
Baseline                       20         20         0          100.0%
Student                        20         20         0          100.0%

‚úÖ CRITERIOS DE ACEPTACI√ìN (9.D.4)

---------------------------------------

## 7. Guardar Resultados

Los resultados se guardan en el teamspace y est√°n disponibles para descarga.

In [12]:
import os
import shutil
from pathlib import Path

# Comprimir resultados
results_dir = "/teamspace/studios/this_studio/agnostic-ai-pipeline/inference_results"
output_dir = "/teamspace/studios/this_studio"
archive_path = f"{output_dir}/eval_results_20251115"

if not os.path.exists(results_dir):
    print("‚ùå No se encontr√≥ el directorio de resultados")
else:
    # Crear ZIP
    shutil.make_archive(archive_path, 'zip', results_dir)
    zip_file = f"{archive_path}.zip"
    print(f"‚úÖ Resultados comprimidos en: {zip_file}")
    
    # Copiar archivos JSON individuales al output
    print(f"\nüìÇ Copiando archivos JSON al teamspace...")
    
    json_files = list(Path(results_dir).glob("*.json"))
    for json_file in json_files:
        dest = Path(output_dir) / json_file.name
        shutil.copy2(json_file, dest)
        print(f"  ‚úì {json_file.name} ‚Üí {dest}")
    
    print(f"\n‚úÖ {len(json_files)} archivos copiados al teamspace")
    print(f"‚úÖ ZIP disponible en: {zip_file}")
    print(f"\nüí° Usa el navegador de archivos de Lightning (sidebar izquierdo) para descargar.")
    
    # Mostrar contenido del ZIP
    print(f"\nüì¶ Contenido del ZIP:")
    import zipfile
    with zipfile.ZipFile(zip_file, 'r') as zf:
        for name in sorted(zf.namelist()):
            info = zf.getinfo(name)
            size_mb = info.file_size / 1024**2
            print(f"   - {name} ({size_mb:.2f} MB)")
    
    # Listar archivos en teamspace
    print(f"\nüìÅ Archivos en teamspace:")
    !ls -lh /teamspace/studios/this_studio/*.json /teamspace/studios/this_studio/*.zip 2>/dev/null || echo "No hay archivos JSON/ZIP"

‚úÖ Resultados comprimidos en: /teamspace/studios/this_studio/eval_results_20251115.zip

üìÇ Copiando archivos JSON al teamspace...
  ‚úì comparison_20251114_143731.json ‚Üí /teamspace/studios/this_studio/comparison_20251114_143731.json
  ‚úì finetuned_20251114_143731.json ‚Üí /teamspace/studios/this_studio/finetuned_20251114_143731.json
  ‚úì baseline_20251114_143731.json ‚Üí /teamspace/studios/this_studio/baseline_20251114_143731.json
  ‚úì baseline_20251114_215442.json ‚Üí /teamspace/studios/this_studio/baseline_20251114_215442.json
  ‚úì student_20251114_220448.json ‚Üí /teamspace/studios/this_studio/student_20251114_220448.json

‚úÖ 5 archivos copiados al teamspace
‚úÖ ZIP disponible en: /teamspace/studios/this_studio/eval_results_20251115.zip

üí° Usa el navegador de archivos de Lightning (sidebar izquierdo) para descargar.

üì¶ Contenido del ZIP:
   - baseline_20251114_143731.json (0.01 MB)
   - baseline_20251114_215442.json (0.08 MB)
   - comparison_20251114_143731.json (0.0

## 8. Instrucciones Finales

**Para descargar los resultados**:

1. En Lightning AI Studio, ve al **navegador de archivos** (sidebar izquierdo)
2. Navega a `/teamspace/studios/this_studio/`
3. Encontrar√°s:
   - `eval_results_20251115.zip` (todos los resultados comprimidos)
   - `baseline_<timestamp>.json` (resultados baseline individuales)
   - `student_<timestamp>.json` (resultados student individuales)
4. Click derecho en cada archivo ‚Üí **Download**

**Siguiente paso (Task 9.D.4)**:

1. Subir los archivos JSON a `inference_results/` en el repositorio
2. Actualizar `docs/po_distillation_report.md` con los resultados
3. Si PASS ‚Üí Avanzar a Task 9.D.5 (integraci√≥n al pipeline)
4. Si FAIL ‚Üí Analizar casos `format_error` y ajustar

**Nota sobre cuota GPU**:
- Lightning AI Studio ofrece 10 horas GPU gratuitas/mes
- Esta evaluaci√≥n toma ~30-40 minutos (baseline + student)
- Recuerda detener el Studio cuando termines para conservar cuota