# Kaggle Agents + MLE-bench Lite (Colab)

Este notebook executa o `kaggle-agents` em competições do **MLE-bench Lite** e valida as submissões com `mlebench grade-sample`.

Objetivo:
- Rodar uma competição (ou o Lite) no MLE-bench
- Gerar `results.json`, `summary.json`, `results.csv`
- Comparar com o benchmark via `any_medal_percentage` (Low/Lite)

Pré-requisitos (Colab Secrets):
- `OPENAI_API_KEY`
- `KAGGLE_USERNAME`
- `KAGGLE_KEY`


## 1) Setup (clone + install)

Este notebook assume os paths:
- `/content/kaggle-agents`
- `/content/mle-bench`


In [None]:
# GPU (opcional)
!nvidia-smi || echo "No GPU available (CPU mode)"

# Clone repos (requer rede)
!test -d /content/kaggle-agents || git clone https://github.com/gustavogomespl/kaggle-agents.git /content/kaggle-agents
!test -d /content/mle-bench || git clone https://github.com/openai/mle-bench.git /content/mle-bench

%cd /content/kaggle-agents
!ls -la | head

In [None]:
# Instalar deps
!pip -q install uv
!uv pip install --system -e /content/kaggle-agents
!pip -q install -e /content/mle-bench

import sys
print('python:', sys.version)

## 2) Configuração (.env + Kaggle credentials)

Configure no Colab → **Secrets**:
- `OPENAI_API_KEY`
- `KAGGLE_USERNAME`
- `KAGGLE_KEY`

Opcional (para controle de custo/qualidade):
- `LLM_MODEL`, `PLANNER_MODEL`, `DEVELOPER_MODEL`, `EVALUATOR_MODEL`


In [None]:
import os, json
from pathlib import Path

from google.colab import userdata

# ===== LLM config (ajuste aqui) =====
os.environ['LLM_PROVIDER'] = os.environ.get('LLM_PROVIDER', 'openai')
os.environ['LLM_MODEL'] = os.environ.get('LLM_MODEL', 'gpt-5-mini')
os.environ['LLM_TEMPERATURE'] = os.environ.get('LLM_TEMPERATURE', '0.7')
os.environ['LLM_MAX_TOKENS'] = os.environ.get('LLM_MAX_TOKENS', '16000')

# Per-role overrides (opcional)
os.environ['PLANNER_MODEL'] = os.environ.get('PLANNER_MODEL', os.environ['LLM_MODEL'])
os.environ['DEVELOPER_MODEL'] = os.environ.get('DEVELOPER_MODEL', os.environ['LLM_MODEL'])
os.environ['EVALUATOR_MODEL'] = os.environ.get('EVALUATOR_MODEL', os.environ['LLM_MODEL'])

# ===== Secrets =====
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY') or ''
kaggle_username = userdata.get('KAGGLE_USERNAME') or ''
kaggle_key = userdata.get('KAGGLE_KEY') or ''

if not os.environ['OPENAI_API_KEY']:
    raise ValueError('Missing OPENAI_API_KEY (configure nos Colab Secrets).')
if not kaggle_username or not kaggle_key:
    raise ValueError('Missing KAGGLE_USERNAME/KAGGLE_KEY (configure nos Colab Secrets).')

# Kaggle credentials para o CLI do mlebench
kaggle_path = Path.home() / '.kaggle' / 'kaggle.json'
kaggle_path.parent.mkdir(parents=True, exist_ok=True)
kaggle_path.write_text(json.dumps({'username': kaggle_username, 'key': kaggle_key}))
kaggle_path.chmod(0o600)

# .env (útil para processos subprocess / reexecuções)
env_path = Path('/content/kaggle-agents/.env')
env_path.write_text(
    "\n".join([
        f"LLM_PROVIDER={os.environ['LLM_PROVIDER']}",
        f"LLM_MODEL={os.environ['LLM_MODEL']}",
        f"LLM_TEMPERATURE={os.environ['LLM_TEMPERATURE']}",
        f"LLM_MAX_TOKENS={os.environ['LLM_MAX_TOKENS']}",
        f"PLANNER_MODEL={os.environ['PLANNER_MODEL']}",
        f"DEVELOPER_MODEL={os.environ['DEVELOPER_MODEL']}",
        f"EVALUATOR_MODEL={os.environ['EVALUATOR_MODEL']}",
        f"OPENAI_API_KEY={os.environ['OPENAI_API_KEY']}",
        f"KAGGLE_USERNAME={kaggle_username}",
        f"KAGGLE_KEY={kaggle_key}",
        "KAGGLE_AUTO_SUBMIT=false",
        "LOG_LEVEL=INFO",
        "LOG_DIR=./logs",
    ]) + "\n"
)

print('✅ kaggle.json:', kaggle_path)
print('✅ .env:', env_path)
print('✅ LLM_MODEL:', os.environ['LLM_MODEL'])

## 3) Preparar datasets (MLE-bench)

Dica:
- Primeiro rode 1 competição para validar o pipeline.
- Depois rode `--lite` se quiser o Lite completo (22 competições).


In [None]:
# Teste rápido (1 competição)
!mlebench prepare -c aerial-cactus-identification

# Lite completo (22 competições) - pode demorar
# !mlebench prepare --lite

## 4) Rodar avaliação (gera `results.json`, `summary.json`, `results.csv`)

Dica para economizar tempo/custo:
- Comece com `--max-iterations 1` e `--timeout 1200` (20 min por componente)
- Depois aumente gradualmente.


In [None]:
from pathlib import Path

RUN_ID = 'aerial_cactus_smoke'
OUT_DIR = Path(f'/content/mlebench_results/{RUN_ID}')
OUT_DIR.mkdir(parents=True, exist_ok=True)

# IMPORTANT: use path absoluto (evita erro '/content/notebooks/...')
!python3 /content/kaggle-agents/notebooks/mlebench_eval.py --competition aerial-cactus-identification --max-iterations 1 --timeout 1800 -o {OUT_DIR}

# Lite completo (opcional)
# !python3 /content/kaggle-agents/notebooks/mlebench_eval.py --lite --max-iterations 1 --timeout 1800 -o /content/mlebench_results/lite_run

In [None]:
import json
from pathlib import Path

out_dir = Path(f'/content/mlebench_results/{RUN_ID}')
summary = json.loads((out_dir / 'summary.json').read_text())

print(json.dumps(summary, indent=2))
print('\n=== Benchmark (MLE-bench) ===')
print(f"Low/Lite Any Medal (%): {summary.get('any_medal_percentage', 0.0) * 100:.2f}%")
print(f"Valid Submission (%): {summary.get('valid_submission_percentage', 0.0) * 100:.2f}%")


## 5) (Opcional) Repetir 3 runs e reportar mean ± SEM

O leaderboard do MLE-bench reporta `Any Medal (%)` como **média ± SEM** (recomenda 3 seeds/runs).

Abaixo roda o Lite 3 vezes e calcula a estatística. Ajuste `--max-iterations`/`--timeout` conforme seu orçamento.

In [None]:
import os
import json, math, statistics
from pathlib import Path

RUN_GROUP = 'lite_3runs'
BASE_DIR = Path('/content/mlebench_results') / RUN_GROUP
BASE_DIR.mkdir(parents=True, exist_ok=True)

seeds = [0, 1, 2]
summaries = []

for seed in seeds:
    out_dir = BASE_DIR / f'seed_{seed}'
    out_dir.mkdir(parents=True, exist_ok=True)
    
    # (Opcional) logar seed para rastreio
    os.environ['RUN_SEED'] = str(seed)
    
    # Rodar Lite completo
    !python3 /content/kaggle-agents/notebooks/mlebench_eval.py --lite --max-iterations 1 --timeout 1800 -o {out_dir}
    summaries.append(json.loads((out_dir / 'summary.json').read_text()))

any_medals = [s.get('any_medal_percentage', 0.0) for s in summaries]
mean = statistics.mean(any_medals)
sem = (statistics.stdev(any_medals) / math.sqrt(len(any_medals))) if len(any_medals) > 1 else 0.0

print(f"\nLite Any Medal (%): {mean*100:.2f} ± {sem*100:.2f} (n={len(any_medals)})")
