<a href="https://colab.research.google.com/github/luasm17/LLM_as_a_judge/blob/main/prueba_2_qwen3_8B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Instalar dependencias + Imports
!pip install -q -U "transformers>=4.51.0" accelerate safetensors

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Token HF
HF_TOKEN = os.environ.get("HF_TOKEN", None)

In [None]:
# Cargar modelo e tokenizer (Qwen3-8B)
# LLM-as-a-Judge binario para concordancia de número en galego

MODEL_ID = "Qwen/Qwen3-8B"

print("🔄 Cargando tokenizer...")
if HF_TOKEN:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
else:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

print("🔄 Cargando modelo (pode tardar uns minutos)...")
if HF_TOKEN:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        token=HF_TOKEN,
        device_map="auto",
        torch_dtype="auto"
    )
else:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        device_map="auto",
        torch_dtype="auto"
    )

model.eval()
print("✅ Modelo cargado correctamente")

🔄 Cargando tokenizer...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]



vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

🔄 Cargando modelo (pode tardar uns minutos)...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/399 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



✅ Modelo cargado correctamente


In [None]:
# Función LLM-as-a-Judge (concordancia de número)
def qwen_judge_concordancia(oracion: str) -> str:
    prompt = f"""
You are an LLM-as-a-judge that will evaluate a grammatical error correction model in Galician.

Your task is to evaluate the output of the grammatical error correction (GEC) model
and decide whether the correction is ADEQUATE with respect to number agreement
(singular/plural between determiner, noun, adjective, or verb).

Return the answer EXACTLY in the following format, with no additional text:

output_modelo: "<evaluated sentence>"
etiqueta: <0 or 1>
explicacion: "<brief and precise explanation in Galician>"

Criteria:
- etiqueta = 1 → the correction is NOT adequate (there is still a number agreement error)
- etiqueta = 0 → the correction is adequate (there is no number agreement error)
- The explanation must justify only the number agreement

YOU MUST NOT, UNDER ANY CIRCUMSTANCES, CORRECT THE MODEL OUTPUT YOU HAVE TO EVALUATE. YOU MUST LIMIT YOURSELF EXCLUSIVELY TO DECIDING WHETHER IT CONTAINS A NUMBER AGREEMENT ERROR OR NOT.
YOU MUST NOT EVALUATE OTHER TYPES OF ERRORS.

Now evaluate the following output from a GEC model:

"<OUTPUT_MODELO>"


"{oracion}"
"""

    messages = [{"role": "user", "content": prompt}]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.0,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    resposta = tokenizer.decode(
        output[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True
    )

    return resposta.strip()

In [None]:
# Probas cos meus exemplos
exemplos = [
    "As decisións tomadas polo comité foron comunicadas aos responsables das distintas áreas.",
    "O grupos de estudantes que participou no proxecto presentou os resultados finais onte pola tarde.",
    "As propostas que chegaron desde os concellos máis pequenos foi analizadas polo equipo técnico.",
    "A maioría das persoas entrevistadas manifestou a súa opinión durante a sesión pública."
    "Os relatorios anuais sobre o impacto ambiental foi revisados exhaustivamente pola comisión de avaliación.",
    "A valiosa colección de poemas inéditos do autor galego publicouse recentemente baixo un prestixioso selo independente.",
    "O paquete de medidas económicas recentemente aprobado é de aplicación inmediata en todos os sectores da economía.",
    "As adversas condicións climáticas dos últimos días obrigou a suspender completamente varias actividades programadas ao aire libre.",
    "A maioría dos participantes no curso de formación mostrou un grande interese en continuar con sesións prácticas adicionais.",
    "Os sabios consellos que me deches sobre a xestión do tempo foi moi útiles para poder tomar unha decisión acertada."
]

for i, frase in enumerate(exemplos, 1):
    print(f"\n===== EXEMPLO {i} =====")
    print(qwen_judge_concordancia(frase))


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



===== EXEMPLO 1 =====
<think>
Okay, let's tackle this evaluation. The user wants me to check if the model's correction is adequate regarding number agreement in Galician. The sentence provided is: "As decisións tomadas polo comité foron comunicadas aos responsables das distintas áreas."

First, I need to focus on number agreement. Let's break down the sentence. The subject is "As decisións" – "As" is the definite article, "decisións" is plural. Then the verb is "foron", which is the plural form of "ser" (to be). So the subject and verb agree in number, that's good.

Next, looking at the rest of the sentence. "Tomadas polo comité" – "tomadas" is the past participle of "tomar", which agrees with "decisións" (plural), so that's correct. "Aos responsables" – "aos" is the plural form of "ao", so that's correct. "Das distintas áreas" – "das" is plural, matching "responsables", and "áreas" is plural. All parts seem to agree in number.

Wait, but maybe I should check if there's any part that 