In [1]:
!nvidia-smi

Sun Sep 28 17:12:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   34C    P0             52W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!pip install transformers
!pip install accelerate
!pip install safetensors



In [3]:
import pandas as pd
import time
import os
import json
import torch
from transformers import AutoTokenizer, LlamaForCausalLM

In [4]:
INPUT_CSV = 'sample_master.csv'
OUTPUT_CSV = 'sample_master_prometheus_judge.csv'
OFFLOAD_FOLDER = "prometheus_offload"
ESSENTIAL_COLS = ['evaluation_id', 'response_A', 'response_B']

In [5]:
HF_TOKEN = os.environ.get("HF_TOKEN")

In [6]:
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    use_auth_token=HF_TOKEN
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [7]:
model = LlamaForCausalLM.from_pretrained(
    "kaist-ai/Prometheus-13b-v1.0",
    device_map="auto",
    offload_folder=OFFLOAD_FOLDER
  )

config.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

pytorch_model.bin.index.json: 0.00B [00:00, ?B/s]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

pytorch_model-00001-of-00006.bin:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

pytorch_model-00003-of-00006.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00005-of-00006.bin:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

pytorch_model-00006-of-00006.bin:   0%|          | 0.00/2.49G [00:00<?, ?B/s]

pytorch_model-00002-of-00006.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00004-of-00006.bin:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [18]:
prompt = "Hello world conte me a historia do brasil e nao repita a pergunta no saida conte apenas o seu resultado"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(
    input_ids,
    temperature=0,
    max_new_tokens=1024,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> Hello world conte me a historia do brasil e nao repita a pergunta no saida conte apenas o seu resultado.

Resposta:

A história do Brasil é longa e rica, e envolve a colonização portuguesa, a independência, a formação da república, a ditadura militar, a transição para a democracia e muitos outros fatos importantes.

O país foi descoberto por Pedro Álvares Cabral em 1500, e a colonização começou em 1532. A independência foi declarada em 1822, após uma longa luta contra a Portugal. A república foi estabelecida em 1889, e a ditadura militar começou em 1964, com o golpe de estado que depôs o presidente João Goulart. A transição para a democracia ocorreu em 1985, com a eleição de Tancredo Neves como presidente.

Essa é uma breve resenha da história do Brasil, sem repetir a pergunta.</s>


In [82]:
def judge_prometheus(prompt: str) -> str:
    try:
        messages = [
            {"role": "system", "content": "You are a strict JSON judge. Reply only with a valid JSON."},
            {"role": "user", "content": prompt}
        ]

        # gera os input_ids
        input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

        outputs = model.generate(
            input_ids,
            max_new_tokens=500,
            temperature=0,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

        # decodifica a saída
        raw_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # pega tudo que vem depois de [/INST]
        split_token = "[/INST]"
        idx = raw_response.find(split_token)
        if idx != -1:
            final_response = raw_response[idx + len(split_token):].strip()
        else:
            final_response = raw_response.strip()

        return final_response

    except Exception as e:
        return json.dumps({"winner": "Error", "error_details": str(e)})


In [83]:
def create_judge_prompt(response_a, response_b):
    core_prompt = f"""
    You are an expert judge specializing in evaluating the quality of responses generated by language models.
    Your task is to compare two responses (Response A and Response B) objectively and impartially,
    based on the provided rubric.

    IMPORTANT RULES:
    - **Blind Evaluation:** You do not know which model generated each response. Evaluate only the content presented.
    - **Mandatory Evidence:** For EACH criterion, you MUST quote a brief excerpt from the response that justifies your score.
    - **Length-Blind Independence:** The length of the response should NOT influence your evaluation, unless it negatively affects "Conciseness and Clarity".
    - **Output Format:** Your response must be STRICTLY a JSON object, with no additional text before or after.

    EVALUATION CRITERIA:
    1. **Logical Coherence (0-5)**
    2. **Relevance and Focus (0-5)**
    3. **Accuracy and Truthfulness (0-5)**
    4. **Conciseness and Clarity (0-5)**

    [RESPONSE A]
    {response_a}
    [/RESPONSE A]

    [RESPONSE B]
    {response_b}
    [/RESPONSE B]

    Based on your analysis, fill out the following JSON:

    {{
        "winner": "A", "B", or "Tie",
        "general_justification": "A concise 1-2 sentence explanation.",
        "criteria": {{
            "logical_coherence": {{
                "score": <0-5>,
                "justification": "<excerpt>"
            }},
            "relevance_and_focus": {{
                "score": <0-5>,
                "justification": "<excerpt>"
            }},
            "accuracy_and_truthfulness": {{
                "score": <0-5>,
                "justification": "<excerpt>"
            }},
            "conciseness_and_clarity": {{
                "score": <0-5>,
                "justification": "<excerpt>"
            }}
        }}
    }}
    """
    return core_prompt


In [84]:
df = pd.read_csv(INPUT_CSV)
df_eval = df[ESSENTIAL_COLS].copy()

In [85]:
df_eval = df_eval.sample(frac=1, random_state=42).reset_index(drop=True)

In [86]:
df_eval['evaluation_prometheus'] = None
print(f"Loaded {len(df_eval)} pairs for evaluation.")

Loaded 1500 pairs for evaluation.


In [87]:
evaluated_ids = set()
if os.path.exists(OUTPUT_CSV):
    df_existing = pd.read_csv(OUTPUT_CSV)
    evaluated_ids = set(df_existing['evaluation_id'])
    print(f"Found existing file: {len(evaluated_ids)} already evaluated.")


In [88]:
for idx, row in df_eval.iterrows():
    eval_id = row['evaluation_id']
    if eval_id in evaluated_ids:
        continue

    response_a = str(row['response_A'])
    response_b = str(row['response_B'])

    if not response_a or not response_b:
        continue

    prompt = create_judge_prompt(response_a, response_b)
    print(f"Evaluating pair {idx+1}/{len(df_eval)}...")

    output = judge_prometheus(prompt)

    new_row = {
        'evaluation_id': eval_id,
        'response_A': response_a,
        'response_B': response_b,
        'evaluation_prometheus': output
    }

    pd.DataFrame([new_row]).to_csv(
        OUTPUT_CSV,
        mode='a',
        header=not os.path.exists(OUTPUT_CSV),
        index=False,
        encoding='utf-8-sig'
    )

    time.sleep(1)

Evaluating pair 1/1500...
Evaluating pair 2/1500...
Evaluating pair 3/1500...
Evaluating pair 4/1500...
Evaluating pair 5/1500...
Evaluating pair 6/1500...
Evaluating pair 7/1500...
Evaluating pair 8/1500...
Evaluating pair 9/1500...
Evaluating pair 10/1500...
Evaluating pair 11/1500...
Evaluating pair 12/1500...
Evaluating pair 13/1500...
Evaluating pair 14/1500...
Evaluating pair 15/1500...
Evaluating pair 16/1500...
Evaluating pair 17/1500...
Evaluating pair 18/1500...
Evaluating pair 19/1500...
Evaluating pair 20/1500...
Evaluating pair 21/1500...
Evaluating pair 22/1500...
Evaluating pair 23/1500...
Evaluating pair 24/1500...
Evaluating pair 25/1500...
Evaluating pair 26/1500...
Evaluating pair 27/1500...
Evaluating pair 28/1500...
Evaluating pair 29/1500...
Evaluating pair 30/1500...
Evaluating pair 31/1500...
Evaluating pair 32/1500...
Evaluating pair 33/1500...
Evaluating pair 34/1500...
Evaluating pair 35/1500...
Evaluating pair 36/1500...
Evaluating pair 37/1500...
Evaluating

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Evaluating pair 395/1500...
Evaluating pair 396/1500...
Evaluating pair 397/1500...
Evaluating pair 398/1500...
Evaluating pair 399/1500...
Evaluating pair 400/1500...
Evaluating pair 401/1500...
Evaluating pair 402/1500...
Evaluating pair 403/1500...
Evaluating pair 404/1500...
Evaluating pair 405/1500...
Evaluating pair 406/1500...
Evaluating pair 407/1500...
Evaluating pair 408/1500...
Evaluating pair 409/1500...
Evaluating pair 410/1500...
Evaluating pair 411/1500...
Evaluating pair 412/1500...
Evaluating pair 413/1500...
Evaluating pair 414/1500...
Evaluating pair 415/1500...
Evaluating pair 416/1500...
Evaluating pair 417/1500...
Evaluating pair 418/1500...
Evaluating pair 419/1500...
Evaluating pair 420/1500...
Evaluating pair 421/1500...
Evaluating pair 422/1500...
Evaluating pair 423/1500...
Evaluating pair 424/1500...
Evaluating pair 425/1500...
Evaluating pair 426/1500...
Evaluating pair 427/1500...
Evaluating pair 428/1500...
Evaluating pair 429/1500...
Evaluating pair 430/

In [89]:
print(f"All results saved to '{OUTPUT_CSV}'")

All results saved to 'sample_master_prometheus_judge.csv'
