<a href="https://colab.research.google.com/github/plpowerbug/code-eval/blob/main/evaluate_qwen_humaneval_csv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers evalplus pandas

Collecting evalplus
  Downloading evalplus-0.3.1-py3-none-any.whl.metadata (12 kB)
Collecting wget>=3.2 (from evalplus)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tempdir>=0.7.1 (from evalplus)
  Downloading tempdir-0.7.1.tar.gz (5.9 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting appdirs>=1.4.4 (from evalplus)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting fire>=0.6.0 (from evalplus)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tree-sitter>=0.22.0 (from evalplus)
  Downloading tree_sitter-0.24.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.8 kB)
Collecting tree-sitter-python>=0.21.0 (from evalplus)
  Downloading tree_sitter_python-0.23.6-cp39-abi3-manylinux_2_5_x86_64.many



```
# 2. 加载模型（以 Qwen 为例）

```



In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Qwen/Qwen1.5-0.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    trust_remote_code=True
)

model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1024,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((1024,), eps=1e-06)
    (rotary_emb): 

3. 加载 HumanEval 并生成代码

In [3]:
!pip install datasets



In [4]:
from datasets import load_dataset

dataset = load_dataset("openai_humaneval")
samples = dataset["test"]
print(f"共加载 {len(samples)} 道 HumanEval 题目")

README.md:   0%|          | 0.00/6.52k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/83.9k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/164 [00:00<?, ? examples/s]

共加载 164 道 HumanEval 题目


4. 定义生成 + 测试代码的函数

In [5]:
# 生成代码
def generate_code(prompt, max_new_tokens=256):
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=0.8,
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()

# 执行代码并跑 test 部分
def run_test(function_code: str, test_code: str) -> bool:
    with tempfile.NamedTemporaryFile(suffix=".py", mode="w+", delete=False) as f:
        f.write(function_code + "\n\n" + test_code)
        f.flush()
        try:
            subprocess.check_output(["python3", f.name], stderr=subprocess.STDOUT, timeout=5)
            return True
        except subprocess.CalledProcessError:
            return False
        except subprocess.TimeoutExpired:
            return False

 5. 实际执行评测（可选：只跑前 N 题）

In [6]:
import tempfile
import subprocess
results = []
num_completions = 3  # 每题生成几个解法
subset = samples.select(range(5))  # 评估前 5 题，可改 range(10) 评估前10题

for item in subset:
    for i in range(num_completions):
        generated = generate_code(item["prompt"])
        passed = run_test(item["prompt"] + generated, item["test"])
        results.append({
            "task_id": item["task_id"],
            "index": i + 1,
            "passed": passed,
            "completion": generated
        })
        print(f"{item['task_id']} | Try {i+1} | {'✅ PASS' if passed else '❌ FAIL'}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


HumanEval/0 | Try 1 | ❌ FAIL
HumanEval/0 | Try 2 | ❌ FAIL
HumanEval/0 | Try 3 | ❌ FAIL
HumanEval/1 | Try 1 | ❌ FAIL
HumanEval/1 | Try 2 | ❌ FAIL
HumanEval/1 | Try 3 | ❌ FAIL
HumanEval/2 | Try 1 | ❌ FAIL
HumanEval/2 | Try 2 | ❌ FAIL
HumanEval/2 | Try 3 | ❌ FAIL
HumanEval/3 | Try 1 | ❌ FAIL
HumanEval/3 | Try 2 | ❌ FAIL
HumanEval/3 | Try 3 | ❌ FAIL
HumanEval/4 | Try 1 | ❌ FAIL
HumanEval/4 | Try 2 | ❌ FAIL
HumanEval/4 | Try 3 | ❌ FAIL


4. 保存 CSV + 运行评测

In [8]:
import pandas as pd
df = pd.DataFrame(results)
df.to_csv("humaneval_results_qwen_light.csv", index=False)
df.head()

Unnamed: 0,task_id,index,passed,completion
0,HumanEval/0,1,False,return any(x < threshold for x in numbers) # ...
1,HumanEval/0,2,False,for num in numbers:\n if num - num < th...
2,HumanEval/0,3,False,return (not numbers).all()\n\n\ndef has_close_...
3,HumanEval/1,1,False,return paren_string.split()
4,HumanEval/1,2,False,result: List[str] = []\n stack = []\n st...


In [9]:


df = pd.read_csv("humaneval_results_qwen_light.csv")

# 按 task 分组，保留 index 排序（即生成顺序）
grouped = df.sort_values(by=["task_id", "index"]).groupby("task_id")

pass_at_1 = grouped.head(1).passed.mean()
pass_at_3 = grouped.head(3).groupby("task_id")["passed"].any().mean()
pass_at_5 = grouped.head(5).groupby("task_id")["passed"].any().mean()

print(f"✅ Pass@1: {pass_at_1:.2%}")
print(f"✅ Pass@3: {pass_at_3:.2%}")
print(f"✅ Pass@5: {pass_at_5:.2%}")

✅ Pass@1: 0.00%
✅ Pass@3: 0.00%
✅ Pass@5: 0.00%


分析每个任务失败的原因

In [12]:
# 找出每个 task 完全没有通过的情况
failed_tasks = grouped["passed"].any()
completely_failed = failed_tasks[failed_tasks == False].index.tolist()

print(f"共 {len(completely_failed)} 个任务完全失败：")
for tid in completely_failed[:5]:  # 只显示前5个
    print("-", tid)
#进一步打印这些任务的所有 completions
df[df["task_id"].isin(completely_failed)].to_csv("completely_failed_tasks.csv", index=False)

共 5 个任务完全失败：
- HumanEval/0
- HumanEval/1
- HumanEval/2
- HumanEval/3
- HumanEval/4
