Skip to content

RecLM- eval#104

Merged
Leavingseason merged 5 commits intomicrosoft:devfrom
LINJH00:RecLM-eval
Sep 28, 2025
Merged

RecLM- eval#104
Leavingseason merged 5 commits intomicrosoft:devfrom
LINJH00:RecLM-eval

Conversation

@LINJH00
Copy link
Copy Markdown
Contributor

@LINJH00 LINJH00 commented Sep 11, 2025

Description

  1. New Tasks Added
    Two new tasks have been added: cf_ranking_mc and seq_ranking_mc, along with two metrics:
    acc@1: computes the accuracy of the current evaluation task
    none_ratio: the proportion of cases that cannot be recognized by our defined rules
    2.change the task of the ranking
    Checklist:
  • [√] I have added description accordingly.
  • [√] This PR is being made to dev branch AND NOT TO main branch.

Comment thread RecLM-eval/README.md
``` No newline at end of file
```

## Amazon_Fashion · Ranking Evaluation (1000 samples)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以在RecLM-eval 根目录新建一个“examples” folder,把运行这个结果的shell 脚本放在这里,方便别人复现结果。 比如 "amazon_fashion_ranking_evalaution.sh"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在看起来只有Ranking的结果? 可以把 cf_ranking_mc 和seq_ranking_mc 的两个Table结果也放上去。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经添加cf_ranking_mc 和seq_ranking_mc 的两个Table结果上去,关于 ”examples“ 这个指的是要把我们测评的数据备份一份上去吗?

Comment thread RecLM-eval/api_cost.jsonl Outdated
{"text-embedding-3-small": {"input": 0.02, "output": 0}}
{"text-embedding-3-large": {"input": 0.13, "output": 0}}
{"ada v2": {"input": 0.1, "output": 0}}
{"gpt-4.1": {"input": 3.0, "output": 12.0}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件的价格都按照openai 官网的价格更新一下吧? 大部分都过时了。
chatgpt-4o-latest 这一行可以去掉,不知道是指那个model。
只需要保留 gpt-35-turbo, gpt-4.1, gpt-4o, gpt-4o-mini,text-embedding-3-small, text-embedding-3-large

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经更改

Comment thread RecLM-eval/call_models/vllm_models.py Outdated
data["prompt"],
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # <- turn off thought mode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于非thinking model,比如Llama 3.1 8B, 加入enable_thinking=False 参数会不会报异常?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_thinking=False 只有在支持该参数的 tokenizer(如 Qwen 系列)里才会生效;若模型 tokenizer 根本没有该关键字,Transformers 会抛 TypeError/ValueError/AttributeError。然后代码已用 try/except 捕获这类异常;一旦捕获,会回退到简单的字符串拼接方案,不再调用 apply_chat_template,从而完全规避参数不兼容问题。同时我也测试了带该参数和不带该参数的模型的测评效果是一样的

Copy link
Copy Markdown
Contributor

@Leavingseason Leavingseason Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"若模型 tokenizer 根本没有该关键字,Transformers 会抛 TypeError/ValueError/AttributeError" 那就不对了。 不能走这条路。我们得用apply_chat_template,不能自己拼接。 你这里需要先用if条件+hasattr() 判断一下tokenizer里面有没有enable_thinking这个变量,再决定要不要传入。

text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode()
return re.sub(r'[^a-z0-9]', '', text.lower())

def _map_titles(answer_line: str, candidates: list[str]) -> list[str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加一些英文注释

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加

Comment thread RecLM-eval/call_models/vllm_models.py Outdated
# filter history
mapped_titles = [t for t in mapped_titles if t not in history]
# pad if necessary with remaining candidates
if len(mapped_titles) < 20:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

大段的代码都添加一些英文注释。
这里为什么会有一个magic number 20? 需要定义成变量,以便灵活支持不同数量的配置。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在脚本中添加了一个参数用来控制该变量,一开始设定为20是因为我们的指标的k考虑的是1、5、10、20,所以就设定到了20。

Comment thread RecLM-eval/eval.py Outdated
@@ -15,11 +15,53 @@

## If you use customerized deployment names, don't forget to add them to this list
OPENAI_MODELS = ["gpt-35-turbo", "gpt-3.5-turbo", "gpt-4", "gpt-4-turbo",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也对应修改一下,只需要支持上面提到过的几个models

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

Comment thread RecLM-eval/main.sh Outdated
--bench-name steam \
--model_path_or_name NousResearch/Hermes-3-Llama-3.1-8B \
--bench-name "${DATASETS[@]}" \
--model_path_or_name /home/data/model/qwen3-8B \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果不是自己finetuned过的model的话,需要用原始huggingface的地址, 比如这里用“Qwen/Qwen3-8B”

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

Comment thread RecLM-eval/requirements.txt Outdated
# Optional acceleration
xformers==0.0.31
triton==3.3.1 # CUDA kernels for xformers
# flash-attn 需要在已有 torch 环境中手动安装;请在完成 requirements 安装后运行:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中文都需要翻译成英文

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改

@Leavingseason Leavingseason merged commit 35939d5 into microsoft:dev Sep 28, 2025
1 check passed
Leavingseason added a commit that referenced this pull request Sep 28, 2025
* change the task of ranking and add two tasks.

* new change

* new changes

* new changes

* change something here

Co-authored-by: LINJH00 <2020043053@email.szu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants