Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG #25

Open
nyBball opened this issue Feb 20, 2024 · 13 comments
Open

BUG #25

nyBball opened this issue Feb 20, 2024 · 13 comments

Comments

@nyBball
Copy link

nyBball commented Feb 20, 2024

我目前使用chatglm3-6b模型进行评测,我使用的是4个A800显卡(80G),在跑v0.2版本的代码时,部分数据集跑不通报错。详情如下:

对于instruct、review、plan json数据集评测正常,但是对于plan str、retrieve str,在运行到中途的时候会报如下错误:

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 109, in <module>
    prediction = infer(dataset, llm, args.out_dir, tmp_folder_name=tmp_folder_name, test_num=test_num)
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 74, in infer
    prediction = split_special_tokens(prediction)
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 50, in split_special_tokens
    text = text.split('<eoa>')[0]
AttributeError: 'dict' object has no attribute 'split'

对于reason str、understand str、RRU,在运行到中途的时候会报如下错误:

Input length of input_ids is 8463, but `max_length` is set to 8192. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

请问作者有什么建议吗?多谢

@zehuichen123
Copy link
Collaborator

对于第一个问题,需要把meta_template设成chatglm,chatglm3有个feature就是如果第一条message是system的时候返回的结果会eval一下变成dict,就不是string了
第二个问题try except一下把...chatglm3超长了就是会报错...直接让返回为空吧...

@nyBball
Copy link
Author

nyBball commented Feb 20, 2024

多谢~

我的启动命令是 sh test_all_zh.sh hf ../../ckpt/chatglm3-6b/ chatglm3-6b-zh chatglm,应该已经把meta_template设成chatglm了吧?并且只是在部分数据集报第一个错,应该不是meta_template没设成chatglm的原因?

@zehuichen123
Copy link
Collaborator

话说方便说一下具体是哪条数据呀,我这边试一下~

@nyBball
Copy link
Author

nyBball commented Feb 20, 2024

plan str:

image

retrieve str:

image

@zehuichen123
Copy link
Collaborator

我hack了一下 现在应该是能正常跑通的, 主要是他们自己写了个这个代码 (https://huggingface.co/THUDM/chatglm3-6b/blob/f30825950ce00cb0577bf6a15e0d95de58e328dc/modeling_chatglm.py#L1021)

@nyBball
Copy link
Author

nyBball commented Feb 21, 2024

你的这套代码bug太多了,建议先都check下,严谨一些吧。。。

  1. sh test_all_zh.sh hf ../../ckpt/internlm-7b/ internlm-7b-zh internlm

evaluating understand str报错

image

  1. sh test_all_en.sh hf ../../ckpt/internlm-7b/ internlm-7b-en internlm

evaluating understand str报错

image

@zehuichen123
Copy link
Collaborator

不好意思哈 因为论文走的是opencompass 这套代码是我们另外写的 现在还处在验证阶段 一些回归测试还没来得及跑 今天会整体测一下~有bug请多见谅

@zehuichen123
Copy link
Collaborator

zehuichen123 commented Feb 21, 2024

@nyBball 代码已经更新,您这边可以再试试,但是对齐精度可能需要等opencompass那边也ready后才能完全对齐哈
这个是chatglm3-6b的infer结果

Overall: 49.8   Instruct: 80.1  Plan: 40.1      Reason: 28.5    Retrieve: 48.6  Understand: 51.7        Review: 50.1

chatglm3-6b似乎会存在一些内部错误,走到try except逻辑里面,存在结果偏低的现象,这个我们后续会再看看

@zehuichen123
Copy link
Collaborator

这个是Qwen-7B的infer结果,应该是正常的

Overall: 58.6   Instruct: 66.6  Plan: 55.2      Reason: 47.7    Retrieve: 67.1  Understand: 51.9        Review: 63.2

@nyBball
Copy link
Author

nyBball commented Feb 22, 2024

英文只支持单卡跑吗?另外请问一下,跑的模型是base版本还是chat版本?
image

@zehuichen123
Copy link
Collaborator

这个您自己注释掉就行了哈~所有的跑的都是chat model

@nyBball
Copy link
Author

nyBball commented Feb 26, 2024

这个是Qwen-7B的infer结果,应该是正常的

Overall: 58.6   Instruct: 66.6  Plan: 55.2      Reason: 47.7    Retrieve: 67.1  Understand: 51.9        Review: 63.2

我跑的qwen-7b-chat (https://huggingface.co/Qwen/Qwen-7B-Chat)

数据集用的1/5 subset (https://drive.google.com/file/d/1DgCMjquEIJ2v14Xu6uB6w3UEzaYXZbUL/view)

跑的结果是 Overall: 64.0 Instruct: 93.3 Plan: 56.8 Reason: 59.7 Retrieve: 67.3 Understand: 48.7 Review: 58.1

跟你的这个结果差的有点大,估计是什么原因呢?谢谢

@zehuichen123
Copy link
Collaborator

zehuichen123 commented Feb 27, 2024

之前那个是full-set的 我昨天跑了一个1/5susbet的结果

Overall: 58.7   Instruct: 64.1  Plan: 59.6      Reason: 49.3    Retrieve: 66.2  Understand: 51.4        Review: 61.9

你可以更新一下lagent和t-eval的代码,感觉主要是instruct你的点比较高?你测了str格式的instruct了吗 我这边json format的instruct能到90+,但是str format的比较低

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants