优化_replace_image_tags防止数据中不包含具体图像的<img></img> tag中断训练 #3683
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR type
PR information
在一次GRPO训练中程序意外中断于 此处,原因是模型生成了
<img>乱七八糟的内容</img>这样的文本,_replace_image_tags函数未加检查地错误地提取了其中内容,然后和数据中原始的images冲突。本次PR更新了_replace_image_tags,主要实现两个逻辑。
<img></img>包裹的内容进行判断,若为合法的图像(如url、本地路径、base64等),才提取并compat<img></img>为<image>,否则不处理。<image>tag且inputs.images有相应图像的情况下,保证<img></img>插入inputs.images的正确位置。Experiment results
经过测试可以忽略
<img></img>内非图片内容,且对后续<image>和image文件的处理没有影响,能够正常训练。原始数据中若inputs.images和
<image>数量匹配,则没有问题。若inputs.images数量比原始<image>多,则由后续_add_default_tags函数处理。若inputs.images数量比原始<image>少,且在有效<img></img>前面有没有图像对应的<image>tag,则会造成图像错位,但实际上可能会因为最终<image>和inputs.images数量不匹配,在后续模型中报错(如llava)。