Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Cannot match one checkpoint key to multiple keys in the model. #6

Closed
erpingzi opened this issue Apr 22, 2022 · 6 comments
Closed

Comments

@erpingzi
Copy link

您好,我在使用detectron2进行cross-training时,没有使用resume,直接从头训练,会报错“ValueError: Cannot match one checkpoint key to multiple keys in the model.”
请问cross-training必须resume,您运行命令中的output/model_0005999.pth是怎么得到的呢,期待您的回复

@machengcheng2016
Copy link
Owner

您好,与其他半监督检测工作相同,resume是为了让模型起初有更好的预测能力。我们是采用全监督训练做warmup,6000次迭代结束后进行半监督训练。

@erpingzi
Copy link
Author

感谢您的回复!我理解的是在半监督训练前会有一个burn-in stage(全监督),burn-in阶段随机初始化两个相同的model,我使用命令python3 train_net.py --num-gpus 8 --config configs/voc/voc07_voc12.yaml进行初始化训练(voc07_voc12.yaml中WEIGHTS被注释掉),但是训练6个iter后会报错“FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.”
请问您在进行全监督训练时有使用预训练模型吗,我尝试使用voc07_voc12.yaml中的WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl",但是这样会报错
error:Ambiguity found for res2.0.conv1.norm.bias in checkpoint!It matches at least two keys in the model (backbone2.bottom_up.res2.0.conv1.norm.bias and backbone1.bottom_up.res2.0.conv1.norm.bias).
ValueError: Cannot match one checkpoint key to multiple keys in the model

此外在trainer.py文件的def run_step_full_semisup(self):中if....else.....只在else......中定义了record_dict,没在if.....中定义,所以运行时报错“UnboundLocalError: local variable 'record_dict' referenced before assignment”

想向您请教一下

  • [ ]全监督训练怎么进行,需要使用预训练模型吗还是随机初始化,请问您是怎么运行的
    因为刚接触半监督学习,对训练过程不是很了解,如果言辞有所冒犯还请您多多谅解,期待您的回复,谢谢啦!

@machengcheng2016
Copy link
Owner

  1. Inf/NaN是因为伪标签质量太差影响了模型训练,通常降低半监督损失权重或者重跑几次训练会解决。
  2. 您看下报错信息,是在说ckpt里缺少backbone1和backbone2等权重的参数,这说明ckpt里包含了两个模型的weights。所以很自然的,无论是全监督的warmup还是半监督训练,都需要load/save两个模型的weights。

全监督的warmup需要resume MSRA/R-50.pkl(实际上就是backbone参数),以保证模型初始的特征提取能力。

@erpingzi
Copy link
Author

太感谢您的解答了,我现在还存在一点问题

  1. 我读代码理解model是(backbone1+backbone2),从cfg.MODEL.WEIGHTS中加载MSRA/R-50.pkl进行迁移学习,但是MSRA/R-50.pkl中只有一个resnet的预训练权重,请问这个如何解决呢,是需要自己改resume_or_load代码load两次吗

@machengcheng2016
Copy link
Owner

我刚刚上传了一个merge_two_ckpts脚本,可以将两个checkpoint融合起来用于resume,您可以参考下

@erpingzi
Copy link
Author

我刚刚上传了一个merge_two_ckpts脚本,可以将两个checkpoint融合起来用于resume,您可以参考下

非常感谢您,问题解决了!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants