-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmyolo 0.1.0 rtmdet复现出错 #139
Comments
We recommend using English or English & Chinese for issues so that we could have broader discussion. |
@huoshuai-dot Is it possible to reduce the batchsize and try it? |
单卡我设置bs=16 是ok的 但是 多卡卡住的时候显存一直是500M左右 但是 单卡训练显存应该很大才对 感觉应该是读数据这块有什么问题导致挂起了,batchsize减少还是有这个问题 |
@hhaAndroid 还有一个现象就是 我把pretrained 模型给注释掉后 (不适用imagenet pretrained 模型初始化权重),训练过程是可以进行下去的,莫非跟这个预训练模型加载有关系? |
@huoshuai-dot I have not encountered this situation. Can you upload your training log? |
@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it? |
我的torch版本是1.12.1 确实版本比较高,我可以按照readme里面的配置安装下python环境再试试 |
@hhaAndroid 按照readme换了1.10的torch 问题还是存在 这个还可能是什么原因呢? |
@hhaAndroid 你好 我昨天装了docker镜像 然后跑了一个例子,还是遇到了相同的问题 这次 挂起很久之后报如下错误: |
NCCL driver issue, resolved |
Prerequisite
💬 Describe the reimplementation questions
使用dist_train.sh 多卡训练rtmdet时 会卡死 强行中断 显示多线程挂死 显存也没有打满 但是 执行train.py或者dist_train.sh 指定单卡是可以正常训练的,
yolox这个任务是正常的 问题出现在rtmdet 多卡上 请问这个问题怎么排查呢?
Environment
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce GTX TITAN X
CUDA_HOME: :/usr/local/cuda-10.2
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:
TorchVision: 0.13.1
OpenCV: 4.6.0
MMEngine: 0.1.0
MMCV: 2.0.0rc1
MMDetection: 3.0.0rc1
MMYOLO: 0.1.1+
Expected results
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: