Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLOv3 baseline training gets strucked #8

Closed
ShoufaChen opened this issue Nov 23, 2019 · 3 comments
Closed

YOLOv3 baseline training gets strucked #8

ShoufaChen opened this issue Nov 23, 2019 · 3 comments

Comments

@ShoufaChen
Copy link

Hello,

When I run the YOLOv3 baseline training script:

python -m torch.distributed.launch --nproc_per_node=10 --master_port=287343 main.py \
        --cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 8 \
        --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608

The process got strucked at:

index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda

I use 8 2080Ti GPUs, and the state is GPUs is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:1A:00.0 Off |                  N/A |
| 27%   31C    P8    20W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:1B:00.0 Off |                  N/A |
| 27%   29C    P8    18W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:3D:00.0 Off |                  N/A |
| 27%   30C    P8    23W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:3E:00.0 Off |                  N/A |
| 27%   29C    P8    13W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  On   | 00000000:88:00.0 Off |                  N/A |
| 27%   28C    P8     9W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  On   | 00000000:89:00.0 Off |                  N/A |
| 27%   30C    P8    17W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  On   | 00000000:B1:00.0 Off |                  N/A |
| 27%   29C    P8    11W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce RTX 208...  On   | 00000000:B2:00.0 Off |                  N/A |
| 27%   31C    P8    25W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    159506      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    1    159507      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    2    159508      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    3    159509      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    4    159510      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    5    159511      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    6    159512      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    7    159513      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
+-----------------------------------------------------------------------------+
@ShoufaChen ShoufaChen changed the title YOLOv3 baseline training doesn't start YOLOv3 baseline training gets strucked Nov 23, 2019
@GOATmessi8
Copy link
Owner

You may change --nproc_per_node=10 to --nproc_per_node=8 respectively. And this deadlock is usually caused by pytorch dataloader or opencv, so you may restart it several times.

@ShoufaChen
Copy link
Author

Thank you. It solved my problem.

@jwnirvana
Copy link

Thank you. It solved my problem
excuse me!I encounted enviroment problem with 4 2080ti,could you please tell me about the version of CUDA, cuddn, python, pytorch, torchvision, apex respectively? thx very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants