YOLOv3 baseline training gets strucked #8

ShoufaChen · 2019-11-23T03:02:40Z

Hello,

When I run the YOLOv3 baseline training script:

python -m torch.distributed.launch --nproc_per_node=10 --master_port=287343 main.py \
        --cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 8 \
        --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608

The process got strucked at:

index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda

I use 8 2080Ti GPUs, and the state is GPUs is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:1A:00.0 Off |                  N/A |
| 27%   31C    P8    20W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:1B:00.0 Off |                  N/A |
| 27%   29C    P8    18W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:3D:00.0 Off |                  N/A |
| 27%   30C    P8    23W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:3E:00.0 Off |                  N/A |
| 27%   29C    P8    13W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  On   | 00000000:88:00.0 Off |                  N/A |
| 27%   28C    P8     9W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  On   | 00000000:89:00.0 Off |                  N/A |
| 27%   30C    P8    17W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  On   | 00000000:B1:00.0 Off |                  N/A |
| 27%   29C    P8    11W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce RTX 208...  On   | 00000000:B2:00.0 Off |                  N/A |
| 27%   31C    P8    25W / 250W |   1186MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    159506      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    1    159507      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    2    159508      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    3    159509      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    4    159510      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    5    159511      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    6    159512      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
|    7    159513      C   ...hen/anaconda3/envs/pytorch13/bin/python  1175MiB |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

GOATmessi8 · 2019-11-23T03:16:17Z

You may change --nproc_per_node=10 to --nproc_per_node=8 respectively. And this deadlock is usually caused by pytorch dataloader or opencv, so you may restart it several times.

ShoufaChen · 2019-11-23T03:19:58Z

Thank you. It solved my problem.

jwnirvana · 2020-01-15T11:56:41Z

Thank you. It solved my problem
excuse me！I encounted enviroment problem with 4 2080ti，could you please tell me about the version of CUDA， cuddn， python， pytorch， torchvision， apex respectively？ thx very much

ShoufaChen changed the title ~~YOLOv3 baseline training doesn't start~~ YOLOv3 baseline training gets strucked Nov 23, 2019

GOATmessi8 closed this as completed Nov 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YOLOv3 baseline training gets strucked #8

YOLOv3 baseline training gets strucked #8

ShoufaChen commented Nov 23, 2019

GOATmessi8 commented Nov 23, 2019

ShoufaChen commented Nov 23, 2019

jwnirvana commented Jan 15, 2020

YOLOv3 baseline training gets strucked #8

YOLOv3 baseline training gets strucked #8

Comments

ShoufaChen commented Nov 23, 2019

GOATmessi8 commented Nov 23, 2019

ShoufaChen commented Nov 23, 2019

jwnirvana commented Jan 15, 2020