Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练卡住不继续运行 #188

Closed
2 of 3 tasks
kaixin-bai opened this issue Sep 13, 2023 · 1 comment
Closed
2 of 3 tasks

多卡训练卡住不继续运行 #188

kaixin-bai opened this issue Sep 13, 2023 · 1 comment
Assignees

Comments

@kaixin-bai
Copy link

问题确认 Search before asking

  • 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

No response

Bug描述 Describe the Bug

在进行多卡训练时,卡在W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.之后就没有新的打印信息了,也不会继续运行。

训练命令:

(paddleseg) kb@gpu01:/data-r10/kb/Projects/PaddleYOLO$ python -m paddle.distributed.launch --log_dir=./log_vima_dir --gpus 0,1,2,3,4,5,6,7 tools/train.py -c ./configs/yolov5/yolov5_s_80e_ssod_finetune_vima_coco.yml --eval --amp --use_vdl=True --vdl_log_dir=vdl_vima_dir/scalar

打印信息:

LAUNCH INFO 2023-09-13 14:56:09,577 -----------  Configuration  ----------------------
LAUNCH INFO 2023-09-13 14:56:09,577 auto_parallel_config: None
LAUNCH INFO 2023-09-13 14:56:09,577 devices: 0,1,2,3,4,5,6,7
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_level: -1
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_timeout: 30
LAUNCH INFO 2023-09-13 14:56:09,577 gloo_port: 6767
LAUNCH INFO 2023-09-13 14:56:09,577 host: None
LAUNCH INFO 2023-09-13 14:56:09,577 ips: None
LAUNCH INFO 2023-09-13 14:56:09,577 job_id: default
LAUNCH INFO 2023-09-13 14:56:09,577 legacy: False
LAUNCH INFO 2023-09-13 14:56:09,577 log_dir: ./log_vima_dir
LAUNCH INFO 2023-09-13 14:56:09,577 log_level: INFO
LAUNCH INFO 2023-09-13 14:56:09,577 log_overwrite: False
LAUNCH INFO 2023-09-13 14:56:09,577 master: None
LAUNCH INFO 2023-09-13 14:56:09,577 max_restart: 3
LAUNCH INFO 2023-09-13 14:56:09,577 nnodes: 1
LAUNCH INFO 2023-09-13 14:56:09,577 nproc_per_node: None
LAUNCH INFO 2023-09-13 14:56:09,577 rank: -1
LAUNCH INFO 2023-09-13 14:56:09,577 run_mode: collective
LAUNCH INFO 2023-09-13 14:56:09,577 server_num: None
LAUNCH INFO 2023-09-13 14:56:09,577 servers: 
LAUNCH INFO 2023-09-13 14:56:09,578 start_port: 6070
LAUNCH INFO 2023-09-13 14:56:09,578 trainer_num: None
LAUNCH INFO 2023-09-13 14:56:09,578 trainers: 
LAUNCH INFO 2023-09-13 14:56:09,578 training_script: tools/train.py
LAUNCH INFO 2023-09-13 14:56:09,578 training_script_args: ['-c', './configs/yolov5/yolov5_s_80e_ssod_finetune_vima_coco.yml', '--eval', '--amp', '--use_vdl=True', '--vdl_log_dir=vdl_vima_dir/scalar']
LAUNCH INFO 2023-09-13 14:56:09,578 with_gloo: 1
LAUNCH INFO 2023-09-13 14:56:09,578 --------------------------------------------------
LAUNCH INFO 2023-09-13 14:56:09,579 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-09-13 14:56:09,590 Run Pod: rqvwxn, replicas 8, status ready
LAUNCH INFO 2023-09-13 14:56:09,744 Watching Pod: rqvwxn, replicas 8, status running
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0913 14:56:11.313481 18935 tcp_utils.cc:181] The server starts to listen on IP_ANY:41751
I0913 14:56:11.313777 18935 tcp_utils.cc:130] Successfully connected to 10.3.15.202:41751
W0913 14:56:13.161374 18935 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website.
W0913 14:56:13.161420 18935 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7
W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.

GPU信息:

Wed Sep 13 15:01:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 34%   60C    P2    79W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 36%   63C    P2    75W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 29%   52C    P2    73W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:3E:00.0 Off |                  N/A |
| 33%   59C    P2    79W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:88:00.0 Off |                  N/A |
| 31%   55C    P2    82W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:89:00.0 Off |                  N/A |
| 37%   64C    P2    79W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 31%   55C    P2    77W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  Off  | 00000000:B2:00.0 Off |                  N/A |
| 32%   57C    P2    75W / 250W |    205MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     18935      C   ...envs/paddleseg/bin/python      203MiB |
|    1   N/A  N/A     18937      C   ...envs/paddleseg/bin/python      203MiB |
|    2   N/A  N/A     18939      C   ...envs/paddleseg/bin/python      203MiB |
|    3   N/A  N/A     18941      C   ...envs/paddleseg/bin/python      203MiB |
|    4   N/A  N/A     18946      C   ...envs/paddleseg/bin/python      203MiB |
|    5   N/A  N/A     18948      C   ...envs/paddleseg/bin/python      203MiB |
|    6   N/A  N/A     18954      C   ...envs/paddleseg/bin/python      203MiB |
|    7   N/A  N/A     18957      C   ...envs/paddleseg/bin/python      203MiB |
+-----------------------------------------------------------------------------+

使用htop查看CPU的使用情况,固定的几个核心飙升到100%不下降。

复现环境 Environment

paddlepaddle-gpu 2.5.1.post117 pypi_0 pypi
cudatoolkit 11.7.0 hd8887f6_10 nvidia

Bug描述确认 Bug description confirmation

  • 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • 我愿意提交PR!I'd like to help by submitting a PR!
@nemonameless
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants