You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LAUNCH INFO 2023-09-13 14:56:09,577 ----------- Configuration ----------------------
LAUNCH INFO 2023-09-13 14:56:09,577 auto_parallel_config: None
LAUNCH INFO 2023-09-13 14:56:09,577 devices: 0,1,2,3,4,5,6,7
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_level: -1
LAUNCH INFO 2023-09-13 14:56:09,577 elastic_timeout: 30
LAUNCH INFO 2023-09-13 14:56:09,577 gloo_port: 6767
LAUNCH INFO 2023-09-13 14:56:09,577 host: None
LAUNCH INFO 2023-09-13 14:56:09,577 ips: None
LAUNCH INFO 2023-09-13 14:56:09,577 job_id: default
LAUNCH INFO 2023-09-13 14:56:09,577 legacy: False
LAUNCH INFO 2023-09-13 14:56:09,577 log_dir: ./log_vima_dir
LAUNCH INFO 2023-09-13 14:56:09,577 log_level: INFO
LAUNCH INFO 2023-09-13 14:56:09,577 log_overwrite: False
LAUNCH INFO 2023-09-13 14:56:09,577 master: None
LAUNCH INFO 2023-09-13 14:56:09,577 max_restart: 3
LAUNCH INFO 2023-09-13 14:56:09,577 nnodes: 1
LAUNCH INFO 2023-09-13 14:56:09,577 nproc_per_node: None
LAUNCH INFO 2023-09-13 14:56:09,577 rank: -1
LAUNCH INFO 2023-09-13 14:56:09,577 run_mode: collective
LAUNCH INFO 2023-09-13 14:56:09,577 server_num: None
LAUNCH INFO 2023-09-13 14:56:09,577 servers:
LAUNCH INFO 2023-09-13 14:56:09,578 start_port: 6070
LAUNCH INFO 2023-09-13 14:56:09,578 trainer_num: None
LAUNCH INFO 2023-09-13 14:56:09,578 trainers:
LAUNCH INFO 2023-09-13 14:56:09,578 training_script: tools/train.py
LAUNCH INFO 2023-09-13 14:56:09,578 training_script_args: ['-c', './configs/yolov5/yolov5_s_80e_ssod_finetune_vima_coco.yml', '--eval', '--amp', '--use_vdl=True', '--vdl_log_dir=vdl_vima_dir/scalar']
LAUNCH INFO 2023-09-13 14:56:09,578 with_gloo: 1
LAUNCH INFO 2023-09-13 14:56:09,578 --------------------------------------------------
LAUNCH INFO 2023-09-13 14:56:09,579 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2023-09-13 14:56:09,590 Run Pod: rqvwxn, replicas 8, status ready
LAUNCH INFO 2023-09-13 14:56:09,744 Watching Pod: rqvwxn, replicas 8, status running
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0913 14:56:11.313481 18935 tcp_utils.cc:181] The server starts to listen on IP_ANY:41751
I0913 14:56:11.313777 18935 tcp_utils.cc:130] Successfully connected to 10.3.15.202:41751
W0913 14:56:13.161374 18935 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website.
W0913 14:56:13.161420 18935 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7
W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.
GPU信息:
Wed Sep 13 15:01:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |||| MIG M. ||===============================+======================+======================|| 0 NVIDIA GeForce ... Off | 00000000:1A:00.0 Off | N/A || 34% 60C P2 79W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:1B:00.0 Off | N/A || 36% 63C P2 75W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:3D:00.0 Off | N/A || 29% 52C P2 73W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:3E:00.0 Off | N/A || 33% 59C P2 79W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:88:00.0 Off | N/A || 31% 55C P2 82W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:89:00.0 Off | N/A || 37% 64C P2 79W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:B1:00.0 Off | N/A || 31% 55C P2 77W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:B2:00.0 Off | N/A || 32% 57C P2 75W / 250W | 205MiB / 11264MiB | 100% Default |||| N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| 0 N/A N/A 18935 C ...envs/paddleseg/bin/python 203MiB || 1 N/A N/A 18937 C ...envs/paddleseg/bin/python 203MiB || 2 N/A N/A 18939 C ...envs/paddleseg/bin/python 203MiB || 3 N/A N/A 18941 C ...envs/paddleseg/bin/python 203MiB || 4 N/A N/A 18946 C ...envs/paddleseg/bin/python 203MiB || 5 N/A N/A 18948 C ...envs/paddleseg/bin/python 203MiB || 6 N/A N/A 18954 C ...envs/paddleseg/bin/python 203MiB || 7 N/A N/A 18957 C ...envs/paddleseg/bin/python 203MiB |
+-----------------------------------------------------------------------------+
我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.
是否愿意提交PR? Are you willing to submit a PR?
我愿意提交PR!I'd like to help by submitting a PR!
The text was updated successfully, but these errors were encountered:
问题确认 Search before asking
Bug组件 Bug Component
No response
Bug描述 Describe the Bug
在进行多卡训练时,卡在
W0913 14:56:13.162582 18935 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8.
之后就没有新的打印信息了,也不会继续运行。训练命令:
打印信息:
GPU信息:
使用
htop
查看CPU的使用情况,固定的几个核心飙升到100%不下降。复现环境 Environment
paddlepaddle-gpu 2.5.1.post117 pypi_0 pypi
cudatoolkit 11.7.0 hd8887f6_10 nvidia
Bug描述确认 Bug description confirmation
是否愿意提交PR? Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: