Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yoloV8 n 单机8卡训练耗时问题] #277

Open
Mr-shen-yyds opened this issue Mar 26, 2024 · 6 comments
Open

[yoloV8 n 单机8卡训练耗时问题] #277

Mr-shen-yyds opened this issue Mar 26, 2024 · 6 comments
Assignees

Comments

@Mr-shen-yyds
Copy link

Mr-shen-yyds commented Mar 26, 2024

一、问题表现
1、单机单卡训练时间正常

2、单机8卡训练卡住在图编译阶段-耗时严重,最后导致建链超时(默认静态模式)
拉起命令: mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --data_dir /home/code/coco --is_parallel True

image

3、单机8卡训练(修改计算图为动态模式) - 可以迭代,但迭代数据较慢。
image

4、同环境上训练densenet121模型单机8卡训练正常。
image

@yuedongli1
Copy link
Collaborator

可以使用MindSpore 2.2.12.B010版本

@Mr-shen-yyds
Copy link
Author

Mr-shen-yyds commented Mar 26, 2024 via email

@Mr-shen-yyds
Copy link
Author

Mr-shen-yyds commented Mar 27, 2024 via email

@Mr-shen-yyds
Copy link
Author

或者之前的一些可以使用的版本进行提供也可以的, 我整体更换一下

@zhanghuiyao
Copy link
Collaborator

版本建议可以跟readme中的一致,另外mindspore的安装包均可以在官网上获取到哈
https://www.mindspore.cn/versions

@zhanghuiyao
Copy link
Collaborator

zhanghuiyao commented Apr 7, 2024

如果是编译太慢导致的超时问题可以尝试调整这个环境变量设置超时时间,单位为 秒
export HCCL_CONNECT_TIMEOUT=7200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants