Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shufflenetv1] [Ascend910] [GRAPH] Distributed train failed #699

Open
787918582 opened this issue Jul 4, 2023 · 2 comments
Open

[shufflenetv1] [Ascend910] [GRAPH] Distributed train failed #699

787918582 opened this issue Jul 4, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@787918582
Copy link

787918582 commented Jul 4, 2023

If this is your first time, please read our contributor guidelines:
https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填)
shufflenet_v1_0_5 & shufflenet_v1_1_0执行分布式训练报错

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :mindspore_v2.0.0 mindcv_0.2.2
    -- Python version (e.g., Python 3.7.5) :3.7.5
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8
    -- GCC/Compiler version (if compiled from source):7.3.0

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/shufflenetv1/shufflenet_v1_1.0_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填)
可跑通完整分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填)
shufflenetv1

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.
v2.1.0、v2.2.0、v2.2.1均复现该报错

@787918582 787918582 added the bug Something isn't working label Jul 4, 2023
@787918582 787918582 changed the title [shufflenetv1_1_0] [Ascend910] [PYNATIVE] Distributed train failed [shufflenetv1_1_0] [Ascend910] [GRAPH] Distributed train failed Jul 31, 2023
@787918582 787918582 changed the title [shufflenetv1_1_0] [Ascend910] [GRAPH] Distributed train failed [shufflenetv1] [Ascend910] [GRAPH] Distributed train failed Jul 31, 2023
@tacyi
Copy link

tacyi commented Jan 10, 2024

ms2.2.10.B180复现该报错

@tacyi
Copy link

tacyi commented Jan 22, 2024

MindSpore_v2.2.10.B180 训练也报错
RuntimeError: Found inconsistent format or data type! Op: Mul[@kernel_graph_2:207{[0]: ValueNode Mul, [1]: equiv_207, [2]: ValueNode Tensor(shape=[], dtype=Float32, value=0.04096)}],ame: Default/network-TrainOneStepCell/optimizer-Momentum/Mul-op1711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants