Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

Open
787918582 opened this issue Nov 22, 2023 · 2 comments
Open

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

787918582 opened this issue Nov 22, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@787918582
Copy link

787918582 commented Nov 22, 2023

If this is your first time, please read our contributor guidelines:
https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
hrnet_w32、hrnet_w48执行静态图模式分布式训练均报错

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :mindspore_v2.2.1 mindcv_0.2.2
    -- Python version (e.g., Python 3.7.5) :3.7.5
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8
    -- GCC/Compiler version (if compiled from source):7.3.0

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --distribute True --data_dir /ImageNet_Origin/
    Expected behavior / 预期结果 (Mandatory / 必填)
    可跑通静态图分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
[2023-11-19 10:29:13] mindcv.scheduler.scheduler_factory WARNING - warmup_epochs + decay_epochs > num_epochs. Please check and reduce decay_epochs!
[2023-11-19 10:29:16] mindcv.train INFO - Essential Experiment Configurations:
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 0
Distributed mode: True
Number of devices: 8
Number of training samples: 800000
Number of validation samples: None
Number of classes: 1000
Number of batches: 781
Batch size: 128
Auto augment: randaug-m7-mstd0.5
MixUp: 0.2
CutMix: 1.0
Model: hrnet_w32
Model parameters: 41303464
Number of epochs: 5
Optimizer: adamw
Learning rate: 0.001
LR Scheduler: cosine_decay
Momentum: 0.9
Weight decay: 0.05
Auto mixed precision: O2
Loss scale: 1024(fixed)
[2023-11-19 10:29:16] mindcv.train INFO - Start training
[ERROR] PIPELINE(171895,ffff914f2190,python):2023-11-19-10:29:53.881.102 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171893,ffffbe9fb190,python):2023-11-19-10:29:54.378.528 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171887,ffff9b3ad190,python):2023-11-19-10:29:54.825.669 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171889,ffff87cee190,python):2023-11-19-10:29:55.189.347 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171890,ffff91938190,python):2023-11-19-10:29:55.439.711 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171894,ffff929f0190,python):2023-11-19-10:29:55.738.301 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171888,ffff8a2c7190,python):2023-11-19-10:29:56.666.323 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171891,ffffb5509190,python):2023-11-19-10:29:57.019.842 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[WARNING] MD(171895,fffc8ffff1e0,python):2023-11-19-10:30:19.682.318 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:1168] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result GetNext timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
Traceback (most recent call last):
File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 323, in
train(args)
File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 309, in train
trainer.train(args.epoch_size, loader_train, callbacks=callbacks, dataset_sink_mode=args.dataset_sink_mode)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 623, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 680, in call
out = self.compile_and_run(*args, **kwargs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 1020, in compile_and_run
self.compile(*args, **kwargs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 997, in compile
_cell_graph_executor.compile(self, phase=self.phase,
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/common/api.py", line 1547, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: For operation 'setitem', current input arguments types are <Tuple, Number, Tensor>. The 1-th argument type 'Tuple' is not supported now.
the support arguments types of 'setitem' operation as follows:
<List, Number, Number>
<List, Number, String>
<List, Number, List>
<List, Number, Tuple>
<List, Number, Tensor>
<List, Slice, Number>
<List, Slice, List>
<List, Slice, Tuple>
<List, Slice, Tensor>
<Tensor, None, Number>
<Tensor, None, List>
<Tensor, None, Tuple>
<Tensor, None, Tensor>
<Tensor, Ellipsis, Number>
<Tensor, Ellipsis, List>
<Tensor, Ellipsis, Tuple>
<Tensor, Ellipsis, Tensor>
<Tensor, Number, Number>
<Tensor, Number, List>
<Tensor, Number, Tuple>
<Tensor, Number, Tensor>
<Tensor, List, Number>
<Tensor, List, List>
<Tensor, List, Tuple>
<Tensor, List, Tensor>
<Tensor, Tuple, Number>
<Tensor, Tuple, List>
<Tensor, Tuple, Tuple>
<Tensor, Tuple, Tensor>
<Tensor, Slice, Number>
<Tensor, Slice, List>
<Tensor, Slice, Tuple>
<Tensor, Slice, Tensor>
<Tensor, Tensor, Number>
<Tensor, Tensor, List>
<Tensor, Tensor, Tuple>
<Tensor, Tensor, Tensor>
<Dictionary, Number, Number>
<Dictionary, Number, List>
<Dictionary, Number, Tuple>
<Dictionary, Number, Tensor>
<Dictionary, Number, Dictionary>
<Dictionary, String, Number>
<Dictionary, String, List>
<Dictionary, String, Tuple>
<Dictionary, String, Tensor>
<Dictionary, String, Dictionary>
<Dictionary, Tuple, Number>
<Dictionary, Tuple, List>
<Dictionary, Tuple, Tuple>
<Dictionary, Tuple, Tensor>
<Dictionary, Tuple, Dictionary>
<Dictionary, Tensor, Number>
<Dictionary, Tensor, List>
<Dictionary, Tensor, Tuple>
<Dictionary, Tensor, Tensor>
<Dictionary, Tensor, Dictionary>
<MapTensor, Tensor, Tensor>
For more details with 'setitem', please refer to https://mindspore.cn/search/en?inputValue=Index%20value%20assignment

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

@787918582 787918582 added the bug Something isn't working label Nov 22, 2023
@tacyi
Copy link

tacyi commented Jan 10, 2024

ms2.2.10.B180复现该报错

@tacyi
Copy link

tacyi commented Jan 22, 2024

MindSpore_v2.2.10.B180 完整性训练成功

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants