[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

787918582 · 2023-11-22T07:44:25Z

If this is your first time, please read our contributor guidelines:
https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
hrnet_w32、hrnet_w48执行静态图模式分布式训练均报错

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :mindspore_v2.2.1 mindcv_0.2.2
-- Python version (e.g., Python 3.7.5) :3.7.5
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8
-- GCC/Compiler version (if compiled from source):7.3.0
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

mpirun --allow-run-as-root -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --distribute True --data_dir /ImageNet_Origin/
Expected behavior / 预期结果 (Mandatory / 必填)
可跑通静态图分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
[2023-11-19 10:29:13] mindcv.scheduler.scheduler_factory WARNING - warmup_epochs + decay_epochs > num_epochs. Please check and reduce decay_epochs!
[2023-11-19 10:29:16] mindcv.train INFO - Essential Experiment Configurations:
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 0
Distributed mode: True
Number of devices: 8
Number of training samples: 800000
Number of validation samples: None
Number of classes: 1000
Number of batches: 781
Batch size: 128
Auto augment: randaug-m7-mstd0.5
MixUp: 0.2
CutMix: 1.0
Model: hrnet_w32
Model parameters: 41303464
Number of epochs: 5
Optimizer: adamw
Learning rate: 0.001
LR Scheduler: cosine_decay
Momentum: 0.9
Weight decay: 0.05
Auto mixed precision: O2
Loss scale: 1024(fixed)
[2023-11-19 10:29:16] mindcv.train INFO - Start training
[ERROR] PIPELINE(171895,ffff914f2190,python):2023-11-19-10:29:53.881.102 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171893,ffffbe9fb190,python):2023-11-19-10:29:54.378.528 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171887,ffff9b3ad190,python):2023-11-19-10:29:54.825.669 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171889,ffff87cee190,python):2023-11-19-10:29:55.189.347 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171890,ffff91938190,python):2023-11-19-10:29:55.439.711 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171894,ffff929f0190,python):2023-11-19-10:29:55.738.301 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171888,ffff8a2c7190,python):2023-11-19-10:29:56.666.323 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171891,ffffb5509190,python):2023-11-19-10:29:57.019.842 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[WARNING] MD(171895,fffc8ffff1e0,python):2023-11-19-10:30:19.682.318 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:1168] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result GetNext timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
Traceback (most recent call last):
File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 323, in
train(args)
File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 309, in train
trainer.train(args.epoch_size, loader_train, callbacks=callbacks, dataset_sink_mode=args.dataset_sink_mode)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 623, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 680, in call
out = self.compile_and_run(*args, **kwargs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 1020, in compile_and_run
self.compile(*args, **kwargs)
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/nn/cell.py", line 997, in compile
_cell_graph_executor.compile(self, phase=self.phase,
File "/root/archiconda3/envs/Python380/lib/python3.8/site-packages/mindspore/common/api.py", line 1547, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: For operation 'setitem', current input arguments types are <Tuple, Number, Tensor>. The 1-th argument type 'Tuple' is not supported now.
the support arguments types of 'setitem' operation as follows:
<List, Number, Number>
<List, Number, String>
<List, Number, List>
<List, Number, Tuple>
<List, Number, Tensor>
<List, Slice, Number>
<List, Slice, List>
<List, Slice, Tuple>
<List, Slice, Tensor>
<Tensor, None, Number>
<Tensor, None, List>
<Tensor, None, Tuple>
<Tensor, None, Tensor>
<Tensor, Ellipsis, Number>
<Tensor, Ellipsis, List>
<Tensor, Ellipsis, Tuple>
<Tensor, Ellipsis, Tensor>
<Tensor, Number, Number>
<Tensor, Number, List>
<Tensor, Number, Tuple>
<Tensor, Number, Tensor>
<Tensor, List, Number>
<Tensor, List, List>
<Tensor, List, Tuple>
<Tensor, List, Tensor>
<Tensor, Tuple, Number>
<Tensor, Tuple, List>
<Tensor, Tuple, Tuple>
<Tensor, Tuple, Tensor>
<Tensor, Slice, Number>
<Tensor, Slice, List>
<Tensor, Slice, Tuple>
<Tensor, Slice, Tensor>
<Tensor, Tensor, Number>
<Tensor, Tensor, List>
<Tensor, Tensor, Tuple>
<Tensor, Tensor, Tensor>
<Dictionary, Number, Number>
<Dictionary, Number, List>
<Dictionary, Number, Tuple>
<Dictionary, Number, Tensor>
<Dictionary, Number, Dictionary>
<Dictionary, String, Number>
<Dictionary, String, List>
<Dictionary, String, Tuple>
<Dictionary, String, Tensor>
<Dictionary, String, Dictionary>
<Dictionary, Tuple, Number>
<Dictionary, Tuple, List>
<Dictionary, Tuple, Tuple>
<Dictionary, Tuple, Tensor>
<Dictionary, Tuple, Dictionary>
<Dictionary, Tensor, Number>
<Dictionary, Tensor, List>
<Dictionary, Tensor, Tuple>
<Dictionary, Tensor, Tensor>
<Dictionary, Tensor, Dictionary>
<MapTensor, Tensor, Tensor>
For more details with 'setitem', please refer to https://mindspore.cn/search/en?inputValue=Index%20value%20assignment

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tacyi · 2024-01-10T03:03:03Z

ms2.2.10.B180复现该报错

tacyi · 2024-01-22T06:15:27Z

MindSpore_v2.2.10.B180 完整性训练成功

787918582 added the bug Something isn't working label Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

787918582 commented Nov 22, 2023 •

edited

tacyi commented Jan 10, 2024

tacyi commented Jan 22, 2024

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

[hrnet] [Ascend910] [GRAPH] Distributed train failed #746

Comments

787918582 commented Nov 22, 2023 • edited

tacyi commented Jan 10, 2024

tacyi commented Jan 22, 2024

787918582 commented Nov 22, 2023 •

edited