[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

jazzly · 2024-05-13T06:32:02Z

软件环境

- paddlepaddle: 2.6.1
- paddlepaddle-gpu: 
- paddlenlp: 2.8.0

重复问题

I have searched the existing issues

错误描述

使用如上版本训练 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme
示例中的训练数据时，使用CPU模式时由于默认命令只使用单线程训练。想加快训练进程，查看了有一个enable_auto_parallel参数，当把这个 enable_auto_parallel 置为True时，启动训练会报get_rank_by_dim_and_process_id 函数找不到。

Traceback (most recent call last):
  File "train.py", line 230, in <module>
    main()
  File "train.py", line 166, in main
    trainer = Trainer(
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 388, in __init__
    self.print_config()
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 3058, in print_config
    v = getattr(args, a)
  File "/home/user/.local/lib/python3.8/site-packages/paddlenlp/trainer/training_args.py", line 1524, in data_parallel_rank
    return mesh.get_rank_by_dim_and_process_id("dp", dist.get_rank())
AttributeError: 'ProcessMesh' object has no attribute 'get_rank_by_dim_and_process_id'

稳定复现步骤 & 代码

训练时启用 enable_auto_parallel参数
python3 train.py
--do_train
--do_eval
--do_export
--model_name_or_path ernie-3.0-tiny-medium-v2-zh
--output_dir checkpoint
--device cpu
--num_train_epochs 100
--early_stopping True
--early_stopping_patience 5
--learning_rate 3e-5
--max_length 128
--per_device_eval_batch_size 32
--per_device_train_batch_size 32
--metric_for_best_model accuracy
--load_best_model_at_end
--logging_steps 5
--evaluation_strategy epoch
--save_strategy epoch
--save_total_limit 3
--enable_auto_parallel True

github-actions · 2024-07-13T00:18:03Z

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions · 2024-07-28T00:20:08Z

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

jazzly added the bug Something isn't working label May 13, 2024

paddle-bot bot assigned wj-Mcat May 13, 2024

jazzly mentioned this issue May 13, 2024

[Question]: paddle.distributed.launch 启动多进程训练结束后Loading best model from checkpoint 报错 #8429

Closed

github-actions bot added the stale label Jul 13, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

jazzly commented May 13, 2024

github-actions bot commented Jul 13, 2024

github-actions bot commented Jul 28, 2024

[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

[Bug]: get_rank_by_dim_and_process_id 函数未实现 #8428

Comments

jazzly commented May 13, 2024

软件环境

重复问题

错误描述

稳定复现步骤 & 代码

github-actions bot commented Jul 13, 2024

github-actions bot commented Jul 28, 2024