[BUG]: TypeError: init() got an unexpected keyword argument 'strict_ddp_mode' #2590

ivrschool · 2023-02-06T07:02:10Z

🐛 Describe the bug

I am able to run training job with zero1 and zero2 but I am facing this issue with distplan 'colossalai':

Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2755) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py",line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_demo.py FAILED```

### Environment

I am using this branch: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt

I am using aws ubuntu Deep Learning Pytorch=1.12.1 ec2 instance with g5.12xlarge gpu instance (4GPUs, 96GB, 48vCPU, 192GB RAM):
 
Colossal-AI version: 0.1.12
----------------------------
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.6
CUDA Version required by PyTorch: 11.6
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: x
----------------------------
CUDA Extension: ✓

The text was updated successfully, but these errors were encountered:

JThh · 2023-02-06T07:37:16Z

Can you try upgrading your colossalAI to 0.2.0 and pulling the latest changes to the gpt example accordingly?

ivrschool · 2023-02-06T08:33:11Z

Thanks @JThh . It works after I have upgraded to 0.2.0.

Can you please update the README.md here accordingly?

Also, Can you help me with the definitions and recommendations of each of the distplans ["CAI_ZeRO1", "CAI_ZeRO2", "CAI_Gemini", "Pytorch_DDP", "Pytorch_ZeRO"]? I could not find any compiled information, thank you!

JThh · 2023-02-06T09:23:57Z

Thank you for pointing this out. I will seek to update soon.

"CAI_ZeRO1", "CAI_ZeRO2" refer to two types of zero redundancy optimizers, with their differences being detailed in this paper (on page 2).

For "CAI_Gemini" (or "zero3"), please see this our tutorial. And "Pytorch_DDP", "Pytorch_ZeRO" literally mean the implementation of data parallelism and ZeroRedundancyOptimizer by PyTorch.

You may trace the function calls here and see how they differentiate.

You are advised to try out our CAI_Gemini (or "zero3").

ivrschool · 2023-02-06T09:28:31Z

Thank you @JThh for the references. I am playing with different distplans listed and making a recommendation list that suggests when to use which type of plan for a given dataset and hardware capacity since it is clearly not available at one place. If you already have this then you can help me with it. This will save me a lot of time. Thank you once again!

JThh · 2023-02-06T09:31:47Z

Thank you. If you are willing to share your findings, feel free to post here!

ivrschool added the bug Something isn't working label Feb 6, 2023

JThh mentioned this issue Feb 9, 2023

[example] Polish README.md #2658

Merged

10 tasks

binmakeswell closed this as completed in #2658 Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: TypeError: init() got an unexpected keyword argument 'strict_ddp_mode' #2590

[BUG]: TypeError: init() got an unexpected keyword argument 'strict_ddp_mode' #2590

ivrschool commented Feb 6, 2023

JThh commented Feb 6, 2023

ivrschool commented Feb 6, 2023 •

edited

Loading

JThh commented Feb 6, 2023

ivrschool commented Feb 6, 2023 •

edited

Loading

JThh commented Feb 6, 2023

[BUG]: TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode' #2590

[BUG]: TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode' #2590

Comments

ivrschool commented Feb 6, 2023

🐛 Describe the bug

JThh commented Feb 6, 2023

ivrschool commented Feb 6, 2023 • edited Loading

JThh commented Feb 6, 2023

ivrschool commented Feb 6, 2023 • edited Loading

JThh commented Feb 6, 2023

[BUG]: TypeError: init() got an unexpected keyword argument 'strict_ddp_mode' #2590

[BUG]: TypeError: init() got an unexpected keyword argument 'strict_ddp_mode' #2590

ivrschool commented Feb 6, 2023 •

edited

Loading

ivrschool commented Feb 6, 2023 •

edited

Loading