Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode' #2590

Closed
ivrschool opened this issue Feb 6, 2023 · 5 comments · Fixed by #2658
Closed
Labels
bug Something isn't working

Comments

@ivrschool
Copy link

🐛 Describe the bug

I am able to run training job with zero1 and zero2 but I am facing this issue with distplan 'colossalai':

Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
Traceback (most recent call last):
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 373, in <module>
    main()
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 288, in main
    model, optimizer = build_gemini(model, tp_pg, args.placement, args.tp_degree == 1)
  File "/home/ubuntu/Desktop/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py", line 195, in build_gemini
    model = GeminiDDP(model,
TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2755) of binary: /opt/conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py",line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_demo.py FAILED```

### Environment

I am using this branch: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt

I am using aws ubuntu Deep Learning Pytorch=1.12.1 ec2 instance with g5.12xlarge gpu instance (4GPUs, 96GB, 48vCPU, 192GB RAM):
 
Colossal-AI version: 0.1.12
----------------------------
PyTorch Version: 1.12.1
PyTorch Version required by Colossal-AI: 1.12
PyTorch version match: ✓
----------------------------
System CUDA Version: 11.6
CUDA Version required by PyTorch: 11.6
CUDA Version required by Colossal-AI: 11.3
CUDA Version Match: x
----------------------------
CUDA Extension: ✓
@ivrschool ivrschool added the bug Something isn't working label Feb 6, 2023
@JThh
Copy link
Contributor

JThh commented Feb 6, 2023

Can you try upgrading your colossalAI to 0.2.0 and pulling the latest changes to the gpt example accordingly?

@ivrschool
Copy link
Author

ivrschool commented Feb 6, 2023

Thanks @JThh . It works after I have upgraded to 0.2.0.

Can you please update the README.md here accordingly?

Also, Can you help me with the definitions and recommendations of each of the distplans ["CAI_ZeRO1", "CAI_ZeRO2", "CAI_Gemini", "Pytorch_DDP", "Pytorch_ZeRO"]? I could not find any compiled information, thank you!

@JThh
Copy link
Contributor

JThh commented Feb 6, 2023

Thank you for pointing this out. I will seek to update soon.

"CAI_ZeRO1", "CAI_ZeRO2" refer to two types of zero redundancy optimizers, with their differences being detailed in this paper (on page 2).

For "CAI_Gemini" (or "zero3"), please see this our tutorial. And "Pytorch_DDP", "Pytorch_ZeRO" literally mean the implementation of data parallelism and ZeroRedundancyOptimizer by PyTorch.

You may trace the function calls here and see how they differentiate.

You are advised to try out our CAI_Gemini (or "zero3").

@ivrschool
Copy link
Author

ivrschool commented Feb 6, 2023

Thank you @JThh for the references. I am playing with different distplans listed and making a recommendation list that suggests when to use which type of plan for a given dataset and hardware capacity since it is clearly not available at one place. If you already have this then you can help me with it. This will save me a lot of time. Thank you once again!

@JThh
Copy link
Contributor

JThh commented Feb 6, 2023

Thank you. If you are willing to share your findings, feel free to post here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants