Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. #3173

Merged
merged 8 commits into from Apr 27, 2023

Conversation

yhna940
Copy link
Contributor

@yhna940 yhna940 commented Mar 17, 2023

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

N/A

📝 What does this PR do?

It seems that the variable names related to the mixed precision parameter group do not comprehensively cover its characteristics, so I suggest a few changes. These changes are very trivial, but hopefully they will alleviate some of the confusion for beginners like me.

Currently, the entire parameter group is named fp16_param_groups, and the parts managed by the gpu at the current rank are described as fp32_flat_param_groups_of_current_rank. This state perfectly represents the characteristics when the master weight is a half-tensor or the dtype specified in the __init__ method is fp16. In other cases, however, its characteristics do not correspond to the variable it. So I want it to be renamed according to the sharding state, not the data type, according to the fsdp convention of pytorch. (with names like flatten_sharded_optim_state_dict and full_optim_state_dict).

This is a related but more trivial issue, but it seems that the param_store methods don't even need to specify fp16.

Thank you :)

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@1SAA
Copy link
Contributor

1SAA commented Mar 21, 2023

Hi @yhna940

Thanks for your contribution. But the way you named is little confusing. For each param appeared in the orginal code, it has a prefix, such as fp16 and fp32, while you named three types of param. They are param, full_param, and sharded_param. I suggest that we should use original_param and master_weight to replace fp16 and fp32. What do you think?

@yhna940 yhna940 closed this Mar 21, 2023
@yhna940 yhna940 reopened this Mar 21, 2023
@github-actions
Copy link
Contributor

The code coverage for the changed files is 85%.

Click me to view the complete report
Name                                                           Stmts   Miss  Cover
----------------------------------------------------------------------------------
colossalai/zero/sharded_optim/_utils.py                          125     48    62%
colossalai/zero/sharded_optim/bookkeeping/parameter_store.py      48      0   100%
colossalai/zero/sharded_optim/low_level_optim.py                 311     25    92%
----------------------------------------------------------------------------------
TOTAL                                                            484     73    85%

@yhna940
Copy link
Contributor Author

yhna940 commented Mar 24, 2023

Hi @yhna940

Thanks for your contribution. But the way you named is little confusing. For each param appeared in the orginal code, it has a prefix, such as fp16 and fp32, while you named three types of param. They are param, full_param, and sharded_param. I suggest that we should use original_param and master_weight to replace fp16 and fp32. What do you think?

@1SAA

Thank you for your feedback. I understand the concern about the naming conventions and appreciate your suggestion. However, I would like to propose an alternative term, working_param, instead of original_param. The term working_param is more closely related to the concept of the mixed-precision training context. It emphasizes the fact that these are the parameters that are actively used during forward and backward pass computations. Using working_param and master_weight would create a clear distinction between the two types of parameters and help avoid confusion.

I hope this explanation clarifies my reasoning for suggesting the term working_param. Please let me know if you have any concerns or if you'd like to discuss this further.

To summarize my suggestions:

  • fp16 -> working
  • fp32 -> master

@yhna940
Copy link
Contributor Author

yhna940 commented Mar 28, 2023

Hi @1SAA

Based on our previous discussions, I have renamed the variables that were causing confusion related to Mixed Precision. Could you please review them once again? Thank you!

@binmakeswell
Copy link
Member

Hi @yhna940 Thanks for your contribution, but there are some conflicts in this PR. Could you please solve them first? Thanks.

@yhna940
Copy link
Contributor Author

yhna940 commented Apr 7, 2023

Hi @yhna940 Thanks for your contribution, but there are some conflicts in this PR. Could you please solve them first? Thanks.

Hello @binmakeswell , the conflict has been resolved. thank you

@1SAA
Copy link
Contributor

1SAA commented Apr 7, 2023

Hi @yhna940

I am willing to merge your pr once tests in CI are passed.

@yhna940
Copy link
Contributor Author

yhna940 commented Apr 7, 2023

Hi @yhna940

I am willing to merge your pr once tests in CI are passed.

Hi @1SAA
The Github action CI Test pipeline failed, but it succeeded in my environment. Can you check this out? Thank you :)

Github Action CI Logs

974
=========================== short test summary info ============================
[975](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:976)
FAILED tests/test_booster/test_plugin/test_gemini_plugin.py::test_gemini_plugin - torch.multiprocessing.spawn.ProcessRaisedException: 
[976](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:977)

[977](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:978)
-- Process 0 terminated with the following error:
[978](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:979)
Traceback (most recent call last):
[979](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:980)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/chunk/manager.py", line 64, in register_tensor
[980](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:981)
    chunk_group[-1].append_tensor(tensor)
[981](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:982)
IndexError: deque index out of range
[982](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:983)

[983](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:984)
During handling of the above exception, another exception occurred:
[984](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:985)

[985](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:986)
Traceback (most recent call last):
[986](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:987)
  File "/opt/conda/envs/pytorch/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
[987](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:988)
    fn(i, *args)
[988](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:989)
  File "/__w/ColossalAI/ColossalAI/tests/test_booster/test_plugin/test_gemini_plugin.py", line 112, in run_dist
[989](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:990)
    check_gemini_plugin(early_stop=early_stop)
[990](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:991)
  File "/__w/ColossalAI/ColossalAI/tests/test_booster/test_plugin/test_gemini_plugin.py", line 74, in check_gemini_plugin
[991](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:992)
    raise e
[992](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:993)
  File "/__w/ColossalAI/ColossalAI/tests/test_booster/test_plugin/test_gemini_plugin.py", line 56, in check_gemini_plugin
[993](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:994)
    model, optimizer, criterion, _, _ = booster.boost(model, optimizer, criterion)
[994](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:995)
  File "/__w/ColossalAI/ColossalAI/colossalai/booster/booster.py", line 118, in boost
[995](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:996)
    model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
[996](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:997)
  File "/__w/ColossalAI/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 328, in configure
[997](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:998)
    model = GeminiModel(model, self.gemini_config)
[998](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:999)
  File "/__w/ColossalAI/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 118, in __init__
[999](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1000)
    self.module = zero_model_wrapper(module, zero_stage=3, gemini_config=gemini_config)
[1000](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1001)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/wrapper.py", line 43, in zero_model_wrapper
[1001](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1002)
    wrapped_model = GeminiDDP(model, **gemini_config)
[1002](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1003)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 590, in __init__
[1003](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1004)
    super().__init__(module, gemini_manager, pin_memory, force_outputs_fp32, strict_ddp_mode)
[1004](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1005)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 83, in __init__
[1005](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1006)
    self._init_chunks(param_order=param_order,
[1006](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1007)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/gemini_ddp.py", line 511, in _init_chunks
[1007](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1008)
    self.chunk_manager.register_tensor(tensor=fp32_p,
[1008](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1009)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/chunk/manager.py", line 79, in register_tensor
[1009](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1010)
    chunk = Chunk(
[1010](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1011)
  File "/__w/ColossalAI/ColossalAI/colossalai/zero/gemini/chunk/chunk.py", line 102, in __init__
[1011](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1012)
    self.chunk_temp = torch.zeros(chunk_size, dtype=dtype, device=device)    # keep all zero
[1012](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1013)
RuntimeError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 9.78 GiB total capacity; 6.30 GiB already allocated; 55.31 MiB free; 6.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[1013](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1014)
==== 1 failed, 240 passed, 175 skipped, 121 warnings in 1455.24s (0:24:15) =====
[1014](https://github.com/hpcaitech/ColossalAI/pull/3173/checks#step:11:1015)
Error: Process completed with exit code 1.

Local Test Log

python3 -m pytest -v tests/test_booster/test_plugin/test_gemini_plugin.py
========================================================= test session starts =========================================================
platform linux -- Python 3.8.10, pytest-7.2.2, pluggy-1.0.0 -- /fsx/home-yhna/0407/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/fsx/home-yhna/0407/ColossalAI/.hypothesis/examples')
rootdir: /fsx/home-yhna/0407/ColossalAI, configfile: pytest.ini
plugins: cov-4.0.0, hypothesis-6.70.2
collected 1 item

tests/test_booster/test_plugin/test_gemini_plugin.py::test_gemini_plugin

PASSED                                                 [100%]

==================================================== 1 passed in 262.90s (0:04:22) ====================================================

@1SAA
Copy link
Contributor

1SAA commented Apr 10, 2023

Hi @yhna940

It seems like there exists a memory leak in our tests. I think this problem is not caused by your code. I will fix this error soon.

@ver217
Copy link
Member

ver217 commented Apr 26, 2023

@yhna940 Can you sync our main branch updates first? We've already fixed the CI test. So after this, we can merge your this PR. Thanks.

@yhna940
Copy link
Contributor Author

yhna940 commented Apr 26, 2023

Can you sync our main branch updates first? We've already fixed the CI test. So after this, we can merge your this PR. Thanks.

@ver217 I have synced the main branch updates to this PR. Thanks for letting me know that the CI test issue has been resolved.

@github-actions
Copy link
Contributor

The code coverage for the changed files is 86%.

Click me to view the complete report
Name                                                       Stmts   Miss  Cover
------------------------------------------------------------------------------
colossalai/zero/low_level/_utils.py                          125     47    62%
colossalai/zero/low_level/bookkeeping/parameter_store.py      48      0   100%
colossalai/zero/low_level/low_level_optim.py                 313     20    94%
------------------------------------------------------------------------------
TOTAL                                                        486     67    86%

@ver217 ver217 merged commit a22407c into hpcaitech:main Apr 27, 2023
3 checks passed
hyunwoongko pushed a commit to EleutherAI/oslo that referenced this pull request May 25, 2023
…O optimizer (#183)

## Title

- [zero] Suggests a minor change to confusing variable names in the ZeRO
optimizer

## Description

It seems that the variable names related to the mixed precision
parameter group do not comprehensively cover its characteristics, so I
suggest a few changes. These changes are very trivial, but hopefully
they will alleviate some of the confusion for beginners like me.

Currently, the entire parameter group is named `fp16_param_groups`, and
the parts managed by the gpu at the current rank are described as
`fp32_flat_param_groups_of_current_rank`. This state perfectly
represents the characteristics when the master weight is a half-tensor
or the dtype specified in the `__init__`method is fp16. In other cases,
however, its characteristics do not correspond to the variable it.

I would like to propose an alternative term, `working_param` and
`master_param`. The term is more closely related to the concept of the
mixed-precision training context. Using `working_param` and
`master_weight` would create a clear distinction between the two types
of parameters and help avoid confusion.

To summarize my suggestions:
- `fp16` -> `working`
- `fp32` -> `master`


## Linked Issues

- N/A

## Reference

- hpcaitech/ColossalAI#3173
dyanos pushed a commit to EleutherAI/oslo that referenced this pull request Jun 8, 2023
…O optimizer (#183)

## Title

- [zero] Suggests a minor change to confusing variable names in the ZeRO
optimizer

## Description

It seems that the variable names related to the mixed precision
parameter group do not comprehensively cover its characteristics, so I
suggest a few changes. These changes are very trivial, but hopefully
they will alleviate some of the confusion for beginners like me.

Currently, the entire parameter group is named `fp16_param_groups`, and
the parts managed by the gpu at the current rank are described as
`fp32_flat_param_groups_of_current_rank`. This state perfectly
represents the characteristics when the master weight is a half-tensor
or the dtype specified in the `__init__`method is fp16. In other cases,
however, its characteristics do not correspond to the variable it.

I would like to propose an alternative term, `working_param` and
`master_param`. The term is more closely related to the concept of the
mixed-precision training context. Using `working_param` and
`master_weight` would create a clear distinction between the two types
of parameters and help avoid confusion.

To summarize my suggestions:
- `fp16` -> `working`
- `fp32` -> `master`


## Linked Issues

- N/A

## Reference

- hpcaitech/ColossalAI#3173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants