-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[zero] add zero optimizer for ColoTensor #1046
Conversation
UNSCALED = 1 | ||
|
||
|
||
class ZeroOptimizer(ColossalaiOptimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be put in the zero module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
tests/test_tensor/test_zero_optim.py
Outdated
torch_model.train() | ||
set_seed(gpc.get_local_rank(ParallelMode.DATA)) | ||
for i, (input_ids, attn_mask) in enumerate(train_dataloader): | ||
if i > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only one pass? will it fail if i > 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test takes too long if i > 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but you should have more robust test like i > 3
colossalai/zero/zero_optimizer.py
Outdated
max_scale=max_scale) | ||
self._found_overflow: torch.Tensor = torch.zeros(1, dtype=torch.int64, device=torch.cuda.current_device()) | ||
self.dp_process_group = gpc.get_group(ParallelMode.DATA) | ||
self.mp_process_group = gpc.get_group(ParallelMode.MODEL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we have a ParallelMode.Global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FrankLeeeee shall we add this mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think implementing it is not difficult. You can consider it as a patch in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this mode actually...
for p in group['params']: | ||
if not self.module.chunk_manager.is_chunk_free(p): | ||
fp32_p = self.fp16_param_to_fp32_param[p] | ||
self.module.chunk_manager.copy_tensor_to_chunk_slice(p, fp32_p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy in tensors is inefficient....
You should add a TODO and improve this line later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
tests/test_tensor/test_gpt.py
Outdated
@@ -68,6 +68,7 @@ def run_gpt(init_spec_func, use_ddp): | |||
for i, (input_ids, attn_mask) in enumerate(train_dataloader): | |||
logits = model(input_ids, attn_mask) | |||
torch_logits = torch_model(input_ids, attn_mask) | |||
print(torch_logits, logits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this.
No description provided.