-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stage1.5 #123
base: master
Are you sure you want to change the base?
Stage1.5 #123
Conversation
def context_initialize(batch_size): | ||
""" | ||
Initialize this module, must be invoked before calling any other functions. | ||
This function will block until it has been invoked from all replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's this enforced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's this enforced?
Dear Omkar, Thank you for asking. As we were making the Context global, the Context was firstly initialized in init_process_group as Context_obj following Aurick's suggestion. All the subsequent processes in terms of Context will be using the Context_obj instead. So the initialize process was enforced in the very beginning.
adaptdl/adaptdl/torch/__init__.py
Outdated
@@ -119,6 +120,9 @@ def init_process_group(backend, | |||
rank, | |||
world_size) | |||
|
|||
# Initialize Context module. | |||
adaptdl.torch.data.context_initialize(batch_size=32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to make batch_size
available in init_process_group
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to make
batch_size
available ininit_process_group
?
We were just giving it a default value here for the global Context, the batch_size could be replaced by users when they use it. But thanks for point it out that we can also remove the default value from here, which would be given from the definition of class Context directly. Further, users are still able to give their input value and the default value will be replaced then.
Please also have a check with our new commit to review the modification.
Many thanks.
adaptdl/adaptdl/torch/data.py
Outdated
@@ -355,7 +318,7 @@ def context(self): | |||
|
|||
@property | |||
def current_batch_size(self): | |||
return (self.current_local_bsz * (self.accumulation_steps + 1) * | |||
return (self._context.get_batch_size() * (self._context.get_accum_steps() + 1) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit inefficient, it can potentially invoke the goodput optimization twice, once per Context._get_local_bsz
call. You could probably use _context.current_local_bsz
and accumulation_steps
instead when you just want to query the values without potentially triggering optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Omkar, Please have a check on the new commit where the triggering issue was fixed. 🤝
adaptdl/adaptdl/torch/data.py
Outdated
@@ -526,19 +490,19 @@ def __iter__(self): | |||
while not done: | |||
self.sampler.set_epoch( | |||
epoch, index=self._elastic.current_index) | |||
self.batch_sampler.batch_size = self._elastic._sync_local_bsz() | |||
self.batch_sampler.batch_size = self._elastic._context.get_batch_size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_sync_local_bsz
cannot be replaced by the Context.get_batch_size
, because _sync_local_bsz
also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size
could cause local batch sizes in replicas to go out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_sync_local_bsz
cannot be replaced by theContext.get_batch_size
, because_sync_local_bsz
also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier.Context.get_batch_size
could cause local batch sizes in replicas to go out of sync.
Thanks Omkar, please have a check on the new commit where the replacement issue is fixed.
No description provided.