Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stage1.5 #123

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Stage1.5 #123

wants to merge 6 commits into from

Conversation

Xuezhi-Liang
Copy link

No description provided.

@Xuezhi-Liang Xuezhi-Liang mentioned this pull request May 31, 2022
def context_initialize(batch_size):
"""
Initialize this module, must be invoked before calling any other functions.
This function will block until it has been invoked from all replicas.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's this enforced?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's this enforced?

Dear Omkar, Thank you for asking. As we were making the Context global, the Context was firstly initialized in init_process_group as Context_obj following Aurick's suggestion. All the subsequent processes in terms of Context will be using the Context_obj instead. So the initialize process was enforced in the very beginning.

@@ -119,6 +120,9 @@ def init_process_group(backend,
rank,
world_size)

# Initialize Context module.
adaptdl.torch.data.context_initialize(batch_size=32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to make batch_size available in init_process_group?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to make batch_size available in init_process_group?

We were just giving it a default value here for the global Context, the batch_size could be replaced by users when they use it. But thanks for point it out that we can also remove the default value from here, which would be given from the definition of class Context directly. Further, users are still able to give their input value and the default value will be replaced then.
Please also have a check with our new commit to review the modification.
Many thanks.

@@ -355,7 +318,7 @@ def context(self):

@property
def current_batch_size(self):
return (self.current_local_bsz * (self.accumulation_steps + 1) *
return (self._context.get_batch_size() * (self._context.get_accum_steps() + 1) *
Copy link
Collaborator

@odp odp Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit inefficient, it can potentially invoke the goodput optimization twice, once per Context._get_local_bsz call. You could probably use _context.current_local_bsz and accumulation_steps instead when you just want to query the values without potentially triggering optimization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Omkar, Please have a check on the new commit where the triggering issue was fixed. 🤝

@@ -526,19 +490,19 @@ def __iter__(self):
while not done:
self.sampler.set_epoch(
epoch, index=self._elastic.current_index)
self.batch_sampler.batch_size = self._elastic._sync_local_bsz()
self.batch_sampler.batch_size = self._elastic._context.get_batch_size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.

Thanks Omkar, please have a check on the new commit where the replacement issue is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants