Stage1.5 #123

Xuezhi-Liang · 2022-05-31T12:37:53Z

No description provided.

odp · 2022-05-31T16:29:01Z

adaptdl/adaptdl/torch/data.py

+def context_initialize(batch_size):
+    """
+    Initialize this module, must be invoked before calling any other functions.
+    This function will block until it has been invoked from all replicas.


How's this enforced?

How's this enforced?

Dear Omkar, Thank you for asking. As we were making the Context global, the Context was firstly initialized in init_process_group as Context_obj following Aurick's suggestion. All the subsequent processes in terms of Context will be using the Context_obj instead. So the initialize process was enforced in the very beginning.

odp · 2022-06-02T00:27:37Z

adaptdl/adaptdl/torch/__init__.py

@@ -119,6 +120,9 @@ def init_process_group(backend,
                                  rank,
                                  world_size)

+    # Initialize Context module.
+    adaptdl.torch.data.context_initialize(batch_size=32)


How do you plan to make batch_size available in init_process_group?

How do you plan to make batch_size available in init_process_group?

We were just giving it a default value here for the global Context, the batch_size could be replaced by users when they use it. But thanks for point it out that we can also remove the default value from here, which would be given from the definition of class Context directly. Further, users are still able to give their input value and the default value will be replaced then.
Please also have a check with our new commit to review the modification.
Many thanks.

odp · 2022-06-02T23:23:27Z

adaptdl/adaptdl/torch/data.py

@@ -355,7 +318,7 @@ def context(self):

    @property
    def current_batch_size(self):
-        return (self.current_local_bsz * (self.accumulation_steps + 1) *
+        return (self._context.get_batch_size() * (self._context.get_accum_steps() + 1) *


This is a bit inefficient, it can potentially invoke the goodput optimization twice, once per Context._get_local_bsz call. You could probably use _context.current_local_bsz and accumulation_steps instead when you just want to query the values without potentially triggering optimization.

Thank you Omkar, Please have a check on the new commit where the triggering issue was fixed. 🤝

odp · 2022-06-03T18:19:46Z

adaptdl/adaptdl/torch/data.py

@@ -526,19 +490,19 @@ def __iter__(self):
            while not done:
                self.sampler.set_epoch(
                    epoch, index=self._elastic.current_index)
-                self.batch_sampler.batch_size = self._elastic._sync_local_bsz()
+                self.batch_sampler.batch_size = self._elastic._context.get_batch_size()


_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.

_sync_local_bsz cannot be replaced by the Context.get_batch_size, because _sync_local_bsz also does a broadcast from rank 0 to propagate the local batch size it calculated to rest of the replicas and also acts as barrier. Context.get_batch_size could cause local batch sizes in replicas to go out of sync.

Thanks Omkar, please have a check on the new commit where the replacement issue is fixed.

Xuezhi-Liang added 2 commits May 31, 2022 16:33

stage1.5

c49b9ca

stage1.5

ff96c21

Xuezhi-Liang mentioned this pull request May 31, 2022

Stage1 #115

Closed

odp reviewed May 31, 2022

View reviewed changes

odp reviewed Jun 2, 2022

View reviewed changes

Xuezhi-Liang added 2 commits June 2, 2022 10:14

default batch size issue fixed

fee7fb1

default batch size issue fixed

ff2e5e6

odp reviewed Jun 2, 2022

View reviewed changes

optimize trigger issue fixed

2da85e5

odp reviewed Jun 3, 2022

View reviewed changes

_sync_local_bsz() replacement issue fixed

a08e312

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage1.5 #123

Stage1.5 #123

Xuezhi-Liang commented May 31, 2022

odp May 31, 2022

VincentYaoMBZUAI Jun 1, 2022

odp Jun 2, 2022

VincentYaoMBZUAI Jun 2, 2022

odp Jun 2, 2022 •

edited

Loading

VincentYaoMBZUAI Jun 3, 2022

odp Jun 3, 2022

VincentYaoMBZUAI Jun 6, 2022

Stage1.5 #123

Are you sure you want to change the base?

Stage1.5 #123

Conversation

Xuezhi-Liang commented May 31, 2022

odp May 31, 2022

Choose a reason for hiding this comment

VincentYaoMBZUAI Jun 1, 2022

Choose a reason for hiding this comment

odp Jun 2, 2022

Choose a reason for hiding this comment

VincentYaoMBZUAI Jun 2, 2022

Choose a reason for hiding this comment

odp Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

VincentYaoMBZUAI Jun 3, 2022

Choose a reason for hiding this comment

odp Jun 3, 2022

Choose a reason for hiding this comment

VincentYaoMBZUAI Jun 6, 2022

Choose a reason for hiding this comment

odp Jun 2, 2022 •

edited

Loading