-
Notifications
You must be signed in to change notification settings - Fork 17
Fix CPU stats aggregation for JAX multihost. #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replaces custom pmap+pmax with the special-purpose multihost_utils.process_allgather. Tested in a pseudo multihost (multiprocess) test.
Summary of ChangesHello @cantonios, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the CPU statistics aggregation mechanism within a JAX distributed embedding layer. The primary goal is to enhance the robustness and maintainability of the multihost setup by replacing a manually implemented Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request replaces a custom pmap+pmax implementation with multihost_utils.process_allgather for aggregating CPU stats in a JAX multihost setup. This change simplifies the code and leverages a specialized utility for multihost communication. I have identified a potential issue with the incrementing of the step count in the backward pass, which could lead to incorrect training behavior.
| # Aggregate stats across all processes/devices via pmax. | ||
| num_local_cpu_devices = jax.local_device_count("cpu") | ||
|
|
||
| def pmax_aggregate(x: Any) -> Any: | ||
| if not hasattr(x, "ndim"): | ||
| x = np.array(x) | ||
| tiled_x = np.tile(x, (num_local_cpu_devices, *([1] * x.ndim))) | ||
| return jax.pmap( | ||
| lambda y: jax.lax.pmax(y, "all_cpus"), # type: ignore[no-untyped-call] | ||
| axis_name="all_cpus", | ||
| backend="cpu", | ||
| )(tiled_x)[0] | ||
|
|
||
| full_stats = jax.tree.map(pmax_aggregate, stats) | ||
| all_stats = multihost_utils.process_allgather(stats) | ||
| aggregated_stats = jax.tree.map( | ||
| lambda x: jnp.max(x, axis=0), all_stats | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code used jax.pmap with jax.lax.pmax to aggregate statistics across devices. This change replaces that with multihost_utils.process_allgather followed by jnp.max. This seems like a good simplification, leveraging a dedicated utility for multihost aggregation.
However, it's important to ensure that process_allgather correctly handles the data sharding and aggregation across multiple hosts in your specific environment. Double-check that the resulting aggregated_stats contains the expected maximum values across all processes.
hertschuh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
|
Thanks! This works for a dummy dataset on #144. For the real dataset though, it freezes before training. Trying to work through this, maybe something wrong with my processing code |
Replaces custom pmap+pmax with the special-purpose multihost_utils.process_allgather. Tested in a pseudo multihost (multiprocess) test.