Fix CPU stats aggregation for JAX multihost. #164

cantonios · 2025-10-22T05:10:17Z

Replaces custom pmap+pmax with the special-purpose multihost_utils.process_allgather. Tested in a pseudo multihost (multiprocess) test.

gemini-code-assist · 2025-10-22T05:10:28Z

Summary of Changes

Hello @cantonios, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the CPU statistics aggregation mechanism within a JAX distributed embedding layer. The primary goal is to enhance the robustness and maintainability of the multihost setup by replacing a manually implemented pmap and pmax pattern with JAX's dedicated multihost_utils.process_allgather function. This change streamlines how statistics are collected and processed across multiple hosts, ensuring more efficient and reliable operation in distributed JAX environments.

Highlights

CPU Stats Aggregation: Replaced a custom pmap and pmax based aggregation logic for CPU statistics with the specialized jax.experimental.multihost_utils.process_allgather utility.
Code Simplification: The change simplifies the aggregation process for JAX multihost environments, making the code more concise and leveraging a purpose-built JAX utility.
Variable Renaming: The variable full_stats was renamed to aggregated_stats to reflect the new aggregation method and maintain clarity.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request replaces a custom pmap+pmax implementation with multihost_utils.process_allgather for aggregating CPU stats in a JAX multihost setup. This change simplifies the code and leverages a specialized utility for multihost communication. I have identified a potential issue with the incrementing of the step count in the backward pass, which could lead to incorrect training behavior.

gemini-code-assist · 2025-10-22T05:11:49Z

keras_rs/src/layers/embedding/jax/distributed_embedding.py

            # Aggregate stats across all processes/devices via pmax.
-            num_local_cpu_devices = jax.local_device_count("cpu")
-
-            def pmax_aggregate(x: Any) -> Any:
-                if not hasattr(x, "ndim"):
-                    x = np.array(x)
-                tiled_x = np.tile(x, (num_local_cpu_devices, *([1] * x.ndim)))
-                return jax.pmap(
-                    lambda y: jax.lax.pmax(y, "all_cpus"),  # type: ignore[no-untyped-call]
-                    axis_name="all_cpus",
-                    backend="cpu",
-                )(tiled_x)[0]
-
-            full_stats = jax.tree.map(pmax_aggregate, stats)
+            all_stats = multihost_utils.process_allgather(stats)
+            aggregated_stats = jax.tree.map(
+                lambda x: jnp.max(x, axis=0), all_stats
+            )


The original code used jax.pmap with jax.lax.pmax to aggregate statistics across devices. This change replaces that with multihost_utils.process_allgather followed by jnp.max. This seems like a good simplification, leveraging a dedicated utility for multihost aggregation.

However, it's important to ensure that process_allgather correctly handles the data sharding and aggregation across multiple hosts in your specific environment. Double-check that the resulting aggregated_stats contains the expected maximum values across all processes.

hertschuh

Thanks for the fix!

abheesht17 · 2025-10-22T06:30:44Z

Thanks! This works for a dummy dataset on #144. For the real dataset though, it freezes before training. Trying to work through this, maybe something wrong with my processing code

Fix CPU stats aggregation for JAX multihost.

272b83f

Replaces custom pmap+pmax with the special-purpose multihost_utils.process_allgather. Tested in a pseudo multihost (multiprocess) test.

cantonios requested review from abheesht17 and hertschuh October 22, 2025 05:10

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

hertschuh approved these changes Oct 22, 2025

View reviewed changes

hertschuh merged commit 6887dcb into keras-team:main Oct 22, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CPU stats aggregation for JAX multihost. #164

Fix CPU stats aggregation for JAX multihost. #164

Uh oh!

cantonios commented Oct 22, 2025

Uh oh!

gemini-code-assist bot commented Oct 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 22, 2025

Uh oh!

hertschuh left a comment

Uh oh!

Uh oh!

abheesht17 commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix CPU stats aggregation for JAX multihost. #164

Fix CPU stats aggregation for JAX multihost. #164

Uh oh!

Conversation

cantonios commented Oct 22, 2025

Uh oh!

gemini-code-assist bot commented Oct 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

hertschuh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abheesht17 commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants