Skip to content

Adds pool command group to CLI for node pool management#58

Merged
divyashreepathihalli merged 5 commits intomainfrom
new-pools
Mar 3, 2026
Merged

Adds pool command group to CLI for node pool management#58
divyashreepathihalli merged 5 commits intomainfrom
new-pools

Conversation

@JyotinderSingh
Copy link
Collaborator

@JyotinderSingh JyotinderSingh commented Feb 25, 2026

Multi-node-pool support

Fixes #45

Adds support for multiple accelerator node pools per cluster, replacing the previous single-pool model. Users can now incrementally add and remove pools without reprovisioning the entire cluster.

New CLI commands

keras-remote pool add --accelerator <spec>    # Add a node pool
keras-remote pool remove <pool-name>          # Remove a node pool by name
keras-remote pool list                        # List all node pools

keras-remote status — The accelerator section now renders multiple pools with indexed entries:

         Infrastructure State
 Resource                   Value
 Project                    my-project
 Zone                       us-central1-a
 Cluster Name               keras-remote-cluster
 Cluster Endpoint           34.123.45.67
 Artifact Registry          us-docker.pkg.dev/my-project/keras-remote

 Accelerator Pools (2)
   Pool 1: GPU              gpu-l4-a3f2
     GPU Type               l4
     GPU Count              1
     Machine Type           g2-standard-4
     Node Count             1
   Pool 2: TPU              tpu-v5p-b7e1
     TPU Type               v5p
     TPU Chips              8
     Topology               2x2x2
     Machine Type           ct5p-hightpu-4t
     Node Count             2

When no accelerator pools exist:

 Accelerators               CPU only (no accelerator pools)

keras-remote up configuration summary — Now shows all accelerators as a comma-separated list:

       Configuration Summary
 Setting        Value
 Project        my-project
 Zone           us-central1-a
 Cluster Name   keras-remote-cluster
 Accelerators   GPU (nvidia-l4)

Or for CPU-only: Accelerators CPU only.

Both pool add and pool remove accept --yes/-y to skip confirmation prompts (matching up and down behavior).

Implementation details

Data modelInfraConfig.accelerator (single) replaced with InfraConfig.node_pools (list of NodePoolConfig). Each NodePoolConfig pairs a generated pool name with an accelerator config.

Pool naming — Names follow the pattern gpu-{name}-{hex4} / tpu-{name}-{hex4} (e.g., gpu-l4-a3f2, tpu-v5p-b7e1). The random suffix allows multiple pools with identical accelerator specs.

Pulumi program — Now iterates over the node pool list, creating one GKE node pool per entry. Pool names are passed in rather than hardcoded. Exports "accelerators" (list) instead of "accelerator" (single dict).

Pool managementpool add reads current state from stack exports, appends the new pool, and runs stack.up(). pool remove does the inverse. Pulumi handles the diff — only the added/removed pool is created/destroyed.

Re-running up on existing clustersup now refreshes stack state before provisioning. If existing node pools are found, they are preserved as-is and the --accelerator flag is ignored — a message directs users to pool add/remove instead. On first run (no stack), --accelerator creates the initial pool as before.

Backward compatibility — Status display handles both the new "accelerators" list key and the old "accelerator" single-dict key for pre-upgrade stacks.

Migration

Existing stacks are fully compatible. up preserves existing node pools (including their original names) by reading state from the Pulumi stack before provisioning. New pools added via pool add use the new gpu-{name}-{hex4} / tpu-{name}-{hex4} naming scheme.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @JyotinderSingh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the keras-remote CLI by enabling robust management of multiple accelerator node pools within a single cluster. This change moves beyond a monolithic accelerator configuration to a more flexible, scalable model, allowing users to dynamically adjust their compute resources. The update includes new commands for adding, removing, and listing these pools, alongside comprehensive updates to how infrastructure state and configuration are displayed and managed internally.

Highlights

  • Multi-Node Pool Support: The system now supports multiple accelerator node pools per cluster, replacing the previous single-pool model. Users can add and remove pools incrementally without reprovisioning the entire cluster.
  • New CLI Commands: Introduced a new keras-remote pool command group with subcommands add, remove, and list for managing accelerator node pools. These commands allow users to specify accelerator types and manage their lifecycle.
  • Updated Status and Configuration Display: The keras-remote status command now renders multiple accelerator pools with indexed entries, providing detailed information for each. The keras-remote up configuration summary also reflects the new multi-accelerator setup.
  • Data Model and Pulumi Program Changes: The internal data model was updated from a single InfraConfig.accelerator to a list of InfraConfig.node_pools. The Pulumi program now iterates over this list to provision multiple GKE node pools and exports a list of accelerators instead of a single one.
  • Backward Compatibility and Migration: The system maintains backward compatibility for status display with pre-upgrade stacks. Existing stacks will have their node pools recreated with the new naming scheme upon the first pool add or up command after upgrading.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_remote/cli/commands/pool.py
    • Added new CLI command group pool for managing accelerator node pools.
    • Implemented pool add subcommand to provision new accelerator node pools.
    • Implemented pool remove subcommand to deprovision existing accelerator node pools.
    • Implemented pool list subcommand to display current accelerator node pools.
    • Integrated Pulumi automation for dynamic infrastructure updates based on pool changes.
  • keras_remote/cli/commands/pool_test.py
    • Added unit tests for the pool add command, including GPU and existing pool scenarios.
    • Added unit tests for the pool remove command, covering successful removal and handling of non-existent pools.
    • Included tests for rejecting CPU-only node pool additions via the pool add command.
  • keras_remote/cli/commands/up.py
    • Imported NodePoolConfig and generate_pool_name to support multi-node pool configuration.
    • Modified up command to construct a list of NodePoolConfig instead of a single accelerator.
    • Updated GPU driver installation logic to check for any GPU node pool in the configuration list.
  • keras_remote/cli/config.py
    • Defined NodePoolConfig dataclass to represent a named accelerator node pool.
    • Updated InfraConfig to use a list of NodePoolConfig objects for node_pools instead of a single Accelerator.
  • keras_remote/cli/infra/program.py
    • Updated the Pulumi program to accept node_pools (a list) instead of a single accelerator.
    • Modified node pool creation logic to iterate over node_pools, creating a GKE node pool for each entry.
    • Passed the generated pool_name to _create_gpu_node_pool and _create_tpu_node_pool functions.
    • Changed stack exports from a single accelerator dictionary to a list of accelerators dictionaries.
  • keras_remote/cli/infra/program_test.py
    • Imported NodePoolConfig for use in tests.
    • Updated _create_tpu_node_pool and _create_gpu_node_pool test calls to include the pool_name argument.
    • Modified _make_config to accept node_pools list instead of a single accelerator.
    • Adjusted accelerator export tests to expect a list of accelerators and handle multiple pools.
    • Updated tests for CPU-only configurations to expect an empty list for accelerators export.
  • keras_remote/cli/infra/stack_manager.py
    • Imported NodePoolConfig and accelerators for node pool management.
    • Added get_current_node_pools function to read and parse node pool configurations from Pulumi stack outputs.
    • Implemented logic in get_current_node_pools to handle both new (accelerators list) and legacy (accelerator single dict) output formats.
    • Added _export_to_node_pool helper function to convert stack export dictionaries into NodePoolConfig objects.
  • keras_remote/cli/main.py
    • Imported the new pool command group.
    • Registered the pool command group with the main CLI application.
  • keras_remote/cli/output.py
    • Updated infrastructure_state to display multiple accelerator pools with indexed entries.
    • Implemented _render_accelerator helper function for consistent display of individual accelerator pool details.
    • Modified config_summary to display a comma-separated list of accelerators when multiple node pools are configured.
  • keras_remote/cli/output_test.py
    • Removed the file as its functionality was likely superseded or refactored.
  • keras_remote/core/accelerators.py
    • Imported uuid module for generating unique identifiers.
    • Added generate_pool_name function to create unique GKE node pool names with a random hexadecimal suffix.
  • keras_remote/core/accelerators_test.py
    • Imported generate_pool_name for testing.
    • Added unit tests for generate_pool_name to verify naming conventions, suffix length, and uniqueness.
Activity
  • The pull request introduces a significant new feature for multi-node pool management.
  • No specific review comments or iterative changes were provided in the context, indicating a focused initial implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable feature: support for multiple accelerator node pools. The implementation is well-structured, with clear separation of concerns between the CLI commands, configuration, and the Pulumi infrastructure program. The new pool command group provides an intuitive workflow for managing node pools, adhering to the repository's design guidelines for end-to-end workflows. I also appreciate the consideration for backward compatibility in the status display and state parsing.

My review focuses on improving maintainability and ensuring robust test coverage. I've identified some code duplication in the new CLI commands, a minor issue with output consistency, and a fragile data-reconstruction pattern. More critically, I've noted that tests for the status command's output have been removed without replacement, and the new pool list command is untested. Addressing these points will help ensure the long-term quality and stability of this new functionality.

I am having trouble creating individual review comments. Click here to see my feedback.

keras_remote/cli/output_test.py (1-101)

critical

This test file, which provided coverage for the infrastructure_state output function, has been removed but not replaced. The infrastructure_state function in keras_remote/cli/output.py was significantly modified in this PR to support multiple node pools and maintain backward compatibility.

Removing these tests leaves this critical, user-facing output functionality completely untested. Please re-introduce and update these tests to cover:

  • The new multi-pool display format.
  • The legacy single-pool display format for backward compatibility.
  • The display for CPU-only configurations (both new and legacy formats).
  • The display when no infrastructure is found.

keras_remote/cli/commands/pool.py (104)

medium

Using on_output=print for stack.refresh() and stack.up() writes Pulumi's output directly to stdout, bypassing the rich.console object used elsewhere for CLI output. This can lead to inconsistent formatting and styling.

To ensure all CLI output is handled consistently, please use console.print as the callback. This applies here and in the following locations:

  • pool_add: line 131
  • pool_remove: lines 163 and 196
  • pool_list: line 231
    stack.refresh(on_output=console.print)

keras_remote/cli/commands/pool.py (147)

medium

The pool_add and pool_remove commands share a significant amount of boilerplate code for setting up and executing the Pulumi update (e.g., resolving config, getting the stack, refreshing state, confirming with the user, and running stack.up). This duplication can make future maintenance more difficult.

Consider refactoring this common logic into a shared helper function or context manager. This would centralize the process of running a Pulumi update against a modified list of node pools, making the pool_add and pool_remove commands simpler and more focused on their specific logic of modifying the pool list.

keras_remote/cli/commands/pool_test.py (55-61)

medium

The pool list command is not covered by any tests in this file, and the _LIST_ARGS variable defined for it is unused. Please add tests for the pool list command to ensure its functionality is verified. The tests should cover cases where there are no pools, one pool, and multiple pools, as well as handling of potential errors when fetching the stack state.

keras_remote/cli/infra/stack_manager.py (82-96)

medium

The _export_to_node_pool function reconstructs an Accelerator config by formatting a string from the exported dictionary and then re-parsing it with accelerators.parse_accelerator. This approach is brittle because it tightly couples this function to the specific string formats supported by the parser. If the parser's supported formats change in the future, this logic could break.

A more robust approach would be to reconstruct the config objects directly from the structured entry dictionary, without the intermediate string representation. This could be achieved by:

  1. Adding a new function to the accelerators module (e.g., from_export_dict) that takes the dictionary and returns a config object.
  2. Making the internal _make_gpu and _make_tpu functions public and calling them here with the appropriate fields from the entry dict.

This would create a more stable contract between the export format and the state reconstruction logic.

@divyashreepathihalli divyashreepathihalli added the run-e2e Runs End-to-End tests on Internal Cluster. label Mar 2, 2026
@github-actions github-actions bot removed the run-e2e Runs End-to-End tests on Internal Cluster. label Mar 2, 2026
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Jyotinder. I have left some comments.
A doc explaining the flow and edge cases would be helpful for review.

new_pool_name = generate_pool_name(accel_config)
new_pool = NodePoolConfig(new_pool_name, accel_config)

project, zone, cluster_name, existing_pools = _load_pools(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if an identical accelerator configuration already exists in existing_pools and warn/block the user? Creating duplicate pools of the exact same instance type might confuse the GKE autoscaler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should think through this as a separate issue. I'm not sure what issues GKE has when autoscaling across similar node pools.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#63
For tracking

project, zone, cluster_name
)

remaining = [p for p in existing_pools if p.name != pool_name]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both add and remove, we prompt the user to proceed before running stack.up(). Should we run stack.preview() first so the user can verify [y/n] exactly what GCP resources Pulumi plans to create or destroy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should, but I plan to add this separately and unify the preview flow across the pools and the up command.

I've created an issue to track this: #62

"v5litepod, v5p, v6e, v3 (with optional count/topology)",
)
@click.option("--yes", "-y", is_flag=True, help="Skip confirmation prompt")
def pool_add(project, zone, cluster_name, accelerator, yes):
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: if multiple people use the same cluster and run pool add concurrently, they might overwrite each other's state changes. Is this a concern for our current use cases

Might be an overkill to think about this right now - but just a thought on this edge case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't need to worry about this once we start using GCS as a backend for the pulumi state file, since pulumi automatically handles safe concurrent access.


# Build node pool list
node_pools = []
if accel_config is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user runs keras-remote up on an existing cluster that has 3 node pools, this code will overwrite config.node_pools with a single new pool, effectively deleting the other 3 pools via Pulumi. Should up act differently on existing clusters, perhaps refreshing state and merging, or explicitly warning that it will overwrite pools?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed this. I've simplified the user flow now to prevent this edge case handling. If node pools already exist when the user runs up - we ignore the accelerator flag, skip the accelerator selection prompt, and show a warning. Using the pools add/remove/list commands should be the only way to manage pools after the first time a user runs keras-remote up.

@divyashreepathihalli divyashreepathihalli merged commit 4300d82 into main Mar 3, 2026
4 checks passed
@JyotinderSingh JyotinderSingh deleted the new-pools branch March 4, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CLI command to create additional TPU/GPU node pools

2 participants