Adds pool command group to CLI for node pool management#58
Adds pool command group to CLI for node pool management#58divyashreepathihalli merged 5 commits intomainfrom
Conversation
Summary of ChangesHello @JyotinderSingh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable feature: support for multiple accelerator node pools. The implementation is well-structured, with clear separation of concerns between the CLI commands, configuration, and the Pulumi infrastructure program. The new pool command group provides an intuitive workflow for managing node pools, adhering to the repository's design guidelines for end-to-end workflows. I also appreciate the consideration for backward compatibility in the status display and state parsing.
My review focuses on improving maintainability and ensuring robust test coverage. I've identified some code duplication in the new CLI commands, a minor issue with output consistency, and a fragile data-reconstruction pattern. More critically, I've noted that tests for the status command's output have been removed without replacement, and the new pool list command is untested. Addressing these points will help ensure the long-term quality and stability of this new functionality.
I am having trouble creating individual review comments. Click here to see my feedback.
keras_remote/cli/output_test.py (1-101)
This test file, which provided coverage for the infrastructure_state output function, has been removed but not replaced. The infrastructure_state function in keras_remote/cli/output.py was significantly modified in this PR to support multiple node pools and maintain backward compatibility.
Removing these tests leaves this critical, user-facing output functionality completely untested. Please re-introduce and update these tests to cover:
- The new multi-pool display format.
- The legacy single-pool display format for backward compatibility.
- The display for CPU-only configurations (both new and legacy formats).
- The display when no infrastructure is found.
keras_remote/cli/commands/pool.py (104)
Using on_output=print for stack.refresh() and stack.up() writes Pulumi's output directly to stdout, bypassing the rich.console object used elsewhere for CLI output. This can lead to inconsistent formatting and styling.
To ensure all CLI output is handled consistently, please use console.print as the callback. This applies here and in the following locations:
pool_add: line 131pool_remove: lines 163 and 196pool_list: line 231
stack.refresh(on_output=console.print)
keras_remote/cli/commands/pool.py (147)
The pool_add and pool_remove commands share a significant amount of boilerplate code for setting up and executing the Pulumi update (e.g., resolving config, getting the stack, refreshing state, confirming with the user, and running stack.up). This duplication can make future maintenance more difficult.
Consider refactoring this common logic into a shared helper function or context manager. This would centralize the process of running a Pulumi update against a modified list of node pools, making the pool_add and pool_remove commands simpler and more focused on their specific logic of modifying the pool list.
keras_remote/cli/commands/pool_test.py (55-61)
The pool list command is not covered by any tests in this file, and the _LIST_ARGS variable defined for it is unused. Please add tests for the pool list command to ensure its functionality is verified. The tests should cover cases where there are no pools, one pool, and multiple pools, as well as handling of potential errors when fetching the stack state.
keras_remote/cli/infra/stack_manager.py (82-96)
The _export_to_node_pool function reconstructs an Accelerator config by formatting a string from the exported dictionary and then re-parsing it with accelerators.parse_accelerator. This approach is brittle because it tightly couples this function to the specific string formats supported by the parser. If the parser's supported formats change in the future, this logic could break.
A more robust approach would be to reconstruct the config objects directly from the structured entry dictionary, without the intermediate string representation. This could be achieved by:
- Adding a new function to the
acceleratorsmodule (e.g.,from_export_dict) that takes the dictionary and returns a config object. - Making the internal
_make_gpuand_make_tpufunctions public and calling them here with the appropriate fields from theentrydict.
This would create a more stable contract between the export format and the state reconstruction logic.
bc755f4 to
1815a13
Compare
divyashreepathihalli
left a comment
There was a problem hiding this comment.
Thanks for the PR Jyotinder. I have left some comments.
A doc explaining the flow and edge cases would be helpful for review.
| new_pool_name = generate_pool_name(accel_config) | ||
| new_pool = NodePoolConfig(new_pool_name, accel_config) | ||
|
|
||
| project, zone, cluster_name, existing_pools = _load_pools( |
There was a problem hiding this comment.
Should we check if an identical accelerator configuration already exists in existing_pools and warn/block the user? Creating duplicate pools of the exact same instance type might confuse the GKE autoscaler.
There was a problem hiding this comment.
I think we should think through this as a separate issue. I'm not sure what issues GKE has when autoscaling across similar node pools.
| project, zone, cluster_name | ||
| ) | ||
|
|
||
| remaining = [p for p in existing_pools if p.name != pool_name] |
There was a problem hiding this comment.
In both add and remove, we prompt the user to proceed before running stack.up(). Should we run stack.preview() first so the user can verify [y/n] exactly what GCP resources Pulumi plans to create or destroy?
There was a problem hiding this comment.
Yes we should, but I plan to add this separately and unify the preview flow across the pools and the up command.
I've created an issue to track this: #62
| "v5litepod, v5p, v6e, v3 (with optional count/topology)", | ||
| ) | ||
| @click.option("--yes", "-y", is_flag=True, help="Skip confirmation prompt") | ||
| def pool_add(project, zone, cluster_name, accelerator, yes): |
There was a problem hiding this comment.
Just a thought: if multiple people use the same cluster and run pool add concurrently, they might overwrite each other's state changes. Is this a concern for our current use cases
Might be an overkill to think about this right now - but just a thought on this edge case.
There was a problem hiding this comment.
We won't need to worry about this once we start using GCS as a backend for the pulumi state file, since pulumi automatically handles safe concurrent access.
keras_remote/cli/commands/up.py
Outdated
|
|
||
| # Build node pool list | ||
| node_pools = [] | ||
| if accel_config is not None: |
There was a problem hiding this comment.
if a user runs keras-remote up on an existing cluster that has 3 node pools, this code will overwrite config.node_pools with a single new pool, effectively deleting the other 3 pools via Pulumi. Should up act differently on existing clusters, perhaps refreshing state and merging, or explicitly warning that it will overwrite pools?
There was a problem hiding this comment.
Oh I missed this. I've simplified the user flow now to prevent this edge case handling. If node pools already exist when the user runs up - we ignore the accelerator flag, skip the accelerator selection prompt, and show a warning. Using the pools add/remove/list commands should be the only way to manage pools after the first time a user runs keras-remote up.
Multi-node-pool support
Fixes #45
Adds support for multiple accelerator node pools per cluster, replacing the previous single-pool model. Users can now incrementally add and remove pools without reprovisioning the entire cluster.
New CLI commands
keras-remote status— The accelerator section now renders multiple pools with indexed entries:When no accelerator pools exist:
keras-remote upconfiguration summary — Now shows all accelerators as a comma-separated list:Or for CPU-only:
Accelerators CPU only.Both
pool addandpool removeaccept--yes/-yto skip confirmation prompts (matchingupanddownbehavior).Implementation details
Data model —
InfraConfig.accelerator(single) replaced withInfraConfig.node_pools(list ofNodePoolConfig). EachNodePoolConfigpairs a generated pool name with an accelerator config.Pool naming — Names follow the pattern
gpu-{name}-{hex4}/tpu-{name}-{hex4}(e.g.,gpu-l4-a3f2,tpu-v5p-b7e1). The random suffix allows multiple pools with identical accelerator specs.Pulumi program — Now iterates over the node pool list, creating one GKE node pool per entry. Pool names are passed in rather than hardcoded. Exports
"accelerators"(list) instead of"accelerator"(single dict).Pool management —
pool addreads current state from stack exports, appends the new pool, and runsstack.up().pool removedoes the inverse. Pulumi handles the diff — only the added/removed pool is created/destroyed.Re-running
upon existing clusters —upnow refreshes stack state before provisioning. If existing node pools are found, they are preserved as-is and the--acceleratorflag is ignored — a message directs users topool add/removeinstead. On first run (no stack),--acceleratorcreates the initial pool as before.Backward compatibility — Status display handles both the new
"accelerators"list key and the old"accelerator"single-dict key for pre-upgrade stacks.Migration
Existing stacks are fully compatible.
uppreserves existing node pools (including their original names) by reading state from the Pulumi stack before provisioning. New pools added viapool adduse the newgpu-{name}-{hex4}/tpu-{name}-{hex4}naming scheme.