feat: implement Kubernetes Cluster Autoscaler integration#89
Conversation
📝 WalkthroughWalkthroughImplements Kubernetes cluster autoscaling via an out-of-tree gRPC External Provider pattern. Adds autoscaler server, node group management across domain/service/repository layers, API endpoints for node group CRUD, Kubernetes deployment manifests with sidecar bridge, and database migrations supporting cluster node groups as the primary scaling unit. Changes
Sequence Diagram(s)sequenceDiagram
participant CA as Cluster Autoscaler
participant Bridge as thecloud-bridge<br/>(Autoscaler Server)
participant SDK as SDK Client
participant API as Cloud API
participant DB as Database
Note over CA,DB: Node Group Scaling Flow
CA->>Bridge: NodeGroups(ctx, req)
Bridge->>SDK: List node groups for cluster
SDK->>API: GET /clusters/{id}/nodegroups
API->>DB: Query cluster_node_groups
DB-->>API: Return node groups
API-->>SDK: Return NodeGroup[]
SDK-->>Bridge: Return NodeGroup[]
Bridge-->>CA: Return NodeGroups response
CA->>Bridge: NodeGroupIncreaseSize(ctx, req)
activate Bridge
Bridge->>Bridge: Validate: desired ≤ MaxSize
Bridge->>SDK: UpdateNodeGroup(clusterID, name, newSize)
SDK->>API: PUT /clusters/{id}/nodegroups/{name}
API->>DB: Update cluster_node_groups.current_size
DB-->>API: OK
API-->>SDK: Return updated NodeGroup
SDK-->>Bridge: Return updated NodeGroup
Bridge-->>CA: Return success response
deactivate Bridge
CA->>Bridge: NodeGroupNodes(ctx, req)
Bridge->>SDK: List instances in node group
SDK->>API: GET /clusters/{id}/instances?node_group={name}
API->>DB: Query instances with metadata filter
DB-->>API: Return instances
API-->>SDK: Return Instance[]
SDK-->>Bridge: Return Instance[]
Bridge-->>CA: Return Instance[] mapped to protos.Instance
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Implements Kubernetes Cluster Autoscaler integration using the external gRPC provider pattern, introducing a first-class “node group / node pool” model and deploying an in-cluster bridge that translates CA RPCs into The Cloud API calls.
Changes:
- Add NodeGroup domain/API/storage support (DB migration, repository/service/handler/routes, SDK types + endpoints).
- Add autoscaler gRPC bridge service (
autoscaler-server) and Kubernetes manifests to deploy it alongside upstream Cluster Autoscaler. - Update documentation (ADR-020, KaaS guides, and DB schema docs) to describe the integration and usage.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/sdk/kubernetes.go | Adds NodeGroup models and SDK methods for node group management; introduces context-aware GetCluster. |
| k8s/autoscaler/rbac.yaml | RBAC resources for running Cluster Autoscaler + bridge in kube-system. |
| k8s/autoscaler/deployment.yaml | Deployment running upstream cluster-autoscaler plus The Cloud bridge sidecar. |
| k8s/autoscaler/configmap.yaml | External gRPC provider config pointing CA to localhost bridge port. |
| k8s/autoscaler/cluster-id-configmap.yaml | ConfigMap template to provide the cluster ID to the bridge. |
| internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql | Creates cluster_node_groups table and backfills a default pool for existing clusters. |
| internal/repositories/postgres/migrations/098_create_cluster_node_groups.down.sql | Drops cluster_node_groups. |
| internal/repositories/postgres/cluster_repo.go | Loads node groups for clusters and adds CRUD repository methods for node groups. |
| internal/repositories/k8s/provisioner.go | Provisions workers based on node groups and tags instances with cluster/node-group metadata. |
| internal/handlers/cluster_handler.go | Adds HTTP endpoints for add/update/delete node groups. |
| internal/core/services/cluster.go | Persists default node group on cluster creation and implements node group management logic. |
| internal/core/ports/compute_options.go | Extends instance creation options with metadata/labels. |
| internal/core/ports/cluster.go | Extends ClusterRepository/ClusterService interfaces with node group operations. |
| internal/core/domain/cluster.go | Adds NodeGroup domain model and attaches NodeGroups to Cluster. |
| internal/autoscaler/server.go | Implements externalgrpc CloudProvider server backed by The Cloud SDK. |
| internal/autoscaler/protos/externalgrpc_grpc.pb.go | Generated gRPC service code for externalgrpc provider. |
| internal/api/setup/router.go | Registers node group routes under /clusters. |
| docs/services/kubernetes-autoscaler.md | New autoscaler feature documentation and usage notes. |
| docs/kubernetes.md | Updates KaaS guide to mention autoscaling + adjusts some MVP limitations wording. |
| docs/database.md | Documents clusters and new cluster_node_groups schema. |
| docs/adr/ADR-020-kubernetes-cluster-autoscaler.md | ADR describing the chosen autoscaler integration strategy. |
| cmd/autoscaler-server/main.go | CLI entrypoint for the bridge service. |
| Makefile | Adds build target for autoscaler-server. |
| Dockerfile.autoscaler-server | Adds container build for the bridge service. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for _, node := range req.Nodes { | ||
| providerID := strings.TrimPrefix(node.ProviderID, "thecloud://") | ||
| if providerID == node.ProviderID { | ||
| continue | ||
| } | ||
| if err := s.client.TerminateInstance(providerID); err != nil { | ||
| klog.Errorf("Failed to terminate instance %s: %v", providerID, err) | ||
| return nil, err | ||
| } | ||
| } |
There was a problem hiding this comment.
NodeGroupDeleteNodes terminates instances solely based on ProviderID without validating that the instance belongs to the requested node group and cluster. A malformed/stale request could terminate unrelated instances. Fetch the instance first and verify metadata (cluster-id/node-group) matches req.Id and s.clusterID before terminating.
| _, _ = s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{ | ||
| DesiredSize: &newSize, | ||
| }) |
There was a problem hiding this comment.
NodeGroupDeleteNodes computes newSize := currentSize - len(nodes) and sends it without bounding to MinSize/0, and it also ignores the UpdateNodeGroupWithContext error. This can set negative/invalid sizes and hide failures; clamp/validate the new size and handle the update error.
| _, _ = s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{ | |
| DesiredSize: &newSize, | |
| }) | |
| if newSize < 0 { | |
| klog.Errorf("calculated negative desired size %d for node group %s, clamping to 0", newSize, req.Id) | |
| newSize = 0 | |
| } | |
| if _, err := s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{ | |
| DesiredSize: &newSize, | |
| }); err != nil { | |
| klog.Errorf("failed to update node group %s desired size to %d: %v", req.Id, newSize, err) | |
| return nil, err | |
| } |
| 1. A pod is in `Pending` status due to CPU/Memory constraints. | ||
| 2. Upstream CA identifies that a Node Group can satisfy the requirement. | ||
| 3. Upstream CA calls `NodeGroupIncreaseSize` via gRPC. | ||
| 4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API. |
There was a problem hiding this comment.
This doc references POST /clusters/{id}/nodegroups/{name}, but the API routes are POST /clusters/{id}/nodegroups (create) and PUT /clusters/{id}/nodegroups/{name} (update/scale). Update the scaling flow to reflect the actual endpoint the bridge uses (PUT).
| 4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API. | |
| 4. `thecloud-bridge` calls `PUT /clusters/{id}/nodegroups/{name}` on The Cloud API. |
|
|
||
| // Node Group operations | ||
| AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error | ||
| GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) | ||
| UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error | ||
| DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error |
There was a problem hiding this comment.
The new ClusterRepository interface adds node group methods, but other in-repo implementations/mocks (e.g., internal/repositories/noop.NoopClusterService/NoopClusterRepository and internal/core/services test mocks) must be updated to implement these methods; otherwise builds/tests that wire these into ports.ClusterService/ports.ClusterRepository will not compile.
| if err == nil { | ||
| c.NodeGroups = ngs | ||
| } |
There was a problem hiding this comment.
ClusterRepository.list() ignores errors from GetNodeGroups and returns clusters with silently missing NodeGroups. This can mask DB/query failures and lead to inconsistent API responses; consider returning the error (or at least logging it) instead of dropping it.
| if err == nil { | |
| c.NodeGroups = ngs | |
| } | |
| if err != nil { | |
| return nil, errors.Wrap(errors.Internal, "failed to list node groups for cluster", err) | |
| } | |
| c.NodeGroups = ngs |
| // AddNodeGroup adds a new node pool to the cluster. | ||
| func (c *Client) AddNodeGroup(clusterID string, input NodeGroupInput) (*NodeGroup, error) { | ||
| var resp Response[*NodeGroup] | ||
| if err := c.post(clustersPath+"/"+clusterID+"/nodegroups", input, &resp); err != nil { | ||
| return nil, err | ||
| } | ||
| return resp.Data, nil | ||
| } | ||
|
|
||
| // UpdateNodeGroupWithContext updates a node group's parameters with context support. | ||
| func (c *Client) UpdateNodeGroupWithContext(ctx context.Context, clusterID string, name string, input UpdateNodeGroupInput) (*NodeGroup, error) { | ||
| var resp Response[*NodeGroup] | ||
| if err := c.putWithContext(ctx, clustersPath+"/"+clusterID+"/nodegroups/"+name, input, &resp); err != nil { | ||
| return nil, err | ||
| } | ||
| return resp.Data, nil | ||
| } | ||
|
|
||
| // DeleteNodeGroup removes a node group. | ||
| func (c *Client) DeleteNodeGroup(clusterID string, name string) error { | ||
| var resp Response[any] | ||
| return c.delete(clustersPath+"/"+clusterID+"/nodegroups/"+name, &resp) | ||
| } |
There was a problem hiding this comment.
SDK NodeGroup operations are inconsistent with the established context pattern (non-context method delegating to WithContext). Add AddNodeGroupWithContext/DeleteNodeGroupWithContext (and optionally a non-context UpdateNodeGroup wrapper) so callers can control cancellation/timeouts.
| // Fetch Node Groups for each cluster in the list | ||
| ngs, err := r.GetNodeGroups(ctx, c.ID) | ||
| if err == nil { | ||
| c.NodeGroups = ngs | ||
| } |
There was a problem hiding this comment.
list() performs an additional GetNodeGroups query per cluster row (N+1). For users with many clusters this will add noticeable latency/load; consider fetching node groups for all listed cluster IDs in a single query and grouping in-memory.
| // If desired size > 0, we should trigger a scale operation for this specific group. | ||
| // For now, we will update the global worker count to reflect the addition. | ||
| cluster.WorkerCount += params.DesiredSize | ||
| _ = s.repo.Update(ctx, cluster) | ||
|
|
There was a problem hiding this comment.
AddNodeGroup/UpdateNodeGroup update cluster.WorkerCount but ignore errors from repo.Update. If the update fails, WorkerCount will drift from the sum of node groups, which can break legacy paths relying on WorkerCount. Return/handle the error (and consider wrapping it) instead of discarding it.
| if err := s.repo.UpdateNodeGroup(ctx, targetGroup); err != nil { | ||
| return nil, err | ||
| } | ||
| _ = s.repo.Update(ctx, cluster) |
There was a problem hiding this comment.
UpdateNodeGroup() persists the node group and then ignores the error from repo.Update(cluster). If that update fails, the cluster record (WorkerCount/UpdatedAt) may become inconsistent with the node group state. Handle/return the update error rather than discarding it.
| _ = s.repo.Update(ctx, cluster) | |
| if err := s.repo.Update(ctx, cluster); err != nil { | |
| return nil, err | |
| } |
| func (s *AutoscalerServer) NodeGroupTemplateNodeInfo(ctx context.Context, req *protos.NodeGroupTemplateNodeInfoRequest) (*protos.NodeGroupTemplateNodeInfoResponse, error) { | ||
| return nil, fmt.Errorf("unimplemented") | ||
| } | ||
|
|
||
| func (s *AutoscalerServer) NodeGroupGetOptions(ctx context.Context, req *protos.NodeGroupAutoscalingOptionsRequest) (*protos.NodeGroupAutoscalingOptionsResponse, error) { | ||
| return nil, fmt.Errorf("unimplemented") | ||
| } | ||
|
|
||
| func (s *AutoscalerServer) PricingNodePrice(ctx context.Context, req *protos.PricingNodePriceRequest) (*protos.PricingNodePriceResponse, error) { | ||
| return nil, fmt.Errorf("unimplemented") | ||
| } | ||
|
|
||
| func (s *AutoscalerServer) PricingPodPrice(ctx context.Context, req *protos.PricingPodPriceRequest) (*protos.PricingPodPriceResponse, error) { | ||
| return nil, fmt.Errorf("unimplemented") | ||
| } |
There was a problem hiding this comment.
The Cluster Autoscaler external gRPC spec expects optional/unimplemented methods to return gRPC code Unimplemented (12). Returning fmt.Errorf("unimplemented") will surface as code Unknown instead. Use status.Errorf(codes.Unimplemented, ...) for these methods.
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (6)
docs/kubernetes.md-119-121 (1)
119-121:⚠️ Potential issue | 🟡 Minor“Limitations (MVP)” now lists capabilities, not limitations.
The new bullets describe supported features. Please either rename this section or replace these bullets with actual current constraints.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/kubernetes.md` around lines 119 - 121, The section header "Limitations (MVP)" currently lists supported features ("Storage: Dynamic storage provisioning (CSI)" and "High Availability: ... via the `--ha` flag") instead of constraints; either rename the header to "Features (MVP)" or keep "Limitations (MVP)" and replace those bullets with real limitations (for example: lack of automated backups, no cluster autoscaler, limited OS image support, manual certificate rotation, or single-region control plane) so the text matches the header; update the header text and/or replace the two bullets ("Storage" and "High Availability" referencing `--ha`) accordingly.docs/services/kubernetes-autoscaler.md-38-38 (1)
38-38:⚠️ Potential issue | 🟡 MinorFix the scale-up API endpoint description.
Line 38 documents
POST /clusters/{id}/nodegroups/{name}, which conflicts with the create example on Lines 57-58 and can mislead readers about scale vs create behavior.Doc wording fix
-4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API. +4. `thecloud-bridge` calls the Node Group resize/update endpoint (for example, `PATCH /clusters/{id}/nodegroups/{name}`) on The Cloud API.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/services/kubernetes-autoscaler.md` at line 38, The doc currently lists the scale-up endpoint as `POST /clusters/{id}/nodegroups/{name}`, which conflicts with the create example; update the wording to describe the actual scale/update endpoint (replace `POST /clusters/{id}/nodegroups/{name}`) with the correct API verb and path for resizing a nodegroup (for example `PATCH /clusters/{id}/nodegroups/{name}` or the team-approved scale path such as `POST /clusters/{id}/nodegroups/{name}/scale`), and ensure the text clearly states this is a scale/update operation (not create) so it no longer conflicts with the create example that uses `POST /clusters/{id}/nodegroups`.docs/swagger/docs.go-9685-9698 (1)
9685-9698:⚠️ Potential issue | 🟡 MinorDisallow empty update bodies for node group updates.
With all fields optional, an empty object is currently valid for
UpdateNodeGroupRequest. Require at least one updatable field to prevent no-op update calls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 9685 - 9698, The UpdateNodeGroupRequest schema currently allows an empty object; modify its schema to require at least one updatable field by adding a constraint (e.g., "minProperties": 1) to the "httphandlers.UpdateNodeGroupRequest" schema in docs/swagger/docs.go so that requests with no fields are rejected; reference the schema name "httphandlers.UpdateNodeGroupRequest" and ensure the change is applied to the JSON object describing that schema.docs/swagger/docs.go-9550-9572 (1)
9550-9572:⚠️ Potential issue | 🟡 MinorAdd numeric constraints to node group sizing fields.
On Line 9556, Line 9562, and Line 9565,
desired_size/max_size/min_sizeare unconstrained. The schema currently allows invalid values (including negatives). Add bounds in annotations so generated docs enforce valid ranges.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 9550 - 9572, The schema for httphandlers.NodeGroupRequest currently allows negative/invalid sizes for desired_size, max_size, and min_size; update the struct or swagger annotations for the NodeGroupRequest fields desired_size, max_size, and min_size to include numeric bounds (e.g., add "minimum: 0" to each and optionally a sensible "maximum" if desired) so the generated docs emit valid range constraints (minimum: 0) for those integer properties.internal/repositories/postgres/cluster_repo.go-244-252 (1)
244-252:⚠️ Potential issue | 🟡 MinorCheck
rows.Err()after iterating node groups.Without a post-loop
rows.Err()check, cursor-level read errors can be missed.Suggested change
for rows.Next() { var ng domain.NodeGroup if err := rows.Scan(&ng.ID, &ng.ClusterID, &ng.Name, &ng.InstanceType, &ng.MinSize, &ng.MaxSize, &ng.CurrentSize, &ng.CreatedAt, &ng.UpdatedAt); err != nil { return nil, errors.Wrap(errors.Internal, "failed to scan node group", err) } groups = append(groups, ng) } + if err := rows.Err(); err != nil { + return nil, errors.Wrap(errors.Internal, "failed while iterating node groups", err) + } return groups, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/cluster_repo.go` around lines 244 - 252, After iterating rows into the groups slice you must check cursor-level errors by calling rows.Err(); update the code after the for rows.Next() loop (where rows, groups and domain.NodeGroup are used and rows.Scan is invoked) to call if err := rows.Err(); err != nil { return nil, errors.Wrap(errors.Internal, "row iteration error", err) } before returning groups, nil so any underlying read errors are surfaced and wrapped consistently.internal/handlers/cluster_handler.go-380-460 (1)
380-460:⚠️ Potential issue | 🟡 MinorAdd missing
@SecuritySwagger annotations on new NodeGroup handlers.
AddNodeGroup,UpdateNodeGroup, andDeleteNodeGroupdocs omit@Security, so generated API docs won’t fully describe auth requirements.Suggested doc updates
// AddNodeGroup godoc // `@Summary` Add a node group to a cluster // `@Description` Creates a new pool of worker nodes // `@Tags` K8s +// `@Security` BearerAuth // `@Param` id path string true "Cluster ID" // `@Param` request body NodeGroupRequest true "Node Group details" // `@Success` 201 {object} domain.NodeGroup // `@Router` /clusters/{id}/nodegroups [post] // UpdateNodeGroup godoc // `@Summary` Update a node group // `@Description` Modifies scaling boundaries or desired size of a node pool // `@Tags` K8s +// `@Security` BearerAuth // `@Param` id path string true "Cluster ID" // `@Param` name path string true "Node Group Name" // `@Param` request body UpdateNodeGroupRequest true "Update details" // `@Success` 200 {object} domain.NodeGroup // `@Router` /clusters/{id}/nodegroups/{name} [put] // DeleteNodeGroup godoc // `@Summary` Delete a node group // `@Description` Removes a node pool and terminates its nodes // `@Tags` K8s +// `@Security` BearerAuth // `@Param` id path string true "Cluster ID" // `@Param` name path string true "Node Group Name" // `@Success` 202 // `@Router` /clusters/{id}/nodegroups/{name} [delete]As per coding guidelines, "Include Swagger documentation comments on handler methods with
@Summary,@Tags,@Security, and@Routerdirectives".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handlers/cluster_handler.go` around lines 380 - 460, The Swagger comments for AddNodeGroup, UpdateNodeGroup, and DeleteNodeGroup are missing `@Security`; update the doc comments above each handler (AddNodeGroup, UpdateNodeGroup, DeleteNodeGroup) to include the same `@Security` directive used by other K8s handlers in the file (e.g., `@Security` ApiKeyAuth or `@Security` BearerAuth — use the exact scheme name already present elsewhere), placing it alongside the existing `@Summary`, `@Tags`, and `@Router` lines so the generated API docs correctly show authentication requirements.
🧹 Nitpick comments (12)
internal/core/ports/compute_options.go (1)
13-14: Add JSON tags for the new optional map fields.
MetadataandLabelsare optional but currently lackjsontags withomitempty.♻️ Proposed change
- Metadata map[string]string // Key-value metadata for the instance - Labels map[string]string // Scheduling or grouping labels + Metadata map[string]string `json:"metadata,omitempty"` // Key-value metadata for the instance + Labels map[string]string `json:"labels,omitempty"` // Scheduling or grouping labelsAs per coding guidelines, "Include
jsontags on struct fields withomitemptyfor optional fields".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/core/ports/compute_options.go` around lines 13 - 14, The struct fields Metadata and Labels are optional but missing JSON tags; update their definitions in the struct (fields named Metadata and Labels in this file) to include json tags with omitempty (e.g., `json:"metadata,omitempty"` and `json:"labels,omitempty"`) so they are omitted from serialized output when nil/empty.Dockerfile.autoscaler-server (1)
20-28: Harden runtime image by dropping root.The final container runs as root. Consider adding a non-root user for better default security posture.
🔒 Proposed hardening
FROM alpine:3.19 WORKDIR /app +RUN addgroup -S app && adduser -S app -G app # Copy the binary from the builder stage COPY --from=builder /app/bin/autoscaler-server /usr/local/bin/thecloud-autoscaler-server # Run the server +USER app ENTRYPOINT ["/usr/local/bin/thecloud-autoscaler-server"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile.autoscaler-server` around lines 20 - 28, The container currently runs as root; modify Dockerfile.autoscaler-server to create a non-root user and group (e.g., "autoscaler"), chown the runtime binary (/usr/local/bin/thecloud-autoscaler-server) and any needed app dirs to that user, set a safe HOME if needed, and add a USER autoscaler (or UID/GID) directive before the ENTRYPOINT so ENTRYPOINT ["/usr/local/bin/thecloud-autoscaler-server"] runs as the non-root user; ensure you use Alpine-compatible commands (addgroup/adduser or adduser -S/addgroup -S) and preserve executable permissions on the binary.docs/adr/ADR-020-kubernetes-cluster-autoscaler.md (1)
29-29: Align bridge naming across ADR and implementation artifacts.This ADR uses
thecloud-bridge, while build/runtime files in this PR useautoscaler-server/thecloud-autoscaler-server. Please standardize on one canonical name.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/adr/ADR-020-kubernetes-cluster-autoscaler.md` at line 29, Update naming so the ADR and implementation use the same canonical bridge name: pick one of the existing identifiers (either "thecloud-bridge" or "autoscaler-server"/"thecloud-autoscaler-server") and replace the others to match; update ADR-020-kubernetes-cluster-autoscaler.md (the line containing "thecloud-bridge") to the chosen canonical name and also update any build/runtime artifacts that reference "autoscaler-server" or "thecloud-autoscaler-server" so all references (ADR text, service names, manifests, build scripts) consistently use the single selected identifier.internal/repositories/postgres/migrations/020_create_clusters.down.sql (1)
1-2: Prefer explicit rollback dependency handling over broadCASCADE.Line 2 uses
CASCADEeven aftercluster_nodesis already explicitly dropped on Line 1. This can over-drop unrelated dependent objects and makes rollback behavior less predictable.Suggested migration tweak
-DROP TABLE IF EXISTS cluster_nodes CASCADE; -DROP TABLE IF EXISTS clusters CASCADE; +DROP TABLE IF EXISTS cluster_nodes; +DROP TABLE IF EXISTS clusters;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/migrations/020_create_clusters.down.sql` around lines 1 - 2, The second DROP uses broad CASCADE which can remove unrelated objects; change the rollback to avoid CASCADE on the clusters drop—replace "DROP TABLE IF EXISTS clusters CASCADE;" with "DROP TABLE IF EXISTS clusters;" (or "DROP TABLE IF EXISTS clusters RESTRICT;") and, if necessary, explicitly drop any known dependent objects (e.g., foreign keys or indexes that reference clusters) instead of relying on CASCADE; refer to the DROP TABLE lines and the table names cluster_nodes and clusters to locate the statements.internal/repositories/postgres/cluster_repo_unit_test.go (1)
62-120: Refactor the newly added NodeGroup repository tests to table-driven +testify/mock.The new cases repeat setup/assertion patterns and continue using
pgxmock; please align these added tests with the repository’s test conventions.As per coding guidelines,
**/*_test.go:Use table-driven tests in test filesandUse testify/mock for creating mock objects in tests.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/cluster_repo_unit_test.go` around lines 62 - 120, The three tests (GetByID, ListAll, ListByUserID) repeat setup and use pgxmock; convert them to a single table-driven test loop that iterates cases (name, setupMock, callFn, expectedLen) and replace pgxmock usage with testify/mock-based mocks; for each case call NewClusterRepository with the testify mock (or mock the lower-level DB interface used by NewClusterRepository), configure expectations in setupMock (return rows/NodeGroups for that case), then invoke repo.GetByID / repo.ListAll / repo.ListByUserID accordingly and assert results (len of clusters and cluster.NodeGroups). Ensure you reference the existing methods NewClusterRepository, GetByID, ListAll, ListByUserID and the NodeGroups field when writing expectations and assertions.cmd/autoscaler-server/main.go (1)
18-18: Replace the hardcoded default gRPC port with a named constant.
50051at Line 18 is an inline magic number; move it to a constant for maintainability.♻️ Suggested change
+const DefaultGRPCPort = 50051 + func main() { var ( - port = flag.Int("port", 50051, "The gRPC server port") + port = flag.Int("port", DefaultGRPCPort, "The gRPC server port")As per coding guidelines,
Do not use magic numbers - use named constants instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/autoscaler-server/main.go` at line 18, Replace the inline magic number 50051 used in flag.Int call by introducing a named constant (e.g., defaultGRPCPort) at the top of the file and use that constant in the call to flag.Int("port", defaultGRPCPort, ...); update any related comments or usages referencing the hardcoded port to use the new constant and ensure the constant is of type int to match flag.Int's default value.internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql (1)
7-13: Add DB-level size invariants for node groups.The table currently allows invalid combinations (
min_size > max_size,current_sizeoutside bounds). Enforcing these at DB level prevents corrupted scaling state.✅ Suggested change
CREATE TABLE IF NOT EXISTS cluster_node_groups ( @@ min_size INTEGER NOT NULL DEFAULT 1, max_size INTEGER NOT NULL DEFAULT 10, current_size INTEGER NOT NULL DEFAULT 1, @@ - UNIQUE(cluster_id, name) + UNIQUE(cluster_id, name), + CHECK (min_size >= 0), + CHECK (max_size >= min_size), + CHECK (current_size >= min_size AND current_size <= max_size) );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql` around lines 7 - 13, Add DB-level CHECK constraints to the cluster node groups table to enforce size invariants: add a check ensuring min_size <= max_size and another ensuring current_size >= min_size AND current_size <= max_size (you can also include min_size >= 1 if you want to match the DEFAULT). Add these constraints to the CREATE TABLE for the table that defines min_size, max_size, current_size (e.g., add CHECK (min_size <= max_size) and CHECK (current_size BETWEEN min_size AND max_size) with descriptive constraint names like chk_cluster_node_groups_min_le_max and chk_cluster_node_groups_current_in_range).docs/swagger/docs.go (2)
6937-6968: Clarifyworker_countvsnode_groupscontract.
node_groupswas added on Line 6937, whileworker_countremains in the same schema on Line 6966. If node pools are the canonical model, markworker_countdeprecated (or remove in the next API version) to avoid ambiguous client behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 6937 - 6968, The schema exposes both node_groups (items -> "$ref": "#/definitions/domain.NodeGroup") and worker_count which creates ambiguity; update the schema for the cluster definition to mark worker_count as deprecated and guide clients to use node_groups instead: add a deprecation indicator and a clear description on the worker_count property (e.g., "DEPRECATED: use node_groups for node pool counts; will be removed in next API version"), or remove worker_count in the next major API version, ensuring any generated docs/clients reflect the change.
7874-7905: Strengthendomain.NodeGroupresponse schema withrequiredfields.Right now all properties are optional in the OpenAPI model. For better client generation, consider marking stable fields (
id,cluster_id,name,min_size,max_size,current_size) as required.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 7874 - 7905, The OpenAPI schema for domain.NodeGroup currently has all properties optional; add a "required" array to the domain.NodeGroup definition listing the stable fields to make them mandatory (id, cluster_id, name, min_size, max_size, current_size). Update the domain.NodeGroup object in docs/swagger/docs.go to include the required property alongside the existing "type" and "properties" keys so code generation tools will treat those fields as required.docs/swagger/swagger.json (3)
6929-6935: Clarify the canonical scaling field ondomain.Cluster.With
node_groupsadded at Line 6929, keepingworker_countat Line 6958 creates two scaling representations on the same resource. Please document one as canonical (and the other as legacy/deprecated) to avoid conflicting client behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 6929 - 6935, The Cluster schema now exposes two scaling representations—node_groups and worker_count—so update the domain.Cluster definition to mark one as canonical and the other as deprecated: designate "node_groups" as the canonical scaling field (or vice versa as your choice) and add clear description text for both properties indicating the canonical field, marking the other with "deprecated": true and a deprecation note pointing to the canonical field; ensure you update the descriptions for "node_groups" and "worker_count" in the swagger.json so clients know which to use and that the deprecated field is retained only for backward compatibility.
7866-7897: Add contract-level bounds for node group size integers.
min_size,max_size,current_size, anddesired_sizeare currently unbounded. Add non-negative constraints in the OpenAPI schema so invalid payloads are rejected earlier.♻️ Proposed schema hardening
"domain.NodeGroup": { "type": "object", "properties": { @@ "current_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, @@ "max_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, "min_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, @@ "httphandlers.NodeGroupRequest": { "type": "object", @@ "properties": { "desired_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, @@ "max_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, "min_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, @@ "httphandlers.UpdateNodeGroupRequest": { "type": "object", "properties": { "desired_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, "max_size": { - "type": "integer" + "type": "integer", + "minimum": 0 }, "min_size": { - "type": "integer" + "type": "integer", + "minimum": 0 } } }Also applies to: 9542-9564, 9677-9690
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 7866 - 7897, The NodeGroup OpenAPI schema allows negative integers for size fields; update the domain.NodeGroup schema and the same schemas at the other occurrences to add a non-negative constraint by adding "minimum": 0 to the integer properties min_size, max_size, current_size and desired_size (where present), keeping their type as "integer" (you can also add "format": "int32" if desired) so invalid negative values are rejected at contract level.
1218-1225: Document common error responses for new node group operations.Only success codes are currently specified. Adding standard
400/404/409/500response schemas here will make generated clients and API consumers more reliable.Also applies to: 1260-1267, 1291-1294
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 1218 - 1225, The OpenAPI responses for the node group create endpoints only include a 201 success schema (referencing definitions/domain.NodeGroup); add standard error response entries for 400, 404, 409, and 500 to each of these operation response objects (the same blocks that currently contain the "201" response) and point each to your common error schema (e.g., definitions/Error or definitions/domain.Error) with appropriate descriptions like "Bad Request", "Not Found", "Conflict", and "Internal Server Error" so generated clients can handle failures consistently; update the three duplicate response blocks (the ones with the 201/domain.NodeGroup reference) to include these four additional responses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/autoscaler-server/main.go`:
- Line 33: The gRPC listener is currently bound to all interfaces via
net.Listen("tcp", fmt.Sprintf(":%d", *port)) which exposes the server
in-cluster; change the bind address to loopback by default by using
fmt.Sprintf("127.0.0.1:%d", *port) (or build an address string like addr :=
fmt.Sprintf("127.0.0.1:%d", *port) and pass addr to net.Listen) so lis, err :=
net.Listen("tcp", addr) binds to loopback, keeping the existing *port flag
unchanged so callers can still override if needed.
In `@Dockerfile.autoscaler-server`:
- Line 2: The Dockerfile currently pins the builder stage to an older Go image
("FROM golang:1.22-alpine") while go.mod requires Go 1.24.0; update the builder
image reference to the matching Go 1.24 variant (e.g., replace "FROM
golang:1.22-alpine" with the appropriate "golang:1.24-alpine" tag) so the
container toolchain matches go.mod and avoids runtime/toolchain downloads in CI.
In `@downs.txt`:
- Around line 48-50: downs.txt is missing rollback entries for two migrations
present in ups.txt; open downs.txt and insert the missing lines
"internal/repositories/postgres/migrations/065_grant_rbac_permissions" and
"internal/repositories/postgres/migrations/072_migrate_to_tenants" at the
appropriate positions so the sequence reads ...070_create_tenants,
065_grant_rbac_permissions, 072_migrate_to_tenants, 073_create_dns_zones
(matching the order in ups.txt), ensuring no duplicates and preserving file
ordering used by the migration tooling.
In `@internal/autoscaler/server.go`:
- Around line 111-134: In NodeGroupIncreaseSize, validate that req.Delta is a
positive integer before proceeding: check req.Delta > 0 at the start of the
handler (e.g., immediately after logging) and return a descriptive error if it
is zero or negative so scale-down requests cannot be routed through this
increase path; update error message to indicate invalid delta and reference
req.Delta and the NodeGroupIncreaseSize function in the change.
- Around line 149-177: The desired-size reconciliation decrements
targetGroup.CurrentSize by len(req.Nodes) even for nodes skipped due to
non-matching providerID and silently ignores errors from
UpdateNodeGroupWithContext; fix by counting only successfully terminated nodes
(increment a counter when s.client.TerminateInstance(providerID) returns nil)
and compute newSize := targetGroup.CurrentSize - successfulTerminations, and do
not swallow the UpdateNodeGroupWithContext error—check its returned error and
return it (or log and return) instead of assigning to _; key symbols: the loop
over req.Nodes, providerID, s.client.TerminateInstance, targetGroup, newSize,
and s.client.UpdateNodeGroupWithContext.
- Around line 270-283: Replace the generic fmt.Errorf("unimplemented") returns
in the stubbed RPC handlers NodeGroupTemplateNodeInfo, NodeGroupGetOptions,
PricingNodePrice, and PricingPodPrice with gRPC Unimplemented errors by
returning status.Error(codes.Unimplemented, "<message>") so clients receive the
proper gRPC status; also add the imports "google.golang.org/grpc/codes" and
"google.golang.org/grpc/status" and ensure the error message clearly names the
unimplemented RPC (e.g., "NodeGroupTemplateNodeInfo method not implemented").
- Line 154: s.client.TerminateInstance(providerID) is called without passing the
request context so gRPC cancellations/deadlines aren't propagated; update the
call to use a context-aware API (either add and call
TerminateInstanceWithContext(ctx, providerID) on sdk.Client following the
GetInstanceWithContext(ctx, ...) pattern in pkg/sdk/compute.go, or call an
existing context-taking variant) and ensure the server method uses that new
TerminateInstanceWithContext symbol instead of TerminateInstance so the ctx
parameter is propagated to the blocking termination call.
In `@internal/core/services/cluster.go`:
- Around line 130-137: The Create + AddNodeGroup sequence is not atomic: if
s.repo.Create(ctx, cluster) succeeds but s.repo.AddNodeGroup(ctx,
&cluster.NodeGroups[0]) fails you leave a persisted cluster without its default
node group; update the code to run both operations inside a single repository
transaction (e.g. use the repo's Begin/RunInTx/WithTx/Commit/Rollback API) so
that you call Create and AddNodeGroup using the transactional repo handle and
rollback on any error, returning the wrapped error as before; reference the
s.repo.Create, s.repo.AddNodeGroup and cluster.NodeGroups[0] symbols when making
the change.
- Around line 417-427: Validate and enforce node-group sizing invariants in
AddNodeGroup and UpdateNodeGroup: before constructing or persisting
domain.NodeGroup (the block that sets MinSize, MaxSize,
CurrentSize/DesiredSize), check that MinSize, MaxSize and Desired/Current are
non-negative, that MinSize <= MaxSize, and that Desired/Current is between
MinSize and MaxSize; if any check fails return a validation error and do not
persist. Apply the same checks to the other node-group creation/update block
referenced (the later block around the UpdateNodeGroup flow) so both creation
and updates enforce the same invariants.
- Around line 433-437: The repo.Update call results are being discarded (e.g.,
`_ = s.repo.Update(ctx, cluster)` after mutating cluster.WorkerCount and
similarly at the other location); replace these with proper error handling: call
`s.repo.Update(ctx, cluster)`, check the returned error, revert the in-memory
`cluster.WorkerCount` change if the update fails, and propagate or return the
error to the caller (do the same for the second occurrence around lines
472-476). Ensure you reference `cluster.WorkerCount`, `params.DesiredSize`, and
`s.repo.Update` when making the change so persistence failures do not leave
cluster and node-group state inconsistent.
- Around line 490-503: The current delete path always proceeds even if no node
group with the given name exists; update the logic in the delete handler (the
loop over cluster.NodeGroups and subsequent calls to s.repo.DeleteNodeGroup and
s.repo.Update) to first detect whether any ng with ng.Name == name exists, and
if not return a NotFound error instead of continuing; only subtract
ng.CurrentSize from cluster.WorkerCount and call s.repo.DeleteNodeGroup /
s.repo.Update when the node group was found. Use the same NotFound error
type/constructor your repo/service stack uses so callers get a proper not-found
response.
In `@internal/repositories/k8s/provisioner.go`:
- Line 218: The worker provision config uses ImageName: "ubuntu-22.04" which is
inconsistent with the control-plane image "ubuntu:22.04" and will likely fail
image resolution; update the worker provision setting (the ImageName field in
the provisioner code) to use the same canonical image string "ubuntu:22.04" to
match control-plane provisioning so both use the identical image reference.
- Around line 226-239: The code currently swallows worker launch errors and
ignores persistence errors; update the error handling in the worker creation
block so that when the create call returns err you record it and do not silently
continue (use p.logger.Error with the actual error and return or aggregate the
error), and when calling p.repo.AddNode(ctx, node) check its returned error
instead of assigning to _ — log and return or propagate that error (referencing
workerName, p.logger.Error, node, p.repo.AddNode, and domain.ClusterNode) so
provisioning failures and persistence failures surface instead of being ignored.
In `@internal/repositories/k8s/templates/control_plane.yaml`:
- Around line 344-346: Replace the hardcoded CLOUD_API_URL value in
control_plane.yaml with a templated/parameterized reference so different DNS
layouts can supply the endpoint at deploy time; update the environment variable
entry for CLOUD_API_URL (the env name CLOUD_API_URL in control_plane.yaml) to
read from a ConfigMap/Secret or a Helm/template value (e.g., use a template
placeholder or valueFrom reference) and ensure any deployment code that renders
this template provides the correct cluster API URL (or falls back to a sensible
default) so bootstrap no longer relies on
"https://thecloud-api.kube-system.svc.cluster.local:443".
In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 116-120: The loop that populates c.NodeGroups is suppressing
errors from r.GetNodeGroups (call to GetNodeGroups) which can return partial
cluster data; instead check the returned err and propagate it (or wrap with
context) rather than ignoring — when r.GetNodeGroups(ctx, c.ID) returns a
non-nil error return that error from the enclosing function (or collect and wrap
it with details like cluster ID) so NodeGroups is never silently missing; update
the code around GetNodeGroups and c.NodeGroups to perform an explicit err != nil
check and return the error.
In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`:
- Around line 16-29: The index creation and backfill aren't idempotent: change
the index DDL to use "CREATE INDEX IF NOT EXISTS
idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id)" and make
the backfill INSERT into cluster_node_groups idempotent by only inserting a
'default-pool' for clusters that don't already have one (e.g., use INSERT ...
SELECT ... FROM clusters WHERE NOT EXISTS (SELECT 1 FROM cluster_node_groups cng
WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool')), keeping the
same columns (id via gen_random_uuid(), cluster_id, name, instance_type,
min_size, max_size, current_size) and the existing size logic (CASE WHEN
worker_count > 10 THEN worker_count ELSE 10 END).
In `@k8s/autoscaler/deployment.yaml`:
- Around line 45-63: Add resource requests and limits for the container named
"thecloud-bridge" in the Deployment spec: under the container entry for
thecloud-bridge add a resources block setting resources.requests (cpu and
memory) and resources.limits (cpu and memory) to sensible values for your
workload (e.g., minimal and maximum values to prevent eviction and overcommit).
Ensure the resources block is added alongside existing env and ports for the
container so Kubernetes uses them for scheduling and OOM eviction decisions.
- Around line 21-63: Add a pod/container securityContext to harden both
containers: for the cluster-autoscaler and thecloud-bridge containers, set
securityContext with runAsNonRoot: true and a non-zero runAsUser, set
allowPrivilegeEscalation: false (and optionally readOnlyRootFilesystem: true and
drop all capabilities) so neither container runs as root or can escalate
privileges; apply these securityContext blocks under each container spec (or at
pod level if appropriate) to enforce the change for the cluster-autoscaler and
thecloud-bridge containers.
In `@k8s/autoscaler/rbac.yaml`:
- Around line 78-80: The current RBAC rule grants "get/update/patch/delete" on
ConfigMaps including "cluster-id"; remove "cluster-id" from that broader rule's
resourceNames and add a separate rule that grants only read verbs for the
specific ConfigMap name "cluster-id" (e.g., verbs: ["get"] or ["get","list"] as
needed). Update the rules so the existing rule keeps update/patch/delete for
other ConfigMaps while a new rule targets resourceNames: ["cluster-id"] with
read-only verbs to enforce least privilege.
In `@pkg/sdk/kubernetes.go`:
- Around line 201-213: The node group name is inserted raw into request paths in
UpdateNodeGroupWithContext and DeleteNodeGroup, which can break URLs when name
contains reserved characters; fix by path-escaping the name (e.g., use
url.PathEscape or equivalent) before concatenation in both c.putWithContext and
c.delete calls so the path becomes clustersPath + "/" + clusterID +
"/nodegroups/" + PathEscape(name).
---
Minor comments:
In `@docs/kubernetes.md`:
- Around line 119-121: The section header "Limitations (MVP)" currently lists
supported features ("Storage: Dynamic storage provisioning (CSI)" and "High
Availability: ... via the `--ha` flag") instead of constraints; either rename
the header to "Features (MVP)" or keep "Limitations (MVP)" and replace those
bullets with real limitations (for example: lack of automated backups, no
cluster autoscaler, limited OS image support, manual certificate rotation, or
single-region control plane) so the text matches the header; update the header
text and/or replace the two bullets ("Storage" and "High Availability"
referencing `--ha`) accordingly.
In `@docs/services/kubernetes-autoscaler.md`:
- Line 38: The doc currently lists the scale-up endpoint as `POST
/clusters/{id}/nodegroups/{name}`, which conflicts with the create example;
update the wording to describe the actual scale/update endpoint (replace `POST
/clusters/{id}/nodegroups/{name}`) with the correct API verb and path for
resizing a nodegroup (for example `PATCH /clusters/{id}/nodegroups/{name}` or
the team-approved scale path such as `POST
/clusters/{id}/nodegroups/{name}/scale`), and ensure the text clearly states
this is a scale/update operation (not create) so it no longer conflicts with the
create example that uses `POST /clusters/{id}/nodegroups`.
In `@docs/swagger/docs.go`:
- Around line 9685-9698: The UpdateNodeGroupRequest schema currently allows an
empty object; modify its schema to require at least one updatable field by
adding a constraint (e.g., "minProperties": 1) to the
"httphandlers.UpdateNodeGroupRequest" schema in docs/swagger/docs.go so that
requests with no fields are rejected; reference the schema name
"httphandlers.UpdateNodeGroupRequest" and ensure the change is applied to the
JSON object describing that schema.
- Around line 9550-9572: The schema for httphandlers.NodeGroupRequest currently
allows negative/invalid sizes for desired_size, max_size, and min_size; update
the struct or swagger annotations for the NodeGroupRequest fields desired_size,
max_size, and min_size to include numeric bounds (e.g., add "minimum: 0" to each
and optionally a sensible "maximum" if desired) so the generated docs emit valid
range constraints (minimum: 0) for those integer properties.
In `@internal/handlers/cluster_handler.go`:
- Around line 380-460: The Swagger comments for AddNodeGroup, UpdateNodeGroup,
and DeleteNodeGroup are missing `@Security`; update the doc comments above each
handler (AddNodeGroup, UpdateNodeGroup, DeleteNodeGroup) to include the same
`@Security` directive used by other K8s handlers in the file (e.g., `@Security`
ApiKeyAuth or `@Security` BearerAuth — use the exact scheme name already present
elsewhere), placing it alongside the existing `@Summary`, `@Tags`, and `@Router` lines
so the generated API docs correctly show authentication requirements.
In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 244-252: After iterating rows into the groups slice you must check
cursor-level errors by calling rows.Err(); update the code after the for
rows.Next() loop (where rows, groups and domain.NodeGroup are used and rows.Scan
is invoked) to call if err := rows.Err(); err != nil { return nil,
errors.Wrap(errors.Internal, "row iteration error", err) } before returning
groups, nil so any underlying read errors are surfaced and wrapped consistently.
---
Nitpick comments:
In `@cmd/autoscaler-server/main.go`:
- Line 18: Replace the inline magic number 50051 used in flag.Int call by
introducing a named constant (e.g., defaultGRPCPort) at the top of the file and
use that constant in the call to flag.Int("port", defaultGRPCPort, ...); update
any related comments or usages referencing the hardcoded port to use the new
constant and ensure the constant is of type int to match flag.Int's default
value.
In `@Dockerfile.autoscaler-server`:
- Around line 20-28: The container currently runs as root; modify
Dockerfile.autoscaler-server to create a non-root user and group (e.g.,
"autoscaler"), chown the runtime binary
(/usr/local/bin/thecloud-autoscaler-server) and any needed app dirs to that
user, set a safe HOME if needed, and add a USER autoscaler (or UID/GID)
directive before the ENTRYPOINT so ENTRYPOINT
["/usr/local/bin/thecloud-autoscaler-server"] runs as the non-root user; ensure
you use Alpine-compatible commands (addgroup/adduser or adduser -S/addgroup -S)
and preserve executable permissions on the binary.
In `@docs/adr/ADR-020-kubernetes-cluster-autoscaler.md`:
- Line 29: Update naming so the ADR and implementation use the same canonical
bridge name: pick one of the existing identifiers (either "thecloud-bridge" or
"autoscaler-server"/"thecloud-autoscaler-server") and replace the others to
match; update ADR-020-kubernetes-cluster-autoscaler.md (the line containing
"thecloud-bridge") to the chosen canonical name and also update any
build/runtime artifacts that reference "autoscaler-server" or
"thecloud-autoscaler-server" so all references (ADR text, service names,
manifests, build scripts) consistently use the single selected identifier.
In `@docs/swagger/docs.go`:
- Around line 6937-6968: The schema exposes both node_groups (items -> "$ref":
"#/definitions/domain.NodeGroup") and worker_count which creates ambiguity;
update the schema for the cluster definition to mark worker_count as deprecated
and guide clients to use node_groups instead: add a deprecation indicator and a
clear description on the worker_count property (e.g., "DEPRECATED: use
node_groups for node pool counts; will be removed in next API version"), or
remove worker_count in the next major API version, ensuring any generated
docs/clients reflect the change.
- Around line 7874-7905: The OpenAPI schema for domain.NodeGroup currently has
all properties optional; add a "required" array to the domain.NodeGroup
definition listing the stable fields to make them mandatory (id, cluster_id,
name, min_size, max_size, current_size). Update the domain.NodeGroup object in
docs/swagger/docs.go to include the required property alongside the existing
"type" and "properties" keys so code generation tools will treat those fields as
required.
In `@docs/swagger/swagger.json`:
- Around line 6929-6935: The Cluster schema now exposes two scaling
representations—node_groups and worker_count—so update the domain.Cluster
definition to mark one as canonical and the other as deprecated: designate
"node_groups" as the canonical scaling field (or vice versa as your choice) and
add clear description text for both properties indicating the canonical field,
marking the other with "deprecated": true and a deprecation note pointing to the
canonical field; ensure you update the descriptions for "node_groups" and
"worker_count" in the swagger.json so clients know which to use and that the
deprecated field is retained only for backward compatibility.
- Around line 7866-7897: The NodeGroup OpenAPI schema allows negative integers
for size fields; update the domain.NodeGroup schema and the same schemas at the
other occurrences to add a non-negative constraint by adding "minimum": 0 to the
integer properties min_size, max_size, current_size and desired_size (where
present), keeping their type as "integer" (you can also add "format": "int32" if
desired) so invalid negative values are rejected at contract level.
- Around line 1218-1225: The OpenAPI responses for the node group create
endpoints only include a 201 success schema (referencing
definitions/domain.NodeGroup); add standard error response entries for 400, 404,
409, and 500 to each of these operation response objects (the same blocks that
currently contain the "201" response) and point each to your common error schema
(e.g., definitions/Error or definitions/domain.Error) with appropriate
descriptions like "Bad Request", "Not Found", "Conflict", and "Internal Server
Error" so generated clients can handle failures consistently; update the three
duplicate response blocks (the ones with the 201/domain.NodeGroup reference) to
include these four additional responses.
In `@internal/core/ports/compute_options.go`:
- Around line 13-14: The struct fields Metadata and Labels are optional but
missing JSON tags; update their definitions in the struct (fields named Metadata
and Labels in this file) to include json tags with omitempty (e.g.,
`json:"metadata,omitempty"` and `json:"labels,omitempty"`) so they are omitted
from serialized output when nil/empty.
In `@internal/repositories/postgres/cluster_repo_unit_test.go`:
- Around line 62-120: The three tests (GetByID, ListAll, ListByUserID) repeat
setup and use pgxmock; convert them to a single table-driven test loop that
iterates cases (name, setupMock, callFn, expectedLen) and replace pgxmock usage
with testify/mock-based mocks; for each case call NewClusterRepository with the
testify mock (or mock the lower-level DB interface used by
NewClusterRepository), configure expectations in setupMock (return
rows/NodeGroups for that case), then invoke repo.GetByID / repo.ListAll /
repo.ListByUserID accordingly and assert results (len of clusters and
cluster.NodeGroups). Ensure you reference the existing methods
NewClusterRepository, GetByID, ListAll, ListByUserID and the NodeGroups field
when writing expectations and assertions.
In `@internal/repositories/postgres/migrations/020_create_clusters.down.sql`:
- Around line 1-2: The second DROP uses broad CASCADE which can remove unrelated
objects; change the rollback to avoid CASCADE on the clusters drop—replace "DROP
TABLE IF EXISTS clusters CASCADE;" with "DROP TABLE IF EXISTS clusters;" (or
"DROP TABLE IF EXISTS clusters RESTRICT;") and, if necessary, explicitly drop
any known dependent objects (e.g., foreign keys or indexes that reference
clusters) instead of relying on CASCADE; refer to the DROP TABLE lines and the
table names cluster_nodes and clusters to locate the statements.
In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`:
- Around line 7-13: Add DB-level CHECK constraints to the cluster node groups
table to enforce size invariants: add a check ensuring min_size <= max_size and
another ensuring current_size >= min_size AND current_size <= max_size (you can
also include min_size >= 1 if you want to match the DEFAULT). Add these
constraints to the CREATE TABLE for the table that defines min_size, max_size,
current_size (e.g., add CHECK (min_size <= max_size) and CHECK (current_size
BETWEEN min_size AND max_size) with descriptive constraint names like
chk_cluster_node_groups_min_le_max and
chk_cluster_node_groups_current_in_range).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
go.sumis excluded by!**/*.suminternal/autoscaler/protos/externalgrpc.pb.gois excluded by!**/*.pb.gointernal/autoscaler/protos/externalgrpc_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (32)
Dockerfile.autoscaler-serverMakefilecmd/autoscaler-server/main.godocs/adr/ADR-020-kubernetes-cluster-autoscaler.mddocs/database.mddocs/kubernetes.mddocs/services/kubernetes-autoscaler.mddocs/swagger/docs.godocs/swagger/swagger.jsondocs/swagger/swagger.yamldowns.txtgo.modinternal/api/setup/router.gointernal/autoscaler/server.gointernal/core/domain/cluster.gointernal/core/ports/cluster.gointernal/core/ports/compute_options.gointernal/core/services/cluster.gointernal/handlers/cluster_handler.gointernal/repositories/k8s/provisioner.gointernal/repositories/k8s/templates/control_plane.yamlinternal/repositories/postgres/cluster_repo.gointernal/repositories/postgres/cluster_repo_unit_test.gointernal/repositories/postgres/migrations/020_create_clusters.down.sqlinternal/repositories/postgres/migrations/098_create_cluster_node_groups.down.sqlinternal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sqlk8s/autoscaler/cluster-id-configmap.yamlk8s/autoscaler/configmap.yamlk8s/autoscaler/deployment.yamlk8s/autoscaler/rbac.yamlpkg/sdk/kubernetes.goups.txt
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 44 out of 46 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (s *ClusterService) AddNodeGroup(ctx context.Context, clusterID uuid.UUID, params ports.NodeGroupParams) (*domain.NodeGroup, error) { | ||
| cluster, err := s.GetCluster(ctx, clusterID) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| ng := &domain.NodeGroup{ | ||
| ID: uuid.New(), | ||
| ClusterID: cluster.ID, | ||
| Name: params.Name, | ||
| InstanceType: params.InstanceType, | ||
| MinSize: params.MinSize, | ||
| MaxSize: params.MaxSize, | ||
| CurrentSize: params.DesiredSize, | ||
| CreatedAt: time.Now(), | ||
| UpdatedAt: time.Now(), | ||
| } |
There was a problem hiding this comment.
Node group create/update paths don’t validate scaling invariants (e.g., min_size <= desired_size <= max_size, non-negative sizes, and non-empty instance_type). As-is, invalid inputs can be persisted (or cause DB constraint errors) and can make autoscaler decisions incorrect. Add explicit validation in the service layer before writing to the repo.
| func (s *AutoscalerServer) NodeGroupIncreaseSize(ctx context.Context, req *protos.NodeGroupIncreaseSizeRequest) (*protos.NodeGroupIncreaseSizeResponse, error) { | ||
| klog.Infof("Request to increase size of node group %s by %d", req.Id, req.Delta) | ||
|
|
||
| cluster, err := s.client.GetClusterWithContext(ctx, s.clusterID) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
NodeGroupIncreaseSize doesn’t validate that req.Delta is positive. If a negative/zero delta is ever sent (misbehaving client or future CA change), newSize could shrink or no-op via the “increase” RPC, which violates the External Provider contract. Add a guard (e.g., require Delta > 0) and return an InvalidArgument-style error when it isn’t.
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | ||
| cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, | ||
| name VARCHAR(255) NOT NULL, | ||
| instance_type VARCHAR(50) NOT NULL, | ||
| min_size INTEGER NOT NULL DEFAULT 1, | ||
| max_size INTEGER NOT NULL DEFAULT 10, | ||
| current_size INTEGER NOT NULL DEFAULT 1, | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), |
There was a problem hiding this comment.
The documented cluster_node_groups schema doesn’t match the actual migration: this doc shows id UUID PRIMARY KEY DEFAULT gen_random_uuid() and created_at/updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), but the migration defines id UUID PRIMARY KEY (no default) and timestamps without NOT NULL using CURRENT_TIMESTAMP. Please align the documentation with the real schema (or update the migration to match the documented defaults/constraints).
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, | |
| name VARCHAR(255) NOT NULL, | |
| instance_type VARCHAR(50) NOT NULL, | |
| min_size INTEGER NOT NULL DEFAULT 1, | |
| max_size INTEGER NOT NULL DEFAULT 10, | |
| current_size INTEGER NOT NULL DEFAULT 1, | |
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| id UUID PRIMARY KEY, | |
| cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, | |
| name VARCHAR(255) NOT NULL, | |
| instance_type VARCHAR(50) NOT NULL, | |
| min_size INTEGER NOT NULL DEFAULT 1, | |
| max_size INTEGER NOT NULL DEFAULT 10, | |
| current_size INTEGER NOT NULL DEFAULT 1, | |
| created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, | |
| updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, |
| return nil, fmt.Errorf("node group %s not found", req.Id) | ||
| } |
There was a problem hiding this comment.
The gRPC handlers return plain fmt.Errorf(...) for “not found”, “invalid delta”, and “unimplemented” cases. That will surface to clients as gRPC Unknown errors, while Cluster Autoscaler expects specific status codes (e.g., codes.NotFound, codes.InvalidArgument, codes.Unimplemented) for correct behavior and retries. Prefer returning status.Error/ status.Errorf with appropriate codes.* in these paths.
| 1. A pod is in `Pending` status due to CPU/Memory constraints. | ||
| 2. Upstream CA identifies that a Node Group can satisfy the requirement. | ||
| 3. Upstream CA calls `NodeGroupIncreaseSize` via gRPC. | ||
| 4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API. |
There was a problem hiding this comment.
The “Scaling Up Flow” step 4 documents POST /clusters/{id}/nodegroups/{name}, but the API routes implemented in this PR are POST /clusters/{id}/nodegroups (create) and PUT /clusters/{id}/nodegroups/{name} (update desired size). Please update the doc to reflect the actual endpoint used for scale-up (likely the PUT update).
| 4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API. | |
| 4. `thecloud-bridge` calls `PUT /clusters/{id}/nodegroups/{name}` on The Cloud API to update the node group's desired size. |
| var groups []domain.NodeGroup | ||
| for rows.Next() { | ||
| var ng domain.NodeGroup | ||
| if err := rows.Scan(&ng.ID, &ng.ClusterID, &ng.Name, &ng.InstanceType, &ng.MinSize, &ng.MaxSize, &ng.CurrentSize, &ng.CreatedAt, &ng.UpdatedAt); err != nil { | ||
| return nil, errors.Wrap(errors.Internal, "failed to scan node group", err) | ||
| } | ||
| groups = append(groups, ng) | ||
| } | ||
| return groups, nil |
There was a problem hiding this comment.
GetNodeGroups doesn’t check rows.Err() after iterating. If an error occurs during row iteration, it may be dropped and returned as a partial result without error. Add a if err := rows.Err(); err != nil { ... } check before returning.
| workerInst, err := p.instSvc.LaunchInstanceWithOptions(ctx, ports.CreateInstanceOptions{ | ||
| Name: workerName, | ||
| ImageName: "ubuntu-22.04", | ||
| NetworkID: cluster.VpcID.String(), | ||
| UserData: userData, | ||
| Metadata: map[string]string{ | ||
| "thecloud.io/cluster-id": cluster.ID.String(), | ||
| "thecloud.io/node-group": ng.Name, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
Worker provisioning now sets instance metadata (thecloud.io/cluster-id, thecloud.io/node-group) via CreateInstanceOptions.Metadata, but InstanceService.LaunchInstanceWithOptions currently does not persist opts.Metadata/opts.Labels into the instance record or provisioning job. As a result, instances launched here won’t carry the metadata needed by the autoscaler bridge (which filters on instance metadata). Either extend LaunchInstanceWithOptions to propagate metadata/labels, or set metadata via a separate API call after launch.
| // Persist Node Group | ||
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { |
There was a problem hiding this comment.
CreateCluster persists the cluster row and then inserts the default node group in a separate call. If AddNodeGroup fails, the API returns an error but leaves a cluster record without its required default node group, which can break subsequent provisioning/autoscaler flows. Consider wrapping both inserts in a single DB transaction (or rolling back the cluster record on failure).
| // Persist Node Group | |
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { | |
| // Persist Node Group. If this fails, attempt to roll back the cluster record | |
| // to avoid leaving a cluster without its required default node group. | |
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { | |
| if rollbackErr := s.repo.Delete(ctx, cluster.ID); rollbackErr != nil { | |
| s.logger.Error("failed to rollback cluster after node group creation failure", "cluster_id", cluster.ID, "error", rollbackErr) | |
| } |
| if providerID == node.ProviderID { | ||
| continue | ||
| } | ||
| if err := s.client.TerminateInstance(providerID); err != nil { |
There was a problem hiding this comment.
NodeGroupDeleteNodes terminates instances using TerminateInstance, which does not accept a context (it uses the SDK’s non-context delete). This prevents request cancellation/timeouts from being honored during scale-down operations. Consider adding/using a context-aware SDK method for instance termination and using it here so CA requests can be canceled cleanly.
| if err := s.client.TerminateInstance(providerID); err != nil { | |
| if err := s.client.TerminateInstanceWithContext(ctx, providerID); err != nil { |
| // AddNodeGroup adds a new node pool to the cluster. | ||
| func (c *Client) AddNodeGroup(clusterID string, input NodeGroupInput) (*NodeGroup, error) { | ||
| var resp Response[*NodeGroup] | ||
| if err := c.post(clustersPath+"/"+clusterID+"/nodegroups", input, &resp); err != nil { | ||
| return nil, err | ||
| } | ||
| return resp.Data, nil | ||
| } | ||
|
|
||
| // UpdateNodeGroupWithContext updates a node group's parameters with context support. | ||
| func (c *Client) UpdateNodeGroupWithContext(ctx context.Context, clusterID string, name string, input UpdateNodeGroupInput) (*NodeGroup, error) { | ||
| var resp Response[*NodeGroup] | ||
| if err := c.putWithContext(ctx, clustersPath+"/"+clusterID+"/nodegroups/"+name, input, &resp); err != nil { | ||
| return nil, err | ||
| } | ||
| return resp.Data, nil | ||
| } | ||
|
|
||
| // DeleteNodeGroup removes a node group. | ||
| func (c *Client) DeleteNodeGroup(clusterID string, name string) error { | ||
| var resp Response[any] | ||
| return c.delete(clustersPath+"/"+clusterID+"/nodegroups/"+name, &resp) | ||
| } |
There was a problem hiding this comment.
The new NodeGroup SDK methods don’t follow the established SDK pattern of providing both context-aware and non-context variants. AddNodeGroup/DeleteNodeGroup have no WithContext equivalents, while UpdateNodeGroupWithContext has no non-context wrapper. This makes it hard for callers (like controllers) to enforce timeouts/cancellation consistently. Consider adding AddNodeGroupWithContext/DeleteNodeGroupWithContext and a non-context UpdateNodeGroup wrapper that delegates to the context variant (using context.Background()).
- Hardened Autoscaler Deployment (limits, non-root, read-only FS) - Consolidated redundant cluster migrations and fixed idempotency - Tightened RBAC for cluster-id ConfigMap - Improved SDK robustness with path-escaping for node group names - Fixed documentation inconsistencies and updated Swagger constraints - Added rows.Err() checks in ClusterRepository
- Bind gRPC listener to loopback by default - Update builder image to Go 1.24 - Add pod and container security contexts (non-root, read-only FS) - Standardize bridge container name to thecloud-bridge
- Add TerminateInstanceWithContext to SDK - Propagate context to instance termination calls in autoscaler - Validate increase delta and track successful terminations - Use proper gRPC status codes for unimplemented methods
…tence errors - Add validation for MinSize, MaxSize, and DesiredSize - Revert in-memory state on repository update failures - Ensure existence check in node group deletion path
- Use canonical ubuntu:22.04 image for workers - Parameterize CLOUD_API_URL in control plane template - Aggregate and report errors during worker provisioning
- Convert repository tests to table-driven format - Check errors from GetNodeGroups in cluster list operations - Add missing rows.Err() checks
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 51 out of 53 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
internal/core/services/cluster.go:288
- ScaleCluster only updates cluster.WorkerCount, but the k8s provisioner’s Scale() now provisions based on cluster.NodeGroups when present. Since NodeGroups aren’t updated here, scaling via /clusters/{id}/scale will no longer change the number of workers. Either keep the default node group in sync with WorkerCount (update current_size) or change scaling to operate on node groups directly.
// ScaleCluster adjusts the number of worker nodes in the cluster.
func (s *ClusterService) ScaleCluster(ctx context.Context, id uuid.UUID, workers int) error {
cluster, err := s.GetCluster(ctx, id)
if err != nil {
return err
}
if workers < 1 {
return errors.New(errors.InvalidInput, "at least 1 worker required")
}
oldWorkerCount := cluster.WorkerCount
cluster.WorkerCount = workers
cluster.UpdatedAt = time.Now()
if err := s.repo.Update(ctx, cluster); err != nil {
cluster.WorkerCount = oldWorkerCount
return errors.Wrap(errors.Internal, "failed to update cluster record during scaling", err)
}
go func() {
bgCtx := context.Background()
bgCtx = appcontext.WithUserID(bgCtx, cluster.UserID)
s.logger.Info("starting cluster scale", "cluster_id", cluster.ID, "new_workers", workers)
if err := s.provisioner.Scale(bgCtx, cluster); err != nil {
s.logger.Error("cluster scaling failed", "cluster_id", cluster.ID, "error", err)
return
}
s.logger.Info("cluster scaling completed", "cluster_id", cluster.ID)
}()
return nil
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (s *ClusterService) AddNodeGroup(ctx context.Context, clusterID uuid.UUID, params ports.NodeGroupParams) (*domain.NodeGroup, error) { | ||
| cluster, err := s.GetCluster(ctx, clusterID) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // Validate node group sizing invariants | ||
| if err := s.validateNodeGroupSizing(params.MinSize, params.MaxSize, params.DesiredSize); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| ng := &domain.NodeGroup{ | ||
| ID: uuid.New(), | ||
| ClusterID: cluster.ID, | ||
| Name: params.Name, | ||
| InstanceType: params.InstanceType, | ||
| MinSize: params.MinSize, | ||
| MaxSize: params.MaxSize, | ||
| CurrentSize: params.DesiredSize, | ||
| CreatedAt: time.Now(), | ||
| UpdatedAt: time.Now(), | ||
| } | ||
|
|
||
| if err := s.repo.AddNodeGroup(ctx, ng); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // If desired size > 0, we should trigger a scale operation for this specific group. | ||
| // For now, we will update the global worker count to reflect the addition. | ||
| oldWorkerCount := cluster.WorkerCount | ||
| cluster.WorkerCount += params.DesiredSize | ||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| cluster.WorkerCount = oldWorkerCount | ||
| return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err) | ||
| } | ||
|
|
||
| return ng, nil | ||
| } | ||
|
|
||
| func (s *ClusterService) UpdateNodeGroup(ctx context.Context, clusterID uuid.UUID, name string, params ports.UpdateNodeGroupParams) (*domain.NodeGroup, error) { | ||
| cluster, err := s.GetCluster(ctx, clusterID) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| var targetGroup *domain.NodeGroup | ||
| for i := range cluster.NodeGroups { | ||
| if cluster.NodeGroups[i].Name == name { | ||
| targetGroup = &cluster.NodeGroups[i] | ||
| break | ||
| } | ||
| } | ||
|
|
||
| if targetGroup == nil { | ||
| return nil, errors.New(errors.NotFound, "node group not found") | ||
| } | ||
|
|
||
| newMin := targetGroup.MinSize | ||
| newMax := targetGroup.MaxSize | ||
| newDesired := targetGroup.CurrentSize | ||
|
|
||
| if params.MinSize != nil { | ||
| newMin = *params.MinSize | ||
| } | ||
| if params.MaxSize != nil { | ||
| newMax = *params.MaxSize | ||
| } | ||
| if params.DesiredSize != nil { | ||
| newDesired = *params.DesiredSize | ||
| } | ||
|
|
||
| // Validate node group sizing invariants | ||
| if err := s.validateNodeGroupSizing(newMin, newMax, newDesired); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| oldMin, oldMax, oldDesired := targetGroup.MinSize, targetGroup.MaxSize, targetGroup.CurrentSize | ||
| targetGroup.MinSize = newMin | ||
| targetGroup.MaxSize = newMax | ||
|
|
||
| oldWorkerCount := cluster.WorkerCount | ||
| if params.DesiredSize != nil { | ||
| diff := newDesired - oldDesired | ||
| cluster.WorkerCount += diff | ||
| targetGroup.CurrentSize = newDesired | ||
| } | ||
|
|
||
| if err := s.repo.UpdateNodeGroup(ctx, targetGroup); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| // Rollback in-memory changes on persistence failure | ||
| targetGroup.MinSize = oldMin | ||
| targetGroup.MaxSize = oldMax | ||
| targetGroup.CurrentSize = oldDesired | ||
| cluster.WorkerCount = oldWorkerCount | ||
| return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after updating node group", err) | ||
| } | ||
|
|
||
| return targetGroup, nil | ||
| } |
There was a problem hiding this comment.
AddNodeGroup/UpdateNodeGroup adjust desired/current sizes in the DB but don’t trigger any provisioning action (e.g., provisioner.Scale or a background job) to actually create/remove worker nodes. This means cluster-autoscaler scale-up requests will only update state without adding nodes. Consider invoking scaling/reconciliation when DesiredSize changes (at least for scale-up).
| // Atomically create cluster and its default node group | ||
| if err := s.repo.Create(ctx, cluster); err != nil { | ||
| return nil, errors.Wrap(errors.Internal, "failed to create cluster record", err) | ||
| } | ||
|
|
||
| // Persist Node Group | ||
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { | ||
| // Ideally this should be in a transaction with repo.Create |
There was a problem hiding this comment.
The comment says this is an atomic create, but repo.Create is executed separately from creating the default node group. If AddNodeGroup fails, the DB can end up with a cluster row but no node groups. Consider wrapping both writes in a single DB transaction (or adding compensation logic).
| // Atomically create cluster and its default node group | |
| if err := s.repo.Create(ctx, cluster); err != nil { | |
| return nil, errors.Wrap(errors.Internal, "failed to create cluster record", err) | |
| } | |
| // Persist Node Group | |
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { | |
| // Ideally this should be in a transaction with repo.Create | |
| // Create cluster record; default node group is added in a separate step (not atomic) | |
| if err := s.repo.Create(ctx, cluster); err != nil { | |
| return nil, errors.Wrap(errors.Internal, "failed to create cluster record", err) | |
| } | |
| // Persist Node Group (note: this is not wrapped in a DB transaction with repo.Create) | |
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { | |
| // TODO: ideally this should be done in a single DB transaction with repo.Create |
| InstanceType string `json:"instance_type"` | ||
| MinSize int `json:"min_size" binding:"gte=0"` | ||
| MaxSize int `json:"max_size" binding:"gte=0"` | ||
| DesiredSize int `json:"desired_size" binding:"gte=0"` |
There was a problem hiding this comment.
NodeGroupRequest leaves InstanceType/MinSize/MaxSize/DesiredSize optional, so omitted fields default to 0/"". That can violate DB constraints (instance_type is NOT NULL; current_size must be between min/max) and return a 500 instead of a 400. Consider adding binding validation (e.g., require instance_type, min_size, max_size, desired_size) and/or applying sane defaults before calling the service.
| InstanceType string `json:"instance_type"` | |
| MinSize int `json:"min_size" binding:"gte=0"` | |
| MaxSize int `json:"max_size" binding:"gte=0"` | |
| DesiredSize int `json:"desired_size" binding:"gte=0"` | |
| InstanceType string `json:"instance_type" binding:"required"` | |
| MinSize int `json:"min_size" binding:"required,gte=0"` | |
| MaxSize int `json:"max_size" binding:"required,gte=0"` | |
| DesiredSize int `json:"desired_size" binding:"required,gte=0"` |
| - apiGroups: [""] | ||
| resources: ["configmaps"] | ||
| resourceNames: ["cluster-autoscaler-status", "cluster-id"] | ||
| verbs: ["get", "update", "patch", "delete"] | ||
| --- |
There was a problem hiding this comment.
In the rendered autoscaler RBAC, the Role grants get/update/patch/delete on both configmaps "cluster-autoscaler-status" and "cluster-id". The autoscaler typically only needs read access to the cluster-id configmap; granting delete/update is unnecessarily permissive. Consider splitting the rules (status configmap writable, cluster-id read-only) to match least privilege.
| internal/repositories/postgres/migrations/034_create_accounting_table | ||
| internal/repositories/postgres/migrations/035_create_images_table | ||
| internal/repositories/postgres/migrations/036_add_volume_backend_path | ||
| internal/repositories/postgres/migrations/037_create_clusters_table |
There was a problem hiding this comment.
This list references internal/repositories/postgres/migrations/037_create_clusters_table, but that migration has been removed (no longer present under internal/repositories/postgres/migrations). If these *.txt files are intended to drive tooling, this stale entry will cause confusion/breakage; otherwise consider removing these unused lists entirely.
| internal/repositories/postgres/migrations/037_create_clusters_table |
| var provisioningErrors []string | ||
| for _, ng := range cluster.NodeGroups { | ||
| for i := 0; i < ng.CurrentSize; i++ { | ||
| workerName := fmt.Sprintf("%s-%s-%d", cluster.Name, ng.Name, i) | ||
| workerInst, err := p.instSvc.LaunchInstanceWithOptions(ctx, ports.CreateInstanceOptions{ | ||
| Name: workerName, | ||
| ImageName: "ubuntu:22.04", // Use canonical image name consistent with control-plane | ||
| NetworkID: cluster.VpcID.String(), | ||
| UserData: userData, | ||
| Metadata: map[string]string{ | ||
| "thecloud.io/cluster-id": cluster.ID.String(), | ||
| "thecloud.io/node-group": ng.Name, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
provisionWorkers() unconditionally loops from 0..CurrentSize-1 and launches instances, without checking how many worker nodes already exist for this cluster/node group. If this is used for scaling or retries, it can attempt to recreate existing nodes (name collisions) or over-provision. Consider reconciling against repo.GetNodes / actual instances and only creating the delta needed for each node group.
|
|
||
| func (s *AutoscalerServer) NodeGroupNodes(ctx context.Context, req *protos.NodeGroupNodesRequest) (*protos.NodeGroupNodesResponse, error) { | ||
| instances, err := s.client.ListInstancesWithContext(ctx) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| var protosInstances []*protos.Instance | ||
| for _, inst := range instances { | ||
| if inst.Metadata["thecloud.io/cluster-id"] == s.clusterID && | ||
| inst.Metadata["thecloud.io/node-group"] == req.Id { | ||
|
|
||
| state := protos.InstanceStatus_instanceRunning | ||
| if inst.Status == "PROVISIONING" || inst.Status == "STARTING" { | ||
| state = protos.InstanceStatus_instanceCreating | ||
| } else if inst.Status == "TERMINATING" || inst.Status == "STOPPING" { | ||
| state = protos.InstanceStatus_instanceDeleting | ||
| } | ||
|
|
||
| protosInstances = append(protosInstances, &protos.Instance{ | ||
| Id: "thecloud://" + inst.ID, | ||
| Status: &protos.InstanceStatus{ | ||
| InstanceState: state, | ||
| }, | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| return &protos.NodeGroupNodesResponse{ | ||
| Instances: protosInstances, | ||
| }, nil |
There was a problem hiding this comment.
NodeGroupNodes is returning a local protosInstancesResponse type, but the gRPC interface requires *protos.NodeGroupNodesResponse. As written, this won’t compile / won’t satisfy protos.CloudProviderServer. Return the generated response type (and remove the custom protosInstancesResponse helper).
|
|
||
| // Persist Node Group | ||
| if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil { | ||
| // Ideally this should be in a transaction with repo.Create |
There was a problem hiding this comment.
Creating the default node group happens after the cluster insert without any rollback if this step fails. Either create both in a transaction or delete the cluster record on failure to avoid leaving an unusable cluster (no node groups).
| // Ideally this should be in a transaction with repo.Create | |
| // Attempt to roll back the previously created cluster to avoid leaving it without node groups. | |
| if delErr := s.repo.Delete(ctx, cluster.ID); delErr != nil { | |
| s.logger.Error("failed to rollback cluster after default node group creation failure", "cluster_id", cluster.ID, "error", delErr) | |
| } |
| internal/repositories/postgres/migrations/035_create_images_table | ||
| internal/repositories/postgres/migrations/036_add_volume_backend_path | ||
| internal/repositories/postgres/migrations/037_create_clusters_table | ||
| internal/repositories/postgres/migrations/038_add_cluster_permissions | ||
| internal/repositories/postgres/migrations/053_add_cluster_network_isolation |
There was a problem hiding this comment.
This list references internal/repositories/postgres/migrations/037_create_clusters_table, but that migration has been removed (no longer present under internal/repositories/postgres/migrations). If these *.txt files are intended to drive tooling, this stale entry will cause confusion/breakage; otherwise consider removing these unused lists entirely.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 51 out of 53 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 'default-pool', | ||
| 'standard-1', | ||
| 1, | ||
| CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END, | ||
| worker_count |
There was a problem hiding this comment.
The backfill INSERT can fail if any existing cluster has worker_count < 1 (e.g., 0): current_size will be set to worker_count while min_size is 1, violating the chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size). Consider clamping current_size to at least min_size (and adjusting max_size accordingly), or allowing min_size=0 for backfilled rows.
| // If desired size > 0, we should trigger a scale operation for this specific group. | ||
| // For now, we will update the global worker count to reflect the addition. | ||
| oldWorkerCount := cluster.WorkerCount | ||
| cluster.WorkerCount += params.DesiredSize | ||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| cluster.WorkerCount = oldWorkerCount | ||
| return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err) | ||
| } |
There was a problem hiding this comment.
AddNodeGroup updates two persisted resources (cluster_node_groups row and clusters.worker_count) without a transaction. If the cluster update fails after the node group insert succeeds, the system can end up in an inconsistent state. Consider using a transaction for both updates or stopping updates to the deprecated worker_count field and deriving it from node_groups instead.
| // If desired size > 0, we should trigger a scale operation for this specific group. | |
| // For now, we will update the global worker count to reflect the addition. | |
| oldWorkerCount := cluster.WorkerCount | |
| cluster.WorkerCount += params.DesiredSize | |
| if err := s.repo.Update(ctx, cluster); err != nil { | |
| cluster.WorkerCount = oldWorkerCount | |
| return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err) | |
| } | |
| // NOTE: We intentionally do not update cluster.WorkerCount here. | |
| // The worker_count field is deprecated and should be derived from node groups | |
| // rather than being maintained via separate updates that can become inconsistent. |
| if err := s.repo.UpdateNodeGroup(ctx, targetGroup); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| // Rollback in-memory changes on persistence failure | ||
| targetGroup.MinSize = oldMin | ||
| targetGroup.MaxSize = oldMax | ||
| targetGroup.CurrentSize = oldDesired | ||
| cluster.WorkerCount = oldWorkerCount | ||
| return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after updating node group", err) |
There was a problem hiding this comment.
UpdateNodeGroup persists the node group update before updating the parent cluster’s worker_count. If the cluster update fails, the rollback only affects in-memory state and leaves the database with the node group updated but the cluster worker_count unchanged. This should be done transactionally (or avoid persisting worker_count at all if it’s deprecated).
| clusterGroup.POST("/:id/repair", httputil.Permission(svcs.RBAC, domain.PermissionClusterUpdate), handlers.Cluster.RepairCluster) | ||
| clusterGroup.POST("/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionClusterUpdate), handlers.Cluster.ScaleCluster) | ||
| clusterGroup.PUT("/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionClusterUpdate), handlers.Cluster.ScaleCluster) | ||
| clusterGroup.GET("/:id/health", httputil.Permission(svcs.RBAC, domain.PermissionClusterRead), handlers.Cluster.GetClusterHealth) |
There was a problem hiding this comment.
The HTTP method for /clusters/:id/scale was changed to PUT, but the Go SDK still sends POST requests to /clusters/{id}/scale (pkg/sdk/kubernetes.go). This will break SDK users unless the SDK (and its tests) are updated in the same change set or the API supports both methods temporarily.
| internal/repositories/postgres/migrations/001_create_instances_table | ||
| internal/repositories/postgres/migrations/002_create_api_keys_table | ||
| internal/repositories/postgres/migrations/003_add_container_id | ||
| internal/repositories/postgres/migrations/004_create_objects_table | ||
| internal/repositories/postgres/migrations/005_add_ports |
There was a problem hiding this comment.
This file appears to be a manual migration listing, but it isn’t referenced anywhere in the repository (no code/doc references found). Keeping it will likely lead to drift as migrations change. Consider removing it (or generating it as part of tooling if it’s needed).
| // NodeGroupRequest is the payload for adding/updating a node group. | ||
| type NodeGroupRequest struct { | ||
| Name string `json:"name" binding:"required"` | ||
| InstanceType string `json:"instance_type"` | ||
| MinSize int `json:"min_size" binding:"gte=0"` | ||
| MaxSize int `json:"max_size" binding:"gte=0"` | ||
| DesiredSize int `json:"desired_size" binding:"gte=0"` | ||
| } |
There was a problem hiding this comment.
instance_type is optional in the request payload, but the database schema requires cluster_node_groups.instance_type to be NOT NULL. This can turn a client input error into an internal error. Consider making instance_type required (binding:"required") and validating it in the service layer as InvalidInput.
| containerGroup.GET("/deployments", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.ListDeployments) | ||
| containerGroup.GET("/deployments/:id", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.GetDeployment) | ||
| containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) | ||
| containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) |
There was a problem hiding this comment.
The HTTP method for /containers/deployments/:id/scale was changed to PUT, but the Go SDK still uses POST for this endpoint (pkg/sdk/container.go + tests). This will break existing SDK clients unless the SDK is updated or the API supports both methods for a deprecation window.
| containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) | |
| containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) | |
| containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) |
| // AddNodeGroup adds a new node pool to the cluster. | ||
| func (c *Client) AddNodeGroup(clusterID string, input NodeGroupInput) (*NodeGroup, error) { | ||
| var resp Response[*NodeGroup] | ||
| if err := c.post(clustersPath+"/"+clusterID+"/nodegroups", input, &resp); err != nil { | ||
| return nil, err | ||
| } | ||
| return resp.Data, nil | ||
| } | ||
|
|
||
| // UpdateNodeGroupWithContext updates a node group's parameters with context support. | ||
| func (c *Client) UpdateNodeGroupWithContext(ctx context.Context, clusterID string, name string, input UpdateNodeGroupInput) (*NodeGroup, error) { | ||
| var resp Response[*NodeGroup] | ||
| if err := c.putWithContext(ctx, clustersPath+"/"+clusterID+"/nodegroups/"+url.PathEscape(name), input, &resp); err != nil { | ||
| return nil, err | ||
| } | ||
| return resp.Data, nil | ||
| } | ||
|
|
||
| // DeleteNodeGroup removes a node group. | ||
| func (c *Client) DeleteNodeGroup(clusterID string, name string) error { | ||
| var resp Response[any] | ||
| return c.delete(clustersPath+"/"+clusterID+"/nodegroups/"+url.PathEscape(name), &resp) | ||
| } |
There was a problem hiding this comment.
SDK method set is inconsistent with the established context-support pattern (non-context method delegates to WithContext). AddNodeGroup and DeleteNodeGroup lack WithContext variants, while UpdateNodeGroup only exists as WithContext (no convenience wrapper). Consider adding AddNodeGroupWithContext/DeleteNodeGroupWithContext and a non-context UpdateNodeGroup wrapper for consistency.
| return nil, fmt.Errorf("node group %s not found", req.Id) | ||
| } |
There was a problem hiding this comment.
These handlers return plain fmt.Errorf for common conditions (e.g., node group not found, size out of range), which will be surfaced to the autoscaler as gRPC code=Unknown. Prefer returning status.Error/status.Errorf with appropriate codes (NotFound, FailedPrecondition/InvalidArgument) so upstream Cluster Autoscaler can react correctly.
| externalgrpc.config: | | ||
| service: | ||
| address: "localhost:50051" |
There was a problem hiding this comment.
Repository convention prefers 127.0.0.1 over localhost for local connection addresses (e.g., internal/core/services/database.go). To align with that, consider using "127.0.0.1:50051" instead of "localhost:50051" in the externalgrpc config.
There was a problem hiding this comment.
Actionable comments posted: 11
♻️ Duplicate comments (2)
internal/repositories/k8s/provisioner.go (1)
242-246:⚠️ Potential issue | 🟠 MajorHandle
AddNodefailures with rollback and fail-state transition.At Line 242, if
AddNodefails, the worker instance is already created but left running/orphaned. Also, at Line 250, returning a plain error skipsfailCluster, so status may remain inconsistent.🛠️ Proposed fix
if err := p.repo.AddNode(ctx, node); err != nil { errMsg := fmt.Sprintf("failed to add worker node %s to repository: %v", workerName, err) p.logger.Error(errMsg) + if termErr := p.instSvc.TerminateInstance(ctx, workerInst.ID.String()); termErr != nil { + p.logger.Error("failed to rollback worker instance after repository failure", "worker", workerName, "instance_id", workerInst.ID, "error", termErr) + } provisioningErrors = append(provisioningErrors, errMsg) + continue } } } if len(provisioningErrors) > 0 { - return fmt.Errorf("worker provisioning encountered errors: %s", strings.Join(provisioningErrors, "; ")) + return p.failCluster( + ctx, + cluster, + "worker provisioning encountered errors", + fmt.Errorf("%s", strings.Join(provisioningErrors, "; ")), + ) }Also applies to: 250-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/k8s/provisioner.go` around lines 242 - 246, If p.repo.AddNode(ctx, node) fails, the newly created worker is orphaned and the function currently just logs and appends provisioningErrors; modify the error path in the AddNode block to perform a rollback by deleting/terminating the created cloud instance (use the provisioner cloud client method, e.g., p.cloud.DeleteInstance or p.cloud.TerminateInstance with the workerName), log the rollback result via p.logger, then call failCluster(ctx, clusterID) to move the cluster to a failed state and return the error (instead of only appending provisioningErrors). Ensure similar rollback + failCluster + return behavior is applied to the other failure path referenced around where provisioningErrors is used so status remains consistent.internal/core/services/cluster.go (1)
129-138:⚠️ Potential issue | 🟠 Major
CreateCluster+ default node-group persistence is still non-atomic.
Line 130andLine 135are separate writes. If the second fails, the call returns an error with a partially-created cluster persisted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/core/services/cluster.go` around lines 129 - 138, The CreateCluster flow is non-atomic because s.repo.Create(ctx, cluster) and s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]) are separate writes; if AddNodeGroup fails the cluster remains persisted. Fix by performing both operations in a single transactional unit (use your repository's transaction mechanism or add one) — e.g., run Create and AddNodeGroup inside a repo transaction or implement a single repo method (CreateClusterWithNodeGroup/CreateWithDefaultNodeGroup) that performs both writes atomically; alternatively, if transactions aren't available implement a compensating rollback by calling repo.Delete(ctx, cluster.ID) when AddNodeGroup fails. Ensure you update CreateCluster to call the transactional method (or perform rollback) so partial state cannot be left behind.
🧹 Nitpick comments (9)
internal/core/services/database.go (1)
377-381: Replace exporter port literals with named constants for consistency.
"9187"and"9104"are still magic values in this method; extracting them keeps the same maintainability gain as engine ports.♻️ Proposed refactor
const ( postgresDefaultPort = "5432" mysqlDefaultPort = "3306" postgresExporterImg = "prometheuscommunity/postgres-exporter" mysqlExporterImg = "prom/mysqld-exporter" + postgresExporterPort = "9187" + mysqlExporterPort = "9104" ) @@ - return postgresExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, "9187" + return postgresExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, postgresExporterPort @@ - return mysqlExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, "9104" + return mysqlExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, mysqlExporterPortAs per coding guidelines, "Do not use magic numbers - use named constants instead".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/core/services/database.go` around lines 377 - 381, The returns in the exporter selection code currently use magic string literals "9187" and "9104" for ports; replace these with named constants (e.g., PostgresExporterPort and MySQLExporterPort) defined alongside other service/engine constants and use those constants in the return statements where postgresExporterImg and mysqlExporterImg are returned; ensure the constants are typed as strings (or format them where used) and update any tests or usages that reference the literal port strings to reference the new constant names.internal/workers/database_failover_worker_test.go (1)
128-129: Replace-1with a named constant for test intent.Using a named constant keeps the unhealthy-port intent explicit and avoids magic numbers.
♻️ Small cleanup
+const unhealthyPrimaryPort = -1 + func TestDatabaseFailoverWorker(t *testing.T) { @@ primary := &domain.Database{ ID: primaryID, Name: "primary", Role: domain.RolePrimary, Status: domain.DatabaseStatusRunning, - Port: -1, // Invalid port to ensure DialTimeout fails + Port: unhealthyPrimaryPort, // Invalid port to ensure DialTimeout fails }As per coding guidelines,
**/*.go: Do not use magic numbers - use named constants instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/workers/database_failover_worker_test.go` around lines 128 - 129, Replace the magic number -1 used for the Port field with a named constant (e.g., invalidPort or unhealthyPort) to make the test intent explicit; add the constant near the top of the test file or in the test's scope, then update the Port: -1 occurrence to Port: invalidPort (or your chosen name) so the unhealthy-port intent is clear when examining the test that configures the DialTimeout failure.k8s/autoscaler/deployment.yaml (1)
77-78: HardcodedCLOUD_API_URLlimits deployment flexibility.Unlike the templated control_plane.yaml (which uses
{{ .CloudAPIURL }}), this standalone manifest hardcodes the API URL. This works for the default in-cluster DNS layout but will break if the API service is deployed differently or accessed externally.Consider using a ConfigMap or making this value configurable at deployment time for consistency with the templated approach.
♻️ Proposed change using ConfigMap
- name: CLOUD_API_URL - value: "https://thecloud-api.kube-system.svc.cluster.local:443" + valueFrom: + configMapKeyRef: + name: cluster-autoscaler-config + key: cloud-api-urlThen add to
k8s/autoscaler/configmap.yaml:data: cloud-api-url: "https://thecloud-api.kube-system.svc.cluster.local:443"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/autoscaler/deployment.yaml` around lines 77 - 78, The manifest hardcodes CLOUD_API_URL in deployment.yaml which reduces flexibility; change it to source the value from a ConfigMap (e.g., create/modify k8s/autoscaler/configmap.yaml to include a cloud-api-url key) and update the Deployment to set the environment variable CLOUD_API_URL from that ConfigMap (use envFrom or valueFrom with configMapKeyRef) so the same {{ .CloudAPIURL }}-style configurability is available for CLOUD_API_URL.internal/repositories/k8s/mocks_test.go (1)
102-111: Make NodeGroup mock methods assertable withm.Called.These methods currently hardcode success, which weakens test signal for call expectations and error-path coverage.
As per coding guidelines, "Use `testify/mock` for creating mock objects in tests".♻️ Proposed refactor
-func (m *mockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil } +func (m *mockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { + return m.Called(ctx, ng).Error(0) +} func (m *mockClusterRepo) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) { - return []domain.NodeGroup{}, nil + args := m.Called(ctx, clusterID) + if args.Get(0) == nil { + return nil, args.Error(1) + } + r0, _ := args.Get(0).([]domain.NodeGroup) + return r0, args.Error(1) } func (m *mockClusterRepo) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { - return nil + return m.Called(ctx, ng).Error(0) } func (m *mockClusterRepo) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { - return nil + return m.Called(ctx, clusterID, name).Error(0) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/k8s/mocks_test.go` around lines 102 - 111, Replace the hardcoded-success implementations in mockClusterRepo's methods (AddNodeGroup, GetNodeGroups, UpdateNodeGroup, DeleteNodeGroup) with testify/mock-style handlers: call m.Called(...) with the same arguments, extract returned values and errors from the call result, and return them (casting to the appropriate types, e.g., []domain.NodeGroup for GetNodeGroups and error for others). Ensure you pass ctx and the relevant parameters into m.Called so tests can assert calls and control return values via On/Return.internal/workers/cluster_worker_test.go (1)
64-73: Mock methods are stubs withoutm.Called()delegation.These NodeGroup mock methods return hardcoded values without delegating to
m.Called(), unlike other methods in this mock (e.g.,GetByID,ListAll). This is acceptable if current tests don't need to verify these calls or configure return values, but consider addingm.Called()delegation for consistency and future test flexibility.♻️ Optional: Add m.Called() for consistency with other mock methods
-func (m *MockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil } -func (m *MockClusterRepo) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) { - return []domain.NodeGroup{}, nil -} -func (m *MockClusterRepo) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { - return nil -} -func (m *MockClusterRepo) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { - return nil -} +func (m *MockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { + return m.Called(ctx, ng).Error(0) +} +func (m *MockClusterRepo) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) { + args := m.Called(ctx, clusterID) + if args.Get(0) == nil { + return nil, args.Error(1) + } + r0, _ := args.Get(0).([]domain.NodeGroup) + return r0, args.Error(1) +} +func (m *MockClusterRepo) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { + return m.Called(ctx, ng).Error(0) +} +func (m *MockClusterRepo) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { + return m.Called(ctx, clusterID, name).Error(0) +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/workers/cluster_worker_test.go` around lines 64 - 73, The NodeGroup mock methods (MockClusterRepo.AddNodeGroup, GetNodeGroups, UpdateNodeGroup, DeleteNodeGroup) are hardcoded stubs; update each to delegate to m.Called(...) for consistency and test configurability by invoking m.Called with the same parameters (ctx, ng or clusterID/name) and returning the typed results (error and []domain.NodeGroup for GetNodeGroups) extracted from the returned arguments (using .Error(0) or .Get(0).([]domain.NodeGroup) as appropriate).internal/handlers/cluster_handler.go (1)
446-471: Consider validating the node group name parameter.The
namepath parameter is used directly without validation. While the service layer may handle invalid names, consider adding basic validation (e.g., non-empty check) at the handler level for fail-fast behavior and clearer error messages.♻️ Optional: Add name validation
func (h *ClusterHandler) UpdateNodeGroup(c *gin.Context) { id, err := uuid.Parse(c.Param("id")) if err != nil { httputil.Error(c, errors.New(errors.InvalidInput, errInvalidClusterID)) return } name := c.Param("name") + if name == "" { + httputil.Error(c, errors.New(errors.InvalidInput, "invalid node group name")) + return + } var req UpdateNodeGroupRequest🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handlers/cluster_handler.go` around lines 446 - 471, In UpdateNodeGroup validate the path param `name` (the variable `name := c.Param("name")`) before calling the service: check it's non-empty (and optionally allowed format/length), and if invalid return a clear bad-request via httputil.Error with an InvalidInput error (create or reuse an error constant such as errInvalidNodeGroupName) so the handler fails fast with a descriptive message instead of forwarding an empty/invalid name to svc.UpdateNodeGroup.internal/handlers/cluster_handler_test.go (1)
570-630: Refactor NodeGroup handler tests into table-driven cases.Line 580 onward uses ad-hoc subtests; please switch this block to a table-driven layout (and include invalid-id / invalid-body / service-error variants in the table) to keep endpoint coverage scalable.
♻️ Suggested table-driven structure
func TestClusterHandlerNodeGroups(t *testing.T) { t.Parallel() svc, handler, r := setupClusterHandlerTest() clusterID := uuid.New() nodeGroupName := "test-pool" r.POST("/clusters/:id/nodegroups", handler.AddNodeGroup) r.PUT("/clusters/:id/nodegroups/:name", handler.UpdateNodeGroup) r.DELETE("/clusters/:id/nodegroups/:name", handler.DeleteNodeGroup) - t.Run("AddNodeGroup_Success", func(t *testing.T) { ... }) - t.Run("UpdateNodeGroup_Success", func(t *testing.T) { ... }) - t.Run("DeleteNodeGroup_Success", func(t *testing.T) { ... }) + tests := []struct { + name string + method string + path string + body []byte + setupMock func() + wantCode int + }{ + // add success/error/validation cases here + } + + for _, tc := range tests { + tc := tc + t.Run(tc.name, func(t *testing.T) { + if tc.setupMock != nil { + tc.setupMock() + } + w := httptest.NewRecorder() + req := httptest.NewRequest(tc.method, tc.path, bytes.NewBuffer(tc.body)) + r.ServeHTTP(w, req) + assert.Equal(t, tc.wantCode, w.Code) + }) + } }As per coding guidelines,
**/*_test.go: Use table-driven tests in test files.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handlers/cluster_handler_test.go` around lines 570 - 630, Replace the ad-hoc subtests inside TestClusterHandlerNodeGroups with a table-driven loop: create a slice of test cases (structs) that include name, method, path, requestBody, mockSetup func (to set expectations on svc for AddNodeGroup/UpdateNodeGroup/DeleteNodeGroup), expectedStatus, and any flags for invalid-id/invalid-body/service-error; then iterate over the table calling r.ServeHTTP and assert the response. Keep and reuse the existing handlers (handler.AddNodeGroup, handler.UpdateNodeGroup, handler.DeleteNodeGroup) and the example params/expected domain.NodeGroup values from the current subtests inside each test case's mockSetup; for invalid-id/invalid-body cases set requestBody or path to invalid values and do not set successful svc expectations (or set svc to return an error for service-error cases). Ensure each case uses t.Run(tc.name, func(t *testing.T) { ... }) so mocks are asserted per-case.docs/swagger/docs.go (1)
1236-1299: Add explicit JSON media types for body-based operations.These operations take request bodies but don’t declare
consumes/produces. Adding them makes Swagger 2 client generation more deterministic. Since this file is generated, update handler annotations and regenerate Swagger.♻️ Suggested patch
"/clusters/{id}/nodegroups": { "post": { + "consumes": [ + "application/json" + ], + "produces": [ + "application/json" + ], "security": [ { "APIKeyAuth": [] } ], "/clusters/{id}/nodegroups/{name}": { "put": { + "consumes": [ + "application/json" + ], + "produces": [ + "application/json" + ], "security": [ { "APIKeyAuth": [] } ], "/containers/deployments/{id}/scale": { "put": { + "consumes": [ + "application/json" + ], + "produces": [ + "application/json" + ], "security": [ { "APIKeyAuth": [] } ],Also applies to: 1300-1363, 1583-1622
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 1236 - 1299, The generated Swagger operation for POST "/clusters/{id}/nodegroups" (request body schema httphandlers.NodeGroupRequest, response domain.NodeGroup) is missing explicit consumes/produces JSON media types; update the handler's Swagger annotations (the comments above the handler that reference httphandlers.NodeGroupRequest and domain.NodeGroup) to include consumes: application/json and produces: application/json, then re-run the Swagger generator to regenerate docs/swagger/docs.go; apply the same annotation fix and regeneration for the other body-based operations noted (the blocks around lines 1300-1363 and 1583-1622) so all request-body endpoints declare consumes/produces JSON.internal/repositories/postgres/cluster_repo.go (1)
116-121: Avoid N+1 node-group queries inlist.
Line 117triggers one extra query per cluster. This will add significant latency for users with many clusters; consider batching node-group fetches by cluster IDs and mapping results in-memory.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/cluster_repo.go` around lines 116 - 121, The current loop calls r.GetNodeGroups(ctx, c.ID) per cluster causing an N+1 problem; instead, collect all cluster IDs from the clusters slice in the list function, call a batched fetch (add or use a new repository method like GetNodeGroupsByClusterIDs(ctx, []clusterIDs) that returns node groups with their ClusterID), then build an in-memory map[ClusterID][]NodeGroup and assign c.NodeGroups from that map for each cluster (replace the per-cluster r.GetNodeGroups calls and the ngs variable usage). Ensure the new method signature and mapping use the existing types and error handling (wrap errors similarly) so callers remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/swagger/docs.go`:
- Around line 9878-9894: The UpdateNodeGroupRequest schema currently permits an
empty object which allows no-op updates; update the
"httphandlers.UpdateNodeGroupRequest" JSON schema to require at least one field
by adding "minProperties": 1 (or alternatively list the updatable fields under
"required" if specific ones must always be present) so requests like {} are
rejected; modify the schema block that defines "UpdateNodeGroupRequest" to
include the "minProperties": 1 property adjacent to "type": "object" (or add a
"required" array containing "desired_size"/"max_size"/"min_size" if
appropriate).
- Around line 9867-9877: The OpenAPI schema for ScaleDeploymentRequest currently
allows negative integers for the replicas field; update the replicas property
inside the ScaleDeploymentRequest schema to enforce non-negative values by
adding a minimum constraint (e.g., "minimum": 0) so that replicas cannot be
negative and validation will reject negative scaling requests.
In `@docs/swagger/swagger.json`:
- Around line 9859-9868: The schema for httphandlers.ScaleDeploymentRequest
allows negative replicas; update the replicas property in the
httphandlers.ScaleDeploymentRequest definition to include a lower bound by
adding "minimum": 0 (so generated SDKs validate non-negative replica counts);
ensure the change is applied to the "replicas" property in that object schema.
In `@internal/core/services/cluster.go`:
- Around line 437-448: The AddNodeGroup flow mutates node-group then updates
cluster.WorkerCount in two separate writes (s.repo.AddNodeGroup and
s.repo.Update) which can leave state inconsistent if the second write fails;
wrap the related operations (s.repo.AddNodeGroup, adjusting cluster.WorkerCount
by params.DesiredSize, and s.repo.Update) in a single transactional boundary
(use your repo's transaction API / BeginTx/WithTx), perform both writes inside
the same transaction, and ensure rollback on any error so that either both the
node-group and cluster.WorkerCount persist or neither do; apply the same
transactional pattern to the other affected code paths (the ranges noted around
lines 501-512 and 542-550).
- Around line 115-123: The default NodeGroup construction sets CurrentSize :=
params.Workers without validating it against hard-coded bounds (MinSize/MaxSize
1..10) which can cause DB failures after the cluster row is created; update the
code that builds defaultNG (the domain.NodeGroup instance) to validate
params.Workers before any DB writes by introducing named constants (e.g.,
DefaultMinNodeGroupSize, DefaultMaxNodeGroupSize) instead of magic numbers, and
either clamp params.Workers into that range or return an error so you don't
persist an invalid NodeGroup; ensure validation happens prior to inserting the
cluster/NodeGroup and update any callers that assume the old behavior.
In `@internal/repositories/k8s/templates/control_plane.yaml`:
- Around line 309-352: The embedded autoscaler Pod spec is missing security
hardening and resource constraints: add a securityContext block to both
containers (cluster-autoscaler and thecloud-bridge) with
allowPrivilegeEscalation: false, readOnlyRootFilesystem: true and
capabilities.drop: ["ALL"], and add resource requests/limits for thecloud-bridge
container (CPU/memory) to match the values used in
k8s/autoscaler/deployment.yaml so both containers have consistent resource and
security settings.
In `@internal/repositories/noop/adapters.go`:
- Around line 221-229: The NoopClusterRepository is missing the NodeGroup
methods required by ports.ClusterRepository; add implementations for
AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup on the
NoopClusterRepository type that match the repository signatures and return
benign noop results (e.g., nil error and empty slice for GetNodeGroups) so the
repository conforms to the interface; ensure method names and parameter types
exactly match the ports.ClusterRepository signatures and use domain.NodeGroup
and uuid.UUID where appropriate.
In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 265-285: Update UpdateNodeGroup and DeleteNodeGroup to check the
Exec result for affected rows and return a not-found error when zero rows are
affected: capture the Exec return into a result (e.g., res, err :=
r.db.Exec(...)), check rows := res.RowsAffected() (or res.RowsAffected() via
CommandTag if your DB driver returns a CommandTag) and if rows == 0 return
errors.Wrap(errors.NotFound, "node group not found") instead of success; keep
the existing error wrapping for Exec failures and apply the same change in both
ClusterRepository.UpdateNodeGroup and ClusterRepository.DeleteNodeGroup.
In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`:
- Around line 22-35: The backfill sets current_size = worker_count which can be
<= 0 and violate the chk_cluster_node_groups_current_in_range constraint; update
the INSERT SELECT to ensure current_size is at least 1 by replacing current_size
= worker_count with current_size = GREATEST(worker_count, 1) (and optionally
guard max_size similarly if you want to ensure it's >=1), referencing the
cluster_node_groups table, clusters table, worker_count column and the
chk_cluster_node_groups_current_in_range constraint.
In `@internal/workers/database_failover_worker_test.go`:
- Around line 158-160: The test "No failover if primary healthy" is a no-op;
make it deterministic by starting a temporary local TCP listener on the primary
address used by the worker, then invoke the health-check path (call isHealthy or
trigger the worker's health-check/run method used in
database_failover_worker.go) so the dial succeeds, assert that no failover
actions occurred (e.g. primary still selected / no call to failover handler),
and finally close the listener; ensure the test cleans up the listener and any
goroutines to avoid flakiness.
In `@k8s/autoscaler/deployment.yaml`:
- Around line 79-83: The deployment references the ConfigMap named "cluster-id"
to populate the CLOUD_CLUSTER_ID env var but the ConfigMap contains the
unresolved template "{{ .ClusterID }}"; before applying the manifest either
replace that placeholder with the actual cluster UUID or add a templating step
to render it. Update docs/README or deployment instructions to require creating
the ConfigMap "cluster-id" with key "cluster-id" set to the real cluster UUID
(or integrate a templating command such as envsubst/helm/kustomize to inject
.ClusterID) so CLOUD_CLUSTER_ID receives the real UUID when Deployment in
deployment.yaml is applied.
---
Duplicate comments:
In `@internal/core/services/cluster.go`:
- Around line 129-138: The CreateCluster flow is non-atomic because
s.repo.Create(ctx, cluster) and s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0])
are separate writes; if AddNodeGroup fails the cluster remains persisted. Fix by
performing both operations in a single transactional unit (use your repository's
transaction mechanism or add one) — e.g., run Create and AddNodeGroup inside a
repo transaction or implement a single repo method
(CreateClusterWithNodeGroup/CreateWithDefaultNodeGroup) that performs both
writes atomically; alternatively, if transactions aren't available implement a
compensating rollback by calling repo.Delete(ctx, cluster.ID) when AddNodeGroup
fails. Ensure you update CreateCluster to call the transactional method (or
perform rollback) so partial state cannot be left behind.
In `@internal/repositories/k8s/provisioner.go`:
- Around line 242-246: If p.repo.AddNode(ctx, node) fails, the newly created
worker is orphaned and the function currently just logs and appends
provisioningErrors; modify the error path in the AddNode block to perform a
rollback by deleting/terminating the created cloud instance (use the provisioner
cloud client method, e.g., p.cloud.DeleteInstance or p.cloud.TerminateInstance
with the workerName), log the rollback result via p.logger, then call
failCluster(ctx, clusterID) to move the cluster to a failed state and return the
error (instead of only appending provisioningErrors). Ensure similar rollback +
failCluster + return behavior is applied to the other failure path referenced
around where provisioningErrors is used so status remains consistent.
---
Nitpick comments:
In `@docs/swagger/docs.go`:
- Around line 1236-1299: The generated Swagger operation for POST
"/clusters/{id}/nodegroups" (request body schema httphandlers.NodeGroupRequest,
response domain.NodeGroup) is missing explicit consumes/produces JSON media
types; update the handler's Swagger annotations (the comments above the handler
that reference httphandlers.NodeGroupRequest and domain.NodeGroup) to include
consumes: application/json and produces: application/json, then re-run the
Swagger generator to regenerate docs/swagger/docs.go; apply the same annotation
fix and regeneration for the other body-based operations noted (the blocks
around lines 1300-1363 and 1583-1622) so all request-body endpoints declare
consumes/produces JSON.
In `@internal/core/services/database.go`:
- Around line 377-381: The returns in the exporter selection code currently use
magic string literals "9187" and "9104" for ports; replace these with named
constants (e.g., PostgresExporterPort and MySQLExporterPort) defined alongside
other service/engine constants and use those constants in the return statements
where postgresExporterImg and mysqlExporterImg are returned; ensure the
constants are typed as strings (or format them where used) and update any tests
or usages that reference the literal port strings to reference the new constant
names.
In `@internal/handlers/cluster_handler_test.go`:
- Around line 570-630: Replace the ad-hoc subtests inside
TestClusterHandlerNodeGroups with a table-driven loop: create a slice of test
cases (structs) that include name, method, path, requestBody, mockSetup func (to
set expectations on svc for AddNodeGroup/UpdateNodeGroup/DeleteNodeGroup),
expectedStatus, and any flags for invalid-id/invalid-body/service-error; then
iterate over the table calling r.ServeHTTP and assert the response. Keep and
reuse the existing handlers (handler.AddNodeGroup, handler.UpdateNodeGroup,
handler.DeleteNodeGroup) and the example params/expected domain.NodeGroup values
from the current subtests inside each test case's mockSetup; for
invalid-id/invalid-body cases set requestBody or path to invalid values and do
not set successful svc expectations (or set svc to return an error for
service-error cases). Ensure each case uses t.Run(tc.name, func(t *testing.T) {
... }) so mocks are asserted per-case.
In `@internal/handlers/cluster_handler.go`:
- Around line 446-471: In UpdateNodeGroup validate the path param `name` (the
variable `name := c.Param("name")`) before calling the service: check it's
non-empty (and optionally allowed format/length), and if invalid return a clear
bad-request via httputil.Error with an InvalidInput error (create or reuse an
error constant such as errInvalidNodeGroupName) so the handler fails fast with a
descriptive message instead of forwarding an empty/invalid name to
svc.UpdateNodeGroup.
In `@internal/repositories/k8s/mocks_test.go`:
- Around line 102-111: Replace the hardcoded-success implementations in
mockClusterRepo's methods (AddNodeGroup, GetNodeGroups, UpdateNodeGroup,
DeleteNodeGroup) with testify/mock-style handlers: call m.Called(...) with the
same arguments, extract returned values and errors from the call result, and
return them (casting to the appropriate types, e.g., []domain.NodeGroup for
GetNodeGroups and error for others). Ensure you pass ctx and the relevant
parameters into m.Called so tests can assert calls and control return values via
On/Return.
In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 116-121: The current loop calls r.GetNodeGroups(ctx, c.ID) per
cluster causing an N+1 problem; instead, collect all cluster IDs from the
clusters slice in the list function, call a batched fetch (add or use a new
repository method like GetNodeGroupsByClusterIDs(ctx, []clusterIDs) that returns
node groups with their ClusterID), then build an in-memory
map[ClusterID][]NodeGroup and assign c.NodeGroups from that map for each cluster
(replace the per-cluster r.GetNodeGroups calls and the ngs variable usage).
Ensure the new method signature and mapping use the existing types and error
handling (wrap errors similarly) so callers remain consistent.
In `@internal/workers/cluster_worker_test.go`:
- Around line 64-73: The NodeGroup mock methods (MockClusterRepo.AddNodeGroup,
GetNodeGroups, UpdateNodeGroup, DeleteNodeGroup) are hardcoded stubs; update
each to delegate to m.Called(...) for consistency and test configurability by
invoking m.Called with the same parameters (ctx, ng or clusterID/name) and
returning the typed results (error and []domain.NodeGroup for GetNodeGroups)
extracted from the returned arguments (using .Error(0) or
.Get(0).([]domain.NodeGroup) as appropriate).
In `@internal/workers/database_failover_worker_test.go`:
- Around line 128-129: Replace the magic number -1 used for the Port field with
a named constant (e.g., invalidPort or unhealthyPort) to make the test intent
explicit; add the constant near the top of the test file or in the test's scope,
then update the Port: -1 occurrence to Port: invalidPort (or your chosen name)
so the unhealthy-port intent is clear when examining the test that configures
the DialTimeout failure.
In `@k8s/autoscaler/deployment.yaml`:
- Around line 77-78: The manifest hardcodes CLOUD_API_URL in deployment.yaml
which reduces flexibility; change it to source the value from a ConfigMap (e.g.,
create/modify k8s/autoscaler/configmap.yaml to include a cloud-api-url key) and
update the Deployment to set the environment variable CLOUD_API_URL from that
ConfigMap (use envFrom or valueFrom with configMapKeyRef) so the same {{
.CloudAPIURL }}-style configurability is available for CLOUD_API_URL.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 113a517e-0d2d-446e-a551-32d4eab748cb
📒 Files selected for processing (42)
Dockerfile.autoscaler-servercmd/api/main.gocmd/autoscaler-server/main.godocs/adr/ADR-020-kubernetes-cluster-autoscaler.mddocs/kubernetes.mddocs/services/kubernetes-autoscaler.mddocs/swagger/docs.godocs/swagger/swagger.jsondocs/swagger/swagger.yamldowns.txtinternal/api/setup/router.gointernal/autoscaler/server.gointernal/core/domain/cluster.gointernal/core/ports/compute_options.gointernal/core/services/cluster.gointernal/core/services/cluster_test.gointernal/core/services/cluster_unit_test.gointernal/core/services/database.gointernal/core/services/health_unit_test.gointernal/core/services/shared_test.gointernal/handlers/cluster_handler.gointernal/handlers/cluster_handler_test.gointernal/handlers/container_handler.gointernal/handlers/container_handler_test.gointernal/repositories/k8s/kubeadm_provisioner_test.gointernal/repositories/k8s/mocks_test.gointernal/repositories/k8s/provisioner.gointernal/repositories/k8s/templates/control_plane.yamlinternal/repositories/noop/adapters.gointernal/repositories/postgres/cluster_repo.gointernal/repositories/postgres/cluster_repo_unit_test.gointernal/repositories/postgres/migrations/020_create_clusters.up.sqlinternal/repositories/postgres/migrations/037_create_clusters_table.down.sqlinternal/repositories/postgres/migrations/037_create_clusters_table.up.sqlinternal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sqlinternal/workers/cluster_worker_test.gointernal/workers/database_failover_worker_test.gointernal/workers/workers_unit_test.gok8s/autoscaler/deployment.yamlk8s/autoscaler/rbac.yamlpkg/sdk/compute.gopkg/sdk/kubernetes.go
💤 Files with no reviewable changes (2)
- internal/repositories/postgres/migrations/037_create_clusters_table.up.sql
- internal/repositories/postgres/migrations/037_create_clusters_table.down.sql
✅ Files skipped from review due to trivial changes (1)
- docs/services/kubernetes-autoscaler.md
🚧 Files skipped from review as they are similar to previous changes (6)
- docs/adr/ADR-020-kubernetes-cluster-autoscaler.md
- internal/api/setup/router.go
- internal/core/ports/compute_options.go
- cmd/autoscaler-server/main.go
- Dockerfile.autoscaler-server
- internal/autoscaler/server.go
| "httphandlers.ScaleDeploymentRequest": { | ||
| "type": "object", | ||
| "required": [ | ||
| "replicas" | ||
| ], | ||
| "properties": { | ||
| "replicas": { | ||
| "type": "integer" | ||
| } | ||
| } | ||
| }, |
There was a problem hiding this comment.
Constrain replicas to non-negative values in the contract.
ScaleDeploymentRequest.replicas currently accepts negative integers in schema, which is invalid for scaling semantics.
🐛 Suggested patch
"httphandlers.ScaleDeploymentRequest": {
"type": "object",
"required": [
"replicas"
],
"properties": {
"replicas": {
- "type": "integer"
+ "type": "integer",
+ "minimum": 0
}
}
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "httphandlers.ScaleDeploymentRequest": { | |
| "type": "object", | |
| "required": [ | |
| "replicas" | |
| ], | |
| "properties": { | |
| "replicas": { | |
| "type": "integer" | |
| } | |
| } | |
| }, | |
| "httphandlers.ScaleDeploymentRequest": { | |
| "type": "object", | |
| "required": [ | |
| "replicas" | |
| ], | |
| "properties": { | |
| "replicas": { | |
| "type": "integer", | |
| "minimum": 0 | |
| } | |
| } | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/swagger/docs.go` around lines 9867 - 9877, The OpenAPI schema for
ScaleDeploymentRequest currently allows negative integers for the replicas
field; update the replicas property inside the ScaleDeploymentRequest schema to
enforce non-negative values by adding a minimum constraint (e.g., "minimum": 0)
so that replicas cannot be negative and validation will reject negative scaling
requests.
| "httphandlers.UpdateNodeGroupRequest": { | ||
| "type": "object", | ||
| "properties": { | ||
| "desired_size": { | ||
| "type": "integer", | ||
| "minimum": 0 | ||
| }, | ||
| "max_size": { | ||
| "type": "integer", | ||
| "minimum": 0 | ||
| }, | ||
| "min_size": { | ||
| "type": "integer", | ||
| "minimum": 0 | ||
| } | ||
| } | ||
| }, |
There was a problem hiding this comment.
Disallow empty node-group update payloads.
UpdateNodeGroupRequest currently allows {}. Add minProperties: 1 (or required fields) to avoid ambiguous no-op updates.
🐛 Suggested patch
"httphandlers.UpdateNodeGroupRequest": {
"type": "object",
+ "minProperties": 1,
"properties": {
"desired_size": {
"type": "integer",
"minimum": 0
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "httphandlers.UpdateNodeGroupRequest": { | |
| "type": "object", | |
| "properties": { | |
| "desired_size": { | |
| "type": "integer", | |
| "minimum": 0 | |
| }, | |
| "max_size": { | |
| "type": "integer", | |
| "minimum": 0 | |
| }, | |
| "min_size": { | |
| "type": "integer", | |
| "minimum": 0 | |
| } | |
| } | |
| }, | |
| "httphandlers.UpdateNodeGroupRequest": { | |
| "type": "object", | |
| "minProperties": 1, | |
| "properties": { | |
| "desired_size": { | |
| "type": "integer", | |
| "minimum": 0 | |
| }, | |
| "max_size": { | |
| "type": "integer", | |
| "minimum": 0 | |
| }, | |
| "min_size": { | |
| "type": "integer", | |
| "minimum": 0 | |
| } | |
| } | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/swagger/docs.go` around lines 9878 - 9894, The UpdateNodeGroupRequest
schema currently permits an empty object which allows no-op updates; update the
"httphandlers.UpdateNodeGroupRequest" JSON schema to require at least one field
by adding "minProperties": 1 (or alternatively list the updatable fields under
"required" if specific ones must always be present) so requests like {} are
rejected; modify the schema block that defines "UpdateNodeGroupRequest" to
include the "minProperties": 1 property adjacent to "type": "object" (or add a
"required" array containing "desired_size"/"max_size"/"min_size" if
appropriate).
| "httphandlers.ScaleDeploymentRequest": { | ||
| "type": "object", | ||
| "required": [ | ||
| "replicas" | ||
| ], | ||
| "properties": { | ||
| "replicas": { | ||
| "type": "integer" | ||
| } | ||
| } |
There was a problem hiding this comment.
Add a lower bound for deployment replicas in the schema.
Line 9865 currently allows negative replicas values because no minimum is declared. This weakens client-side validation for generated SDKs.
🔧 Suggested schema fix
"httphandlers.ScaleDeploymentRequest": {
"type": "object",
"required": [
"replicas"
],
"properties": {
"replicas": {
- "type": "integer"
+ "type": "integer",
+ "minimum": 0
}
}
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "httphandlers.ScaleDeploymentRequest": { | |
| "type": "object", | |
| "required": [ | |
| "replicas" | |
| ], | |
| "properties": { | |
| "replicas": { | |
| "type": "integer" | |
| } | |
| } | |
| "httphandlers.ScaleDeploymentRequest": { | |
| "type": "object", | |
| "required": [ | |
| "replicas" | |
| ], | |
| "properties": { | |
| "replicas": { | |
| "type": "integer", | |
| "minimum": 0 | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/swagger/swagger.json` around lines 9859 - 9868, The schema for
httphandlers.ScaleDeploymentRequest allows negative replicas; update the
replicas property in the httphandlers.ScaleDeploymentRequest definition to
include a lower bound by adding "minimum": 0 (so generated SDKs validate
non-negative replica counts); ensure the change is applied to the "replicas"
property in that object schema.
| // Add default Node Group | ||
| defaultNG := domain.NodeGroup{ | ||
| ID: uuid.New(), | ||
| ClusterID: cluster.ID, | ||
| Name: "default-pool", | ||
| InstanceType: "standard-1", | ||
| MinSize: 1, | ||
| MaxSize: 10, | ||
| CurrentSize: params.Workers, |
There was a problem hiding this comment.
Validate default node-group sizing before the first DB write.
Line 123 sets CurrentSize from params.Workers while bounds are fixed at 1..10. Invalid values can make default node-group persistence fail after the cluster row is already inserted.
🔧 Proposed fix
const (
defaultClusterVersion = "v1.29.0"
defaultWorkerCount = 2
+ defaultNodeGroupMinSize = 1
+ defaultNodeGroupMaxSize = 10
)
@@
+ if err := s.validateNodeGroupSizing(defaultNodeGroupMinSize, defaultNodeGroupMaxSize, params.Workers); err != nil {
+ return nil, err
+ }
+
defaultNG := domain.NodeGroup{
@@
- MinSize: 1,
- MaxSize: 10,
+ MinSize: defaultNodeGroupMinSize,
+ MaxSize: defaultNodeGroupMaxSize,
CurrentSize: params.Workers,As per coding guidelines, "Do not use magic numbers - use named constants instead".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/core/services/cluster.go` around lines 115 - 123, The default
NodeGroup construction sets CurrentSize := params.Workers without validating it
against hard-coded bounds (MinSize/MaxSize 1..10) which can cause DB failures
after the cluster row is created; update the code that builds defaultNG (the
domain.NodeGroup instance) to validate params.Workers before any DB writes by
introducing named constants (e.g., DefaultMinNodeGroupSize,
DefaultMaxNodeGroupSize) instead of magic numbers, and either clamp
params.Workers into that range or return an error so you don't persist an
invalid NodeGroup; ensure validation happens prior to inserting the
cluster/NodeGroup and update any callers that assume the old behavior.
| if err := s.repo.AddNodeGroup(ctx, ng); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // If desired size > 0, we should trigger a scale operation for this specific group. | ||
| // For now, we will update the global worker count to reflect the addition. | ||
| oldWorkerCount := cluster.WorkerCount | ||
| cluster.WorkerCount += params.DesiredSize | ||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| cluster.WorkerCount = oldWorkerCount | ||
| return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err) | ||
| } |
There was a problem hiding this comment.
Node-group mutations and WorkerCount updates need a single transaction boundary.
Add/update/delete each perform multiple dependent writes across node groups and clusters. If the second write fails, persisted state can diverge (cluster.WorkerCount vs. node-group sizes).
Also applies to: 501-512, 542-550
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/core/services/cluster.go` around lines 437 - 448, The AddNodeGroup
flow mutates node-group then updates cluster.WorkerCount in two separate writes
(s.repo.AddNodeGroup and s.repo.Update) which can leave state inconsistent if
the second write fails; wrap the related operations (s.repo.AddNodeGroup,
adjusting cluster.WorkerCount by params.DesiredSize, and s.repo.Update) in a
single transactional boundary (use your repo's transaction API /
BeginTx/WithTx), perform both writes inside the same transaction, and ensure
rollback on any error so that either both the node-group and cluster.WorkerCount
persist or neither do; apply the same transactional pattern to the other
affected code paths (the ranges noted around lines 501-512 and 542-550).
| func (s *NoopClusterService) AddNodeGroup(ctx context.Context, clusterID uuid.UUID, params ports.NodeGroupParams) (*domain.NodeGroup, error) { | ||
| return &domain.NodeGroup{ClusterID: clusterID, Name: params.Name}, nil | ||
| } | ||
| func (s *NoopClusterService) UpdateNodeGroup(ctx context.Context, clusterID uuid.UUID, name string, params ports.UpdateNodeGroupParams) (*domain.NodeGroup, error) { | ||
| return &domain.NodeGroup{ClusterID: clusterID, Name: name}, nil | ||
| } | ||
| func (s *NoopClusterService) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /dev/null && pwdRepository: poyrazK/thecloud
Length of output: 109
🏁 Script executed:
cat -n internal/repositories/noop/adapters.go | head -250 | tail -100Repository: poyrazK/thecloud
Length of output: 5593
🏁 Script executed:
fd 'ports' -t f --name '*.go' | grep -E '(port|interface)' | head -20Repository: poyrazK/thecloud
Length of output: 310
🏁 Script executed:
rg 'type ClusterRepository interface' -A 20 --type=goRepository: poyrazK/thecloud
Length of output: 1730
🏁 Script executed:
rg 'type ClusterService interface' -A 20 --type=goRepository: poyrazK/thecloud
Length of output: 1999
🏁 Script executed:
rg 'NoopClusterRepository' -B 2 -A 5 --type=goRepository: poyrazK/thecloud
Length of output: 2647
🏁 Script executed:
rg 'var.*ClusterRepository|:=.*NoopClusterRepository' --type=go -B2 -A2Repository: poyrazK/thecloud
Length of output: 42
🏁 Script executed:
rg 'NoopClusterRepository' --type=go -lRepository: poyrazK/thecloud
Length of output: 98
🏁 Script executed:
grep -r 'NoopClusterRepository' --include='*.go' | head -20Repository: poyrazK/thecloud
Length of output: 1649
Add missing NodeGroup methods to NoopClusterRepository to fully implement the ports.ClusterRepository interface.
NoopClusterRepository currently lacks AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup methods required by the ports.ClusterRepository interface. These must be implemented to maintain interface compliance:
func (r *NoopClusterRepository) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil }
func (r *NoopClusterRepository) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) {
return []domain.NodeGroup{}, nil
}
func (r *NoopClusterRepository) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil }
func (r *NoopClusterRepository) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/repositories/noop/adapters.go` around lines 221 - 229, The
NoopClusterRepository is missing the NodeGroup methods required by
ports.ClusterRepository; add implementations for AddNodeGroup, GetNodeGroups,
UpdateNodeGroup, and DeleteNodeGroup on the NoopClusterRepository type that
match the repository signatures and return benign noop results (e.g., nil error
and empty slice for GetNodeGroups) so the repository conforms to the interface;
ensure method names and parameter types exactly match the
ports.ClusterRepository signatures and use domain.NodeGroup and uuid.UUID where
appropriate.
| func (r *ClusterRepository) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { | ||
| query := ` | ||
| UPDATE cluster_node_groups | ||
| SET instance_type = $1, min_size = $2, max_size = $3, current_size = $4, updated_at = $5 | ||
| WHERE cluster_id = $6 AND name = $7 | ||
| ` | ||
| _, err := r.db.Exec(ctx, query, ng.InstanceType, ng.MinSize, ng.MaxSize, ng.CurrentSize, time.Now(), ng.ClusterID, ng.Name) | ||
| if err != nil { | ||
| return errors.Wrap(errors.Internal, "failed to update node group", err) | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| func (r *ClusterRepository) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { | ||
| query := `DELETE FROM cluster_node_groups WHERE cluster_id = $1 AND name = $2` | ||
| _, err := r.db.Exec(ctx, query, clusterID, name) | ||
| if err != nil { | ||
| return errors.Wrap(errors.Internal, "failed to delete node group", err) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Return not-found when node-group update/delete affects no record.
UpdateNodeGroup and DeleteNodeGroup currently succeed even if (cluster_id, name) is missing. That can hide races and make higher layers report success for no-op writes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/repositories/postgres/cluster_repo.go` around lines 265 - 285,
Update UpdateNodeGroup and DeleteNodeGroup to check the Exec result for affected
rows and return a not-found error when zero rows are affected: capture the Exec
return into a result (e.g., res, err := r.db.Exec(...)), check rows :=
res.RowsAffected() (or res.RowsAffected() via CommandTag if your DB driver
returns a CommandTag) and if rows == 0 return errors.Wrap(errors.NotFound, "node
group not found") instead of success; keep the existing error wrapping for Exec
failures and apply the same change in both ClusterRepository.UpdateNodeGroup and
ClusterRepository.DeleteNodeGroup.
| INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size) | ||
| SELECT | ||
| gen_random_uuid(), | ||
| id, | ||
| 'default-pool', | ||
| 'standard-1', | ||
| 1, | ||
| CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END, | ||
| worker_count | ||
| FROM clusters | ||
| WHERE NOT EXISTS ( | ||
| SELECT 1 FROM cluster_node_groups cng | ||
| WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool' | ||
| ); |
There was a problem hiding this comment.
Backfill can fail on non-positive worker_count values.
At Line 30, current_size = worker_count can violate chk_cluster_node_groups_current_in_range when legacy data has worker_count <= 0, causing migration failure.
🧩 Proposed safe backfill adjustment
INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size)
SELECT
gen_random_uuid(),
id,
'default-pool',
'standard-1',
1,
- CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END,
- worker_count
+ GREATEST(worker_count, 10),
+ GREATEST(worker_count, 1)
FROM clusters
WHERE NOT EXISTS (
SELECT 1 FROM cluster_node_groups cng
WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool'
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size) | |
| SELECT | |
| gen_random_uuid(), | |
| id, | |
| 'default-pool', | |
| 'standard-1', | |
| 1, | |
| CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END, | |
| worker_count | |
| FROM clusters | |
| WHERE NOT EXISTS ( | |
| SELECT 1 FROM cluster_node_groups cng | |
| WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool' | |
| ); | |
| INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size) | |
| SELECT | |
| gen_random_uuid(), | |
| id, | |
| 'default-pool', | |
| 'standard-1', | |
| 1, | |
| GREATEST(worker_count, 10), | |
| GREATEST(worker_count, 1) | |
| FROM clusters | |
| WHERE NOT EXISTS ( | |
| SELECT 1 FROM cluster_node_groups cng | |
| WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool' | |
| ); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`
around lines 22 - 35, The backfill sets current_size = worker_count which can be
<= 0 and violate the chk_cluster_node_groups_current_in_range constraint; update
the INSERT SELECT to ensure current_size is at least 1 by replacing current_size
= worker_count with current_size = GREATEST(worker_count, 1) (and optionally
guard max_size similarly if you want to ensure it's >=1), referencing the
cluster_node_groups table, clusters table, worker_count column and the
chk_cluster_node_groups_current_in_range constraint.
| - name: CLOUD_CLUSTER_ID | ||
| valueFrom: | ||
| configMapKeyRef: | ||
| name: cluster-id | ||
| key: cluster-id |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check if the cluster-id-configmap.yaml file exists and examine its contents
if [ -f k8s/autoscaler/cluster-id-configmap.yaml ]; then
echo "=== Contents of cluster-id-configmap.yaml ==="
cat k8s/autoscaler/cluster-id-configmap.yaml
echo ""
echo "=== Checking for template placeholders ==="
grep -E '\{\{.*\}\}' k8s/autoscaler/cluster-id-configmap.yaml && echo "Found template placeholder" || echo "No template placeholder found"
else
echo "File not found: k8s/autoscaler/cluster-id-configmap.yaml"
fi
echo ""
echo "=== Verifying deployment.yaml exists and shows relevant section ==="
if [ -f k8s/autoscaler/deployment.yaml ]; then
echo "File exists. Checking lines 79-83:"
sed -n '79,83p' k8s/autoscaler/deployment.yaml
else
echo "File not found: k8s/autoscaler/deployment.yaml"
fiRepository: poyrazK/thecloud
Length of output: 601
Populate cluster-id ConfigMap with actual cluster UUID before deployment.
The cluster-id ConfigMap contains {{ .ClusterID }}, an unresolved template placeholder. The deployment references this ConfigMap to set the CLOUD_CLUSTER_ID environment variable, which will receive the literal string "{{ .ClusterID }}" instead of an actual cluster ID. Update deployment documentation to explicitly require that the ConfigMap be created with the actual cluster UUID before applying this manifest, or implement a templating step to resolve the placeholder during deployment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@k8s/autoscaler/deployment.yaml` around lines 79 - 83, The deployment
references the ConfigMap named "cluster-id" to populate the CLOUD_CLUSTER_ID env
var but the ConfigMap contains the unresolved template "{{ .ClusterID }}";
before applying the manifest either replace that placeholder with the actual
cluster UUID or add a templating step to render it. Update docs/README or
deployment instructions to require creating the ConfigMap "cluster-id" with key
"cluster-id" set to the real cluster UUID (or integrate a templating command
such as envsubst/helm/kustomize to inject .ClusterID) so CLOUD_CLUSTER_ID
receives the real UUID when Deployment in deployment.yaml is applied.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/workers/database_failover_worker_test.go (1)
162-178:⚠️ Potential issue | 🟡 MinorInconsistent matcher usage and magic number.
Two issues in this test:
- Line 172 uses
mock.Anythingwhile other tests useanyCtx- should be consistent.- Line 168 uses magic number
0for healthy port simulation. This relies on an undocumented implementation detail and violates the guideline against magic numbers.Suggested fix
Define a constant and use
anyCtxconsistently:+const healthyPortSimulation = 0 // Port 0 is treated as healthy in worker for testing + func TestDatabaseFailoverWorker(t *testing.T) {t.Run("No failover if primary healthy", func(t *testing.T) { repo := new(mockDatabaseRepo) svc := new(mockDatabaseService) logger := slog.Default() healthyPrimary := *primary - healthyPrimary.Port = 0 // port 0 is always healthy in our worker for simulation + healthyPrimary.Port = healthyPortSimulation worker := NewDatabaseFailoverWorker(svc, repo, logger) - repo.On("List", mock.Anything).Return([]*domain.Database{&healthyPrimary}, nil) + repo.On("List", anyCtx).Return([]*domain.Database{&healthyPrimary}, nil)As per coding guidelines: "Do not use magic numbers - use named constants instead".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/workers/database_failover_worker_test.go` around lines 162 - 178, Replace the inconsistent matcher and magic number in the "No failover if primary healthy" case: use the test-scoped matcher anyCtx instead of mock.Anything when setting expectations on repo or svc, and replace the literal port 0 with a named constant (e.g., healthyPort or DefaultHealthyPort) defined at the top of the test file so healthyPrimary.Port = healthyPort; keep the rest of the logic around healthyPrimary, NewDatabaseFailoverWorker, repo.On("List", ...) and worker.checkDatabases(...) unchanged.docs/swagger/swagger.yaml (1)
3110-3131:⚠️ Potential issue | 🟠 Major
/clusters/{id}/scaleverb switch is a client-breaking change.Changing this operation to
PUTwill break existing clients still usingPOST. Consider a transition window (support both methods, markPOSTdeprecated) or an explicit API version bump.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.yaml` around lines 3110 - 3131, The API operation for /clusters/{id}/scale was changed to PUT which is a client-breaking change for callers still using POST; either support both verbs during a transition or bump the API version. Update swagger.yaml so that the path /clusters/{id}/scale exposes both post and put operations (reusing the same schema reference to httphandlers.ScaleClusterRequest and responses) and mark the POST operation deprecated in its description, or alternatively revert to POST and increment the API version in the spec if you intend a breaking change; ensure documentation and changelog note the deprecation/version change so clients can migrate.
♻️ Duplicate comments (4)
internal/repositories/noop/adapters.go (1)
165-187:⚠️ Potential issue | 🟠 MajorNoopClusterRepository still missing NodeGroup methods required by ports.ClusterRepository.
The past review comment flagging this issue remains unresolved.
NoopClusterRepositorymust implementAddNodeGroup,GetNodeGroups,UpdateNodeGroup, andDeleteNodeGroupto satisfy theports.ClusterRepositoryinterface (seeinternal/core/ports/cluster.golines 36-39).🔧 Add missing NodeGroup methods
func (r *NoopClusterRepository) DeleteNode(ctx context.Context, id uuid.UUID) error { return nil } + +func (r *NoopClusterRepository) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { + return nil +} +func (r *NoopClusterRepository) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) { + return []domain.NodeGroup{}, nil +} +func (r *NoopClusterRepository) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { + return nil +} +func (r *NoopClusterRepository) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { + return nil +}As per coding guidelines, "Repositories (
internal/repositories/) must implement Ports interfaces".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/repositories/noop/adapters.go` around lines 165 - 187, NoopClusterRepository is missing the NodeGroup methods required by the ports.ClusterRepository interface; add the methods AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup to NoopClusterRepository with matching signatures (context.Context plus domain.NodeGroup or IDs/return types) so the struct implements ports.ClusterRepository (refer to ports.ClusterRepository in internal/core/ports/cluster.go for exact signatures), and implement them as no-ops/empty returns consistent with the other Noop methods.docs/swagger/docs.go (2)
9887-9903:⚠️ Potential issue | 🟡 MinorDisallow empty node-group update payloads.
Line 9888 defines an object where all fields are optional;
{}is currently valid and can produce ambiguous no-op updates.Suggested patch
"httphandlers.UpdateNodeGroupRequest": { "type": "object", + "minProperties": 1, "properties": { "desired_size": { "type": "integer", "minimum": 0 },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 9887 - 9903, The UpdateNodeGroupRequest schema (httphandlers.UpdateNodeGroupRequest) currently allows an empty object; add a constraint to require at least one field by adding "minProperties": 1 to that schema so `{}` is invalid and at least one of desired_size, max_size, or min_size must be present.
9876-9886:⚠️ Potential issue | 🟡 MinorConstrain deployment replicas to non-negative values.
Line 9882 currently permits negative integers for
replicas, which breaks scaling semantics.Suggested patch
"httphandlers.ScaleDeploymentRequest": { "type": "object", "required": [ "replicas" ], "properties": { "replicas": { - "type": "integer" + "type": "integer", + "minimum": 0 } } },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 9876 - 9886, The Swagger schema for httphandlers.ScaleDeploymentRequest allows negative replicas; update the schema for the "replicas" property to enforce non-negative integers by adding a minimum constraint (e.g., "minimum": 0) to the replicas property in the ScaleDeploymentRequest definition so clients and validation reject negative values.docs/swagger/swagger.json (1)
9868-9877:⚠️ Potential issue | 🟡 MinorAdd a non-negative lower bound for deployment replicas.
Line 9875 still allows negative values because
minimumis missing onreplicas, so client-side validation remains weak.🔧 Suggested schema fix
"httphandlers.ScaleDeploymentRequest": { "type": "object", "required": [ "replicas" ], "properties": { "replicas": { - "type": "integer" + "type": "integer", + "minimum": 0 } } },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 9868 - 9877, The ScaleDeploymentRequest schema allows negative replica counts; update the "httphandlers.ScaleDeploymentRequest" definition to add a non-negative lower bound by adding "minimum": 0 to the "replicas" property so clients validate that replicas cannot be negative; locate the "replicas" property within the ScaleDeploymentRequest object in swagger.json and add the minimum constraint to that property.
🧹 Nitpick comments (2)
internal/handlers/container_handler_test.go (1)
238-270: Refactor this error test into a table-driven test.The scenarios are repetitive and fit cleanly into a table-driven structure with shared setup/assertions.
As per coding guidelines, "Use table-driven tests in test files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handlers/container_handler_test.go` around lines 238 - 270, TestContainerHandlerScaleError repeats three similar subtests; refactor it into a single table-driven test that iterates over cases (e.g., names "InvalidID", "InvalidJSON", "ServiceError") and runs t.Run for each. For each case, call setupContainerHandlerTest to get svc, handler, r, register r.PUT(deploymentsPath+scalePath, handler.ScaleDeployment), construct the request path using deploymentsPath + either containerPathInvalid, a random UUID, or a specific uuid for the ServiceError case (use id.String()), set the request body to nil, "invalid", or the JSON {"replicas":5} as appropriate, and set up mock expectations only for the ServiceError case via svc.On("ScaleDeployment", mock.Anything, id, 5).Return(errors.New(...)). In the loop assert the expected HTTP status (http.StatusBadRequest or http.StatusInternalServerError) and for cases that set mock expectations call svc.AssertExpectations(t) after serving the request; keep the shared setup/assertions inside the loop to avoid duplication and preserve existing behavior of handler.ScaleDeployment and scaleSuffix.docs/swagger/swagger.yaml (1)
1980-1997: Tighten NodeGroup request validation to prevent under-specified calls.Create currently only requires
name, and update can be an empty object. That weakens API guarantees and pushes avoidable validation failures downstream.Suggested schema hardening
httphandlers.NodeGroupRequest: properties: desired_size: minimum: 0 type: integer instance_type: type: string max_size: minimum: 0 type: integer min_size: minimum: 0 type: integer name: + minLength: 1 type: string required: + - instance_type + - min_size + - max_size - name type: object httphandlers.UpdateNodeGroupRequest: properties: desired_size: minimum: 0 type: integer max_size: minimum: 0 type: integer min_size: minimum: 0 type: integer + minProperties: 1 type: objectAlso applies to: 2085-2096, 2936-2951, 3011-3029
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.yaml` around lines 1980 - 1997, The NodeGroup request schema (httphandlers.NodeGroupRequest) is under-specified — only "name" is required so creates and updates can be empty; update the schema to enforce stronger validation: make instance_type and at least one size field required for create (add instance_type, min_size, max_size, desired_size to the required array for the create/POST schema or a dedicated NodeGroupCreate schema), and for update/PUT/PATCH either require minProperties: 1 or define a separate NodeGroupUpdate schema that allows partial fields but rejects an empty object; also ensure the numeric size fields keep their minimum: 0 constraint and add any cross-field validation notes (e.g., max_size >= min_size) where supported.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/database.md`:
- Around line 274-288: The documentation for the cluster_node_groups table is
missing two CHECK constraints present in the migration; update the CREATE TABLE
block for cluster_node_groups to include the constraints
chk_cluster_node_groups_min_le_max (CHECK (min_size <= max_size)) and
chk_cluster_node_groups_current_in_range (CHECK (current_size BETWEEN min_size
AND max_size)) so the docs match the migration (reference the migration's
constraint names and the cluster_node_groups table to locate where to add them).
In `@docs/swagger/swagger.json`:
- Around line 1501-1503: The OpenAPI change replaced the /clusters/{id}/scale
operation from POST to PUT which breaks clients; restore backward compatibility
by adding a POST operation for the same path (or keep POST as primary and add
PUT as an additional operation) with the same requestBody, parameters and
responses as the new PUT so both POST and PUT are supported; update the
operations for "/clusters/{id}/scale" in swagger.json (ensure request/response
schema names and operationId values are consistent or distinct but functionally
equivalent) so existing clients using POST continue to work while allowing PUT
going forward.
In `@docs/swagger/swagger.yaml`:
- Around line 2078-2084: The OpenAPI schema for
httphandlers.ScaleDeploymentRequest currently allows any integer for the
replicas property; update the schema to disallow negatives by adding a minimum:
0 constraint on the replicas property (and apply the same change to the other
ScaleDeploymentRequest definition at the second occurrence referenced). Locate
the ScaleDeploymentRequest schema and add the numeric validation (minimum: 0) to
the replicas property so the contract rejects negative replica counts before
handler logic runs.
In `@internal/api/setup/router.go`:
- Line 566: The SDK's ScaleDeployment currently sends a POST; change the
implementation of ScaleDeployment in pkg/sdk/container.go to use a PUT call
instead of c.post() (e.g., switch to the client's put method) so it matches the
router's PUT /deployments/:id/scale, and update the unit test in
pkg/sdk/container_test.go (the ScaleDeployment test) to assert the request uses
PUT (and adjust any mock server handler expectations from POST to PUT).
In `@internal/handlers/container_handler_test.go`:
- Line 243: Tests are silently ignoring errors from http.NewRequest; replace
instances like `req, _ := http.NewRequest("PUT",
deploymentsPath+containerPathInvalid+scaleSuffix, nil)` (and the other similar
calls) with explicit error handling: `req, err := http.NewRequest(...)` followed
by `if err != nil { t.Fatalf("failed to create request: %v", err) }` or
`require.NoError(t, err)` so test setup failures surface immediately; update all
occurrences of `http.NewRequest` in these subtests accordingly.
---
Outside diff comments:
In `@docs/swagger/swagger.yaml`:
- Around line 3110-3131: The API operation for /clusters/{id}/scale was changed
to PUT which is a client-breaking change for callers still using POST; either
support both verbs during a transition or bump the API version. Update
swagger.yaml so that the path /clusters/{id}/scale exposes both post and put
operations (reusing the same schema reference to
httphandlers.ScaleClusterRequest and responses) and mark the POST operation
deprecated in its description, or alternatively revert to POST and increment the
API version in the spec if you intend a breaking change; ensure documentation
and changelog note the deprecation/version change so clients can migrate.
In `@internal/workers/database_failover_worker_test.go`:
- Around line 162-178: Replace the inconsistent matcher and magic number in the
"No failover if primary healthy" case: use the test-scoped matcher anyCtx
instead of mock.Anything when setting expectations on repo or svc, and replace
the literal port 0 with a named constant (e.g., healthyPort or
DefaultHealthyPort) defined at the top of the test file so healthyPrimary.Port =
healthyPort; keep the rest of the logic around healthyPrimary,
NewDatabaseFailoverWorker, repo.On("List", ...) and worker.checkDatabases(...)
unchanged.
---
Duplicate comments:
In `@docs/swagger/docs.go`:
- Around line 9887-9903: The UpdateNodeGroupRequest schema
(httphandlers.UpdateNodeGroupRequest) currently allows an empty object; add a
constraint to require at least one field by adding "minProperties": 1 to that
schema so `{}` is invalid and at least one of desired_size, max_size, or
min_size must be present.
- Around line 9876-9886: The Swagger schema for
httphandlers.ScaleDeploymentRequest allows negative replicas; update the schema
for the "replicas" property to enforce non-negative integers by adding a minimum
constraint (e.g., "minimum": 0) to the replicas property in the
ScaleDeploymentRequest definition so clients and validation reject negative
values.
In `@docs/swagger/swagger.json`:
- Around line 9868-9877: The ScaleDeploymentRequest schema allows negative
replica counts; update the "httphandlers.ScaleDeploymentRequest" definition to
add a non-negative lower bound by adding "minimum": 0 to the "replicas" property
so clients validate that replicas cannot be negative; locate the "replicas"
property within the ScaleDeploymentRequest object in swagger.json and add the
minimum constraint to that property.
In `@internal/repositories/noop/adapters.go`:
- Around line 165-187: NoopClusterRepository is missing the NodeGroup methods
required by the ports.ClusterRepository interface; add the methods AddNodeGroup,
GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup to NoopClusterRepository
with matching signatures (context.Context plus domain.NodeGroup or IDs/return
types) so the struct implements ports.ClusterRepository (refer to
ports.ClusterRepository in internal/core/ports/cluster.go for exact signatures),
and implement them as no-ops/empty returns consistent with the other Noop
methods.
---
Nitpick comments:
In `@docs/swagger/swagger.yaml`:
- Around line 1980-1997: The NodeGroup request schema
(httphandlers.NodeGroupRequest) is under-specified — only "name" is required so
creates and updates can be empty; update the schema to enforce stronger
validation: make instance_type and at least one size field required for create
(add instance_type, min_size, max_size, desired_size to the required array for
the create/POST schema or a dedicated NodeGroupCreate schema), and for
update/PUT/PATCH either require minProperties: 1 or define a separate
NodeGroupUpdate schema that allows partial fields but rejects an empty object;
also ensure the numeric size fields keep their minimum: 0 constraint and add any
cross-field validation notes (e.g., max_size >= min_size) where supported.
In `@internal/handlers/container_handler_test.go`:
- Around line 238-270: TestContainerHandlerScaleError repeats three similar
subtests; refactor it into a single table-driven test that iterates over cases
(e.g., names "InvalidID", "InvalidJSON", "ServiceError") and runs t.Run for
each. For each case, call setupContainerHandlerTest to get svc, handler, r,
register r.PUT(deploymentsPath+scalePath, handler.ScaleDeployment), construct
the request path using deploymentsPath + either containerPathInvalid, a random
UUID, or a specific uuid for the ServiceError case (use id.String()), set the
request body to nil, "invalid", or the JSON {"replicas":5} as appropriate, and
set up mock expectations only for the ServiceError case via
svc.On("ScaleDeployment", mock.Anything, id, 5).Return(errors.New(...)). In the
loop assert the expected HTTP status (http.StatusBadRequest or
http.StatusInternalServerError) and for cases that set mock expectations call
svc.AssertExpectations(t) after serving the request; keep the shared
setup/assertions inside the loop to avoid duplication and preserve existing
behavior of handler.ScaleDeployment and scaleSuffix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c1a92bf6-3e97-4a3c-b7aa-983937557b57
📒 Files selected for processing (10)
docs/database.mddocs/kubernetes.mddocs/swagger/docs.godocs/swagger/swagger.jsondocs/swagger/swagger.yamlinternal/api/setup/router.gointernal/core/services/shared_test.gointernal/handlers/container_handler_test.gointernal/repositories/noop/adapters.gointernal/workers/database_failover_worker_test.go
| CREATE TABLE cluster_node_groups ( | ||
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | ||
| cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, | ||
| name VARCHAR(255) NOT NULL, | ||
| instance_type VARCHAR(50) NOT NULL, | ||
| min_size INTEGER NOT NULL DEFAULT 1, | ||
| max_size INTEGER NOT NULL DEFAULT 10, | ||
| current_size INTEGER NOT NULL DEFAULT 1, | ||
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| UNIQUE(cluster_id, name) | ||
| ); | ||
|
|
||
| CREATE INDEX idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id); | ||
| ``` |
There was a problem hiding this comment.
Documentation missing CHECK constraints from the actual migration.
The migration file 098_create_cluster_node_groups.up.sql includes two CHECK constraints that enforce data integrity but aren't documented here:
chk_cluster_node_groups_min_le_max CHECK (min_size <= max_size)chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size)
Consider adding these constraints to the documentation for completeness.
📝 Suggested documentation update
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
- UNIQUE(cluster_id, name)
+ UNIQUE(cluster_id, name),
+ CONSTRAINT chk_cluster_node_groups_min_le_max CHECK (min_size <= max_size),
+ CONSTRAINT chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size)
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CREATE TABLE cluster_node_groups ( | |
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, | |
| name VARCHAR(255) NOT NULL, | |
| instance_type VARCHAR(50) NOT NULL, | |
| min_size INTEGER NOT NULL DEFAULT 1, | |
| max_size INTEGER NOT NULL DEFAULT 10, | |
| current_size INTEGER NOT NULL DEFAULT 1, | |
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| UNIQUE(cluster_id, name) | |
| ); | |
| CREATE INDEX idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id); | |
| ``` | |
| CREATE TABLE cluster_node_groups ( | |
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, | |
| name VARCHAR(255) NOT NULL, | |
| instance_type VARCHAR(50) NOT NULL, | |
| min_size INTEGER NOT NULL DEFAULT 1, | |
| max_size INTEGER NOT NULL DEFAULT 10, | |
| current_size INTEGER NOT NULL DEFAULT 1, | |
| created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | |
| UNIQUE(cluster_id, name), | |
| CONSTRAINT chk_cluster_node_groups_min_le_max CHECK (min_size <= max_size), | |
| CONSTRAINT chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size) | |
| ); | |
| CREATE INDEX idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/database.md` around lines 274 - 288, The documentation for the
cluster_node_groups table is missing two CHECK constraints present in the
migration; update the CREATE TABLE block for cluster_node_groups to include the
constraints chk_cluster_node_groups_min_le_max (CHECK (min_size <= max_size))
and chk_cluster_node_groups_current_in_range (CHECK (current_size BETWEEN
min_size AND max_size)) so the docs match the migration (reference the
migration's constraint names and the cluster_node_groups table to locate where
to add them).
| "/clusters/{id}/scale": { | ||
| "post": { | ||
| "put": { | ||
| "security": [ |
There was a problem hiding this comment.
Avoid a breaking API change by keeping backward compatibility for cluster scale.
Line 1502 switches /clusters/{id}/scale from POST to PUT. This can break existing clients and generated SDKs that still call POST.
🔧 Suggested compatibility approach
"/clusters/{id}/scale": {
+ "post": {
+ "security": [
+ {
+ "APIKeyAuth": []
+ }
+ ],
+ "deprecated": true,
+ "description": "Backward-compatible alias for scale cluster workers. Use PUT.",
+ "tags": [
+ "K8s"
+ ],
+ "summary": "Scale cluster workers (deprecated)",
+ "parameters": [
+ {
+ "type": "string",
+ "description": "Cluster ID",
+ "name": "id",
+ "in": "path",
+ "required": true
+ },
+ {
+ "description": "Scale Request",
+ "name": "request",
+ "in": "body",
+ "required": true,
+ "schema": {
+ "$ref": "#/definitions/httphandlers.ScaleClusterRequest"
+ }
+ }
+ ],
+ "responses": {
+ "200": {
+ "description": "OK"
+ }
+ }
+ },
"put": {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "/clusters/{id}/scale": { | |
| "post": { | |
| "put": { | |
| "security": [ | |
| "/clusters/{id}/scale": { | |
| "post": { | |
| "security": [ | |
| { | |
| "APIKeyAuth": [] | |
| } | |
| ], | |
| "deprecated": true, | |
| "description": "Backward-compatible alias for scale cluster workers. Use PUT.", | |
| "tags": [ | |
| "K8s" | |
| ], | |
| "summary": "Scale cluster workers (deprecated)", | |
| "parameters": [ | |
| { | |
| "type": "string", | |
| "description": "Cluster ID", | |
| "name": "id", | |
| "in": "path", | |
| "required": true | |
| }, | |
| { | |
| "description": "Scale Request", | |
| "name": "request", | |
| "in": "body", | |
| "required": true, | |
| "schema": { | |
| "$ref": "#/definitions/httphandlers.ScaleClusterRequest" | |
| } | |
| } | |
| ], | |
| "responses": { | |
| "200": { | |
| "description": "OK" | |
| } | |
| } | |
| }, | |
| "put": { | |
| "security": [ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/swagger/swagger.json` around lines 1501 - 1503, The OpenAPI change
replaced the /clusters/{id}/scale operation from POST to PUT which breaks
clients; restore backward compatibility by adding a POST operation for the same
path (or keep POST as primary and add PUT as an additional operation) with the
same requestBody, parameters and responses as the new PUT so both POST and PUT
are supported; update the operations for "/clusters/{id}/scale" in swagger.json
(ensure request/response schema names and operationId values are consistent or
distinct but functionally equivalent) so existing clients using POST continue to
work while allowing PUT going forward.
| httphandlers.ScaleDeploymentRequest: | ||
| properties: | ||
| replicas: | ||
| type: integer | ||
| required: | ||
| - replicas | ||
| type: object |
There was a problem hiding this comment.
Disallow negative replica counts in deployment scaling.
Line 2080 accepts any integer, so negative values are valid by contract and can leak into handler logic.
Suggested contract fix
httphandlers.ScaleDeploymentRequest:
properties:
replicas:
+ minimum: 0
type: integer
required:
- replicas
type: objectAlso applies to: 3156-3170
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/swagger/swagger.yaml` around lines 2078 - 2084, The OpenAPI schema for
httphandlers.ScaleDeploymentRequest currently allows any integer for the
replicas property; update the schema to disallow negatives by adding a minimum:
0 constraint on the replicas property (and apply the same change to the other
ScaleDeploymentRequest definition at the second occurrence referenced). Locate
the ScaleDeploymentRequest schema and add the numeric validation (minimum: 0) to
the replicas property so the contract rejects negative replica counts before
handler logic runs.
| containerGroup.GET("/deployments", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.ListDeployments) | ||
| containerGroup.GET("/deployments/:id", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.GetDeployment) | ||
| containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) | ||
| containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for SDK methods calling container deployment scale
rg -n 'deployments.*scale|ScaleDeployment' pkg/sdk/ --type=go -C2Repository: poyrazK/thecloud
Length of output: 1372
🏁 Script executed:
#!/bin/bash
# Also check the router file itself to see the current state
cat -n internal/api/setup/router.go | grep -A5 -B5 "deployments.*scale"Repository: poyrazK/thecloud
Length of output: 1021
🏁 Script executed:
#!/bin/bash
# Search for any references to this endpoint across the codebase
rg -n 'ScaleDeployment|deployments.*scale' --type=goRepository: poyrazK/thecloud
Length of output: 4066
Update SDK client to use PUT for container deployment scaling.
The endpoint has been changed to PUT (verified in router.go line 566 and handler tests), but the SDK client in pkg/sdk/container.go still uses c.post() at line 54. Update ScaleDeployment() to use PUT and update the corresponding test in pkg/sdk/container_test.go to verify the PUT method instead of POST. Without this change, SDK clients will fail when attempting to scale deployments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/api/setup/router.go` at line 566, The SDK's ScaleDeployment
currently sends a POST; change the implementation of ScaleDeployment in
pkg/sdk/container.go to use a PUT call instead of c.post() (e.g., switch to the
client's put method) so it matches the router's PUT /deployments/:id/scale, and
update the unit test in pkg/sdk/container_test.go (the ScaleDeployment test) to
assert the request uses PUT (and adjust any mock server handler expectations
from POST to PUT).
| r.POST(deploymentsPath+scalePath, handler.ScaleDeployment) | ||
| req, _ := http.NewRequest("POST", deploymentsPath+containerPathInvalid+scaleSuffix, nil) | ||
| r.PUT(deploymentsPath+scalePath, handler.ScaleDeployment) | ||
| req, _ := http.NewRequest("PUT", deploymentsPath+containerPathInvalid+scaleSuffix, nil) |
There was a problem hiding this comment.
Don’t ignore request construction errors in these subtests.
Using req, _ := http.NewRequest(...) can silently mask test setup failures.
Suggested fix
- req, _ := http.NewRequest("PUT", deploymentsPath+containerPathInvalid+scaleSuffix, nil)
+ req, err := http.NewRequest(http.MethodPut, deploymentsPath+containerPathInvalid+scaleSuffix, nil)
+ require.NoError(t, err)
- req, _ := http.NewRequest("PUT", deploymentsPath+"/"+uuid.NewString()+scaleSuffix, bytes.NewBufferString("invalid"))
+ req, err := http.NewRequest(http.MethodPut, deploymentsPath+"/"+uuid.NewString()+scaleSuffix, bytes.NewBufferString("invalid"))
+ require.NoError(t, err)
- req, _ := http.NewRequest("PUT", deploymentsPath+"/"+id.String()+scaleSuffix, bytes.NewBuffer(body))
+ req, err := http.NewRequest(http.MethodPut, deploymentsPath+"/"+id.String()+scaleSuffix, bytes.NewBuffer(body))
+ require.NoError(t, err)As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()."
Also applies to: 252-252, 264-264
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/handlers/container_handler_test.go` at line 243, Tests are silently
ignoring errors from http.NewRequest; replace instances like `req, _ :=
http.NewRequest("PUT", deploymentsPath+containerPathInvalid+scaleSuffix, nil)`
(and the other similar calls) with explicit error handling: `req, err :=
http.NewRequest(...)` followed by `if err != nil { t.Fatalf("failed to create
request: %v", err) }` or `require.NoError(t, err)` so test setup failures
surface immediately; update all occurrences of `http.NewRequest` in these
subtests accordingly.
This PR introduces the Kubernetes Cluster Autoscaler integration using the gRPC External Provider pattern.
Key Changes:
Verified via local builds and repository unit tests.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Changes