feat: implement Kubernetes Cluster Autoscaler integration by poyrazK · Pull Request #89 · poyrazK/thecloud

poyrazK · 2026-03-02T22:23:38Z

This PR introduces the Kubernetes Cluster Autoscaler integration using the gRPC External Provider pattern.

Key Changes:

Node Groups: Transitioned from a flat worker count to a scalable Node Pool model.
gRPC Bridge: Implemented a standalone server that translates upstream CA commands to The Cloud API calls.
Automated Bootstrap: Updated provisioning to automatically deploy the Autoscaler as a sidecar deployment.
Documentation: Added ADR-020 and updated the KaaS guide.

Verified via local builds and repository unit tests.

Summary by CodeRabbit

Release Notes

New Features
- Added Kubernetes Cluster Autoscaler integration with Node Groups support for fine-grained worker pool management.
- Introduced Node Group management APIs: create, update, and delete worker pools independently.
- Added container deployment scaling endpoint for flexible workload management.
- Enhanced API security with authentication requirements on cluster operations.
Documentation
- Updated Kubernetes documentation with Cluster Autoscaler, High Availability, and Load Balancer features.
- Added comprehensive Autoscaler service documentation and architecture guidance.
Changes
- Cluster scaling endpoint changed from POST to PUT for RESTful consistency.

…strap

coderabbitai · 2026-03-02T22:23:49Z

📝 Walkthrough

Walkthrough

Implements Kubernetes cluster autoscaling via an out-of-tree gRPC External Provider pattern. Adds autoscaler server, node group management across domain/service/repository layers, API endpoints for node group CRUD, Kubernetes deployment manifests with sidecar bridge, and database migrations supporting cluster node groups as the primary scaling unit.

Changes

Cohort / File(s)	Summary
Autoscaler Server `cmd/autoscaler-server/main.go`, `internal/autoscaler/server.go`	New gRPC autoscaler server implementation with node group operations (list, increase/decrease size, delete nodes), instance listing, and GPU/pricing stubs backed by SDK client.
Node Groups Domain & Ports `internal/core/domain/cluster.go`, `internal/core/ports/cluster.go`, `internal/core/ports/compute_options.go`	Introduces NodeGroup domain type with sizing fields; extends Cluster with node_groups collection; adds repository/service interfaces for node group CRUD; adds Metadata and Labels fields to CreateInstanceOptions.
Cluster Service Layer `internal/core/services/cluster.go`, `internal/core/services/cluster_test.go`, `internal/core/services/cluster_unit_test.go`	Adds AddNodeGroup, UpdateNodeGroup, DeleteNodeGroup service methods; initializes default-pool during cluster creation; integrates node group sizing validation and WorkerCount adjustment.
Cluster Repository `internal/repositories/postgres/cluster_repo.go`, `internal/repositories/postgres/cluster_repo_unit_test.go`	Implements node group CRUD repository methods; fetches and attaches NodeGroups to clusters in get/list flows; adds ControlPlaneIPs field support.
Database Migrations `internal/repositories/postgres/migrations/020_create_clusters.up.sql`, `internal/repositories/postgres/migrations/037_create_clusters_table..sql`, `internal/repositories/postgres/migrations/098_create_cluster_node_groups..sql`, `ups.txt`, `downs.txt`	Creates cluster_node_groups table with UNIQUE(cluster_id, name) constraint; backfills default-pool entries; removes obsolete clusters/cluster_nodes table definitions; adds control_plane_ips column.
API Handlers & Routes `internal/api/setup/router.go`, `internal/handlers/cluster_handler.go`, `internal/handlers/container_handler.go`, `internal/handlers/container_handler_test.go`	Adds POST/PUT/DELETE nodegroup endpoints; changes scale endpoint from POST to PUT; adds ScaleDeploymentRequest type; includes APIKeyAuth security annotations across cluster endpoints.
SDK Client `pkg/sdk/kubernetes.go`, `pkg/sdk/compute.go`	Extends Cluster model with NodeGroups field; adds GetClusterWithContext, AddNodeGroup, UpdateNodeGroupWithContext, DeleteNodeGroup methods; adds TerminateInstanceWithContext for context propagation.
Kubernetes Provisioning `internal/repositories/k8s/provisioner.go`, `internal/repositories/k8s/templates/control_plane.yaml`	Refactors worker provisioning to iterate over NodeGroups with per-group instance naming; passes Metadata (cluster-id, node-group) to instances; applies autoscaler addon manifests.
Kubernetes Autoscaler Manifests `k8s/autoscaler/rbac.yaml`, `k8s/autoscaler/deployment.yaml`, `k8s/autoscaler/configmap.yaml`, `k8s/autoscaler/cluster-id-configmap.yaml`	Declares RBAC (ServiceAccount, ClusterRole, Role, bindings); deployment with cluster-autoscaler (upstream) and thecloud-bridge sidecar containers; ConfigMaps for gRPC endpoint and cluster ID.
Swagger/OpenAPI Definitions `docs/swagger/docs.go`, `docs/swagger/swagger.json`, `docs/swagger/swagger.yaml`	Exposes domain.NodeGroup, httphandlers.NodeGroupRequest, UpdateNodeGroupRequest, ScaleDeploymentRequest types; adds POST/PUT/DELETE nodegroup endpoints; deprecates worker_count in favor of node_groups; adds APIKeyAuth to cluster endpoints.
Documentation `docs/kubernetes.md`, `docs/adr/ADR-020-kubernetes-cluster-autoscaler.md`, `docs/services/kubernetes-autoscaler.md`, `docs/database.md`	Adds autoscaler architecture overview and ADR; documents node group concepts and scaling flows; updates Kubernetes features section; documents new cluster_node_groups schema.
Build & Dependencies `Dockerfile.autoscaler-server`, `Makefile`, `go.mod`, `cmd/api/main.go`	Adds multi-stage Dockerfile for autoscaler-server binary; updates Makefile build target; upgrades Kubernetes modules to v0.29.9; introduces timeout constants in API main.
Test Mocks & Helpers `internal/core/services/health_unit_test.go`, `internal/core/services/shared_test.go`, `internal/handlers/cluster_handler_test.go`, `internal/repositories/k8s/kubeadm_provisioner_test.go`, `internal/repositories/k8s/mocks_test.go`, `internal/repositories/noop/adapters.go`, `internal/workers/cluster_worker_test.go`, `internal/workers/workers_unit_test.go`, `internal/workers/database_failover_worker_test.go`	Extends mock implementations with node group CRUD methods; updates test routes for PUT scaling; introduces context matchers in failover tests.

Sequence Diagram(s)

sequenceDiagram
    participant CA as Cluster Autoscaler
    participant Bridge as thecloud-bridge<br/>(Autoscaler Server)
    participant SDK as SDK Client
    participant API as Cloud API
    participant DB as Database

    Note over CA,DB: Node Group Scaling Flow

    CA->>Bridge: NodeGroups(ctx, req)
    Bridge->>SDK: List node groups for cluster
    SDK->>API: GET /clusters/{id}/nodegroups
    API->>DB: Query cluster_node_groups
    DB-->>API: Return node groups
    API-->>SDK: Return NodeGroup[]
    SDK-->>Bridge: Return NodeGroup[]
    Bridge-->>CA: Return NodeGroups response

    CA->>Bridge: NodeGroupIncreaseSize(ctx, req)
    activate Bridge
    Bridge->>Bridge: Validate: desired ≤ MaxSize
    Bridge->>SDK: UpdateNodeGroup(clusterID, name, newSize)
    SDK->>API: PUT /clusters/{id}/nodegroups/{name}
    API->>DB: Update cluster_node_groups.current_size
    DB-->>API: OK
    API-->>SDK: Return updated NodeGroup
    SDK-->>Bridge: Return updated NodeGroup
    Bridge-->>CA: Return success response
    deactivate Bridge

    CA->>Bridge: NodeGroupNodes(ctx, req)
    Bridge->>SDK: List instances in node group
    SDK->>API: GET /clusters/{id}/instances?node_group={name}
    API->>DB: Query instances with metadata filter
    DB-->>API: Return instances
    API-->>SDK: Return Instance[]
    SDK-->>Bridge: Return Instance[]
    Bridge-->>CA: Return Instance[] mapped to protos.Instance

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat: implement Kubernetes Cloud Controller Manager (CCM) #84: Modifies Kubernetes provisioning (internal/repositories/k8s/provisioner.go, control-plane addon templates) to inject cloud integration sidecars; parallels the autoscaler bridge deployment pattern introduced in this PR.

Poem

🐰 A bridge to scale,
Node pools taking flight,
Autoscaler hops with gRPC might,
Cluster grows green,
Kubernetes dream! 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: implementing Kubernetes Cluster Autoscaler integration, which is the primary focus of the PR across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/cluster-autoscaler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Implements Kubernetes Cluster Autoscaler integration using the external gRPC provider pattern, introducing a first-class “node group / node pool” model and deploying an in-cluster bridge that translates CA RPCs into The Cloud API calls.

Changes:

Add NodeGroup domain/API/storage support (DB migration, repository/service/handler/routes, SDK types + endpoints).
Add autoscaler gRPC bridge service (autoscaler-server) and Kubernetes manifests to deploy it alongside upstream Cluster Autoscaler.
Update documentation (ADR-020, KaaS guides, and DB schema docs) to describe the integration and usage.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
pkg/sdk/kubernetes.go	Adds NodeGroup models and SDK methods for node group management; introduces context-aware GetCluster.
k8s/autoscaler/rbac.yaml	RBAC resources for running Cluster Autoscaler + bridge in `kube-system`.
k8s/autoscaler/deployment.yaml	Deployment running upstream cluster-autoscaler plus The Cloud bridge sidecar.
k8s/autoscaler/configmap.yaml	External gRPC provider config pointing CA to localhost bridge port.
k8s/autoscaler/cluster-id-configmap.yaml	ConfigMap template to provide the cluster ID to the bridge.
internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql	Creates `cluster_node_groups` table and backfills a default pool for existing clusters.
internal/repositories/postgres/migrations/098_create_cluster_node_groups.down.sql	Drops `cluster_node_groups`.
internal/repositories/postgres/cluster_repo.go	Loads node groups for clusters and adds CRUD repository methods for node groups.
internal/repositories/k8s/provisioner.go	Provisions workers based on node groups and tags instances with cluster/node-group metadata.
internal/handlers/cluster_handler.go	Adds HTTP endpoints for add/update/delete node groups.
internal/core/services/cluster.go	Persists default node group on cluster creation and implements node group management logic.
internal/core/ports/compute_options.go	Extends instance creation options with metadata/labels.
internal/core/ports/cluster.go	Extends ClusterRepository/ClusterService interfaces with node group operations.
internal/core/domain/cluster.go	Adds NodeGroup domain model and attaches NodeGroups to Cluster.
internal/autoscaler/server.go	Implements externalgrpc CloudProvider server backed by The Cloud SDK.
internal/autoscaler/protos/externalgrpc_grpc.pb.go	Generated gRPC service code for externalgrpc provider.
internal/api/setup/router.go	Registers node group routes under `/clusters`.
docs/services/kubernetes-autoscaler.md	New autoscaler feature documentation and usage notes.
docs/kubernetes.md	Updates KaaS guide to mention autoscaling + adjusts some MVP limitations wording.
docs/database.md	Documents `clusters` and new `cluster_node_groups` schema.
docs/adr/ADR-020-kubernetes-cluster-autoscaler.md	ADR describing the chosen autoscaler integration strategy.
cmd/autoscaler-server/main.go	CLI entrypoint for the bridge service.
Makefile	Adds build target for `autoscaler-server`.
Dockerfile.autoscaler-server	Adds container build for the bridge service.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T22:33:13Z

+	for _, node := range req.Nodes {
+		providerID := strings.TrimPrefix(node.ProviderID, "thecloud://")
+		if providerID == node.ProviderID {
+			continue
+		}
+		if err := s.client.TerminateInstance(providerID); err != nil {
+			klog.Errorf("Failed to terminate instance %s: %v", providerID, err)
+			return nil, err
+		}
+	}


NodeGroupDeleteNodes terminates instances solely based on ProviderID without validating that the instance belongs to the requested node group and cluster. A malformed/stale request could terminate unrelated instances. Fetch the instance first and verify metadata (cluster-id/node-group) matches req.Id and s.clusterID before terminating.

Copilot · 2026-03-02T22:33:13Z

+		_, _ = s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{
+			DesiredSize: &newSize,
+		})


NodeGroupDeleteNodes computes newSize := currentSize - len(nodes) and sends it without bounding to MinSize/0, and it also ignores the UpdateNodeGroupWithContext error. This can set negative/invalid sizes and hide failures; clamp/validate the new size and handle the update error.

Suggested change

_, _ = s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{

DesiredSize: &newSize,

})

if newSize < 0 {

klog.Errorf("calculated negative desired size %d for node group %s, clamping to 0", newSize, req.Id)

newSize = 0

}

if _, err := s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{

DesiredSize: &newSize,

}); err != nil {

klog.Errorf("failed to update node group %s desired size to %d: %v", req.Id, newSize, err)

return nil, err

}

Copilot · 2026-03-02T22:33:14Z

+1.  A pod is in `Pending` status due to CPU/Memory constraints.
+2.  Upstream CA identifies that a Node Group can satisfy the requirement.
+3.  Upstream CA calls `NodeGroupIncreaseSize` via gRPC.
+4.  `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API.


This doc references POST /clusters/{id}/nodegroups/{name}, but the API routes are POST /clusters/{id}/nodegroups (create) and PUT /clusters/{id}/nodegroups/{name} (update/scale). Update the scaling flow to reflect the actual endpoint the bridge uses (PUT).

Suggested change

4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API.

4. `thecloud-bridge` calls `PUT /clusters/{id}/nodegroups/{name}` on The Cloud API.

Copilot · 2026-03-02T22:33:14Z

+
+	// Node Group operations
+	AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error
+	GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error)
+	UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error
+	DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error


The new ClusterRepository interface adds node group methods, but other in-repo implementations/mocks (e.g., internal/repositories/noop.NoopClusterService/NoopClusterRepository and internal/core/services test mocks) must be updated to implement these methods; otherwise builds/tests that wire these into ports.ClusterService/ports.ClusterRepository will not compile.

Copilot · 2026-03-02T22:33:15Z

+		if err == nil {
+			c.NodeGroups = ngs
+		}


ClusterRepository.list() ignores errors from GetNodeGroups and returns clusters with silently missing NodeGroups. This can mask DB/query failures and lead to inconsistent API responses; consider returning the error (or at least logging it) instead of dropping it.

Suggested change

if err == nil {

c.NodeGroups = ngs

}

if err != nil {

return nil, errors.Wrap(errors.Internal, "failed to list node groups for cluster", err)

}

c.NodeGroups = ngs

Copilot · 2026-03-02T22:33:17Z

+// AddNodeGroup adds a new node pool to the cluster.
+func (c *Client) AddNodeGroup(clusterID string, input NodeGroupInput) (*NodeGroup, error) {
+	var resp Response[*NodeGroup]
+	if err := c.post(clustersPath+"/"+clusterID+"/nodegroups", input, &resp); err != nil {
+		return nil, err
+	}
+	return resp.Data, nil
+}
+
+// UpdateNodeGroupWithContext updates a node group's parameters with context support.
+func (c *Client) UpdateNodeGroupWithContext(ctx context.Context, clusterID string, name string, input UpdateNodeGroupInput) (*NodeGroup, error) {
+	var resp Response[*NodeGroup]
+	if err := c.putWithContext(ctx, clustersPath+"/"+clusterID+"/nodegroups/"+name, input, &resp); err != nil {
+		return nil, err
+	}
+	return resp.Data, nil
+}
+
+// DeleteNodeGroup removes a node group.
+func (c *Client) DeleteNodeGroup(clusterID string, name string) error {
+	var resp Response[any]
+	return c.delete(clustersPath+"/"+clusterID+"/nodegroups/"+name, &resp)
+}


SDK NodeGroup operations are inconsistent with the established context pattern (non-context method delegating to WithContext). Add AddNodeGroupWithContext/DeleteNodeGroupWithContext (and optionally a non-context UpdateNodeGroup wrapper) so callers can control cancellation/timeouts.

Copilot · 2026-03-02T22:33:17Z

+		// Fetch Node Groups for each cluster in the list
+		ngs, err := r.GetNodeGroups(ctx, c.ID)
+		if err == nil {
+			c.NodeGroups = ngs
+		}


list() performs an additional GetNodeGroups query per cluster row (N+1). For users with many clusters this will add noticeable latency/load; consider fetching node groups for all listed cluster IDs in a single query and grouping in-memory.

Copilot · 2026-03-02T22:33:17Z

+	// If desired size > 0, we should trigger a scale operation for this specific group.
+	// For now, we will update the global worker count to reflect the addition.
+	cluster.WorkerCount += params.DesiredSize
+	_ = s.repo.Update(ctx, cluster)
+


AddNodeGroup/UpdateNodeGroup update cluster.WorkerCount but ignore errors from repo.Update. If the update fails, WorkerCount will drift from the sum of node groups, which can break legacy paths relying on WorkerCount. Return/handle the error (and consider wrapping it) instead of discarding it.

Copilot · 2026-03-02T22:33:18Z

+	if err := s.repo.UpdateNodeGroup(ctx, targetGroup); err != nil {
+		return nil, err
+	}
+	_ = s.repo.Update(ctx, cluster)


UpdateNodeGroup() persists the node group and then ignores the error from repo.Update(cluster). If that update fails, the cluster record (WorkerCount/UpdatedAt) may become inconsistent with the node group state. Handle/return the update error rather than discarding it.

Suggested change

_ = s.repo.Update(ctx, cluster)

if err := s.repo.Update(ctx, cluster); err != nil {

return nil, err

}

Copilot · 2026-03-02T22:33:18Z

+func (s *AutoscalerServer) NodeGroupTemplateNodeInfo(ctx context.Context, req *protos.NodeGroupTemplateNodeInfoRequest) (*protos.NodeGroupTemplateNodeInfoResponse, error) {
+	return nil, fmt.Errorf("unimplemented")
+}
+
+func (s *AutoscalerServer) NodeGroupGetOptions(ctx context.Context, req *protos.NodeGroupAutoscalingOptionsRequest) (*protos.NodeGroupAutoscalingOptionsResponse, error) {
+	return nil, fmt.Errorf("unimplemented")
+}
+
+func (s *AutoscalerServer) PricingNodePrice(ctx context.Context, req *protos.PricingNodePriceRequest) (*protos.PricingNodePriceResponse, error) {
+	return nil, fmt.Errorf("unimplemented")
+}
+
+func (s *AutoscalerServer) PricingPodPrice(ctx context.Context, req *protos.PricingPodPriceRequest) (*protos.PricingPodPriceResponse, error) {
+	return nil, fmt.Errorf("unimplemented")
+}


The Cluster Autoscaler external gRPC spec expects optional/unimplemented methods to return gRPC code Unimplemented (12). Returning fmt.Errorf("unimplemented") will surface as code Unknown instead. Use status.Errorf(codes.Unimplemented, ...) for these methods.

coderabbitai

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (6)

docs/kubernetes.md-119-121 (1)

119-121: ⚠️ Potential issue | 🟡 Minor

“Limitations (MVP)” now lists capabilities, not limitations.

The new bullets describe supported features. Please either rename this section or replace these bullets with actual current constraints.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/kubernetes.md` around lines 119 - 121, The section header "Limitations
(MVP)" currently lists supported features ("Storage: Dynamic storage
provisioning (CSI)" and "High Availability: ... via the `--ha` flag") instead of
constraints; either rename the header to "Features (MVP)" or keep "Limitations
(MVP)" and replace those bullets with real limitations (for example: lack of
automated backups, no cluster autoscaler, limited OS image support, manual
certificate rotation, or single-region control plane) so the text matches the
header; update the header text and/or replace the two bullets ("Storage" and
"High Availability" referencing `--ha`) accordingly.

docs/services/kubernetes-autoscaler.md-38-38 (1)

38-38: ⚠️ Potential issue | 🟡 Minor

Fix the scale-up API endpoint description.

Line 38 documents POST /clusters/{id}/nodegroups/{name}, which conflicts with the create example on Lines 57-58 and can mislead readers about scale vs create behavior.

Doc wording fix

-4.  `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API.
+4.  `thecloud-bridge` calls the Node Group resize/update endpoint (for example, `PATCH /clusters/{id}/nodegroups/{name}`) on The Cloud API.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/services/kubernetes-autoscaler.md` at line 38, The doc currently lists
the scale-up endpoint as `POST /clusters/{id}/nodegroups/{name}`, which
conflicts with the create example; update the wording to describe the actual
scale/update endpoint (replace `POST /clusters/{id}/nodegroups/{name}`) with the
correct API verb and path for resizing a nodegroup (for example `PATCH
/clusters/{id}/nodegroups/{name}` or the team-approved scale path such as `POST
/clusters/{id}/nodegroups/{name}/scale`), and ensure the text clearly states
this is a scale/update operation (not create) so it no longer conflicts with the
create example that uses `POST /clusters/{id}/nodegroups`.

docs/swagger/docs.go-9685-9698 (1)

9685-9698: ⚠️ Potential issue | 🟡 Minor

Disallow empty update bodies for node group updates.

With all fields optional, an empty object is currently valid for UpdateNodeGroupRequest. Require at least one updatable field to prevent no-op update calls.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 9685 - 9698, The UpdateNodeGroupRequest
schema currently allows an empty object; modify its schema to require at least
one updatable field by adding a constraint (e.g., "minProperties": 1) to the
"httphandlers.UpdateNodeGroupRequest" schema in docs/swagger/docs.go so that
requests with no fields are rejected; reference the schema name
"httphandlers.UpdateNodeGroupRequest" and ensure the change is applied to the
JSON object describing that schema.

docs/swagger/docs.go-9550-9572 (1)

9550-9572: ⚠️ Potential issue | 🟡 Minor

Add numeric constraints to node group sizing fields.

On Line 9556, Line 9562, and Line 9565, desired_size / max_size / min_size are unconstrained. The schema currently allows invalid values (including negatives). Add bounds in annotations so generated docs enforce valid ranges.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 9550 - 9572, The schema for
httphandlers.NodeGroupRequest currently allows negative/invalid sizes for
desired_size, max_size, and min_size; update the struct or swagger annotations
for the NodeGroupRequest fields desired_size, max_size, and min_size to include
numeric bounds (e.g., add "minimum: 0" to each and optionally a sensible
"maximum" if desired) so the generated docs emit valid range constraints
(minimum: 0) for those integer properties.

internal/repositories/postgres/cluster_repo.go-244-252 (1)

244-252: ⚠️ Potential issue | 🟡 Minor

Check rows.Err() after iterating node groups.

Without a post-loop rows.Err() check, cursor-level read errors can be missed.

Suggested change

 	for rows.Next() {
 		var ng domain.NodeGroup
 		if err := rows.Scan(&ng.ID, &ng.ClusterID, &ng.Name, &ng.InstanceType, &ng.MinSize, &ng.MaxSize, &ng.CurrentSize, &ng.CreatedAt, &ng.UpdatedAt); err != nil {
 			return nil, errors.Wrap(errors.Internal, "failed to scan node group", err)
 		}
 		groups = append(groups, ng)
 	}
+	if err := rows.Err(); err != nil {
+		return nil, errors.Wrap(errors.Internal, "failed while iterating node groups", err)
+	}
 	return groups, nil

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/postgres/cluster_repo.go` around lines 244 - 252, After
iterating rows into the groups slice you must check cursor-level errors by
calling rows.Err(); update the code after the for rows.Next() loop (where rows,
groups and domain.NodeGroup are used and rows.Scan is invoked) to call if err :=
rows.Err(); err != nil { return nil, errors.Wrap(errors.Internal, "row iteration
error", err) } before returning groups, nil so any underlying read errors are
surfaced and wrapped consistently.

internal/handlers/cluster_handler.go-380-460 (1)

380-460: ⚠️ Potential issue | 🟡 Minor

Add missing @Security Swagger annotations on new NodeGroup handlers.

AddNodeGroup, UpdateNodeGroup, and DeleteNodeGroup docs omit @Security, so generated API docs won’t fully describe auth requirements.

Suggested doc updates

 // AddNodeGroup godoc
 // `@Summary` Add a node group to a cluster
 // `@Description` Creates a new pool of worker nodes
 // `@Tags` K8s
+// `@Security` BearerAuth
 // `@Param` id path string true "Cluster ID"
 // `@Param` request body NodeGroupRequest true "Node Group details"
 // `@Success` 201 {object} domain.NodeGroup
 // `@Router` /clusters/{id}/nodegroups [post]

 // UpdateNodeGroup godoc
 // `@Summary` Update a node group
 // `@Description` Modifies scaling boundaries or desired size of a node pool
 // `@Tags` K8s
+// `@Security` BearerAuth
 // `@Param` id path string true "Cluster ID"
 // `@Param` name path string true "Node Group Name"
 // `@Param` request body UpdateNodeGroupRequest true "Update details"
 // `@Success` 200 {object} domain.NodeGroup
 // `@Router` /clusters/{id}/nodegroups/{name} [put]

 // DeleteNodeGroup godoc
 // `@Summary` Delete a node group
 // `@Description` Removes a node pool and terminates its nodes
 // `@Tags` K8s
+// `@Security` BearerAuth
 // `@Param` id path string true "Cluster ID"
 // `@Param` name path string true "Node Group Name"
 // `@Success` 202
 // `@Router` /clusters/{id}/nodegroups/{name} [delete]

As per coding guidelines, "Include Swagger documentation comments on handler methods with @Summary, @Tags, @Security, and @Router directives".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/handlers/cluster_handler.go` around lines 380 - 460, The Swagger
comments for AddNodeGroup, UpdateNodeGroup, and DeleteNodeGroup are missing
`@Security`; update the doc comments above each handler (AddNodeGroup,
UpdateNodeGroup, DeleteNodeGroup) to include the same `@Security` directive used
by other K8s handlers in the file (e.g., `@Security` ApiKeyAuth or `@Security`
BearerAuth — use the exact scheme name already present elsewhere), placing it
alongside the existing `@Summary`, `@Tags`, and `@Router` lines so the generated API
docs correctly show authentication requirements.

🧹 Nitpick comments (12)

internal/core/ports/compute_options.go (1)

13-14: Add JSON tags for the new optional map fields.

Metadata and Labels are optional but currently lack json tags with omitempty.

♻️ Proposed change

-	Metadata    map[string]string // Key-value metadata for the instance
-	Labels      map[string]string // Scheduling or grouping labels
+	Metadata    map[string]string `json:"metadata,omitempty"` // Key-value metadata for the instance
+	Labels      map[string]string `json:"labels,omitempty"`   // Scheduling or grouping labels

As per coding guidelines, "Include json tags on struct fields with omitempty for optional fields".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/core/ports/compute_options.go` around lines 13 - 14, The struct
fields Metadata and Labels are optional but missing JSON tags; update their
definitions in the struct (fields named Metadata and Labels in this file) to
include json tags with omitempty (e.g., `json:"metadata,omitempty"` and
`json:"labels,omitempty"`) so they are omitted from serialized output when
nil/empty.

Dockerfile.autoscaler-server (1)

20-28: Harden runtime image by dropping root.

The final container runs as root. Consider adding a non-root user for better default security posture.

🔒 Proposed hardening

 FROM alpine:3.19
 
 WORKDIR /app
+RUN addgroup -S app && adduser -S app -G app
 
 # Copy the binary from the builder stage
 COPY --from=builder /app/bin/autoscaler-server /usr/local/bin/thecloud-autoscaler-server
 
 # Run the server
+USER app
 ENTRYPOINT ["/usr/local/bin/thecloud-autoscaler-server"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Dockerfile.autoscaler-server` around lines 20 - 28, The container currently
runs as root; modify Dockerfile.autoscaler-server to create a non-root user and
group (e.g., "autoscaler"), chown the runtime binary
(/usr/local/bin/thecloud-autoscaler-server) and any needed app dirs to that
user, set a safe HOME if needed, and add a USER autoscaler (or UID/GID)
directive before the ENTRYPOINT so ENTRYPOINT
["/usr/local/bin/thecloud-autoscaler-server"] runs as the non-root user; ensure
you use Alpine-compatible commands (addgroup/adduser or adduser -S/addgroup -S)
and preserve executable permissions on the binary.

docs/adr/ADR-020-kubernetes-cluster-autoscaler.md (1)

29-29: Align bridge naming across ADR and implementation artifacts.

This ADR uses thecloud-bridge, while build/runtime files in this PR use autoscaler-server / thecloud-autoscaler-server. Please standardize on one canonical name.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/adr/ADR-020-kubernetes-cluster-autoscaler.md` at line 29, Update naming
so the ADR and implementation use the same canonical bridge name: pick one of
the existing identifiers (either "thecloud-bridge" or
"autoscaler-server"/"thecloud-autoscaler-server") and replace the others to
match; update ADR-020-kubernetes-cluster-autoscaler.md (the line containing
"thecloud-bridge") to the chosen canonical name and also update any
build/runtime artifacts that reference "autoscaler-server" or
"thecloud-autoscaler-server" so all references (ADR text, service names,
manifests, build scripts) consistently use the single selected identifier.

internal/repositories/postgres/migrations/020_create_clusters.down.sql (1)

1-2: Prefer explicit rollback dependency handling over broad CASCADE.

Line 2 uses CASCADE even after cluster_nodes is already explicitly dropped on Line 1. This can over-drop unrelated dependent objects and makes rollback behavior less predictable.

Suggested migration tweak

-DROP TABLE IF EXISTS cluster_nodes CASCADE;
-DROP TABLE IF EXISTS clusters CASCADE;
+DROP TABLE IF EXISTS cluster_nodes;
+DROP TABLE IF EXISTS clusters;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/postgres/migrations/020_create_clusters.down.sql`
around lines 1 - 2, The second DROP uses broad CASCADE which can remove
unrelated objects; change the rollback to avoid CASCADE on the clusters
drop—replace "DROP TABLE IF EXISTS clusters CASCADE;" with "DROP TABLE IF EXISTS
clusters;" (or "DROP TABLE IF EXISTS clusters RESTRICT;") and, if necessary,
explicitly drop any known dependent objects (e.g., foreign keys or indexes that
reference clusters) instead of relying on CASCADE; refer to the DROP TABLE lines
and the table names cluster_nodes and clusters to locate the statements.

internal/repositories/postgres/cluster_repo_unit_test.go (1)

62-120: Refactor the newly added NodeGroup repository tests to table-driven + testify/mock.

The new cases repeat setup/assertion patterns and continue using pgxmock; please align these added tests with the repository’s test conventions.

As per coding guidelines, **/*_test.go: Use table-driven tests in test files and Use testify/mock for creating mock objects in tests.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/postgres/cluster_repo_unit_test.go` around lines 62 -
120, The three tests (GetByID, ListAll, ListByUserID) repeat setup and use
pgxmock; convert them to a single table-driven test loop that iterates cases
(name, setupMock, callFn, expectedLen) and replace pgxmock usage with
testify/mock-based mocks; for each case call NewClusterRepository with the
testify mock (or mock the lower-level DB interface used by
NewClusterRepository), configure expectations in setupMock (return
rows/NodeGroups for that case), then invoke repo.GetByID / repo.ListAll /
repo.ListByUserID accordingly and assert results (len of clusters and
cluster.NodeGroups). Ensure you reference the existing methods
NewClusterRepository, GetByID, ListAll, ListByUserID and the NodeGroups field
when writing expectations and assertions.

cmd/autoscaler-server/main.go (1)

18-18: Replace the hardcoded default gRPC port with a named constant.

50051 at Line 18 is an inline magic number; move it to a constant for maintainability.

♻️ Suggested change

+const DefaultGRPCPort = 50051
+
 func main() {
 	var (
-		port      = flag.Int("port", 50051, "The gRPC server port")
+		port      = flag.Int("port", DefaultGRPCPort, "The gRPC server port")

As per coding guidelines, Do not use magic numbers - use named constants instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cmd/autoscaler-server/main.go` at line 18, Replace the inline magic number
50051 used in flag.Int call by introducing a named constant (e.g.,
defaultGRPCPort) at the top of the file and use that constant in the call to
flag.Int("port", defaultGRPCPort, ...); update any related comments or usages
referencing the hardcoded port to use the new constant and ensure the constant
is of type int to match flag.Int's default value.

internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql (1)

7-13: Add DB-level size invariants for node groups.

The table currently allows invalid combinations (min_size > max_size, current_size outside bounds). Enforcing these at DB level prevents corrupted scaling state.

✅ Suggested change

 CREATE TABLE IF NOT EXISTS cluster_node_groups (
@@
     min_size INTEGER NOT NULL DEFAULT 1,
     max_size INTEGER NOT NULL DEFAULT 10,
     current_size INTEGER NOT NULL DEFAULT 1,
@@
-    UNIQUE(cluster_id, name)
+    UNIQUE(cluster_id, name),
+    CHECK (min_size >= 0),
+    CHECK (max_size >= min_size),
+    CHECK (current_size >= min_size AND current_size <= max_size)
 );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`
around lines 7 - 13, Add DB-level CHECK constraints to the cluster node groups
table to enforce size invariants: add a check ensuring min_size <= max_size and
another ensuring current_size >= min_size AND current_size <= max_size (you can
also include min_size >= 1 if you want to match the DEFAULT). Add these
constraints to the CREATE TABLE for the table that defines min_size, max_size,
current_size (e.g., add CHECK (min_size <= max_size) and CHECK (current_size
BETWEEN min_size AND max_size) with descriptive constraint names like
chk_cluster_node_groups_min_le_max and
chk_cluster_node_groups_current_in_range).

docs/swagger/docs.go (2)

6937-6968: Clarify worker_count vs node_groups contract.

node_groups was added on Line 6937, while worker_count remains in the same schema on Line 6966. If node pools are the canonical model, mark worker_count deprecated (or remove in the next API version) to avoid ambiguous client behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 6937 - 6968, The schema exposes both
node_groups (items -> "$ref": "#/definitions/domain.NodeGroup") and worker_count
which creates ambiguity; update the schema for the cluster definition to mark
worker_count as deprecated and guide clients to use node_groups instead: add a
deprecation indicator and a clear description on the worker_count property
(e.g., "DEPRECATED: use node_groups for node pool counts; will be removed in
next API version"), or remove worker_count in the next major API version,
ensuring any generated docs/clients reflect the change.
7874-7905: Strengthen domain.NodeGroup response schema with required fields.

Right now all properties are optional in the OpenAPI model. For better client generation, consider marking stable fields (id, cluster_id, name, min_size, max_size, current_size) as required.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 7874 - 7905, The OpenAPI schema for
domain.NodeGroup currently has all properties optional; add a "required" array
to the domain.NodeGroup definition listing the stable fields to make them
mandatory (id, cluster_id, name, min_size, max_size, current_size). Update the
domain.NodeGroup object in docs/swagger/docs.go to include the required property
alongside the existing "type" and "properties" keys so code generation tools
will treat those fields as required.

docs/swagger/swagger.json (3)

6929-6935: Clarify the canonical scaling field on domain.Cluster.

With node_groups added at Line 6929, keeping worker_count at Line 6958 creates two scaling representations on the same resource. Please document one as canonical (and the other as legacy/deprecated) to avoid conflicting client behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 6929 - 6935, The Cluster schema now
exposes two scaling representations—node_groups and worker_count—so update the
domain.Cluster definition to mark one as canonical and the other as deprecated:
designate "node_groups" as the canonical scaling field (or vice versa as your
choice) and add clear description text for both properties indicating the
canonical field, marking the other with "deprecated": true and a deprecation
note pointing to the canonical field; ensure you update the descriptions for
"node_groups" and "worker_count" in the swagger.json so clients know which to
use and that the deprecated field is retained only for backward compatibility.

7866-7897: Add contract-level bounds for node group size integers.

min_size, max_size, current_size, and desired_size are currently unbounded. Add non-negative constraints in the OpenAPI schema so invalid payloads are rejected earlier.

♻️ Proposed schema hardening

 "domain.NodeGroup": {
   "type": "object",
   "properties": {
@@
     "current_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
@@
     "max_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
     "min_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
@@
 "httphandlers.NodeGroupRequest": {
   "type": "object",
@@
   "properties": {
     "desired_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
@@
     "max_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
     "min_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
@@
 "httphandlers.UpdateNodeGroupRequest": {
   "type": "object",
   "properties": {
     "desired_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
     "max_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     },
     "min_size": {
-      "type": "integer"
+      "type": "integer",
+      "minimum": 0
     }
   }
 }

Also applies to: 9542-9564, 9677-9690

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 7866 - 7897, The NodeGroup OpenAPI
schema allows negative integers for size fields; update the domain.NodeGroup
schema and the same schemas at the other occurrences to add a non-negative
constraint by adding "minimum": 0 to the integer properties min_size, max_size,
current_size and desired_size (where present), keeping their type as "integer"
(you can also add "format": "int32" if desired) so invalid negative values are
rejected at contract level.

1218-1225: Document common error responses for new node group operations.

Only success codes are currently specified. Adding standard 400/404/409/500 response schemas here will make generated clients and API consumers more reliable.

Also applies to: 1260-1267, 1291-1294

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 1218 - 1225, The OpenAPI responses
for the node group create endpoints only include a 201 success schema
(referencing definitions/domain.NodeGroup); add standard error response entries
for 400, 404, 409, and 500 to each of these operation response objects (the same
blocks that currently contain the "201" response) and point each to your common
error schema (e.g., definitions/Error or definitions/domain.Error) with
appropriate descriptions like "Bad Request", "Not Found", "Conflict", and
"Internal Server Error" so generated clients can handle failures consistently;
update the three duplicate response blocks (the ones with the
201/domain.NodeGroup reference) to include these four additional responses.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/autoscaler-server/main.go`:
- Line 33: The gRPC listener is currently bound to all interfaces via
net.Listen("tcp", fmt.Sprintf(":%d", *port)) which exposes the server
in-cluster; change the bind address to loopback by default by using
fmt.Sprintf("127.0.0.1:%d", *port) (or build an address string like addr :=
fmt.Sprintf("127.0.0.1:%d", *port) and pass addr to net.Listen) so lis, err :=
net.Listen("tcp", addr) binds to loopback, keeping the existing *port flag
unchanged so callers can still override if needed.

In `@Dockerfile.autoscaler-server`:
- Line 2: The Dockerfile currently pins the builder stage to an older Go image
("FROM golang:1.22-alpine") while go.mod requires Go 1.24.0; update the builder
image reference to the matching Go 1.24 variant (e.g., replace "FROM
golang:1.22-alpine" with the appropriate "golang:1.24-alpine" tag) so the
container toolchain matches go.mod and avoids runtime/toolchain downloads in CI.

In `@downs.txt`:
- Around line 48-50: downs.txt is missing rollback entries for two migrations
present in ups.txt; open downs.txt and insert the missing lines
"internal/repositories/postgres/migrations/065_grant_rbac_permissions" and
"internal/repositories/postgres/migrations/072_migrate_to_tenants" at the
appropriate positions so the sequence reads ...070_create_tenants,
065_grant_rbac_permissions, 072_migrate_to_tenants, 073_create_dns_zones
(matching the order in ups.txt), ensuring no duplicates and preserving file
ordering used by the migration tooling.

In `@internal/autoscaler/server.go`:
- Around line 111-134: In NodeGroupIncreaseSize, validate that req.Delta is a
positive integer before proceeding: check req.Delta > 0 at the start of the
handler (e.g., immediately after logging) and return a descriptive error if it
is zero or negative so scale-down requests cannot be routed through this
increase path; update error message to indicate invalid delta and reference
req.Delta and the NodeGroupIncreaseSize function in the change.
- Around line 149-177: The desired-size reconciliation decrements
targetGroup.CurrentSize by len(req.Nodes) even for nodes skipped due to
non-matching providerID and silently ignores errors from
UpdateNodeGroupWithContext; fix by counting only successfully terminated nodes
(increment a counter when s.client.TerminateInstance(providerID) returns nil)
and compute newSize := targetGroup.CurrentSize - successfulTerminations, and do
not swallow the UpdateNodeGroupWithContext error—check its returned error and
return it (or log and return) instead of assigning to _; key symbols: the loop
over req.Nodes, providerID, s.client.TerminateInstance, targetGroup, newSize,
and s.client.UpdateNodeGroupWithContext.
- Around line 270-283: Replace the generic fmt.Errorf("unimplemented") returns
in the stubbed RPC handlers NodeGroupTemplateNodeInfo, NodeGroupGetOptions,
PricingNodePrice, and PricingPodPrice with gRPC Unimplemented errors by
returning status.Error(codes.Unimplemented, "<message>") so clients receive the
proper gRPC status; also add the imports "google.golang.org/grpc/codes" and
"google.golang.org/grpc/status" and ensure the error message clearly names the
unimplemented RPC (e.g., "NodeGroupTemplateNodeInfo method not implemented").
- Line 154: s.client.TerminateInstance(providerID) is called without passing the
request context so gRPC cancellations/deadlines aren't propagated; update the
call to use a context-aware API (either add and call
TerminateInstanceWithContext(ctx, providerID) on sdk.Client following the
GetInstanceWithContext(ctx, ...) pattern in pkg/sdk/compute.go, or call an
existing context-taking variant) and ensure the server method uses that new
TerminateInstanceWithContext symbol instead of TerminateInstance so the ctx
parameter is propagated to the blocking termination call.

In `@internal/core/services/cluster.go`:
- Around line 130-137: The Create + AddNodeGroup sequence is not atomic: if
s.repo.Create(ctx, cluster) succeeds but s.repo.AddNodeGroup(ctx,
&cluster.NodeGroups[0]) fails you leave a persisted cluster without its default
node group; update the code to run both operations inside a single repository
transaction (e.g. use the repo's Begin/RunInTx/WithTx/Commit/Rollback API) so
that you call Create and AddNodeGroup using the transactional repo handle and
rollback on any error, returning the wrapped error as before; reference the
s.repo.Create, s.repo.AddNodeGroup and cluster.NodeGroups[0] symbols when making
the change.
- Around line 417-427: Validate and enforce node-group sizing invariants in
AddNodeGroup and UpdateNodeGroup: before constructing or persisting
domain.NodeGroup (the block that sets MinSize, MaxSize,
CurrentSize/DesiredSize), check that MinSize, MaxSize and Desired/Current are
non-negative, that MinSize <= MaxSize, and that Desired/Current is between
MinSize and MaxSize; if any check fails return a validation error and do not
persist. Apply the same checks to the other node-group creation/update block
referenced (the later block around the UpdateNodeGroup flow) so both creation
and updates enforce the same invariants.
- Around line 433-437: The repo.Update call results are being discarded (e.g.,
`_ = s.repo.Update(ctx, cluster)` after mutating cluster.WorkerCount and
similarly at the other location); replace these with proper error handling: call
`s.repo.Update(ctx, cluster)`, check the returned error, revert the in-memory
`cluster.WorkerCount` change if the update fails, and propagate or return the
error to the caller (do the same for the second occurrence around lines
472-476). Ensure you reference `cluster.WorkerCount`, `params.DesiredSize`, and
`s.repo.Update` when making the change so persistence failures do not leave
cluster and node-group state inconsistent.
- Around line 490-503: The current delete path always proceeds even if no node
group with the given name exists; update the logic in the delete handler (the
loop over cluster.NodeGroups and subsequent calls to s.repo.DeleteNodeGroup and
s.repo.Update) to first detect whether any ng with ng.Name == name exists, and
if not return a NotFound error instead of continuing; only subtract
ng.CurrentSize from cluster.WorkerCount and call s.repo.DeleteNodeGroup /
s.repo.Update when the node group was found. Use the same NotFound error
type/constructor your repo/service stack uses so callers get a proper not-found
response.

In `@internal/repositories/k8s/provisioner.go`:
- Line 218: The worker provision config uses ImageName: "ubuntu-22.04" which is
inconsistent with the control-plane image "ubuntu:22.04" and will likely fail
image resolution; update the worker provision setting (the ImageName field in
the provisioner code) to use the same canonical image string "ubuntu:22.04" to
match control-plane provisioning so both use the identical image reference.
- Around line 226-239: The code currently swallows worker launch errors and
ignores persistence errors; update the error handling in the worker creation
block so that when the create call returns err you record it and do not silently
continue (use p.logger.Error with the actual error and return or aggregate the
error), and when calling p.repo.AddNode(ctx, node) check its returned error
instead of assigning to _ — log and return or propagate that error (referencing
workerName, p.logger.Error, node, p.repo.AddNode, and domain.ClusterNode) so
provisioning failures and persistence failures surface instead of being ignored.

In `@internal/repositories/k8s/templates/control_plane.yaml`:
- Around line 344-346: Replace the hardcoded CLOUD_API_URL value in
control_plane.yaml with a templated/parameterized reference so different DNS
layouts can supply the endpoint at deploy time; update the environment variable
entry for CLOUD_API_URL (the env name CLOUD_API_URL in control_plane.yaml) to
read from a ConfigMap/Secret or a Helm/template value (e.g., use a template
placeholder or valueFrom reference) and ensure any deployment code that renders
this template provides the correct cluster API URL (or falls back to a sensible
default) so bootstrap no longer relies on
"https://thecloud-api.kube-system.svc.cluster.local:443".

In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 116-120: The loop that populates c.NodeGroups is suppressing
errors from r.GetNodeGroups (call to GetNodeGroups) which can return partial
cluster data; instead check the returned err and propagate it (or wrap with
context) rather than ignoring — when r.GetNodeGroups(ctx, c.ID) returns a
non-nil error return that error from the enclosing function (or collect and wrap
it with details like cluster ID) so NodeGroups is never silently missing; update
the code around GetNodeGroups and c.NodeGroups to perform an explicit err != nil
check and return the error.

In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`:
- Around line 16-29: The index creation and backfill aren't idempotent: change
the index DDL to use "CREATE INDEX IF NOT EXISTS
idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id)" and make
the backfill INSERT into cluster_node_groups idempotent by only inserting a
'default-pool' for clusters that don't already have one (e.g., use INSERT ...
SELECT ... FROM clusters WHERE NOT EXISTS (SELECT 1 FROM cluster_node_groups cng
WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool')), keeping the
same columns (id via gen_random_uuid(), cluster_id, name, instance_type,
min_size, max_size, current_size) and the existing size logic (CASE WHEN
worker_count > 10 THEN worker_count ELSE 10 END).

In `@k8s/autoscaler/deployment.yaml`:
- Around line 45-63: Add resource requests and limits for the container named
"thecloud-bridge" in the Deployment spec: under the container entry for
thecloud-bridge add a resources block setting resources.requests (cpu and
memory) and resources.limits (cpu and memory) to sensible values for your
workload (e.g., minimal and maximum values to prevent eviction and overcommit).
Ensure the resources block is added alongside existing env and ports for the
container so Kubernetes uses them for scheduling and OOM eviction decisions.
- Around line 21-63: Add a pod/container securityContext to harden both
containers: for the cluster-autoscaler and thecloud-bridge containers, set
securityContext with runAsNonRoot: true and a non-zero runAsUser, set
allowPrivilegeEscalation: false (and optionally readOnlyRootFilesystem: true and
drop all capabilities) so neither container runs as root or can escalate
privileges; apply these securityContext blocks under each container spec (or at
pod level if appropriate) to enforce the change for the cluster-autoscaler and
thecloud-bridge containers.

In `@k8s/autoscaler/rbac.yaml`:
- Around line 78-80: The current RBAC rule grants "get/update/patch/delete" on
ConfigMaps including "cluster-id"; remove "cluster-id" from that broader rule's
resourceNames and add a separate rule that grants only read verbs for the
specific ConfigMap name "cluster-id" (e.g., verbs: ["get"] or ["get","list"] as
needed). Update the rules so the existing rule keeps update/patch/delete for
other ConfigMaps while a new rule targets resourceNames: ["cluster-id"] with
read-only verbs to enforce least privilege.

In `@pkg/sdk/kubernetes.go`:
- Around line 201-213: The node group name is inserted raw into request paths in
UpdateNodeGroupWithContext and DeleteNodeGroup, which can break URLs when name
contains reserved characters; fix by path-escaping the name (e.g., use
url.PathEscape or equivalent) before concatenation in both c.putWithContext and
c.delete calls so the path becomes clustersPath + "/" + clusterID +
"/nodegroups/" + PathEscape(name).

---

Minor comments:
In `@docs/kubernetes.md`:
- Around line 119-121: The section header "Limitations (MVP)" currently lists
supported features ("Storage: Dynamic storage provisioning (CSI)" and "High
Availability: ... via the `--ha` flag") instead of constraints; either rename
the header to "Features (MVP)" or keep "Limitations (MVP)" and replace those
bullets with real limitations (for example: lack of automated backups, no
cluster autoscaler, limited OS image support, manual certificate rotation, or
single-region control plane) so the text matches the header; update the header
text and/or replace the two bullets ("Storage" and "High Availability"
referencing `--ha`) accordingly.

In `@docs/services/kubernetes-autoscaler.md`:
- Line 38: The doc currently lists the scale-up endpoint as `POST
/clusters/{id}/nodegroups/{name}`, which conflicts with the create example;
update the wording to describe the actual scale/update endpoint (replace `POST
/clusters/{id}/nodegroups/{name}`) with the correct API verb and path for
resizing a nodegroup (for example `PATCH /clusters/{id}/nodegroups/{name}` or
the team-approved scale path such as `POST
/clusters/{id}/nodegroups/{name}/scale`), and ensure the text clearly states
this is a scale/update operation (not create) so it no longer conflicts with the
create example that uses `POST /clusters/{id}/nodegroups`.

In `@docs/swagger/docs.go`:
- Around line 9685-9698: The UpdateNodeGroupRequest schema currently allows an
empty object; modify its schema to require at least one updatable field by
adding a constraint (e.g., "minProperties": 1) to the
"httphandlers.UpdateNodeGroupRequest" schema in docs/swagger/docs.go so that
requests with no fields are rejected; reference the schema name
"httphandlers.UpdateNodeGroupRequest" and ensure the change is applied to the
JSON object describing that schema.
- Around line 9550-9572: The schema for httphandlers.NodeGroupRequest currently
allows negative/invalid sizes for desired_size, max_size, and min_size; update
the struct or swagger annotations for the NodeGroupRequest fields desired_size,
max_size, and min_size to include numeric bounds (e.g., add "minimum: 0" to each
and optionally a sensible "maximum" if desired) so the generated docs emit valid
range constraints (minimum: 0) for those integer properties.

In `@internal/handlers/cluster_handler.go`:
- Around line 380-460: The Swagger comments for AddNodeGroup, UpdateNodeGroup,
and DeleteNodeGroup are missing `@Security`; update the doc comments above each
handler (AddNodeGroup, UpdateNodeGroup, DeleteNodeGroup) to include the same
`@Security` directive used by other K8s handlers in the file (e.g., `@Security`
ApiKeyAuth or `@Security` BearerAuth — use the exact scheme name already present
elsewhere), placing it alongside the existing `@Summary`, `@Tags`, and `@Router` lines
so the generated API docs correctly show authentication requirements.

In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 244-252: After iterating rows into the groups slice you must check
cursor-level errors by calling rows.Err(); update the code after the for
rows.Next() loop (where rows, groups and domain.NodeGroup are used and rows.Scan
is invoked) to call if err := rows.Err(); err != nil { return nil,
errors.Wrap(errors.Internal, "row iteration error", err) } before returning
groups, nil so any underlying read errors are surfaced and wrapped consistently.

---

Nitpick comments:
In `@cmd/autoscaler-server/main.go`:
- Line 18: Replace the inline magic number 50051 used in flag.Int call by
introducing a named constant (e.g., defaultGRPCPort) at the top of the file and
use that constant in the call to flag.Int("port", defaultGRPCPort, ...); update
any related comments or usages referencing the hardcoded port to use the new
constant and ensure the constant is of type int to match flag.Int's default
value.

In `@Dockerfile.autoscaler-server`:
- Around line 20-28: The container currently runs as root; modify
Dockerfile.autoscaler-server to create a non-root user and group (e.g.,
"autoscaler"), chown the runtime binary
(/usr/local/bin/thecloud-autoscaler-server) and any needed app dirs to that
user, set a safe HOME if needed, and add a USER autoscaler (or UID/GID)
directive before the ENTRYPOINT so ENTRYPOINT
["/usr/local/bin/thecloud-autoscaler-server"] runs as the non-root user; ensure
you use Alpine-compatible commands (addgroup/adduser or adduser -S/addgroup -S)
and preserve executable permissions on the binary.

In `@docs/adr/ADR-020-kubernetes-cluster-autoscaler.md`:
- Line 29: Update naming so the ADR and implementation use the same canonical
bridge name: pick one of the existing identifiers (either "thecloud-bridge" or
"autoscaler-server"/"thecloud-autoscaler-server") and replace the others to
match; update ADR-020-kubernetes-cluster-autoscaler.md (the line containing
"thecloud-bridge") to the chosen canonical name and also update any
build/runtime artifacts that reference "autoscaler-server" or
"thecloud-autoscaler-server" so all references (ADR text, service names,
manifests, build scripts) consistently use the single selected identifier.

In `@docs/swagger/docs.go`:
- Around line 6937-6968: The schema exposes both node_groups (items -> "$ref":
"#/definitions/domain.NodeGroup") and worker_count which creates ambiguity;
update the schema for the cluster definition to mark worker_count as deprecated
and guide clients to use node_groups instead: add a deprecation indicator and a
clear description on the worker_count property (e.g., "DEPRECATED: use
node_groups for node pool counts; will be removed in next API version"), or
remove worker_count in the next major API version, ensuring any generated
docs/clients reflect the change.
- Around line 7874-7905: The OpenAPI schema for domain.NodeGroup currently has
all properties optional; add a "required" array to the domain.NodeGroup
definition listing the stable fields to make them mandatory (id, cluster_id,
name, min_size, max_size, current_size). Update the domain.NodeGroup object in
docs/swagger/docs.go to include the required property alongside the existing
"type" and "properties" keys so code generation tools will treat those fields as
required.

In `@docs/swagger/swagger.json`:
- Around line 6929-6935: The Cluster schema now exposes two scaling
representations—node_groups and worker_count—so update the domain.Cluster
definition to mark one as canonical and the other as deprecated: designate
"node_groups" as the canonical scaling field (or vice versa as your choice) and
add clear description text for both properties indicating the canonical field,
marking the other with "deprecated": true and a deprecation note pointing to the
canonical field; ensure you update the descriptions for "node_groups" and
"worker_count" in the swagger.json so clients know which to use and that the
deprecated field is retained only for backward compatibility.
- Around line 7866-7897: The NodeGroup OpenAPI schema allows negative integers
for size fields; update the domain.NodeGroup schema and the same schemas at the
other occurrences to add a non-negative constraint by adding "minimum": 0 to the
integer properties min_size, max_size, current_size and desired_size (where
present), keeping their type as "integer" (you can also add "format": "int32" if
desired) so invalid negative values are rejected at contract level.
- Around line 1218-1225: The OpenAPI responses for the node group create
endpoints only include a 201 success schema (referencing
definitions/domain.NodeGroup); add standard error response entries for 400, 404,
409, and 500 to each of these operation response objects (the same blocks that
currently contain the "201" response) and point each to your common error schema
(e.g., definitions/Error or definitions/domain.Error) with appropriate
descriptions like "Bad Request", "Not Found", "Conflict", and "Internal Server
Error" so generated clients can handle failures consistently; update the three
duplicate response blocks (the ones with the 201/domain.NodeGroup reference) to
include these four additional responses.

In `@internal/core/ports/compute_options.go`:
- Around line 13-14: The struct fields Metadata and Labels are optional but
missing JSON tags; update their definitions in the struct (fields named Metadata
and Labels in this file) to include json tags with omitempty (e.g.,
`json:"metadata,omitempty"` and `json:"labels,omitempty"`) so they are omitted
from serialized output when nil/empty.

In `@internal/repositories/postgres/cluster_repo_unit_test.go`:
- Around line 62-120: The three tests (GetByID, ListAll, ListByUserID) repeat
setup and use pgxmock; convert them to a single table-driven test loop that
iterates cases (name, setupMock, callFn, expectedLen) and replace pgxmock usage
with testify/mock-based mocks; for each case call NewClusterRepository with the
testify mock (or mock the lower-level DB interface used by
NewClusterRepository), configure expectations in setupMock (return
rows/NodeGroups for that case), then invoke repo.GetByID / repo.ListAll /
repo.ListByUserID accordingly and assert results (len of clusters and
cluster.NodeGroups). Ensure you reference the existing methods
NewClusterRepository, GetByID, ListAll, ListByUserID and the NodeGroups field
when writing expectations and assertions.

In `@internal/repositories/postgres/migrations/020_create_clusters.down.sql`:
- Around line 1-2: The second DROP uses broad CASCADE which can remove unrelated
objects; change the rollback to avoid CASCADE on the clusters drop—replace "DROP
TABLE IF EXISTS clusters CASCADE;" with "DROP TABLE IF EXISTS clusters;" (or
"DROP TABLE IF EXISTS clusters RESTRICT;") and, if necessary, explicitly drop
any known dependent objects (e.g., foreign keys or indexes that reference
clusters) instead of relying on CASCADE; refer to the DROP TABLE lines and the
table names cluster_nodes and clusters to locate the statements.

In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`:
- Around line 7-13: Add DB-level CHECK constraints to the cluster node groups
table to enforce size invariants: add a check ensuring min_size <= max_size and
another ensuring current_size >= min_size AND current_size <= max_size (you can
also include min_size >= 1 if you want to match the DEFAULT). Add these
constraints to the CREATE TABLE for the table that defines min_size, max_size,
current_size (e.g., add CHECK (min_size <= max_size) and CHECK (current_size
BETWEEN min_size AND max_size) with descriptive constraint names like
chk_cluster_node_groups_min_le_max and
chk_cluster_node_groups_current_in_range).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f5efc4 and d541c29.

⛔ Files ignored due to path filters (3)

go.sum is excluded by !**/*.sum
internal/autoscaler/protos/externalgrpc.pb.go is excluded by !**/*.pb.go
internal/autoscaler/protos/externalgrpc_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (32)

Dockerfile.autoscaler-server
Makefile
cmd/autoscaler-server/main.go
docs/adr/ADR-020-kubernetes-cluster-autoscaler.md
docs/database.md
docs/kubernetes.md
docs/services/kubernetes-autoscaler.md
docs/swagger/docs.go
docs/swagger/swagger.json
docs/swagger/swagger.yaml
downs.txt
go.mod
internal/api/setup/router.go
internal/autoscaler/server.go
internal/core/domain/cluster.go
internal/core/ports/cluster.go
internal/core/ports/compute_options.go
internal/core/services/cluster.go
internal/handlers/cluster_handler.go
internal/repositories/k8s/provisioner.go
internal/repositories/k8s/templates/control_plane.yaml
internal/repositories/postgres/cluster_repo.go
internal/repositories/postgres/cluster_repo_unit_test.go
internal/repositories/postgres/migrations/020_create_clusters.down.sql
internal/repositories/postgres/migrations/098_create_cluster_node_groups.down.sql
internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql
k8s/autoscaler/cluster-id-configmap.yaml
k8s/autoscaler/configmap.yaml
k8s/autoscaler/deployment.yaml
k8s/autoscaler/rbac.yaml
pkg/sdk/kubernetes.go
ups.txt

…aler integration

Copilot

Pull request overview

Copilot reviewed 44 out of 46 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T12:24:25Z

+func (s *ClusterService) AddNodeGroup(ctx context.Context, clusterID uuid.UUID, params ports.NodeGroupParams) (*domain.NodeGroup, error) {
+	cluster, err := s.GetCluster(ctx, clusterID)
+	if err != nil {
+		return nil, err
+	}
+
+	ng := &domain.NodeGroup{
+		ID:           uuid.New(),
+		ClusterID:    cluster.ID,
+		Name:         params.Name,
+		InstanceType: params.InstanceType,
+		MinSize:      params.MinSize,
+		MaxSize:      params.MaxSize,
+		CurrentSize:  params.DesiredSize,
+		CreatedAt:    time.Now(),
+		UpdatedAt:    time.Now(),
+	}


Node group create/update paths don’t validate scaling invariants (e.g., min_size <= desired_size <= max_size, non-negative sizes, and non-empty instance_type). As-is, invalid inputs can be persisted (or cause DB constraint errors) and can make autoscaler decisions incorrect. Add explicit validation in the service layer before writing to the repo.

Copilot · 2026-03-04T12:24:26Z

+func (s *AutoscalerServer) NodeGroupIncreaseSize(ctx context.Context, req *protos.NodeGroupIncreaseSizeRequest) (*protos.NodeGroupIncreaseSizeResponse, error) {
+	klog.Infof("Request to increase size of node group %s by %d", req.Id, req.Delta)
+
+	cluster, err := s.client.GetClusterWithContext(ctx, s.clusterID)
+	if err != nil {
+		return nil, err


NodeGroupIncreaseSize doesn’t validate that req.Delta is positive. If a negative/zero delta is ever sent (misbehaving client or future CA change), newSize could shrink or no-op via the “increase” RPC, which violates the External Provider contract. Add a guard (e.g., require Delta > 0) and return an InvalidArgument-style error when it isn’t.

Copilot · 2026-03-04T12:24:26Z

+    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE,
+    name VARCHAR(255) NOT NULL,
+    instance_type VARCHAR(50) NOT NULL,
+    min_size INTEGER NOT NULL DEFAULT 1,
+    max_size INTEGER NOT NULL DEFAULT 10,
+    current_size INTEGER NOT NULL DEFAULT 1,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),


The documented cluster_node_groups schema doesn’t match the actual migration: this doc shows id UUID PRIMARY KEY DEFAULT gen_random_uuid() and created_at/updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), but the migration defines id UUID PRIMARY KEY (no default) and timestamps without NOT NULL using CURRENT_TIMESTAMP. Please align the documentation with the real schema (or update the migration to match the documented defaults/constraints).

Suggested change

id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE,

name VARCHAR(255) NOT NULL,

instance_type VARCHAR(50) NOT NULL,

min_size INTEGER NOT NULL DEFAULT 1,

max_size INTEGER NOT NULL DEFAULT 10,

current_size INTEGER NOT NULL DEFAULT 1,

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

id UUID PRIMARY KEY,

cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE,

name VARCHAR(255) NOT NULL,

instance_type VARCHAR(50) NOT NULL,

min_size INTEGER NOT NULL DEFAULT 1,

max_size INTEGER NOT NULL DEFAULT 10,

current_size INTEGER NOT NULL DEFAULT 1,

created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,

updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,

Copilot · 2026-03-04T12:24:26Z

+	return nil, fmt.Errorf("node group %s not found", req.Id)
+}


The gRPC handlers return plain fmt.Errorf(...) for “not found”, “invalid delta”, and “unimplemented” cases. That will surface to clients as gRPC Unknown errors, while Cluster Autoscaler expects specific status codes (e.g., codes.NotFound, codes.InvalidArgument, codes.Unimplemented) for correct behavior and retries. Prefer returning status.Error/ status.Errorf with appropriate codes.* in these paths.

Copilot · 2026-03-04T12:24:27Z

+1.  A pod is in `Pending` status due to CPU/Memory constraints.
+2.  Upstream CA identifies that a Node Group can satisfy the requirement.
+3.  Upstream CA calls `NodeGroupIncreaseSize` via gRPC.
+4.  `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API.


The “Scaling Up Flow” step 4 documents POST /clusters/{id}/nodegroups/{name}, but the API routes implemented in this PR are POST /clusters/{id}/nodegroups (create) and PUT /clusters/{id}/nodegroups/{name} (update desired size). Please update the doc to reflect the actual endpoint used for scale-up (likely the PUT update).

Suggested change

4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API.

4. `thecloud-bridge` calls `PUT /clusters/{id}/nodegroups/{name}` on The Cloud API to update the node group's desired size.

Copilot · 2026-03-04T12:24:27Z

+	var groups []domain.NodeGroup
+	for rows.Next() {
+		var ng domain.NodeGroup
+		if err := rows.Scan(&ng.ID, &ng.ClusterID, &ng.Name, &ng.InstanceType, &ng.MinSize, &ng.MaxSize, &ng.CurrentSize, &ng.CreatedAt, &ng.UpdatedAt); err != nil {
+			return nil, errors.Wrap(errors.Internal, "failed to scan node group", err)
+		}
+		groups = append(groups, ng)
+	}
+	return groups, nil


GetNodeGroups doesn’t check rows.Err() after iterating. If an error occurs during row iteration, it may be dropped and returned as a partial result without error. Add a if err := rows.Err(); err != nil { ... } check before returning.

Copilot · 2026-03-04T12:24:27Z

+			workerInst, err := p.instSvc.LaunchInstanceWithOptions(ctx, ports.CreateInstanceOptions{
+				Name:      workerName,
+				ImageName: "ubuntu-22.04",
+				NetworkID: cluster.VpcID.String(),
+				UserData:  userData,
+				Metadata: map[string]string{
+					"thecloud.io/cluster-id": cluster.ID.String(),
+					"thecloud.io/node-group": ng.Name,
+				},
+			})


Worker provisioning now sets instance metadata (thecloud.io/cluster-id, thecloud.io/node-group) via CreateInstanceOptions.Metadata, but InstanceService.LaunchInstanceWithOptions currently does not persist opts.Metadata/opts.Labels into the instance record or provisioning job. As a result, instances launched here won’t carry the metadata needed by the autoscaler bridge (which filters on instance metadata). Either extend LaunchInstanceWithOptions to propagate metadata/labels, or set metadata via a separate API call after launch.

Copilot · 2026-03-04T12:24:27Z

+	// Persist Node Group
+	if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {


CreateCluster persists the cluster row and then inserts the default node group in a separate call. If AddNodeGroup fails, the API returns an error but leaves a cluster record without its required default node group, which can break subsequent provisioning/autoscaler flows. Consider wrapping both inserts in a single DB transaction (or rolling back the cluster record on failure).

Suggested change

// Persist Node Group

if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {

// Persist Node Group. If this fails, attempt to roll back the cluster record

// to avoid leaving a cluster without its required default node group.

if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {

if rollbackErr := s.repo.Delete(ctx, cluster.ID); rollbackErr != nil {

s.logger.Error("failed to rollback cluster after node group creation failure", "cluster_id", cluster.ID, "error", rollbackErr)

}

Copilot · 2026-03-04T12:24:28Z

+		if providerID == node.ProviderID {
+			continue
+		}
+		if err := s.client.TerminateInstance(providerID); err != nil {


NodeGroupDeleteNodes terminates instances using TerminateInstance, which does not accept a context (it uses the SDK’s non-context delete). This prevents request cancellation/timeouts from being honored during scale-down operations. Consider adding/using a context-aware SDK method for instance termination and using it here so CA requests can be canceled cleanly.

Suggested change

if err := s.client.TerminateInstance(providerID); err != nil {

if err := s.client.TerminateInstanceWithContext(ctx, providerID); err != nil {

Copilot · 2026-03-04T12:24:28Z

+// AddNodeGroup adds a new node pool to the cluster.
+func (c *Client) AddNodeGroup(clusterID string, input NodeGroupInput) (*NodeGroup, error) {
+	var resp Response[*NodeGroup]
+	if err := c.post(clustersPath+"/"+clusterID+"/nodegroups", input, &resp); err != nil {
+		return nil, err
+	}
+	return resp.Data, nil
+}
+
+// UpdateNodeGroupWithContext updates a node group's parameters with context support.
+func (c *Client) UpdateNodeGroupWithContext(ctx context.Context, clusterID string, name string, input UpdateNodeGroupInput) (*NodeGroup, error) {
+	var resp Response[*NodeGroup]
+	if err := c.putWithContext(ctx, clustersPath+"/"+clusterID+"/nodegroups/"+name, input, &resp); err != nil {
+		return nil, err
+	}
+	return resp.Data, nil
+}
+
+// DeleteNodeGroup removes a node group.
+func (c *Client) DeleteNodeGroup(clusterID string, name string) error {
+	var resp Response[any]
+	return c.delete(clustersPath+"/"+clusterID+"/nodegroups/"+name, &resp)
+}


The new NodeGroup SDK methods don’t follow the established SDK pattern of providing both context-aware and non-context variants. AddNodeGroup/DeleteNodeGroup have no WithContext equivalents, while UpdateNodeGroupWithContext has no non-context wrapper. This makes it hard for callers (like controllers) to enforce timeouts/cancellation consistently. Consider adding AddNodeGroupWithContext/DeleteNodeGroupWithContext and a non-context UpdateNodeGroup wrapper that delegates to the context variant (using context.Background()).

- Hardened Autoscaler Deployment (limits, non-root, read-only FS) - Consolidated redundant cluster migrations and fixed idempotency - Tightened RBAC for cluster-id ConfigMap - Improved SDK robustness with path-escaping for node group names - Fixed documentation inconsistencies and updated Swagger constraints - Added rows.Err() checks in ClusterRepository

…caler server

…n service layer

…in entry point

…ad of POST

- Bind gRPC listener to loopback by default - Update builder image to Go 1.24 - Add pod and container security contexts (non-root, read-only FS) - Standardize bridge container name to thecloud-bridge

- Add TerminateInstanceWithContext to SDK - Propagate context to instance termination calls in autoscaler - Validate increase delta and track successful terminations - Use proper gRPC status codes for unimplemented methods

…tence errors - Add validation for MinSize, MaxSize, and DesiredSize - Revert in-memory state on repository update failures - Ensure existence check in node group deletion path

- Use canonical ubuntu:22.04 image for workers - Parameterize CLOUD_API_URL in control plane template - Aggregate and report errors during worker provisioning

- Convert repository tests to table-driven format - Check errors from GetNodeGroups in cluster list operations - Add missing rows.Err() checks

Copilot

Pull request overview

Copilot reviewed 51 out of 53 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

internal/core/services/cluster.go:288

ScaleCluster only updates cluster.WorkerCount, but the k8s provisioner’s Scale() now provisions based on cluster.NodeGroups when present. Since NodeGroups aren’t updated here, scaling via /clusters/{id}/scale will no longer change the number of workers. Either keep the default node group in sync with WorkerCount (update current_size) or change scaling to operate on node groups directly.

// ScaleCluster adjusts the number of worker nodes in the cluster.
func (s *ClusterService) ScaleCluster(ctx context.Context, id uuid.UUID, workers int) error {
	cluster, err := s.GetCluster(ctx, id)
	if err != nil {
		return err
	}

	if workers < 1 {
		return errors.New(errors.InvalidInput, "at least 1 worker required")
	}

	oldWorkerCount := cluster.WorkerCount
	cluster.WorkerCount = workers
	cluster.UpdatedAt = time.Now()
	if err := s.repo.Update(ctx, cluster); err != nil {
		cluster.WorkerCount = oldWorkerCount
		return errors.Wrap(errors.Internal, "failed to update cluster record during scaling", err)
	}

	go func() {
		bgCtx := context.Background()
		bgCtx = appcontext.WithUserID(bgCtx, cluster.UserID)
		s.logger.Info("starting cluster scale", "cluster_id", cluster.ID, "new_workers", workers)

		if err := s.provisioner.Scale(bgCtx, cluster); err != nil {
			s.logger.Error("cluster scaling failed", "cluster_id", cluster.ID, "error", err)
			return
		}

		s.logger.Info("cluster scaling completed", "cluster_id", cluster.ID)
	}()

	return nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T12:40:25Z

+func (s *ClusterService) AddNodeGroup(ctx context.Context, clusterID uuid.UUID, params ports.NodeGroupParams) (*domain.NodeGroup, error) {
+	cluster, err := s.GetCluster(ctx, clusterID)
+	if err != nil {
+		return nil, err
+	}
+
+	// Validate node group sizing invariants
+	if err := s.validateNodeGroupSizing(params.MinSize, params.MaxSize, params.DesiredSize); err != nil {
+		return nil, err
+	}
+
+	ng := &domain.NodeGroup{
+		ID:           uuid.New(),
+		ClusterID:    cluster.ID,
+		Name:         params.Name,
+		InstanceType: params.InstanceType,
+		MinSize:      params.MinSize,
+		MaxSize:      params.MaxSize,
+		CurrentSize:  params.DesiredSize,
+		CreatedAt:    time.Now(),
+		UpdatedAt:    time.Now(),
+	}
+
+	if err := s.repo.AddNodeGroup(ctx, ng); err != nil {
+		return nil, err
+	}
+
+	// If desired size > 0, we should trigger a scale operation for this specific group.
+	// For now, we will update the global worker count to reflect the addition.
+	oldWorkerCount := cluster.WorkerCount
+	cluster.WorkerCount += params.DesiredSize
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		cluster.WorkerCount = oldWorkerCount
+		return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err)
+	}
+
+	return ng, nil
+}
+
+func (s *ClusterService) UpdateNodeGroup(ctx context.Context, clusterID uuid.UUID, name string, params ports.UpdateNodeGroupParams) (*domain.NodeGroup, error) {
+	cluster, err := s.GetCluster(ctx, clusterID)
+	if err != nil {
+		return nil, err
+	}
+
+	var targetGroup *domain.NodeGroup
+	for i := range cluster.NodeGroups {
+		if cluster.NodeGroups[i].Name == name {
+			targetGroup = &cluster.NodeGroups[i]
+			break
+		}
+	}
+
+	if targetGroup == nil {
+		return nil, errors.New(errors.NotFound, "node group not found")
+	}
+
+	newMin := targetGroup.MinSize
+	newMax := targetGroup.MaxSize
+	newDesired := targetGroup.CurrentSize
+
+	if params.MinSize != nil {
+		newMin = *params.MinSize
+	}
+	if params.MaxSize != nil {
+		newMax = *params.MaxSize
+	}
+	if params.DesiredSize != nil {
+		newDesired = *params.DesiredSize
+	}
+
+	// Validate node group sizing invariants
+	if err := s.validateNodeGroupSizing(newMin, newMax, newDesired); err != nil {
+		return nil, err
+	}
+
+	oldMin, oldMax, oldDesired := targetGroup.MinSize, targetGroup.MaxSize, targetGroup.CurrentSize
+	targetGroup.MinSize = newMin
+	targetGroup.MaxSize = newMax
+
+	oldWorkerCount := cluster.WorkerCount
+	if params.DesiredSize != nil {
+		diff := newDesired - oldDesired
+		cluster.WorkerCount += diff
+		targetGroup.CurrentSize = newDesired
+	}
+
+	if err := s.repo.UpdateNodeGroup(ctx, targetGroup); err != nil {
+		return nil, err
+	}
+
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		// Rollback in-memory changes on persistence failure
+		targetGroup.MinSize = oldMin
+		targetGroup.MaxSize = oldMax
+		targetGroup.CurrentSize = oldDesired
+		cluster.WorkerCount = oldWorkerCount
+		return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after updating node group", err)
+	}
+
+	return targetGroup, nil
+}


AddNodeGroup/UpdateNodeGroup adjust desired/current sizes in the DB but don’t trigger any provisioning action (e.g., provisioner.Scale or a background job) to actually create/remove worker nodes. This means cluster-autoscaler scale-up requests will only update state without adding nodes. Consider invoking scaling/reconciliation when DesiredSize changes (at least for scale-up).

Copilot · 2026-03-05T12:40:26Z

+	// Atomically create cluster and its default node group
 	if err := s.repo.Create(ctx, cluster); err != nil {
 		return nil, errors.Wrap(errors.Internal, "failed to create cluster record", err)
 	}

+	// Persist Node Group
+	if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {
+		// Ideally this should be in a transaction with repo.Create


The comment says this is an atomic create, but repo.Create is executed separately from creating the default node group. If AddNodeGroup fails, the DB can end up with a cluster row but no node groups. Consider wrapping both writes in a single DB transaction (or adding compensation logic).

Suggested change

// Atomically create cluster and its default node group

if err := s.repo.Create(ctx, cluster); err != nil {

return nil, errors.Wrap(errors.Internal, "failed to create cluster record", err)

}

// Persist Node Group

if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {

// Ideally this should be in a transaction with repo.Create

// Create cluster record; default node group is added in a separate step (not atomic)

if err := s.repo.Create(ctx, cluster); err != nil {

return nil, errors.Wrap(errors.Internal, "failed to create cluster record", err)

}

// Persist Node Group (note: this is not wrapped in a DB transaction with repo.Create)

if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {

// TODO: ideally this should be done in a single DB transaction with repo.Create

Copilot · 2026-03-05T12:40:26Z

+	InstanceType string `json:"instance_type"`
+	MinSize      int    `json:"min_size" binding:"gte=0"`
+	MaxSize      int    `json:"max_size" binding:"gte=0"`
+	DesiredSize  int    `json:"desired_size" binding:"gte=0"`


NodeGroupRequest leaves InstanceType/MinSize/MaxSize/DesiredSize optional, so omitted fields default to 0/"". That can violate DB constraints (instance_type is NOT NULL; current_size must be between min/max) and return a 500 instead of a 400. Consider adding binding validation (e.g., require instance_type, min_size, max_size, desired_size) and/or applying sane defaults before calling the service.

Suggested change

InstanceType string `json:"instance_type"`

MinSize int `json:"min_size" binding:"gte=0"`

MaxSize int `json:"max_size" binding:"gte=0"`

DesiredSize int `json:"desired_size" binding:"gte=0"`

InstanceType string `json:"instance_type" binding:"required"`

MinSize int `json:"min_size" binding:"required,gte=0"`

MaxSize int `json:"max_size" binding:"required,gte=0"`

DesiredSize int `json:"desired_size" binding:"required,gte=0"`

Copilot · 2026-03-05T12:40:27Z

+        - apiGroups: [""]
+          resources: ["configmaps"]
+          resourceNames: ["cluster-autoscaler-status", "cluster-id"]
+          verbs: ["get", "update", "patch", "delete"]
+      ---


In the rendered autoscaler RBAC, the Role grants get/update/patch/delete on both configmaps "cluster-autoscaler-status" and "cluster-id". The autoscaler typically only needs read access to the cluster-id configmap; granting delete/update is unnecessarily permissive. Consider splitting the rules (status configmap writable, cluster-id read-only) to match least privilege.

Copilot · 2026-03-05T12:40:27Z

+internal/repositories/postgres/migrations/034_create_accounting_table
+internal/repositories/postgres/migrations/035_create_images_table
+internal/repositories/postgres/migrations/036_add_volume_backend_path
+internal/repositories/postgres/migrations/037_create_clusters_table


This list references internal/repositories/postgres/migrations/037_create_clusters_table, but that migration has been removed (no longer present under internal/repositories/postgres/migrations). If these *.txt files are intended to drive tooling, this stale entry will cause confusion/breakage; otherwise consider removing these unused lists entirely.

Suggested change

internal/repositories/postgres/migrations/037_create_clusters_table

Copilot · 2026-03-05T12:40:27Z

+	var provisioningErrors []string
+	for _, ng := range cluster.NodeGroups {
+		for i := 0; i < ng.CurrentSize; i++ {
+			workerName := fmt.Sprintf("%s-%s-%d", cluster.Name, ng.Name, i)
+			workerInst, err := p.instSvc.LaunchInstanceWithOptions(ctx, ports.CreateInstanceOptions{
+				Name:      workerName,
+				ImageName: "ubuntu:22.04", // Use canonical image name consistent with control-plane
+				NetworkID: cluster.VpcID.String(),
+				UserData:  userData,
+				Metadata: map[string]string{
+					"thecloud.io/cluster-id": cluster.ID.String(),
+					"thecloud.io/node-group": ng.Name,
+				},
+			})


provisionWorkers() unconditionally loops from 0..CurrentSize-1 and launches instances, without checking how many worker nodes already exist for this cluster/node group. If this is used for scaling or retries, it can attempt to recreate existing nodes (name collisions) or over-provision. Consider reconciling against repo.GetNodes / actual instances and only creating the delta needed for each node group.

Copilot · 2026-03-05T12:40:27Z

+
+func (s *AutoscalerServer) NodeGroupNodes(ctx context.Context, req *protos.NodeGroupNodesRequest) (*protos.NodeGroupNodesResponse, error) {
+	instances, err := s.client.ListInstancesWithContext(ctx)
+	if err != nil {
+		return nil, err
+	}
+
+	var protosInstances []*protos.Instance
+	for _, inst := range instances {
+		if inst.Metadata["thecloud.io/cluster-id"] == s.clusterID &&
+			inst.Metadata["thecloud.io/node-group"] == req.Id {
+
+			state := protos.InstanceStatus_instanceRunning
+			if inst.Status == "PROVISIONING" || inst.Status == "STARTING" {
+				state = protos.InstanceStatus_instanceCreating
+			} else if inst.Status == "TERMINATING" || inst.Status == "STOPPING" {
+				state = protos.InstanceStatus_instanceDeleting
+			}
+
+			protosInstances = append(protosInstances, &protos.Instance{
+				Id: "thecloud://" + inst.ID,
+				Status: &protos.InstanceStatus{
+					InstanceState: state,
+				},
+			})
+		}
+	}
+
+	return &protos.NodeGroupNodesResponse{
+		Instances: protosInstances,
+	}, nil


NodeGroupNodes is returning a local protosInstancesResponse type, but the gRPC interface requires *protos.NodeGroupNodesResponse. As written, this won’t compile / won’t satisfy protos.CloudProviderServer. Return the generated response type (and remove the custom protosInstancesResponse helper).

Copilot · 2026-03-05T12:40:27Z


+	// Persist Node Group
+	if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {
+		// Ideally this should be in a transaction with repo.Create


Creating the default node group happens after the cluster insert without any rollback if this step fails. Either create both in a transaction or delete the cluster record on failure to avoid leaving an unusable cluster (no node groups).

Suggested change

// Ideally this should be in a transaction with repo.Create

// Attempt to roll back the previously created cluster to avoid leaving it without node groups.

if delErr := s.repo.Delete(ctx, cluster.ID); delErr != nil {

s.logger.Error("failed to rollback cluster after default node group creation failure", "cluster_id", cluster.ID, "error", delErr)

}

Copilot · 2026-03-05T12:40:28Z

+internal/repositories/postgres/migrations/035_create_images_table
+internal/repositories/postgres/migrations/036_add_volume_backend_path
+internal/repositories/postgres/migrations/037_create_clusters_table
+internal/repositories/postgres/migrations/038_add_cluster_permissions
+internal/repositories/postgres/migrations/053_add_cluster_network_isolation


This list references internal/repositories/postgres/migrations/037_create_clusters_table, but that migration has been removed (no longer present under internal/repositories/postgres/migrations). If these *.txt files are intended to drive tooling, this stale entry will cause confusion/breakage; otherwise consider removing these unused lists entirely.

Copilot

Pull request overview

Copilot reviewed 51 out of 53 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T12:55:44Z

+    'default-pool', 
+    'standard-1', 
+    1, 
+    CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END, 
+    worker_count


The backfill INSERT can fail if any existing cluster has worker_count < 1 (e.g., 0): current_size will be set to worker_count while min_size is 1, violating the chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size). Consider clamping current_size to at least min_size (and adjusting max_size accordingly), or allowing min_size=0 for backfilled rows.

Copilot · 2026-03-05T12:55:45Z

+	// If desired size > 0, we should trigger a scale operation for this specific group.
+	// For now, we will update the global worker count to reflect the addition.
+	oldWorkerCount := cluster.WorkerCount
+	cluster.WorkerCount += params.DesiredSize
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		cluster.WorkerCount = oldWorkerCount
+		return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err)
+	}


AddNodeGroup updates two persisted resources (cluster_node_groups row and clusters.worker_count) without a transaction. If the cluster update fails after the node group insert succeeds, the system can end up in an inconsistent state. Consider using a transaction for both updates or stopping updates to the deprecated worker_count field and deriving it from node_groups instead.

Suggested change

// If desired size > 0, we should trigger a scale operation for this specific group.

// For now, we will update the global worker count to reflect the addition.

oldWorkerCount := cluster.WorkerCount

cluster.WorkerCount += params.DesiredSize

if err := s.repo.Update(ctx, cluster); err != nil {

cluster.WorkerCount = oldWorkerCount

return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err)

}

// NOTE: We intentionally do not update cluster.WorkerCount here.

// The worker_count field is deprecated and should be derived from node groups

// rather than being maintained via separate updates that can become inconsistent.

Copilot · 2026-03-05T12:55:45Z

+	if err := s.repo.UpdateNodeGroup(ctx, targetGroup); err != nil {
+		return nil, err
+	}
+
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		// Rollback in-memory changes on persistence failure
+		targetGroup.MinSize = oldMin
+		targetGroup.MaxSize = oldMax
+		targetGroup.CurrentSize = oldDesired
+		cluster.WorkerCount = oldWorkerCount
+		return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after updating node group", err)


UpdateNodeGroup persists the node group update before updating the parent cluster’s worker_count. If the cluster update fails, the rollback only affects in-memory state and leaves the database with the node group updated but the cluster worker_count unchanged. This should be done transactionally (or avoid persisting worker_count at all if it’s deprecated).

Copilot · 2026-03-05T12:55:45Z

 		clusterGroup.POST("/:id/repair", httputil.Permission(svcs.RBAC, domain.PermissionClusterUpdate), handlers.Cluster.RepairCluster)
-		clusterGroup.POST("/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionClusterUpdate), handlers.Cluster.ScaleCluster)
+		clusterGroup.PUT("/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionClusterUpdate), handlers.Cluster.ScaleCluster)
 		clusterGroup.GET("/:id/health", httputil.Permission(svcs.RBAC, domain.PermissionClusterRead), handlers.Cluster.GetClusterHealth)


The HTTP method for /clusters/:id/scale was changed to PUT, but the Go SDK still sends POST requests to /clusters/{id}/scale (pkg/sdk/kubernetes.go). This will break SDK users unless the SDK (and its tests) are updated in the same change set or the API supports both methods temporarily.

Copilot · 2026-03-05T12:55:45Z

+internal/repositories/postgres/migrations/001_create_instances_table
+internal/repositories/postgres/migrations/002_create_api_keys_table
+internal/repositories/postgres/migrations/003_add_container_id
+internal/repositories/postgres/migrations/004_create_objects_table
+internal/repositories/postgres/migrations/005_add_ports


This file appears to be a manual migration listing, but it isn’t referenced anywhere in the repository (no code/doc references found). Keeping it will likely lead to drift as migrations change. Consider removing it (or generating it as part of tooling if it’s needed).

Copilot · 2026-03-05T12:55:46Z

+// NodeGroupRequest is the payload for adding/updating a node group.
+type NodeGroupRequest struct {
+	Name         string `json:"name" binding:"required"`
+	InstanceType string `json:"instance_type"`
+	MinSize      int    `json:"min_size" binding:"gte=0"`
+	MaxSize      int    `json:"max_size" binding:"gte=0"`
+	DesiredSize  int    `json:"desired_size" binding:"gte=0"`
+}


instance_type is optional in the request payload, but the database schema requires cluster_node_groups.instance_type to be NOT NULL. This can turn a client input error into an internal error. Consider making instance_type required (binding:"required") and validating it in the service layer as InvalidInput.

Copilot · 2026-03-05T12:55:46Z

 		containerGroup.GET("/deployments", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.ListDeployments)
 		containerGroup.GET("/deployments/:id", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.GetDeployment)
-		containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)
+		containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)


The HTTP method for /containers/deployments/:id/scale was changed to PUT, but the Go SDK still uses POST for this endpoint (pkg/sdk/container.go + tests). This will break existing SDK clients unless the SDK is updated or the API supports both methods for a deprecation window.

Suggested change

containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)

containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)

containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)

Copilot · 2026-03-05T12:55:46Z

+// AddNodeGroup adds a new node pool to the cluster.
+func (c *Client) AddNodeGroup(clusterID string, input NodeGroupInput) (*NodeGroup, error) {
+	var resp Response[*NodeGroup]
+	if err := c.post(clustersPath+"/"+clusterID+"/nodegroups", input, &resp); err != nil {
+		return nil, err
+	}
+	return resp.Data, nil
+}
+
+// UpdateNodeGroupWithContext updates a node group's parameters with context support.
+func (c *Client) UpdateNodeGroupWithContext(ctx context.Context, clusterID string, name string, input UpdateNodeGroupInput) (*NodeGroup, error) {
+	var resp Response[*NodeGroup]
+	if err := c.putWithContext(ctx, clustersPath+"/"+clusterID+"/nodegroups/"+url.PathEscape(name), input, &resp); err != nil {
+		return nil, err
+	}
+	return resp.Data, nil
+}
+
+// DeleteNodeGroup removes a node group.
+func (c *Client) DeleteNodeGroup(clusterID string, name string) error {
+	var resp Response[any]
+	return c.delete(clustersPath+"/"+clusterID+"/nodegroups/"+url.PathEscape(name), &resp)
+}


SDK method set is inconsistent with the established context-support pattern (non-context method delegates to WithContext). AddNodeGroup and DeleteNodeGroup lack WithContext variants, while UpdateNodeGroup only exists as WithContext (no convenience wrapper). Consider adding AddNodeGroupWithContext/DeleteNodeGroupWithContext and a non-context UpdateNodeGroup wrapper for consistency.

Copilot · 2026-03-05T12:55:46Z

+	return nil, fmt.Errorf("node group %s not found", req.Id)
+}


These handlers return plain fmt.Errorf for common conditions (e.g., node group not found, size out of range), which will be surfaced to the autoscaler as gRPC code=Unknown. Prefer returning status.Error/status.Errorf with appropriate codes (NotFound, FailedPrecondition/InvalidArgument) so upstream Cluster Autoscaler can react correctly.

Copilot · 2026-03-05T12:55:47Z

+  externalgrpc.config: |
+    service:
+      address: "localhost:50051"


Repository convention prefers 127.0.0.1 over localhost for local connection addresses (e.g., internal/core/services/database.go). To align with that, consider using "127.0.0.1:50051" instead of "localhost:50051" in the externalgrpc config.

coderabbitai

Actionable comments posted: 11

♻️ Duplicate comments (2)

internal/repositories/k8s/provisioner.go (1)

242-246: ⚠️ Potential issue | 🟠 Major

Handle AddNode failures with rollback and fail-state transition.

At Line 242, if AddNode fails, the worker instance is already created but left running/orphaned. Also, at Line 250, returning a plain error skips failCluster, so status may remain inconsistent.

🛠️ Proposed fix

 			if err := p.repo.AddNode(ctx, node); err != nil {
 				errMsg := fmt.Sprintf("failed to add worker node %s to repository: %v", workerName, err)
 				p.logger.Error(errMsg)
+				if termErr := p.instSvc.TerminateInstance(ctx, workerInst.ID.String()); termErr != nil {
+					p.logger.Error("failed to rollback worker instance after repository failure", "worker", workerName, "instance_id", workerInst.ID, "error", termErr)
+				}
 				provisioningErrors = append(provisioningErrors, errMsg)
+				continue
 			}
 		}
 	}

 	if len(provisioningErrors) > 0 {
-		return fmt.Errorf("worker provisioning encountered errors: %s", strings.Join(provisioningErrors, "; "))
+		return p.failCluster(
+			ctx,
+			cluster,
+			"worker provisioning encountered errors",
+			fmt.Errorf("%s", strings.Join(provisioningErrors, "; ")),
+		)
 	}

Also applies to: 250-252

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/k8s/provisioner.go` around lines 242 - 246, If
p.repo.AddNode(ctx, node) fails, the newly created worker is orphaned and the
function currently just logs and appends provisioningErrors; modify the error
path in the AddNode block to perform a rollback by deleting/terminating the
created cloud instance (use the provisioner cloud client method, e.g.,
p.cloud.DeleteInstance or p.cloud.TerminateInstance with the workerName), log
the rollback result via p.logger, then call failCluster(ctx, clusterID) to move
the cluster to a failed state and return the error (instead of only appending
provisioningErrors). Ensure similar rollback + failCluster + return behavior is
applied to the other failure path referenced around where provisioningErrors is
used so status remains consistent.

internal/core/services/cluster.go (1)

129-138: ⚠️ Potential issue | 🟠 Major

CreateCluster + default node-group persistence is still non-atomic.

Line 130 and Line 135 are separate writes. If the second fails, the call returns an error with a partially-created cluster persisted.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/cluster.go` around lines 129 - 138, The CreateCluster
flow is non-atomic because s.repo.Create(ctx, cluster) and
s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]) are separate writes; if
AddNodeGroup fails the cluster remains persisted. Fix by performing both
operations in a single transactional unit (use your repository's transaction
mechanism or add one) — e.g., run Create and AddNodeGroup inside a repo
transaction or implement a single repo method
(CreateClusterWithNodeGroup/CreateWithDefaultNodeGroup) that performs both
writes atomically; alternatively, if transactions aren't available implement a
compensating rollback by calling repo.Delete(ctx, cluster.ID) when AddNodeGroup
fails. Ensure you update CreateCluster to call the transactional method (or
perform rollback) so partial state cannot be left behind.

🧹 Nitpick comments (9)

internal/core/services/database.go (1)

377-381: Replace exporter port literals with named constants for consistency.

"9187" and "9104" are still magic values in this method; extracting them keeps the same maintainability gain as engine ports.

♻️ Proposed refactor

 const (
 	postgresDefaultPort = "5432"
 	mysqlDefaultPort    = "3306"
 	postgresExporterImg = "prometheuscommunity/postgres-exporter"
 	mysqlExporterImg    = "prom/mysqld-exporter"
+	postgresExporterPort = "9187"
+	mysqlExporterPort    = "9104"
 )
@@
-		return postgresExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, "9187"
+		return postgresExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, postgresExporterPort
@@
-		return mysqlExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, "9104"
+		return mysqlExporterImg, []string{"DATA_SOURCE_NAME=" + dsn}, mysqlExporterPort

As per coding guidelines, "Do not use magic numbers - use named constants instead".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/database.go` around lines 377 - 381, The returns in
the exporter selection code currently use magic string literals "9187" and
"9104" for ports; replace these with named constants (e.g., PostgresExporterPort
and MySQLExporterPort) defined alongside other service/engine constants and use
those constants in the return statements where postgresExporterImg and
mysqlExporterImg are returned; ensure the constants are typed as strings (or
format them where used) and update any tests or usages that reference the
literal port strings to reference the new constant names.

internal/workers/database_failover_worker_test.go (1)

128-129: Replace -1 with a named constant for test intent.

Using a named constant keeps the unhealthy-port intent explicit and avoids magic numbers.

♻️ Small cleanup

+const unhealthyPrimaryPort = -1
+
 func TestDatabaseFailoverWorker(t *testing.T) {
@@
 	primary := &domain.Database{
 		ID:     primaryID,
 		Name:   "primary",
 		Role:   domain.RolePrimary,
 		Status: domain.DatabaseStatusRunning,
-		Port:   -1, // Invalid port to ensure DialTimeout fails
+		Port:   unhealthyPrimaryPort, // Invalid port to ensure DialTimeout fails
 	}

As per coding guidelines, **/*.go: Do not use magic numbers - use named constants instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/workers/database_failover_worker_test.go` around lines 128 - 129,
Replace the magic number -1 used for the Port field with a named constant (e.g.,
invalidPort or unhealthyPort) to make the test intent explicit; add the constant
near the top of the test file or in the test's scope, then update the Port: -1
occurrence to Port: invalidPort (or your chosen name) so the unhealthy-port
intent is clear when examining the test that configures the DialTimeout failure.

k8s/autoscaler/deployment.yaml (1)

77-78: Hardcoded CLOUD_API_URL limits deployment flexibility.

Unlike the templated control_plane.yaml (which uses {{ .CloudAPIURL }}), this standalone manifest hardcodes the API URL. This works for the default in-cluster DNS layout but will break if the API service is deployed differently or accessed externally.

Consider using a ConfigMap or making this value configurable at deployment time for consistency with the templated approach.
♻️ Proposed change using ConfigMap
             - name: CLOUD_API_URL
-              value: "https://thecloud-api.kube-system.svc.cluster.local:443"
+              valueFrom:
+                configMapKeyRef:
+                  name: cluster-autoscaler-config
+                  key: cloud-api-url
Then add to k8s/autoscaler/configmap.yaml:
data:
  cloud-api-url: "https://thecloud-api.kube-system.svc.cluster.local:443"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/autoscaler/deployment.yaml` around lines 77 - 78, The manifest hardcodes
CLOUD_API_URL in deployment.yaml which reduces flexibility; change it to source
the value from a ConfigMap (e.g., create/modify k8s/autoscaler/configmap.yaml to
include a cloud-api-url key) and update the Deployment to set the environment
variable CLOUD_API_URL from that ConfigMap (use envFrom or valueFrom with
configMapKeyRef) so the same {{ .CloudAPIURL }}-style configurability is
available for CLOUD_API_URL.

internal/repositories/k8s/mocks_test.go (1)

102-111: Make NodeGroup mock methods assertable with m.Called.

These methods currently hardcode success, which weakens test signal for call expectations and error-path coverage.

♻️ Proposed refactor

-func (m *mockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil }
+func (m *mockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
+	return m.Called(ctx, ng).Error(0)
+}
 func (m *mockClusterRepo) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) {
-	return []domain.NodeGroup{}, nil
+	args := m.Called(ctx, clusterID)
+	if args.Get(0) == nil {
+		return nil, args.Error(1)
+	}
+	r0, _ := args.Get(0).([]domain.NodeGroup)
+	return r0, args.Error(1)
 }
 func (m *mockClusterRepo) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
-	return nil
+	return m.Called(ctx, ng).Error(0)
 }
 func (m *mockClusterRepo) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error {
-	return nil
+	return m.Called(ctx, clusterID, name).Error(0)
 }

As per coding guidelines, "Use `testify/mock` for creating mock objects in tests".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/k8s/mocks_test.go` around lines 102 - 111, Replace the
hardcoded-success implementations in mockClusterRepo's methods (AddNodeGroup,
GetNodeGroups, UpdateNodeGroup, DeleteNodeGroup) with testify/mock-style
handlers: call m.Called(...) with the same arguments, extract returned values
and errors from the call result, and return them (casting to the appropriate
types, e.g., []domain.NodeGroup for GetNodeGroups and error for others). Ensure
you pass ctx and the relevant parameters into m.Called so tests can assert calls
and control return values via On/Return.

internal/workers/cluster_worker_test.go (1)

64-73: Mock methods are stubs without m.Called() delegation.

These NodeGroup mock methods return hardcoded values without delegating to m.Called(), unlike other methods in this mock (e.g., GetByID, ListAll). This is acceptable if current tests don't need to verify these calls or configure return values, but consider adding m.Called() delegation for consistency and future test flexibility.

♻️ Optional: Add m.Called() for consistency with other mock methods

-func (m *MockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil }
-func (m *MockClusterRepo) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) {
-	return []domain.NodeGroup{}, nil
-}
-func (m *MockClusterRepo) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
-	return nil
-}
-func (m *MockClusterRepo) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error {
-	return nil
-}
+func (m *MockClusterRepo) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
+	return m.Called(ctx, ng).Error(0)
+}
+func (m *MockClusterRepo) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) {
+	args := m.Called(ctx, clusterID)
+	if args.Get(0) == nil {
+		return nil, args.Error(1)
+	}
+	r0, _ := args.Get(0).([]domain.NodeGroup)
+	return r0, args.Error(1)
+}
+func (m *MockClusterRepo) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
+	return m.Called(ctx, ng).Error(0)
+}
+func (m *MockClusterRepo) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error {
+	return m.Called(ctx, clusterID, name).Error(0)
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/workers/cluster_worker_test.go` around lines 64 - 73, The NodeGroup
mock methods (MockClusterRepo.AddNodeGroup, GetNodeGroups, UpdateNodeGroup,
DeleteNodeGroup) are hardcoded stubs; update each to delegate to m.Called(...)
for consistency and test configurability by invoking m.Called with the same
parameters (ctx, ng or clusterID/name) and returning the typed results (error
and []domain.NodeGroup for GetNodeGroups) extracted from the returned arguments
(using .Error(0) or .Get(0).([]domain.NodeGroup) as appropriate).

internal/handlers/cluster_handler.go (1)

446-471: Consider validating the node group name parameter.

The name path parameter is used directly without validation. While the service layer may handle invalid names, consider adding basic validation (e.g., non-empty check) at the handler level for fail-fast behavior and clearer error messages.

♻️ Optional: Add name validation

 func (h *ClusterHandler) UpdateNodeGroup(c *gin.Context) {
 	id, err := uuid.Parse(c.Param("id"))
 	if err != nil {
 		httputil.Error(c, errors.New(errors.InvalidInput, errInvalidClusterID))
 		return
 	}

 	name := c.Param("name")
+	if name == "" {
+		httputil.Error(c, errors.New(errors.InvalidInput, "invalid node group name"))
+		return
+	}
 	var req UpdateNodeGroupRequest

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/handlers/cluster_handler.go` around lines 446 - 471, In
UpdateNodeGroup validate the path param `name` (the variable `name :=
c.Param("name")`) before calling the service: check it's non-empty (and
optionally allowed format/length), and if invalid return a clear bad-request via
httputil.Error with an InvalidInput error (create or reuse an error constant
such as errInvalidNodeGroupName) so the handler fails fast with a descriptive
message instead of forwarding an empty/invalid name to svc.UpdateNodeGroup.

internal/handlers/cluster_handler_test.go (1)

570-630: Refactor NodeGroup handler tests into table-driven cases.

Line 580 onward uses ad-hoc subtests; please switch this block to a table-driven layout (and include invalid-id / invalid-body / service-error variants in the table) to keep endpoint coverage scalable.

♻️ Suggested table-driven structure

 func TestClusterHandlerNodeGroups(t *testing.T) {
   t.Parallel()
   svc, handler, r := setupClusterHandlerTest()
   clusterID := uuid.New()
   nodeGroupName := "test-pool"

   r.POST("/clusters/:id/nodegroups", handler.AddNodeGroup)
   r.PUT("/clusters/:id/nodegroups/:name", handler.UpdateNodeGroup)
   r.DELETE("/clusters/:id/nodegroups/:name", handler.DeleteNodeGroup)

-  t.Run("AddNodeGroup_Success", func(t *testing.T) { ... })
-  t.Run("UpdateNodeGroup_Success", func(t *testing.T) { ... })
-  t.Run("DeleteNodeGroup_Success", func(t *testing.T) { ... })
+  tests := []struct {
+    name      string
+    method    string
+    path      string
+    body      []byte
+    setupMock func()
+    wantCode  int
+  }{
+    // add success/error/validation cases here
+  }
+
+  for _, tc := range tests {
+    tc := tc
+    t.Run(tc.name, func(t *testing.T) {
+      if tc.setupMock != nil {
+        tc.setupMock()
+      }
+      w := httptest.NewRecorder()
+      req := httptest.NewRequest(tc.method, tc.path, bytes.NewBuffer(tc.body))
+      r.ServeHTTP(w, req)
+      assert.Equal(t, tc.wantCode, w.Code)
+    })
+  }
 }

As per coding guidelines, **/*_test.go: Use table-driven tests in test files.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/handlers/cluster_handler_test.go` around lines 570 - 630, Replace
the ad-hoc subtests inside TestClusterHandlerNodeGroups with a table-driven
loop: create a slice of test cases (structs) that include name, method, path,
requestBody, mockSetup func (to set expectations on svc for
AddNodeGroup/UpdateNodeGroup/DeleteNodeGroup), expectedStatus, and any flags for
invalid-id/invalid-body/service-error; then iterate over the table calling
r.ServeHTTP and assert the response. Keep and reuse the existing handlers
(handler.AddNodeGroup, handler.UpdateNodeGroup, handler.DeleteNodeGroup) and the
example params/expected domain.NodeGroup values from the current subtests inside
each test case's mockSetup; for invalid-id/invalid-body cases set requestBody or
path to invalid values and do not set successful svc expectations (or set svc to
return an error for service-error cases). Ensure each case uses t.Run(tc.name,
func(t *testing.T) { ... }) so mocks are asserted per-case.

docs/swagger/docs.go (1)

1236-1299: Add explicit JSON media types for body-based operations.

These operations take request bodies but don’t declare consumes/produces. Adding them makes Swagger 2 client generation more deterministic. Since this file is generated, update handler annotations and regenerate Swagger.

♻️ Suggested patch

         "/clusters/{id}/nodegroups": {
             "post": {
+                "consumes": [
+                    "application/json"
+                ],
+                "produces": [
+                    "application/json"
+                ],
                 "security": [
                     {
                         "APIKeyAuth": []
                     }
                 ],

         "/clusters/{id}/nodegroups/{name}": {
             "put": {
+                "consumes": [
+                    "application/json"
+                ],
+                "produces": [
+                    "application/json"
+                ],
                 "security": [
                     {
                         "APIKeyAuth": []
                     }
                 ],

         "/containers/deployments/{id}/scale": {
             "put": {
+                "consumes": [
+                    "application/json"
+                ],
+                "produces": [
+                    "application/json"
+                ],
                 "security": [
                     {
                         "APIKeyAuth": []
                     }
                 ],

Also applies to: 1300-1363, 1583-1622

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 1236 - 1299, The generated Swagger
operation for POST "/clusters/{id}/nodegroups" (request body schema
httphandlers.NodeGroupRequest, response domain.NodeGroup) is missing explicit
consumes/produces JSON media types; update the handler's Swagger annotations
(the comments above the handler that reference httphandlers.NodeGroupRequest and
domain.NodeGroup) to include consumes: application/json and produces:
application/json, then re-run the Swagger generator to regenerate
docs/swagger/docs.go; apply the same annotation fix and regeneration for the
other body-based operations noted (the blocks around lines 1300-1363 and
1583-1622) so all request-body endpoints declare consumes/produces JSON.

internal/repositories/postgres/cluster_repo.go (1)

116-121: Avoid N+1 node-group queries in list.

Line 117 triggers one extra query per cluster. This will add significant latency for users with many clusters; consider batching node-group fetches by cluster IDs and mapping results in-memory.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/postgres/cluster_repo.go` around lines 116 - 121, The
current loop calls r.GetNodeGroups(ctx, c.ID) per cluster causing an N+1
problem; instead, collect all cluster IDs from the clusters slice in the list
function, call a batched fetch (add or use a new repository method like
GetNodeGroupsByClusterIDs(ctx, []clusterIDs) that returns node groups with their
ClusterID), then build an in-memory map[ClusterID][]NodeGroup and assign
c.NodeGroups from that map for each cluster (replace the per-cluster
r.GetNodeGroups calls and the ngs variable usage). Ensure the new method
signature and mapping use the existing types and error handling (wrap errors
similarly) so callers remain consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/swagger/docs.go`:
- Around line 9878-9894: The UpdateNodeGroupRequest schema currently permits an
empty object which allows no-op updates; update the
"httphandlers.UpdateNodeGroupRequest" JSON schema to require at least one field
by adding "minProperties": 1 (or alternatively list the updatable fields under
"required" if specific ones must always be present) so requests like {} are
rejected; modify the schema block that defines "UpdateNodeGroupRequest" to
include the "minProperties": 1 property adjacent to "type": "object" (or add a
"required" array containing "desired_size"/"max_size"/"min_size" if
appropriate).
- Around line 9867-9877: The OpenAPI schema for ScaleDeploymentRequest currently
allows negative integers for the replicas field; update the replicas property
inside the ScaleDeploymentRequest schema to enforce non-negative values by
adding a minimum constraint (e.g., "minimum": 0) so that replicas cannot be
negative and validation will reject negative scaling requests.

In `@docs/swagger/swagger.json`:
- Around line 9859-9868: The schema for httphandlers.ScaleDeploymentRequest
allows negative replicas; update the replicas property in the
httphandlers.ScaleDeploymentRequest definition to include a lower bound by
adding "minimum": 0 (so generated SDKs validate non-negative replica counts);
ensure the change is applied to the "replicas" property in that object schema.

In `@internal/core/services/cluster.go`:
- Around line 437-448: The AddNodeGroup flow mutates node-group then updates
cluster.WorkerCount in two separate writes (s.repo.AddNodeGroup and
s.repo.Update) which can leave state inconsistent if the second write fails;
wrap the related operations (s.repo.AddNodeGroup, adjusting cluster.WorkerCount
by params.DesiredSize, and s.repo.Update) in a single transactional boundary
(use your repo's transaction API / BeginTx/WithTx), perform both writes inside
the same transaction, and ensure rollback on any error so that either both the
node-group and cluster.WorkerCount persist or neither do; apply the same
transactional pattern to the other affected code paths (the ranges noted around
lines 501-512 and 542-550).
- Around line 115-123: The default NodeGroup construction sets CurrentSize :=
params.Workers without validating it against hard-coded bounds (MinSize/MaxSize
1..10) which can cause DB failures after the cluster row is created; update the
code that builds defaultNG (the domain.NodeGroup instance) to validate
params.Workers before any DB writes by introducing named constants (e.g.,
DefaultMinNodeGroupSize, DefaultMaxNodeGroupSize) instead of magic numbers, and
either clamp params.Workers into that range or return an error so you don't
persist an invalid NodeGroup; ensure validation happens prior to inserting the
cluster/NodeGroup and update any callers that assume the old behavior.

In `@internal/repositories/k8s/templates/control_plane.yaml`:
- Around line 309-352: The embedded autoscaler Pod spec is missing security
hardening and resource constraints: add a securityContext block to both
containers (cluster-autoscaler and thecloud-bridge) with
allowPrivilegeEscalation: false, readOnlyRootFilesystem: true and
capabilities.drop: ["ALL"], and add resource requests/limits for thecloud-bridge
container (CPU/memory) to match the values used in
k8s/autoscaler/deployment.yaml so both containers have consistent resource and
security settings.

In `@internal/repositories/noop/adapters.go`:
- Around line 221-229: The NoopClusterRepository is missing the NodeGroup
methods required by ports.ClusterRepository; add implementations for
AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup on the
NoopClusterRepository type that match the repository signatures and return
benign noop results (e.g., nil error and empty slice for GetNodeGroups) so the
repository conforms to the interface; ensure method names and parameter types
exactly match the ports.ClusterRepository signatures and use domain.NodeGroup
and uuid.UUID where appropriate.

In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 265-285: Update UpdateNodeGroup and DeleteNodeGroup to check the
Exec result for affected rows and return a not-found error when zero rows are
affected: capture the Exec return into a result (e.g., res, err :=
r.db.Exec(...)), check rows := res.RowsAffected() (or res.RowsAffected() via
CommandTag if your DB driver returns a CommandTag) and if rows == 0 return
errors.Wrap(errors.NotFound, "node group not found") instead of success; keep
the existing error wrapping for Exec failures and apply the same change in both
ClusterRepository.UpdateNodeGroup and ClusterRepository.DeleteNodeGroup.

In
`@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql`:
- Around line 22-35: The backfill sets current_size = worker_count which can be
<= 0 and violate the chk_cluster_node_groups_current_in_range constraint; update
the INSERT SELECT to ensure current_size is at least 1 by replacing current_size
= worker_count with current_size = GREATEST(worker_count, 1) (and optionally
guard max_size similarly if you want to ensure it's >=1), referencing the
cluster_node_groups table, clusters table, worker_count column and the
chk_cluster_node_groups_current_in_range constraint.

In `@internal/workers/database_failover_worker_test.go`:
- Around line 158-160: The test "No failover if primary healthy" is a no-op;
make it deterministic by starting a temporary local TCP listener on the primary
address used by the worker, then invoke the health-check path (call isHealthy or
trigger the worker's health-check/run method used in
database_failover_worker.go) so the dial succeeds, assert that no failover
actions occurred (e.g. primary still selected / no call to failover handler),
and finally close the listener; ensure the test cleans up the listener and any
goroutines to avoid flakiness.

In `@k8s/autoscaler/deployment.yaml`:
- Around line 79-83: The deployment references the ConfigMap named "cluster-id"
to populate the CLOUD_CLUSTER_ID env var but the ConfigMap contains the
unresolved template "{{ .ClusterID }}"; before applying the manifest either
replace that placeholder with the actual cluster UUID or add a templating step
to render it. Update docs/README or deployment instructions to require creating
the ConfigMap "cluster-id" with key "cluster-id" set to the real cluster UUID
(or integrate a templating command such as envsubst/helm/kustomize to inject
.ClusterID) so CLOUD_CLUSTER_ID receives the real UUID when Deployment in
deployment.yaml is applied.

---

Duplicate comments:
In `@internal/core/services/cluster.go`:
- Around line 129-138: The CreateCluster flow is non-atomic because
s.repo.Create(ctx, cluster) and s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0])
are separate writes; if AddNodeGroup fails the cluster remains persisted. Fix by
performing both operations in a single transactional unit (use your repository's
transaction mechanism or add one) — e.g., run Create and AddNodeGroup inside a
repo transaction or implement a single repo method
(CreateClusterWithNodeGroup/CreateWithDefaultNodeGroup) that performs both
writes atomically; alternatively, if transactions aren't available implement a
compensating rollback by calling repo.Delete(ctx, cluster.ID) when AddNodeGroup
fails. Ensure you update CreateCluster to call the transactional method (or
perform rollback) so partial state cannot be left behind.

In `@internal/repositories/k8s/provisioner.go`:
- Around line 242-246: If p.repo.AddNode(ctx, node) fails, the newly created
worker is orphaned and the function currently just logs and appends
provisioningErrors; modify the error path in the AddNode block to perform a
rollback by deleting/terminating the created cloud instance (use the provisioner
cloud client method, e.g., p.cloud.DeleteInstance or p.cloud.TerminateInstance
with the workerName), log the rollback result via p.logger, then call
failCluster(ctx, clusterID) to move the cluster to a failed state and return the
error (instead of only appending provisioningErrors). Ensure similar rollback +
failCluster + return behavior is applied to the other failure path referenced
around where provisioningErrors is used so status remains consistent.

---

Nitpick comments:
In `@docs/swagger/docs.go`:
- Around line 1236-1299: The generated Swagger operation for POST
"/clusters/{id}/nodegroups" (request body schema httphandlers.NodeGroupRequest,
response domain.NodeGroup) is missing explicit consumes/produces JSON media
types; update the handler's Swagger annotations (the comments above the handler
that reference httphandlers.NodeGroupRequest and domain.NodeGroup) to include
consumes: application/json and produces: application/json, then re-run the
Swagger generator to regenerate docs/swagger/docs.go; apply the same annotation
fix and regeneration for the other body-based operations noted (the blocks
around lines 1300-1363 and 1583-1622) so all request-body endpoints declare
consumes/produces JSON.

In `@internal/core/services/database.go`:
- Around line 377-381: The returns in the exporter selection code currently use
magic string literals "9187" and "9104" for ports; replace these with named
constants (e.g., PostgresExporterPort and MySQLExporterPort) defined alongside
other service/engine constants and use those constants in the return statements
where postgresExporterImg and mysqlExporterImg are returned; ensure the
constants are typed as strings (or format them where used) and update any tests
or usages that reference the literal port strings to reference the new constant
names.

In `@internal/handlers/cluster_handler_test.go`:
- Around line 570-630: Replace the ad-hoc subtests inside
TestClusterHandlerNodeGroups with a table-driven loop: create a slice of test
cases (structs) that include name, method, path, requestBody, mockSetup func (to
set expectations on svc for AddNodeGroup/UpdateNodeGroup/DeleteNodeGroup),
expectedStatus, and any flags for invalid-id/invalid-body/service-error; then
iterate over the table calling r.ServeHTTP and assert the response. Keep and
reuse the existing handlers (handler.AddNodeGroup, handler.UpdateNodeGroup,
handler.DeleteNodeGroup) and the example params/expected domain.NodeGroup values
from the current subtests inside each test case's mockSetup; for
invalid-id/invalid-body cases set requestBody or path to invalid values and do
not set successful svc expectations (or set svc to return an error for
service-error cases). Ensure each case uses t.Run(tc.name, func(t *testing.T) {
... }) so mocks are asserted per-case.

In `@internal/handlers/cluster_handler.go`:
- Around line 446-471: In UpdateNodeGroup validate the path param `name` (the
variable `name := c.Param("name")`) before calling the service: check it's
non-empty (and optionally allowed format/length), and if invalid return a clear
bad-request via httputil.Error with an InvalidInput error (create or reuse an
error constant such as errInvalidNodeGroupName) so the handler fails fast with a
descriptive message instead of forwarding an empty/invalid name to
svc.UpdateNodeGroup.

In `@internal/repositories/k8s/mocks_test.go`:
- Around line 102-111: Replace the hardcoded-success implementations in
mockClusterRepo's methods (AddNodeGroup, GetNodeGroups, UpdateNodeGroup,
DeleteNodeGroup) with testify/mock-style handlers: call m.Called(...) with the
same arguments, extract returned values and errors from the call result, and
return them (casting to the appropriate types, e.g., []domain.NodeGroup for
GetNodeGroups and error for others). Ensure you pass ctx and the relevant
parameters into m.Called so tests can assert calls and control return values via
On/Return.

In `@internal/repositories/postgres/cluster_repo.go`:
- Around line 116-121: The current loop calls r.GetNodeGroups(ctx, c.ID) per
cluster causing an N+1 problem; instead, collect all cluster IDs from the
clusters slice in the list function, call a batched fetch (add or use a new
repository method like GetNodeGroupsByClusterIDs(ctx, []clusterIDs) that returns
node groups with their ClusterID), then build an in-memory
map[ClusterID][]NodeGroup and assign c.NodeGroups from that map for each cluster
(replace the per-cluster r.GetNodeGroups calls and the ngs variable usage).
Ensure the new method signature and mapping use the existing types and error
handling (wrap errors similarly) so callers remain consistent.

In `@internal/workers/cluster_worker_test.go`:
- Around line 64-73: The NodeGroup mock methods (MockClusterRepo.AddNodeGroup,
GetNodeGroups, UpdateNodeGroup, DeleteNodeGroup) are hardcoded stubs; update
each to delegate to m.Called(...) for consistency and test configurability by
invoking m.Called with the same parameters (ctx, ng or clusterID/name) and
returning the typed results (error and []domain.NodeGroup for GetNodeGroups)
extracted from the returned arguments (using .Error(0) or
.Get(0).([]domain.NodeGroup) as appropriate).

In `@internal/workers/database_failover_worker_test.go`:
- Around line 128-129: Replace the magic number -1 used for the Port field with
a named constant (e.g., invalidPort or unhealthyPort) to make the test intent
explicit; add the constant near the top of the test file or in the test's scope,
then update the Port: -1 occurrence to Port: invalidPort (or your chosen name)
so the unhealthy-port intent is clear when examining the test that configures
the DialTimeout failure.

In `@k8s/autoscaler/deployment.yaml`:
- Around line 77-78: The manifest hardcodes CLOUD_API_URL in deployment.yaml
which reduces flexibility; change it to source the value from a ConfigMap (e.g.,
create/modify k8s/autoscaler/configmap.yaml to include a cloud-api-url key) and
update the Deployment to set the environment variable CLOUD_API_URL from that
ConfigMap (use envFrom or valueFrom with configMapKeyRef) so the same {{
.CloudAPIURL }}-style configurability is available for CLOUD_API_URL.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 113a517e-0d2d-446e-a551-32d4eab748cb

📥 Commits

Reviewing files that changed from the base of the PR and between d541c29 and 4d664b1.

📒 Files selected for processing (42)

Dockerfile.autoscaler-server
cmd/api/main.go
cmd/autoscaler-server/main.go
docs/adr/ADR-020-kubernetes-cluster-autoscaler.md
docs/kubernetes.md
docs/services/kubernetes-autoscaler.md
docs/swagger/docs.go
docs/swagger/swagger.json
docs/swagger/swagger.yaml
downs.txt
internal/api/setup/router.go
internal/autoscaler/server.go
internal/core/domain/cluster.go
internal/core/ports/compute_options.go
internal/core/services/cluster.go
internal/core/services/cluster_test.go
internal/core/services/cluster_unit_test.go
internal/core/services/database.go
internal/core/services/health_unit_test.go
internal/core/services/shared_test.go
internal/handlers/cluster_handler.go
internal/handlers/cluster_handler_test.go
internal/handlers/container_handler.go
internal/handlers/container_handler_test.go
internal/repositories/k8s/kubeadm_provisioner_test.go
internal/repositories/k8s/mocks_test.go
internal/repositories/k8s/provisioner.go
internal/repositories/k8s/templates/control_plane.yaml
internal/repositories/noop/adapters.go
internal/repositories/postgres/cluster_repo.go
internal/repositories/postgres/cluster_repo_unit_test.go
internal/repositories/postgres/migrations/020_create_clusters.up.sql
internal/repositories/postgres/migrations/037_create_clusters_table.down.sql
internal/repositories/postgres/migrations/037_create_clusters_table.up.sql
internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql
internal/workers/cluster_worker_test.go
internal/workers/database_failover_worker_test.go
internal/workers/workers_unit_test.go
k8s/autoscaler/deployment.yaml
k8s/autoscaler/rbac.yaml
pkg/sdk/compute.go
pkg/sdk/kubernetes.go

💤 Files with no reviewable changes (2)

internal/repositories/postgres/migrations/037_create_clusters_table.up.sql
internal/repositories/postgres/migrations/037_create_clusters_table.down.sql

✅ Files skipped from review due to trivial changes (1)

docs/services/kubernetes-autoscaler.md

🚧 Files skipped from review as they are similar to previous changes (6)

docs/adr/ADR-020-kubernetes-cluster-autoscaler.md
internal/api/setup/router.go
internal/core/ports/compute_options.go
cmd/autoscaler-server/main.go
Dockerfile.autoscaler-server
internal/autoscaler/server.go

coderabbitai · 2026-03-05T13:00:27Z

+        "httphandlers.ScaleDeploymentRequest": {
+            "type": "object",
+            "required": [
+                "replicas"
+            ],
+            "properties": {
+                "replicas": {
+                    "type": "integer"
+                }
+            }
+        },


⚠️ Potential issue | 🟡 Minor

Constrain replicas to non-negative values in the contract.

ScaleDeploymentRequest.replicas currently accepts negative integers in schema, which is invalid for scaling semantics.

🐛 Suggested patch

"httphandlers.ScaleDeploymentRequest": { "type": "object", "required": [ "replicas" ], "properties": { "replicas": { - "type": "integer" + "type": "integer", + "minimum": 0 } } },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"httphandlers.ScaleDeploymentRequest": {

"type": "object",

"required": [

"replicas"

],

"properties": {

"replicas": {

"type": "integer"

}

}

},

"httphandlers.ScaleDeploymentRequest": {

"type": "object",

"required": [

"replicas"

],

"properties": {

"replicas": {

"type": "integer",

"minimum": 0

}

}

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 9867 - 9877, The OpenAPI schema for ScaleDeploymentRequest currently allows negative integers for the replicas field; update the replicas property inside the ScaleDeploymentRequest schema to enforce non-negative values by adding a minimum constraint (e.g., "minimum": 0) so that replicas cannot be negative and validation will reject negative scaling requests.

coderabbitai · 2026-03-05T13:00:28Z

+        "httphandlers.UpdateNodeGroupRequest": {
+            "type": "object",
+            "properties": {
+                "desired_size": {
+                    "type": "integer",
+                    "minimum": 0
+                },
+                "max_size": {
+                    "type": "integer",
+                    "minimum": 0
+                },
+                "min_size": {
+                    "type": "integer",
+                    "minimum": 0
+                }
+            }
+        },


⚠️ Potential issue | 🟡 Minor

Disallow empty node-group update payloads.

UpdateNodeGroupRequest currently allows {}. Add minProperties: 1 (or required fields) to avoid ambiguous no-op updates.

🐛 Suggested patch

"httphandlers.UpdateNodeGroupRequest": { "type": "object", + "minProperties": 1, "properties": { "desired_size": { "type": "integer", "minimum": 0 },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"httphandlers.UpdateNodeGroupRequest": {

"type": "object",

"properties": {

"desired_size": {

"type": "integer",

"minimum": 0

},

"max_size": {

"type": "integer",

"minimum": 0

},

"min_size": {

"type": "integer",

"minimum": 0

}

}

},

"httphandlers.UpdateNodeGroupRequest": {

"type": "object",

"minProperties": 1,

"properties": {

"desired_size": {

"type": "integer",

"minimum": 0

},

"max_size": {

"type": "integer",

"minimum": 0

},

"min_size": {

"type": "integer",

"minimum": 0

}

}

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/swagger/docs.go` around lines 9878 - 9894, The UpdateNodeGroupRequest schema currently permits an empty object which allows no-op updates; update the "httphandlers.UpdateNodeGroupRequest" JSON schema to require at least one field by adding "minProperties": 1 (or alternatively list the updatable fields under "required" if specific ones must always be present) so requests like {} are rejected; modify the schema block that defines "UpdateNodeGroupRequest" to include the "minProperties": 1 property adjacent to "type": "object" (or add a "required" array containing "desired_size"/"max_size"/"min_size" if appropriate).

coderabbitai · 2026-03-05T13:00:28Z

+        "httphandlers.ScaleDeploymentRequest": {
+            "type": "object",
+            "required": [
+                "replicas"
+            ],
+            "properties": {
+                "replicas": {
+                    "type": "integer"
+                }
+            }


⚠️ Potential issue | 🟡 Minor

Add a lower bound for deployment replicas in the schema.

Line 9865 currently allows negative replicas values because no minimum is declared. This weakens client-side validation for generated SDKs.

🔧 Suggested schema fix

"httphandlers.ScaleDeploymentRequest": { "type": "object", "required": [ "replicas" ], "properties": { "replicas": { - "type": "integer" + "type": "integer", + "minimum": 0 } } },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"httphandlers.ScaleDeploymentRequest": {

"type": "object",

"required": [

"replicas"

],

"properties": {

"replicas": {

"type": "integer"

}

}

"httphandlers.ScaleDeploymentRequest": {

"type": "object",

"required": [

"replicas"

],

"properties": {

"replicas": {

"type": "integer",

"minimum": 0

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 9859 - 9868, The schema for httphandlers.ScaleDeploymentRequest allows negative replicas; update the replicas property in the httphandlers.ScaleDeploymentRequest definition to include a lower bound by adding "minimum": 0 (so generated SDKs validate non-negative replica counts); ensure the change is applied to the "replicas" property in that object schema.

coderabbitai · 2026-03-05T13:00:28Z

+	// Add default Node Group
+	defaultNG := domain.NodeGroup{
+		ID:           uuid.New(),
+		ClusterID:    cluster.ID,
+		Name:         "default-pool",
+		InstanceType: "standard-1",
+		MinSize:      1,
+		MaxSize:      10,
+		CurrentSize:  params.Workers,


⚠️ Potential issue | 🟠 Major

Validate default node-group sizing before the first DB write.

Line 123 sets CurrentSize from params.Workers while bounds are fixed at 1..10. Invalid values can make default node-group persistence fail after the cluster row is already inserted.

🔧 Proposed fix

const ( defaultClusterVersion = "v1.29.0" defaultWorkerCount = 2 + defaultNodeGroupMinSize = 1 + defaultNodeGroupMaxSize = 10 ) @@ + if err := s.validateNodeGroupSizing(defaultNodeGroupMinSize, defaultNodeGroupMaxSize, params.Workers); err != nil { + return nil, err + } + defaultNG := domain.NodeGroup{ @@ - MinSize: 1, - MaxSize: 10, + MinSize: defaultNodeGroupMinSize, + MaxSize: defaultNodeGroupMaxSize, CurrentSize: params.Workers,

As per coding guidelines, "Do not use magic numbers - use named constants instead".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/core/services/cluster.go` around lines 115 - 123, The default NodeGroup construction sets CurrentSize := params.Workers without validating it against hard-coded bounds (MinSize/MaxSize 1..10) which can cause DB failures after the cluster row is created; update the code that builds defaultNG (the domain.NodeGroup instance) to validate params.Workers before any DB writes by introducing named constants (e.g., DefaultMinNodeGroupSize, DefaultMaxNodeGroupSize) instead of magic numbers, and either clamp params.Workers into that range or return an error so you don't persist an invalid NodeGroup; ensure validation happens prior to inserting the cluster/NodeGroup and update any callers that assume the old behavior.

coderabbitai · 2026-03-05T13:00:28Z

+	if err := s.repo.AddNodeGroup(ctx, ng); err != nil {
+		return nil, err
+	}
+
+	// If desired size > 0, we should trigger a scale operation for this specific group.
+	// For now, we will update the global worker count to reflect the addition.
+	oldWorkerCount := cluster.WorkerCount
+	cluster.WorkerCount += params.DesiredSize
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		cluster.WorkerCount = oldWorkerCount
+		return nil, errors.Wrap(errors.Internal, "failed to update cluster worker count after adding node group", err)
+	}


⚠️ Potential issue | 🟠 Major

Node-group mutations and WorkerCount updates need a single transaction boundary.

Add/update/delete each perform multiple dependent writes across node groups and clusters. If the second write fails, persisted state can diverge (cluster.WorkerCount vs. node-group sizes).

Also applies to: 501-512, 542-550

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/core/services/cluster.go` around lines 437 - 448, The AddNodeGroup flow mutates node-group then updates cluster.WorkerCount in two separate writes (s.repo.AddNodeGroup and s.repo.Update) which can leave state inconsistent if the second write fails; wrap the related operations (s.repo.AddNodeGroup, adjusting cluster.WorkerCount by params.DesiredSize, and s.repo.Update) in a single transactional boundary (use your repo's transaction API / BeginTx/WithTx), perform both writes inside the same transaction, and ensure rollback on any error so that either both the node-group and cluster.WorkerCount persist or neither do; apply the same transactional pattern to the other affected code paths (the ranges noted around lines 501-512 and 542-550).

coderabbitai · 2026-03-05T13:00:28Z

+func (s *NoopClusterService) AddNodeGroup(ctx context.Context, clusterID uuid.UUID, params ports.NodeGroupParams) (*domain.NodeGroup, error) {
+	return &domain.NodeGroup{ClusterID: clusterID, Name: params.Name}, nil
+}
+func (s *NoopClusterService) UpdateNodeGroup(ctx context.Context, clusterID uuid.UUID, name string, params ports.UpdateNodeGroupParams) (*domain.NodeGroup, error) {
+	return &domain.NodeGroup{ClusterID: clusterID, Name: name}, nil
+}
+func (s *NoopClusterService) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error {
+	return nil
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd /dev/null && pwd

Repository: poyrazK/thecloud

Length of output: 109

🏁 Script executed:

cat -n internal/repositories/noop/adapters.go | head -250 | tail -100

Repository: poyrazK/thecloud

Length of output: 5593

🏁 Script executed:

fd 'ports' -t f --name '*.go' | grep -E '(port|interface)' | head -20

Repository: poyrazK/thecloud

Length of output: 310

🏁 Script executed:

rg 'type ClusterRepository interface' -A 20 --type=go

Repository: poyrazK/thecloud

Length of output: 1730

🏁 Script executed:

rg 'type ClusterService interface' -A 20 --type=go

Repository: poyrazK/thecloud

Length of output: 1999

🏁 Script executed:

rg 'NoopClusterRepository' -B 2 -A 5 --type=go

Repository: poyrazK/thecloud

Length of output: 2647

🏁 Script executed:

rg 'var.*ClusterRepository|:=.*NoopClusterRepository' --type=go -B2 -A2

Repository: poyrazK/thecloud

Length of output: 42

🏁 Script executed:

rg 'NoopClusterRepository' --type=go -l

Repository: poyrazK/thecloud

Length of output: 98

🏁 Script executed:

grep -r 'NoopClusterRepository' --include='*.go' | head -20

Repository: poyrazK/thecloud

Length of output: 1649

Add missing NodeGroup methods to NoopClusterRepository to fully implement the ports.ClusterRepository interface.

NoopClusterRepository currently lacks AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup methods required by the ports.ClusterRepository interface. These must be implemented to maintain interface compliance:

func (r *NoopClusterRepository) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil } func (r *NoopClusterRepository) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) { return []domain.NodeGroup{}, nil } func (r *NoopClusterRepository) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error { return nil } func (r *NoopClusterRepository) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error { return nil }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/noop/adapters.go` around lines 221 - 229, The NoopClusterRepository is missing the NodeGroup methods required by ports.ClusterRepository; add implementations for AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup on the NoopClusterRepository type that match the repository signatures and return benign noop results (e.g., nil error and empty slice for GetNodeGroups) so the repository conforms to the interface; ensure method names and parameter types exactly match the ports.ClusterRepository signatures and use domain.NodeGroup and uuid.UUID where appropriate.

coderabbitai · 2026-03-05T13:00:28Z

+func (r *ClusterRepository) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
+	query := `
+		UPDATE cluster_node_groups
+		SET instance_type = $1, min_size = $2, max_size = $3, current_size = $4, updated_at = $5
+		WHERE cluster_id = $6 AND name = $7
+	`
+	_, err := r.db.Exec(ctx, query, ng.InstanceType, ng.MinSize, ng.MaxSize, ng.CurrentSize, time.Now(), ng.ClusterID, ng.Name)
+	if err != nil {
+		return errors.Wrap(errors.Internal, "failed to update node group", err)
+	}
+	return nil
+}
+
+func (r *ClusterRepository) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error {
+	query := `DELETE FROM cluster_node_groups WHERE cluster_id = $1 AND name = $2`
+	_, err := r.db.Exec(ctx, query, clusterID, name)
+	if err != nil {
+		return errors.Wrap(errors.Internal, "failed to delete node group", err)
+	}
+	return nil
+}


⚠️ Potential issue | 🟠 Major

Return not-found when node-group update/delete affects no record.

UpdateNodeGroup and DeleteNodeGroup currently succeed even if (cluster_id, name) is missing. That can hide races and make higher layers report success for no-op writes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/cluster_repo.go` around lines 265 - 285, Update UpdateNodeGroup and DeleteNodeGroup to check the Exec result for affected rows and return a not-found error when zero rows are affected: capture the Exec return into a result (e.g., res, err := r.db.Exec(...)), check rows := res.RowsAffected() (or res.RowsAffected() via CommandTag if your DB driver returns a CommandTag) and if rows == 0 return errors.Wrap(errors.NotFound, "node group not found") instead of success; keep the existing error wrapping for Exec failures and apply the same change in both ClusterRepository.UpdateNodeGroup and ClusterRepository.DeleteNodeGroup.

coderabbitai · 2026-03-05T13:00:28Z

+INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size)
+SELECT 
+    gen_random_uuid(), 
+    id, 
+    'default-pool', 
+    'standard-1', 
+    1, 
+    CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END, 
+    worker_count
+FROM clusters
+WHERE NOT EXISTS (
+    SELECT 1 FROM cluster_node_groups cng 
+    WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool'
+);


⚠️ Potential issue | 🔴 Critical

Backfill can fail on non-positive worker_count values.

At Line 30, current_size = worker_count can violate chk_cluster_node_groups_current_in_range when legacy data has worker_count <= 0, causing migration failure.

🧩 Proposed safe backfill adjustment

INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size) SELECT gen_random_uuid(), id, 'default-pool', 'standard-1', 1, - CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END, - worker_count + GREATEST(worker_count, 10), + GREATEST(worker_count, 1) FROM clusters WHERE NOT EXISTS ( SELECT 1 FROM cluster_node_groups cng WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool' );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size)

SELECT

gen_random_uuid(),

id,

'default-pool',

'standard-1',

1,

CASE WHEN worker_count > 10 THEN worker_count ELSE 10 END,

worker_count

FROM clusters

WHERE NOT EXISTS (

SELECT 1 FROM cluster_node_groups cng

WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool'

);

INSERT INTO cluster_node_groups (id, cluster_id, name, instance_type, min_size, max_size, current_size)

SELECT

gen_random_uuid(),

id,

'default-pool',

'standard-1',

1,

GREATEST(worker_count, 10),

GREATEST(worker_count, 1)

FROM clusters

WHERE NOT EXISTS (

SELECT 1 FROM cluster_node_groups cng

WHERE cng.cluster_id = clusters.id AND cng.name = 'default-pool'

);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/postgres/migrations/098_create_cluster_node_groups.up.sql` around lines 22 - 35, The backfill sets current_size = worker_count which can be <= 0 and violate the chk_cluster_node_groups_current_in_range constraint; update the INSERT SELECT to ensure current_size is at least 1 by replacing current_size = worker_count with current_size = GREATEST(worker_count, 1) (and optionally guard max_size similarly if you want to ensure it's >=1), referencing the cluster_node_groups table, clusters table, worker_count column and the chk_cluster_node_groups_current_in_range constraint.

coderabbitai · 2026-03-05T13:00:28Z

+            - name: CLOUD_CLUSTER_ID
+              valueFrom:
+                configMapKeyRef:
+                  name: cluster-id
+                  key: cluster-id


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, check if the cluster-id-configmap.yaml file exists and examine its contents if [ -f k8s/autoscaler/cluster-id-configmap.yaml ]; then echo "=== Contents of cluster-id-configmap.yaml ===" cat k8s/autoscaler/cluster-id-configmap.yaml echo "" echo "=== Checking for template placeholders ===" grep -E '\{\{.*\}\}' k8s/autoscaler/cluster-id-configmap.yaml && echo "Found template placeholder" || echo "No template placeholder found" else echo "File not found: k8s/autoscaler/cluster-id-configmap.yaml" fi echo "" echo "=== Verifying deployment.yaml exists and shows relevant section ===" if [ -f k8s/autoscaler/deployment.yaml ]; then echo "File exists. Checking lines 79-83:" sed -n '79,83p' k8s/autoscaler/deployment.yaml else echo "File not found: k8s/autoscaler/deployment.yaml" fi

Repository: poyrazK/thecloud

Length of output: 601

Populate cluster-id ConfigMap with actual cluster UUID before deployment.

The cluster-id ConfigMap contains {{ .ClusterID }}, an unresolved template placeholder. The deployment references this ConfigMap to set the CLOUD_CLUSTER_ID environment variable, which will receive the literal string "{{ .ClusterID }}" instead of an actual cluster ID. Update deployment documentation to explicitly require that the ConfigMap be created with the actual cluster UUID before applying this manifest, or implement a templating step to resolve the placeholder during deployment.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@k8s/autoscaler/deployment.yaml` around lines 79 - 83, The deployment references the ConfigMap named "cluster-id" to populate the CLOUD_CLUSTER_ID env var but the ConfigMap contains the unresolved template "{{ .ClusterID }}"; before applying the manifest either replace that placeholder with the actual cluster UUID or add a templating step to render it. Update docs/README or deployment instructions to require creating the ConfigMap "cluster-id" with key "cluster-id" set to the real cluster UUID (or integrate a templating command such as envsubst/helm/kustomize to inject .ClusterID) so CLOUD_CLUSTER_ID receives the real UUID when Deployment in deployment.yaml is applied.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/workers/database_failover_worker_test.go (1)

162-178: ⚠️ Potential issue | 🟡 Minor

Inconsistent matcher usage and magic number.

Two issues in this test:

Line 172 uses mock.Anything while other tests use anyCtx - should be consistent.
Line 168 uses magic number 0 for healthy port simulation. This relies on an undocumented implementation detail and violates the guideline against magic numbers.

Suggested fix

Define a constant and use anyCtx consistently:

+const healthyPortSimulation = 0 // Port 0 is treated as healthy in worker for testing
+
 func TestDatabaseFailoverWorker(t *testing.T) {

 	t.Run("No failover if primary healthy", func(t *testing.T) {
 		repo := new(mockDatabaseRepo)
 		svc := new(mockDatabaseService)
 		logger := slog.Default()

 		healthyPrimary := *primary
-		healthyPrimary.Port = 0 // port 0 is always healthy in our worker for simulation
+		healthyPrimary.Port = healthyPortSimulation

 		worker := NewDatabaseFailoverWorker(svc, repo, logger)

-		repo.On("List", mock.Anything).Return([]*domain.Database{&healthyPrimary}, nil)
+		repo.On("List", anyCtx).Return([]*domain.Database{&healthyPrimary}, nil)

As per coding guidelines: "Do not use magic numbers - use named constants instead".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/workers/database_failover_worker_test.go` around lines 162 - 178,
Replace the inconsistent matcher and magic number in the "No failover if primary
healthy" case: use the test-scoped matcher anyCtx instead of mock.Anything when
setting expectations on repo or svc, and replace the literal port 0 with a named
constant (e.g., healthyPort or DefaultHealthyPort) defined at the top of the
test file so healthyPrimary.Port = healthyPort; keep the rest of the logic
around healthyPrimary, NewDatabaseFailoverWorker, repo.On("List", ...) and
worker.checkDatabases(...) unchanged.

docs/swagger/swagger.yaml (1)

3110-3131: ⚠️ Potential issue | 🟠 Major

/clusters/{id}/scale verb switch is a client-breaking change.

Changing this operation to PUT will break existing clients still using POST. Consider a transition window (support both methods, mark POST deprecated) or an explicit API version bump.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.yaml` around lines 3110 - 3131, The API operation for
/clusters/{id}/scale was changed to PUT which is a client-breaking change for
callers still using POST; either support both verbs during a transition or bump
the API version. Update swagger.yaml so that the path /clusters/{id}/scale
exposes both post and put operations (reusing the same schema reference to
httphandlers.ScaleClusterRequest and responses) and mark the POST operation
deprecated in its description, or alternatively revert to POST and increment the
API version in the spec if you intend a breaking change; ensure documentation
and changelog note the deprecation/version change so clients can migrate.

♻️ Duplicate comments (4)

internal/repositories/noop/adapters.go (1)

165-187: ⚠️ Potential issue | 🟠 Major

NoopClusterRepository still missing NodeGroup methods required by ports.ClusterRepository.

The past review comment flagging this issue remains unresolved. NoopClusterRepository must implement AddNodeGroup, GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup to satisfy the ports.ClusterRepository interface (see internal/core/ports/cluster.go lines 36-39).

🔧 Add missing NodeGroup methods

 func (r *NoopClusterRepository) DeleteNode(ctx context.Context, id uuid.UUID) error { return nil }
+
+func (r *NoopClusterRepository) AddNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
+	return nil
+}
+func (r *NoopClusterRepository) GetNodeGroups(ctx context.Context, clusterID uuid.UUID) ([]domain.NodeGroup, error) {
+	return []domain.NodeGroup{}, nil
+}
+func (r *NoopClusterRepository) UpdateNodeGroup(ctx context.Context, ng *domain.NodeGroup) error {
+	return nil
+}
+func (r *NoopClusterRepository) DeleteNodeGroup(ctx context.Context, clusterID uuid.UUID, name string) error {
+	return nil
+}

As per coding guidelines, "Repositories (internal/repositories/) must implement Ports interfaces".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/noop/adapters.go` around lines 165 - 187,
NoopClusterRepository is missing the NodeGroup methods required by the
ports.ClusterRepository interface; add the methods AddNodeGroup, GetNodeGroups,
UpdateNodeGroup, and DeleteNodeGroup to NoopClusterRepository with matching
signatures (context.Context plus domain.NodeGroup or IDs/return types) so the
struct implements ports.ClusterRepository (refer to ports.ClusterRepository in
internal/core/ports/cluster.go for exact signatures), and implement them as
no-ops/empty returns consistent with the other Noop methods.

docs/swagger/docs.go (2)

9887-9903: ⚠️ Potential issue | 🟡 Minor

Disallow empty node-group update payloads.

Line 9888 defines an object where all fields are optional; {} is currently valid and can produce ambiguous no-op updates.

Suggested patch

         "httphandlers.UpdateNodeGroupRequest": {
             "type": "object",
+            "minProperties": 1,
             "properties": {
                 "desired_size": {
                     "type": "integer",
                     "minimum": 0
                 },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 9887 - 9903, The UpdateNodeGroupRequest
schema (httphandlers.UpdateNodeGroupRequest) currently allows an empty object;
add a constraint to require at least one field by adding "minProperties": 1 to
that schema so `{}` is invalid and at least one of desired_size, max_size, or
min_size must be present.

9876-9886: ⚠️ Potential issue | 🟡 Minor

Constrain deployment replicas to non-negative values.

Line 9882 currently permits negative integers for replicas, which breaks scaling semantics.

Suggested patch

         "httphandlers.ScaleDeploymentRequest": {
             "type": "object",
             "required": [
                 "replicas"
             ],
             "properties": {
                 "replicas": {
-                    "type": "integer"
+                    "type": "integer",
+                    "minimum": 0
                 }
             }
         },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/docs.go` around lines 9876 - 9886, The Swagger schema for
httphandlers.ScaleDeploymentRequest allows negative replicas; update the schema
for the "replicas" property to enforce non-negative integers by adding a minimum
constraint (e.g., "minimum": 0) to the replicas property in the
ScaleDeploymentRequest definition so clients and validation reject negative
values.

docs/swagger/swagger.json (1)

9868-9877: ⚠️ Potential issue | 🟡 Minor

Add a non-negative lower bound for deployment replicas.

Line 9875 still allows negative values because minimum is missing on replicas, so client-side validation remains weak.

🔧 Suggested schema fix

         "httphandlers.ScaleDeploymentRequest": {
             "type": "object",
             "required": [
                 "replicas"
             ],
             "properties": {
                 "replicas": {
-                    "type": "integer"
+                    "type": "integer",
+                    "minimum": 0
                 }
             }
         },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 9868 - 9877, The
ScaleDeploymentRequest schema allows negative replica counts; update the
"httphandlers.ScaleDeploymentRequest" definition to add a non-negative lower
bound by adding "minimum": 0 to the "replicas" property so clients validate that
replicas cannot be negative; locate the "replicas" property within the
ScaleDeploymentRequest object in swagger.json and add the minimum constraint to
that property.

🧹 Nitpick comments (2)

internal/handlers/container_handler_test.go (1)

238-270: Refactor this error test into a table-driven test.

The scenarios are repetitive and fit cleanly into a table-driven structure with shared setup/assertions.

As per coding guidelines, "Use table-driven tests in test files."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/handlers/container_handler_test.go` around lines 238 - 270,
TestContainerHandlerScaleError repeats three similar subtests; refactor it into
a single table-driven test that iterates over cases (e.g., names "InvalidID",
"InvalidJSON", "ServiceError") and runs t.Run for each. For each case, call
setupContainerHandlerTest to get svc, handler, r, register
r.PUT(deploymentsPath+scalePath, handler.ScaleDeployment), construct the request
path using deploymentsPath + either containerPathInvalid, a random UUID, or a
specific uuid for the ServiceError case (use id.String()), set the request body
to nil, "invalid", or the JSON {"replicas":5} as appropriate, and set up mock
expectations only for the ServiceError case via svc.On("ScaleDeployment",
mock.Anything, id, 5).Return(errors.New(...)). In the loop assert the expected
HTTP status (http.StatusBadRequest or http.StatusInternalServerError) and for
cases that set mock expectations call svc.AssertExpectations(t) after serving
the request; keep the shared setup/assertions inside the loop to avoid
duplication and preserve existing behavior of handler.ScaleDeployment and
scaleSuffix.

docs/swagger/swagger.yaml (1)

1980-1997: Tighten NodeGroup request validation to prevent under-specified calls.

Create currently only requires name, and update can be an empty object. That weakens API guarantees and pushes avoidable validation failures downstream.

Suggested schema hardening

  httphandlers.NodeGroupRequest:
    properties:
      desired_size:
        minimum: 0
        type: integer
      instance_type:
        type: string
      max_size:
        minimum: 0
        type: integer
      min_size:
        minimum: 0
        type: integer
      name:
+       minLength: 1
        type: string
    required:
+   - instance_type
+   - min_size
+   - max_size
    - name
    type: object

  httphandlers.UpdateNodeGroupRequest:
    properties:
      desired_size:
        minimum: 0
        type: integer
      max_size:
        minimum: 0
        type: integer
      min_size:
        minimum: 0
        type: integer
+   minProperties: 1
    type: object

Also applies to: 2085-2096, 2936-2951, 3011-3029

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.yaml` around lines 1980 - 1997, The NodeGroup request
schema (httphandlers.NodeGroupRequest) is under-specified — only "name" is
required so creates and updates can be empty; update the schema to enforce
stronger validation: make instance_type and at least one size field required for
create (add instance_type, min_size, max_size, desired_size to the required
array for the create/POST schema or a dedicated NodeGroupCreate schema), and for
update/PUT/PATCH either require minProperties: 1 or define a separate
NodeGroupUpdate schema that allows partial fields but rejects an empty object;
also ensure the numeric size fields keep their minimum: 0 constraint and add any
cross-field validation notes (e.g., max_size >= min_size) where supported.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/database.md`:
- Around line 274-288: The documentation for the cluster_node_groups table is
missing two CHECK constraints present in the migration; update the CREATE TABLE
block for cluster_node_groups to include the constraints
chk_cluster_node_groups_min_le_max (CHECK (min_size <= max_size)) and
chk_cluster_node_groups_current_in_range (CHECK (current_size BETWEEN min_size
AND max_size)) so the docs match the migration (reference the migration's
constraint names and the cluster_node_groups table to locate where to add them).

In `@docs/swagger/swagger.json`:
- Around line 1501-1503: The OpenAPI change replaced the /clusters/{id}/scale
operation from POST to PUT which breaks clients; restore backward compatibility
by adding a POST operation for the same path (or keep POST as primary and add
PUT as an additional operation) with the same requestBody, parameters and
responses as the new PUT so both POST and PUT are supported; update the
operations for "/clusters/{id}/scale" in swagger.json (ensure request/response
schema names and operationId values are consistent or distinct but functionally
equivalent) so existing clients using POST continue to work while allowing PUT
going forward.

In `@docs/swagger/swagger.yaml`:
- Around line 2078-2084: The OpenAPI schema for
httphandlers.ScaleDeploymentRequest currently allows any integer for the
replicas property; update the schema to disallow negatives by adding a minimum:
0 constraint on the replicas property (and apply the same change to the other
ScaleDeploymentRequest definition at the second occurrence referenced). Locate
the ScaleDeploymentRequest schema and add the numeric validation (minimum: 0) to
the replicas property so the contract rejects negative replica counts before
handler logic runs.

In `@internal/api/setup/router.go`:
- Line 566: The SDK's ScaleDeployment currently sends a POST; change the
implementation of ScaleDeployment in pkg/sdk/container.go to use a PUT call
instead of c.post() (e.g., switch to the client's put method) so it matches the
router's PUT /deployments/:id/scale, and update the unit test in
pkg/sdk/container_test.go (the ScaleDeployment test) to assert the request uses
PUT (and adjust any mock server handler expectations from POST to PUT).

In `@internal/handlers/container_handler_test.go`:
- Line 243: Tests are silently ignoring errors from http.NewRequest; replace
instances like `req, _ := http.NewRequest("PUT",
deploymentsPath+containerPathInvalid+scaleSuffix, nil)` (and the other similar
calls) with explicit error handling: `req, err := http.NewRequest(...)` followed
by `if err != nil { t.Fatalf("failed to create request: %v", err) }` or
`require.NoError(t, err)` so test setup failures surface immediately; update all
occurrences of `http.NewRequest` in these subtests accordingly.

---

Outside diff comments:
In `@docs/swagger/swagger.yaml`:
- Around line 3110-3131: The API operation for /clusters/{id}/scale was changed
to PUT which is a client-breaking change for callers still using POST; either
support both verbs during a transition or bump the API version. Update
swagger.yaml so that the path /clusters/{id}/scale exposes both post and put
operations (reusing the same schema reference to
httphandlers.ScaleClusterRequest and responses) and mark the POST operation
deprecated in its description, or alternatively revert to POST and increment the
API version in the spec if you intend a breaking change; ensure documentation
and changelog note the deprecation/version change so clients can migrate.

In `@internal/workers/database_failover_worker_test.go`:
- Around line 162-178: Replace the inconsistent matcher and magic number in the
"No failover if primary healthy" case: use the test-scoped matcher anyCtx
instead of mock.Anything when setting expectations on repo or svc, and replace
the literal port 0 with a named constant (e.g., healthyPort or
DefaultHealthyPort) defined at the top of the test file so healthyPrimary.Port =
healthyPort; keep the rest of the logic around healthyPrimary,
NewDatabaseFailoverWorker, repo.On("List", ...) and worker.checkDatabases(...)
unchanged.

---

Duplicate comments:
In `@docs/swagger/docs.go`:
- Around line 9887-9903: The UpdateNodeGroupRequest schema
(httphandlers.UpdateNodeGroupRequest) currently allows an empty object; add a
constraint to require at least one field by adding "minProperties": 1 to that
schema so `{}` is invalid and at least one of desired_size, max_size, or
min_size must be present.
- Around line 9876-9886: The Swagger schema for
httphandlers.ScaleDeploymentRequest allows negative replicas; update the schema
for the "replicas" property to enforce non-negative integers by adding a minimum
constraint (e.g., "minimum": 0) to the replicas property in the
ScaleDeploymentRequest definition so clients and validation reject negative
values.

In `@docs/swagger/swagger.json`:
- Around line 9868-9877: The ScaleDeploymentRequest schema allows negative
replica counts; update the "httphandlers.ScaleDeploymentRequest" definition to
add a non-negative lower bound by adding "minimum": 0 to the "replicas" property
so clients validate that replicas cannot be negative; locate the "replicas"
property within the ScaleDeploymentRequest object in swagger.json and add the
minimum constraint to that property.

In `@internal/repositories/noop/adapters.go`:
- Around line 165-187: NoopClusterRepository is missing the NodeGroup methods
required by the ports.ClusterRepository interface; add the methods AddNodeGroup,
GetNodeGroups, UpdateNodeGroup, and DeleteNodeGroup to NoopClusterRepository
with matching signatures (context.Context plus domain.NodeGroup or IDs/return
types) so the struct implements ports.ClusterRepository (refer to
ports.ClusterRepository in internal/core/ports/cluster.go for exact signatures),
and implement them as no-ops/empty returns consistent with the other Noop
methods.

---

Nitpick comments:
In `@docs/swagger/swagger.yaml`:
- Around line 1980-1997: The NodeGroup request schema
(httphandlers.NodeGroupRequest) is under-specified — only "name" is required so
creates and updates can be empty; update the schema to enforce stronger
validation: make instance_type and at least one size field required for create
(add instance_type, min_size, max_size, desired_size to the required array for
the create/POST schema or a dedicated NodeGroupCreate schema), and for
update/PUT/PATCH either require minProperties: 1 or define a separate
NodeGroupUpdate schema that allows partial fields but rejects an empty object;
also ensure the numeric size fields keep their minimum: 0 constraint and add any
cross-field validation notes (e.g., max_size >= min_size) where supported.

In `@internal/handlers/container_handler_test.go`:
- Around line 238-270: TestContainerHandlerScaleError repeats three similar
subtests; refactor it into a single table-driven test that iterates over cases
(e.g., names "InvalidID", "InvalidJSON", "ServiceError") and runs t.Run for
each. For each case, call setupContainerHandlerTest to get svc, handler, r,
register r.PUT(deploymentsPath+scalePath, handler.ScaleDeployment), construct
the request path using deploymentsPath + either containerPathInvalid, a random
UUID, or a specific uuid for the ServiceError case (use id.String()), set the
request body to nil, "invalid", or the JSON {"replicas":5} as appropriate, and
set up mock expectations only for the ServiceError case via
svc.On("ScaleDeployment", mock.Anything, id, 5).Return(errors.New(...)). In the
loop assert the expected HTTP status (http.StatusBadRequest or
http.StatusInternalServerError) and for cases that set mock expectations call
svc.AssertExpectations(t) after serving the request; keep the shared
setup/assertions inside the loop to avoid duplication and preserve existing
behavior of handler.ScaleDeployment and scaleSuffix.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c1a92bf6-3e97-4a3c-b7aa-983937557b57

📥 Commits

Reviewing files that changed from the base of the PR and between 4d664b1 and b441bdf.

📒 Files selected for processing (10)

docs/database.md
docs/kubernetes.md
docs/swagger/docs.go
docs/swagger/swagger.json
docs/swagger/swagger.yaml
internal/api/setup/router.go
internal/core/services/shared_test.go
internal/handlers/container_handler_test.go
internal/repositories/noop/adapters.go
internal/workers/database_failover_worker_test.go

coderabbitai · 2026-03-05T13:57:17Z

+CREATE TABLE cluster_node_groups (
+    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE,
+    name VARCHAR(255) NOT NULL,
+    instance_type VARCHAR(50) NOT NULL,
+    min_size INTEGER NOT NULL DEFAULT 1,
+    max_size INTEGER NOT NULL DEFAULT 10,
+    current_size INTEGER NOT NULL DEFAULT 1,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+    UNIQUE(cluster_id, name)
+);
+
+CREATE INDEX idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id);
+```


⚠️ Potential issue | 🟡 Minor

Documentation missing CHECK constraints from the actual migration.

The migration file 098_create_cluster_node_groups.up.sql includes two CHECK constraints that enforce data integrity but aren't documented here:

chk_cluster_node_groups_min_le_max CHECK (min_size <= max_size)

chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size)

Consider adding these constraints to the documentation for completeness.

📝 Suggested documentation update

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), - UNIQUE(cluster_id, name) + UNIQUE(cluster_id, name), + CONSTRAINT chk_cluster_node_groups_min_le_max CHECK (min_size <= max_size), + CONSTRAINT chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size) );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CREATE TABLE cluster_node_groups (

id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE,

name VARCHAR(255) NOT NULL,

instance_type VARCHAR(50) NOT NULL,

min_size INTEGER NOT NULL DEFAULT 1,

max_size INTEGER NOT NULL DEFAULT 10,

current_size INTEGER NOT NULL DEFAULT 1,

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

UNIQUE(cluster_id, name)

);

CREATE INDEX idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id);

```

CREATE TABLE cluster_node_groups (

id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE,

name VARCHAR(255) NOT NULL,

instance_type VARCHAR(50) NOT NULL,

min_size INTEGER NOT NULL DEFAULT 1,

max_size INTEGER NOT NULL DEFAULT 10,

current_size INTEGER NOT NULL DEFAULT 1,

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

UNIQUE(cluster_id, name),

CONSTRAINT chk_cluster_node_groups_min_le_max CHECK (min_size <= max_size),

CONSTRAINT chk_cluster_node_groups_current_in_range CHECK (current_size BETWEEN min_size AND max_size)

);

CREATE INDEX idx_cluster_node_groups_cluster_id ON cluster_node_groups(cluster_id);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/database.md` around lines 274 - 288, The documentation for the cluster_node_groups table is missing two CHECK constraints present in the migration; update the CREATE TABLE block for cluster_node_groups to include the constraints chk_cluster_node_groups_min_le_max (CHECK (min_size <= max_size)) and chk_cluster_node_groups_current_in_range (CHECK (current_size BETWEEN min_size AND max_size)) so the docs match the migration (reference the migration's constraint names and the cluster_node_groups table to locate where to add them).

coderabbitai · 2026-03-05T13:57:17Z

        "/clusters/{id}/scale": {
-            "post": {
+            "put": {
+                "security": [


⚠️ Potential issue | 🟠 Major

Avoid a breaking API change by keeping backward compatibility for cluster scale.

Line 1502 switches /clusters/{id}/scale from POST to PUT. This can break existing clients and generated SDKs that still call POST.

🔧 Suggested compatibility approach

"/clusters/{id}/scale": { + "post": { + "security": [ + { + "APIKeyAuth": [] + } + ], + "deprecated": true, + "description": "Backward-compatible alias for scale cluster workers. Use PUT.", + "tags": [ + "K8s" + ], + "summary": "Scale cluster workers (deprecated)", + "parameters": [ + { + "type": "string", + "description": "Cluster ID", + "name": "id", + "in": "path", + "required": true + }, + { + "description": "Scale Request", + "name": "request", + "in": "body", + "required": true, + "schema": { + "$ref": "#/definitions/httphandlers.ScaleClusterRequest" + } + } + ], + "responses": { + "200": { + "description": "OK" + } + } + }, "put": {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"/clusters/{id}/scale": {

"post": {

"put": {

"security": [

"/clusters/{id}/scale": {

"post": {

"security": [

{

"APIKeyAuth": []

}

],

"deprecated": true,

"description": "Backward-compatible alias for scale cluster workers. Use PUT.",

"tags": [

"K8s"

],

"summary": "Scale cluster workers (deprecated)",

"parameters": [

{

"type": "string",

"description": "Cluster ID",

"name": "id",

"in": "path",

"required": true

},

{

"description": "Scale Request",

"name": "request",

"in": "body",

"required": true,

"schema": {

"$ref": "#/definitions/httphandlers.ScaleClusterRequest"

}

}

],

"responses": {

"200": {

"description": "OK"

}

}

},

"put": {

"security": [

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 1501 - 1503, The OpenAPI change replaced the /clusters/{id}/scale operation from POST to PUT which breaks clients; restore backward compatibility by adding a POST operation for the same path (or keep POST as primary and add PUT as an additional operation) with the same requestBody, parameters and responses as the new PUT so both POST and PUT are supported; update the operations for "/clusters/{id}/scale" in swagger.json (ensure request/response schema names and operationId values are consistent or distinct but functionally equivalent) so existing clients using POST continue to work while allowing PUT going forward.

coderabbitai · 2026-03-05T13:57:17Z

+  httphandlers.ScaleDeploymentRequest:
+    properties:
+      replicas:
+        type: integer
+    required:
+    - replicas
+    type: object


⚠️ Potential issue | 🟠 Major

Disallow negative replica counts in deployment scaling.

Line 2080 accepts any integer, so negative values are valid by contract and can leak into handler logic.

Suggested contract fix

httphandlers.ScaleDeploymentRequest: properties: replicas: + minimum: 0 type: integer required: - replicas type: object

Also applies to: 3156-3170

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.yaml` around lines 2078 - 2084, The OpenAPI schema for httphandlers.ScaleDeploymentRequest currently allows any integer for the replicas property; update the schema to disallow negatives by adding a minimum: 0 constraint on the replicas property (and apply the same change to the other ScaleDeploymentRequest definition at the second occurrence referenced). Locate the ScaleDeploymentRequest schema and add the numeric validation (minimum: 0) to the replicas property so the contract rejects negative replica counts before handler logic runs.

coderabbitai · 2026-03-05T13:57:17Z

 		containerGroup.GET("/deployments", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.ListDeployments)
 		containerGroup.GET("/deployments/:id", httputil.Permission(svcs.RBAC, domain.PermissionContainerRead), handlers.Container.GetDeployment)
-		containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)
+		containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for SDK methods calling container deployment scale rg -n 'deployments.*scale|ScaleDeployment' pkg/sdk/ --type=go -C2

Repository: poyrazK/thecloud

Length of output: 1372

🏁 Script executed:

#!/bin/bash # Also check the router file itself to see the current state cat -n internal/api/setup/router.go | grep -A5 -B5 "deployments.*scale"

Repository: poyrazK/thecloud

Length of output: 1021

🏁 Script executed:

#!/bin/bash # Search for any references to this endpoint across the codebase rg -n 'ScaleDeployment|deployments.*scale' --type=go

Repository: poyrazK/thecloud

Length of output: 4066

Update SDK client to use PUT for container deployment scaling.

The endpoint has been changed to PUT (verified in router.go line 566 and handler tests), but the SDK client in pkg/sdk/container.go still uses c.post() at line 54. Update ScaleDeployment() to use PUT and update the corresponding test in pkg/sdk/container_test.go to verify the PUT method instead of POST. Without this change, SDK clients will fail when attempting to scale deployments.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/api/setup/router.go` at line 566, The SDK's ScaleDeployment currently sends a POST; change the implementation of ScaleDeployment in pkg/sdk/container.go to use a PUT call instead of c.post() (e.g., switch to the client's put method) so it matches the router's PUT /deployments/:id/scale, and update the unit test in pkg/sdk/container_test.go (the ScaleDeployment test) to assert the request uses PUT (and adjust any mock server handler expectations from POST to PUT).

coderabbitai · 2026-03-05T13:57:17Z

-		r.POST(deploymentsPath+scalePath, handler.ScaleDeployment)
-		req, _ := http.NewRequest("POST", deploymentsPath+containerPathInvalid+scaleSuffix, nil)
+		r.PUT(deploymentsPath+scalePath, handler.ScaleDeployment)
+		req, _ := http.NewRequest("PUT", deploymentsPath+containerPathInvalid+scaleSuffix, nil)


⚠️ Potential issue | 🟡 Minor

Don’t ignore request construction errors in these subtests.

Using req, _ := http.NewRequest(...) can silently mask test setup failures.

Suggested fix

- req, _ := http.NewRequest("PUT", deploymentsPath+containerPathInvalid+scaleSuffix, nil) + req, err := http.NewRequest(http.MethodPut, deploymentsPath+containerPathInvalid+scaleSuffix, nil) + require.NoError(t, err) - req, _ := http.NewRequest("PUT", deploymentsPath+"/"+uuid.NewString()+scaleSuffix, bytes.NewBufferString("invalid")) + req, err := http.NewRequest(http.MethodPut, deploymentsPath+"/"+uuid.NewString()+scaleSuffix, bytes.NewBufferString("invalid")) + require.NoError(t, err) - req, _ := http.NewRequest("PUT", deploymentsPath+"/"+id.String()+scaleSuffix, bytes.NewBuffer(body)) + req, err := http.NewRequest(http.MethodPut, deploymentsPath+"/"+id.String()+scaleSuffix, bytes.NewBuffer(body)) + require.NoError(t, err)

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()."

Also applies to: 252-252, 264-264

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/handlers/container_handler_test.go` at line 243, Tests are silently ignoring errors from http.NewRequest; replace instances like `req, _ := http.NewRequest("PUT", deploymentsPath+containerPathInvalid+scaleSuffix, nil)` (and the other similar calls) with explicit error handling: `req, err := http.NewRequest(...)` followed by `if err != nil { t.Fatalf("failed to create request: %v", err) }` or `require.NoError(t, err)` so test setup failures surface immediately; update all occurrences of `http.NewRequest` in these subtests accordingly.

poyrazK added 15 commits March 3, 2026 01:22

docs(adr): add strategy for kubernetes cluster autoscaler

f517903

docs: add kubernetes cluster autoscaler service guide

a6cece6

docs: update kubernetes and database guides for autoscaling

a46797e

feat(db): add migration for cluster node groups

8f76df0

feat(domain): add node group models and repository ports

811e1ba

feat(repo): implement node group persistence in postgres

4f8affc

feat(service): implement node group management logic

cac134b

feat(api): expose node group management endpoints

4fe276c

feat(sdk): add node group support and context-aware methods

1c43bf5

feat(autoscaler): add generated grpc protos for external provider

f376513

feat(autoscaler): implement external grpc provider server

1418f7a

feat(autoscaler): add server entrypoint binary

c2feba6

feat(provisioner): automate autoscaler deployment during cluster boot…

a80847f

…strap

feat(k8s): add cluster autoscaler deployment manifests

7d0e2cf

build: add autoscaler server dockerfile and make targets

e2ac45c

Copilot AI review requested due to automatic review settings March 2, 2026 22:23

Copilot started reviewing on behalf of poyrazK March 2, 2026 22:24 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

fix: resolve CI failures and sync swagger docs

d541c29

coderabbitai Bot reviewed Mar 2, 2026

View reviewed changes

test: update mocks to implement Node Group methods for Cluster Autosc…

52c2a0c

…aler integration

Copilot AI review requested due to automatic review settings March 4, 2026 12:14

Copilot started reviewing on behalf of poyrazK March 4, 2026 12:14 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

poyrazK added 5 commits March 4, 2026 16:03

refactor(autoscaler): replace hardcoded ports with constants in autos…

219447c

…caler server

refactor(database): replace hardcoded database ports with constants i…

ecac9ba

…n service layer

refactor(api): replace hardcoded server timeouts with constants in ma…

9f806b6

…in entry point

feat(k8s): standardize cluster scaling API to use PUT instead of POST

e9ab34b

poyrazK added 8 commits March 5, 2026 14:19

feat(containers): standardize deployment scaling API to use PUT inste…

16bf9fe

…ad of POST

docs: regenerate swagger documentation

458fbdc

chore(autoscaler): harden autoscaler server and bridge deployment

286e133

- Bind gRPC listener to loopback by default - Update builder image to Go 1.24 - Add pod and container security contexts (non-root, read-only FS) - Standardize bridge container name to thecloud-bridge

feat(cluster): enforce node group sizing invariants and handle persis…

2f84e7c

…tence errors - Add validation for MinSize, MaxSize, and DesiredSize - Revert in-memory state on repository update failures - Ensure existence check in node group deletion path

chore(k8s): improve provisioner reliability and parameterization

10a8ef5

- Use canonical ubuntu:22.04 image for workers - Parameterize CLOUD_API_URL in control plane template - Aggregate and report errors during worker provisioning

test(repo): refactor cluster repo tests and improve error checking

bf314f7

- Convert repository tests to table-driven format - Check errors from GetNodeGroups in cluster list operations - Add missing rows.Err() checks

fix(migrations): include missing rollback entries in downs.txt

169147c

Copilot AI review requested due to automatic review settings March 5, 2026 12:26

Copilot started reviewing on behalf of poyrazK March 5, 2026 12:26 View session

fix(autoscaler): correct return type mismatch in NodeGroupNodes

7a611a0

Copilot AI reviewed Mar 5, 2026

View reviewed changes

poyrazK added 2 commits March 5, 2026 15:43

fix(lint): address thelper and prealloc linting issues

9282095

fix(lint): suppress G115 false positives in autoscaler server

4d664b1

Copilot AI review requested due to automatic review settings March 5, 2026 12:48

Copilot started reviewing on behalf of poyrazK March 5, 2026 12:49 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

merge: resolve conflicts from main

b441bdf

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

poyrazK merged commit a2f5d6e into main Mar 5, 2026
38 of 39 checks passed

This was referenced Mar 5, 2026

feat: multi-tenancy and RBAC refactor #91

Merged

feat(k8s): implement cluster backup and restore (Disaster Recovery) #95

Merged

poyrazK deleted the feature/cluster-autoscaler branch April 12, 2026 15:57

-		_, _ = s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{
-			DesiredSize: &newSize,
-		})
+		if newSize < 0 {
+			klog.Errorf("calculated negative desired size %d for node group %s, clamping to 0", newSize, req.Id)
+			newSize = 0
+		}
+		if _, err := s.client.UpdateNodeGroupWithContext(ctx, s.clusterID, req.Id, sdk.UpdateNodeGroupInput{
+			DesiredSize: &newSize,
+		}); err != nil {
+			klog.Errorf("failed to update node group %s desired size to %d: %v", req.Id, newSize, err)
+			return nil, err
+		}

	4. `thecloud-bridge` calls `POST /clusters/{id}/nodegroups/{name}` on The Cloud API.
	4. `thecloud-bridge` calls `PUT /clusters/{id}/nodegroups/{name}` on The Cloud API.

		// Persist Node Group
		if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {

-	// Persist Node Group
-	if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {
+	// Persist Node Group. If this fails, attempt to roll back the cluster record
+	// to avoid leaving a cluster without its required default node group.
+	if err := s.repo.AddNodeGroup(ctx, &cluster.NodeGroups[0]); err != nil {
+		if rollbackErr := s.repo.Delete(ctx, cluster.ID); rollbackErr != nil {
+			s.logger.Error("failed to rollback cluster after node group creation failure", "cluster_id", cluster.ID, "error", rollbackErr)
+		}

	if err := s.client.TerminateInstance(providerID); err != nil {
	if err := s.client.TerminateInstanceWithContext(ctx, providerID); err != nil {

-		// Ideally this should be in a transaction with repo.Create
+		// Attempt to roll back the previously created cluster to avoid leaving it without node groups.
+		if delErr := s.repo.Delete(ctx, cluster.ID); delErr != nil {
+			s.logger.Error("failed to rollback cluster after default node group creation failure", "cluster_id", cluster.ID, "error", delErr)
+		}

	containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)
	containerGroup.PUT("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)
	containerGroup.POST("/deployments/:id/scale", httputil.Permission(svcs.RBAC, domain.PermissionContainerUpdate), handlers.Container.ScaleDeployment)

-        "/clusters/{id}/scale": {
-            "post": {
-            "put": {
-                "security": [
+        "/clusters/{id}/scale": {
+            "post": {
+                "security": [
+                    {
+                        "APIKeyAuth": []
+                    }
+                ],
+                "deprecated": true,
+                "description": "Backward-compatible alias for scale cluster workers. Use PUT.",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Scale cluster workers (deprecated)",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    },
+                    {
+                        "description": "Scale Request",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/httphandlers.ScaleClusterRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK"
+                    }
+                }
+            },
+            "put": {
+                "security": [

Conversation

poyrazK commented Mar 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

poyrazK commented Mar 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 2, 2026 •

edited

Loading