feat: Managed Kubernetes (KaaS) with HA and Task Queue by poyrazK · Pull Request #29 · poyrazK/thecloud

poyrazK · 2026-01-19T20:29:40Z

🌟 Overview

This PR introduces the highly anticipated Managed Kubernetes (KaaS) service to The Cloud platform. It allows users to provision production-ready, highly available Kubernetes clusters with a single command.

The implementation follows a robust Hexagonal Architecture and introduces an Asynchronous Durable Operations layer using a Redis-backed Task Queue.

🚀 Key Features

Managed Kubernetes (KaaS): Full lifecycle management (Create, List, Show, Scale, Upgrade, Delete).
High Availability (HA): One-click creation of 3-node HA control planes with automated API Server Load Balancers.
Asynchronous Processing: Long-running operations are now handled by a dedicated background ClusterWorker.
Durable Operations: Operations survive API restarts by leveraging a Redis-backed Task Queue.
VPC Isolation: Strict network segregation for every cluster.
HA Load Balancer: Integrated LB support for cluster access.

🏗️ Architectural Changes

Task Queue Integration: Added TaskQueue port and implementation.
Cluster Worker: New worker type for processing cluster-related jobs.
Kubeadm Provisioner: Production-grade bootstrapping logic following upstream standards.
SSH Utility: New internal SSH client package for secure node orchestration.

🧪 Testing

Unit Tests: Comprehensive coverage for Provisioner, Service, Handler, and Worker layers.
E2E Tests: Verified full cluster creation flow on Docker simulation.
HA Mock Tests: Specifically verified HA coordination logic.

📝 Documentation

Updated README.md, FEATURES.md, and architecture.md.
Added a detailed Kubernetes Guide.
Updated API Reference and CLI Reference.
Documented decisions in ADR-010.

Summary by CodeRabbit

New Features
- Managed Kubernetes (KaaS): create/list/show/delete clusters, kubeconfig download, scale, upgrade, repair, backups/restore, rotate-secrets; optional 3-node HA control plane with API server load balancer; CLI, SDK, and demo workflow.
Infrastructure
- Asynchronous durable operations via Redis-backed task queue and background Cluster Worker; per-user encrypted kubeconfigs and SSH/exec support.
Documentation
- ADRs, feature pages, architecture, API and CLI docs updated.
Tests
- New unit, integration, and end-to-end tests for Kubernetes workflows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…nterfaces

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@internal/repositories/k8s/provisioner.go`:
- Around line 97-100: The call checks masterIP but then passes nil into
p.failCluster which can produce a `%!w(<nil>)` when wrapped; instead create an
explicit error (e.g., errors.New or fmt.Errorf with a clear message like "master
node failed to get an IP") and pass that error into p.failCluster(ctx, cluster,
"master node failed to get an IP", err) so p.failCluster receives a real error
object; update the branch after masterIP := p.waitForIP(ctx, master.ID)
accordingly and ensure the new error is used in the call.

♻️ Duplicate comments (6)

internal/repositories/k8s/provisioner.go (6)

101-103: Handle repository update failures when persisting control plane IPs.

Ignoring this error can leave the DB out of sync with the in‑memory state. As per coding guidelines, avoid silent failures.

🐛 Proposed fix

 	cluster.ControlPlaneIPs = []string{masterIP}
-	_ = p.repo.Update(ctx, cluster)
+	if err := p.repo.Update(ctx, cluster); err != nil {
+		return "", p.failCluster(ctx, cluster, "failed to persist control plane IPs", err)
+	}

120-127: Don’t persist plaintext kubeconfig and handle update errors.

Storing unencrypted kubeconfig is a security risk, and the update error is ignored. As per coding guidelines, avoid silent failures.

🔒 Proposed fix

 	encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig)
 	if err != nil {
-		p.logger.Error("failed to encrypt kubeconfig", "cluster_id", cluster.ID, "error", err)
-		cluster.Kubeconfig = kubeconfig
-	} else {
-		cluster.Kubeconfig = encryptedKubeconfig
+		return "", p.failCluster(ctx, cluster, "failed to encrypt kubeconfig", err)
 	}
-	_ = p.repo.Update(ctx, cluster)
+	cluster.Kubeconfig = encryptedKubeconfig
+	if err := p.repo.Update(ctx, cluster); err != nil {
+		return "", p.failCluster(ctx, cluster, "failed to persist kubeconfig", err)
+	}

132-160: Don’t silently ignore worker status updates or all‑worker failures.

_ = p.updateNodeStatus(...) hides errors, and the method succeeds even if all workers fail. As per coding guidelines, avoid silent failures.

🐛 Proposed fix

 func (p *KubeadmProvisioner) provisionWorkers(ctx context.Context, cluster *domain.Cluster, joinCmd string) (string, error) {
+	successCount := 0
 	for i := 0; i < cluster.WorkerCount; i++ {
@@
 		if err := p.joinCluster(ctx, cluster, workerIP, joinCmd); err != nil {
 			p.logger.Error("failed to join worker to cluster", "ip", workerIP, "error", err)
-			_ = p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed")
+			if updateErr := p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed"); updateErr != nil {
+				p.logger.Error("failed to update node status", "instance_id", worker.ID, "error", updateErr)
+			}
 			continue
 		}
 
-		_ = p.updateNodeStatus(ctx, cluster.ID, worker.ID, "active")
+		if err := p.updateNodeStatus(ctx, cluster.ID, worker.ID, "active"); err != nil {
+			p.logger.Error("failed to update node status", "instance_id", worker.ID, "error", err)
+			continue
+		}
+		successCount++
 	}
+	if successCount == 0 && cluster.WorkerCount > 0 {
+		return joinCmd, fmt.Errorf("no workers successfully provisioned")
+	}
 	return joinCmd, nil
 }

200-201: Handle repo update failures when marking the cluster running.

If this update fails, the cluster is running but the DB says otherwise. As per coding guidelines, avoid silent failures.

🐛 Proposed fix

 	cluster.Status = domain.ClusterStatusRunning
-	_ = p.repo.Update(ctx, cluster)
+	if err := p.repo.Update(ctx, cluster); err != nil {
+		p.logger.Error("failed to update cluster status to running", "cluster_id", cluster.ID, "error", err)
+		return fmt.Errorf("failed to persist cluster running status: %w", err)
+	}

246-252: Don’t ignore DeleteNode errors during deprovisioning.

Ignoring this can leave orphaned DB records. As per coding guidelines, avoid silent failures.

🐛 Proposed fix

 	for _, node := range nodes {
 		p.logger.Info("terminating node instance", "instance_id", node.InstanceID)
 		if err := p.instSvc.TerminateInstance(ctx, node.InstanceID.String()); err != nil {
 			p.logger.Error("failed to terminate node instance", "instance_id", node.InstanceID, "error", err)
 		}
-		_ = p.repo.DeleteNode(ctx, node.ID)
+		if err := p.repo.DeleteNode(ctx, node.ID); err != nil {
+			p.logger.Error("failed to delete node record", "node_id", node.ID, "error", err)
+		}
 	}

266-268: Persisting failed status must check the update error.

This is a critical state transition; ignoring the error breaks errcheck and can hide a failed persistence. As per coding guidelines, avoid silent failures.

🐛 Proposed fix

 func (p *KubeadmProvisioner) failCluster(ctx context.Context, cluster *domain.Cluster, msg string, err error) error {
 	cluster.Status = domain.ClusterStatusFailed
-	_ = p.repo.Update(ctx, cluster)
+	if updateErr := p.repo.Update(ctx, cluster); updateErr != nil {
+		p.logger.Error("failed to update cluster status to failed", "cluster_id", cluster.ID, "error", updateErr)
+	}
 	p.logger.Error(msg, "cluster_id", cluster.ID, "error", err)
 	return fmt.Errorf("%s: %w", msg, err)
 }

🧹 Nitpick comments (1)

internal/repositories/k8s/provisioner.go (1)
229-235: Prefer strings.Index over the custom mathStringsIndex.

This reduces custom code and leverages the standard library’s tested implementation.
♻️ Proposed refactor
 import (
 	"context"
 	"fmt"
 	"log/slog"
+	"strings"
@@
-				if idx := mathStringsIndex(instIP, "/"); idx != -1 {
+				if idx := strings.Index(instIP, "/"); idx != -1 {
 					instIP = instIP[:idx]
 				}
@@
-func mathStringsIndex(s, substr string) int {
-	for i := 0; i < len(s)-len(substr)+1; i++ {
-		if s[i:i+len(substr)] == substr {
-			return i
-		}
-	}
-	return -1
-}

coderabbitai · 2026-01-19T21:24:15Z

+	masterIP := p.waitForIP(ctx, master.ID)
+	if masterIP == "" {
+		return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil)
+	}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's read the provisioner.go file around lines 97-100 head -110 internal/repositories/k8s/provisioner.go | tail -30

Repository: PoyrazK/thecloud

Length of output: 1132

🏁 Script executed:

# Find the failCluster function to understand how it uses the error parameter grep -n "func.*failCluster" internal/repositories/k8s/provisioner.go

Repository: PoyrazK/thecloud

Length of output: 181

🏁 Script executed:

# Get more context around the failCluster function grep -A 10 "func.*failCluster" internal/repositories/k8s/provisioner.go

Repository: PoyrazK/thecloud

Length of output: 357

🏁 Script executed:

# Check the imports to see if errors is already imported head -30 internal/repositories/k8s/provisioner.go | grep -E "^import|\"errors\""

Repository: PoyrazK/thecloud

Length of output: 68

🏁 Script executed:

# Get the full import block head -20 internal/repositories/k8s/provisioner.go

Repository: PoyrazK/thecloud

Length of output: 607

🏁 Script executed:

# Also verify if there are any other nil error passes to failCluster grep -n "failCluster.*nil" internal/repositories/k8s/provisioner.go

Repository: PoyrazK/thecloud

Length of output: 143

Avoid wrapping a nil error in failCluster.

Passing nil into fmt.Errorf("%s: %w", ...) produces %!w(<nil>) and obscures the cause. Create an explicit error before calling failCluster instead.

🐛 Proposed fix

import ( "context" + "errors" "fmt" "log/slog"

masterIP := p.waitForIP(ctx, master.ID) if masterIP == "" { - return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil) + return "", p.failCluster(ctx, cluster, "master node failed to get an IP", errors.New("master node has no assigned IP")) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

masterIP := p.waitForIP(ctx, master.ID)

if masterIP == "" {

return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil)

}

import (

"context"

"errors"

"fmt"

"log/slog"

...

)

masterIP := p.waitForIP(ctx, master.ID)

if masterIP == "" {

return "", p.failCluster(ctx, cluster, "master node failed to get an IP", errors.New("master node has no assigned IP"))

}

🤖 Prompt for AI Agents

In `@internal/repositories/k8s/provisioner.go` around lines 97 - 100, The call checks masterIP but then passes nil into p.failCluster which can produce a `%!w(<nil>)` when wrapped; instead create an explicit error (e.g., errors.New or fmt.Errorf with a clear message like "master node failed to get an IP") and pass that error into p.failCluster(ctx, cluster, "master node failed to get an IP", err) so p.failCluster receives a real error object; update the branch after masterIP := p.waitForIP(ctx, master.ID) accordingly and ensure the new error is used in the call.

Copilot

Pull request overview

Copilot reviewed 70 out of 71 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-19T21:28:05Z

+	config := &ssh.ClientConfig{
+		User:            c.User,
+		Auth:            c.Auth,
+		HostKeyCallback: ssh.InsecureIgnoreHostKey(),


The SSH client uses InsecureIgnoreHostKey which skips host key verification. This is a security risk in production environments as it makes the connection vulnerable to man-in-the-middle attacks. Consider implementing proper host key verification or at least documenting this security trade-off.

Copilot · 2026-01-19T21:28:06Z

 			handleRBACMock(w, r, method, path) ||
 			handleDatabasesMock(w, r, method, path) ||
-			handleSecretsMock(w, r, method, path)
+			//	handleDatabasesMock(w, r, method, path) || // Duplicate removed


Duplicate line removed comment. The line "handleDatabasesMock(w, r, method, path) ||" appears to have been accidentally duplicated and then removed, but the comment marking it as duplicate should also be removed for cleaner code.

Suggested change

// handleDatabasesMock(w, r, method, path) || // Duplicate removed

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@internal/repositories/k8s/health.go`:
- Around line 197-205: Guard against empty kubectl output before splitting:
after calling exec.Run(ctx, kubectlBase+" get nodes --no-headers") and obtaining
nodesOut, check if strings.TrimSpace(nodesOut) == "" and if so set
health.NodesTotal = 0 and skip the readyCount loop (return or continue),
otherwise split into lines and compute readyCount as currently implemented;
update the logic around the nodesOut/lines handling (the section referencing
nodesOut, health.NodesTotal and the readyCount loop) so you don't treat an empty
string as a single phantom node.

In `@internal/repositories/k8s/lifecycle.go`:
- Around line 187-203: The current flow uses exec.Run to base64-encode the
entire snapshot into a string (b64Output) and then decodes it into a byte slice
(snapshotData) before calling p.storageSvc.Upload, which buffers the whole
snapshot in memory and risks OOM/truncation for multi-GB etcd backups; instead
refactor to stream the snapshot through an io.Reader to p.storageSvc.Upload (or
generate a presigned URL and run curl on the node) so data is piped without full
buffering: replace the exec.Run(base64 ...) + base64.DecodeString +
bytes.NewReader(snapshotData) pattern with a streaming approach (e.g., an
exec/Executor method that returns an io.ReadCloser for tempPath or a remote curl
upload using a presigned URL) and call p.storageSvc.Upload with that reader,
updating usages of exec.Run, tempPath, b64Output, snapshotData and
p.storageSvc.Upload accordingly.

♻️ Duplicate comments (4)

cmd/cloud/cli_commands_test.go (1)

58-58: Remove the commented-out code.

This commented line is a leftover artifact from removing duplicates and should be deleted for cleaner code.
pkg/sshutil/client.go (2)
39-43: Avoid InsecureIgnoreHostKey in production.
This disables host key verification and enables MITM. Consider knownhosts or a configurable callback (already flagged by security tooling).

Also applies to: 86-90

39-49: Replace hard‑coded SSH port/timeouts with shared constants.
There are still literal "22" and 10 * time.Second values. Consolidate them as named constants and reuse across Run/WriteFile/WaitForSSH.
♻️ Proposed fix
-const (
-	sshWaitInterval = 2 * time.Second
-	sshDialTimeout  = 2 * time.Second
-)
+const (
+	defaultSSHPort   = "22"
+	sshClientTimeout = 10 * time.Second
+	sshWaitInterval  = 2 * time.Second
+	sshDialTimeout   = 2 * time.Second
+)
 	config := &ssh.ClientConfig{
 		User:            c.User,
 		Auth:            c.Auth,
 		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
-		Timeout:         10 * time.Second,
+		Timeout:         sshClientTimeout,
 	}

 	addr := c.Host
 	if _, _, err := net.SplitHostPort(addr); err != nil {
-		addr = net.JoinHostPort(addr, "22")
+		addr = net.JoinHostPort(addr, defaultSSHPort)
 	}
 	addr := c.Host
 	if _, _, err := net.SplitHostPort(addr); err != nil {
-		addr = net.JoinHostPort(addr, "22")
+		addr = net.JoinHostPort(addr, defaultSSHPort)
 	}
As per coding guidelines, use named constants for timeouts and ports.

Also applies to: 86-96, 161-178
internal/repositories/k8s/health.go (1)
83-116: Use instance IDs for termination and don’t ignore status-update errors.

TerminateInstance is called with worker.ID (cluster node ID), which likely isn’t the cloud instance ID; this can leave orphaned instances. Also _ = p.updateNodeStatus(...) silently discards errors and violates the no-silent-failure guideline. Based on learnings, avoid blank-identifier error suppression.
✅ Suggested fix
-			if termErr := p.instSvc.TerminateInstance(ctx, worker.ID.String()); termErr != nil {
+			if termErr := p.instSvc.TerminateInstance(ctx, worker.InstanceID.String()); termErr != nil {
 				p.logger.Error("failed to terminate failed bootstrap instance", "instance_id", worker.ID, "error", termErr)
 			}
@@
-		if err := p.joinCluster(ctx, cluster, ip, joinCmd); err != nil {
+		if err := p.joinCluster(ctx, cluster, ip, joinCmd); err != nil {
 			errs = append(errs, fmt.Errorf("failed to join node %s to cluster: %w", workerName, err))
-			_ = p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed")
+			if err := p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed"); err != nil {
+				p.logger.Error("failed to update node status to failed", "node_id", worker.ID, "error", err)
+				errs = append(errs, fmt.Errorf("failed to mark node %s as failed: %w", workerName, err))
+			}
 			continue
 		}
Run this to confirm waitForIP/createNode expectations and ensure the correct ID is used:
#!/bin/bash
rg -n "func \$p \\*KubeadmProvisioner\$ waitForIP\$" -C3
rg -n "func \\(p \\*KubeadmProvisioner\$ createNode\\(" -C3
rg -n "TerminateInstance\\(" -C3

coderabbitai · 2026-01-19T21:39:49Z

+	nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers")
+	if err == nil {
+		lines := strings.Split(strings.TrimSpace(nodesOut), "\n")
+		health.NodesTotal = len(lines)
+		readyCount := 0
+		for _, line := range lines {
+			if strings.Contains(line, " Ready ") {
+				readyCount++
+			}


⚠️ Potential issue | 🟡 Minor

Handle empty kubectl get nodes output to avoid a phantom node.

When nodesOut is empty, strings.Split("", "\n") yields a single empty line, so NodesTotal becomes 1. Guard on empty output before splitting.

🔧 Suggested fix

- if err == nil { - lines := strings.Split(strings.TrimSpace(nodesOut), "\n") - health.NodesTotal = len(lines) + if err == nil { + trimmed := strings.TrimSpace(nodesOut) + if trimmed == "" { + health.NodesTotal = 0 + return health, nil + } + lines := strings.Split(trimmed, "\n") + health.NodesTotal = len(lines) readyCount := 0 for _, line := range lines { if strings.Contains(line, " Ready ") { readyCount++ } } health.NodesReady = readyCount }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers")

if err == nil {

lines := strings.Split(strings.TrimSpace(nodesOut), "\n")

health.NodesTotal = len(lines)

readyCount := 0

for _, line := range lines {

if strings.Contains(line, " Ready ") {

readyCount++

}

nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers")

if err == nil {

trimmed := strings.TrimSpace(nodesOut)

if trimmed == "" {

health.NodesTotal = 0

return health, nil

}

lines := strings.Split(trimmed, "\n")

health.NodesTotal = len(lines)

readyCount := 0

for _, line := range lines {

if strings.Contains(line, " Ready ") {

readyCount++

}

}

health.NodesReady = readyCount

}

🤖 Prompt for AI Agents

In `@internal/repositories/k8s/health.go` around lines 197 - 205, Guard against empty kubectl output before splitting: after calling exec.Run(ctx, kubectlBase+" get nodes --no-headers") and obtaining nodesOut, check if strings.TrimSpace(nodesOut) == "" and if so set health.NodesTotal = 0 and skip the readyCount loop (return or continue), otherwise split into lines and compute readyCount as currently implemented; update the logic around the nodesOut/lines handling (the section referencing nodesOut, health.NodesTotal and the readyCount loop) so you don't treat an empty string as a single phantom node.

coderabbitai · 2026-01-19T21:39:49Z

+	// 2. Extract snapshot as Base64 to handle binary data through the executor
+	p.logger.Info("uploading snapshot to remote storage", "cluster_id", cluster.ID)
+	b64Output, err := exec.Run(ctx, fmt.Sprintf("base64 -w 0 %s", tempPath))
+	if err != nil {
+		return fmt.Errorf("failed to read snapshot: %w", err)
+	}
+
+	snapshotData, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64Output))
+	if err != nil {
+		return fmt.Errorf("failed to decode base64 snapshot: %w", err)
+	}
+
+	bucket := "k8s-backups"
+	key := fmt.Sprintf("%s/snapshot-%d.db", cluster.ID, time.Now().Unix())
+	_, err = p.storageSvc.Upload(ctx, bucket, key, bytes.NewReader(snapshotData))
+	if err != nil {
+		return fmt.Errorf("failed to upload snapshot to storage: %w", err)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and read the file to confirm the code matches find . -name "lifecycle.go" -path "*/k8s/*" -type f

Repository: PoyrazK/thecloud

Length of output: 100

🏁 Script executed:

# Read the specific lines mentioned in the review sed -n '187,203p' internal/repositories/k8s/lifecycle.go

Repository: PoyrazK/thecloud

Length of output: 800

🏁 Script executed:

# Get broader context around these lines to understand the function sed -n '170,220p' internal/repositories/k8s/lifecycle.go

Repository: PoyrazK/thecloud

Length of output: 2058

🏁 Script executed:

# Search for the exec.Run implementation to understand how it handles output rg "func.*Run\(" --type go -A 10 | head -100

Repository: PoyrazK/thecloud

Length of output: 7113

🏁 Script executed:

# Check if there are any constants or limits defined for this operation rg "SSH|buffer|snapshot" --type go internal/repositories/k8s/ -C 2 | head -80

Repository: PoyrazK/thecloud

Length of output: 5837

🏁 Script executed:

# Look at the sshutil.Client.Run implementation in detail cat -n pkg/sshutil/client.go | grep -A 30 "func (c \*Client) Run"

Repository: PoyrazK/thecloud

Length of output: 1154

🏁 Script executed:

# Check for any buffer size limits in SSH client rg "maxSize|BufferSize|limit|buffer" --type go pkg/sshutil/ -C 3

Repository: PoyrazK/thecloud

Length of output: 42

🏁 Script executed:

# Look at the StorageService interface to understand Upload expectations rg "type StorageService|Upload.*func" --type go -A 5 | head -50

Repository: PoyrazK/thecloud

Length of output: 3998

🏁 Script executed:

# Get the complete sshutil.Client.Run implementation cat -n pkg/sshutil/client.go | sed -n '38,80p'

Repository: PoyrazK/thecloud

Length of output: 1634

🏁 Script executed:

# Check if there's any documentation or comments about SSH output handling cat pkg/sshutil/client.go | head -100

Repository: PoyrazK/thecloud

Length of output: 2585

🏁 Script executed:

# Check the interface to understand the expected behavior rg "type NodeExecutor|Run\(ctx" --type go internal/repositories/k8s/ -A 2

Repository: PoyrazK/thecloud

Length of output: 7155

Refactor snapshot upload to avoid buffering large etcd backups in memory.

The entire snapshot is base64-encoded through SSH (buffered in bytes.Buffer), returned as a string, then fully decoded into memory before upload. etcd snapshots can be several GB, and base64 encoding inflates the size by 33%; this risks exceeding SSH message limits, memory exhaustion, or truncation. The StorageService.Upload already accepts io.Reader for streaming, so instead:

Generate a presigned URL on the storage backend and have the node upload directly via curl or similar

Implement a streaming upload endpoint or executor method that pipes snapshot data without buffering the entire output

This avoids double memory allocation and leverages the streaming capabilities already designed into the architecture.

🤖 Prompt for AI Agents

In `@internal/repositories/k8s/lifecycle.go` around lines 187 - 203, The current flow uses exec.Run to base64-encode the entire snapshot into a string (b64Output) and then decodes it into a byte slice (snapshotData) before calling p.storageSvc.Upload, which buffers the whole snapshot in memory and risks OOM/truncation for multi-GB etcd backups; instead refactor to stream the snapshot through an io.Reader to p.storageSvc.Upload (or generate a presigned URL and run curl on the node) so data is piped without full buffering: replace the exec.Run(base64 ...) + base64.DecodeString + bytes.NewReader(snapshotData) pattern with a streaming approach (e.g., an exec/Executor method that returns an io.ReadCloser for tempPath or a remote curl upload using a presigned URL) and call p.storageSvc.Upload with that reader, updating usages of exec.Run, tempPath, b64Output, snapshotData and p.storageSvc.Upload accordingly.

…lity

…ests, and fix SonarQube duplication warnings

Copilot

Pull request overview

Copilot reviewed 74 out of 75 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 12

🤖 Fix all issues with AI agents

In `@cmd/api/main.go`:
- Around line 30-56: The package-level DI globals (loadConfigFunc,
initDatabaseFunc, ..., newHTTPServer, startHTTPServer, shutdownHTTPServer,
notifySignals) should be removed and replaced with a single dependency container
passed into main/initInfrastructure/initBackends/runApplication; define an
AppDeps (or Options) struct that holds those function fields and inject it into
SetupRouter/initRepositories/initServices/initHandlers and any functions that
currently reference the globals; update
initInfrastructure/initBackends/runApplication signatures to accept AppDeps and
thread the deps through instead of using globals so tests can construct AppDeps
with stubbed functions (including
newHTTPServer/startHTTPServer/shutdownHTTPServer/notifySignals) and no
package-level mutable state remains.

In `@internal/core/domain/cluster.go`:
- Around line 34-46: The Cluster struct currently exposes sensitive fields
SSHKey and Kubeconfig via JSON tags; change their tags to json:"-" or remove
them from the public serialization so secrets are not emitted when Cluster is
marshaled or logged. Locate the Cluster type (fields SSHKey and Kubeconfig) and
update their struct tags to json:"-" or refactor callers to return a dedicated
DTO/response type (e.g., ClusterResponse) that omits those fields; ensure any
existing JSON usage (marshal/unmarshal, handlers returning Cluster) is updated
to use the safe DTO or otherwise avoid serializing SSHKey/Kubeconfig.

In `@internal/core/services/cluster.go`:
- Around line 39-43: Change NewClusterService to avoid panicking on nil
TaskQueue: update the constructor signature to return (*ClusterService, error)
and validate params.TaskQueue, returning a descriptive error when it's nil
instead of calling panic. Ensure callers are updated to handle the error; keep
the ClusterService and ClusterServiceParams types unchanged and perform the nil
check inside NewClusterService (or a helper like validateClusterServiceParams)
so construction fails cleanly with an error rather than crashing the process.
- Around line 96-99: If s.taskQueue.Enqueue(ctx, "k8s_jobs", job) fails you must
not silently return success; change the code to mark the cluster as failed and
return an error to the caller. On Enqueue error, call your cluster status
updater (e.g. s.store.UpdateClusterStatus or equivalent) to set cluster.ID to a
failed state with the enqueue error message, log the failure via s.logger.Error
with the error, and return a non-nil error (wrap the original err) so the caller
knows provisioning did not complete; alternatively persist a job record for the
worker to sweep if you prefer retry semantics, but do not leave the cluster
permanently pending.
- Around line 63-81: The private SSH key returned by crypto.GenerateSSHKeyPair()
is being stored raw on the domain.Cluster (cluster.SSHKey); instead, call your
SecretService to encrypt or store the secret and persist only the ciphertext or
secret reference: use SecretService.Encrypt/Store (or equivalent) right after
privKey is generated, handle and return any errors from that call, set
cluster.SSHKey to the returned ciphertext/secretID, and ensure any code that
needs the raw key uses SecretService.Decrypt/Fetch when required; update error
messages and types around GenerateSSHKeyPair and cluster creation to reflect
encryption failures.
- Around line 165-170: The current code calls s.secretSvc.Decrypt and on error
logs via s.logger.Error but returns cluster.Kubeconfig, which can expose
ciphertext and yields an unusable value; change the error handling in the
function that calls s.secretSvc.Decrypt so that it returns an error instead of
cluster.Kubeconfig (e.g., return empty string and propagate or wrap the Decrypt
error), and keep the s.logger.Error call for context using cluster.ID and the
error from s.secretSvc.Decrypt; update any callers if necessary to handle the
propagated error.

In `@internal/repositories/k8s/ha.go`:
- Around line 76-79: The code path checking ip := p.waitForIP(ctx, node.ID)
should not pass nil into failCluster when ip == ""; create an explicit error
(e.g., using fmt.Errorf or errors.New) describing the missing master IP for the
given node/name and pass that error into p.failCluster(ctx, cluster, "timeout
waiting for IP for master "+name, err) instead of nil; update the block around
waitForIP and the call to failCluster so the real error is logged/propagated.
- Around line 132-134: parseJoinCommands can return empty strings and you must
validate its output before using it; after calling p.parseJoinCommands(out)
check that joinCmd and cpJoinCmd are non-empty and return a clear error (or fail
early) instead of proceeding to call joinHAMasters with empty commands. Update
the caller around parseJoinCommands (the variables joinCmd and cpJoinCmd) to
validate both values, surface a descriptive error if either is empty, and avoid
executing further steps (including invoking joinHAMasters) when validation
fails.
- Around line 83-85: The AddTarget error is being ignored with p.logger.Warn
which can leave masters unreachable; change the behavior in the block calling
p.lbSvc.AddTarget (the call referencing p.lbSvc.AddTarget(ctx, lb.ID, node.ID,
6443, 10) and currently logging via p.logger.Warn) to fail fast by returning a
wrapped error (including lb.ID and node.ID) from the enclosing function instead
of only warning; alternatively implement a retry with backoff (respecting ctx)
around p.lbSvc.AddTarget and only return an error after retries are
exhausted—ensure callers of this enclosing function handle the returned error.
- Around line 22-30: The code currently sets lbIP from lb.IP, sleeps once and
tries p.lbSvc.Get(ctx, lb.ID) once, but may still leave lbIP empty; update the
logic in the block around lbIP, p.lbSvc.Get and cluster.APIServerLBAddress to
perform a bounded retry loop (e.g., max N attempts with a short delay) that
re-queries p.lbSvc.Get(ctx, lb.ID) until a non-empty IP is returned, and if
after the retries lbIP is still empty return or propagate an error (do not set
cluster.APIServerLBAddress to an empty value); reference the existing lbIP
variable, the p.lbSvc.Get call, lb.ID, and cluster.APIServerLBAddress when
implementing the retry-and-fail behavior.

In `@internal/workers/cluster_worker.go`:
- Around line 59-62: The loop currently spawns unbounded goroutines at the line
calling "go w.processJob(job)" which can exhaust resources; replace the
unbounded spawn with a bounded approach—either call w.processJob(job) inline
(remove the "go" to process synchronously) or implement a semaphore/worker-pool
in the worker (add a buffered channel semaphore on the worker struct and acquire
before launching a goroutine and release after processJob returns) so that
w.processJob, and the logger.Info call remain but job processing is limited to a
fixed concurrency.
- Around line 44-51: The loop currently tight-spins when
w.taskQueue.Dequeue(ctx, "k8s_jobs") returns an empty msg or when it errors, so
introduce a named constant (e.g., queuePollBackoff) for the backoff duration and
use it in both the error case and the msg=="" case instead of time.Sleep(1 *
time.Second) and a bare continue; replace the magic number with this constant
and call time.Sleep(queuePollBackoff) before continuing to avoid pegging the
CPU; update any surrounding comments to clarify the shared backoff usage.

♻️ Duplicate comments (4)

internal/repositories/k8s/node_executor.go (1)
73-88: Timeout context is not used for readiness probes

ctxTimeout is created, but e.Run uses the parent ctx, so a hung exec can ignore the timeout. Use ctxTimeout for the readiness probe. As per coding guidelines, propagate context to blocking calls.
🛠️ Proposed fix
-			_, err := e.Run(ctx, "echo ready")
+			_, err := e.Run(ctxTimeout, "echo ready")
internal/repositories/k8s/provisioner.go (3)
99-102: Don’t pass nil into failCluster

Wrapping a nil error results in %!w(<nil>). Create an explicit error for the missing IP.
🛠️ Proposed fix
-	if masterIP == "" {
-		return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil)
-	}
+	if masterIP == "" {
+		return "", p.failCluster(ctx, cluster, "master node failed to get an IP", fmt.Errorf("master node has no assigned IP"))
+	}
122-129: Do not persist kubeconfig when encryption fails

Persisting plaintext kubeconfig is a security risk. Fail the provisioning (or mark failed) instead of storing secrets unencrypted.
🛠️ Proposed fix
 	encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig)
 	if err != nil {
-		p.logger.Error("failed to encrypt kubeconfig", "cluster_id", cluster.ID, "error", err)
-		cluster.Kubeconfig = kubeconfig
-	} else {
-		cluster.Kubeconfig = encryptedKubeconfig
+		return "", p.failCluster(ctx, cluster, "failed to encrypt kubeconfig", err)
 	}
+	cluster.Kubeconfig = encryptedKubeconfig
 	_ = p.repo.Update(ctx, cluster)
268-271: Handle repo.Update failure inside failCluster

Ignoring p.repo.Update hides persistence failures and also trips errcheck. Handle the error explicitly. As per coding guidelines, avoid silent failures.
🛠️ Proposed fix
 func (p *KubeadmProvisioner) failCluster(ctx context.Context, cluster *domain.Cluster, msg string, err error) error {
 	cluster.Status = domain.ClusterStatusFailed
-	_ = p.repo.Update(ctx, cluster)
+	if updateErr := p.repo.Update(ctx, cluster); updateErr != nil {
+		p.logger.Error("failed to update cluster status to failed", "cluster_id", cluster.ID, "error", updateErr)
+	}
 	p.logger.Error(msg, "cluster_id", cluster.ID, "error", err)
 	return fmt.Errorf("%s: %w", msg, err)
 }

🧹 Nitpick comments (5)

cmd/api/main.go (1)
244-251: Replace the magic worker count with a derived value

wg.Add(7) will drift as workers are added/removed. Use a slice and len(...) to keep it correct and self-updating.
♻️ Suggested refactor
 func runWorkers(ctx context.Context, wg *sync.WaitGroup, workers *setup.Workers) {
-	wg.Add(7)
-	go workers.LB.Run(ctx, wg)
-	go workers.AutoScaling.Run(ctx, wg)
-	go workers.Cron.Run(ctx, wg)
-	go workers.Container.Run(ctx, wg)
-	go workers.Provision.Run(ctx, wg)
-	go workers.Accounting.Run(ctx, wg)
-	go workers.Cluster.Run(ctx, wg)
+	workerFns := []func(context.Context, *sync.WaitGroup){
+		workers.LB.Run,
+		workers.AutoScaling.Run,
+		workers.Cron.Run,
+		workers.Container.Run,
+		workers.Provision.Run,
+		workers.Accounting.Run,
+		workers.Cluster.Run,
+	}
+	wg.Add(len(workerFns))
+	for _, fn := range workerFns {
+		go fn(ctx, wg)
+	}
 }
internal/repositories/postgres/lb_repo_test.go (1)

24-341: Consider table-driven tests for consistency.

These tests use repeated t.Run blocks; converting to a []struct{...} table will reduce duplication and align with repo testing conventions.

As per coding guidelines, use table-driven tests in test files.
internal/core/services/shared_test.go (1)
900-906: Misplaced method breaks code organization.

ListByVPC for MockInstanceRepo is placed after MockClusterProvisioner, but MockInstanceRepo and its other methods are defined at lines 772-813. This disrupts the logical grouping and makes the file harder to navigate.
♻️ Suggested fix: Move method to join other MockInstanceRepo methods

Move this method to immediately after line 813 (after the Delete method of MockInstanceRepo), keeping all MockInstanceRepo methods together:
 func (m *MockInstanceRepo) Delete(ctx context.Context, id uuid.UUID) error {
 	return m.Called(ctx, id).Error(0)
 }
+func (m *MockInstanceRepo) ListByVPC(ctx context.Context, vpcID uuid.UUID) ([]*domain.Instance, error) {
+	args := m.Called(ctx, vpcID)
+	if args.Get(0) == nil {
+		return nil, args.Error(1)
+	}
+	return args.Get(0).([]*domain.Instance), args.Error(1)
+}

 // MockClusterRepo
 type MockClusterRepo struct{ mock.Mock }
Then remove the duplicate definition at lines 900-906.
internal/core/services/cluster_lifecycle_test.go (2)
44-48: Avoid time.Sleep for async test synchronization.

Using time.Sleep(100 * time.Millisecond) creates flaky tests that may pass or fail depending on system load. If the async operation takes longer than 100ms, assertions will fail incorrectly.

Consider using a synchronization mechanism such as:

A channel/waitgroup signaled by the async operation

Polling with a timeout using assert.Eventually

Injecting a mock that completes synchronously in tests
♻️ Example using assert.Eventually
// Instead of time.Sleep, poll for completion:
assert.Eventually(t, func() bool {
    provisioner.AssertExpectations(t)
    repo.AssertExpectations(t)
    taskQueue.AssertExpectations(t)
    return true
}, time.Second, 10*time.Millisecond)
Alternatively, consider making the upgrade operation return a completion channel or use dependency injection to make async behavior synchronous in tests.
51-88: Subtests mutate shared state, creating implicit dependencies.

All subtests share and mutate the same cluster pointer. While this works because subtests run sequentially, the pattern is fragile:

Test isolation is compromised

Adding t.Parallel() in the future would cause data races

Test failures may cascade unpredictably

Consider creating fresh cluster instances within each subtest for better isolation.
♻️ Example of isolated subtest setup
 t.Run("fails if not running", func(t *testing.T) {
+    cluster := &domain.Cluster{
+        ID:      id,
+        Status:  domain.ClusterStatusPending,
+        Version: testK8sBase,
+        UserID:  uuid.New(),
+    }
-    cluster.Status = domain.ClusterStatusPending
     repo.On("GetByID", mock.Anything, id).Return(cluster, nil).Once()

     err := svc.UpgradeCluster(ctx, id, testK8sVersion)
     assert.Error(t, err)
     assert.Contains(t, err.Error(), "must be in running state")
 })

coderabbitai · 2026-01-20T14:17:07Z

+// NewClusterService creates a new ClusterService with the provided parameters.
+func NewClusterService(params ClusterServiceParams) *ClusterService {
+	if params.TaskQueue == nil {
+		panic("taskQueue cannot be nil")
+	}


⚠️ Potential issue | 🟠 Major

Avoid panic in NewClusterService

A panic here will crash the process on misconfiguration. Prefer returning an error from the constructor or validating at composition time. As per coding guidelines, do not panic in production code.

🛠️ Proposed fix (change constructor signature)

-func NewClusterService(params ClusterServiceParams) *ClusterService { - if params.TaskQueue == nil { - panic("taskQueue cannot be nil") - } - return &ClusterService{ +func NewClusterService(params ClusterServiceParams) (*ClusterService, error) { + if params.TaskQueue == nil { + return nil, errors.New(errors.Internal, "task queue not initialized") + } + return &ClusterService{ repo: params.Repo, provisioner: params.Provisioner, vpcSvc: params.VpcSvc, instanceSvc: params.InstanceSvc, secretSvc: params.SecretSvc, taskQueue: params.TaskQueue, logger: params.Logger, - } + }, nil }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// NewClusterService creates a new ClusterService with the provided parameters.

func NewClusterService(params ClusterServiceParams) *ClusterService {

if params.TaskQueue == nil {

panic("taskQueue cannot be nil")

}

// NewClusterService creates a new ClusterService with the provided parameters.

func NewClusterService(params ClusterServiceParams) (*ClusterService, error) {

if params.TaskQueue == nil {

return nil, errors.New(errors.Internal, "task queue not initialized")

}

return &ClusterService{

repo: params.Repo,

provisioner: params.Provisioner,

vpcSvc: params.VpcSvc,

instanceSvc: params.InstanceSvc,

secretSvc: params.SecretSvc,

taskQueue: params.TaskQueue,

logger: params.Logger,

}, nil

}

🤖 Prompt for AI Agents

In `@internal/core/services/cluster.go` around lines 39 - 43, Change NewClusterService to avoid panicking on nil TaskQueue: update the constructor signature to return (*ClusterService, error) and validate params.TaskQueue, returning a descriptive error when it's nil instead of calling panic. Ensure callers are updated to handle the error; keep the ClusterService and ClusterServiceParams types unchanged and perform the nil check inside NewClusterService (or a helper like validateClusterServiceParams) so construction fails cleanly with an error rather than crashing the process.

…ving duplicate literals, and standardizing test naming conventions

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tests/multitenancy_e2e_test.go (1)
184-188: Don’t silently ignore delete errors; use t and close the body.

You’ve blanked out *testing.T and still discard the response/error. Use t to surface failures (or at least log them) and close the response body to avoid leaks. As per coding guidelines, avoid silent failures.
🔧 Suggested fix
-func deleteInstance(_ *testing.T, client *http.Client, token, id string) {
+func deleteInstance(t *testing.T, client *http.Client, token, id string) {
 	req, _ := http.NewRequest("DELETE", fmt.Sprintf(instancesPathFmt, testutil.TestBaseURL, id), nil)
 	req.Header.Set(testutil.TestHeaderAPIKey, token)
-	_, _ = client.Do(req)
+	resp, err := client.Do(req)
+	require.NoError(t, err)
+	defer func() { _ = resp.Body.Close() }()
+	if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusNoContent {
+		body, _ := io.ReadAll(resp.Body)
+		t.Fatalf("Delete instance failed: status %d body %s", resp.StatusCode, body)
+	}
 }
internal/core/services/instance_test.go (2)
427-436: Remove or complete this incomplete test.

This test is incomplete and contains dead code. The mock expectation on line 432 is never triggered, and the inline comments acknowledge the test cannot work as written because getVolumeByIDOrName is unexported.

Either:

Remove the test entirely if the functionality is already covered by TestInstanceServiceLaunchWithVolumes

Complete the test by exercising the volume resolution through the public LaunchInstance API
🗑️ Suggested removal
-func TestInstanceServiceGetVolumeByIDOrName(t *testing.T) {
-	_, _, _, volumeRepo, _, _, _, _, _, _ := setupInstanceServiceTest(t)
-	volID := uuid.New()
-
-	// Test by ID
-	volumeRepo.On("GetByID", mock.Anything, volID).Return(&domain.Volume{ID: volID}, nil).Once()
-	// Wait, getVolumeByIDOrName is unexported. I can't call it directly.
-	// But LaunchInstance/TerminateInstance/etc might call it?
-	// Actually, only LaunchInstance calls resolveVolumes which calls it.
-}
285-345: Add error path tests for the Provision method.

The Provision method handles multiple error scenarios—networking setup failures (line 136), volume resolution failures (line 143), container creation failures (line 151), and finalization failures (line 168)—but only the success path is tested. Add tests covering these error scenarios:

TestProvisionGetInstanceFailure — repo.GetByID error

TestProvisionNetworkingFailure — provisionNetwork error

TestProvisionVolumeResolutionFailure — resolveVolumes error

TestProvisionCreateInstanceFailure — compute.CreateInstance error

TestProvisionFinalizeFailure — finalizeProvision error

Each should verify that updateStatus is called with domain.StatusError and appropriate errors are propagated. Per coding guidelines, do not skip tests for services.
cmd/cloud/cli_commands_test.go (1)
541-552: Don’t ignore flag-setting errors in tests.
Flag configuration can fail and should fail the test when it does. As per coding guidelines, avoid silent failures.
Suggested fix
-    _ = createCacheCmd.Flags().Set("name", testRedisMod)
-    _ = createCacheCmd.Flags().Set("version", "7.2")
-    _ = createCacheCmd.Flags().Set("memory", "128")
-    _ = createCacheCmd.Flags().Set("vpc", "vpc-1")
-    _ = createCacheCmd.Flags().Set("wait", "false")
+    if err := createCacheCmd.Flags().Set("name", testRedisMod); err != nil {
+        t.Fatalf("set --name: %v", err)
+    }
+    if err := createCacheCmd.Flags().Set("version", "7.2"); err != nil {
+        t.Fatalf("set --version: %v", err)
+    }
+    if err := createCacheCmd.Flags().Set("memory", "128"); err != nil {
+        t.Fatalf("set --memory: %v", err)
+    }
+    if err := createCacheCmd.Flags().Set("vpc", "vpc-1"); err != nil {
+        t.Fatalf("set --vpc: %v", err)
+    }
+    if err := createCacheCmd.Flags().Set("wait", "false"); err != nil {
+        t.Fatalf("set --wait: %v", err)
+    }

🤖 Fix all issues with AI agents

In `@cmd/cloud/cli_commands_test.go`:
- Around line 331-338: The mock subscription response uses the string literal
"testTopicID" instead of the testTopicID constant, causing inconsistency; update
the sdk.Subscription object in the HTTP handler (the case for POST to
pathNotify+testTopicID+"/subscriptions") to set TopicID to the testTopicID
constant (not the quoted string) so the mock payload matches the test constant.
- Around line 33-58: The test HTTP handlers use json.NewEncoder(w).Encode(resp)
and discard the returned error; update each encoder call (both the list response
and the single security group response in the mocked handlers) to capture the
error, and if non-nil write an internal server error (e.g., set
w.WriteHeader(http.StatusInternalServerError) or use http.Error) and log or fail
the test accordingly instead of using the blank identifier; specifically modify
the json.NewEncoder(w).Encode(resp) calls so they check err :=
json.NewEncoder(w).Encode(resp) and handle err != nil by returning a 500
response.

In `@internal/repositories/docker/loadbalancer.go`:
- Around line 24-28: Introduce named permission constants (e.g., dirPerm and
filePerm) alongside NginxImage and nginxConf and replace all hard-coded
0755/0644 usages with these constants; update calls in functions that create
directories and write files such as MkdirAll and WriteFile and also in
UpdateProxyConfig to use dirPerm for directory creation and filePerm for file
writes to remove magic numbers and centralize permission values.

In `@internal/repositories/k8s/bootstrap.go`:
- Around line 49-71: The waitForIP function currently ignores context
cancellation and can keep polling until its 5-minute timeout; update the polling
select in KubeadmProvisioner.waitForIP to include a case for <-ctx.Done() so the
function returns immediately (empty string) when the parent context is
cancelled, ensuring any resources (ticker already deferred) are cleaned up; keep
the instSvc.GetInstance(ctx, ...) usage as-is so the request itself still
respects ctx.
- Around line 73-95: The polling loop in waitForAPIServer uses time.Sleep and
ignores context cancellation; change it to use a ticker/select that listens for
ctx.Done() so the function can abort immediately on cancellation. In function
waitForAPIServer (which calls p.getExecutor and exec.Run), replace the for
loop/time.Sleep with a time.NewTicker(10*time.Second) and loop selecting on
ticker.C and ctx.Done(); on ticker.C call exec.Run and handle success/failure as
before, on ctx.Done() stop the ticker and return ctx.Err() (or wrap it) and
ensure the ticker is stopped before returning.

In `@internal/repositories/k8s/health.go`:
- Around line 115-119: In the error-handling block that runs after
p.bootstrapNode returns an error, stop discarding errors with `_ =` and instead
capture and log the results of p.instSvc.TerminateInstance(ctx,
worker.ID.String()) and p.repo.DeleteNode(ctx, worker.ID) so cleanup failures
are recorded; specifically, update the block handling the bootstrap error (the
if that calls p.bootstrapNode) to call TerminateInstance and DeleteNode, check
their returned errors, and send those errors to the logger (or wrap them) with
clear context mentioning worker.ID/workerName and ip rather than using blank
identifier assignments.

♻️ Duplicate comments (2)

internal/repositories/k8s/lifecycle.go (1)

199-216: Memory buffering of large snapshots remains a concern.

The entire etcd snapshot is base64-encoded, transferred through SSH, and buffered in memory before upload. For production clusters with multi-GB snapshots, this risks memory exhaustion or SSH message limits. The code comment in Restore (line 257-258) acknowledges this limitation for KIND clusters.

Consider implementing presigned URL uploads or streaming in a future iteration if production-scale clusters are targeted.
internal/repositories/k8s/health.go (1)
198-210: Empty output results in phantom node count.

When kubectl get nodes --no-headers returns an empty string (e.g., no nodes registered yet), strings.Split("", "\n") returns a slice with one empty string, causing NodesTotal to be incorrectly set to 1.
Proposed fix
 	// 2. Node Counts
 	nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers")
 	if err == nil {
+		trimmed := strings.TrimSpace(nodesOut)
+		if trimmed == "" {
+			health.NodesTotal = 0
+			health.NodesReady = 0
+		} else {
-		lines := strings.Split(strings.TrimSpace(nodesOut), "\n")
-		health.NodesTotal = len(lines)
-		readyCount := 0
-		for _, line := range lines {
-			if strings.Contains(line, " Ready ") {
-				readyCount++
+			lines := strings.Split(trimmed, "\n")
+			health.NodesTotal = len(lines)
+			readyCount := 0
+			for _, line := range lines {
+				if strings.Contains(line, " Ready ") {
+					readyCount++
+				}
 			}
+			health.NodesReady = readyCount
 		}
-		health.NodesReady = readyCount
 	}

🧹 Nitpick comments (12)

internal/core/services/queue_test.go (1)
105-111: Good addition for error path coverage.

The new "already exists" test case properly validates the duplicate name check. For consistency with the "success" subtest (line 96), consider adding repo.AssertExpectations(t) to verify the mock was called as expected.
Suggested improvement
 	t.Run("already exists", func(t *testing.T) {
 		repo.On("GetByName", mock.Anything, "existing", userID).Return(&domain.Queue{ID: uuid.New()}, nil).Once()
 
 		_, err := svc.CreateQueue(ctx, "existing", nil)
 		assert.Error(t, err)
 		assert.Contains(t, err.Error(), "already exists")
+		repo.AssertExpectations(t)
 	})
internal/repositories/k8s/lifecycle.go (3)
85-103: Consider extracting the package version suffix as a constant.

The hardcoded -1.1 suffix in the package names (e.g., kubeadm=%s-1.1) is a Debian/Ubuntu-specific convention that may vary across distributions or Kubernetes releases. Consider defining this as a named constant or making it configurable.
Suggested improvement
+const kubePackageSuffix = "-1.1" // Debian/Ubuntu package version suffix

 func (p *KubeadmProvisioner) upgradeControlPlane(ctx context.Context, cluster *domain.Cluster, ip, version string) error {
     // ...
     upgradeScript := fmt.Sprintf(`
 set -e
 # 1. Update kubeadm
 apt-mark unhold kubeadm
-apt-get update && apt-get install -y kubeadm=%s-1.1
+apt-get update && apt-get install -y kubeadm=%s%s
 apt-mark hold kubeadm
 // ...
-`, ver, version, ver, ver)
+`, ver, kubePackageSuffix, version, ver, kubePackageSuffix, ver, kubePackageSuffix)
167-168: Remove duplicate comment.

The comment // 3. Encrypt and store the new kubeconfig appears twice on consecutive lines.
Fix
 	// 3. Encrypt and store the new kubeconfig
-	// 3. Encrypt and store the new kubeconfig
 	encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig)
218-219: Remove duplicate comment.

The comment // Cleanup temp file appears twice on consecutive lines.
Fix
 	// Cleanup temp file
-	// Cleanup temp file
 	if _, err := exec.Run(ctx, "rm "+tempPath); err != nil {
internal/repositories/k8s/bootstrap.go (1)
97-97: Unused parameters in function signature.

The version and isMaster parameters are unused (marked with _). If these are intentionally reserved for future use, consider adding a comment explaining the intent. Otherwise, remove them to avoid confusion.
Options

Option 1: Document intent
// bootstrapNode prepares a node for Kubernetes.
// version and isMaster are reserved for future differentiated bootstrap logic.
func (p *KubeadmProvisioner) bootstrapNode(ctx context.Context, cluster *domain.Cluster, ip, _ string, _ bool) error {
Option 2: Remove unused parameters (if callers can be updated)
func (p *KubeadmProvisioner) bootstrapNode(ctx context.Context, cluster *domain.Cluster, ip string) error {
cmd/cloud/cli_commands_test.go (2)

15-23: Prefer shared testutil IDs to avoid drift.
Since pkg/testutil now exports TestCacheID, TestQueueID, and TestTopicID, consider reusing them here to keep test fixtures centralized.

384-425: Consider table-driven tests to reduce repetition.
Many tests follow the same setup/execute/assert pattern; consolidating into table-driven cases would reduce duplication and align with test guidelines. As per coding guidelines, prefer table-driven tests in *_test.go.
internal/repositories/k8s/kubeadm_provisioner_test.go (5)
17-48: Guard mocked instance returns against nil panics.

LaunchInstance/GetInstance cast args.Get(0) directly; returning nil with an error will panic and makes error‑path tests awkward. Consider a nil‑safe assertion.
♻️ Suggested change
 func (m *MockInstanceService) LaunchInstance(ctx context.Context, name, image, ports string, vpcID, subnetID *uuid.UUID, volumes []domain.VolumeAttachment) (*domain.Instance, error) {
 	args := m.Called(ctx, name, image, ports, vpcID, subnetID, volumes)
-	return args.Get(0).(*domain.Instance), args.Error(1)
+	inst, _ := args.Get(0).(*domain.Instance)
+	return inst, args.Error(1)
 }
 func (m *MockInstanceService) GetInstance(ctx context.Context, id string) (*domain.Instance, error) {
 	args := m.Called(ctx, id)
-	return args.Get(0).(*domain.Instance), args.Error(1)
+	inst, _ := args.Get(0).(*domain.Instance)
+	return inst, args.Error(1)
 }
50-75: Make GetNodes mock nil‑safe for error‑path tests.

args.Get(0) will panic on a nil return. Guard it to allow nil slices alongside an error.
♻️ Suggested change
 func (m *MockClusterRepo) GetNodes(ctx context.Context, clusterID uuid.UUID) ([]*domain.ClusterNode, error) {
 	args := m.Called(ctx, clusterID)
-	return args.Get(0).([]*domain.ClusterNode), args.Error(1)
+	nodes, _ := args.Get(0).([]*domain.ClusterNode)
+	return nodes, args.Error(1)
 }
141-165: Nil‑safe LoadBalancer mock returns for error‑path tests.

Create/Get type assertions will panic on nil returns. Guard them so error‑path tests can return nil safely.
♻️ Suggested change
 func (m *MockLBService) Create(ctx context.Context, name string, vpcID uuid.UUID, port int, algo string, idempotencyKey string) (*domain.LoadBalancer, error) {
 	args := m.Called(ctx, name, vpcID, port, algo, idempotencyKey)
-	return args.Get(0).(*domain.LoadBalancer), args.Error(1)
+	lb, _ := args.Get(0).(*domain.LoadBalancer)
+	return lb, args.Error(1)
 }
 func (m *MockLBService) Get(ctx context.Context, id uuid.UUID) (*domain.LoadBalancer, error) {
 	args := m.Called(ctx, id)
-	return args.Get(0).(*domain.LoadBalancer), args.Error(1)
+	lb, _ := args.Get(0).(*domain.LoadBalancer)
+	return lb, args.Error(1)
 }
167-193: Assert mock expectations; consider table‑driven cases.

The test only checks for no error; add expectation assertions so regressions in the deprovision flow are caught. Also consider a table‑driven structure as scenarios grow. As per coding guidelines, prefer table‑driven tests.
🔧 Suggested change
 err := p.Deprovision(ctx, cluster)

 assert.NoError(t, err)
+instSvc.AssertExpectations(t)
+repo.AssertExpectations(t)
195-249: Replace magic numbers with named constants.

Ports/counts/weights and IP literals are repeated; named constants improve readability and simplify updates. As per coding guidelines, avoid magic numbers.
♻️ Suggested change
 	cluster := &domain.Cluster{
 		ID:        uuid.New(),
 		Name:      "ha-cluster",
 		VpcID:     uuid.New(),
 		Version:   "v1.29.0",
 		HAEnabled: true,
 	}
+	const (
+		apiServerPort = 6443
+		masterCount   = 3
+		lbWeight      = 10
+		lbIP          = "10.0.0.100"
+		nodeIPBase    = "10.0.0.%d"
+		nodeIPStart   = 10
+	)

 	repo.On("Update", mock.Anything, mock.Anything).Return(nil).Maybe()
 	repo.On("AddNode", mock.Anything, mock.Anything).Return(nil).Maybe()
 	repo.On("UpdateNode", mock.Anything, mock.Anything).Return(nil).Maybe()

 	// 1. LB Creation
 	lbID := uuid.New()
-	lb := &domain.LoadBalancer{ID: lbID, IP: "10.0.0.100"}
-	lbSvc.On("Create", mock.Anything, mock.Anything, cluster.VpcID, 6443, "round-robin", mock.Anything).Return(lb, nil)
+	lb := &domain.LoadBalancer{ID: lbID, IP: lbIP}
+	lbSvc.On("Create", mock.Anything, mock.Anything, cluster.VpcID, apiServerPort, "round-robin", mock.Anything).Return(lb, nil)

 	// 2. 3 Master Nodes Creation
 	var allNodes []*domain.ClusterNode
-	for i := 0; i < 3; i++ {
+	for i := 0; i < masterCount; i++ {
 		nodeID := uuid.New()
-		instance := &domain.Instance{ID: nodeID, PrivateIP: fmt.Sprintf("10.0.0.%d", 10+i)}
+		instance := &domain.Instance{ID: nodeID, PrivateIP: fmt.Sprintf(nodeIPBase, nodeIPStart+i)}

 		allNodes = append(allNodes, &domain.ClusterNode{
 			ID:         uuid.New(),
 			ClusterID:  cluster.ID,
 			InstanceID: nodeID,
 			Role:       domain.NodeRoleControlPlane,
 		})

 		instSvc.On("LaunchInstance", mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything).Return(instance, nil).Once()
 		instSvc.On("GetInstance", mock.Anything, nodeID.String()).Return(instance, nil).Maybe()
-		lbSvc.On("AddTarget", mock.Anything, lbID, nodeID, 6443, 10).Return(nil)
+		lbSvc.On("AddTarget", mock.Anything, lbID, nodeID, apiServerPort, lbWeight).Return(nil)
 	}
 	repo.On("GetNodes", mock.Anything, cluster.ID).Return(allNodes, nil).Maybe()

 	// General Exec matches
-	instSvc.On("Exec", mock.Anything, mock.Anything, mock.Anything).Return("kubeadm join 10.0.0.100:6443 --token abc --control-plane --certificate-key xyz", nil).Maybe()
+	joinCmd := fmt.Sprintf("kubeadm join %s:%d --token abc --control-plane --certificate-key xyz", lbIP, apiServerPort)
+	instSvc.On("Exec", mock.Anything, mock.Anything, mock.Anything).Return(joinCmd, nil).Maybe()

coderabbitai · 2026-01-20T16:32:57Z

 		resp := sdk.Response[[]sdk.SecurityGroup]{
-			Data: []sdk.SecurityGroup{{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()}},
-		}
-		_ = json.NewEncoder(w).Encode(resp)
-		return true
-	case method == http.MethodPost && path == "/security-groups":
-		resp := sdk.Response[sdk.SecurityGroup]{
-			Data: sdk.SecurityGroup{ID: "sg-2", VPCID: "vpc-2", Name: "web", Description: "web group", ARN: "arn:sg:2", CreatedAt: time.Now().UTC()},
+			Data: []sdk.SecurityGroup{
+				{
+					ID:          "sg-1",
+					VPCID:       "vpc-1",
+					Name:        "default",
+					Description: "default group",
+					ARN:         "arn:sg:1",
+					CreatedAt:   time.Now().UTC(),
+				},
+			},
 		}
 		_ = json.NewEncoder(w).Encode(resp)
 		return true
-	case method == http.MethodGet && path == pathSG+testSGID:
+	case r.Method == http.MethodPost && r.URL.Path == "/security-groups":
 		resp := sdk.Response[sdk.SecurityGroup]{
-			Data: sdk.SecurityGroup{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()},
+			Data: sdk.SecurityGroup{
+				ID:          "sg-2",
+				VPCID:       "vpc-2",
+				Name:        "web",
+				Description: "web group",
+				ARN:         "arn:sg:2",
+				CreatedAt:   time.Now().UTC(),
+			},
 		}
 		_ = json.NewEncoder(w).Encode(resp)


⚠️ Potential issue | 🟡 Minor

Handle JSON encode errors instead of discarding them.
Ignoring encoder errors can mask test failures and violates the “no silent failures” guideline. Consider returning a 500 when encoding fails. As per coding guidelines, avoid blank-identifier error drops.

Suggested fix (apply across handlers)

- _ = json.NewEncoder(w).Encode(resp) - return true + if err := json.NewEncoder(w).Encode(resp); err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + return true + } + return true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resp := sdk.Response[[]sdk.SecurityGroup]{

Data: []sdk.SecurityGroup{{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()}},

}

_ = json.NewEncoder(w).Encode(resp)

return true

case method == http.MethodPost && path == "/security-groups":

resp := sdk.Response[sdk.SecurityGroup]{

Data: sdk.SecurityGroup{ID: "sg-2", VPCID: "vpc-2", Name: "web", Description: "web group", ARN: "arn:sg:2", CreatedAt: time.Now().UTC()},

Data: []sdk.SecurityGroup{

{

ID: "sg-1",

VPCID: "vpc-1",

Name: "default",

Description: "default group",

ARN: "arn:sg:1",

CreatedAt: time.Now().UTC(),

},

},

}

_ = json.NewEncoder(w).Encode(resp)

return true

case method == http.MethodGet && path == pathSG+testSGID:

case r.Method == http.MethodPost && r.URL.Path == "/security-groups":

resp := sdk.Response[sdk.SecurityGroup]{

Data: sdk.SecurityGroup{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()},

Data: sdk.SecurityGroup{

ID: "sg-2",

VPCID: "vpc-2",

Name: "web",

Description: "web group",

ARN: "arn:sg:2",

CreatedAt: time.Now().UTC(),

},

}

_ = json.NewEncoder(w).Encode(resp)

resp := sdk.Response[[]sdk.SecurityGroup]{

Data: []sdk.SecurityGroup{

{

ID: "sg-1",

VPCID: "vpc-1",

Name: "default",

Description: "default group",

ARN: "arn:sg:1",

CreatedAt: time.Now().UTC(),

},

},

}

if err := json.NewEncoder(w).Encode(resp); err != nil {

http.Error(w, err.Error(), http.StatusInternalServerError)

return true

}

return true

case r.Method == http.MethodPost && r.URL.Path == "/security-groups":

resp := sdk.Response[sdk.SecurityGroup]{

Data: sdk.SecurityGroup{

ID: "sg-2",

VPCID: "vpc-2",

Name: "web",

Description: "web group",

ARN: "arn:sg:2",

CreatedAt: time.Now().UTC(),

},

}

if err := json.NewEncoder(w).Encode(resp); err != nil {

http.Error(w, err.Error(), http.StatusInternalServerError)

return true

}

return true

🤖 Prompt for AI Agents

In `@cmd/cloud/cli_commands_test.go` around lines 33 - 58, The test HTTP handlers use json.NewEncoder(w).Encode(resp) and discard the returned error; update each encoder call (both the list response and the single security group response in the mocked handlers) to capture the error, and if non-nil write an internal server error (e.g., set w.WriteHeader(http.StatusInternalServerError) or use http.Error) and log or fail the test accordingly instead of using the blank identifier; specifically modify the json.NewEncoder(w).Encode(resp) calls so they check err := json.NewEncoder(w).Encode(resp) and handle err != nil by returning a 500 response.

- Refactored cmd/api to use AppDeps instead of globals for better testability. - Updated Cluster domain to safely handle sensitive fields (SSHKey, Kubeconfig) with json:"-". - Enhanced ClusterService with secure constructor, SSH key encryption, and robust error handling. - Improved KubeadmProvisioner (ha.go) with retry loops for LB IP and strict join command validation. - Implemented bounded concurrency and polling backoff in ClusterWorker. - Fixed CLI command tests (topic ID consistency, JSON error handling). - Introduced named permissions constants in docker/loadbalancer.go. - Updated k8s bootstrap and health checks to respect context cancellation and log cleanup errors. - Updated Swagger documentation.

sonarqubecloud · 2026-01-20T16:48:50Z

Quality Gate failed

Failed conditions
11 Security Hotspots

See analysis details on SonarQube Cloud

Copilot

Pull request overview

Copilot reviewed 78 out of 79 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T16:51:47Z

+	// Cleanup temp file
+	// Cleanup temp file


The comment on line 218 is duplicated on line 219. Remove the duplicate comment.

Copilot · 2026-01-20T16:51:47Z

+	// 3. Encrypt and store the new kubeconfig
+	// 3. Encrypt and store the new kubeconfig


The comment on line 167 is duplicated on line 168. Remove the duplicate comment.

Copilot · 2026-01-20T16:51:47Z

+      - K8s
+  /clusters/{id}/kubeconfig:
+    get:
+      description: Returns the kubeconfig for clinical access to the cluster


The same typo "clinical access" appears in the swagger documentation. This should be "client access" or "cluster access".

Suggested change

description: Returns the kubeconfig for clinical access to the cluster

description: Returns the kubeconfig for client access to the cluster

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@internal/core/services/cluster_test.go`:
- Around line 40-99: Convert the two standalone tests TestClusterServiceCreate
and TestClusterServiceDelete into a single table-driven test (or separate
table-driven tests) that iterates over test cases and uses t.Run for subtests;
for each case call setupClusterServiceTest to initialize mocks, set per-case
expectations on repo.On, secretSvc.On, taskQueue.On (e.g., taskQueue.Enqueue
assertions), invoke svc.CreateCluster or svc.DeleteCluster as appropriate, and
assert results and mock expectations; keep case metadata (name, inputs, expected
error/cluster/status, mock setup function) in a slice and ensure each subtest
resets or recreates mocks so expectations (repo.On, secretSvc.On, taskQueue.On)
are scoped to that case.

In `@internal/core/services/cluster.go`:
- Around line 273-293: When enqueueing the upgrade job, if taskQueue is nil or
taskQueue.Enqueue returns an error, persist a recovery status on the cluster
instead of leaving it in ClusterStatusUpgrading; after setting cluster.Status =
domain.ClusterStatusUpgrading and calling s.repo.Update(ctx, cluster) you must
detect enqueue failures and update the cluster status back to
domain.ClusterStatusRunning (or domain.ClusterStatusFailed) and UpdatedAt,
calling s.repo.Update(ctx, cluster) again (mirror the CreateCluster failure
handling pattern). Specifically handle the case where s.taskQueue == nil and
when s.taskQueue.Enqueue(ctx, "k8s_jobs", job) returns an error by wrapping the
error for return but first setting cluster.Status to the appropriate
non-upgrading state and persisting via s.repo.Update(ctx, cluster) so the
cluster is not left stuck in upgrading.
- Around line 138-153: When Enqueue fails after setting cluster.Status =
domain.ClusterStatusDeleting and persisting via s.repo.Update, revert the
cluster status (e.g., back to previous status or to a new failure state) and
persist that change so the cluster does not remain stuck in "deleting";
specifically, catch the error from s.taskQueue.Enqueue(ctx, "k8s_jobs", job) and
call s.repo.Update(ctx, cluster) after setting cluster.Status to the original
value or a "delete_failed" status, and return the wrapped error; follow the same
pattern used in CreateCluster to ensure the failure is recorded and persisted
for recovery/visibility.

In `@internal/repositories/k8s/ha.go`:
- Around line 139-143: The kubeadm init invocation builds initCmd with
--ignore-preflight-errors=all which silences all safety checks; change the logic
that constructs initCmd in ha.go (the initCmd variable used with exec.Run) to
specify a comma-separated list of named preflight errors to ignore instead of
"all" (e.g., IsPrivilegedUser,Swap or other specific checks discovered via
`kubeadm init phase preflight --help`); implement this by replacing the literal
string "--ignore-preflight-errors=all" with a variable or function that returns
the allowed names, or make it configurable on the cluster object, and update the
initCmd construction so the chosen names are inserted (preserving the existing
flags: --kubernetes-version, --pod-network-cidr, --control-plane-endpoint,
--upload-certs) before calling exec.Run.

In `@internal/repositories/k8s/health.go`:
- Around line 115-123: The TerminateInstance call in the bootstrap error handler
is using worker.ID (node record UUID) instead of the compute instance UUID;
update the call in the p.bootstrapNode error branch to pass
worker.InstanceID.String() to p.instSvc.TerminateInstance so the actual instance
is terminated (mirror what scaleDown uses with node.InstanceID.String()); ensure
you reference worker.InstanceID and convert to String() as in scaleDown.

♻️ Duplicate comments (5)

internal/repositories/k8s/health.go (1)
204-214: Handle empty kubectl get nodes output to avoid phantom node count.

When nodesOut is empty, strings.Split("", "\n") yields [""] (a slice with one empty string), so NodesTotal becomes 1 instead of 0.
Proposed fix
 	nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers")
 	if err == nil {
-		lines := strings.Split(strings.TrimSpace(nodesOut), "\n")
-		health.NodesTotal = len(lines)
+		trimmed := strings.TrimSpace(nodesOut)
+		if trimmed != "" {
+			lines := strings.Split(trimmed, "\n")
+			health.NodesTotal = len(lines)
+			readyCount := 0
+			for _, line := range lines {
+				if strings.Contains(line, " Ready ") {
+					readyCount++
+				}
+			}
+			health.NodesReady = readyCount
+		}
-		readyCount := 0
-		for _, line := range lines {
-			if strings.Contains(line, " Ready ") {
-				readyCount++
-			}
-		}
-		health.NodesReady = readyCount
 	}
cmd/cloud/cli_commands_test.go (1)
30-59: Handle JSON encoding errors instead of discarding them.
Several handlers still ignore Encode errors, which can mask test failures. Consider a small helper to encode and return 500 on failure, then reuse it across handlers.
✅ Suggested helper (apply across handlers)
 func respondNoContent(w http.ResponseWriter) bool {
 	w.WriteHeader(http.StatusNoContent)
 	return true
 }
+
+func respondJSON(w http.ResponseWriter, payload any) bool {
+	if err := json.NewEncoder(w).Encode(payload); err != nil {
+		http.Error(w, err.Error(), http.StatusInternalServerError)
+	}
+	return true
+}
-		_ = json.NewEncoder(w).Encode(resp)
-		return true
+		return respondJSON(w, resp)
As per coding guidelines, avoid silent failures.
Also applies to: 64-127, 136-167, 174-246, 251-304, 313-343
docs/swagger/docs.go (1)
966-1299: Fix typo in kubeconfig description.

Line 1125 says “clinical access”; it should be “client access” (or similar). Since this file is generated, update the source annotation and regenerate.
✏️ Suggested fix
-                "description": "Returns the kubeconfig for clinical access to the cluster",
+                "description": "Returns the kubeconfig for client access to the cluster",
docs/swagger/swagger.yaml (1)
1848-2067: Fix typo in kubeconfig description.

Line 1953 says “clinical access”; it should be “client access” (or similar).
✏️ Suggested fix
-      description: Returns the kubeconfig for clinical access to the cluster
+      description: Returns the kubeconfig for client access to the cluster
docs/swagger/swagger.json (1)

1112-1115: Fix typo in kubeconfig description.
“clinical access” → “client access” (or “cluster access”).

🧹 Nitpick comments (11)

internal/repositories/k8s/kubeadm_provisioner_test.go (4)
19-26: Nil pointer dereference risk on type assertion.

When args.Get(0) returns nil (e.g., in error cases), the type assertion .(*domain.Instance) will panic. Use a safe pattern:
♻️ Suggested safe nil handling
 func (m *MockInstanceService) LaunchInstance(ctx context.Context, name, image, ports string, vpcID, subnetID *uuid.UUID, volumes []domain.VolumeAttachment) (*domain.Instance, error) {
 	args := m.Called(ctx, name, image, ports, vpcID, subnetID, volumes)
-	return args.Get(0).(*domain.Instance), args.Error(1)
+	if v := args.Get(0); v != nil {
+		return v.(*domain.Instance), args.Error(1)
+	}
+	return nil, args.Error(1)
 }
 func (m *MockInstanceService) GetInstance(ctx context.Context, id string) (*domain.Instance, error) {
 	args := m.Called(ctx, id)
-	return args.Get(0).(*domain.Instance), args.Error(1)
+	if v := args.Get(0); v != nil {
+		return v.(*domain.Instance), args.Error(1)
+	}
+	return nil, args.Error(1)
 }
The same pattern should be applied to MockClusterRepo.GetNodes (line 68), MockLBService.Create (line 145), and MockLBService.Get (line 149).

30-48: Mock methods silently return without recording calls.

Methods like StopInstance, ListInstances, GetInstanceLogs, etc. don't use m.Called(). If production code unexpectedly calls these, tests won't detect it.

If these methods should never be called during the tested flows, consider either:

Using m.Called() and not setting up expectations (testify will fail on unexpected calls)

Adding t.Fatal("unexpected call") to detect unintended usage
 func (m *MockInstanceService) StopInstance(ctx context.Context, id string) error {
-	return nil
+	return m.Called(ctx, id).Error(0)
 }
167-193: Add mock assertion verification.

The test doesn't verify that expected mock calls were actually made. Without AssertExpectations, if Deprovision silently skips calling TerminateInstance or DeleteNode, the test would still pass.
♻️ Add assertions
 	err := p.Deprovision(ctx, cluster)

 	assert.NoError(t, err)
+	repo.AssertExpectations(t)
+	instSvc.AssertExpectations(t)
 }
195-254: HA provisioning test uses overly permissive matchers and lacks assertion verification.

Several concerns:

Excessive .Maybe() (lines 215-217, 238, 241, 249): These make calls optional, so the test passes even if critical operations like AddNode or Update aren't called.

Missing AssertExpectations: No verification that LB creation, node launches, and target registrations occurred.

Overly broad mock.Anything on LaunchInstance (line 237): Consider asserting at least the VPC ID matches cluster.VpcID to catch routing bugs.
♻️ Strengthen test assertions
-	repo.On("Update", mock.Anything, mock.Anything).Return(nil).Maybe()
-	repo.On("AddNode", mock.Anything, mock.Anything).Return(nil).Maybe()
+	repo.On("Update", mock.Anything, mock.Anything).Return(nil)
+	repo.On("AddNode", mock.Anything, mock.Anything).Return(nil)

 	err := p.Provision(ctx, cluster)

 	assert.NoError(t, err)
+	assert.NotNil(t, cluster.APIServerLBAddress, "LB address should be set")
+	lbSvc.AssertExpectations(t)
+	instSvc.AssertExpectations(t)
+	repo.AssertExpectations(t)
 }
internal/repositories/k8s/health.go (1)

139-163: LGTM with optional suggestion.

Error aggregation and logging correctly address previous feedback. Consider skipping DeleteNode if TerminateInstance fails to avoid orphaning instances (the node record serves as a reference for retry/cleanup).
internal/repositories/k8s/ha.go (3)
22-36: Previous feedback addressed: bounded retry for LB IP.

The LB IP retry logic now properly retries up to 5 times and fails with an explicit error if the IP is never assigned. This addresses the earlier concern about proceeding with an empty endpoint.

However, consider extracting the magic numbers to named constants per coding guidelines:
♻️ Suggested constants
+const (
+	lbIPRetryAttempts = 5
+	lbIPRetryInterval = 2 * time.Second
+)
+
 func (p *KubeadmProvisioner) provisionHAControlPlane(ctx context.Context, cluster *domain.Cluster) (string, error) {
 	// ...
-	for i := 0; i < 5; i++ {
+	for i := 0; i < lbIPRetryAttempts; i++ {
 		lbIP = lb.IP
 		if lbIP != "" {
 			break
 		}
-		time.Sleep(2 * time.Second)
+		time.Sleep(lbIPRetryInterval)
82-99: Extract magic numbers to named constants.

The master count (3), API server port (6443), and LB weight (10) should be named constants per coding guidelines. This improves readability and makes the HA configuration easier to adjust.
♻️ Suggested refactor
+const (
+	HAMasterCount     = 3
+	APIServerPort     = 6443
+	DefaultLBWeight   = 10
+)
+
 func (p *KubeadmProvisioner) provisionHAMasters(...) (...) {
 	// ...
-	for i := 0; i < 3; i++ {
+	for i := 0; i < HAMasterCount; i++ {
 		// ...
-		if err := p.lbSvc.AddTarget(ctx, lb.ID, node.ID, 6443, 10); err != nil {
+		if err := p.lbSvc.AddTarget(ctx, lb.ID, node.ID, APIServerPort, DefaultLBWeight); err != nil {
110-118: Add bounds check for masterIPs slice.

The loop assumes masterIPs has at least 3 elements. While provisionHAMasters should guarantee this, a defensive check would prevent a panic if the contract changes.
♻️ Suggested defensive check
 func (p *KubeadmProvisioner) joinHAMasters(ctx context.Context, cluster *domain.Cluster, masterIPs []string, cpJoinCmd string) error {
+	if len(masterIPs) < HAMasterCount {
+		return fmt.Errorf("expected %d master IPs, got %d", HAMasterCount, len(masterIPs))
+	}
-	for i := 1; i < 3; i++ {
+	for i := 1; i < HAMasterCount; i++ {
 		p.logger.Info("joining master to control plane", "ip", masterIPs[i])
cmd/api/main_test.go (2)

25-102: Consider table-driven tests + testify/mock for infra cases.

These infra-init tests follow the same arrange/act/assert shape; a table-driven structure plus testify/mock-based fakes would reduce duplication and standardize mocks. As per coding guidelines, please use table-driven tests and testify/mock for mocks.

104-129: Avoid real Redis clients in unit tests.

Using a real go-redis client (even pointed at 127.0.0.1) risks accidental network use; prefer a mock or stubbed Redis/TaskQueue interface. As per coding guidelines, avoid real external dependencies in unit tests.
internal/core/services/cluster_test.go (1)
28-36: Don’t ignore constructor errors in test setup.

If NewClusterService ever returns an error, the tests will proceed with a nil service and fail later in a harder-to-diagnose way. Consider asserting it’s nil here.
💡 Suggested adjustment
-    svc, _ := services.NewClusterService(services.ClusterServiceParams{
+    svc, err := services.NewClusterService(services.ClusterServiceParams{
         Repo:        repo,
         Provisioner: provisioner,
         VpcSvc:      vpcSvc,
         InstanceSvc: instSvc,
         SecretSvc:   secretSvc,
         TaskQueue:   taskQueue,
         Logger:      logger,
     })
+    if err != nil {
+        panic(err) // or return error and fail the test via t.Fatalf in caller
+    }

coderabbitai · 2026-01-20T16:58:25Z

+        "/clusters": {
+            "get": {
+                "description": "Returns all clusters belonging to the user",
+                "produces": [
+                    "application/json"
+                ],
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "List managed K8s clusters",
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "type": "array",
+                            "items": {
+                                "$ref": "#/definitions/domain.Cluster"
+                            }
+                        }
+                    }
+                }
+            },
+            "post": {
+                "description": "Provisions a new Kubernetes cluster using kubeadm",
+                "consumes": [
+                    "application/json"
+                ],
+                "produces": [
+                    "application/json"
+                ],
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Create a managed K8s cluster",
+                "parameters": [
+                    {
+                        "description": "Cluster details",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/httphandlers.CreateClusterRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "202": {
+                        "description": "Accepted",
+                        "schema": {
+                            "$ref": "#/definitions/domain.Cluster"
+                        }
+                    }
+                }
+            }
+        },
+        "/clusters/{id}": {
+            "get": {
+                "description": "Returns cluster metadata and current status",
+                "produces": [
+                    "application/json"
+                ],
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Get cluster details",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "$ref": "#/definitions/domain.Cluster"
+                        }
+                    }
+                }
+            },
+            "delete": {
+                "description": "Terminates all nodes and removes the cluster record",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Delete a K8s cluster",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    }
+                ],
+                "responses": {
+                    "202": {
+                        "description": "Accepted"
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/backups": {
+            "post": {
+                "description": "Creates an etcd snapshot of the cluster state",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Create cluster backup",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    }
+                ],
+                "responses": {
+                    "202": {
+                        "description": "Accepted"
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/health": {
+            "get": {
+                "description": "Returns readiness of nodes and API server",
+                "produces": [
+                    "application/json"
+                ],
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Get cluster operational health",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "$ref": "#/definitions/ports.ClusterHealth"
+                        }
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/kubeconfig": {
+            "get": {
+                "description": "Returns the kubeconfig for clinical access to the cluster",
+                "produces": [
+                    "text/plain"
+                ],
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Download kubeconfig",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    },
+                    {
+                        "type": "string",
+                        "description": "Role (e.g. viewer)",
+                        "name": "role",
+                        "in": "query"
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "type": "string"
+                        }
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/repair": {
+            "post": {
+                "description": "Re-applies CNI and kube-proxy patches to a running cluster",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Repair cluster components",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    }
+                ],
+                "responses": {
+                    "202": {
+                        "description": "Accepted"
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/restore": {
+            "post": {
+                "description": "Restores the etcd state from a specified snapshot path",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Restore cluster from backup",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    },
+                    {
+                        "description": "Restore Request",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/httphandlers.RestoreBackupRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK"
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/rotate-secrets": {
+            "post": {
+                "description": "Renews cluster certificates and refreshes admin kubeconfig",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Rotate cluster secrets",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK"
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/scale": {
+            "post": {
+                "description": "Adjusts the number of worker nodes",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Scale cluster workers",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    },
+                    {
+                        "description": "Scale Request",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/httphandlers.ScaleClusterRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK"
+                    }
+                }
+            }
+        },
+        "/clusters/{id}/upgrade": {
+            "post": {
+                "description": "Initiates an asynchronous upgrade of the Kubernetes control plane and workers",
+                "tags": [
+                    "K8s"
+                ],
+                "summary": "Upgrade cluster version",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "Cluster ID",
+                        "name": "id",
+                        "in": "path",
+                        "required": true
+                    },
+                    {
+                        "description": "Upgrade Request",
+                        "name": "request",
+                        "in": "body",
+                        "required": true,
+                        "schema": {
+                            "$ref": "#/definitions/httphandlers.UpgradeClusterRequest"
+                        }
+                    }
+                ],
+                "responses": {
+                    "202": {
+                        "description": "Accepted"
+                    }
+                }
+            }
+        },


⚠️ Potential issue | 🟡 Minor

Add APIKeyAuth security to the cluster endpoints.

Unlike other authenticated routes, the /clusters family lacks a security block, which makes the docs imply public access. Please add APIKeyAuth (or set global security) for consistency and to avoid client confusion.

🔧 Example (apply similarly to all /clusters routes)

"/clusters": { "get": { + "security": [ + { + "APIKeyAuth": [] + } + ], "description": "Returns all clusters belonging to the user", "produces": [ "application/json" ], "tags": [ "K8s" ], "summary": "List managed K8s clusters", "responses": { "200": { "description": "OK", "schema": { "type": "array", "items": { "$ref": "#/definitions/domain.Cluster" } } } } }, "post": { + "security": [ + { + "APIKeyAuth": [] + } + ], "description": "Provisions a new Kubernetes cluster using kubeadm", "consumes": [ "application/json" ],

coderabbitai · 2026-01-20T16:58:25Z

+func TestClusterServiceCreate(t *testing.T) {
+	repo, _, vpcSvc, _, taskQueue, secretSvc, svc := setupClusterServiceTest()
+	ctx := context.Background()
+	userID := uuid.New()
+	vpcID := uuid.New()
+
+	vpcSvc.On("GetVPC", mock.Anything, vpcID.String()).Return(&domain.VPC{ID: vpcID}, nil)
+	repo.On("Create", mock.Anything, mock.MatchedBy(func(c *domain.Cluster) bool {
+		return c.Name == testClusterName && c.UserID == userID
+	})).Return(nil)
+	repo.On("Update", mock.Anything, mock.Anything).Return(nil)
+	repo.On("Update", mock.Anything, mock.Anything).Return(nil)
+	secretSvc.On("Encrypt", mock.Anything, userID, mock.Anything).Return("encrypted-key", nil)
+
+	// Expect task queue enqueue
+	taskQueue.On("Enqueue", mock.Anything, "k8s_jobs", mock.MatchedBy(func(job domain.ClusterJob) bool {
+		return job.Type == domain.ClusterJobProvision && job.ClusterID != uuid.Nil && job.UserID == userID
+	})).Return(nil).Once()
+
+	cluster, err := svc.CreateCluster(ctx, ports.CreateClusterParams{
+		UserID:  userID,
+		Name:    testClusterName,
+		VpcID:   vpcID,
+		Version: "v1.29.0",
+		Workers: 2,
+	})
+
+	assert.NoError(t, err)
+	assert.NotNil(t, cluster)
+	assert.Equal(t, testClusterName, cluster.Name)
+	assert.Equal(t, domain.ClusterStatusPending, cluster.Status)
+
+	// Wait for background provisioning - Wait time removed as provision is not async in test mock unless explicitly delayed
+	taskQueue.AssertExpectations(t)
+}
+
+func TestClusterServiceDelete(t *testing.T) {
+	repo, _, _, _, taskQueue, _, svc := setupClusterServiceTest()
+	ctx := context.Background()
+	id := uuid.New()
+	cluster := &domain.Cluster{ID: id, Status: domain.ClusterStatusRunning}
+
+	repo.On("GetByID", mock.Anything, id).Return(cluster, nil)
+	repo.On("Update", mock.Anything, mock.Anything).Return(nil)
+	repo.On("Update", mock.Anything, mock.Anything).Return(nil)
+	repo.On("Delete", mock.Anything, id).Return(nil)
+
+	// Expect task queue enqueue
+	taskQueue.On("Enqueue", mock.Anything, "k8s_jobs", mock.MatchedBy(func(job domain.ClusterJob) bool {
+		return job.Type == domain.ClusterJobDeprovision && job.ClusterID == id
+	})).Return(nil).Once()
+
+	err := svc.DeleteCluster(ctx, id)
+
+	assert.NoError(t, err)
+
+	taskQueue.AssertExpectations(t)
+	repo.AssertCalled(t, "GetByID", mock.Anything, id)
+	repo.AssertCalled(t, "Update", mock.Anything, mock.Anything)
+}


🛠️ Refactor suggestion | 🟠 Major

Use table‑driven tests for cluster service cases.

Current tests are standalone; please convert to table-driven style to match project testing standards. As per coding guidelines, use table-driven tests in test files.

As per coding guidelines, please use table-driven tests in *_test.go files.

🤖 Prompt for AI Agents

In `@internal/core/services/cluster_test.go` around lines 40 - 99, Convert the two standalone tests TestClusterServiceCreate and TestClusterServiceDelete into a single table-driven test (or separate table-driven tests) that iterates over test cases and uses t.Run for subtests; for each case call setupClusterServiceTest to initialize mocks, set per-case expectations on repo.On, secretSvc.On, taskQueue.On (e.g., taskQueue.Enqueue assertions), invoke svc.CreateCluster or svc.DeleteCluster as appropriate, and assert results and mock expectations; keep case metadata (name, inputs, expected error/cluster/status, mock setup function) in a slice and ensure each subtest resets or recreates mocks so expectations (repo.On, secretSvc.On, taskQueue.On) are scoped to that case.

coderabbitai · 2026-01-20T16:58:25Z

+	cluster.Status = domain.ClusterStatusDeleting
+	cluster.UpdatedAt = time.Now()
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		return errors.Wrap(errors.Internal, "failed to update cluster status", err)
+	}
+
+	// Enqueue deprovision job
+	job := domain.ClusterJob{
+		ClusterID: cluster.ID,
+		UserID:    cluster.UserID,
+		Type:      domain.ClusterJobDeprovision,
+	}
+
+	if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil {
+		return errors.Wrap(errors.Internal, "failed to enqueue cluster deprovision job", err)
+	}


⚠️ Potential issue | 🟠 Major

Handle deprovision enqueue failures to avoid “stuck deleting” clusters.

If enqueue fails, the cluster stays in deleting with no job scheduled. Consider reverting status or marking failed and persisting the update (similar to CreateCluster).

🛠️ Suggested fix

if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil { - return errors.Wrap(errors.Internal, "failed to enqueue cluster deprovision job", err) + s.logger.Error("failed to enqueue cluster deprovision job", "cluster_id", cluster.ID, "error", err) + cluster.Status = domain.ClusterStatusFailed + cluster.UpdatedAt = time.Now() + if updateErr := s.repo.Update(ctx, cluster); updateErr != nil { + s.logger.Error("failed to update cluster status after enqueue failure", "cluster_id", cluster.ID, "error", updateErr) + } + return errors.Wrap(errors.Internal, "failed to enqueue cluster deprovision job", err) }

🤖 Prompt for AI Agents

In `@internal/core/services/cluster.go` around lines 138 - 153, When Enqueue fails after setting cluster.Status = domain.ClusterStatusDeleting and persisting via s.repo.Update, revert the cluster status (e.g., back to previous status or to a new failure state) and persist that change so the cluster does not remain stuck in "deleting"; specifically, catch the error from s.taskQueue.Enqueue(ctx, "k8s_jobs", job) and call s.repo.Update(ctx, cluster) after setting cluster.Status to the original value or a "delete_failed" status, and return the wrapped error; follow the same pattern used in CreateCluster to ensure the failure is recorded and persisted for recovery/visibility.

coderabbitai · 2026-01-20T16:58:25Z

+	cluster.Status = domain.ClusterStatusUpgrading
+	cluster.UpdatedAt = time.Now()
+	if err := s.repo.Update(ctx, cluster); err != nil {
+		return err
+	}
+
+	// Enqueue upgrade job
+	job := domain.ClusterJob{
+		ClusterID: cluster.ID,
+		UserID:    cluster.UserID,
+		Type:      domain.ClusterJobUpgrade,
+		Version:   version,
+	}
+
+	if s.taskQueue == nil {
+		return errors.New(errors.Internal, "task queue not initialized")
+	}
+
+	if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil {
+		return errors.Wrap(errors.Internal, "failed to enqueue cluster upgrade job", err)
+	}


⚠️ Potential issue | 🟠 Major

Handle upgrade enqueue failures to avoid “stuck upgrading” clusters.

If enqueue fails, the cluster remains in upgrading with no job queued. Mirror CreateCluster’s failure handling and persist a failed state (or revert to running).

🛠️ Suggested fix

if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil { - return errors.Wrap(errors.Internal, "failed to enqueue cluster upgrade job", err) + s.logger.Error("failed to enqueue cluster upgrade job", "cluster_id", cluster.ID, "error", err) + cluster.Status = domain.ClusterStatusFailed + cluster.UpdatedAt = time.Now() + if updateErr := s.repo.Update(ctx, cluster); updateErr != nil { + s.logger.Error("failed to update cluster status after enqueue failure", "cluster_id", cluster.ID, "error", updateErr) + } + return errors.Wrap(errors.Internal, "failed to enqueue cluster upgrade job", err) }

🤖 Prompt for AI Agents

In `@internal/core/services/cluster.go` around lines 273 - 293, When enqueueing the upgrade job, if taskQueue is nil or taskQueue.Enqueue returns an error, persist a recovery status on the cluster instead of leaving it in ClusterStatusUpgrading; after setting cluster.Status = domain.ClusterStatusUpgrading and calling s.repo.Update(ctx, cluster) you must detect enqueue failures and update the cluster status back to domain.ClusterStatusRunning (or domain.ClusterStatusFailed) and UpdatedAt, calling s.repo.Update(ctx, cluster) again (mirror the CreateCluster failure handling pattern). Specifically handle the case where s.taskQueue == nil and when s.taskQueue.Enqueue(ctx, "k8s_jobs", job) returns an error by wrapping the error for return but first setting cluster.Status to the appropriate non-upgrading state and persisting via s.repo.Update(ctx, cluster) so the cluster is not left stuck in upgrading.

coderabbitai · 2026-01-20T16:58:25Z

+	initCmd := fmt.Sprintf("kubeadm init --kubernetes-version=%s --pod-network-cidr=%s --control-plane-endpoint=%s:6443 --upload-certs --ignore-preflight-errors=all", cluster.Version, podCIDR, lbIP)
+	out, err := exec.Run(ctx, initCmd)
+	if err != nil {
+		return "", "", "", err
+	}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

kubeadm init --ignore-preflight-errors documentation specific errors

💡 Result:

Short answer

The --ignore-preflight-errors flag accepts a comma-separated list of preflight check names (e.g. IsPrivilegedUser,Swap) or the special value all to treat all preflight errors as warnings. Use it with kubeadm init (or the preflight phase). [1][2][3]

Usage / examples

CLI: kubeadm init --ignore-preflight-errors=IsPrivilegedUser,Swap

To ignore everything: kubeadm init --ignore-preflight-errors=all. [1][2]

Where to find the authoritative docs

kubeadm init reference (shows the --ignore-preflight-errors option and examples). [1]

kubeadm init phase preflight (preflight phase documentation; also documents the flag). [2]

kubeadm config API (ignorePreflightErrors field for config files). [3]

If you want to know the exact names of checks on your kubeadm version

Run: kubeadm init phase preflight --help or run kubeadm init phase preflight (it will list checks/errors for your system/version), or consult the generated reference for your kubeadm version on kubernetes.io. [2][1]

Sources
[1] kubeadm init reference — kubernetes.io.
[2] kubeadm init phase preflight — kubernetes.io.
[3] kubeadm-config API (ignorePreflightErrors) — kubernetes.io.

Specify individual preflight errors instead of ignoring all.

Using --ignore-preflight-errors=all bypasses all kubeadm safety checks, which could mask real issues (e.g., insufficient resources, port conflicts, missing dependencies). Specify only the particular preflight errors that need to be ignored by name (e.g., --ignore-preflight-errors=IsPrivilegedUser,Swap). You can discover available check names with kubeadm init phase preflight --help.

🤖 Prompt for AI Agents

In `@internal/repositories/k8s/ha.go` around lines 139 - 143, The kubeadm init invocation builds initCmd with --ignore-preflight-errors=all which silences all safety checks; change the logic that constructs initCmd in ha.go (the initCmd variable used with exec.Run) to specify a comma-separated list of named preflight errors to ignore instead of "all" (e.g., IsPrivilegedUser,Swap or other specific checks discovered via `kubeadm init phase preflight --help`); implement this by replacing the literal string "--ignore-preflight-errors=all" with a variable or function that returns the allowed names, or make it configurable on the cluster object, and update the initCmd construction so the chosen names are inserted (preserving the existing flags: --kubernetes-version, --pod-network-cidr, --control-plane-endpoint, --upload-certs) before calling exec.Run.

coderabbitai · 2026-01-20T16:58:26Z

+	if err := p.bootstrapNode(ctx, cluster, ip, cluster.Version, false); err != nil {
+		if termErr := p.instSvc.TerminateInstance(ctx, worker.ID.String()); termErr != nil {
+			p.logger.Error("failed to terminate instance after bootstrap error", "instance_id", worker.ID, "error", termErr)
+		}
+		if delErr := p.repo.DeleteNode(ctx, worker.ID); delErr != nil {
+			p.logger.Error("failed to delete node record after bootstrap error", "node_id", worker.ID, "error", delErr)
+		}
+		return fmt.Errorf("failed to bootstrap node %s (%s): %w", workerName, ip, err)
+	}


⚠️ Potential issue | 🔴 Critical

Bug: Wrong ID passed to TerminateInstance.

Line 116 passes worker.ID (the node record UUID) instead of worker.InstanceID (the compute instance UUID). This will fail to terminate the actual instance, leaving orphaned compute resources. Compare with scaleDown at line 148 which correctly uses node.InstanceID.String().

Proposed fix

if err := p.bootstrapNode(ctx, cluster, ip, cluster.Version, false); err != nil { - if termErr := p.instSvc.TerminateInstance(ctx, worker.ID.String()); termErr != nil { + if termErr := p.instSvc.TerminateInstance(ctx, worker.InstanceID.String()); termErr != nil { p.logger.Error("failed to terminate instance after bootstrap error", "instance_id", worker.ID, "error", termErr) } if delErr := p.repo.DeleteNode(ctx, worker.ID); delErr != nil { p.logger.Error("failed to delete node record after bootstrap error", "node_id", worker.ID, "error", delErr) } return fmt.Errorf("failed to bootstrap node %s (%s): %w", workerName, ip, err) }

🤖 Prompt for AI Agents

In `@internal/repositories/k8s/health.go` around lines 115 - 123, The TerminateInstance call in the bootstrap error handler is using worker.ID (node record UUID) instead of the compute instance UUID; update the call in the p.bootstrapNode error branch to pass worker.InstanceID.String() to p.instSvc.TerminateInstance so the actual instance is terminated (mirror what scaleDown uses with node.InstanceID.String()); ensure you reference worker.InstanceID and convert to String() as in scaleDown.

poyrazK added 30 commits January 18, 2026 18:09

feat: add domain and port interfaces for kubernetes clusters

cc53ea6

feat: implement kubernetes cluster service business logic

c7b25d6

feat: add postgres repository and migrations for kubernetes clusters

cffca67

feat: add mock provisioner and http handler for kubernetes clusters

40ae96c

feat: wire up kubernetes cluster service, handler, and routes

9867516

feat(db): add initial kubernetes clusters table migration

f79b331

feat(db): add kubernetes cluster RBAC permissions

0bbb212

feat(db): add down migration for cluster permissions

c20e1a8

feat(db): add network isolation support to clusters

267f4ca

feat(db): add down migration for network isolation

558d26e

feat(db): extend clusters table with HA and LB support

f5a1b36

feat(db): add down migration for clusters v2 extension

5a50965

chore: update dependencies for kubernetes and task queue

163798b

chore: add swagger and run commands for kubernetes

f484436

feat(domain): add kubernetes cluster and node entities

440a9e5

feat(domain): add cluster job entity for task queue

e747e6d

feat(domain): add ip to loadbalancer entity

1419d9a

feat(domain): add kubernetes rbac permissions

0267edf

feat(ports): define kubernetes repository, service, and provisioner i…

c096e32

…nterfaces

feat(ports): update docker adapter interface

6beac9d

feat(ports): update compute backend interface

eb9ef35

feat(ports): update secret service interface

45ff2b5

feat(pkg): add ssh key generation utility

d4b25f7

feat(pkg): add ssh client and command execution utility

87af114

feat(k8s): add node command executor

27a5b27

feat(k8s): implement base kubeadm provisioner

abd13c3

feat(k8s): add node bootstrapping logic

7da880e

feat(k8s): add kubeadm init and join logic

7ef3957

feat(k8s): add calico cni installation

8fa7ca1

feat(k8s): implement HA control plane provisioning

1a3c561

fix(k8s): enhance error handling in lifecycle operations

1bbd7a9

coderabbitai Bot reviewed Jan 19, 2026

View reviewed changes

fix(ssh): use constants and handle ports in wait for ssh

09995eb

Copilot AI review requested due to automatic review settings January 19, 2026 21:24

Copilot started reviewing on behalf of poyrazK January 19, 2026 21:24 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

fix(k8s): resolve incorrect type reference in health check

98ff26e

coderabbitai Bot reviewed Jan 19, 2026

View reviewed changes

poyrazK added 2 commits January 20, 2026 16:12

fix: resolve API and service layer test failures and improve code qua…

3c839b9

…lity

fix: resolve remaining test panics, improve dependency injection in t…

634e9e1

…ests, and fix SonarQube duplication warnings

Copilot AI review requested due to automatic review settings January 20, 2026 14:06

Copilot started reviewing on behalf of poyrazK January 20, 2026 14:06 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

coderabbitai Bot reviewed Jan 20, 2026

View reviewed changes

refactor: improve code quality by reducing cognitive complexity, remo…

bc5f628

…ving duplicate literals, and standardizing test naming conventions

coderabbitai Bot reviewed Jan 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings January 20, 2026 16:47

Copilot started reviewing on behalf of poyrazK January 20, 2026 16:48 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

poyrazK merged commit 7f79e05 into main Jan 20, 2026
30 of 35 checks passed

coderabbitai Bot reviewed Jan 20, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Feb 3, 2026

feat: Global Load Balancer integration, Multi-tenancy, and CI/CD optimizations #46

Merged

-// NewClusterService creates a new ClusterService with the provided parameters.
-func NewClusterService(params ClusterServiceParams) *ClusterService {
-	if params.TaskQueue == nil {
-		panic("taskQueue cannot be nil")
-	}
+// NewClusterService creates a new ClusterService with the provided parameters.
+func NewClusterService(params ClusterServiceParams) (*ClusterService, error) {
+	if params.TaskQueue == nil {
+		return nil, errors.New(errors.Internal, "task queue not initialized")
+	}
+	return &ClusterService{
+		repo:        params.Repo,
+		provisioner: params.Provisioner,
+		vpcSvc:      params.VpcSvc,
+		instanceSvc: params.InstanceSvc,
+		secretSvc:   params.SecretSvc,
+		taskQueue:   params.TaskQueue,
+		logger:      params.Logger,
+	}, nil
+}

		// 3. Encrypt and store the new kubeconfig
		// 3. Encrypt and store the new kubeconfig

	description: Returns the kubeconfig for clinical access to the cluster
	description: Returns the kubeconfig for client access to the cluster

Conversation

poyrazK commented Jan 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🌟 Overview

🚀 Key Features

🏗️ Architectural Changes

🧪 Testing

📝 Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jan 20, 2026

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

poyrazK commented Jan 19, 2026 •

edited by coderabbitai Bot

Loading