feat: Managed Kubernetes (KaaS) with HA and Task Queue#29
Conversation
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/repositories/k8s/provisioner.go`:
- Around line 97-100: The call checks masterIP but then passes nil into
p.failCluster which can produce a `%!w(<nil>)` when wrapped; instead create an
explicit error (e.g., errors.New or fmt.Errorf with a clear message like "master
node failed to get an IP") and pass that error into p.failCluster(ctx, cluster,
"master node failed to get an IP", err) so p.failCluster receives a real error
object; update the branch after masterIP := p.waitForIP(ctx, master.ID)
accordingly and ensure the new error is used in the call.
♻️ Duplicate comments (6)
internal/repositories/k8s/provisioner.go (6)
101-103: Handle repository update failures when persisting control plane IPs.Ignoring this error can leave the DB out of sync with the in‑memory state. As per coding guidelines, avoid silent failures.
🐛 Proposed fix
cluster.ControlPlaneIPs = []string{masterIP} - _ = p.repo.Update(ctx, cluster) + if err := p.repo.Update(ctx, cluster); err != nil { + return "", p.failCluster(ctx, cluster, "failed to persist control plane IPs", err) + }
120-127: Don’t persist plaintext kubeconfig and handle update errors.Storing unencrypted kubeconfig is a security risk, and the update error is ignored. As per coding guidelines, avoid silent failures.
🔒 Proposed fix
encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig) if err != nil { - p.logger.Error("failed to encrypt kubeconfig", "cluster_id", cluster.ID, "error", err) - cluster.Kubeconfig = kubeconfig - } else { - cluster.Kubeconfig = encryptedKubeconfig + return "", p.failCluster(ctx, cluster, "failed to encrypt kubeconfig", err) } - _ = p.repo.Update(ctx, cluster) + cluster.Kubeconfig = encryptedKubeconfig + if err := p.repo.Update(ctx, cluster); err != nil { + return "", p.failCluster(ctx, cluster, "failed to persist kubeconfig", err) + }
132-160: Don’t silently ignore worker status updates or all‑worker failures.
_ = p.updateNodeStatus(...)hides errors, and the method succeeds even if all workers fail. As per coding guidelines, avoid silent failures.🐛 Proposed fix
func (p *KubeadmProvisioner) provisionWorkers(ctx context.Context, cluster *domain.Cluster, joinCmd string) (string, error) { + successCount := 0 for i := 0; i < cluster.WorkerCount; i++ { @@ if err := p.joinCluster(ctx, cluster, workerIP, joinCmd); err != nil { p.logger.Error("failed to join worker to cluster", "ip", workerIP, "error", err) - _ = p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed") + if updateErr := p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed"); updateErr != nil { + p.logger.Error("failed to update node status", "instance_id", worker.ID, "error", updateErr) + } continue } - _ = p.updateNodeStatus(ctx, cluster.ID, worker.ID, "active") + if err := p.updateNodeStatus(ctx, cluster.ID, worker.ID, "active"); err != nil { + p.logger.Error("failed to update node status", "instance_id", worker.ID, "error", err) + continue + } + successCount++ } + if successCount == 0 && cluster.WorkerCount > 0 { + return joinCmd, fmt.Errorf("no workers successfully provisioned") + } return joinCmd, nil }
200-201: Handle repo update failures when marking the cluster running.If this update fails, the cluster is running but the DB says otherwise. As per coding guidelines, avoid silent failures.
🐛 Proposed fix
cluster.Status = domain.ClusterStatusRunning - _ = p.repo.Update(ctx, cluster) + if err := p.repo.Update(ctx, cluster); err != nil { + p.logger.Error("failed to update cluster status to running", "cluster_id", cluster.ID, "error", err) + return fmt.Errorf("failed to persist cluster running status: %w", err) + }
246-252: Don’t ignore DeleteNode errors during deprovisioning.Ignoring this can leave orphaned DB records. As per coding guidelines, avoid silent failures.
🐛 Proposed fix
for _, node := range nodes { p.logger.Info("terminating node instance", "instance_id", node.InstanceID) if err := p.instSvc.TerminateInstance(ctx, node.InstanceID.String()); err != nil { p.logger.Error("failed to terminate node instance", "instance_id", node.InstanceID, "error", err) } - _ = p.repo.DeleteNode(ctx, node.ID) + if err := p.repo.DeleteNode(ctx, node.ID); err != nil { + p.logger.Error("failed to delete node record", "node_id", node.ID, "error", err) + } }
266-268: Persisting failed status must check the update error.This is a critical state transition; ignoring the error breaks errcheck and can hide a failed persistence. As per coding guidelines, avoid silent failures.
🐛 Proposed fix
func (p *KubeadmProvisioner) failCluster(ctx context.Context, cluster *domain.Cluster, msg string, err error) error { cluster.Status = domain.ClusterStatusFailed - _ = p.repo.Update(ctx, cluster) + if updateErr := p.repo.Update(ctx, cluster); updateErr != nil { + p.logger.Error("failed to update cluster status to failed", "cluster_id", cluster.ID, "error", updateErr) + } p.logger.Error(msg, "cluster_id", cluster.ID, "error", err) return fmt.Errorf("%s: %w", msg, err) }
🧹 Nitpick comments (1)
internal/repositories/k8s/provisioner.go (1)
229-235: Preferstrings.Indexover the custommathStringsIndex.This reduces custom code and leverages the standard library’s tested implementation.
♻️ Proposed refactor
import ( "context" "fmt" "log/slog" + "strings" @@ - if idx := mathStringsIndex(instIP, "/"); idx != -1 { + if idx := strings.Index(instIP, "/"); idx != -1 { instIP = instIP[:idx] } @@ -func mathStringsIndex(s, substr string) int { - for i := 0; i < len(s)-len(substr)+1; i++ { - if s[i:i+len(substr)] == substr { - return i - } - } - return -1 -}
| masterIP := p.waitForIP(ctx, master.ID) | ||
| if masterIP == "" { | ||
| return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the provisioner.go file around lines 97-100
head -110 internal/repositories/k8s/provisioner.go | tail -30Repository: PoyrazK/thecloud
Length of output: 1132
🏁 Script executed:
# Find the failCluster function to understand how it uses the error parameter
grep -n "func.*failCluster" internal/repositories/k8s/provisioner.goRepository: PoyrazK/thecloud
Length of output: 181
🏁 Script executed:
# Get more context around the failCluster function
grep -A 10 "func.*failCluster" internal/repositories/k8s/provisioner.goRepository: PoyrazK/thecloud
Length of output: 357
🏁 Script executed:
# Check the imports to see if errors is already imported
head -30 internal/repositories/k8s/provisioner.go | grep -E "^import|\"errors\""Repository: PoyrazK/thecloud
Length of output: 68
🏁 Script executed:
# Get the full import block
head -20 internal/repositories/k8s/provisioner.goRepository: PoyrazK/thecloud
Length of output: 607
🏁 Script executed:
# Also verify if there are any other nil error passes to failCluster
grep -n "failCluster.*nil" internal/repositories/k8s/provisioner.goRepository: PoyrazK/thecloud
Length of output: 143
Avoid wrapping a nil error in failCluster.
Passing nil into fmt.Errorf("%s: %w", ...) produces %!w(<nil>) and obscures the cause. Create an explicit error before calling failCluster instead.
🐛 Proposed fix
import (
"context"
+ "errors"
"fmt"
"log/slog" masterIP := p.waitForIP(ctx, master.ID)
if masterIP == "" {
- return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil)
+ return "", p.failCluster(ctx, cluster, "master node failed to get an IP", errors.New("master node has no assigned IP"))
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| masterIP := p.waitForIP(ctx, master.ID) | |
| if masterIP == "" { | |
| return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil) | |
| } | |
| import ( | |
| "context" | |
| "errors" | |
| "fmt" | |
| "log/slog" | |
| ... | |
| ) | |
| masterIP := p.waitForIP(ctx, master.ID) | |
| if masterIP == "" { | |
| return "", p.failCluster(ctx, cluster, "master node failed to get an IP", errors.New("master node has no assigned IP")) | |
| } |
🤖 Prompt for AI Agents
In `@internal/repositories/k8s/provisioner.go` around lines 97 - 100, The call
checks masterIP but then passes nil into p.failCluster which can produce a
`%!w(<nil>)` when wrapped; instead create an explicit error (e.g., errors.New or
fmt.Errorf with a clear message like "master node failed to get an IP") and pass
that error into p.failCluster(ctx, cluster, "master node failed to get an IP",
err) so p.failCluster receives a real error object; update the branch after
masterIP := p.waitForIP(ctx, master.ID) accordingly and ensure the new error is
used in the call.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 70 out of 71 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| config := &ssh.ClientConfig{ | ||
| User: c.User, | ||
| Auth: c.Auth, | ||
| HostKeyCallback: ssh.InsecureIgnoreHostKey(), |
There was a problem hiding this comment.
The SSH client uses InsecureIgnoreHostKey which skips host key verification. This is a security risk in production environments as it makes the connection vulnerable to man-in-the-middle attacks. Consider implementing proper host key verification or at least documenting this security trade-off.
| handleRBACMock(w, r, method, path) || | ||
| handleDatabasesMock(w, r, method, path) || | ||
| handleSecretsMock(w, r, method, path) | ||
| // handleDatabasesMock(w, r, method, path) || // Duplicate removed |
There was a problem hiding this comment.
Duplicate line removed comment. The line "handleDatabasesMock(w, r, method, path) ||" appears to have been accidentally duplicated and then removed, but the comment marking it as duplicate should also be removed for cleaner code.
| // handleDatabasesMock(w, r, method, path) || // Duplicate removed |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@internal/repositories/k8s/health.go`:
- Around line 197-205: Guard against empty kubectl output before splitting:
after calling exec.Run(ctx, kubectlBase+" get nodes --no-headers") and obtaining
nodesOut, check if strings.TrimSpace(nodesOut) == "" and if so set
health.NodesTotal = 0 and skip the readyCount loop (return or continue),
otherwise split into lines and compute readyCount as currently implemented;
update the logic around the nodesOut/lines handling (the section referencing
nodesOut, health.NodesTotal and the readyCount loop) so you don't treat an empty
string as a single phantom node.
In `@internal/repositories/k8s/lifecycle.go`:
- Around line 187-203: The current flow uses exec.Run to base64-encode the
entire snapshot into a string (b64Output) and then decodes it into a byte slice
(snapshotData) before calling p.storageSvc.Upload, which buffers the whole
snapshot in memory and risks OOM/truncation for multi-GB etcd backups; instead
refactor to stream the snapshot through an io.Reader to p.storageSvc.Upload (or
generate a presigned URL and run curl on the node) so data is piped without full
buffering: replace the exec.Run(base64 ...) + base64.DecodeString +
bytes.NewReader(snapshotData) pattern with a streaming approach (e.g., an
exec/Executor method that returns an io.ReadCloser for tempPath or a remote curl
upload using a presigned URL) and call p.storageSvc.Upload with that reader,
updating usages of exec.Run, tempPath, b64Output, snapshotData and
p.storageSvc.Upload accordingly.
♻️ Duplicate comments (4)
cmd/cloud/cli_commands_test.go (1)
58-58: Remove the commented-out code.This commented line is a leftover artifact from removing duplicates and should be deleted for cleaner code.
pkg/sshutil/client.go (2)
39-43: AvoidInsecureIgnoreHostKeyin production.
This disables host key verification and enables MITM. Considerknownhostsor a configurable callback (already flagged by security tooling).Also applies to: 86-90
39-49: Replace hard‑coded SSH port/timeouts with shared constants.
There are still literal"22"and10 * time.Secondvalues. Consolidate them as named constants and reuse across Run/WriteFile/WaitForSSH.♻️ Proposed fix
-const ( - sshWaitInterval = 2 * time.Second - sshDialTimeout = 2 * time.Second -) +const ( + defaultSSHPort = "22" + sshClientTimeout = 10 * time.Second + sshWaitInterval = 2 * time.Second + sshDialTimeout = 2 * time.Second +)config := &ssh.ClientConfig{ User: c.User, Auth: c.Auth, HostKeyCallback: ssh.InsecureIgnoreHostKey(), - Timeout: 10 * time.Second, + Timeout: sshClientTimeout, } addr := c.Host if _, _, err := net.SplitHostPort(addr); err != nil { - addr = net.JoinHostPort(addr, "22") + addr = net.JoinHostPort(addr, defaultSSHPort) }addr := c.Host if _, _, err := net.SplitHostPort(addr); err != nil { - addr = net.JoinHostPort(addr, "22") + addr = net.JoinHostPort(addr, defaultSSHPort) }As per coding guidelines, use named constants for timeouts and ports.
Also applies to: 86-96, 161-178
internal/repositories/k8s/health.go (1)
83-116: Use instance IDs for termination and don’t ignore status-update errors.
TerminateInstanceis called withworker.ID(cluster node ID), which likely isn’t the cloud instance ID; this can leave orphaned instances. Also_ = p.updateNodeStatus(...)silently discards errors and violates the no-silent-failure guideline. Based on learnings, avoid blank-identifier error suppression.✅ Suggested fix
- if termErr := p.instSvc.TerminateInstance(ctx, worker.ID.String()); termErr != nil { + if termErr := p.instSvc.TerminateInstance(ctx, worker.InstanceID.String()); termErr != nil { p.logger.Error("failed to terminate failed bootstrap instance", "instance_id", worker.ID, "error", termErr) } @@ - if err := p.joinCluster(ctx, cluster, ip, joinCmd); err != nil { + if err := p.joinCluster(ctx, cluster, ip, joinCmd); err != nil { errs = append(errs, fmt.Errorf("failed to join node %s to cluster: %w", workerName, err)) - _ = p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed") + if err := p.updateNodeStatus(ctx, cluster.ID, worker.ID, "failed"); err != nil { + p.logger.Error("failed to update node status to failed", "node_id", worker.ID, "error", err) + errs = append(errs, fmt.Errorf("failed to mark node %s as failed: %w", workerName, err)) + } continue }Run this to confirm
waitForIP/createNodeexpectations and ensure the correct ID is used:#!/bin/bash rg -n "func \\(p \\*KubeadmProvisioner\\) waitForIP\\(" -C3 rg -n "func \\(p \\*KubeadmProvisioner\\) createNode\\(" -C3 rg -n "TerminateInstance\\(" -C3
| nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers") | ||
| if err == nil { | ||
| lines := strings.Split(strings.TrimSpace(nodesOut), "\n") | ||
| health.NodesTotal = len(lines) | ||
| readyCount := 0 | ||
| for _, line := range lines { | ||
| if strings.Contains(line, " Ready ") { | ||
| readyCount++ | ||
| } |
There was a problem hiding this comment.
Handle empty kubectl get nodes output to avoid a phantom node.
When nodesOut is empty, strings.Split("", "\n") yields a single empty line, so NodesTotal becomes 1. Guard on empty output before splitting.
🔧 Suggested fix
- if err == nil {
- lines := strings.Split(strings.TrimSpace(nodesOut), "\n")
- health.NodesTotal = len(lines)
+ if err == nil {
+ trimmed := strings.TrimSpace(nodesOut)
+ if trimmed == "" {
+ health.NodesTotal = 0
+ return health, nil
+ }
+ lines := strings.Split(trimmed, "\n")
+ health.NodesTotal = len(lines)
readyCount := 0
for _, line := range lines {
if strings.Contains(line, " Ready ") {
readyCount++
}
}
health.NodesReady = readyCount
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers") | |
| if err == nil { | |
| lines := strings.Split(strings.TrimSpace(nodesOut), "\n") | |
| health.NodesTotal = len(lines) | |
| readyCount := 0 | |
| for _, line := range lines { | |
| if strings.Contains(line, " Ready ") { | |
| readyCount++ | |
| } | |
| nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers") | |
| if err == nil { | |
| trimmed := strings.TrimSpace(nodesOut) | |
| if trimmed == "" { | |
| health.NodesTotal = 0 | |
| return health, nil | |
| } | |
| lines := strings.Split(trimmed, "\n") | |
| health.NodesTotal = len(lines) | |
| readyCount := 0 | |
| for _, line := range lines { | |
| if strings.Contains(line, " Ready ") { | |
| readyCount++ | |
| } | |
| } | |
| health.NodesReady = readyCount | |
| } |
🤖 Prompt for AI Agents
In `@internal/repositories/k8s/health.go` around lines 197 - 205, Guard against
empty kubectl output before splitting: after calling exec.Run(ctx, kubectlBase+"
get nodes --no-headers") and obtaining nodesOut, check if
strings.TrimSpace(nodesOut) == "" and if so set health.NodesTotal = 0 and skip
the readyCount loop (return or continue), otherwise split into lines and compute
readyCount as currently implemented; update the logic around the nodesOut/lines
handling (the section referencing nodesOut, health.NodesTotal and the readyCount
loop) so you don't treat an empty string as a single phantom node.
| // 2. Extract snapshot as Base64 to handle binary data through the executor | ||
| p.logger.Info("uploading snapshot to remote storage", "cluster_id", cluster.ID) | ||
| b64Output, err := exec.Run(ctx, fmt.Sprintf("base64 -w 0 %s", tempPath)) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read snapshot: %w", err) | ||
| } | ||
|
|
||
| snapshotData, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64Output)) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to decode base64 snapshot: %w", err) | ||
| } | ||
|
|
||
| bucket := "k8s-backups" | ||
| key := fmt.Sprintf("%s/snapshot-%d.db", cluster.ID, time.Now().Unix()) | ||
| _, err = p.storageSvc.Upload(ctx, bucket, key, bytes.NewReader(snapshotData)) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to upload snapshot to storage: %w", err) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the file to confirm the code matches
find . -name "lifecycle.go" -path "*/k8s/*" -type fRepository: PoyrazK/thecloud
Length of output: 100
🏁 Script executed:
# Read the specific lines mentioned in the review
sed -n '187,203p' internal/repositories/k8s/lifecycle.goRepository: PoyrazK/thecloud
Length of output: 800
🏁 Script executed:
# Get broader context around these lines to understand the function
sed -n '170,220p' internal/repositories/k8s/lifecycle.goRepository: PoyrazK/thecloud
Length of output: 2058
🏁 Script executed:
# Search for the exec.Run implementation to understand how it handles output
rg "func.*Run\(" --type go -A 10 | head -100Repository: PoyrazK/thecloud
Length of output: 7113
🏁 Script executed:
# Check if there are any constants or limits defined for this operation
rg "SSH|buffer|snapshot" --type go internal/repositories/k8s/ -C 2 | head -80Repository: PoyrazK/thecloud
Length of output: 5837
🏁 Script executed:
# Look at the sshutil.Client.Run implementation in detail
cat -n pkg/sshutil/client.go | grep -A 30 "func (c \*Client) Run"Repository: PoyrazK/thecloud
Length of output: 1154
🏁 Script executed:
# Check for any buffer size limits in SSH client
rg "maxSize|BufferSize|limit|buffer" --type go pkg/sshutil/ -C 3Repository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
# Look at the StorageService interface to understand Upload expectations
rg "type StorageService|Upload.*func" --type go -A 5 | head -50Repository: PoyrazK/thecloud
Length of output: 3998
🏁 Script executed:
# Get the complete sshutil.Client.Run implementation
cat -n pkg/sshutil/client.go | sed -n '38,80p'Repository: PoyrazK/thecloud
Length of output: 1634
🏁 Script executed:
# Check if there's any documentation or comments about SSH output handling
cat pkg/sshutil/client.go | head -100Repository: PoyrazK/thecloud
Length of output: 2585
🏁 Script executed:
# Check the interface to understand the expected behavior
rg "type NodeExecutor|Run\(ctx" --type go internal/repositories/k8s/ -A 2Repository: PoyrazK/thecloud
Length of output: 7155
Refactor snapshot upload to avoid buffering large etcd backups in memory.
The entire snapshot is base64-encoded through SSH (buffered in bytes.Buffer), returned as a string, then fully decoded into memory before upload. etcd snapshots can be several GB, and base64 encoding inflates the size by 33%; this risks exceeding SSH message limits, memory exhaustion, or truncation. The StorageService.Upload already accepts io.Reader for streaming, so instead:
- Generate a presigned URL on the storage backend and have the node upload directly via
curlor similar - Implement a streaming upload endpoint or executor method that pipes snapshot data without buffering the entire output
This avoids double memory allocation and leverages the streaming capabilities already designed into the architecture.
🤖 Prompt for AI Agents
In `@internal/repositories/k8s/lifecycle.go` around lines 187 - 203, The current
flow uses exec.Run to base64-encode the entire snapshot into a string
(b64Output) and then decodes it into a byte slice (snapshotData) before calling
p.storageSvc.Upload, which buffers the whole snapshot in memory and risks
OOM/truncation for multi-GB etcd backups; instead refactor to stream the
snapshot through an io.Reader to p.storageSvc.Upload (or generate a presigned
URL and run curl on the node) so data is piped without full buffering: replace
the exec.Run(base64 ...) + base64.DecodeString + bytes.NewReader(snapshotData)
pattern with a streaming approach (e.g., an exec/Executor method that returns an
io.ReadCloser for tempPath or a remote curl upload using a presigned URL) and
call p.storageSvc.Upload with that reader, updating usages of exec.Run,
tempPath, b64Output, snapshotData and p.storageSvc.Upload accordingly.
…ests, and fix SonarQube duplication warnings
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 74 out of 75 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@cmd/api/main.go`:
- Around line 30-56: The package-level DI globals (loadConfigFunc,
initDatabaseFunc, ..., newHTTPServer, startHTTPServer, shutdownHTTPServer,
notifySignals) should be removed and replaced with a single dependency container
passed into main/initInfrastructure/initBackends/runApplication; define an
AppDeps (or Options) struct that holds those function fields and inject it into
SetupRouter/initRepositories/initServices/initHandlers and any functions that
currently reference the globals; update
initInfrastructure/initBackends/runApplication signatures to accept AppDeps and
thread the deps through instead of using globals so tests can construct AppDeps
with stubbed functions (including
newHTTPServer/startHTTPServer/shutdownHTTPServer/notifySignals) and no
package-level mutable state remains.
In `@internal/core/domain/cluster.go`:
- Around line 34-46: The Cluster struct currently exposes sensitive fields
SSHKey and Kubeconfig via JSON tags; change their tags to json:"-" or remove
them from the public serialization so secrets are not emitted when Cluster is
marshaled or logged. Locate the Cluster type (fields SSHKey and Kubeconfig) and
update their struct tags to json:"-" or refactor callers to return a dedicated
DTO/response type (e.g., ClusterResponse) that omits those fields; ensure any
existing JSON usage (marshal/unmarshal, handlers returning Cluster) is updated
to use the safe DTO or otherwise avoid serializing SSHKey/Kubeconfig.
In `@internal/core/services/cluster.go`:
- Around line 39-43: Change NewClusterService to avoid panicking on nil
TaskQueue: update the constructor signature to return (*ClusterService, error)
and validate params.TaskQueue, returning a descriptive error when it's nil
instead of calling panic. Ensure callers are updated to handle the error; keep
the ClusterService and ClusterServiceParams types unchanged and perform the nil
check inside NewClusterService (or a helper like validateClusterServiceParams)
so construction fails cleanly with an error rather than crashing the process.
- Around line 96-99: If s.taskQueue.Enqueue(ctx, "k8s_jobs", job) fails you must
not silently return success; change the code to mark the cluster as failed and
return an error to the caller. On Enqueue error, call your cluster status
updater (e.g. s.store.UpdateClusterStatus or equivalent) to set cluster.ID to a
failed state with the enqueue error message, log the failure via s.logger.Error
with the error, and return a non-nil error (wrap the original err) so the caller
knows provisioning did not complete; alternatively persist a job record for the
worker to sweep if you prefer retry semantics, but do not leave the cluster
permanently pending.
- Around line 63-81: The private SSH key returned by crypto.GenerateSSHKeyPair()
is being stored raw on the domain.Cluster (cluster.SSHKey); instead, call your
SecretService to encrypt or store the secret and persist only the ciphertext or
secret reference: use SecretService.Encrypt/Store (or equivalent) right after
privKey is generated, handle and return any errors from that call, set
cluster.SSHKey to the returned ciphertext/secretID, and ensure any code that
needs the raw key uses SecretService.Decrypt/Fetch when required; update error
messages and types around GenerateSSHKeyPair and cluster creation to reflect
encryption failures.
- Around line 165-170: The current code calls s.secretSvc.Decrypt and on error
logs via s.logger.Error but returns cluster.Kubeconfig, which can expose
ciphertext and yields an unusable value; change the error handling in the
function that calls s.secretSvc.Decrypt so that it returns an error instead of
cluster.Kubeconfig (e.g., return empty string and propagate or wrap the Decrypt
error), and keep the s.logger.Error call for context using cluster.ID and the
error from s.secretSvc.Decrypt; update any callers if necessary to handle the
propagated error.
In `@internal/repositories/k8s/ha.go`:
- Around line 76-79: The code path checking ip := p.waitForIP(ctx, node.ID)
should not pass nil into failCluster when ip == ""; create an explicit error
(e.g., using fmt.Errorf or errors.New) describing the missing master IP for the
given node/name and pass that error into p.failCluster(ctx, cluster, "timeout
waiting for IP for master "+name, err) instead of nil; update the block around
waitForIP and the call to failCluster so the real error is logged/propagated.
- Around line 132-134: parseJoinCommands can return empty strings and you must
validate its output before using it; after calling p.parseJoinCommands(out)
check that joinCmd and cpJoinCmd are non-empty and return a clear error (or fail
early) instead of proceeding to call joinHAMasters with empty commands. Update
the caller around parseJoinCommands (the variables joinCmd and cpJoinCmd) to
validate both values, surface a descriptive error if either is empty, and avoid
executing further steps (including invoking joinHAMasters) when validation
fails.
- Around line 83-85: The AddTarget error is being ignored with p.logger.Warn
which can leave masters unreachable; change the behavior in the block calling
p.lbSvc.AddTarget (the call referencing p.lbSvc.AddTarget(ctx, lb.ID, node.ID,
6443, 10) and currently logging via p.logger.Warn) to fail fast by returning a
wrapped error (including lb.ID and node.ID) from the enclosing function instead
of only warning; alternatively implement a retry with backoff (respecting ctx)
around p.lbSvc.AddTarget and only return an error after retries are
exhausted—ensure callers of this enclosing function handle the returned error.
- Around line 22-30: The code currently sets lbIP from lb.IP, sleeps once and
tries p.lbSvc.Get(ctx, lb.ID) once, but may still leave lbIP empty; update the
logic in the block around lbIP, p.lbSvc.Get and cluster.APIServerLBAddress to
perform a bounded retry loop (e.g., max N attempts with a short delay) that
re-queries p.lbSvc.Get(ctx, lb.ID) until a non-empty IP is returned, and if
after the retries lbIP is still empty return or propagate an error (do not set
cluster.APIServerLBAddress to an empty value); reference the existing lbIP
variable, the p.lbSvc.Get call, lb.ID, and cluster.APIServerLBAddress when
implementing the retry-and-fail behavior.
In `@internal/workers/cluster_worker.go`:
- Around line 59-62: The loop currently spawns unbounded goroutines at the line
calling "go w.processJob(job)" which can exhaust resources; replace the
unbounded spawn with a bounded approach—either call w.processJob(job) inline
(remove the "go" to process synchronously) or implement a semaphore/worker-pool
in the worker (add a buffered channel semaphore on the worker struct and acquire
before launching a goroutine and release after processJob returns) so that
w.processJob, and the logger.Info call remain but job processing is limited to a
fixed concurrency.
- Around line 44-51: The loop currently tight-spins when
w.taskQueue.Dequeue(ctx, "k8s_jobs") returns an empty msg or when it errors, so
introduce a named constant (e.g., queuePollBackoff) for the backoff duration and
use it in both the error case and the msg=="" case instead of time.Sleep(1 *
time.Second) and a bare continue; replace the magic number with this constant
and call time.Sleep(queuePollBackoff) before continuing to avoid pegging the
CPU; update any surrounding comments to clarify the shared backoff usage.
♻️ Duplicate comments (4)
internal/repositories/k8s/node_executor.go (1)
73-88: Timeout context is not used for readiness probes
ctxTimeoutis created, bute.Runuses the parentctx, so a hung exec can ignore the timeout. UsectxTimeoutfor the readiness probe. As per coding guidelines, propagate context to blocking calls.🛠️ Proposed fix
- _, err := e.Run(ctx, "echo ready") + _, err := e.Run(ctxTimeout, "echo ready")internal/repositories/k8s/provisioner.go (3)
99-102: Don’t pass nil into failClusterWrapping a nil error results in
%!w(<nil>). Create an explicit error for the missing IP.🛠️ Proposed fix
- if masterIP == "" { - return "", p.failCluster(ctx, cluster, "master node failed to get an IP", nil) - } + if masterIP == "" { + return "", p.failCluster(ctx, cluster, "master node failed to get an IP", fmt.Errorf("master node has no assigned IP")) + }
122-129: Do not persist kubeconfig when encryption failsPersisting plaintext kubeconfig is a security risk. Fail the provisioning (or mark failed) instead of storing secrets unencrypted.
🛠️ Proposed fix
encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig) if err != nil { - p.logger.Error("failed to encrypt kubeconfig", "cluster_id", cluster.ID, "error", err) - cluster.Kubeconfig = kubeconfig - } else { - cluster.Kubeconfig = encryptedKubeconfig + return "", p.failCluster(ctx, cluster, "failed to encrypt kubeconfig", err) } + cluster.Kubeconfig = encryptedKubeconfig _ = p.repo.Update(ctx, cluster)
268-271: Handle repo.Update failure inside failClusterIgnoring
p.repo.Updatehides persistence failures and also trips errcheck. Handle the error explicitly. As per coding guidelines, avoid silent failures.🛠️ Proposed fix
func (p *KubeadmProvisioner) failCluster(ctx context.Context, cluster *domain.Cluster, msg string, err error) error { cluster.Status = domain.ClusterStatusFailed - _ = p.repo.Update(ctx, cluster) + if updateErr := p.repo.Update(ctx, cluster); updateErr != nil { + p.logger.Error("failed to update cluster status to failed", "cluster_id", cluster.ID, "error", updateErr) + } p.logger.Error(msg, "cluster_id", cluster.ID, "error", err) return fmt.Errorf("%s: %w", msg, err) }
🧹 Nitpick comments (5)
cmd/api/main.go (1)
244-251: Replace the magic worker count with a derived value
wg.Add(7)will drift as workers are added/removed. Use a slice andlen(...)to keep it correct and self-updating.♻️ Suggested refactor
func runWorkers(ctx context.Context, wg *sync.WaitGroup, workers *setup.Workers) { - wg.Add(7) - go workers.LB.Run(ctx, wg) - go workers.AutoScaling.Run(ctx, wg) - go workers.Cron.Run(ctx, wg) - go workers.Container.Run(ctx, wg) - go workers.Provision.Run(ctx, wg) - go workers.Accounting.Run(ctx, wg) - go workers.Cluster.Run(ctx, wg) + workerFns := []func(context.Context, *sync.WaitGroup){ + workers.LB.Run, + workers.AutoScaling.Run, + workers.Cron.Run, + workers.Container.Run, + workers.Provision.Run, + workers.Accounting.Run, + workers.Cluster.Run, + } + wg.Add(len(workerFns)) + for _, fn := range workerFns { + go fn(ctx, wg) + } }internal/repositories/postgres/lb_repo_test.go (1)
24-341: Consider table-driven tests for consistency.These tests use repeated
t.Runblocks; converting to a[]struct{...}table will reduce duplication and align with repo testing conventions.As per coding guidelines, use table-driven tests in test files.
internal/core/services/shared_test.go (1)
900-906: Misplaced method breaks code organization.
ListByVPCforMockInstanceRepois placed afterMockClusterProvisioner, butMockInstanceRepoand its other methods are defined at lines 772-813. This disrupts the logical grouping and makes the file harder to navigate.♻️ Suggested fix: Move method to join other MockInstanceRepo methods
Move this method to immediately after line 813 (after the
Deletemethod ofMockInstanceRepo), keeping allMockInstanceRepomethods together:func (m *MockInstanceRepo) Delete(ctx context.Context, id uuid.UUID) error { return m.Called(ctx, id).Error(0) } +func (m *MockInstanceRepo) ListByVPC(ctx context.Context, vpcID uuid.UUID) ([]*domain.Instance, error) { + args := m.Called(ctx, vpcID) + if args.Get(0) == nil { + return nil, args.Error(1) + } + return args.Get(0).([]*domain.Instance), args.Error(1) +} // MockClusterRepo type MockClusterRepo struct{ mock.Mock }Then remove the duplicate definition at lines 900-906.
internal/core/services/cluster_lifecycle_test.go (2)
44-48: Avoidtime.Sleepfor async test synchronization.Using
time.Sleep(100 * time.Millisecond)creates flaky tests that may pass or fail depending on system load. If the async operation takes longer than 100ms, assertions will fail incorrectly.Consider using a synchronization mechanism such as:
- A channel/waitgroup signaled by the async operation
- Polling with a timeout using
assert.Eventually- Injecting a mock that completes synchronously in tests
♻️ Example using assert.Eventually
// Instead of time.Sleep, poll for completion: assert.Eventually(t, func() bool { provisioner.AssertExpectations(t) repo.AssertExpectations(t) taskQueue.AssertExpectations(t) return true }, time.Second, 10*time.Millisecond)Alternatively, consider making the upgrade operation return a completion channel or use dependency injection to make async behavior synchronous in tests.
51-88: Subtests mutate shared state, creating implicit dependencies.All subtests share and mutate the same
clusterpointer. While this works because subtests run sequentially, the pattern is fragile:
- Test isolation is compromised
- Adding
t.Parallel()in the future would cause data races- Test failures may cascade unpredictably
Consider creating fresh cluster instances within each subtest for better isolation.
♻️ Example of isolated subtest setup
t.Run("fails if not running", func(t *testing.T) { + cluster := &domain.Cluster{ + ID: id, + Status: domain.ClusterStatusPending, + Version: testK8sBase, + UserID: uuid.New(), + } - cluster.Status = domain.ClusterStatusPending repo.On("GetByID", mock.Anything, id).Return(cluster, nil).Once() err := svc.UpgradeCluster(ctx, id, testK8sVersion) assert.Error(t, err) assert.Contains(t, err.Error(), "must be in running state") })
| // NewClusterService creates a new ClusterService with the provided parameters. | ||
| func NewClusterService(params ClusterServiceParams) *ClusterService { | ||
| if params.TaskQueue == nil { | ||
| panic("taskQueue cannot be nil") | ||
| } |
There was a problem hiding this comment.
Avoid panic in NewClusterService
A panic here will crash the process on misconfiguration. Prefer returning an error from the constructor or validating at composition time. As per coding guidelines, do not panic in production code.
🛠️ Proposed fix (change constructor signature)
-func NewClusterService(params ClusterServiceParams) *ClusterService {
- if params.TaskQueue == nil {
- panic("taskQueue cannot be nil")
- }
- return &ClusterService{
+func NewClusterService(params ClusterServiceParams) (*ClusterService, error) {
+ if params.TaskQueue == nil {
+ return nil, errors.New(errors.Internal, "task queue not initialized")
+ }
+ return &ClusterService{
repo: params.Repo,
provisioner: params.Provisioner,
vpcSvc: params.VpcSvc,
instanceSvc: params.InstanceSvc,
secretSvc: params.SecretSvc,
taskQueue: params.TaskQueue,
logger: params.Logger,
- }
+ }, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // NewClusterService creates a new ClusterService with the provided parameters. | |
| func NewClusterService(params ClusterServiceParams) *ClusterService { | |
| if params.TaskQueue == nil { | |
| panic("taskQueue cannot be nil") | |
| } | |
| // NewClusterService creates a new ClusterService with the provided parameters. | |
| func NewClusterService(params ClusterServiceParams) (*ClusterService, error) { | |
| if params.TaskQueue == nil { | |
| return nil, errors.New(errors.Internal, "task queue not initialized") | |
| } | |
| return &ClusterService{ | |
| repo: params.Repo, | |
| provisioner: params.Provisioner, | |
| vpcSvc: params.VpcSvc, | |
| instanceSvc: params.InstanceSvc, | |
| secretSvc: params.SecretSvc, | |
| taskQueue: params.TaskQueue, | |
| logger: params.Logger, | |
| }, nil | |
| } |
🤖 Prompt for AI Agents
In `@internal/core/services/cluster.go` around lines 39 - 43, Change
NewClusterService to avoid panicking on nil TaskQueue: update the constructor
signature to return (*ClusterService, error) and validate params.TaskQueue,
returning a descriptive error when it's nil instead of calling panic. Ensure
callers are updated to handle the error; keep the ClusterService and
ClusterServiceParams types unchanged and perform the nil check inside
NewClusterService (or a helper like validateClusterServiceParams) so
construction fails cleanly with an error rather than crashing the process.
…ving duplicate literals, and standardizing test naming conventions
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tests/multitenancy_e2e_test.go (1)
184-188: Don’t silently ignore delete errors; usetand close the body.You’ve blanked out
*testing.Tand still discard the response/error. Usetto surface failures (or at least log them) and close the response body to avoid leaks. As per coding guidelines, avoid silent failures.🔧 Suggested fix
-func deleteInstance(_ *testing.T, client *http.Client, token, id string) { +func deleteInstance(t *testing.T, client *http.Client, token, id string) { req, _ := http.NewRequest("DELETE", fmt.Sprintf(instancesPathFmt, testutil.TestBaseURL, id), nil) req.Header.Set(testutil.TestHeaderAPIKey, token) - _, _ = client.Do(req) + resp, err := client.Do(req) + require.NoError(t, err) + defer func() { _ = resp.Body.Close() }() + if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusNoContent { + body, _ := io.ReadAll(resp.Body) + t.Fatalf("Delete instance failed: status %d body %s", resp.StatusCode, body) + } }internal/core/services/instance_test.go (2)
427-436: Remove or complete this incomplete test.This test is incomplete and contains dead code. The mock expectation on line 432 is never triggered, and the inline comments acknowledge the test cannot work as written because
getVolumeByIDOrNameis unexported.Either:
- Remove the test entirely if the functionality is already covered by
TestInstanceServiceLaunchWithVolumes- Complete the test by exercising the volume resolution through the public
LaunchInstanceAPI🗑️ Suggested removal
-func TestInstanceServiceGetVolumeByIDOrName(t *testing.T) { - _, _, _, volumeRepo, _, _, _, _, _, _ := setupInstanceServiceTest(t) - volID := uuid.New() - - // Test by ID - volumeRepo.On("GetByID", mock.Anything, volID).Return(&domain.Volume{ID: volID}, nil).Once() - // Wait, getVolumeByIDOrName is unexported. I can't call it directly. - // But LaunchInstance/TerminateInstance/etc might call it? - // Actually, only LaunchInstance calls resolveVolumes which calls it. -}
285-345: Add error path tests for the Provision method.The Provision method handles multiple error scenarios—networking setup failures (line 136), volume resolution failures (line 143), container creation failures (line 151), and finalization failures (line 168)—but only the success path is tested. Add tests covering these error scenarios:
TestProvisionGetInstanceFailure— repo.GetByID errorTestProvisionNetworkingFailure— provisionNetwork errorTestProvisionVolumeResolutionFailure— resolveVolumes errorTestProvisionCreateInstanceFailure— compute.CreateInstance errorTestProvisionFinalizeFailure— finalizeProvision errorEach should verify that
updateStatusis called withdomain.StatusErrorand appropriate errors are propagated. Per coding guidelines, do not skip tests for services.cmd/cloud/cli_commands_test.go (1)
541-552: Don’t ignore flag-setting errors in tests.
Flag configuration can fail and should fail the test when it does. As per coding guidelines, avoid silent failures.Suggested fix
- _ = createCacheCmd.Flags().Set("name", testRedisMod) - _ = createCacheCmd.Flags().Set("version", "7.2") - _ = createCacheCmd.Flags().Set("memory", "128") - _ = createCacheCmd.Flags().Set("vpc", "vpc-1") - _ = createCacheCmd.Flags().Set("wait", "false") + if err := createCacheCmd.Flags().Set("name", testRedisMod); err != nil { + t.Fatalf("set --name: %v", err) + } + if err := createCacheCmd.Flags().Set("version", "7.2"); err != nil { + t.Fatalf("set --version: %v", err) + } + if err := createCacheCmd.Flags().Set("memory", "128"); err != nil { + t.Fatalf("set --memory: %v", err) + } + if err := createCacheCmd.Flags().Set("vpc", "vpc-1"); err != nil { + t.Fatalf("set --vpc: %v", err) + } + if err := createCacheCmd.Flags().Set("wait", "false"); err != nil { + t.Fatalf("set --wait: %v", err) + }
🤖 Fix all issues with AI agents
In `@cmd/cloud/cli_commands_test.go`:
- Around line 331-338: The mock subscription response uses the string literal
"testTopicID" instead of the testTopicID constant, causing inconsistency; update
the sdk.Subscription object in the HTTP handler (the case for POST to
pathNotify+testTopicID+"/subscriptions") to set TopicID to the testTopicID
constant (not the quoted string) so the mock payload matches the test constant.
- Around line 33-58: The test HTTP handlers use json.NewEncoder(w).Encode(resp)
and discard the returned error; update each encoder call (both the list response
and the single security group response in the mocked handlers) to capture the
error, and if non-nil write an internal server error (e.g., set
w.WriteHeader(http.StatusInternalServerError) or use http.Error) and log or fail
the test accordingly instead of using the blank identifier; specifically modify
the json.NewEncoder(w).Encode(resp) calls so they check err :=
json.NewEncoder(w).Encode(resp) and handle err != nil by returning a 500
response.
In `@internal/repositories/docker/loadbalancer.go`:
- Around line 24-28: Introduce named permission constants (e.g., dirPerm and
filePerm) alongside NginxImage and nginxConf and replace all hard-coded
0755/0644 usages with these constants; update calls in functions that create
directories and write files such as MkdirAll and WriteFile and also in
UpdateProxyConfig to use dirPerm for directory creation and filePerm for file
writes to remove magic numbers and centralize permission values.
In `@internal/repositories/k8s/bootstrap.go`:
- Around line 49-71: The waitForIP function currently ignores context
cancellation and can keep polling until its 5-minute timeout; update the polling
select in KubeadmProvisioner.waitForIP to include a case for <-ctx.Done() so the
function returns immediately (empty string) when the parent context is
cancelled, ensuring any resources (ticker already deferred) are cleaned up; keep
the instSvc.GetInstance(ctx, ...) usage as-is so the request itself still
respects ctx.
- Around line 73-95: The polling loop in waitForAPIServer uses time.Sleep and
ignores context cancellation; change it to use a ticker/select that listens for
ctx.Done() so the function can abort immediately on cancellation. In function
waitForAPIServer (which calls p.getExecutor and exec.Run), replace the for
loop/time.Sleep with a time.NewTicker(10*time.Second) and loop selecting on
ticker.C and ctx.Done(); on ticker.C call exec.Run and handle success/failure as
before, on ctx.Done() stop the ticker and return ctx.Err() (or wrap it) and
ensure the ticker is stopped before returning.
In `@internal/repositories/k8s/health.go`:
- Around line 115-119: In the error-handling block that runs after
p.bootstrapNode returns an error, stop discarding errors with `_ =` and instead
capture and log the results of p.instSvc.TerminateInstance(ctx,
worker.ID.String()) and p.repo.DeleteNode(ctx, worker.ID) so cleanup failures
are recorded; specifically, update the block handling the bootstrap error (the
if that calls p.bootstrapNode) to call TerminateInstance and DeleteNode, check
their returned errors, and send those errors to the logger (or wrap them) with
clear context mentioning worker.ID/workerName and ip rather than using blank
identifier assignments.
♻️ Duplicate comments (2)
internal/repositories/k8s/lifecycle.go (1)
199-216: Memory buffering of large snapshots remains a concern.The entire etcd snapshot is base64-encoded, transferred through SSH, and buffered in memory before upload. For production clusters with multi-GB snapshots, this risks memory exhaustion or SSH message limits. The code comment in
Restore(line 257-258) acknowledges this limitation for KIND clusters.Consider implementing presigned URL uploads or streaming in a future iteration if production-scale clusters are targeted.
internal/repositories/k8s/health.go (1)
198-210: Empty output results in phantom node count.When
kubectl get nodes --no-headersreturns an empty string (e.g., no nodes registered yet),strings.Split("", "\n")returns a slice with one empty string, causingNodesTotalto be incorrectly set to 1.Proposed fix
// 2. Node Counts nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers") if err == nil { + trimmed := strings.TrimSpace(nodesOut) + if trimmed == "" { + health.NodesTotal = 0 + health.NodesReady = 0 + } else { - lines := strings.Split(strings.TrimSpace(nodesOut), "\n") - health.NodesTotal = len(lines) - readyCount := 0 - for _, line := range lines { - if strings.Contains(line, " Ready ") { - readyCount++ + lines := strings.Split(trimmed, "\n") + health.NodesTotal = len(lines) + readyCount := 0 + for _, line := range lines { + if strings.Contains(line, " Ready ") { + readyCount++ + } } + health.NodesReady = readyCount } - health.NodesReady = readyCount }
🧹 Nitpick comments (12)
internal/core/services/queue_test.go (1)
105-111: Good addition for error path coverage.The new "already exists" test case properly validates the duplicate name check. For consistency with the "success" subtest (line 96), consider adding
repo.AssertExpectations(t)to verify the mock was called as expected.Suggested improvement
t.Run("already exists", func(t *testing.T) { repo.On("GetByName", mock.Anything, "existing", userID).Return(&domain.Queue{ID: uuid.New()}, nil).Once() _, err := svc.CreateQueue(ctx, "existing", nil) assert.Error(t, err) assert.Contains(t, err.Error(), "already exists") + repo.AssertExpectations(t) })internal/repositories/k8s/lifecycle.go (3)
85-103: Consider extracting the package version suffix as a constant.The hardcoded
-1.1suffix in the package names (e.g.,kubeadm=%s-1.1) is a Debian/Ubuntu-specific convention that may vary across distributions or Kubernetes releases. Consider defining this as a named constant or making it configurable.Suggested improvement
+const kubePackageSuffix = "-1.1" // Debian/Ubuntu package version suffix func (p *KubeadmProvisioner) upgradeControlPlane(ctx context.Context, cluster *domain.Cluster, ip, version string) error { // ... upgradeScript := fmt.Sprintf(` set -e # 1. Update kubeadm apt-mark unhold kubeadm -apt-get update && apt-get install -y kubeadm=%s-1.1 +apt-get update && apt-get install -y kubeadm=%s%s apt-mark hold kubeadm // ... -`, ver, version, ver, ver) +`, ver, kubePackageSuffix, version, ver, kubePackageSuffix, ver, kubePackageSuffix)
167-168: Remove duplicate comment.The comment
// 3. Encrypt and store the new kubeconfigappears twice on consecutive lines.Fix
// 3. Encrypt and store the new kubeconfig - // 3. Encrypt and store the new kubeconfig encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig)
218-219: Remove duplicate comment.The comment
// Cleanup temp fileappears twice on consecutive lines.Fix
// Cleanup temp file - // Cleanup temp file if _, err := exec.Run(ctx, "rm "+tempPath); err != nil {internal/repositories/k8s/bootstrap.go (1)
97-97: Unused parameters in function signature.The
versionandisMasterparameters are unused (marked with_). If these are intentionally reserved for future use, consider adding a comment explaining the intent. Otherwise, remove them to avoid confusion.Options
Option 1: Document intent
// bootstrapNode prepares a node for Kubernetes. // version and isMaster are reserved for future differentiated bootstrap logic. func (p *KubeadmProvisioner) bootstrapNode(ctx context.Context, cluster *domain.Cluster, ip, _ string, _ bool) error {Option 2: Remove unused parameters (if callers can be updated)
func (p *KubeadmProvisioner) bootstrapNode(ctx context.Context, cluster *domain.Cluster, ip string) error {cmd/cloud/cli_commands_test.go (2)
15-23: Prefer sharedtestutilIDs to avoid drift.
Sincepkg/testutilnow exportsTestCacheID,TestQueueID, andTestTopicID, consider reusing them here to keep test fixtures centralized.
384-425: Consider table-driven tests to reduce repetition.
Many tests follow the same setup/execute/assert pattern; consolidating into table-driven cases would reduce duplication and align with test guidelines. As per coding guidelines, prefer table-driven tests in *_test.go.internal/repositories/k8s/kubeadm_provisioner_test.go (5)
17-48: Guard mocked instance returns against nil panics.
LaunchInstance/GetInstancecastargs.Get(0)directly; returning nil with an error will panic and makes error‑path tests awkward. Consider a nil‑safe assertion.♻️ Suggested change
func (m *MockInstanceService) LaunchInstance(ctx context.Context, name, image, ports string, vpcID, subnetID *uuid.UUID, volumes []domain.VolumeAttachment) (*domain.Instance, error) { args := m.Called(ctx, name, image, ports, vpcID, subnetID, volumes) - return args.Get(0).(*domain.Instance), args.Error(1) + inst, _ := args.Get(0).(*domain.Instance) + return inst, args.Error(1) } func (m *MockInstanceService) GetInstance(ctx context.Context, id string) (*domain.Instance, error) { args := m.Called(ctx, id) - return args.Get(0).(*domain.Instance), args.Error(1) + inst, _ := args.Get(0).(*domain.Instance) + return inst, args.Error(1) }
50-75: MakeGetNodesmock nil‑safe for error‑path tests.
args.Get(0)will panic on a nil return. Guard it to allownilslices alongside an error.♻️ Suggested change
func (m *MockClusterRepo) GetNodes(ctx context.Context, clusterID uuid.UUID) ([]*domain.ClusterNode, error) { args := m.Called(ctx, clusterID) - return args.Get(0).([]*domain.ClusterNode), args.Error(1) + nodes, _ := args.Get(0).([]*domain.ClusterNode) + return nodes, args.Error(1) }
141-165: Nil‑safe LoadBalancer mock returns for error‑path tests.
Create/Gettype assertions will panic on nil returns. Guard them so error‑path tests can return nil safely.♻️ Suggested change
func (m *MockLBService) Create(ctx context.Context, name string, vpcID uuid.UUID, port int, algo string, idempotencyKey string) (*domain.LoadBalancer, error) { args := m.Called(ctx, name, vpcID, port, algo, idempotencyKey) - return args.Get(0).(*domain.LoadBalancer), args.Error(1) + lb, _ := args.Get(0).(*domain.LoadBalancer) + return lb, args.Error(1) } func (m *MockLBService) Get(ctx context.Context, id uuid.UUID) (*domain.LoadBalancer, error) { args := m.Called(ctx, id) - return args.Get(0).(*domain.LoadBalancer), args.Error(1) + lb, _ := args.Get(0).(*domain.LoadBalancer) + return lb, args.Error(1) }
167-193: Assert mock expectations; consider table‑driven cases.The test only checks for no error; add expectation assertions so regressions in the deprovision flow are caught. Also consider a table‑driven structure as scenarios grow. As per coding guidelines, prefer table‑driven tests.
🔧 Suggested change
err := p.Deprovision(ctx, cluster) assert.NoError(t, err) +instSvc.AssertExpectations(t) +repo.AssertExpectations(t)
195-249: Replace magic numbers with named constants.Ports/counts/weights and IP literals are repeated; named constants improve readability and simplify updates. As per coding guidelines, avoid magic numbers.
♻️ Suggested change
cluster := &domain.Cluster{ ID: uuid.New(), Name: "ha-cluster", VpcID: uuid.New(), Version: "v1.29.0", HAEnabled: true, } + const ( + apiServerPort = 6443 + masterCount = 3 + lbWeight = 10 + lbIP = "10.0.0.100" + nodeIPBase = "10.0.0.%d" + nodeIPStart = 10 + ) repo.On("Update", mock.Anything, mock.Anything).Return(nil).Maybe() repo.On("AddNode", mock.Anything, mock.Anything).Return(nil).Maybe() repo.On("UpdateNode", mock.Anything, mock.Anything).Return(nil).Maybe() // 1. LB Creation lbID := uuid.New() - lb := &domain.LoadBalancer{ID: lbID, IP: "10.0.0.100"} - lbSvc.On("Create", mock.Anything, mock.Anything, cluster.VpcID, 6443, "round-robin", mock.Anything).Return(lb, nil) + lb := &domain.LoadBalancer{ID: lbID, IP: lbIP} + lbSvc.On("Create", mock.Anything, mock.Anything, cluster.VpcID, apiServerPort, "round-robin", mock.Anything).Return(lb, nil) // 2. 3 Master Nodes Creation var allNodes []*domain.ClusterNode - for i := 0; i < 3; i++ { + for i := 0; i < masterCount; i++ { nodeID := uuid.New() - instance := &domain.Instance{ID: nodeID, PrivateIP: fmt.Sprintf("10.0.0.%d", 10+i)} + instance := &domain.Instance{ID: nodeID, PrivateIP: fmt.Sprintf(nodeIPBase, nodeIPStart+i)} allNodes = append(allNodes, &domain.ClusterNode{ ID: uuid.New(), ClusterID: cluster.ID, InstanceID: nodeID, Role: domain.NodeRoleControlPlane, }) instSvc.On("LaunchInstance", mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything).Return(instance, nil).Once() instSvc.On("GetInstance", mock.Anything, nodeID.String()).Return(instance, nil).Maybe() - lbSvc.On("AddTarget", mock.Anything, lbID, nodeID, 6443, 10).Return(nil) + lbSvc.On("AddTarget", mock.Anything, lbID, nodeID, apiServerPort, lbWeight).Return(nil) } repo.On("GetNodes", mock.Anything, cluster.ID).Return(allNodes, nil).Maybe() // General Exec matches - instSvc.On("Exec", mock.Anything, mock.Anything, mock.Anything).Return("kubeadm join 10.0.0.100:6443 --token abc --control-plane --certificate-key xyz", nil).Maybe() + joinCmd := fmt.Sprintf("kubeadm join %s:%d --token abc --control-plane --certificate-key xyz", lbIP, apiServerPort) + instSvc.On("Exec", mock.Anything, mock.Anything, mock.Anything).Return(joinCmd, nil).Maybe()
| resp := sdk.Response[[]sdk.SecurityGroup]{ | ||
| Data: []sdk.SecurityGroup{{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()}}, | ||
| } | ||
| _ = json.NewEncoder(w).Encode(resp) | ||
| return true | ||
| case method == http.MethodPost && path == "/security-groups": | ||
| resp := sdk.Response[sdk.SecurityGroup]{ | ||
| Data: sdk.SecurityGroup{ID: "sg-2", VPCID: "vpc-2", Name: "web", Description: "web group", ARN: "arn:sg:2", CreatedAt: time.Now().UTC()}, | ||
| Data: []sdk.SecurityGroup{ | ||
| { | ||
| ID: "sg-1", | ||
| VPCID: "vpc-1", | ||
| Name: "default", | ||
| Description: "default group", | ||
| ARN: "arn:sg:1", | ||
| CreatedAt: time.Now().UTC(), | ||
| }, | ||
| }, | ||
| } | ||
| _ = json.NewEncoder(w).Encode(resp) | ||
| return true | ||
| case method == http.MethodGet && path == pathSG+testSGID: | ||
| case r.Method == http.MethodPost && r.URL.Path == "/security-groups": | ||
| resp := sdk.Response[sdk.SecurityGroup]{ | ||
| Data: sdk.SecurityGroup{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()}, | ||
| Data: sdk.SecurityGroup{ | ||
| ID: "sg-2", | ||
| VPCID: "vpc-2", | ||
| Name: "web", | ||
| Description: "web group", | ||
| ARN: "arn:sg:2", | ||
| CreatedAt: time.Now().UTC(), | ||
| }, | ||
| } | ||
| _ = json.NewEncoder(w).Encode(resp) |
There was a problem hiding this comment.
Handle JSON encode errors instead of discarding them.
Ignoring encoder errors can mask test failures and violates the “no silent failures” guideline. Consider returning a 500 when encoding fails. As per coding guidelines, avoid blank-identifier error drops.
Suggested fix (apply across handlers)
- _ = json.NewEncoder(w).Encode(resp)
- return true
+ if err := json.NewEncoder(w).Encode(resp); err != nil {
+ http.Error(w, err.Error(), http.StatusInternalServerError)
+ return true
+ }
+ return true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resp := sdk.Response[[]sdk.SecurityGroup]{ | |
| Data: []sdk.SecurityGroup{{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()}}, | |
| } | |
| _ = json.NewEncoder(w).Encode(resp) | |
| return true | |
| case method == http.MethodPost && path == "/security-groups": | |
| resp := sdk.Response[sdk.SecurityGroup]{ | |
| Data: sdk.SecurityGroup{ID: "sg-2", VPCID: "vpc-2", Name: "web", Description: "web group", ARN: "arn:sg:2", CreatedAt: time.Now().UTC()}, | |
| Data: []sdk.SecurityGroup{ | |
| { | |
| ID: "sg-1", | |
| VPCID: "vpc-1", | |
| Name: "default", | |
| Description: "default group", | |
| ARN: "arn:sg:1", | |
| CreatedAt: time.Now().UTC(), | |
| }, | |
| }, | |
| } | |
| _ = json.NewEncoder(w).Encode(resp) | |
| return true | |
| case method == http.MethodGet && path == pathSG+testSGID: | |
| case r.Method == http.MethodPost && r.URL.Path == "/security-groups": | |
| resp := sdk.Response[sdk.SecurityGroup]{ | |
| Data: sdk.SecurityGroup{ID: testSGID, VPCID: testVPCID, Name: "default", Description: "default group", ARN: "arn:sg:1", CreatedAt: time.Now().UTC()}, | |
| Data: sdk.SecurityGroup{ | |
| ID: "sg-2", | |
| VPCID: "vpc-2", | |
| Name: "web", | |
| Description: "web group", | |
| ARN: "arn:sg:2", | |
| CreatedAt: time.Now().UTC(), | |
| }, | |
| } | |
| _ = json.NewEncoder(w).Encode(resp) | |
| resp := sdk.Response[[]sdk.SecurityGroup]{ | |
| Data: []sdk.SecurityGroup{ | |
| { | |
| ID: "sg-1", | |
| VPCID: "vpc-1", | |
| Name: "default", | |
| Description: "default group", | |
| ARN: "arn:sg:1", | |
| CreatedAt: time.Now().UTC(), | |
| }, | |
| }, | |
| } | |
| if err := json.NewEncoder(w).Encode(resp); err != nil { | |
| http.Error(w, err.Error(), http.StatusInternalServerError) | |
| return true | |
| } | |
| return true | |
| case r.Method == http.MethodPost && r.URL.Path == "/security-groups": | |
| resp := sdk.Response[sdk.SecurityGroup]{ | |
| Data: sdk.SecurityGroup{ | |
| ID: "sg-2", | |
| VPCID: "vpc-2", | |
| Name: "web", | |
| Description: "web group", | |
| ARN: "arn:sg:2", | |
| CreatedAt: time.Now().UTC(), | |
| }, | |
| } | |
| if err := json.NewEncoder(w).Encode(resp); err != nil { | |
| http.Error(w, err.Error(), http.StatusInternalServerError) | |
| return true | |
| } | |
| return true |
🤖 Prompt for AI Agents
In `@cmd/cloud/cli_commands_test.go` around lines 33 - 58, The test HTTP handlers
use json.NewEncoder(w).Encode(resp) and discard the returned error; update each
encoder call (both the list response and the single security group response in
the mocked handlers) to capture the error, and if non-nil write an internal
server error (e.g., set w.WriteHeader(http.StatusInternalServerError) or use
http.Error) and log or fail the test accordingly instead of using the blank
identifier; specifically modify the json.NewEncoder(w).Encode(resp) calls so
they check err := json.NewEncoder(w).Encode(resp) and handle err != nil by
returning a 500 response.
- Refactored cmd/api to use AppDeps instead of globals for better testability. - Updated Cluster domain to safely handle sensitive fields (SSHKey, Kubeconfig) with json:"-". - Enhanced ClusterService with secure constructor, SSH key encryption, and robust error handling. - Improved KubeadmProvisioner (ha.go) with retry loops for LB IP and strict join command validation. - Implemented bounded concurrency and polling backoff in ClusterWorker. - Fixed CLI command tests (topic ID consistency, JSON error handling). - Introduced named permissions constants in docker/loadbalancer.go. - Updated k8s bootstrap and health checks to respect context cancellation and log cleanup errors. - Updated Swagger documentation.
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 78 out of 79 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Cleanup temp file | ||
| // Cleanup temp file |
There was a problem hiding this comment.
The comment on line 218 is duplicated on line 219. Remove the duplicate comment.
| // 3. Encrypt and store the new kubeconfig | ||
| // 3. Encrypt and store the new kubeconfig |
There was a problem hiding this comment.
The comment on line 167 is duplicated on line 168. Remove the duplicate comment.
| - K8s | ||
| /clusters/{id}/kubeconfig: | ||
| get: | ||
| description: Returns the kubeconfig for clinical access to the cluster |
There was a problem hiding this comment.
The same typo "clinical access" appears in the swagger documentation. This should be "client access" or "cluster access".
| description: Returns the kubeconfig for clinical access to the cluster | |
| description: Returns the kubeconfig for client access to the cluster |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@internal/core/services/cluster_test.go`:
- Around line 40-99: Convert the two standalone tests TestClusterServiceCreate
and TestClusterServiceDelete into a single table-driven test (or separate
table-driven tests) that iterates over test cases and uses t.Run for subtests;
for each case call setupClusterServiceTest to initialize mocks, set per-case
expectations on repo.On, secretSvc.On, taskQueue.On (e.g., taskQueue.Enqueue
assertions), invoke svc.CreateCluster or svc.DeleteCluster as appropriate, and
assert results and mock expectations; keep case metadata (name, inputs, expected
error/cluster/status, mock setup function) in a slice and ensure each subtest
resets or recreates mocks so expectations (repo.On, secretSvc.On, taskQueue.On)
are scoped to that case.
In `@internal/core/services/cluster.go`:
- Around line 273-293: When enqueueing the upgrade job, if taskQueue is nil or
taskQueue.Enqueue returns an error, persist a recovery status on the cluster
instead of leaving it in ClusterStatusUpgrading; after setting cluster.Status =
domain.ClusterStatusUpgrading and calling s.repo.Update(ctx, cluster) you must
detect enqueue failures and update the cluster status back to
domain.ClusterStatusRunning (or domain.ClusterStatusFailed) and UpdatedAt,
calling s.repo.Update(ctx, cluster) again (mirror the CreateCluster failure
handling pattern). Specifically handle the case where s.taskQueue == nil and
when s.taskQueue.Enqueue(ctx, "k8s_jobs", job) returns an error by wrapping the
error for return but first setting cluster.Status to the appropriate
non-upgrading state and persisting via s.repo.Update(ctx, cluster) so the
cluster is not left stuck in upgrading.
- Around line 138-153: When Enqueue fails after setting cluster.Status =
domain.ClusterStatusDeleting and persisting via s.repo.Update, revert the
cluster status (e.g., back to previous status or to a new failure state) and
persist that change so the cluster does not remain stuck in "deleting";
specifically, catch the error from s.taskQueue.Enqueue(ctx, "k8s_jobs", job) and
call s.repo.Update(ctx, cluster) after setting cluster.Status to the original
value or a "delete_failed" status, and return the wrapped error; follow the same
pattern used in CreateCluster to ensure the failure is recorded and persisted
for recovery/visibility.
In `@internal/repositories/k8s/ha.go`:
- Around line 139-143: The kubeadm init invocation builds initCmd with
--ignore-preflight-errors=all which silences all safety checks; change the logic
that constructs initCmd in ha.go (the initCmd variable used with exec.Run) to
specify a comma-separated list of named preflight errors to ignore instead of
"all" (e.g., IsPrivilegedUser,Swap or other specific checks discovered via
`kubeadm init phase preflight --help`); implement this by replacing the literal
string "--ignore-preflight-errors=all" with a variable or function that returns
the allowed names, or make it configurable on the cluster object, and update the
initCmd construction so the chosen names are inserted (preserving the existing
flags: --kubernetes-version, --pod-network-cidr, --control-plane-endpoint,
--upload-certs) before calling exec.Run.
In `@internal/repositories/k8s/health.go`:
- Around line 115-123: The TerminateInstance call in the bootstrap error handler
is using worker.ID (node record UUID) instead of the compute instance UUID;
update the call in the p.bootstrapNode error branch to pass
worker.InstanceID.String() to p.instSvc.TerminateInstance so the actual instance
is terminated (mirror what scaleDown uses with node.InstanceID.String()); ensure
you reference worker.InstanceID and convert to String() as in scaleDown.
♻️ Duplicate comments (5)
internal/repositories/k8s/health.go (1)
204-214: Handle emptykubectl get nodesoutput to avoid phantom node count.When
nodesOutis empty,strings.Split("", "\n")yields[""](a slice with one empty string), soNodesTotalbecomes 1 instead of 0.Proposed fix
nodesOut, err := exec.Run(ctx, kubectlBase+" get nodes --no-headers") if err == nil { - lines := strings.Split(strings.TrimSpace(nodesOut), "\n") - health.NodesTotal = len(lines) + trimmed := strings.TrimSpace(nodesOut) + if trimmed != "" { + lines := strings.Split(trimmed, "\n") + health.NodesTotal = len(lines) + readyCount := 0 + for _, line := range lines { + if strings.Contains(line, " Ready ") { + readyCount++ + } + } + health.NodesReady = readyCount + } - readyCount := 0 - for _, line := range lines { - if strings.Contains(line, " Ready ") { - readyCount++ - } - } - health.NodesReady = readyCount }cmd/cloud/cli_commands_test.go (1)
30-59: Handle JSON encoding errors instead of discarding them.
Several handlers still ignoreEncodeerrors, which can mask test failures. Consider a small helper to encode and return 500 on failure, then reuse it across handlers.As per coding guidelines, avoid silent failures.✅ Suggested helper (apply across handlers)
func respondNoContent(w http.ResponseWriter) bool { w.WriteHeader(http.StatusNoContent) return true } + +func respondJSON(w http.ResponseWriter, payload any) bool { + if err := json.NewEncoder(w).Encode(payload); err != nil { + http.Error(w, err.Error(), http.StatusInternalServerError) + } + return true +}- _ = json.NewEncoder(w).Encode(resp) - return true + return respondJSON(w, resp)Also applies to: 64-127, 136-167, 174-246, 251-304, 313-343
docs/swagger/docs.go (1)
966-1299: Fix typo in kubeconfig description.Line 1125 says “clinical access”; it should be “client access” (or similar). Since this file is generated, update the source annotation and regenerate.
✏️ Suggested fix
- "description": "Returns the kubeconfig for clinical access to the cluster", + "description": "Returns the kubeconfig for client access to the cluster",docs/swagger/swagger.yaml (1)
1848-2067: Fix typo in kubeconfig description.Line 1953 says “clinical access”; it should be “client access” (or similar).
✏️ Suggested fix
- description: Returns the kubeconfig for clinical access to the cluster + description: Returns the kubeconfig for client access to the clusterdocs/swagger/swagger.json (1)
1112-1115: Fix typo in kubeconfig description.
“clinical access” → “client access” (or “cluster access”).
🧹 Nitpick comments (11)
internal/repositories/k8s/kubeadm_provisioner_test.go (4)
19-26: Nil pointer dereference risk on type assertion.When
args.Get(0)returnsnil(e.g., in error cases), the type assertion.(*domain.Instance)will panic. Use a safe pattern:♻️ Suggested safe nil handling
func (m *MockInstanceService) LaunchInstance(ctx context.Context, name, image, ports string, vpcID, subnetID *uuid.UUID, volumes []domain.VolumeAttachment) (*domain.Instance, error) { args := m.Called(ctx, name, image, ports, vpcID, subnetID, volumes) - return args.Get(0).(*domain.Instance), args.Error(1) + if v := args.Get(0); v != nil { + return v.(*domain.Instance), args.Error(1) + } + return nil, args.Error(1) } func (m *MockInstanceService) GetInstance(ctx context.Context, id string) (*domain.Instance, error) { args := m.Called(ctx, id) - return args.Get(0).(*domain.Instance), args.Error(1) + if v := args.Get(0); v != nil { + return v.(*domain.Instance), args.Error(1) + } + return nil, args.Error(1) }The same pattern should be applied to
MockClusterRepo.GetNodes(line 68),MockLBService.Create(line 145), andMockLBService.Get(line 149).
30-48: Mock methods silently return without recording calls.Methods like
StopInstance,ListInstances,GetInstanceLogs, etc. don't usem.Called(). If production code unexpectedly calls these, tests won't detect it.If these methods should never be called during the tested flows, consider either:
- Using
m.Called()and not setting up expectations (testify will fail on unexpected calls)- Adding
t.Fatal("unexpected call")to detect unintended usagefunc (m *MockInstanceService) StopInstance(ctx context.Context, id string) error { - return nil + return m.Called(ctx, id).Error(0) }
167-193: Add mock assertion verification.The test doesn't verify that expected mock calls were actually made. Without
AssertExpectations, ifDeprovisionsilently skips callingTerminateInstanceorDeleteNode, the test would still pass.♻️ Add assertions
err := p.Deprovision(ctx, cluster) assert.NoError(t, err) + repo.AssertExpectations(t) + instSvc.AssertExpectations(t) }
195-254: HA provisioning test uses overly permissive matchers and lacks assertion verification.Several concerns:
Excessive
.Maybe()(lines 215-217, 238, 241, 249): These make calls optional, so the test passes even if critical operations likeAddNodeorUpdatearen't called.Missing
AssertExpectations: No verification that LB creation, node launches, and target registrations occurred.Overly broad
mock.AnythingonLaunchInstance(line 237): Consider asserting at least the VPC ID matchescluster.VpcIDto catch routing bugs.♻️ Strengthen test assertions
- repo.On("Update", mock.Anything, mock.Anything).Return(nil).Maybe() - repo.On("AddNode", mock.Anything, mock.Anything).Return(nil).Maybe() + repo.On("Update", mock.Anything, mock.Anything).Return(nil) + repo.On("AddNode", mock.Anything, mock.Anything).Return(nil) err := p.Provision(ctx, cluster) assert.NoError(t, err) + assert.NotNil(t, cluster.APIServerLBAddress, "LB address should be set") + lbSvc.AssertExpectations(t) + instSvc.AssertExpectations(t) + repo.AssertExpectations(t) }internal/repositories/k8s/health.go (1)
139-163: LGTM with optional suggestion.Error aggregation and logging correctly address previous feedback. Consider skipping
DeleteNodeifTerminateInstancefails to avoid orphaning instances (the node record serves as a reference for retry/cleanup).internal/repositories/k8s/ha.go (3)
22-36: Previous feedback addressed: bounded retry for LB IP.The LB IP retry logic now properly retries up to 5 times and fails with an explicit error if the IP is never assigned. This addresses the earlier concern about proceeding with an empty endpoint.
However, consider extracting the magic numbers to named constants per coding guidelines:
♻️ Suggested constants
+const ( + lbIPRetryAttempts = 5 + lbIPRetryInterval = 2 * time.Second +) + func (p *KubeadmProvisioner) provisionHAControlPlane(ctx context.Context, cluster *domain.Cluster) (string, error) { // ... - for i := 0; i < 5; i++ { + for i := 0; i < lbIPRetryAttempts; i++ { lbIP = lb.IP if lbIP != "" { break } - time.Sleep(2 * time.Second) + time.Sleep(lbIPRetryInterval)
82-99: Extract magic numbers to named constants.The master count (
3), API server port (6443), and LB weight (10) should be named constants per coding guidelines. This improves readability and makes the HA configuration easier to adjust.♻️ Suggested refactor
+const ( + HAMasterCount = 3 + APIServerPort = 6443 + DefaultLBWeight = 10 +) + func (p *KubeadmProvisioner) provisionHAMasters(...) (...) { // ... - for i := 0; i < 3; i++ { + for i := 0; i < HAMasterCount; i++ { // ... - if err := p.lbSvc.AddTarget(ctx, lb.ID, node.ID, 6443, 10); err != nil { + if err := p.lbSvc.AddTarget(ctx, lb.ID, node.ID, APIServerPort, DefaultLBWeight); err != nil {
110-118: Add bounds check formasterIPsslice.The loop assumes
masterIPshas at least 3 elements. WhileprovisionHAMastersshould guarantee this, a defensive check would prevent a panic if the contract changes.♻️ Suggested defensive check
func (p *KubeadmProvisioner) joinHAMasters(ctx context.Context, cluster *domain.Cluster, masterIPs []string, cpJoinCmd string) error { + if len(masterIPs) < HAMasterCount { + return fmt.Errorf("expected %d master IPs, got %d", HAMasterCount, len(masterIPs)) + } - for i := 1; i < 3; i++ { + for i := 1; i < HAMasterCount; i++ { p.logger.Info("joining master to control plane", "ip", masterIPs[i])cmd/api/main_test.go (2)
25-102: Consider table-driven tests + testify/mock for infra cases.These infra-init tests follow the same arrange/act/assert shape; a table-driven structure plus testify/mock-based fakes would reduce duplication and standardize mocks. As per coding guidelines, please use table-driven tests and testify/mock for mocks.
104-129: Avoid real Redis clients in unit tests.Using a real go-redis client (even pointed at 127.0.0.1) risks accidental network use; prefer a mock or stubbed Redis/TaskQueue interface. As per coding guidelines, avoid real external dependencies in unit tests.
internal/core/services/cluster_test.go (1)
28-36: Don’t ignore constructor errors in test setup.If
NewClusterServiceever returns an error, the tests will proceed with a nil service and fail later in a harder-to-diagnose way. Consider asserting it’s nil here.💡 Suggested adjustment
- svc, _ := services.NewClusterService(services.ClusterServiceParams{ + svc, err := services.NewClusterService(services.ClusterServiceParams{ Repo: repo, Provisioner: provisioner, VpcSvc: vpcSvc, InstanceSvc: instSvc, SecretSvc: secretSvc, TaskQueue: taskQueue, Logger: logger, }) + if err != nil { + panic(err) // or return error and fail the test via t.Fatalf in caller + }
| "/clusters": { | ||
| "get": { | ||
| "description": "Returns all clusters belonging to the user", | ||
| "produces": [ | ||
| "application/json" | ||
| ], | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "List managed K8s clusters", | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK", | ||
| "schema": { | ||
| "type": "array", | ||
| "items": { | ||
| "$ref": "#/definitions/domain.Cluster" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "post": { | ||
| "description": "Provisions a new Kubernetes cluster using kubeadm", | ||
| "consumes": [ | ||
| "application/json" | ||
| ], | ||
| "produces": [ | ||
| "application/json" | ||
| ], | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Create a managed K8s cluster", | ||
| "parameters": [ | ||
| { | ||
| "description": "Cluster details", | ||
| "name": "request", | ||
| "in": "body", | ||
| "required": true, | ||
| "schema": { | ||
| "$ref": "#/definitions/httphandlers.CreateClusterRequest" | ||
| } | ||
| } | ||
| ], | ||
| "responses": { | ||
| "202": { | ||
| "description": "Accepted", | ||
| "schema": { | ||
| "$ref": "#/definitions/domain.Cluster" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}": { | ||
| "get": { | ||
| "description": "Returns cluster metadata and current status", | ||
| "produces": [ | ||
| "application/json" | ||
| ], | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Get cluster details", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| } | ||
| ], | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK", | ||
| "schema": { | ||
| "$ref": "#/definitions/domain.Cluster" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "delete": { | ||
| "description": "Terminates all nodes and removes the cluster record", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Delete a K8s cluster", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| } | ||
| ], | ||
| "responses": { | ||
| "202": { | ||
| "description": "Accepted" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/backups": { | ||
| "post": { | ||
| "description": "Creates an etcd snapshot of the cluster state", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Create cluster backup", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| } | ||
| ], | ||
| "responses": { | ||
| "202": { | ||
| "description": "Accepted" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/health": { | ||
| "get": { | ||
| "description": "Returns readiness of nodes and API server", | ||
| "produces": [ | ||
| "application/json" | ||
| ], | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Get cluster operational health", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| } | ||
| ], | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK", | ||
| "schema": { | ||
| "$ref": "#/definitions/ports.ClusterHealth" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/kubeconfig": { | ||
| "get": { | ||
| "description": "Returns the kubeconfig for clinical access to the cluster", | ||
| "produces": [ | ||
| "text/plain" | ||
| ], | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Download kubeconfig", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| }, | ||
| { | ||
| "type": "string", | ||
| "description": "Role (e.g. viewer)", | ||
| "name": "role", | ||
| "in": "query" | ||
| } | ||
| ], | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK", | ||
| "schema": { | ||
| "type": "string" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/repair": { | ||
| "post": { | ||
| "description": "Re-applies CNI and kube-proxy patches to a running cluster", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Repair cluster components", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| } | ||
| ], | ||
| "responses": { | ||
| "202": { | ||
| "description": "Accepted" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/restore": { | ||
| "post": { | ||
| "description": "Restores the etcd state from a specified snapshot path", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Restore cluster from backup", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| }, | ||
| { | ||
| "description": "Restore Request", | ||
| "name": "request", | ||
| "in": "body", | ||
| "required": true, | ||
| "schema": { | ||
| "$ref": "#/definitions/httphandlers.RestoreBackupRequest" | ||
| } | ||
| } | ||
| ], | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/rotate-secrets": { | ||
| "post": { | ||
| "description": "Renews cluster certificates and refreshes admin kubeconfig", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Rotate cluster secrets", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| } | ||
| ], | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/scale": { | ||
| "post": { | ||
| "description": "Adjusts the number of worker nodes", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Scale cluster workers", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| }, | ||
| { | ||
| "description": "Scale Request", | ||
| "name": "request", | ||
| "in": "body", | ||
| "required": true, | ||
| "schema": { | ||
| "$ref": "#/definitions/httphandlers.ScaleClusterRequest" | ||
| } | ||
| } | ||
| ], | ||
| "responses": { | ||
| "200": { | ||
| "description": "OK" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "/clusters/{id}/upgrade": { | ||
| "post": { | ||
| "description": "Initiates an asynchronous upgrade of the Kubernetes control plane and workers", | ||
| "tags": [ | ||
| "K8s" | ||
| ], | ||
| "summary": "Upgrade cluster version", | ||
| "parameters": [ | ||
| { | ||
| "type": "string", | ||
| "description": "Cluster ID", | ||
| "name": "id", | ||
| "in": "path", | ||
| "required": true | ||
| }, | ||
| { | ||
| "description": "Upgrade Request", | ||
| "name": "request", | ||
| "in": "body", | ||
| "required": true, | ||
| "schema": { | ||
| "$ref": "#/definitions/httphandlers.UpgradeClusterRequest" | ||
| } | ||
| } | ||
| ], | ||
| "responses": { | ||
| "202": { | ||
| "description": "Accepted" | ||
| } | ||
| } | ||
| } | ||
| }, |
There was a problem hiding this comment.
Add APIKeyAuth security to the cluster endpoints.
Unlike other authenticated routes, the /clusters family lacks a security block, which makes the docs imply public access. Please add APIKeyAuth (or set global security) for consistency and to avoid client confusion.
🔧 Example (apply similarly to all /clusters routes)
"/clusters": {
"get": {
+ "security": [
+ {
+ "APIKeyAuth": []
+ }
+ ],
"description": "Returns all clusters belonging to the user",
"produces": [
"application/json"
],
"tags": [
"K8s"
],
"summary": "List managed K8s clusters",
"responses": {
"200": {
"description": "OK",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/domain.Cluster"
}
}
}
}
},
"post": {
+ "security": [
+ {
+ "APIKeyAuth": []
+ }
+ ],
"description": "Provisions a new Kubernetes cluster using kubeadm",
"consumes": [
"application/json"
],| func TestClusterServiceCreate(t *testing.T) { | ||
| repo, _, vpcSvc, _, taskQueue, secretSvc, svc := setupClusterServiceTest() | ||
| ctx := context.Background() | ||
| userID := uuid.New() | ||
| vpcID := uuid.New() | ||
|
|
||
| vpcSvc.On("GetVPC", mock.Anything, vpcID.String()).Return(&domain.VPC{ID: vpcID}, nil) | ||
| repo.On("Create", mock.Anything, mock.MatchedBy(func(c *domain.Cluster) bool { | ||
| return c.Name == testClusterName && c.UserID == userID | ||
| })).Return(nil) | ||
| repo.On("Update", mock.Anything, mock.Anything).Return(nil) | ||
| repo.On("Update", mock.Anything, mock.Anything).Return(nil) | ||
| secretSvc.On("Encrypt", mock.Anything, userID, mock.Anything).Return("encrypted-key", nil) | ||
|
|
||
| // Expect task queue enqueue | ||
| taskQueue.On("Enqueue", mock.Anything, "k8s_jobs", mock.MatchedBy(func(job domain.ClusterJob) bool { | ||
| return job.Type == domain.ClusterJobProvision && job.ClusterID != uuid.Nil && job.UserID == userID | ||
| })).Return(nil).Once() | ||
|
|
||
| cluster, err := svc.CreateCluster(ctx, ports.CreateClusterParams{ | ||
| UserID: userID, | ||
| Name: testClusterName, | ||
| VpcID: vpcID, | ||
| Version: "v1.29.0", | ||
| Workers: 2, | ||
| }) | ||
|
|
||
| assert.NoError(t, err) | ||
| assert.NotNil(t, cluster) | ||
| assert.Equal(t, testClusterName, cluster.Name) | ||
| assert.Equal(t, domain.ClusterStatusPending, cluster.Status) | ||
|
|
||
| // Wait for background provisioning - Wait time removed as provision is not async in test mock unless explicitly delayed | ||
| taskQueue.AssertExpectations(t) | ||
| } | ||
|
|
||
| func TestClusterServiceDelete(t *testing.T) { | ||
| repo, _, _, _, taskQueue, _, svc := setupClusterServiceTest() | ||
| ctx := context.Background() | ||
| id := uuid.New() | ||
| cluster := &domain.Cluster{ID: id, Status: domain.ClusterStatusRunning} | ||
|
|
||
| repo.On("GetByID", mock.Anything, id).Return(cluster, nil) | ||
| repo.On("Update", mock.Anything, mock.Anything).Return(nil) | ||
| repo.On("Update", mock.Anything, mock.Anything).Return(nil) | ||
| repo.On("Delete", mock.Anything, id).Return(nil) | ||
|
|
||
| // Expect task queue enqueue | ||
| taskQueue.On("Enqueue", mock.Anything, "k8s_jobs", mock.MatchedBy(func(job domain.ClusterJob) bool { | ||
| return job.Type == domain.ClusterJobDeprovision && job.ClusterID == id | ||
| })).Return(nil).Once() | ||
|
|
||
| err := svc.DeleteCluster(ctx, id) | ||
|
|
||
| assert.NoError(t, err) | ||
|
|
||
| taskQueue.AssertExpectations(t) | ||
| repo.AssertCalled(t, "GetByID", mock.Anything, id) | ||
| repo.AssertCalled(t, "Update", mock.Anything, mock.Anything) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use table‑driven tests for cluster service cases.
Current tests are standalone; please convert to table-driven style to match project testing standards. As per coding guidelines, use table-driven tests in test files.
As per coding guidelines, please use table-driven tests in *_test.go files.
🤖 Prompt for AI Agents
In `@internal/core/services/cluster_test.go` around lines 40 - 99, Convert the two
standalone tests TestClusterServiceCreate and TestClusterServiceDelete into a
single table-driven test (or separate table-driven tests) that iterates over
test cases and uses t.Run for subtests; for each case call
setupClusterServiceTest to initialize mocks, set per-case expectations on
repo.On, secretSvc.On, taskQueue.On (e.g., taskQueue.Enqueue assertions), invoke
svc.CreateCluster or svc.DeleteCluster as appropriate, and assert results and
mock expectations; keep case metadata (name, inputs, expected
error/cluster/status, mock setup function) in a slice and ensure each subtest
resets or recreates mocks so expectations (repo.On, secretSvc.On, taskQueue.On)
are scoped to that case.
| cluster.Status = domain.ClusterStatusDeleting | ||
| cluster.UpdatedAt = time.Now() | ||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| return errors.Wrap(errors.Internal, "failed to update cluster status", err) | ||
| } | ||
|
|
||
| // Enqueue deprovision job | ||
| job := domain.ClusterJob{ | ||
| ClusterID: cluster.ID, | ||
| UserID: cluster.UserID, | ||
| Type: domain.ClusterJobDeprovision, | ||
| } | ||
|
|
||
| if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil { | ||
| return errors.Wrap(errors.Internal, "failed to enqueue cluster deprovision job", err) | ||
| } |
There was a problem hiding this comment.
Handle deprovision enqueue failures to avoid “stuck deleting” clusters.
If enqueue fails, the cluster stays in deleting with no job scheduled. Consider reverting status or marking failed and persisting the update (similar to CreateCluster).
🛠️ Suggested fix
if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil {
- return errors.Wrap(errors.Internal, "failed to enqueue cluster deprovision job", err)
+ s.logger.Error("failed to enqueue cluster deprovision job", "cluster_id", cluster.ID, "error", err)
+ cluster.Status = domain.ClusterStatusFailed
+ cluster.UpdatedAt = time.Now()
+ if updateErr := s.repo.Update(ctx, cluster); updateErr != nil {
+ s.logger.Error("failed to update cluster status after enqueue failure", "cluster_id", cluster.ID, "error", updateErr)
+ }
+ return errors.Wrap(errors.Internal, "failed to enqueue cluster deprovision job", err)
}🤖 Prompt for AI Agents
In `@internal/core/services/cluster.go` around lines 138 - 153, When Enqueue fails
after setting cluster.Status = domain.ClusterStatusDeleting and persisting via
s.repo.Update, revert the cluster status (e.g., back to previous status or to a
new failure state) and persist that change so the cluster does not remain stuck
in "deleting"; specifically, catch the error from s.taskQueue.Enqueue(ctx,
"k8s_jobs", job) and call s.repo.Update(ctx, cluster) after setting
cluster.Status to the original value or a "delete_failed" status, and return the
wrapped error; follow the same pattern used in CreateCluster to ensure the
failure is recorded and persisted for recovery/visibility.
| cluster.Status = domain.ClusterStatusUpgrading | ||
| cluster.UpdatedAt = time.Now() | ||
| if err := s.repo.Update(ctx, cluster); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| // Enqueue upgrade job | ||
| job := domain.ClusterJob{ | ||
| ClusterID: cluster.ID, | ||
| UserID: cluster.UserID, | ||
| Type: domain.ClusterJobUpgrade, | ||
| Version: version, | ||
| } | ||
|
|
||
| if s.taskQueue == nil { | ||
| return errors.New(errors.Internal, "task queue not initialized") | ||
| } | ||
|
|
||
| if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil { | ||
| return errors.Wrap(errors.Internal, "failed to enqueue cluster upgrade job", err) | ||
| } |
There was a problem hiding this comment.
Handle upgrade enqueue failures to avoid “stuck upgrading” clusters.
If enqueue fails, the cluster remains in upgrading with no job queued. Mirror CreateCluster’s failure handling and persist a failed state (or revert to running).
🛠️ Suggested fix
if err := s.taskQueue.Enqueue(ctx, "k8s_jobs", job); err != nil {
- return errors.Wrap(errors.Internal, "failed to enqueue cluster upgrade job", err)
+ s.logger.Error("failed to enqueue cluster upgrade job", "cluster_id", cluster.ID, "error", err)
+ cluster.Status = domain.ClusterStatusFailed
+ cluster.UpdatedAt = time.Now()
+ if updateErr := s.repo.Update(ctx, cluster); updateErr != nil {
+ s.logger.Error("failed to update cluster status after enqueue failure", "cluster_id", cluster.ID, "error", updateErr)
+ }
+ return errors.Wrap(errors.Internal, "failed to enqueue cluster upgrade job", err)
}🤖 Prompt for AI Agents
In `@internal/core/services/cluster.go` around lines 273 - 293, When enqueueing
the upgrade job, if taskQueue is nil or taskQueue.Enqueue returns an error,
persist a recovery status on the cluster instead of leaving it in
ClusterStatusUpgrading; after setting cluster.Status =
domain.ClusterStatusUpgrading and calling s.repo.Update(ctx, cluster) you must
detect enqueue failures and update the cluster status back to
domain.ClusterStatusRunning (or domain.ClusterStatusFailed) and UpdatedAt,
calling s.repo.Update(ctx, cluster) again (mirror the CreateCluster failure
handling pattern). Specifically handle the case where s.taskQueue == nil and
when s.taskQueue.Enqueue(ctx, "k8s_jobs", job) returns an error by wrapping the
error for return but first setting cluster.Status to the appropriate
non-upgrading state and persisting via s.repo.Update(ctx, cluster) so the
cluster is not left stuck in upgrading.
| initCmd := fmt.Sprintf("kubeadm init --kubernetes-version=%s --pod-network-cidr=%s --control-plane-endpoint=%s:6443 --upload-certs --ignore-preflight-errors=all", cluster.Version, podCIDR, lbIP) | ||
| out, err := exec.Run(ctx, initCmd) | ||
| if err != nil { | ||
| return "", "", "", err | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
kubeadm init --ignore-preflight-errors documentation specific errors
💡 Result:
Short answer
- The --ignore-preflight-errors flag accepts a comma-separated list of preflight check names (e.g. IsPrivilegedUser,Swap) or the special value all to treat all preflight errors as warnings. Use it with kubeadm init (or the preflight phase). [1][2][3]
Usage / examples
- CLI: kubeadm init --ignore-preflight-errors=IsPrivilegedUser,Swap
- To ignore everything: kubeadm init --ignore-preflight-errors=all. [1][2]
Where to find the authoritative docs
- kubeadm init reference (shows the --ignore-preflight-errors option and examples). [1]
- kubeadm init phase preflight (preflight phase documentation; also documents the flag). [2]
- kubeadm config API (ignorePreflightErrors field for config files). [3]
If you want to know the exact names of checks on your kubeadm version
- Run: kubeadm init phase preflight --help or run kubeadm init phase preflight (it will list checks/errors for your system/version), or consult the generated reference for your kubeadm version on kubernetes.io. [2][1]
Sources
[1] kubeadm init reference — kubernetes.io.
[2] kubeadm init phase preflight — kubernetes.io.
[3] kubeadm-config API (ignorePreflightErrors) — kubernetes.io.
Specify individual preflight errors instead of ignoring all.
Using --ignore-preflight-errors=all bypasses all kubeadm safety checks, which could mask real issues (e.g., insufficient resources, port conflicts, missing dependencies). Specify only the particular preflight errors that need to be ignored by name (e.g., --ignore-preflight-errors=IsPrivilegedUser,Swap). You can discover available check names with kubeadm init phase preflight --help.
🤖 Prompt for AI Agents
In `@internal/repositories/k8s/ha.go` around lines 139 - 143, The kubeadm init
invocation builds initCmd with --ignore-preflight-errors=all which silences all
safety checks; change the logic that constructs initCmd in ha.go (the initCmd
variable used with exec.Run) to specify a comma-separated list of named
preflight errors to ignore instead of "all" (e.g., IsPrivilegedUser,Swap or
other specific checks discovered via `kubeadm init phase preflight --help`);
implement this by replacing the literal string "--ignore-preflight-errors=all"
with a variable or function that returns the allowed names, or make it
configurable on the cluster object, and update the initCmd construction so the
chosen names are inserted (preserving the existing flags: --kubernetes-version,
--pod-network-cidr, --control-plane-endpoint, --upload-certs) before calling
exec.Run.
| if err := p.bootstrapNode(ctx, cluster, ip, cluster.Version, false); err != nil { | ||
| if termErr := p.instSvc.TerminateInstance(ctx, worker.ID.String()); termErr != nil { | ||
| p.logger.Error("failed to terminate instance after bootstrap error", "instance_id", worker.ID, "error", termErr) | ||
| } | ||
| if delErr := p.repo.DeleteNode(ctx, worker.ID); delErr != nil { | ||
| p.logger.Error("failed to delete node record after bootstrap error", "node_id", worker.ID, "error", delErr) | ||
| } | ||
| return fmt.Errorf("failed to bootstrap node %s (%s): %w", workerName, ip, err) | ||
| } |
There was a problem hiding this comment.
Bug: Wrong ID passed to TerminateInstance.
Line 116 passes worker.ID (the node record UUID) instead of worker.InstanceID (the compute instance UUID). This will fail to terminate the actual instance, leaving orphaned compute resources. Compare with scaleDown at line 148 which correctly uses node.InstanceID.String().
Proposed fix
if err := p.bootstrapNode(ctx, cluster, ip, cluster.Version, false); err != nil {
- if termErr := p.instSvc.TerminateInstance(ctx, worker.ID.String()); termErr != nil {
+ if termErr := p.instSvc.TerminateInstance(ctx, worker.InstanceID.String()); termErr != nil {
p.logger.Error("failed to terminate instance after bootstrap error", "instance_id", worker.ID, "error", termErr)
}
if delErr := p.repo.DeleteNode(ctx, worker.ID); delErr != nil {
p.logger.Error("failed to delete node record after bootstrap error", "node_id", worker.ID, "error", delErr)
}
return fmt.Errorf("failed to bootstrap node %s (%s): %w", workerName, ip, err)
}🤖 Prompt for AI Agents
In `@internal/repositories/k8s/health.go` around lines 115 - 123, The
TerminateInstance call in the bootstrap error handler is using worker.ID (node
record UUID) instead of the compute instance UUID; update the call in the
p.bootstrapNode error branch to pass worker.InstanceID.String() to
p.instSvc.TerminateInstance so the actual instance is terminated (mirror what
scaleDown uses with node.InstanceID.String()); ensure you reference
worker.InstanceID and convert to String() as in scaleDown.


🌟 Overview
This PR introduces the highly anticipated Managed Kubernetes (KaaS) service to The Cloud platform. It allows users to provision production-ready, highly available Kubernetes clusters with a single command.
The implementation follows a robust Hexagonal Architecture and introduces an Asynchronous Durable Operations layer using a Redis-backed Task Queue.
🚀 Key Features
ClusterWorker.🏗️ Architectural Changes
TaskQueueport and implementation.🧪 Testing
📝 Documentation
README.md,FEATURES.md, andarchitecture.md.Summary by CodeRabbit
New Features
Infrastructure
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.