Skip to content

fix: resolve E2E test failures, optimize performance, and clean lint#48

Closed
poyrazK wants to merge 81 commits intomainfrom
feature/test-fixes-and-optimization
Closed

fix: resolve E2E test failures, optimize performance, and clean lint#48
poyrazK wants to merge 81 commits intomainfrom
feature/test-fixes-and-optimization

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Feb 4, 2026

Summary

This PR resolves various test failures and improves overall codebase stability and performance.

Key Changes

  • Database Schema: Fixed missing columns in and tables via new migrations (080, 081).
  • K8s Testing:
    • Centralized mocks in internal/repositories/k8s/mocks_test.go.
    • Fixed mock mismatches (sh vs /bin/sh) and prevented accidental SSH fallback.
    • Properly mocked SecretService for health checks and kubeconfig retrieval.
  • Performance: Enabled parallel execution (t.Parallel()) for unit tests and resolved port conflicts (changed test port to 8888).
  • Lint: Resolved all findings from golangci-lint (errcheck, staticcheck, empty branches).
  • Documentation: Synchronized Swagger API documentation.

Verification

  • All Unit, Integration, and E2E tests (including TestKubernetesE2E) are passing.
  • golangci-lint returns 0 issues.

Summary by CodeRabbit

  • New Features

    • Cloud‑init / user‑data support for instance provisioning
    • Cluster networking options: Pod CIDR and Service CIDR
    • Structured instance launch options and ability to start stopped instances
  • Bug Fixes

    • Improved handling and storage of sensitive cluster credentials (encrypted)
  • Documentation

    • API docs updated to include new cluster networking fields and instance options

Copilot AI review requested due to automatic review settings February 4, 2026 10:01
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 4, 2026

Warning

Rate limit exceeded

@PoyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 29 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Cloud-init based provisioning and user-data support were added across compute backends and the k8s provisioner; the cluster domain model was extended with CIDR and encrypted secret fields; ComputeBackend API was renamed/extended (CreateInstance → LaunchInstanceWithOptions, StartInstance); many legacy k8s SSH-based files were removed and tests were widely parallelized.

Changes

Cohort / File(s) Summary
Compute backends (Docker / Libvirt / Noop)
internal/repositories/docker/..., internal/repositories/libvirt/..., internal/repositories/noop/adapters.go
Renamed CreateInstanceLaunchInstanceWithOptions; added StartInstance; threaded UserData through cloud-init/user-data pipelines; Docker adapter now accepts *slog.Logger (NewDockerAdapter(logger)); tests updated accordingly.
Instance service & workers
internal/core/services/instance.go, internal/workers/provision_worker.go, internal/core/services/*_test.go
Added LaunchInstanceWithOptions and StartInstance methods; Provision gained userData parameter and flows now enqueue/utilize user data; many tests and mocks updated to new signatures.
Kubernetes provisioner (k8s package)
internal/repositories/k8s/provisioner.go, internal/repositories/k8s/node_executor.go, internal/repositories/k8s/security.go, internal/repositories/k8s/templates/*
Replaced SSH-centric bootstrap with template-driven cloud-init flow, introduced ServiceExecutor, template rendering, HA LB handling and control-plane/worker templates; removed multiple legacy kubeadm/CNI/HA/lifecycle files.
Domain model & DB mappings
internal/core/domain/cluster.go, internal/core/ports/*, internal/repositories/postgres/cluster_repo.go, internal/repositories/postgres/migrations/*
Added pod_cidr, service_cidr, encrypted secret fields (ssh_private_key_encrypted, kubeconfig_encrypted), join token fields, and node last_heartbeat; updated repository SQL and migrations to match new columns.
Tests & mocks
internal/repositories/k8s/mocks_test.go, internal/core/services/shared_test.go, internal/handlers/*_test.go, internal/repositories/*_unit_test.go, internal/core/services/system_integration_test.go
Expanded mocks to support LaunchInstanceWithOptions/StartInstance; added SyncTaskQueue integration test; broad addition of t.Parallel() across many unit tests and updated test call sites for new APIs.
Docs / Swagger
docs/swagger/docs.go, docs/swagger/swagger.json, docs/swagger/swagger.yaml
Added pod_cidr and service_cidr fields to the Cluster schema in generated Swagger definitions.
Misc (small edits)
cmd/cloud/queue_cli_test.go, various test files
Minor formatting/indentation fixes, import additions (log/slog), and adoption of slog.Default() in test setups.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant InstanceService
    participant ComputeBackend
    participant DockerEngine

    Client->>InstanceService: LaunchInstanceWithOptions(opts + UserData)
    InstanceService->>InstanceService: persist instance, enqueue provision job
    InstanceService->>ComputeBackend: LaunchInstanceWithOptions(opts with UserData)
    ComputeBackend->>DockerEngine: ContainerCreate (+ cloud-init / config)
    DockerEngine->>DockerEngine: create container
    ComputeBackend->>DockerEngine: ContainerStart
    DockerEngine->>DockerEngine: start container
    ComputeBackend->>DockerEngine: Exec (bootstrap/user-data handling)
    DockerEngine->>DockerEngine: run cloud-init / bootstrap script
    ComputeBackend->>InstanceService: return container ID / result
    InstanceService->>Client: return Instance
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I stitched a cloud-init note tonight,
Tucked secrets safe and set the net right,
Templates hum and containers wake,
Tests race onward — ears a‑twitch, awake,
Hops of joy as nodes boot bright! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 1.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary changes: fixing E2E test failures, optimizing performance (parallel test execution), and cleaning up lint issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/test-fixes-and-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Auth: []ssh.AuthMethod{
ssh.PublicKeys(signer),
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),

Check failure

Code scanning / CodeQL

Use of insecure HostKeyCallback implementation High

Configuring SSH ClientConfig with insecure HostKeyCallback implementation from
this source
.

Copilot Autofix

AI 3 months ago

In general, the fix is to replace ssh.InsecureIgnoreHostKey() with a HostKeyCallback that validates the server’s host key against an allow list. For a simple case where there is exactly one expected host key (typical when connecting to a single node or a node with a fixed key), ssh.FixedHostKey(parsedKey) is the recommended implementation. The public host key should be provided out-of-band (e.g., configuration) rather than learned insecurely at first connection.

For this specific snippet, the least invasive change that preserves existing behavior while adding host key verification is to:

  1. Add a public host key field to SSHExecutor,
  2. Require that key as an argument in NewSSHExecutor, and
  3. Parse it once in Run and use ssh.FixedHostKey for HostKeyCallback.

This avoids changing the rest of the logic and keeps all security concerns local to this type. Concretely, in internal/repositories/k8s/node_executor.go:

  • Extend SSHExecutor to include a hostKey string field that holds an ASCII-armored public key (e.g., ssh-ed25519 AAAA...).
  • Change NewSSHExecutor(ip, user, key string) to NewSSHExecutor(ip, user, key, hostKey string) and store hostKey.
  • In Run, before creating ssh.ClientConfig, parse e.hostKey using ssh.ParseAuthorizedKey (which accepts the standard public key format) and handle errors.
  • Replace HostKeyCallback: ssh.InsecureIgnoreHostKey(), with HostKeyCallback: ssh.FixedHostKey(parsedHostKey),.

Imports don’t need to change; golang.org/x/crypto/ssh already provides ParseAuthorizedKey and FixedHostKey. No extra dependencies are required.


Suggested changeset 1
internal/repositories/k8s/node_executor.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/internal/repositories/k8s/node_executor.go b/internal/repositories/k8s/node_executor.go
--- a/internal/repositories/k8s/node_executor.go
+++ b/internal/repositories/k8s/node_executor.go
@@ -54,13 +54,14 @@
 
 // SSHExecutor uses SSH to run commands on a node.
 type SSHExecutor struct {
-	ip   string
-	user string
-	key  string
+	ip      string
+	user    string
+	key     string
+	hostKey string
 }
 
-func NewSSHExecutor(ip, user, key string) *SSHExecutor {
-	return &SSHExecutor{ip: ip, user: user, key: key}
+func NewSSHExecutor(ip, user, key, hostKey string) *SSHExecutor {
+	return &SSHExecutor{ip: ip, user: user, key: key, hostKey: hostKey}
 }
 
 func (e *SSHExecutor) Run(ctx context.Context, cmd string) (string, error) {
@@ -69,12 +68,17 @@
 		return "", fmt.Errorf("failed to parse private key: %w", err)
 	}
 
+	hostPubKey, _, _, _, err := ssh.ParseAuthorizedKey([]byte(e.hostKey))
+	if err != nil {
+		return "", fmt.Errorf("failed to parse SSH host public key: %w", err)
+	}
+
 	config := &ssh.ClientConfig{
 		User: e.user,
 		Auth: []ssh.AuthMethod{
 			ssh.PublicKeys(signer),
 		},
-		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
+		HostKeyCallback: ssh.FixedHostKey(hostPubKey),
 		Timeout:         10 * time.Second,
 	}
 
EOF
@@ -54,13 +54,14 @@

// SSHExecutor uses SSH to run commands on a node.
type SSHExecutor struct {
ip string
user string
key string
ip string
user string
key string
hostKey string
}

func NewSSHExecutor(ip, user, key string) *SSHExecutor {
return &SSHExecutor{ip: ip, user: user, key: key}
func NewSSHExecutor(ip, user, key, hostKey string) *SSHExecutor {
return &SSHExecutor{ip: ip, user: user, key: key, hostKey: hostKey}
}

func (e *SSHExecutor) Run(ctx context.Context, cmd string) (string, error) {
@@ -69,12 +68,17 @@
return "", fmt.Errorf("failed to parse private key: %w", err)
}

hostPubKey, _, _, _, err := ssh.ParseAuthorizedKey([]byte(e.hostKey))
if err != nil {
return "", fmt.Errorf("failed to parse SSH host public key: %w", err)
}

config := &ssh.ClientConfig{
User: e.user,
Auth: []ssh.AuthMethod{
ssh.PublicKeys(signer),
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
HostKeyCallback: ssh.FixedHostKey(hostPubKey),
Timeout: 10 * time.Second,
}

Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix failing E2E/unit tests, align APIs after compute/provisioning changes (including UserData support), and improve test performance by parallelizing many test suites.

Changes:

  • Renames/extends compute + instance provisioning APIs (e.g., LaunchInstanceWithOptions, StartInstance, provisioning now carries UserData) and updates adapters/tests accordingly.
  • Refactors Kubernetes provisioning test infrastructure (centralized mocks, executor changes, new cloud-config templates) and adjusts security group behavior.
  • Adds/updates Postgres migrations and broadly parallelizes unit tests (t.Parallel()), along with lint-driven cleanups and Swagger sync.

Reviewed changes

Copilot reviewed 144 out of 145 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/test_libvirt.go Updates script to new compute backend API (LaunchInstanceWithOptions).
internal/workers/provision_worker_test.go Updates failing backend mock to new compute backend API.
internal/workers/provision_worker.go Passes UserData through to InstanceService.Provision.
internal/repositories/postgres/vpc_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/secret_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/rbac_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/notify_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/migrator_unit_test.go Parallelizes migration unit test and whitespace cleanup.
internal/repositories/postgres/migrations/081_fix_clusters_table.up.sql Adds missing cluster columns and migrates data from legacy columns.
internal/repositories/postgres/migrations/081_fix_clusters_table.down.sql Drops newly added cluster columns.
internal/repositories/postgres/migrations/080_add_lb_ip.up.sql Adds ip column to load_balancers.
internal/repositories/postgres/migrations/080_add_lb_ip.down.sql Drops ip column from load_balancers.
internal/repositories/postgres/migrations/020_create_clusters.up.sql Adds a new clusters schema migration file (see PR comments re: conflicts).
internal/repositories/postgres/lifecycle_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/image_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/identity_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/function_repo_unit_test.go Parallelizes repository unit tests and formatting cleanup.
internal/repositories/postgres/encryption_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/database_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/cron_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/container_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/postgres/cluster_repo_unit_test.go Updates cluster schema expectations and parallelizes tests.
internal/repositories/postgres/accounting_repo_unit_test.go Parallelizes repository unit tests.
internal/repositories/noop/adapters_test.go Updates noop compute backend tests to new API.
internal/repositories/libvirt/templates_test.go Marks as integration tests; parallelizes subtests.
internal/repositories/libvirt/mock_client_test.go Marks as integration tests.
internal/repositories/libvirt/adapter_volume_test.go Marks as integration tests; parallelizes tests.
internal/repositories/libvirt/adapter_test.go Marks as integration tests; parallelizes tests.
internal/repositories/libvirt/adapter_network_test.go Marks as integration tests; parallelizes tests.
internal/repositories/libvirt/adapter_lifecycle_test.go Marks as integration tests; updates to new API + parallelization.
internal/repositories/libvirt/adapter_context_test.go Marks as integration tests; updates to new API + parallelization.
internal/repositories/libvirt/adapter.go Renames create→launch, adds StartInstance, extends cloud-init generation to accept raw UserData.
internal/repositories/k8s/templates/worker.yaml Adds worker node cloud-config template.
internal/repositories/k8s/templates/control_plane.yaml Adds control-plane cloud-config template.
internal/repositories/k8s/security_test.go Removes prior security tests (behavior shifted).
internal/repositories/k8s/security.go Changes SG naming/rules behavior for clusters.
internal/repositories/k8s/provisioner_all_test.go Removes broad/legacy provisioner tests.
internal/repositories/k8s/node_executor_test.go Updates expected shell path (sh vs /bin/sh).
internal/repositories/k8s/node_executor.go Reworks executor implementations; adds raw SSH execution path.
internal/repositories/k8s/mocks_test.go Centralizes k8s mocks and updates interfaces.
internal/repositories/k8s/mock_provisioner.go Removes old mock provisioner implementation.
internal/repositories/k8s/kubeadm_provisioner_test.go Updates instance service mock + uses new launch API and list mocks.
internal/repositories/k8s/kubeadm.go Removes legacy kubeadm helper code.
internal/repositories/k8s/health_test.go Updates mocks and adds secret decryption path for kubeconfig.
internal/repositories/k8s/ha_test.go Removes legacy HA tests.
internal/repositories/k8s/ha.go Replaces HA provision flow with LB ensure helper.
internal/repositories/k8s/export_test.go Removes test-only exported wrappers.
internal/repositories/k8s/cni_test.go Removes legacy CNI tests.
internal/repositories/k8s/cni.go Removes legacy CNI helper code.
internal/repositories/k8s/bootstrap.go Removes legacy bootstrap helpers.
internal/repositories/docker/stats_test.go Updates adapter constructor + launch API.
internal/repositories/docker/fakes_test.go Adds call counters to fake client for async bootstrap assertions.
internal/repositories/docker/adapter_unit_test.go Updates create→launch and adds UserData bootstrap unit test.
internal/repositories/docker/adapter_test.go Updates constructor + launch API and adds UserData bootstrap integration test.
internal/repositories/docker/adapter.go Adds logger dependency, UserData handling (script + cloud-config), and implements StartInstance.
internal/handlers/ws/ws_test.go Parallelizes websocket handler tests.
internal/handlers/ws/hub_test.go Parallelizes hub tests.
internal/handlers/ws/client_test.go Parallelizes client tests.
internal/handlers/vpc_handler_test.go Parallelizes handler tests.
internal/handlers/volume_handler_test.go Parallelizes handler tests.
internal/handlers/tenant_handler_test.go Parallelizes handler tests.
internal/handlers/subnet_handler_test.go Parallelizes handler tests.
internal/handlers/storage_handler_test.go Parallelizes handler tests.
internal/handlers/storage_handler_errors_test.go Parallelizes error-table tests (including subtests).
internal/handlers/stack_handler_test.go Parallelizes handler tests.
internal/handlers/snapshot_handler_test.go Parallelizes handler tests.
internal/handlers/security_group_handler_test.go Parallelizes handler tests.
internal/handlers/secret_handler_test.go Parallelizes handler tests.
internal/handlers/rbac_handler_test.go Parallelizes handler tests.
internal/handlers/queue_handler_test.go Parallelizes handler tests.
internal/handlers/notify_handler_test.go Parallelizes handler tests and minor constant formatting.
internal/handlers/lifecycle_handler_test.go Parallelizes handler tests.
internal/handlers/lb_handler_test.go Parallelizes handler tests.
internal/handlers/instance_type_handler_test.go Parallelizes handler tests.
internal/handlers/instance_handler_test.go Updates instance service mock interface and parallelizes handler tests.
internal/handlers/image_handler_test.go Parallelizes handler tests.
internal/handlers/identity_handler_test.go Parallelizes handler tests.
internal/handlers/helper_test.go Parallelizes helper tests.
internal/handlers/health_handler_test.go Parallelizes health handler tests.
internal/handlers/global_lb_handler_test.go Parallelizes handler tests.
internal/handlers/gateway_handler_test.go Parallelizes handler tests.
internal/handlers/function_handler_test.go Parallelizes handler tests and minor constant formatting.
internal/handlers/event_handler_test.go Parallelizes handler tests.
internal/handlers/dns_handler_test.go Parallelizes handler tests.
internal/handlers/database_handler_test.go Parallelizes handler tests and minor constant formatting.
internal/handlers/dashboard_handler_test.go Parallelizes handler tests.
internal/handlers/cron_handler_test.go Parallelizes handler tests and minor constant formatting.
internal/handlers/container_handler_test.go Parallelizes handler tests.
internal/handlers/cluster_handler_test.go Parallelizes handler tests.
internal/handlers/cache_handler_test.go Parallelizes handler tests.
internal/handlers/autoscaling_handler_test.go Parallelizes handler tests.
internal/handlers/auth_handler_test.go Parallelizes handler tests.
internal/handlers/audit_handler_test.go Parallelizes handler tests.
internal/handlers/accounting_handler_test.go Parallelizes handler tests.
internal/core/services/volume_test.go Updates Provision calls to include userData argument.
internal/core/services/stack_test.go Parallelizes stack service tests.
internal/core/services/shared_test.go Updates mocks for new compute + instance service interfaces.
internal/core/services/rbac_cached_test.go Parallelizes cached RBAC tests.
internal/core/services/loadbalancer_service_test.go Parallelizes LB service tests.
internal/core/services/lifecycle_svc_test.go Parallelizes lifecycle service tests.
internal/core/services/lb_worker_test.go Parallelizes LB worker tests.
internal/core/services/instance_test.go Updates docker adapter construction + provision signature + test port.
internal/core/services/instance_ports_test.go Parallelizes port parsing tests.
internal/core/services/instance_internal_test.go Parallelizes internal helper tests.
internal/core/services/instance.go Adds LaunchInstanceWithOptions, extends Provision with UserData, adds StartInstance.
internal/core/services/health_test.go Updates docker adapter constructor usage.
internal/core/services/global_lb_test.go Parallelizes global LB tests.
internal/core/services/function_test.go Updates docker adapter constructor usage.
internal/core/services/database_test.go Updates docker adapter constructor usage.
internal/core/services/database.go Uses new compute backend launch API.
internal/core/services/dashboard_test.go Parallelizes dashboard tests.
internal/core/services/cron_worker_test.go Parallelizes cron worker tests.
internal/core/services/container_worker_test.go Parallelizes container worker tests.
internal/core/services/container_test.go Updates provision signature.
internal/core/services/cluster_test.go Parallelizes cluster service tests and updates encrypted kubeconfig field name.
internal/core/services/cluster_lifecycle_test.go Parallelizes cluster lifecycle tests.
internal/core/services/cluster.go Adds Pod/Service CIDR support and uses encrypted kubeconfig field.
internal/core/services/cached_identity_test.go Parallelizes cached identity tests.
internal/core/services/cache_test.go Updates docker adapter constructor usage.
internal/core/services/cache.go Uses new compute backend launch API.
internal/core/services/autoscaling_worker_test.go Parallelizes autoscaling worker tests.
internal/core/services/autoscaling_test.go Updates provision signature.
internal/core/ports/instance.go Extends InstanceService port with LaunchInstanceWithOptions and StartInstance.
internal/core/ports/compute_options.go Adds UserData to compute options.
internal/core/ports/compute.go Renames create→launch and adds StartInstance to compute backend port.
internal/core/ports/cluster.go Adds Pod/Service CIDR fields to cluster creation params.
internal/core/domain/ws_event_test.go Parallelizes ws event tests.
internal/core/domain/security_group_test.go Parallelizes security group/domain tests.
internal/core/domain/jobs.go Adds UserData to provision job payload.
internal/core/domain/domain_test.go Parallelizes smoke test.
internal/core/domain/dns_test.go Parallelizes DNS domain tests.
internal/core/domain/cluster.go Updates cluster/node domain model (CIDRs, encrypted secrets, heartbeat).
internal/api/setup/infrastructure.go Passes logger into Docker adapter constructor.
docs/swagger/swagger.yaml Adds pod_cidr/service_cidr fields to cluster schema.
docs/swagger/swagger.json Adds pod_cidr/service_cidr fields to cluster schema.
docs/swagger/docs.go Syncs generated swagger docs for new fields.
cmd/cloud/queue_cli_test.go Minor indentation cleanup in tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +72 to 85
config := &ssh.ClientConfig{
User: e.user,
Auth: []ssh.AuthMethod{
ssh.PublicKeys(signer),
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: 10 * time.Second,
}

addr := net.JoinHostPort(e.ip, "22")
client, err := ssh.Dial("tcp", addr, config)
if err != nil {
return fmt.Errorf("failed to create ssh client: %w", err)
return "", fmt.Errorf("failed to dial ssh: %w", err)
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSHExecutor disables host key verification via ssh.InsecureIgnoreHostKey(), which makes SSH connections vulnerable to MITM and is generally unsafe for production provisioning. Use a known_hosts-based HostKeyCallback (or a pinned host key / CA) and consider making insecure mode opt-in for tests only. Also note that ssh.Dial/session.Run don’t honor ctx cancellation; if ctx is canceled, the command may still hang until SSH timeouts fire—consider wiring ctx into dialing (net.Dialer.DialContext) and closing the session/client when ctx is done.

Copilot uses AI. Check for mistakes.
Comment on lines 72 to 79
// NewDockerAdapter constructs a DockerAdapter with a Docker client.
func NewDockerAdapter() (*DockerAdapter, error) {
func NewDockerAdapter(logger *slog.Logger) (*DockerAdapter, error) {
cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
if err != nil {
return nil, fmt.Errorf("failed to create docker client: %w", err)
}
return &DockerAdapter{cli: cli}, nil
return &DockerAdapter{cli: cli, logger: logger}, nil
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewDockerAdapter stores the provided logger without nil-guarding, but LaunchInstanceWithOptions/handleUserData later call a.logger.Warn/Error/Info. If any caller passes a nil logger, this will panic at runtime. Default logger to slog.Default() (or a no-op logger) when the argument is nil.

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +225
// Note: We use base64 encoding to avoid escaping issues with complex scripts
encoded := base64.StdEncoding.EncodeToString([]byte(userData))
writeCmd := []string{"sh", "-c", fmt.Sprintf("mkdir -p %s && echo %s | base64 -d > %s && chmod +x %s",
filepath.Dir(scriptPath), encoded, scriptPath, scriptPath)}

if _, err := a.Exec(ctx, containerID, writeCmd); err != nil {
return fmt.Errorf("failed to write userdata to container: %w", err)
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The userdata write command embeds the base64 payload into a shell string via echo %s | base64 -d ... without quoting/printf. While base64 usually avoids whitespace, using printf '%s' and proper quoting is more robust and avoids edge cases with shell interpretation/length limits. Consider switching to a heredoc or printf-based approach and quoting paths consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +393 to +405
// UserData execution occurs asynchronously in a background goroutine.
// A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls.
time.Sleep(50 * time.Millisecond)

// Verify the two-stage bootstrap sequence:
// Stage 1: Payload delivery (writing the bootstrap script to the container filesystem).
// Stage 2: Bootstrap execution (invoking the script via a background Exec operation).
// Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach,
// and ContainerExecInspect according to the adapter's implementation.
require.Equal(t, 2, cli.Calls["ContainerExecCreate"], "Expected dual Stage (delivery + execution) Exec invocations")
require.Equal(t, 2, cli.Calls["ContainerExecAttach"], "Expected associated ExecAttach calls for I/O")
require.Equal(t, 2, cli.Calls["ContainerExecInspect"], "Expected ExecInspect calls to verify termination state")
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses a fixed time.Sleep to wait for an asynchronous goroutine, which can be flaky on slower CI machines (the background exec may not have completed yet). Prefer require.Eventually polling on the fake client's call counters (or synchronize via a channel/hook) so the test waits deterministically for the expected Exec sequence.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
CREATE TABLE clusters (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
vpc_id UUID NOT NULL REFERENCES vpcs(id),
name VARCHAR(255) NOT NULL,
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration introduces a second version "020" alongside existing 020_create_notify.{up,down}.sql, which will break migration ordering/uniqueness in goose-style tooling. It also recreates the clusters/cluster_nodes tables without IF NOT EXISTS even though clusters already exists via 037_create_clusters_table.up.sql, so applying migrations will fail on a fresh DB. Renumber this migration to a new unused version (and add a matching down migration) or fold these schema changes into the existing clusters migrations instead of creating a conflicting table definition.

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +189
// LaunchInstanceWithOptions provisions an instance using structured options.
func (s *InstanceService) LaunchInstanceWithOptions(ctx context.Context, opts ports.CreateInstanceOptions) (*domain.Instance, error) {
ctx, span := otel.Tracer("instance-service").Start(ctx, "LaunchInstanceWithOptions")
defer span.End()

inst := &domain.Instance{
ID: uuid.New(),
UserID: appcontext.UserIDFromContext(ctx),
TenantID: appcontext.TenantIDFromContext(ctx),
Name: opts.Name,
Image: opts.ImageName,
Status: domain.StatusStarting,
InstanceType: "custom", // Marking as custom since we are passing raw constraints or defaults
Version: 1,
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
}

if opts.NetworkID != "" {
vpcID, err := uuid.Parse(opts.NetworkID)
if err == nil {
inst.VpcID = &vpcID
}
}

if err := s.repo.Create(ctx, inst); err != nil {
return nil, err
}

// 4. Enqueue provision task with UserData
job := domain.ProvisionJob{
InstanceID: inst.ID,
UserID: inst.UserID,
TenantID: inst.TenantID,
UserData: opts.UserData,
}

if err := s.taskQueue.Enqueue(ctx, "provision_queue", job); err != nil {
s.logger.Error("failed to enqueue provision job", "instance_id", inst.ID, "error", err)
return nil, errors.Wrap(errors.Internal, "failed to enqueue provisioning task", err)
}

return inst, nil
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LaunchInstanceWithOptions currently creates an Instance record without persisting key options needed by Provision (e.g., Ports, SubnetID/VpcID, and any volume attachments), and enqueues a ProvisionJob without setting Volumes. As a result, Provision will launch with empty port mappings and no volumes, and may fail later unless an instance type named "custom" exists in the DB. Consider either (a) mapping opts into the Instance model (validate/join opts.Ports into inst.Ports, store the correct VPC/Subnet identifiers, ensure a valid InstanceType, and propagate volumes via the job), or (b) defining a dedicated service-level options struct instead of reusing compute-level CreateInstanceOptions (where NetworkID is not necessarily a UUID).

Copilot uses AI. Check for mistakes.
Comment on lines +383 to +403
if inst.Status == domain.StatusRunning {
return nil // Already running
}

// 2. Call Compute backend
target := inst.ContainerID
if target == "" {
// Try to recover ID from name if missing
target = s.formatContainerName(inst.ID)
}

if err := s.compute.StartInstance(ctx, target); err != nil {
platform.InstanceOperationsTotal.WithLabelValues("start", "failure").Inc()
s.logger.Error("failed to start instance", "instance_id", inst.ID, "container_id", target, "error", err)
return errors.Wrap(errors.Internal, "failed to start instance", err)
}

// 3. Update Metrics & Status
platform.InstancesTotal.WithLabelValues("stopped", s.compute.Type()).Dec()
platform.InstancesTotal.WithLabelValues("running", s.compute.Type()).Inc()
platform.InstanceOperationsTotal.WithLabelValues("start", "success").Inc()
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartInstance updates Prometheus counters by decrementing the "stopped" gauge unconditionally. If StartInstance is called while the instance is in STARTING/ERROR (or any non-stopped state), this can drive the "stopped" metric negative and make state counts inaccurate. Gate the gauge updates on the prior status being STOPPED (or recompute gauges from DB periodically) before Dec/Inc.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 34
// Add default rules (K8s control plane ports)
rules := []domain.SecurityRule{
{Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // API Server
{Protocol: "udp", PortMin: 4789, PortMax: 4789, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // VXLAN
{Protocol: "tcp", PortMin: 179, PortMax: 179, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // BGP
{Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // Kubelet
{Protocol: "tcp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort TCP
{Protocol: "udp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort UDP
}

for _, r := range rules {
_, _ = p.sgSvc.AddRule(ctx, sg.ID, r)
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: "0.0.0.0/0"},
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 22, PortMax: 22, CIDR: "0.0.0.0/0"},
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 2379, PortMax: 2380, CIDR: "10.0.0.0/8"}, // etcd (internal)
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: "10.0.0.0/8"}, // kubelet
}

return nil
}

func (p *KubeadmProvisioner) applyBaseSecurity(ctx context.Context, cluster *domain.Cluster, masterIP string) error {
exec, err := p.getExecutor(ctx, cluster, masterIP)
if err != nil {
return err
}

p.logger.Info("applying base security manifests", "ip", masterIP, "isolation", cluster.NetworkIsolation)

var securityManifests string
if cluster.NetworkIsolation {
// Apply a default-deny ingress policy for the default namespace
securityManifests = `
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: default
spec:
podSelector: {}
policyTypes:
- Ingress
`
}
if securityManifests != "" {
manifestFile := "/tmp/base-security.yaml"
_, _ = exec.Run(ctx, fmt.Sprintf("cat <<EOF > %s\n%s\nEOF", manifestFile, securityManifests))
_, err = exec.Run(ctx, fmt.Sprintf("%s apply -f %s", kubectlBase, manifestFile))
return err
for _, rule := range rules {
_, _ = p.sgSvc.AddRule(ctx, newSG.ID, rule)
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureClusterSecurityGroup ignores errors from AddRule (", _ = ..."), so the security group may be created without required rules and provisioning will continue silently. Return/log the error and fail fast (or at least warn) so misconfigured clusters don’t end up unreachable. Also consider whether opening SSH (22) and the API server (6443) to 0.0.0.0/0 is intended; if not, restrict to a configurable admin CIDR or VPC CIDR by default.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (18)
internal/handlers/event_handler_test.go (1)

33-34: ⚠️ Potential issue | 🟡 Minor

Potential race condition with gin.SetMode in parallel tests.

gin.SetMode(gin.TestMode) modifies global state in the gin package. With t.Parallel() enabled, multiple tests calling setupEventHandlerTest concurrently will race on this global variable, potentially triggering race detector warnings.

Consider moving gin.SetMode(gin.TestMode) to a TestMain function or using sync.Once to ensure it's called only once.

🛠️ Proposed fix using sync.Once
+var ginModeOnce sync.Once
+
 func setupEventHandlerTest(_ *testing.T) (*mockEventService, *EventHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
+	ginModeOnce.Do(func() {
+		gin.SetMode(gin.TestMode)
+	})
 	svc := new(mockEventService)
 	handler := NewEventHandler(svc)
 	r := gin.New()
 	return svc, handler, r
 }

Also add "sync" to the imports.

Also applies to: 41-42

internal/handlers/instance_type_handler_test.go (1)

29-35: ⚠️ Potential issue | 🟠 Major

Data race: gin.SetMode is called after t.Parallel().

gin.SetMode(gin.TestMode) modifies a package-level global variable. When this test runs in parallel with other gin tests that also call gin.SetMode, you'll encounter a data race (detectable with -race flag).

Move gin mode initialization to TestMain so it runs once before any parallel tests execute:

func TestMain(m *testing.M) {
    gin.SetMode(gin.TestMode)
    os.Exit(m.Run())
}

Then remove line 31 from this test and any other parallel tests in the file.

internal/repositories/postgres/cron_repo_unit_test.go (1)

14-41: ⚠️ Potential issue | 🟡 Minor

Convert these tests to table-driven style for better maintainability.

The tests follow a repetitive pattern with single cases each. Consider refactoring with table-driven structure using cases slice and t.Run for test organization. Apply consistently to TestCronRepository_CreateJob, TestCronRepository_GetJobByID, and TestCronRepository_ListJobs.

Example refactor (apply same pattern to other tests)
 func TestCronRepository_CreateJob(t *testing.T) {
 	t.Parallel()
-	mock, err := pgxmock.NewPool()
-	assert.NoError(t, err)
-	defer mock.Close()
-
-	repo := NewPostgresCronRepository(mock)
-	job := &domain.CronJob{
-		ID:            uuid.New(),
-		UserID:        uuid.New(),
-		Name:          "test-job",
-		Schedule:      "* * * * *",
-		TargetURL:     "http://test",
-		TargetMethod:  "POST",
-		TargetPayload: "{}",
-		Status:        domain.CronStatusActive,
-		NextRunAt:     nil,
-		CreatedAt:     time.Now(),
-		UpdatedAt:     time.Now(),
-	}
-
-	mock.ExpectExec("INSERT INTO cron_jobs").
-		WithArgs(job.ID, job.UserID, job.Name, job.Schedule, job.TargetURL, job.TargetMethod, job.TargetPayload, job.Status, job.NextRunAt, job.CreatedAt, job.UpdatedAt).
-		WillReturnResult(pgxmock.NewResult("INSERT", 1))
-
-	err = repo.CreateJob(context.Background(), job)
-	assert.NoError(t, err)
+	cases := []struct {
+		name string
+		job  *domain.CronJob
+	}{
+		{
+			name: "active-job",
+			job: &domain.CronJob{
+				ID:            uuid.New(),
+				UserID:        uuid.New(),
+				Name:          "test-job",
+				Schedule:      "* * * * *",
+				TargetURL:     "http://test",
+				TargetMethod:  "POST",
+				TargetPayload: "{}",
+				Status:        domain.CronStatusActive,
+				NextRunAt:     nil,
+				CreatedAt:     time.Now(),
+				UpdatedAt:     time.Now(),
+			},
+		},
+	}
+
+	for _, tc := range cases {
+		tc := tc
+		t.Run(tc.name, func(t *testing.T) {
+			mock, err := pgxmock.NewPool()
+			assert.NoError(t, err)
+			defer mock.Close()
+
+			repo := NewPostgresCronRepository(mock)
+			mock.ExpectExec("INSERT INTO cron_jobs").
+				WithArgs(tc.job.ID, tc.job.UserID, tc.job.Name, tc.job.Schedule, tc.job.TargetURL, tc.job.TargetMethod, tc.job.TargetPayload, tc.job.Status, tc.job.NextRunAt, tc.job.CreatedAt, tc.job.UpdatedAt).
+				WillReturnResult(pgxmock.NewResult("INSERT", 1))
+
+			err = repo.CreateJob(context.Background(), tc.job)
+			assert.NoError(t, err)
+		})
+	}
 }
internal/handlers/ws/ws_test.go (1)

73-89: ⚠️ Potential issue | 🟡 Minor

Avoid silent Close errors (and double-close) in the websocket lifecycle.

_ = conn.Close() hides failures and makes it easy to miss shutdown issues. Prefer t.Cleanup with error checks and avoid double-closing.

🔧 Suggested fix
  conn, _, err := dialer.Dial(wsURL, nil)
  assert.NoError(t, err)
- defer func() { _ = conn.Close() }()
+ t.Cleanup(func() {
+ 	if conn != nil {
+ 		assert.NoError(t, conn.Close())
+ 	}
+ })

  time.Sleep(100 * time.Millisecond)
  assert.Equal(t, 1, hub.ClientCount())
@@
- _ = conn.Close()
+ assert.NoError(t, conn.Close())
+ conn = nil

As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

internal/repositories/postgres/lifecycle_repo_unit_test.go (1)

47-48: ⚠️ Potential issue | 🟡 Minor

Add ExpectationsWereMet() assertions to verify mock expectations were fulfilled in all test cases.

Tests using pgxmock must verify that expected database calls were actually executed. Without these assertions, tests can pass even if the mocked queries are never invoked.

✅ Proposed fix
-    assert.NoError(t, err)
+    assert.NoError(t, err)
+    assert.NoError(t, mock.ExpectationsWereMet())

Also applies to: lines 63-66 (Get test) and 79-80 (Delete test)

internal/core/services/cron_worker_test.go (1)

57-71: ⚠️ Potential issue | 🟡 Minor

Avoid fixed localhost ports in parallel tests.
http://localhost:12345 can be unexpectedly available under parallel runs. Use a closed httptest.Server (or a mocked HTTP client) to guarantee failure deterministically.

🔧 Suggested adjustment
-	// No server, request should fail
+	// Closed server ensures deterministic connection failure
+	server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {}))
+	server.Close()
 	ctx := context.Background()
 	jobID := uuid.New()
 	job := &domain.CronJob{
@@
-		TargetURL:    "http://localhost:12345/unreachable",
+		TargetURL:    server.URL,
 		Schedule:     "* * * * *",
 	}
internal/core/services/lb_worker_test.go (1)

219-259: ⚠️ Potential issue | 🟡 Minor

Avoid real network calls in processHealthChecks test.
The default dialer can hit the host network and become flaky (e.g., if the port is open). Inject mockDialer to make the test deterministic.

🔧 Suggested fix
 	proxy := new(MockLBProxyAdapter)
 	worker := NewLBWorker(lbRepo, instRepo, proxy)
+	worker.dialer = &mockDialer{err: fmt.Errorf("dial failed")}

As per coding guidelines: Do not test with real external dependencies in unit tests - use mocks instead.

internal/repositories/libvirt/adapter_volume_test.go (2)

99-139: ⚠️ Potential issue | 🟠 Major

Same race condition concern applies here.

TestRestoreVolumeSnapshotSuccess also modifies both execCommand and mkdirTemp package-level variables (lines 105-123) while running in parallel. This compounds the race condition risk with TestCreateVolumeSnapshotSuccess.


70-97: ⚠️ Potential issue | 🟠 Major

Confirmed race condition with package-level variable mutations during parallel test execution.

TestCreateVolumeSnapshotSuccess (line 71) and TestRestoreVolumeSnapshotSuccess (line 100) both use t.Parallel() while simultaneously modifying package-level variables execCommand (lines 78 and 107) and mkdirTemp (line 114). Even with defer restoration, concurrent access to these shared globals creates a data race where test state can bleed into other concurrent tests.

Remove t.Parallel() from these tests, use synchronization primitives (e.g., sync.Mutex) to protect the package-level variables, or refactor to avoid modifying globals.

internal/repositories/libvirt/adapter_unit_test.go (2)

62-74: ⚠️ Potential issue | 🟠 Major

Capture range variables before parallel subtests to avoid shared closure variable bugs.

In table-driven tests with t.Parallel(), the loop variable tt is shared across all parallel goroutines. Without capturing it, assertions may read the wrong test case data, causing test flakiness.

Apply this fix to all six locations:

Suggested fix
 for _, tt := range tests {
+	tt := tt
 	t.Run(tt.name, func(t *testing.T) {
 		t.Parallel()
 		// ...
 	})
 }

Applies to lines: 62–74, 150–161, 182–187, 233–238, 276–285, 343–354


421-433: ⚠️ Potential issue | 🟠 Major

Remove t.Parallel() from tests that mutate package-level test hooks.

These three tests override osOpen and execCommand (package-level variables defined in adapter.go) while declaring t.Parallel(). Running them in parallel with other tests can cause data races and flaky failures. The defer restoration does not prevent interference during test execution.

Affected tests:

  • TestGetInstanceLogs (line 421)
  • TestCleanupPortMappings (line 539)
  • TestGenerateCloudInitISO (line 559)

Remove the t.Parallel() call from each of these test functions.

internal/repositories/docker/fakes_test.go (1)

49-176: ⚠️ Potential issue | 🟡 Minor

Guard Calls with a mutex to avoid races under parallel tests.
Calls is a shared map updated from multiple methods; with parallelized tests or concurrent adapter code, this can trigger data races.

🔒 Suggested fix
 import (
 	"bufio"
 	"bytes"
 	"context"
 	"errors"
 	"io"
 	"net"
 	"strings"
+	"sync"
 
 	"github.com/docker/docker/api/types"
@@
 type fakeDockerClient struct {
@@
 	networkCreateErr error
 	networkRemoveErr error
 
+	callsMu sync.Mutex
 	Calls map[string]int
 }
@@
 func (f *fakeDockerClient) inc(name string) {
+	f.callsMu.Lock()
+	defer f.callsMu.Unlock()
 	if f.Calls == nil {
 		f.Calls = make(map[string]int)
 	}
 	f.Calls[name]++
 }
internal/repositories/libvirt/adapter_test.go (1)

172-210: ⚠️ Potential issue | 🟠 Major

Avoid t.Parallel() while overriding the global execCommandContext.
This test mutates a global function and can run concurrently with other tests, causing cross‑test interference or data races. Prefer keeping this test serial or guarding the override with a package‑level mutex.

🛠️ Suggested fix (serialize this test)
 func TestLBProxyAdapterOperations(t *testing.T) {
-	t.Parallel()
 	// Mock execCommandContext using a hermetic helper process
 	oldExec := execCommandContext
@@
 	t.Run("DeployProxy", func(t *testing.T) {
-		t.Parallel()
 		id, err := a.DeployProxy(ctx, lb, nil)
@@
 	t.Run("UpdateProxyConfig", func(t *testing.T) {
-		t.Parallel()
 		err := a.UpdateProxyConfig(ctx, lb, nil)
@@
 	t.Run("RemoveProxy", func(t *testing.T) {
-		t.Parallel()
 		err := a.RemoveProxy(ctx, lb.ID)
internal/core/services/function_test.go (1)

23-33: ⚠️ Potential issue | 🟠 Major

Avoid real Docker dependency in unit tests (mark as integration or mock).
setupFunctionServiceTest instantiates a real Docker adapter at line 30, which couples this test suite to a Docker daemon. This file lacks an integration build tag. Please either (a) switch to a mocked ports.ComputeBackend (e.g., using testify/mock) or (b) mark this file as an integration test with a //go:build integration tag so unit test suites remain hermetic. As per coding guidelines, **/*_test.go must not test with real external dependencies—use mocks instead.

internal/repositories/libvirt/adapter_network_test.go (2)

22-103: ⚠️ Potential issue | 🟠 Major

Avoid parallelizing tests that mutate package-level exec hooks.
execCommand/lookPath are package-level and swapped in TestSetupPortForwardingSuccess, so running these tests in parallel can race and flake. Please run these tests serially (or refactor to inject exec functions per adapter).

Proposed fix (serialize the affected tests)
 func TestSetupPortForwardingSuccess(t *testing.T) {
-	t.Parallel()
+	// Keep serial: mutates package-level exec hooks
 	// Mock execCommand
 	oldExec := execCommand
 	defer func() { execCommand = oldExec }()
@@
 func TestSetupPortForwardingErrors(t *testing.T) {
-	t.Parallel()
+	// Keep serial: depends on package-level exec hooks
 	m := new(MockLibvirtClient)
@@
 func TestConfigureIptablesNoPath(t *testing.T) {
-	t.Parallel()
+	// Keep serial: depends on package-level exec hooks
 	// This test depends on environment, but we can verify it doesn't crash
 	m := new(MockLibvirtClient)

105-126: ⚠️ Potential issue | 🟠 Major

Don’t run parallel subtests while sharing the same mock/adapter.
The subtests share MockLibvirtClient and LibvirtAdapter, and testify/mock isn’t thread-safe, so t.Parallel() here can race. Either remove parallelization or create separate mocks/adapters per subtest.

Proposed fix (run subtests sequentially)
 t.Run("DomainNotFound", func(t *testing.T) {
-	t.Parallel()
 	m.On("DomainLookupByName", ctx, "missing").Return(libvirt.Domain{}, libvirt.Error{Code: 1})
 	_, err := a.GetInstanceIP(ctx, "missing")
 	assert.Error(t, err)
 })
@@
 t.Run("NoMAC", func(t *testing.T) {
-	t.Parallel()
 	dom := libvirt.Domain{Name: "nomac"}
 	m.On("DomainLookupByName", ctx, "nomac").Return(dom, nil)
scripts/test_libvirt.go (1)

46-58: ⚠️ Potential issue | 🟡 Minor

Update log/comment to reflect the new API name.

The comment and error message still refer to CreateInstance, which no longer matches the call site. Keeping this consistent avoids confusion when running the script.

Suggested fix
-// Since we implement a "create empty volume" fallback in CreateInstance, it should at least define the VM.
+// Since we implement a "create empty volume" fallback in LaunchInstanceWithOptions, it should at least define the VM.
...
-	fmt.Printf("CreateInstance failed: %v\n", err)
+	fmt.Printf("LaunchInstanceWithOptions failed: %v\n", err)
internal/repositories/k8s/ha.go (1)

31-49: ⚠️ Potential issue | 🟠 Major

Make the LB IP wait loop context-aware and eliminate magic numbers.

time.Sleep ignores cancellation, so the wait can block even if ctx is done. Also, port/retry/interval literals should be named constants.

Suggested fix
+const (
+	apiServerPort   = 6443
+	lbIPRetries     = 10
+	lbIPRetryDelay  = 2 * time.Second
+)
...
-	lb, err := p.lbSvc.Create(ctx, lbName, cluster.VpcID, 6443, "round-robin", cluster.ID.String())
+	lb, err := p.lbSvc.Create(ctx, lbName, cluster.VpcID, apiServerPort, "round-robin", cluster.ID.String())
...
-	for i := 0; i < 10; i++ {
+	for i := 0; i < lbIPRetries; i++ {
 		if lb.IP != "" {
 			lbIP = lb.IP
 			break
 		}
-		time.Sleep(2 * time.Second)
+		select {
+		case <-ctx.Done():
+			return fmt.Errorf("timeout waiting for LB IP: %w", ctx.Err())
+		case <-time.After(lbIPRetryDelay):
+		}
 		if lbNew, err := p.lbSvc.Get(ctx, lb.ID); err == nil {
 			lb = lbNew
 		}
 	}

As per coding guidelines: Propagate context.Context to all blocking calls; Do not use magic numbers - use named constants instead.

🤖 Fix all issues with AI agents
In `@internal/core/services/instance.go`:
- Around line 152-182: The code silently ignores invalid opts.NetworkID and
fails to persist CreateInstanceOptions (Ports, VolumeBinds, Env, Cmd, CPULimit,
MemoryLimit, DiskLimit) so Provision later reads empty inst fields; validate
opts.NetworkID (return error on invalid uuid when opts.NetworkID != "") and
persist the options either by adding those fields to the domain.Instance before
calling s.repo.Create (e.g., populate inst.Ports, inst.VolumeBinds, inst.Env,
inst.Cmd, inst.CPULimit, inst.MemoryLimit, inst.DiskLimit), or extend
domain.ProvisionJob to carry the full CreateInstanceOptions and populate
ProvisionJob.UserData/fields from opts (or simply attach the entire
CreateInstanceOptions to ProvisionJob) so the Provision handler (which reads
inst.Ports and Env/Cmd) receives the real values instead of nil; ensure
InstanceType is set appropriately (not hardcoded "custom") or documented/handled
elsewhere.

In `@internal/repositories/docker/adapter.go`:
- Around line 208-234: handleUserData (and likewise processCloudConfig) spawns a
goroutine that calls a.Exec with context.Background(), which bypasses the parent
ctx, prevents cancellation propagation, and causes concurrent Exec calls that
race on shared test state; change the background execution to a synchronous call
using a derived context with a timeout from the provided ctx (e.g.,
ctxWithTimeout, cancel := context.WithTimeout(ctx, <reasonableDuration>); defer
cancel()), call a.Exec(ctxWithTimeout, containerID, ...) directly instead of
launching a goroutine, and propagate errors back to the caller (or log them
consistently via a.logger) so Exec is not invoked concurrently from multiple
goroutines and respects cancellation; apply the same change in
processCloudConfig and ensure any callers handle the returned error.

In `@internal/repositories/k8s/mocks_test.go`:
- Around line 89-112: There is a duplicate MockSecretService type and method
implementations (e.g., CreateSecret, GetSecret, Decrypt) which will cause a
duplicate type error; remove the entire MockSecretService definition and its
methods from this file and rely on the single shared MockSecretService already
defined elsewhere in the package, updating any tests in this file to use that
existing MockSecretService (and its m.Called stubs) instead of declaring a new
one.

In `@internal/repositories/k8s/provisioner.go`:
- Around line 155-160: The current fallback stores plaintext kubeconfig when
encryption fails (setting cluster.KubeconfigEncrypted = kubeconfig), which is a
security risk; update the error path in the block that handles
encryptedKubeconfig (the branch using encryptedKubeconfig, kubeconfig and
p.logger.Error) to refuse to persist plaintext: log the error via p.logger.Error
as-is, and return or propagate an error from the surrounding function instead of
assigning the unencrypted kubeconfig to cluster.KubeconfigEncrypted so the
caller can handle the failure and no plaintext kubeconfig is saved.
🟠 Major comments (21)
internal/handlers/dns_handler_test.go-26-28 (1)

26-28: ⚠️ Potential issue | 🟠 Major

Race condition: gin.SetMode() modifies global state.

All test functions call gin.SetMode(gin.TestMode) after t.Parallel(). Since gin.SetMode() writes to a package-level global variable without synchronization, running these tests in parallel creates a data race that the Go race detector will flag.

Move the call to a TestMain function so it executes once before any parallel tests run:

🔒️ Proposed fix: Add TestMain and remove per-test gin.SetMode() calls

Add this function to the file:

func TestMain(m *testing.M) {
	gin.SetMode(gin.TestMode)
	os.Exit(m.Run())
}

Then remove gin.SetMode(gin.TestMode) from each test function:

 func TestCreateZoneHandler(t *testing.T) {
 	t.Parallel()
-	gin.SetMode(gin.TestMode)

 	t.Run("success", func(t *testing.T) {

Apply the same removal to all other test functions in this file.

internal/handlers/identity_handler_test.go-74-78 (1)

74-78: ⚠️ Potential issue | 🟠 Major

Parallel tests race on Gin's unprotected global mode setting.

gin.SetMode writes to package-level globals without synchronization, making it fundamentally not thread-safe. When t.Parallel() tests call setupIdentityHandlerTest, they concurrently mutate Gin's mode state, causing data races and flaky failures under -race. Move gin.SetMode(gin.TestMode) to an init() function (or TestMain) at the test package level so it executes once before parallel tests begin, then remove the call from setupIdentityHandlerTest.

✅ Suggested fix
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"net/http"
 	"net/http/httptest"
 	"testing"

 	"github.com/gin-gonic/gin"
 	"github.com/google/uuid"
 	"github.com/poyrazk/thecloud/internal/core/domain"
 	"github.com/stretchr/testify/assert"
 	"github.com/stretchr/testify/mock"
 )
+
+func init() {
+	gin.SetMode(gin.TestMode)
+}

 const (
 	authKeysPath = "/auth/keys"
 	testKeyName  = "Test Key"
 )
@@
 func setupIdentityHandlerTest(userID uuid.UUID) (*mockIdentityService, *IdentityHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
 	svc := new(mockIdentityService)
 	handler := NewIdentityHandler(svc)
 	r := gin.New()
internal/handlers/database_handler_test.go-78-81 (1)

78-81: ⚠️ Potential issue | 🟠 Major

Move gin.SetMode to TestMain to prevent race conditions with t.Parallel.

gin.SetMode is a global operation in gin-gonic; calling it from parallel tests races and causes -race failures. Set the mode once in TestMain and remove it from setupDatabaseHandlerTest.

🔧 Suggested fix (set mode once in TestMain)
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"net/http"
 	"net/http/httptest"
+	"os"
 	"testing"

 	"github.com/gin-gonic/gin"
@@
 )
 
+func TestMain(m *testing.M) int {
+	gin.SetMode(gin.TestMode)
+	return m.Run()
+}
+
 func setupDatabaseHandlerTest(_ *testing.T) (*mockDatabaseService, *DatabaseHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
 	svc := new(mockDatabaseService)
 	handler := NewDatabaseHandler(svc)
 	r := gin.New()
 	return svc, handler, r
 }
internal/handlers/cache_handler_test.go-82-84 (1)

82-84: ⚠️ Potential issue | 🟠 Major

Parallel tests race on gin.SetMode global state.

t.Parallel() now allows concurrent calls to setupCacheHandlerTest, but gin.SetMode mutates global state without synchronization and will trigger data races under -race. Move it to a package init() so it runs once at startup.

🔧 Proposed fix (move SetMode to init)
+func init() {
+	gin.SetMode(gin.TestMode)
+}
+
 func setupCacheHandlerTest(_ *testing.T) (*mockCacheService, *CacheHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
 	svc := new(mockCacheService)
 	handler := NewCacheHandler(svc)
 	r := gin.New()
 	return svc, handler, r
 }
internal/handlers/secret_handler_test.go-85-86 (1)

85-86: ⚠️ Potential issue | 🟠 Major

Parallel tests race on gin.SetMode—move to TestMain

t.Parallel() runs all 8 tests concurrently. Each calls setupSecretHandlerTest, which invokes gin.SetMode(gin.TestMode). Gin's SetMode writes to unsynchronized package-level variables without mutex protection, causing a data race. Set the mode once in TestMain before tests run.

🔧 Suggested fix
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"net/http"
 	"net/http/httptest"
+	"os"
 	"testing"

 	"github.com/gin-gonic/gin"
 	"github.com/google/uuid"
 	"github.com/poyrazk/thecloud/internal/core/domain"
 	"github.com/poyrazk/thecloud/internal/errors"
 	"github.com/stretchr/testify/assert"
 	"github.com/stretchr/testify/mock"
 )
+
+func TestMain(m *testing.M) {
+	gin.SetMode(gin.TestMode)
+	os.Exit(m.Run())
+}
 
 func setupSecretHandlerTest(_ *testing.T) (*mockSecretService, *SecretHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
 	svc := new(mockSecretService)
 	handler := NewSecretHandler(svc)
 	r := gin.New()
 	return svc, handler, r
 }

Also applies to: 109-110, 127-128, 146-147, 164-165, 215-216, 239-240, 251-252

internal/handlers/lb_handler_test.go-77-78 (1)

77-78: ⚠️ Potential issue | 🟠 Major

Guard gin.SetMode from concurrent calls under t.Parallel().
setupLBHandlerTest mutates the unsynchronized global gin.Mode; parallel tests trigger race conditions. Use sync.Once or move to TestMain.

🔧 Suggested fix (sync.Once)
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"net/http"
 	"net/http/httptest"
+	"sync"
 	"testing"
 )
 
+var ginModeOnce sync.Once
+
 func setupLBHandlerTest(_ *testing.T) (*mockLBService, *LBHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
+	ginModeOnce.Do(func() { gin.SetMode(gin.TestMode) })
 	svc := new(mockLBService)
 	handler := NewLBHandler(svc)
 	r := gin.New()
 	return svc, handler, r
 }

Applies to lines: 77–78, 103–104, 121–122, 140–141, 158–159, 183–184, 202–203

internal/handlers/queue_handler_test.go-90-91 (1)

90-91: ⚠️ Potential issue | 🟠 Major

Move gin.SetMode outside parallel test execution.
gin.SetMode() is not thread-safe and writes to package globals without synchronization. Calling it from setupQueueHandlerTest() in multiple parallel tests causes a data race. Set the mode once before any parallel tests run using TestMain or sync.Once.

🔧 Suggested fix (TestMain)
+func TestMain(m *testing.M) {
+	gin.SetMode(gin.TestMode)
+	code := m.Run()
+	os.Exit(code)
+}
+
 func setupQueueHandlerTest(_ *testing.T) (*mockQueueService, *QueueHandler, *gin.Engine) {
-	gin.SetMode(gin.TestMode)
 	svc := new(mockQueueService)
internal/repositories/k8s/security.go-24-30 (1)

24-30: ⚠️ Potential issue | 🟠 Major

Avoid hard-coded wide-open ingress CIDRs.

Using 0.0.0.0/0 for SSH and fixed 10.0.0.0/8 for control-plane ports can expose clusters or misconfigure non-10/8 networks. Consider deriving CIDRs from the cluster/VPC or a configurable allowlist.

internal/repositories/k8s/security.go-32-34 (1)

32-34: ⚠️ Potential issue | 🟠 Major

Handle AddRule failures instead of dropping them.

Ignoring these errors can leave the cluster without required ingress rules while still returning success.

🐛 Suggested fix
-	for _, rule := range rules {
-		_, _ = p.sgSvc.AddRule(ctx, newSG.ID, rule)
-	}
+	for _, rule := range rules {
+		if _, err := p.sgSvc.AddRule(ctx, newSG.ID, rule); err != nil {
+			return fmt.Errorf("failed to add security group rule: %w", err)
+		}
+	}
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.
internal/core/services/health_test.go-3-6 (1)

3-6: ⚠️ Potential issue | 🟠 Major

Avoid real Docker dependency in this unit test.
docker.NewDockerAdapter(...) requires a running Docker daemon, which makes this a brittle unit test. Prefer a mock/noop compute backend here, or mark the test as integration-only if it must hit Docker.

Proposed fix (swap to noop backend)
 import (
 	"context"
-	"log/slog"
 	"testing"
 
 	"github.com/poyrazk/thecloud/internal/core/services"
-	"github.com/poyrazk/thecloud/internal/repositories/docker"
 	"github.com/poyrazk/thecloud/internal/repositories/noop"
 	"github.com/stretchr/testify/assert"
-	"github.com/stretchr/testify/require"
 )
@@
-	compute, err := docker.NewDockerAdapter(slog.Default())
-	require.NoError(t, err)
+	compute := noop.NewNoopComputeBackend()
As per coding guidelines: **/*_test.go: Do not test with real external dependencies in unit tests - use mocks instead.

Also applies to: 15-22

internal/repositories/docker/adapter_unit_test.go-393-405 (1)

393-405: ⚠️ Potential issue | 🟠 Major

Fix data race in TestDockerAdapterLaunchWithUserData (CI reports race).

The background goroutine mutates cli.Calls while the test reads it after a Sleep, which matches the race reported in CI. Use a synchronization mechanism and make the call counters concurrency-safe (mutex/atomic) instead of time.Sleep.

Example approach
-	time.Sleep(50 * time.Millisecond)
-
-	require.Equal(t, 2, cli.Calls["ContainerExecCreate"], "Expected dual Stage (delivery + execution) Exec invocations")
-	require.Equal(t, 2, cli.Calls["ContainerExecAttach"], "Expected associated ExecAttach calls for I/O")
-	require.Equal(t, 2, cli.Calls["ContainerExecInspect"], "Expected ExecInspect calls to verify termination state")
+	require.Eventually(t, func() bool {
+		return cli.CallCount("ContainerExecCreate") == 2 &&
+			cli.CallCount("ContainerExecAttach") == 2 &&
+			cli.CallCount("ContainerExecInspect") == 2
+	}, 2*time.Second, 10*time.Millisecond)

(Implement CallCount with a mutex around cli.Calls in fakeDockerClient.)

internal/repositories/k8s/templates/control_plane.yaml-26-32 (1)

26-32: ⚠️ Potential issue | 🟠 Major

Template should honor cluster version and avoid localhost endpoint.

kubernetesVersion: stable ignores the requested cluster version, and controlPlaneEndpoint set to localhost will make worker joins and external access fail on non-HA clusters. Prefer templating the version and only setting controlPlaneEndpoint when HA is enabled (or use the control-plane node IP).

Suggested fix
-      kubernetesVersion: stable
-      controlPlaneEndpoint: "{{ if .HAEnabled }}{{ .LBAddress }}{{ else }}localhost{{ end }}:6443"
+      kubernetesVersion: "{{ .Version }}"
+      {{- if .HAEnabled }}
+      controlPlaneEndpoint: "{{ .LBAddress }}:6443"
+      {{- end }}
internal/repositories/k8s/node_executor.go-72-78 (1)

72-78: ⚠️ Potential issue | 🟠 Major

Add configurable host key verification to SSHExecutor.

The hardcoded ssh.InsecureIgnoreHostKey() at line 77 disables host key verification and enables MITM attacks. Either refactor SSHExecutor to use the existing pkg/sshutil.Client (which provides configurable HostKeyCallback), or add a HostKeyCallback field to SSHExecutor to allow injection of knownhosts.New(...) or ssh.FixedHostKey(...) during construction.

internal/repositories/k8s/node_executor.go-81-103 (1)

81-103: ⚠️ Potential issue | 🟠 Major

Use the existing pkg/sshutil.Client instead of reimplementing SSH connection logic.

The codebase already provides pkg/sshutil.Client.Run(ctx context.Context, ...) which properly propagates context via net.Dialer{}.DialContext() and uses ssh.NewClientConn(). The current implementation duplicates this and fails to propagate ctx to ssh.Dial() and session.Run(). Additionally, silent close errors on client.Close() and session.Close() violate the "no silent failures" guideline.

Either refactor to use pkg/sshutil.Client or adopt its pattern: use net.Dialer{}.DialContext(ctx, ...) to establish the TCP connection, then ssh.NewClientConn() to wrap it. Also log or return close errors instead of discarding them.

internal/core/services/instance.go-412-414 (1)

412-414: ⚠️ Potential issue | 🟠 Major

Handle audit log errors instead of dropping them.
Line 412 ignores audit logging failures. Please log or propagate the error so audit failures aren’t silently lost.

🛠️ Suggested fix
-	_ = s.auditSvc.Log(ctx, inst.UserID, "instance.start", "instance", inst.ID.String(), map[string]interface{}{
-		"name": inst.Name,
-	})
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.start", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to write audit log", "instance_id", inst.ID, "error", err)
+	}

As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

internal/repositories/postgres/cluster_repo.go-205-210 (1)

205-210: ⚠️ Potential issue | 🟠 Major

Scanning nullable columns into non-nullable string fields will fail for legacy clusters.

Lines 205-210 scan kubeconfig_encrypted, ssh_private_key_encrypted, join_token, and ca_cert_hash directly into non-nullable string fields. Migration 081 adds these columns as nullable TEXT without defaults. For clusters created before this migration, these columns will be NULL, causing pgx scan errors and breaking Get() and ListByUserID() operations.

Use COALESCE() in the SELECT statements to provide empty string defaults:

SELECT 
    id, user_id, vpc_id, name, version, status, worker_count, ha_enabled, 
    network_isolation, pod_cidr, service_cidr, api_server_lb_address, 
    COALESCE(kubeconfig_encrypted, ''), COALESCE(ssh_private_key_encrypted, ''), COALESCE(join_token, ''), 
    token_expires_at, COALESCE(ca_cert_hash, ''), job_id, created_at, updated_at
FROM clusters

Alternatively, change the struct fields to *string and scan into nullable types.

internal/repositories/k8s/provisioner.go-98-100 (1)

98-100: ⚠️ Potential issue | 🟠 Major

Silent failure on repo update violates coding guidelines.

Line 99 discards the error from p.repo.Update. If the update fails, the cluster status won't persist but provisioning will report success, leading to inconsistent state.

🔧 Proposed fix
 	cluster.Status = domain.ClusterStatusRunning
-	_ = p.repo.Update(ctx, cluster)
-	return nil
+	return p.repo.Update(ctx, cluster)

As per coding guidelines: "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()"

internal/repositories/k8s/provisioner.go-143-143 (1)

143-143: ⚠️ Potential issue | 🟠 Major

Multiple silent failures throughout the file.

Several _ = p.repo.Update(...) and _ = p.repo.AddNode(...) calls discard errors:

  • Line 143: Update after storing control plane IP
  • Line 162: Update after kubeconfig storage
  • Line 204: AddNode for worker nodes
  • Lines 336-337: TerminateInstance and DeleteNode in Deprovision

These can lead to data inconsistency without any indication of failure.

🔧 Suggested approach

At minimum, log the errors. For critical paths, propagate them:

-	_ = p.repo.Update(ctx, cluster)
+	if err := p.repo.Update(ctx, cluster); err != nil {
+		p.logger.Error("failed to update cluster", "cluster_id", cluster.ID, "error", err)
+	}

As per coding guidelines: "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()"

Also applies to: 162-162, 204-204, 336-337

internal/repositories/k8s/provisioner.go-261-274 (1)

261-274: ⚠️ Potential issue | 🟠 Major

waitForIP ignores context cancellation.

The polling loop uses time.Sleep without checking ctx.Done(). This can block for up to 150 seconds even if the context is cancelled, wasting resources and delaying shutdown.

🔧 Proposed fix using select with context
 func (p *KubeadmProvisioner) waitForIP(ctx context.Context, instID uuid.UUID) string {
 	for i := 0; i < 30; i++ {
+		select {
+		case <-ctx.Done():
+			return ""
+		default:
+		}
 		inst, err := p.instSvc.GetInstance(ctx, instID.String())
 		if err == nil && inst.PrivateIP != "" {
 			ip := inst.PrivateIP
 			if idx := strings.Index(ip, "/"); idx != -1 {
 				ip = ip[:idx]
 			}
 			return ip
 		}
-		time.Sleep(5 * time.Second)
+		select {
+		case <-ctx.Done():
+			return ""
+		case <-time.After(5 * time.Second):
+		}
 	}
 	return ""
 }

As per coding guidelines: "Propagate context.Context to all blocking calls"

internal/repositories/k8s/provisioner.go-166-172 (1)

166-172: ⚠️ Potential issue | 🟠 Major

Potential panic if ControlPlaneIPs is empty.

Line 169 accesses cluster.ControlPlaneIPs[0] without checking if the slice is empty. If provisionControlPlane failed to populate IPs, this will panic.

🛡️ Proposed fix
 func (p *KubeadmProvisioner) provisionWorkers(ctx context.Context, cluster *domain.Cluster) error {
 	p.logger.Info("provisioning worker nodes", "count", cluster.WorkerCount)

+	if len(cluster.ControlPlaneIPs) == 0 {
+		return p.failCluster(ctx, cluster, "no control plane IPs available for worker join", nil)
+	}
+
 	apiServer := cluster.ControlPlaneIPs[0]
 	if cluster.HAEnabled && cluster.APIServerLBAddress != nil {
 		apiServer = *cluster.APIServerLBAddress
 	}
internal/repositories/k8s/provisioner.go-456-458 (1)

456-458: ⚠️ Potential issue | 🟠 Major

Potential slice bounds panic if version string is malformed.

strings.LastIndex(v, ".") returns -1 if no dot is found. Slicing with v[:−1] will panic.

🛡️ Proposed fix
 	v := strings.TrimPrefix(version, "v")
-	minor := v[:strings.LastIndex(v, ".")]
+	lastDot := strings.LastIndex(v, ".")
+	if lastDot == -1 {
+		return errors.New(errors.InvalidInput, "invalid version format: "+version)
+	}
+	minor := v[:lastDot]
🟡 Minor comments (18)
internal/handlers/accounting_handler_test.go-45-47 (1)

45-47: ⚠️ Potential issue | 🟡 Minor

Potential data race with gin.SetMode in parallel tests.

gin.SetMode() modifies global state and is not thread-safe. When t.Parallel() is enabled, concurrent tests calling SetMode can race. Since all tests set gin.TestMode, the practical impact is limited, but it's still a race condition.

Consider moving gin.SetMode(gin.TestMode) to a TestMain function to set the mode once before any tests run:

func TestMain(m *testing.M) {
    gin.SetMode(gin.TestMode)
    os.Exit(m.Run())
}
internal/repositories/postgres/lifecycle_repo_unit_test.go-29-31 (1)

29-31: ⚠️ Potential issue | 🟡 Minor

Handle pgxmock.NewPool() errors instead of discarding them with the blank identifier.

Ignoring errors silently can mask setup failures and lead to misleading test results. Fail the test immediately when pool creation fails.

Proposed fix
-    mock, _ := pgxmock.NewPool()
+    mock, err := pgxmock.NewPool()
+    if err != nil {
+        t.Fatalf("pgxmock.NewPool: %v", err)
+    }

Also applies to lines 53-55 and 70-72.

internal/handlers/health_handler_test.go-40-42 (1)

40-42: ⚠️ Potential issue | 🟡 Minor

Same gin.SetMode() race concern applies here.

This test also calls gin.SetMode() after t.Parallel(), creating the same potential race condition as TestHealthHandler_Live.

internal/handlers/health_handler_test.go-25-27 (1)

25-27: ⚠️ Potential issue | 🟡 Minor

Potential data race with gin.SetMode() in parallel tests.

gin.SetMode() modifies a package-level global variable. When this test runs concurrently with other tests that also call gin.SetMode(), a data race can occur. Consider one of these approaches:

  1. Move gin.SetMode(gin.TestMode) to a TestMain function that runs once before all tests.
  2. Use sync.Once to set the mode only once.
  3. Remove t.Parallel() if parallel execution isn't critical for this test.
internal/core/services/instance_test.go-188-189 (1)

188-189: ⚠️ Potential issue | 🟡 Minor

Don’t ignore Docker adapter init errors.

If adapter creation fails, the test can proceed with a nil compute backend and panic later. Handle the error explicitly.

🐛 Suggested fix
-	compute, _ := docker.NewDockerAdapter(slog.Default())
+	compute, err := docker.NewDockerAdapter(slog.Default())
+	require.NoError(t, err)
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.
internal/repositories/k8s/lifecycle_test.go-145-148 (1)

145-148: ⚠️ Potential issue | 🟡 Minor

Handle ListInstances errors.

Dropping the error can turn real failures into false positives.

🐛 Suggested fix
-	instances, _ := instSvc.ListInstances(ctx)
+	instances, err := instSvc.ListInstances(ctx)
+	require.NoError(t, err)
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.
internal/core/services/system_integration_test.go-111-128 (1)

111-128: ⚠️ Potential issue | 🟡 Minor

Don’t ignore instance-type creation errors.

If these inserts fail, provisioning will later fail with a less obvious root cause.

🐛 Suggested fix
-	_, _ = itRepo.Create(ctx, customType)
+	_, err = itRepo.Create(ctx, customType)
+	require.NoError(t, err)
@@
-	_, _ = itRepo.Create(ctx, sys1)
+	_, err = itRepo.Create(ctx, sys1)
+	require.NoError(t, err)
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.
internal/core/services/system_integration_test.go-71-74 (1)

71-74: ⚠️ Potential issue | 🟡 Minor

Handle network creation errors in setup.

Ignoring the error can mask Docker networking failures and lead to misleading downstream errors.

🐛 Suggested fix
-	if compute.Type() == "docker" {
-		_, _ = compute.CreateNetwork(ctx, "cloud-network")
-	}
+	if compute.Type() == "docker" {
+		_, err = compute.CreateNetwork(ctx, "cloud-network")
+		require.NoError(t, err)
+	}
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.
internal/repositories/k8s/lifecycle_test.go-199-205 (1)

199-205: ⚠️ Potential issue | 🟡 Minor

Surface unexpected queue payloads in the sync task queue.

Returning nil for unknown payloads can hide wiring issues and make the test pass when provisioning never ran.

🐛 Suggested fix
-	return nil
+	return fmt.Errorf("unexpected payload type %T", payload)
internal/repositories/postgres/migrations/081_fix_clusters_table.up.sql-11-18 (1)

11-18: ⚠️ Potential issue | 🟡 Minor

Scope column-existence checks to the current schema.
Without filtering by table_schema, a similarly named table in another schema can make the check pass and the subsequent UPDATE fail in the active schema.

🛠️ Suggested fix
-    IF EXISTS (SELECT 1 FROM information_schema.columns WHERE table_name='clusters' AND column_name='ssh_key') THEN
+    IF EXISTS (
+        SELECT 1 FROM information_schema.columns
+        WHERE table_name = 'clusters'
+          AND column_name = 'ssh_key'
+          AND table_schema = current_schema()
+    ) THEN
         UPDATE clusters SET ssh_private_key_encrypted = ssh_key WHERE ssh_private_key_encrypted IS NULL;
     END IF;
-    IF EXISTS (SELECT 1 FROM information_schema.columns WHERE table_name='clusters' AND column_name='kubeconfig') THEN
+    IF EXISTS (
+        SELECT 1 FROM information_schema.columns
+        WHERE table_name = 'clusters'
+          AND column_name = 'kubeconfig'
+          AND table_schema = current_schema()
+    ) THEN
         UPDATE clusters SET kubeconfig_encrypted = kubeconfig WHERE kubeconfig_encrypted IS NULL;
     END IF;
internal/repositories/noop/adapters.go-165-174 (1)

165-174: ⚠️ Potential issue | 🟡 Minor

Make nil returns consistent with other noop methods in the same file.

The methods GetByID, ListByUserID, and ListAll return (nil, nil), but similar methods in the same file like GetNodes (line 179) and ListLogs (line 159) return empty slices instead. This inconsistency can lead to nil dereference in callers (e.g., cluster_worker.go calls GetByID and only checks err, not if the cluster is nil). Return empty slices or stub objects to match the pattern established elsewhere in this file.

Suggested fix
 func (r *NoopClusterRepository) GetByID(ctx context.Context, id uuid.UUID) (*domain.Cluster, error) {
-	return nil, nil
+	return &domain.Cluster{ID: id}, nil
 }
 func (r *NoopClusterRepository) ListByUserID(ctx context.Context, userID uuid.UUID) ([]*domain.Cluster, error) {
-	return nil, nil
+	return []*domain.Cluster{}, nil
 }
 func (r *NoopClusterRepository) ListAll(ctx context.Context) ([]*domain.Cluster, error) {
-	return nil, nil
+	return []*domain.Cluster{}, nil
 }
internal/handlers/security_group_handler_test.go-83-85 (1)

83-85: ⚠️ Potential issue | 🟡 Minor

Potential race condition with gin.SetMode in parallel tests.

gin.SetMode() modifies package-level globals without synchronization and is not goroutine-safe. Calling it in multiple parallel tests (this pattern affects 38+ test files in internal/handlers) causes data race conditions. All these test functions must have gin.SetMode(gin.TestMode) removed from individual tests and set once globally before any tests run.

Add a TestMain function to internal/handlers/security_group_handler_test.go (and all other affected test files) to set the mode once:

func TestMain(m *testing.M) {
	gin.SetMode(gin.TestMode)
	os.Exit(m.Run())
}

Then remove the gin.SetMode(gin.TestMode) calls from all individual test functions.

internal/repositories/docker/adapter_test.go-76-80 (1)

76-80: ⚠️ Potential issue | 🟡 Minor

Potential flaky assertion in Eventually block.

assert.ObjectsAreEqual returns a bool but doesn't produce test output on failure. The string comparison expectedContent+"\n" assumes the file ends with a newline, which depends on echo behavior. Consider using direct string comparison with trimming for robustness:

🔧 Suggested fix
 		require.Eventually(t, func() bool {
 			out, err := adapter.Exec(ctx, id, []string{"cat", "/tmp/bootstrap_test.txt"})
-			return err == nil && assert.ObjectsAreEqual(expectedContent+"\n", out)
+			return err == nil && strings.TrimSpace(out) == expectedContent
 		}, 10*time.Second, 500*time.Millisecond, "Bootstrap script failed to execute or write file")

You'll need to add "strings" to imports.

internal/handlers/instance_handler_test.go-112-114 (1)

112-114: ⚠️ Potential issue | 🟡 Minor

Guard global gin state now that tests are parallel.

setupInstanceHandlerTest calls gin.SetMode, which mutates global state. With t.Parallel() added broadly, consider moving gin.SetMode(gin.TestMode) into a TestMain/init with sync.Once to avoid race detector failures.

internal/repositories/k8s/node_executor.go-39-40 (1)

39-40: ⚠️ Potential issue | 🟡 Minor

Replace readiness/timeout magic numbers with named constants.

Hard-coded poll/timeout values make tuning harder and violate the no-magic-numbers guideline.

Suggested fix
+const (
+	serviceReadyPollInterval = 500 * time.Millisecond
+	sshReadyPollInterval     = 2 * time.Second
+	sshDialTimeout           = 10 * time.Second
+)
...
-	ticker := time.NewTicker(500 * time.Millisecond)
+	ticker := time.NewTicker(serviceReadyPollInterval)
...
-		Timeout:         10 * time.Second,
+		Timeout:         sshDialTimeout,
...
-	ticker := time.NewTicker(2 * time.Second)
+	ticker := time.NewTicker(sshReadyPollInterval)

As per coding guidelines: Do not use magic numbers - use named constants instead.

Also applies to: 78-79, 110-111

internal/repositories/k8s/kubeadm_provisioner_test.go-39-41 (1)

39-41: ⚠️ Potential issue | 🟡 Minor

Make StartInstance use m.Called so expectations and errors work.

Returning nil unconditionally hides unexpected calls and prevents tests from asserting behavior.

Suggested fix
 func (m *MockInstanceService) StartInstance(ctx context.Context, id string) error {
-	return nil
+	return m.Called(ctx, id).Error(0)
 }

As per coding guidelines: Use testify/mock for creating mock objects in tests.

internal/repositories/docker/adapter.go-73-78 (1)

73-78: ⚠️ Potential issue | 🟡 Minor

Default the logger to avoid nil deref.
Line 73-78 accepts a logger but doesn’t guard against nil; later calls to a.logger.* will panic. Consider defaulting to slog.Default() (or a no-op logger).

🛠️ Suggested fix
 func NewDockerAdapter(logger *slog.Logger) (*DockerAdapter, error) {
+	if logger == nil {
+		logger = slog.Default()
+	}
 	cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
internal/repositories/k8s/provisioner.go-316-322 (1)

316-322: ⚠️ Potential issue | 🟡 Minor

Fallback uses encrypted key directly if decryption fails.

When p.secretSvc.Decrypt fails, the code silently uses cluster.SSHPrivateKeyEncrypted as the SSH key. This will either fail SSH auth (best case) or, if the field contains a plaintext key, succeed unexpectedly.

🔧 Proposed fix: Return error on decrypt failure
 	// 2. Fallback to SSH
-	decryptedKey := cluster.SSHPrivateKeyEncrypted // Fallback or if not encrypted
 	if p.secretSvc != nil && cluster.SSHPrivateKeyEncrypted != "" {
 		key, err := p.secretSvc.Decrypt(ctx, cluster.UserID, cluster.SSHPrivateKeyEncrypted)
-		if err == nil {
-			decryptedKey = key
+		if err != nil {
+			return nil, errors.Wrap(errors.Internal, "failed to decrypt SSH key", err)
 		}
+		return NewSSHExecutor(ip, defaultUser, key), nil
 	}
-
-	return NewSSHExecutor(ip, defaultUser, decryptedKey), nil
+	return nil, errors.New(errors.InvalidInput, "no SSH key available for node access")

Comment on lines +152 to +182
inst := &domain.Instance{
ID: uuid.New(),
UserID: appcontext.UserIDFromContext(ctx),
TenantID: appcontext.TenantIDFromContext(ctx),
Name: opts.Name,
Image: opts.ImageName,
Status: domain.StatusStarting,
InstanceType: "custom", // Marking as custom since we are passing raw constraints or defaults
Version: 1,
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
}

if opts.NetworkID != "" {
vpcID, err := uuid.Parse(opts.NetworkID)
if err == nil {
inst.VpcID = &vpcID
}
}

if err := s.repo.Create(ctx, inst); err != nil {
return nil, err
}

// 4. Enqueue provision task with UserData
job := domain.ProvisionJob{
InstanceID: inst.ID,
UserID: inst.UserID,
TenantID: inst.TenantID,
UserData: opts.UserData,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the instance.go file and check its structure
fd 'instance\.go' internal -t f | head -20

Repository: PoyrazK/thecloud

Length of output: 159


🏁 Script executed:

#!/bin/bash
# Search for "custom" instance type definition
rg -n '"custom"' internal --type go | head -30

Repository: PoyrazK/thecloud

Length of output: 665


🏁 Script executed:

#!/bin/bash
# Find CreateInstanceOptions and ProvisionJob definitions
rg -n 'type CreateInstanceOptions|type ProvisionJob' internal --type go -A 10

Repository: PoyrazK/thecloud

Length of output: 1974


🏁 Script executed:

#!/bin/bash
# Search for where ProvisionJob is used and how Provision function works
rg -n 'ProvisionJob|func.*Provision' internal --type go | head -40

Repository: PoyrazK/thecloud

Length of output: 5346


🏁 Script executed:

#!/bin/bash
# Get the full context of instance.go around the problematic lines
wc -l internal/core/services/instance.go

Repository: PoyrazK/thecloud

Length of output: 98


🏁 Script executed:

#!/bin/bash
# Read the LaunchInstanceWithOptions function and surrounding context
sed -n '140,200p' internal/core/services/instance.go | cat -n

Repository: PoyrazK/thecloud

Length of output: 2464


🏁 Script executed:

#!/bin/bash
# Check Instance domain struct to see what fields it has
rg -n 'type Instance struct' internal/core/domain --type go -A 20

Repository: PoyrazK/thecloud

Length of output: 2939


🏁 Script executed:

#!/bin/bash
# Find the Provision function or method that processes ProvisionJob
rg -n 'func.*Provision\(' internal/core/services --type go -A 15 | head -80

Repository: PoyrazK/thecloud

Length of output: 5550


🏁 Script executed:

#!/bin/bash
# Check if Instance struct fields could store Ports, Env, Cmd or limits
rg -n 'type Instance struct' internal/core/domain/instance.go -A 30

Repository: PoyrazK/thecloud

Length of output: 1712


🏁 Script executed:

#!/bin/bash
# Verify the exact NetworkID handling and whether errors are truly silently ignored
sed -n '152,182p' internal/core/services/instance.go | cat -n

Repository: PoyrazK/thecloud

Length of output: 1083


🏁 Script executed:

#!/bin/bash
# Check where Ports might be stored or used in Provision flow
rg -n '\.Ports|ports\[' internal/core/services/instance.go | head -20

Repository: PoyrazK/thecloud

Length of output: 115


🏁 Script executed:

#!/antml:0
# Check if "custom" instance type is verified to exist anywhere in the codebase
rg -n 'instance.*type|InstanceType' internal/core/services/instance.go | head -30

Repository: PoyrazK/thecloud

Length of output: 617


🏁 Script executed:

#!/bin/bash
# Check the parseAndValidatePorts function and how it's used
sed -n '220,240p' internal/core/services/instance.go | cat -n

Repository: PoyrazK/thecloud

Length of output: 896


🏁 Script executed:

#!/an/bin/bash
# Verify how Instance.Ports field could ever be populated if opts.Ports is not stored
rg -n 'inst\.Ports\s*=' internal/core/services/instance.go

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check the InstanceType validation/existence checking in Provision
sed -n '210,225p' internal/core/services/instance.go | cat -n

Repository: PoyrazK/thecloud

Length of output: 670


Validate NetworkID and persist option fields before enqueueing Provision.

Line 165-170: Invalid opts.NetworkID is silently ignored instead of returning an error. Additionally, most CreateInstanceOptions fields (Ports, VolumeBinds, Env, Cmd, CPULimit, MemoryLimit, DiskLimit) are not persisted in the Instance struct or ProvisionJob. During Provision (line 225), the code attempts to parse inst.Ports which was never populated, and Env/Cmd are hardcoded to nil, causing these options to be lost. The hardcoded InstanceType: "custom" also requires manual database setup. Please validate NetworkID and either: (1) persist these options in the Instance struct, (2) extend ProvisionJob to include them, or (3) pass CreateInstanceOptions directly to the ProvisionJob.

✅ Suggested validation for NetworkID
	if opts.NetworkID != "" {
		vpcID, err := uuid.Parse(opts.NetworkID)
-		if err == nil {
-			inst.VpcID = &vpcID
-		}
+		if err != nil {
+			return nil, errors.New(errors.InvalidInput, "invalid network_id")
+		}
+		inst.VpcID = &vpcID
	}
🤖 Prompt for AI Agents
In `@internal/core/services/instance.go` around lines 152 - 182, The code silently
ignores invalid opts.NetworkID and fails to persist CreateInstanceOptions
(Ports, VolumeBinds, Env, Cmd, CPULimit, MemoryLimit, DiskLimit) so Provision
later reads empty inst fields; validate opts.NetworkID (return error on invalid
uuid when opts.NetworkID != "") and persist the options either by adding those
fields to the domain.Instance before calling s.repo.Create (e.g., populate
inst.Ports, inst.VolumeBinds, inst.Env, inst.Cmd, inst.CPULimit,
inst.MemoryLimit, inst.DiskLimit), or extend domain.ProvisionJob to carry the
full CreateInstanceOptions and populate ProvisionJob.UserData/fields from opts
(or simply attach the entire CreateInstanceOptions to ProvisionJob) so the
Provision handler (which reads inst.Ports and Env/Cmd) receives the real values
instead of nil; ensure InstanceType is set appropriately (not hardcoded
"custom") or documented/handled elsewhere.

Comment on lines +208 to +234
func (a *DockerAdapter) handleUserData(ctx context.Context, containerID string, userData string) error {
// For Docker, we simulate Cloud-Init by writing the script to the container and executing it.
if strings.HasPrefix(userData, "#cloud-config") {
return a.processCloudConfig(ctx, containerID, userData)
}

// Default fallback: treat as shell script
scriptPath := "/tmp/bootstrap.sh"

// Write file to container
// Note: We use base64 encoding to avoid escaping issues with complex scripts
encoded := base64.StdEncoding.EncodeToString([]byte(userData))
writeCmd := []string{"sh", "-c", fmt.Sprintf("mkdir -p %s && echo %s | base64 -d > %s && chmod +x %s",
filepath.Dir(scriptPath), encoded, scriptPath, scriptPath)}

if _, err := a.Exec(ctx, containerID, writeCmd); err != nil {
return fmt.Errorf("failed to write userdata to container: %w", err)
}

// Run in background
go func() {
bgCtx := context.Background()
_, err := a.Exec(bgCtx, containerID, []string{"/bin/sh", scriptPath})
if err != nil {
a.logger.Error("failed to execute userdata script", "container_id", containerID, "error", err)
}
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get file size and examine the relevant sections
wc -l internal/repositories/docker/adapter.go

Repository: PoyrazK/thecloud

Length of output: 103


🏁 Script executed:

#!/bin/bash
# Examine the exact code at lines 208-234 (the snippet shown)
sed -n '208,234p' internal/repositories/docker/adapter.go

Repository: PoyrazK/thecloud

Length of output: 1192


🏁 Script executed:

#!/bin/bash
# Examine the second location mentioned (lines 314-323)
sed -n '314,323p' internal/repositories/docker/adapter.go

Repository: PoyrazK/thecloud

Length of output: 464


🏁 Script executed:

#!/bin/bash
# Check for the Exec method signature and understand its concurrency properties
rg "func.*Exec\(" internal/repositories/docker/adapter.go -A 5

Repository: PoyrazK/thecloud

Length of output: 263


🏁 Script executed:

#!/bin/bash
# Look for test files that might exercise concurrent Exec calls
fd -e go internal/repositories/docker | grep -i test

Repository: PoyrazK/thecloud

Length of output: 477


🏁 Script executed:

#!/bin/bash
# Find test files in the docker directory
fd --full-path 'internal/repositories/docker' -name '*test.go'

Repository: PoyrazK/thecloud

Length of output: 290


🏁 Script executed:

#!/bin/bash
# Look for any goroutine patterns in the adapter
rg "go func" internal/repositories/docker/adapter.go -B 2 -A 8

Repository: PoyrazK/thecloud

Length of output: 1040


🏁 Script executed:

#!/bin/bash
# Check if there are any mutexes or synchronization primitives in the adapter
rg "sync\.|Mutex|Lock|RWLock" internal/repositories/docker/adapter.go -B 2 -A 2

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check the struct definition of DockerAdapter to see fields
rg "type DockerAdapter struct" internal/repositories/docker/adapter.go -A 15

Repository: PoyrazK/thecloud

Length of output: 979


🏁 Script executed:

#!/bin/bash
# Find test files in the docker directory with correct syntax
find internal/repositories/docker -name '*test.go'

Repository: PoyrazK/thecloud

Length of output: 342


🏁 Script executed:

#!/bin/bash
# Look for the full Exec implementation to understand concurrency
sed -n '600,700p' internal/repositories/docker/adapter.go

Repository: PoyrazK/thecloud

Length of output: 2942


🏁 Script executed:

#!/bin/bash
# Search for where Exec is fully defined
rg "func \(a \*DockerAdapter\) Exec" internal/repositories/docker/adapter.go -A 30

Repository: PoyrazK/thecloud

Length of output: 1116


🏁 Script executed:

#!/bin/bash
# Check if there's any test that might be running these concurrently
rg "TestDockerAdapter\|TestHandleUserData\|TestProcessCloudConfig" internal/repositories/docker/ -A 10

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look for handleUserData or processCloudConfig tests
rg "handleUserData|processCloudConfig" internal/repositories/docker/adapter_test.go internal/repositories/docker/adapter_unit_test.go -B 2 -A 10

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check test files for concurrent execution patterns
rg "go func|t\.Parallel|concurrent" internal/repositories/docker/adapter_test.go -B 2 -A 5

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look at the beginning of test files to understand test structure
head -50 internal/repositories/docker/adapter_test.go

Repository: PoyrazK/thecloud

Length of output: 1446


🏁 Script executed:

#!/bin/bash
# Check if there are multiple calls to handleUserData or Exec in tests
rg "\.handleUserData\(|\.Exec\(" internal/repositories/docker/adapter_test.go internal/repositories/docker/adapter_unit_test.go | head -20

Repository: PoyrazK/thecloud

Length of output: 432


🏁 Script executed:

#!/bin/bash
# Look at the full adapter_unit_test.go to see the test structure
cat internal/repositories/docker/adapter_unit_test.go

Repository: PoyrazK/thecloud

Length of output: 12499


🏁 Script executed:

#!/bin/bash
# Search for any test that might call handleUserData or processCloudConfig
rg "handleUserData|processCloudConfig|userData" internal/repositories/docker/ --type go

Repository: PoyrazK/thecloud

Length of output: 1060


🏁 Script executed:

#!/bin/bash
# Check if the adapter has any state that could be raced on
sed -n '1,50p' internal/repositories/docker/adapter.go

Repository: PoyrazK/thecloud

Length of output: 1583


🏁 Script executed:

#!/bin/bash
# Look at fakeDockerClient to understand the Calls map and potential race
rg "type fakeDockerClient" internal/repositories/docker/ -A 30

Repository: PoyrazK/thecloud

Length of output: 1933


🏁 Script executed:

#!/bin/bash
# Check if Calls map is protected by mutex
rg "Calls\s*map" internal/repositories/docker/ -B 2 -A 2

Repository: PoyrazK/thecloud

Length of output: 324


🏁 Script executed:

#!/bin/bash
# Look at how Calls is incremented to see if there's synchronization
rg "cli\.Calls\[" internal/repositories/docker/ -B 2 -A 2

Repository: PoyrazK/thecloud

Length of output: 858


🏁 Script executed:

#!/bin/bash
# Look at the full fakeDockerClient implementation to see how Calls is incremented
cat internal/repositories/docker/fakes_test.go

Repository: PoyrazK/thecloud

Length of output: 5469


🏁 Script executed:

#!/bin/bash
# Verify the actual Exec calls in handleUserData to confirm both are concurrent
sed -n '208,237p' internal/repositories/docker/adapter.go | head -35

Repository: PoyrazK/thecloud

Length of output: 1207


🏁 Script executed:

#!/bin/bash
# Check processCloudConfig full implementation to understand the second goroutine
sed -n '240,350p' internal/repositories/docker/adapter.go | grep -A 30 "processCloudConfig"

Repository: PoyrazK/thecloud

Length of output: 1300


Fix context propagation and data races in user-data execution.
Background goroutines in handleUserData and processCloudConfig use context.Background(), bypassing parent context cancellation and violating the guideline to propagate context.Context to all blocking calls. Additionally, concurrent Exec calls race on the test's unprotected Calls map. Replace background execution with synchronous calls using a derived context with timeout:

Example fix
-	// Run in background
-	go func() {
-		bgCtx := context.Background()
-		_, err := a.Exec(bgCtx, containerID, []string{"/bin/sh", scriptPath})
-		if err != nil {
-			a.logger.Error("failed to execute userdata script", "container_id", containerID, "error", err)
-		}
-	}()
+	bgCtx, cancel := context.WithTimeout(ctx, DefaultOperationTimeout)
+	defer cancel()
+	if _, err := a.Exec(bgCtx, containerID, []string{"/bin/sh", scriptPath}); err != nil {
+		a.logger.Error("failed to execute userdata script", "container_id", containerID, "error", err)
+	}

Also applies to: processCloudConfig (around line 314-323)

🧰 Tools
🪛 GitHub Actions: CI

[error] 228-230: Data race detected in DockerAdapter.handleUserData during LaunchInstanceWithOptions. Concurrent map access observed (race between goroutines 47 and 48).

🤖 Prompt for AI Agents
In `@internal/repositories/docker/adapter.go` around lines 208 - 234,
handleUserData (and likewise processCloudConfig) spawns a goroutine that calls
a.Exec with context.Background(), which bypasses the parent ctx, prevents
cancellation propagation, and causes concurrent Exec calls that race on shared
test state; change the background execution to a synchronous call using a
derived context with a timeout from the provided ctx (e.g., ctxWithTimeout,
cancel := context.WithTimeout(ctx, <reasonableDuration>); defer cancel()), call
a.Exec(ctxWithTimeout, containerID, ...) directly instead of launching a
goroutine, and propagate errors back to the caller (or log them consistently via
a.logger) so Exec is not invoked concurrently from multiple goroutines and
respects cancellation; apply the same change in processCloudConfig and ensure
any callers handle the returned error.

Comment on lines +89 to +112
type MockSecretService struct{ mock.Mock }

func (m *MockSecretService) CreateSecret(ctx context.Context, name, value, description string) (*domain.Secret, error) {
return nil, nil
}
func (m *MockSecretService) GetSecret(ctx context.Context, id uuid.UUID) (*domain.Secret, error) {
return nil, nil
}
func (m *MockSecretService) GetSecretByName(ctx context.Context, name string) (*domain.Secret, error) {
return nil, nil
}
func (m *MockSecretService) ListSecrets(ctx context.Context) ([]*domain.Secret, error) {
return nil, nil
}
func (m *MockSecretService) DeleteSecret(ctx context.Context, id uuid.UUID) error {
return nil
}
func (m *MockSecretService) Encrypt(ctx context.Context, userID uuid.UUID, plainText string) (string, error) {
return plainText, nil
}
func (m *MockSecretService) Decrypt(ctx context.Context, userID uuid.UUID, cipherText string) (string, error) {
args := m.Called(ctx, userID, cipherText)
return args.String(0), args.Error(1)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for duplicate MockSecretService definitions in the k8s package
rg -n 'type MockSecretService struct' internal/repositories/k8s/

Repository: PoyrazK/thecloud

Length of output: 245


Remove duplicate MockSecretService definition.

The type MockSecretService struct{ mock.Mock } is already defined in kubeadm_provisioner_test.go:96. Duplicate type definitions in the same package will cause a compilation error. Delete the duplicate definition from mocks_test.go:89-112 and reuse the existing mock from the other file if needed.

🤖 Prompt for AI Agents
In `@internal/repositories/k8s/mocks_test.go` around lines 89 - 112, There is a
duplicate MockSecretService type and method implementations (e.g., CreateSecret,
GetSecret, Decrypt) which will cause a duplicate type error; remove the entire
MockSecretService definition and its methods from this file and rely on the
single shared MockSecretService already defined elsewhere in the package,
updating any tests in this file to use that existing MockSecretService (and its
m.Called stubs) instead of declaring a new one.

Comment on lines +155 to 160
if err == nil {
cluster.KubeconfigEncrypted = encryptedKubeconfig
} else {
cluster.Kubeconfig = encryptedKubeconfig
p.logger.Error("failed to encrypt kubeconfig", "error", err)
cluster.KubeconfigEncrypted = kubeconfig // Fallback
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Security risk: Storing unencrypted kubeconfig as fallback.

If encryption fails, the kubeconfig is stored in plaintext. This defeats the purpose of having KubeconfigEncrypted and could expose cluster credentials.

🔒 Proposed fix: Fail instead of fallback
 	encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig)
-	if err == nil {
-		cluster.KubeconfigEncrypted = encryptedKubeconfig
-	} else {
-		p.logger.Error("failed to encrypt kubeconfig", "error", err)
-		cluster.KubeconfigEncrypted = kubeconfig // Fallback
+	if err != nil {
+		return p.failCluster(ctx, cluster, "failed to encrypt kubeconfig", err)
 	}
+	cluster.KubeconfigEncrypted = encryptedKubeconfig
🤖 Prompt for AI Agents
In `@internal/repositories/k8s/provisioner.go` around lines 155 - 160, The current
fallback stores plaintext kubeconfig when encryption fails (setting
cluster.KubeconfigEncrypted = kubeconfig), which is a security risk; update the
error path in the block that handles encryptedKubeconfig (the branch using
encryptedKubeconfig, kubeconfig and p.logger.Error) to refuse to persist
plaintext: log the error via p.logger.Error as-is, and return or propagate an
error from the surrounding function instead of assigning the unencrypted
kubeconfig to cluster.KubeconfigEncrypted so the caller can handle the failure
and no plaintext kubeconfig is saved.

Add IF NOT EXISTS/IF EXISTS checks to:

- 003_add_container_id

- 006_create_vpcs_table

- 012_add_failure_tracking

- 013_create_users_table

- 014_resource_scoping

- 032_add_instance_network_fields

- 036_add_volume_backend_path

This resolves CI failures related to verifying migration rollbacks.
Add IF NOT EXISTS/IF EXISTS checks to:

- 020_create_clusters.up

- 053_add_cluster_network_isolation

- 054_extend_clusters_v2_2
Add IF NOT EXISTS/IF EXISTS checks to:

- 058_add_object_versioning

- 060_bucket_encryption

Also wrap constraint creation in DO block to avoid errors.
Add IF NOT EXISTS/IF EXISTS checks to remaining migrations.
Add status code assertions and body printing on decode failure for VerifyMethodMatching test.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/repositories/postgres/migrations/076_add_gateway_pattern_matching.up.sql (2)

14-14: ⚠️ Potential issue | 🟡 Minor

CREATE INDEX is not idempotent and will fail on re-run.

This migration will error if executed again because the index already exists. Use IF NOT EXISTS for consistency with the idempotent approach applied to the column additions.

Proposed fix
--- Add index for pattern lookups
-CREATE INDEX idx_gateway_routes_pattern_type ON gateway_routes(pattern_type);
+CREATE INDEX IF NOT EXISTS idx_gateway_routes_pattern_type ON gateway_routes(pattern_type);

17-19: ⚠️ Potential issue | 🟡 Minor

ADD CONSTRAINT is not idempotent and will fail on re-run.

PostgreSQL doesn't support IF NOT EXISTS for constraints directly. To maintain consistency with the idempotent migration approach, wrap this in a DO block that checks for existing constraints.

Proposed fix
--- Add constraint: pattern_type must be 'prefix' or 'pattern'
-ALTER TABLE gateway_routes 
-    ADD CONSTRAINT chk_pattern_type 
-    CHECK (pattern_type IN ('prefix', 'pattern'));
+DO $$
+BEGIN
+    IF NOT EXISTS (
+        SELECT 1 FROM pg_constraint WHERE conname = 'chk_pattern_type'
+    ) THEN
+        ALTER TABLE gateway_routes 
+            ADD CONSTRAINT chk_pattern_type 
+            CHECK (pattern_type IN ('prefix', 'pattern'));
+    END IF;
+END $$;
internal/repositories/postgres/migrations/071_add_tenant_id_to_resources.up.sql (1)

69-90: ⚠️ Potential issue | 🟠 Major

Guard index creation with table existence checks.

Lines 69-90 create indexes unconditionally, but CREATE INDEX IF NOT EXISTS only suppresses errors if an index name already exists—it still errors if the underlying table is missing. Since the preceding column additions (lines 2-67) are conditional, the index creation must be conditional too, or the migration will fail if any table doesn't exist.

Wrap each index creation in a DO block with an existence check:

🔧 Example pattern (apply to each index)
- CREATE INDEX IF NOT EXISTS idx_instances_tenant ON instances(tenant_id);
+ DO $$
+ BEGIN
+   IF EXISTS (SELECT FROM pg_tables WHERE tablename = 'instances') THEN
+     EXECUTE 'CREATE INDEX IF NOT EXISTS idx_instances_tenant ON instances(tenant_id)';
+   END IF;
+ END $$;
🤖 Fix all issues with AI agents
In `@internal/repositories/postgres/migrations/014_resource_scoping.up.sql`:
- Around line 4-10: The migration currently uses "ADD COLUMN IF NOT EXISTS ...
REFERENCES" which is skipped if the column exists and thus can leave missing FK
constraints; update 014_resource_scoping.up.sql to (1) ensure each table
(instances, vpcs, volumes, objects, load_balancers, scaling_groups, events) has
the user_id column added with ALTER TABLE ... ADD COLUMN IF NOT EXISTS user_id
UUID (no REFERENCES), and (2) for each table add a guarded FK constraint named
e.g. instances_user_id_fkey, vpcs_user_id_fkey, etc., inside a DO $$ BEGIN ...
END $$ block that checks pg_constraint for conname before running ALTER TABLE
... ADD CONSTRAINT <name> FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE
RESTRICT (following the pattern used in migration 013 for api_keys) so the
migration is idempotent and will add missing constraints without error.

In `@internal/repositories/postgres/migrations/020_create_clusters.up.sql`:
- Around line 37-38: The CREATE INDEX statements for idx_clusters_user_id and
idx_cluster_nodes_cluster_id should be made idempotent by adding IF NOT EXISTS;
update the two statements that create those indexes (the lines using CREATE
INDEX idx_clusters_user_id ON clusters(user_id) and CREATE INDEX
idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id)) to include IF NOT
EXISTS so rerunning the migration won’t fail if the indexes already exist.

In `@internal/repositories/postgres/migrations/058_add_object_versioning.up.sql`:
- Around line 11-12: The migration drops the wrong unique constraint; change the
DROP CONSTRAINT to remove the original pre-versioning unique constraint
(objects_bucket_key_key) instead of objects_bucket_key_version_unique, then add
the new UNIQUE (bucket, key, version_id) constraint on table objects so multiple
versions per (bucket,key) are allowed; update the ALTER TABLE statement
targeting constraint objects_bucket_key_key and keep the ADD CONSTRAINT
objects_bucket_key_version_unique UNIQUE (bucket, key, version_id).
- Around line 3-4: The migration leaves objects.version_id nullable which allows
duplicate (bucket, key) combos to bypass uniqueness; update the migration to set
a safe non-null default, backfill existing NULLs, and make the column NOT NULL
(e.g., ALTER TABLE objects ALTER COLUMN version_id SET DEFAULT ''; UPDATE
objects SET version_id = '' WHERE version_id IS NULL; ALTER TABLE objects ALTER
COLUMN version_id SET NOT NULL), and ensure the unique index/constraint that
should cover (bucket, key, version_id) is present/updated so NULLs cannot bypass
uniqueness; reference the objects table and the version_id and is_latest columns
when making these changes.

In `@internal/repositories/postgres/migrations/077_add_gateway_methods.up.sql`:
- Around line 8-12: The IF NOT EXISTS check must be scoped to this specific
table/schema so a same-named constraint elsewhere doesn't block creation; update
the SELECT in the DO $$ block that currently checks only constraint_name =
'gateway_routes_pattern_methods_key' to also filter by table_name =
'gateway_routes' and the appropriate table_schema (e.g. 'public' or the schema
used by this migration), ensuring the ALTER TABLE gateway_routes ADD CONSTRAINT
gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods) only gets
skipped when that constraint exists on this table/schema.
🧹 Nitpick comments (8)
internal/repositories/postgres/migrations/078_add_api_key_expiry.up.sql (1)

1-1: LGTM!

The migration correctly uses IF NOT EXISTS for idempotent behavior, and TIMESTAMP WITH TIME ZONE is the appropriate type for expiration timestamps. Nullable by default is sensible—API keys without an expiration remain valid indefinitely.

Optional consideration: If the application runs periodic cleanup jobs or frequently queries expired keys (e.g., WHERE expires_at IS NOT NULL AND expires_at < NOW()), consider adding an index on expires_at in a follow-up migration to optimize those queries.

,

internal/repositories/postgres/migrations/013_create_users_table.up.sql (1)

18-23: Consider using table_constraints for more robust constraint existence check.

The constraint_column_usage view shows columns referenced by constraints, whereas table_constraints directly lists constraints on tables and is more commonly used for this purpose. Additionally, filtering by schema avoids false positives in multi-schema environments.

♻️ Suggested improvement
 DO $$ 
 BEGIN 
-    IF NOT EXISTS (SELECT 1 FROM information_schema.constraint_column_usage WHERE constraint_name = 'fk_api_keys_user') THEN
+    IF NOT EXISTS (
+        SELECT 1 FROM information_schema.table_constraints 
+        WHERE constraint_name = 'fk_api_keys_user' 
+        AND table_name = 'api_keys' 
+        AND constraint_type = 'FOREIGN KEY'
+    ) THEN
         ALTER TABLE api_keys ADD CONSTRAINT fk_api_keys_user FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT;
     END IF;
 END $$;
internal/repositories/postgres/migrations/058_add_object_versioning.down.sql (1)

7-12: Scope the constraint existence check to the objects table.

The check only filters by constraint name, which can yield false positives in edge cases. Consider constraining by table and schema for accuracy.

🔧 Suggested adjustment
 DO $$ 
 BEGIN 
-    IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'objects_bucket_key_key') THEN
+    IF NOT EXISTS (
+        SELECT 1
+        FROM information_schema.table_constraints
+        WHERE constraint_name = 'objects_bucket_key_key'
+          AND table_name = 'objects'
+          AND table_schema = current_schema()
+    ) THEN
         ALTER TABLE objects ADD CONSTRAINT objects_bucket_key_key UNIQUE (bucket, key);
     END IF;
 END $$;
internal/repositories/postgres/migrations/014_resource_scoping.up.sql (1)

13-19: Consider concurrent index creation for large production tables.

If these tables can be large in production, CREATE INDEX will take heavyweight locks and block writes. Consider a separate no-transaction migration that uses CREATE INDEX CONCURRENTLY where needed. Validate against your goose transaction settings before changing.

internal/repositories/postgres/migrations/006_create_vpcs_table.up.sql (1)

10-10: Consider indexing instances.vpc_id for join performance.

If this column is used in joins or lookups, adding an index can prevent table scans and keep FK lookups fast.

internal/repositories/postgres/migrations/071_add_tenant_id_to_resources.up.sql (1)

4-67: Ensure schema resolution is deterministic for unqualified ALTER TABLE.

These checks are schema-agnostic, but ALTER TABLE <name> follows search_path. If multiple schemas can contain these table names, consider setting search_path explicitly or using schema-qualified dynamic SQL from pg_tables.schemaname.

internal/repositories/postgres/migrations/072_migrate_to_tenants.up.sql (1)

32-93: Consider schema-qualified updates or enforce search_path.

These schema-agnostic existence checks are paired with unqualified UPDATE statements. In multi-schema environments, that can hit the wrong table or miss the intended one. Consider setting search_path for the migration or using schema-qualified dynamic SQL.

tests/gateway_e2e_test.go (1)

284-329: Consider consolidating repeated HTTP method tests into a parameterized loop.

The GET, POST, and DELETE test blocks share identical setup and assertion patterns. Extracting the method name, expected status code, and URL fragment into a loop would reduce duplication and make it easier to add additional method tests. This is particularly valuable since the current code duplicates request creation, header setup, body reading, and unmarshaling logic three times.

♻️ Suggested refactor
-		// Test GET request
-		reqGet, _ := http.NewRequest("GET", testutil.TestBaseURL+"/gw"+pattern, nil)
-		reqGet.Header.Set(testutil.TestHeaderAPIKey, token)
-		respGet, err := client.Do(reqGet)
-		require.NoError(t, err)
-		defer func() { _ = respGet.Body.Close() }()
-		bodyBytesGet, err := io.ReadAll(respGet.Body)
-		require.NoError(t, err)
-		var resGet struct {
-			URL string `json:"url"`
-		}
-		if err := json.Unmarshal(bodyBytesGet, &resGet); err != nil {
-			t.Fatalf("Failed to decode GET response: %v. Body: %s", err, string(bodyBytesGet))
-		}
-		assert.Contains(t, resGet.URL, "/get-only")
-
-		// Test POST request
-		reqPost, _ := http.NewRequest("POST", testutil.TestBaseURL+"/gw"+pattern, nil)
-		reqPost.Header.Set(testutil.TestHeaderAPIKey, token)
-		respPost, err := client.Do(reqPost)
-		require.NoError(t, err)
-		defer func() { _ = respPost.Body.Close() }()
-		bodyBytes, err := io.ReadAll(respPost.Body)
-		require.NoError(t, err)
-		var resPost struct {
-			URL string `json:"url"`
-		}
-		if err := json.Unmarshal(bodyBytes, &resPost); err != nil {
-			t.Fatalf("Failed to decode POST response: %v. Body: %s", err, string(bodyBytes))
-		}
-		assert.Contains(t, resPost.URL, "/post-only")
-
-		// Test DELETE request (should fail)
-		reqDel, _ := http.NewRequest("DELETE", testutil.TestBaseURL+"/gw"+pattern, nil)
-		reqDel.Header.Set(testutil.TestHeaderAPIKey, token)
-		respDel, err := client.Do(reqDel)
-		require.NoError(t, err)
-		assert.Equal(t, http.StatusNotFound, respDel.StatusCode)
-		_ = respDel.Body.Close()
+		cases := []struct {
+			name            string
+			method          string
+			wantStatus      int
+			wantURLContains string
+		}{
+			{name: "GET", method: http.MethodGet, wantStatus: http.StatusOK, wantURLContains: "/get-only"},
+			{name: "POST", method: http.MethodPost, wantStatus: http.StatusOK, wantURLContains: "/post-only"},
+			{name: "DELETE", method: http.MethodDelete, wantStatus: http.StatusNotFound},
+		}
+
+		for _, tc := range cases {
+			t.Run(tc.name, func(t *testing.T) {
+				req, err := http.NewRequest(tc.method, testutil.TestBaseURL+"/gw"+pattern, nil)
+				require.NoError(t, err)
+				req.Header.Set(testutil.TestHeaderAPIKey, token)
+
+				resp, err := client.Do(req)
+				require.NoError(t, err)
+				defer func() { _ = resp.Body.Close() }()
+
+				assert.Equal(t, tc.wantStatus, resp.StatusCode)
+				if tc.wantURLContains == "" {
+					return
+				}
+
+				bodyBytes, err := io.ReadAll(resp.Body)
+				require.NoError(t, err)
+				var res struct {
+					URL string `json:"url"`
+				}
+				if err := json.Unmarshal(bodyBytes, &res); err != nil {
+					t.Fatalf("Failed to decode %s response: %v. Body: %s", tc.name, err, string(bodyBytes))
+				}
+				assert.Contains(t, res.URL, tc.wantURLContains)
+			})
+		}

Comment on lines +4 to +10
ALTER TABLE instances ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
ALTER TABLE vpcs ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
ALTER TABLE volumes ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
ALTER TABLE objects ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
ALTER TABLE load_balancers ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
ALTER TABLE scaling_groups ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
ALTER TABLE events ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full migration file
cat -n internal/repositories/postgres/migrations/014_resource_scoping.up.sql

Repository: PoyrazK/thecloud

Length of output: 1484


🏁 Script executed:

# Search for earlier migrations that might have added these columns
ls -la internal/repositories/postgres/migrations/ | head -20

Repository: PoyrazK/thecloud

Length of output: 1537


🏁 Script executed:

# Look for any migrations that added user_id columns before 014
rg "user_id" internal/repositories/postgres/migrations/ --type-list | head -30

Repository: PoyrazK/thecloud

Length of output: 449


🏁 Script executed:

# Properly search for user_id in migration files
rg "user_id" internal/repositories/postgres/migrations/

Repository: PoyrazK/thecloud

Length of output: 16027


🏁 Script executed:

# Check migrations 001-013 more specifically
for i in {001..013}; do
  file="internal/repositories/postgres/migrations/${i}_"*.up.sql
  if [ -f "$file" ]; then
    echo "=== $file ==="
    cat "$file"
  fi
done

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

# Check the early migration files for these tables to see initial schema
cat -n internal/repositories/postgres/migrations/001_create_instances_table.up.sql

Repository: PoyrazK/thecloud

Length of output: 784


🏁 Script executed:

cat -n internal/repositories/postgres/migrations/006_create_vpcs_table.up.sql

Repository: PoyrazK/thecloud

Length of output: 429


🏁 Script executed:

cat -n internal/repositories/postgres/migrations/007_create_events_table.up.sql

Repository: PoyrazK/thecloud

Length of output: 525


🏁 Script executed:

cat -n internal/repositories/postgres/migrations/008_create_volumes_table.up.sql

Repository: PoyrazK/thecloud

Length of output: 746


🏁 Script executed:

# Check migration 013 which creates users and adds FK to api_keys
cat -n internal/repositories/postgres/migrations/013_create_users_table.up.sql

Repository: PoyrazK/thecloud

Length of output: 1064


🏁 Script executed:

# Check migrations 009-013 for any user_id additions to these tables
for i in {009..013}; do
  file="internal/repositories/postgres/migrations/${i}_"*.up.sql
  if [ -f "$file" ]; then
    echo "=== $file ==="
    cat -n "$file" 2>/dev/null
  fi
done

Repository: PoyrazK/thecloud

Length of output: 42


🏁 Script executed:

# Look for load_balancers and scaling_groups table creation
rg "CREATE TABLE.*load_balancers|CREATE TABLE.*scaling_groups|CREATE TABLE.*objects" internal/repositories/postgres/migrations/ -A 10

Repository: PoyrazK/thecloud

Length of output: 4854


🏁 Script executed:

# Let's verify PostgreSQL behavior with ADD COLUMN IF NOT EXISTS and FK constraints
# Check if there are any intermediate migrations between 014 and later ones that might modify these columns
for i in {015..025}; do
  file="internal/repositories/postgres/migrations/${i}_"*.up.sql
  if [ -f "$file" ]; then
    grep -l "ALTER TABLE.*instances\|ALTER TABLE.*vpcs\|ALTER TABLE.*volumes\|ALTER TABLE.*objects\|ALTER TABLE.*load_balancers\|ALTER TABLE.*scaling_groups\|ALTER TABLE.*events" "$file" 2>/dev/null && echo "Found in $file"
  fi
done

Repository: PoyrazK/thecloud

Length of output: 42


🌐 Web query:

PostgreSQL ADD COLUMN IF NOT EXISTS REFERENCES foreign key constraint behavior

💡 Result:

  • ALTER TABLE ... ADD COLUMN IF NOT EXISTS col ... is all-or-nothing for that subcommand: if the column already exists, PostgreSQL does not execute the column definition at all (it just suppresses the error). So ... ADD COLUMN IF NOT EXISTS col integer REFERENCES parent(id) will not add/repair the foreign key if col already exists. [1][2]

  • REFERENCES ... in a column definition is a foreign key constraint (same syntax as CREATE TABLE). It requires referenced columns to be a PRIMARY KEY or UNIQUE (or equivalent unique index, depending on version). [3][4]

  • When the column is newly added, PostgreSQL will create the FK and (in general) checks existing rows immediately; ALTER TABLE ... ADD CONSTRAINT ... NOT VALID exists specifically to skip the initial scan for large tables. In the common “new nullable column” case, existing rows are NULL, and NULL values do not have to satisfy the FK, so it won’t fail unless you also make it NOT NULL / give a non-null default. [5][6]

Practical pattern for idempotency

  • Do ADD COLUMN IF NOT EXISTS ... first, then add the FK separately (optionally NOT VALID + VALIDATE CONSTRAINT), because IF NOT EXISTS won’t help you if the column exists but the FK doesn’t. [1][6]

Sources: PostgreSQL ALTER TABLE (ADD COLUMN IF NOT EXISTS) docs [1][2]; CREATE TABLE / REFERENCES FK definition details [4]; constraint behavior for FKs with NULLs [3]; adding constraints + NOT VALID behavior [5][6].


Add explicit FK constraints guarded by pg_constraint for idempotency.

ADD COLUMN IF NOT EXISTS ... REFERENCES won't create the foreign key if the column already exists. If a column was added in a previous run or partial migration without the FK constraint, this statement silently skips and leaves referential integrity inconsistent. Follow the same pattern as migration 013 (which does this for api_keys): add the column first, then add the FK constraint separately within a DO block checking pg_constraint.

Pattern (repeat for each table)
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT 1
    FROM pg_constraint
    WHERE conname = 'instances_user_id_fkey'
  ) THEN
    ALTER TABLE instances
      ADD CONSTRAINT instances_user_id_fkey
      FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT;
  END IF;
END $$;
🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/014_resource_scoping.up.sql` around
lines 4 - 10, The migration currently uses "ADD COLUMN IF NOT EXISTS ...
REFERENCES" which is skipped if the column exists and thus can leave missing FK
constraints; update 014_resource_scoping.up.sql to (1) ensure each table
(instances, vpcs, volumes, objects, load_balancers, scaling_groups, events) has
the user_id column added with ALTER TABLE ... ADD COLUMN IF NOT EXISTS user_id
UUID (no REFERENCES), and (2) for each table add a guarded FK constraint named
e.g. instances_user_id_fkey, vpcs_user_id_fkey, etc., inside a DO $$ BEGIN ...
END $$ block that checks pg_constraint for conname before running ALTER TABLE
... ADD CONSTRAINT <name> FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE
RESTRICT (following the pattern used in migration 013 for api_keys) so the
migration is idempotent and will add missing constraints without error.

Comment on lines +37 to +38
CREATE INDEX idx_clusters_user_id ON clusters(user_id);
CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n internal/repositories/postgres/migrations/020_create_clusters.up.sql

Repository: PoyrazK/thecloud

Length of output: 1719


Add IF NOT EXISTS to index creation statements for migration idempotency.

The table creation statements (lines 1 and 27) correctly use IF NOT EXISTS, but the index creation statements (lines 37-38) do not. Re-running the migration will fail with "relation already exists" errors if the indexes are already present.

🔧 Suggested fix
-CREATE INDEX idx_clusters_user_id ON clusters(user_id);
-CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
+CREATE INDEX IF NOT EXISTS idx_clusters_user_id ON clusters(user_id);
+CREATE INDEX IF NOT EXISTS idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CREATE INDEX idx_clusters_user_id ON clusters(user_id);
CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
CREATE INDEX IF NOT EXISTS idx_clusters_user_id ON clusters(user_id);
CREATE INDEX IF NOT EXISTS idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/020_create_clusters.up.sql` around
lines 37 - 38, The CREATE INDEX statements for idx_clusters_user_id and
idx_cluster_nodes_cluster_id should be made idempotent by adding IF NOT EXISTS;
update the two statements that create those indexes (the lines using CREATE
INDEX idx_clusters_user_id ON clusters(user_id) and CREATE INDEX
idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id)) to include IF NOT
EXISTS so rerunning the migration won’t fail if the indexes already exist.

Comment on lines +3 to +4
ALTER TABLE objects ADD COLUMN IF NOT EXISTS version_id VARCHAR(64);
ALTER TABLE objects ADD COLUMN IF NOT EXISTS is_latest BOOLEAN DEFAULT TRUE;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent NULL version_id from bypassing uniqueness.

version_id is nullable and has no default, so inserts without a value can create duplicate (bucket, key) rows because unique constraints treat NULLs as distinct. This weakens the pre-existing uniqueness guarantee.

✅ Suggested fix
-ALTER TABLE objects ADD COLUMN IF NOT EXISTS version_id VARCHAR(64);
+ALTER TABLE objects ADD COLUMN IF NOT EXISTS version_id VARCHAR(64) DEFAULT 'null';
 ALTER TABLE objects ADD COLUMN IF NOT EXISTS is_latest BOOLEAN DEFAULT TRUE;

 UPDATE objects SET version_id = 'null' WHERE version_id IS NULL;
+
+ALTER TABLE objects ALTER COLUMN version_id SET NOT NULL;
🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/058_add_object_versioning.up.sql`
around lines 3 - 4, The migration leaves objects.version_id nullable which
allows duplicate (bucket, key) combos to bypass uniqueness; update the migration
to set a safe non-null default, backfill existing NULLs, and make the column NOT
NULL (e.g., ALTER TABLE objects ALTER COLUMN version_id SET DEFAULT ''; UPDATE
objects SET version_id = '' WHERE version_id IS NULL; ALTER TABLE objects ALTER
COLUMN version_id SET NOT NULL), and ensure the unique index/constraint that
should cover (bucket, key, version_id) is present/updated so NULLs cannot bypass
uniqueness; reference the objects table and the version_id and is_latest columns
when making these changes.

Comment on lines +8 to +12
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'gateway_routes_pattern_methods_key') THEN
ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods);
END IF;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope the constraint existence check to table/schema.

The IF NOT EXISTS check only filters by constraint_name, so a same-named constraint in another schema/table will short-circuit and skip adding the unique constraint on gateway_routes. This can leave the table without the intended uniqueness guarantee.

✅ Suggested fix (table + schema scoped)
 DO $$
 BEGIN
-    IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'gateway_routes_pattern_methods_key') THEN
+    IF NOT EXISTS (
+        SELECT 1
+        FROM information_schema.table_constraints tc
+        WHERE tc.constraint_name = 'gateway_routes_pattern_methods_key'
+          AND tc.table_name = 'gateway_routes'
+          AND tc.constraint_schema = current_schema()
+    ) THEN
         ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods);
     END IF;
 END $$;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'gateway_routes_pattern_methods_key') THEN
ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods);
END IF;
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1
FROM information_schema.table_constraints tc
WHERE tc.constraint_name = 'gateway_routes_pattern_methods_key'
AND tc.table_name = 'gateway_routes'
AND tc.constraint_schema = current_schema()
) THEN
ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods);
END IF;
END $$;
🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/077_add_gateway_methods.up.sql`
around lines 8 - 12, The IF NOT EXISTS check must be scoped to this specific
table/schema so a same-named constraint elsewhere doesn't block creation; update
the SELECT in the DO $$ block that currently checks only constraint_name =
'gateway_routes_pattern_methods_key' to also filter by table_name =
'gateway_routes' and the appropriate table_schema (e.g. 'public' or the schema
used by this migration), ensuring the ALTER TABLE gateway_routes ADD CONSTRAINT
gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods) only gets
skipped when that constraint exists on this table/schema.

Make call counting in fakeDockerClient thread-safe using a mutex.
Copilot AI review requested due to automatic review settings February 4, 2026 10:36
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@internal/repositories/docker/adapter_unit_test.go`:
- Around line 393-404: The fixed 50ms time.Sleep is causing flaky tests; replace
it with a deterministic wait: either (A) implement a polling-with-timeout in the
test that repeatedly checks cli.CallCount("ContainerExecCreate"),
cli.CallCount("ContainerExecAttach"), and cli.CallCount("ContainerExecInspect")
until they reach the expected counts (or a deadline expires), using short sleeps
between polls, or (B) add a test-only synchronization hook to the adapter (e.g.,
have LaunchInstanceWithOptions optionally return a wait handle or accept a
completion channel/callback) and use that to block until the background
bootstrap goroutine finishes; update the test to use the polling or the new wait
handle instead of time.Sleep(50 * time.Millisecond).
🧹 Nitpick comments (1)
internal/repositories/docker/adapter_unit_test.go (1)

377-405: Consider adding t.Parallel() for consistency with PR objectives.

The PR objectives state that parallel execution was enabled for unit tests via t.Parallel(). This new test doesn't include it, which may be intentional if the test has shared state concerns—but given that it uses an isolated fakeDockerClient instance, it should be safe to parallelize.

Suggested change
 func TestDockerAdapterLaunchWithUserData(t *testing.T) {
+	t.Parallel()
 	cli := &fakeDockerClient{}

Comment on lines +393 to +404
// UserData execution occurs asynchronously in a background goroutine.
// A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls.
time.Sleep(50 * time.Millisecond)

// Verify the two-stage bootstrap sequence:
// Stage 1: Payload delivery (writing the bootstrap script to the container filesystem).
// Stage 2: Bootstrap execution (invoking the script via a background Exec operation).
// Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach,
// and ContainerExecInspect according to the adapter's implementation.
require.Equal(t, 2, cli.CallCount("ContainerExecCreate"), "Expected dual Stage (delivery + execution) Exec invocations")
require.Equal(t, 2, cli.CallCount("ContainerExecAttach"), "Expected associated ExecAttach calls for I/O")
require.Equal(t, 2, cli.CallCount("ContainerExecInspect"), "Expected ExecInspect calls to verify termination state")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Flaky test: time.Sleep for async synchronization is unreliable.

Using a fixed 50ms sleep to wait for background goroutines is a known source of test flakiness—it may be too short on slow CI runners or under load, and unnecessarily slow on fast machines.

Consider one of these alternatives:

  1. Polling with timeout: Loop checking CallCount until expected values are reached or a deadline expires.
  2. Expose a synchronization hook: Have the adapter signal completion via a channel or callback in test mode.
  3. Return a wait handle: Modify LaunchInstanceWithOptions to optionally return a way to await bootstrap completion.
Example polling approach
-	// UserData execution occurs asynchronously in a background goroutine.
-	// A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls.
-	time.Sleep(50 * time.Millisecond)
+	// Poll until the expected call counts are reached or timeout.
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if cli.CallCount("ContainerExecCreate") >= 2 &&
+			cli.CallCount("ContainerExecAttach") >= 2 &&
+			cli.CallCount("ContainerExecInspect") >= 2 {
+			break
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// UserData execution occurs asynchronously in a background goroutine.
// A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls.
time.Sleep(50 * time.Millisecond)
// Verify the two-stage bootstrap sequence:
// Stage 1: Payload delivery (writing the bootstrap script to the container filesystem).
// Stage 2: Bootstrap execution (invoking the script via a background Exec operation).
// Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach,
// and ContainerExecInspect according to the adapter's implementation.
require.Equal(t, 2, cli.CallCount("ContainerExecCreate"), "Expected dual Stage (delivery + execution) Exec invocations")
require.Equal(t, 2, cli.CallCount("ContainerExecAttach"), "Expected associated ExecAttach calls for I/O")
require.Equal(t, 2, cli.CallCount("ContainerExecInspect"), "Expected ExecInspect calls to verify termination state")
// Poll until the expected call counts are reached or timeout.
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if cli.CallCount("ContainerExecCreate") >= 2 &&
cli.CallCount("ContainerExecAttach") >= 2 &&
cli.CallCount("ContainerExecInspect") >= 2 {
break
}
time.Sleep(10 * time.Millisecond)
}
// Verify the two-stage bootstrap sequence:
// Stage 1: Payload delivery (writing the bootstrap script to the container filesystem).
// Stage 2: Bootstrap execution (invoking the script via a background Exec operation).
// Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach,
// and ContainerExecInspect according to the adapter's implementation.
require.Equal(t, 2, cli.CallCount("ContainerExecCreate"), "Expected dual Stage (delivery + execution) Exec invocations")
require.Equal(t, 2, cli.CallCount("ContainerExecAttach"), "Expected associated ExecAttach calls for I/O")
require.Equal(t, 2, cli.CallCount("ContainerExecInspect"), "Expected ExecInspect calls to verify termination state")
🤖 Prompt for AI Agents
In `@internal/repositories/docker/adapter_unit_test.go` around lines 393 - 404,
The fixed 50ms time.Sleep is causing flaky tests; replace it with a
deterministic wait: either (A) implement a polling-with-timeout in the test that
repeatedly checks cli.CallCount("ContainerExecCreate"),
cli.CallCount("ContainerExecAttach"), and cli.CallCount("ContainerExecInspect")
until they reach the expected counts (or a deadline expires), using short sleeps
between polls, or (B) add a test-only synchronization hook to the adapter (e.g.,
have LaunchInstanceWithOptions optionally return a wait handle or accept a
completion channel/callback) and use that to block until the background
bootstrap goroutine finishes; update the test to use the polling or the new wait
handle instead of time.Sleep(50 * time.Millisecond).

Ensure 'objects_bucket_key_key' is dropped when adding versioning support to prevent constraint violations on multiple versions.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 167 out of 168 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 9 to +12
-- Remove old unique constraint and add new one that includes version_id
-- Note: Constraint name might vary, but objects_bucket_key_key is standard for (bucket, key)
ALTER TABLE objects DROP CONSTRAINT IF EXISTS objects_bucket_key_key;
ALTER TABLE objects DROP CONSTRAINT IF EXISTS objects_bucket_key_version_unique;
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this migration, the old UNIQUE(bucket, key) constraint created in 004_create_objects_table.up.sql will be named objects_bucket_key_key by Postgres. Dropping objects_bucket_key_version_unique here won’t remove the existing constraint, so adding the new (bucket, key, version_id) unique constraint will fail on databases that already have the original constraint. Drop the actual old constraint (or query information_schema/pg_constraint to drop whichever exists) before adding the new one.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
ALTER TABLE load_balancers ADD COLUMN IF NOT EXISTS ip VARCHAR(50);
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration adds load_balancers.ip, but that column is already introduced in 054_extend_clusters_v2_2.up.sql (as TEXT). Adding it again here (as VARCHAR(50)) is redundant and creates rollback hazards: rolling back version 080 would drop the column even though earlier migrations expect it to exist. Consider removing this migration or converting it into a type/constraint adjustment that is consistent with the earlier column definition.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +2
-- +goose Down
ALTER TABLE load_balancers DROP COLUMN IF EXISTS ip;
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

080_add_lb_ip.up.sql is missing the -- +goose Up directive, while the corresponding down migration has -- +goose Down. If migrations are applied via goose (as the rest of this directory suggests), the missing directive can cause the migration to be skipped or mis-parsed. Add the proper goose header (and keep Up/Down files consistent).

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +38
CREATE INDEX idx_clusters_user_id ON clusters(user_id);
CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexes in this migration are created without IF NOT EXISTS. Because the tables are created with IF NOT EXISTS, this migration is likely intended to be idempotent, but repeated application (or partial schema presence) can still fail on index creation. Use CREATE INDEX IF NOT EXISTS ... to match the rest of the migration style in this repo.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
-- Add missing columns to clusters table
ALTER TABLE clusters ADD COLUMN IF NOT EXISTS pod_cidr VARCHAR(50) DEFAULT '10.244.0.0/16';
ALTER TABLE clusters ADD COLUMN IF NOT EXISTS service_cidr VARCHAR(50) DEFAULT '10.96.0.0/12';
ALTER TABLE clusters ADD COLUMN IF NOT EXISTS ssh_private_key_encrypted TEXT;
ALTER TABLE clusters ADD COLUMN IF NOT EXISTS kubeconfig_encrypted TEXT;
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new migration files don’t include the -- +goose Up / -- +goose Down directives that are used throughout the rest of this migrations directory. If goose is used to run migrations, these files may not execute as intended. Add the appropriate goose headers to both the up and down scripts.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 30
// Add default rules (K8s control plane ports)
rules := []domain.SecurityRule{
{Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // API Server
{Protocol: "udp", PortMin: 4789, PortMax: 4789, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // VXLAN
{Protocol: "tcp", PortMin: 179, PortMax: 179, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // BGP
{Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // Kubelet
{Protocol: "tcp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort TCP
{Protocol: "udp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort UDP
}

for _, r := range rules {
_, _ = p.sgSvc.AddRule(ctx, sg.ID, r)
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: "0.0.0.0/0"},
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 22, PortMax: 22, CIDR: "0.0.0.0/0"},
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 2379, PortMax: 2380, CIDR: "10.0.0.0/8"}, // etcd (internal)
{Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: "10.0.0.0/8"}, // kubelet
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default security group rules here open the Kubernetes API server (6443) and SSH (22) to 0.0.0.0/0. That’s a high-risk default for managed clusters. Prefer restricting these to the VPC CIDR (or a configurable admin CIDR) and only exposing 6443 publicly when explicitly requested.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to 34
for _, rule := range rules {
_, _ = p.sgSvc.AddRule(ctx, newSG.ID, rule)
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors from AddRule are currently ignored (_, _ = ...). If rule creation fails, provisioning will proceed with a partially configured security group, which can lead to hard-to-debug connectivity issues. Handle and surface these errors (or at least log them and fail provisioning).

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
CREATE TABLE IF NOT EXISTS clusters (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
vpc_id UUID NOT NULL REFERENCES vpcs(id),
name VARCHAR(255) NOT NULL,
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new migration file name collides with an existing migration version (internal/repositories/postgres/migrations/020_create_notify.up.sql already uses version 020). Having two migrations with the same version will break migration tooling (ordering/application). Rename this to the next available migration number and ensure it’s applied in the intended order.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to 85
config := &ssh.ClientConfig{
User: e.user,
Auth: []ssh.AuthMethod{
ssh.PublicKeys(signer),
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: 10 * time.Second,
}

addr := net.JoinHostPort(e.ip, "22")
client, err := ssh.Dial("tcp", addr, config)
if err != nil {
return fmt.Errorf("failed to create ssh client: %w", err)
return "", fmt.Errorf("failed to dial ssh: %w", err)
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssh.Dial doesn’t honor context cancellation/timeouts, so Run/WaitForReady can block past ctx deadlines. Since pkg/sshutil already implements SSH execution using net.Dialer.DialContext (pkg/sshutil/client.go:55), consider reusing it here (or switching to DialContext + ssh.NewClientConn) so node exec respects ctx and caller timeouts.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +79
config := &ssh.ClientConfig{
User: e.user,
Auth: []ssh.AuthMethod{
ssh.PublicKeys(signer),
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: 10 * time.Second,
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host key verification is disabled via ssh.InsecureIgnoreHostKey(). If this executor is used outside of local/test environments, this enables MITM attacks. Consider wiring a known-hosts callback or making the host key policy configurable, defaulting to strict verification in production.

Copilot uses AI. Check for mistakes.
Allow deployment to staging registry from feature branches to verify build/push before merge. Also adds branch name as docker tag.
@poyrazK poyrazK closed this Feb 9, 2026
@poyrazK poyrazK deleted the feature/test-fixes-and-optimization branch February 9, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants