fix: resolve E2E test failures, optimize performance, and clean lint#48
fix: resolve E2E test failures, optimize performance, and clean lint#48
Conversation
…fallback in tests
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughCloud-init based provisioning and user-data support were added across compute backends and the k8s provisioner; the cluster domain model was extended with CIDR and encrypted secret fields; ComputeBackend API was renamed/extended (CreateInstance → LaunchInstanceWithOptions, StartInstance); many legacy k8s SSH-based files were removed and tests were widely parallelized. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant InstanceService
participant ComputeBackend
participant DockerEngine
Client->>InstanceService: LaunchInstanceWithOptions(opts + UserData)
InstanceService->>InstanceService: persist instance, enqueue provision job
InstanceService->>ComputeBackend: LaunchInstanceWithOptions(opts with UserData)
ComputeBackend->>DockerEngine: ContainerCreate (+ cloud-init / config)
DockerEngine->>DockerEngine: create container
ComputeBackend->>DockerEngine: ContainerStart
DockerEngine->>DockerEngine: start container
ComputeBackend->>DockerEngine: Exec (bootstrap/user-data handling)
DockerEngine->>DockerEngine: run cloud-init / bootstrap script
ComputeBackend->>InstanceService: return container ID / result
InstanceService->>Client: return Instance
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| Auth: []ssh.AuthMethod{ | ||
| ssh.PublicKeys(signer), | ||
| }, | ||
| HostKeyCallback: ssh.InsecureIgnoreHostKey(), |
Check failure
Code scanning / CodeQL
Use of insecure HostKeyCallback implementation High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
In general, the fix is to replace ssh.InsecureIgnoreHostKey() with a HostKeyCallback that validates the server’s host key against an allow list. For a simple case where there is exactly one expected host key (typical when connecting to a single node or a node with a fixed key), ssh.FixedHostKey(parsedKey) is the recommended implementation. The public host key should be provided out-of-band (e.g., configuration) rather than learned insecurely at first connection.
For this specific snippet, the least invasive change that preserves existing behavior while adding host key verification is to:
- Add a public host key field to
SSHExecutor, - Require that key as an argument in
NewSSHExecutor, and - Parse it once in
Runand usessh.FixedHostKeyforHostKeyCallback.
This avoids changing the rest of the logic and keeps all security concerns local to this type. Concretely, in internal/repositories/k8s/node_executor.go:
- Extend
SSHExecutorto include ahostKey stringfield that holds an ASCII-armored public key (e.g.,ssh-ed25519 AAAA...). - Change
NewSSHExecutor(ip, user, key string)toNewSSHExecutor(ip, user, key, hostKey string)and storehostKey. - In
Run, before creatingssh.ClientConfig, parsee.hostKeyusingssh.ParseAuthorizedKey(which accepts the standard public key format) and handle errors. - Replace
HostKeyCallback: ssh.InsecureIgnoreHostKey(),withHostKeyCallback: ssh.FixedHostKey(parsedHostKey),.
Imports don’t need to change; golang.org/x/crypto/ssh already provides ParseAuthorizedKey and FixedHostKey. No extra dependencies are required.
| @@ -54,13 +54,14 @@ | ||
|
|
||
| // SSHExecutor uses SSH to run commands on a node. | ||
| type SSHExecutor struct { | ||
| ip string | ||
| user string | ||
| key string | ||
| ip string | ||
| user string | ||
| key string | ||
| hostKey string | ||
| } | ||
|
|
||
| func NewSSHExecutor(ip, user, key string) *SSHExecutor { | ||
| return &SSHExecutor{ip: ip, user: user, key: key} | ||
| func NewSSHExecutor(ip, user, key, hostKey string) *SSHExecutor { | ||
| return &SSHExecutor{ip: ip, user: user, key: key, hostKey: hostKey} | ||
| } | ||
|
|
||
| func (e *SSHExecutor) Run(ctx context.Context, cmd string) (string, error) { | ||
| @@ -69,12 +68,17 @@ | ||
| return "", fmt.Errorf("failed to parse private key: %w", err) | ||
| } | ||
|
|
||
| hostPubKey, _, _, _, err := ssh.ParseAuthorizedKey([]byte(e.hostKey)) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to parse SSH host public key: %w", err) | ||
| } | ||
|
|
||
| config := &ssh.ClientConfig{ | ||
| User: e.user, | ||
| Auth: []ssh.AuthMethod{ | ||
| ssh.PublicKeys(signer), | ||
| }, | ||
| HostKeyCallback: ssh.InsecureIgnoreHostKey(), | ||
| HostKeyCallback: ssh.FixedHostKey(hostPubKey), | ||
| Timeout: 10 * time.Second, | ||
| } | ||
|
|
There was a problem hiding this comment.
Pull request overview
This PR aims to fix failing E2E/unit tests, align APIs after compute/provisioning changes (including UserData support), and improve test performance by parallelizing many test suites.
Changes:
- Renames/extends compute + instance provisioning APIs (e.g.,
LaunchInstanceWithOptions,StartInstance, provisioning now carriesUserData) and updates adapters/tests accordingly. - Refactors Kubernetes provisioning test infrastructure (centralized mocks, executor changes, new cloud-config templates) and adjusts security group behavior.
- Adds/updates Postgres migrations and broadly parallelizes unit tests (
t.Parallel()), along with lint-driven cleanups and Swagger sync.
Reviewed changes
Copilot reviewed 144 out of 145 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/test_libvirt.go | Updates script to new compute backend API (LaunchInstanceWithOptions). |
| internal/workers/provision_worker_test.go | Updates failing backend mock to new compute backend API. |
| internal/workers/provision_worker.go | Passes UserData through to InstanceService.Provision. |
| internal/repositories/postgres/vpc_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/secret_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/rbac_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/notify_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/migrator_unit_test.go | Parallelizes migration unit test and whitespace cleanup. |
| internal/repositories/postgres/migrations/081_fix_clusters_table.up.sql | Adds missing cluster columns and migrates data from legacy columns. |
| internal/repositories/postgres/migrations/081_fix_clusters_table.down.sql | Drops newly added cluster columns. |
| internal/repositories/postgres/migrations/080_add_lb_ip.up.sql | Adds ip column to load_balancers. |
| internal/repositories/postgres/migrations/080_add_lb_ip.down.sql | Drops ip column from load_balancers. |
| internal/repositories/postgres/migrations/020_create_clusters.up.sql | Adds a new clusters schema migration file (see PR comments re: conflicts). |
| internal/repositories/postgres/lifecycle_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/image_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/identity_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/function_repo_unit_test.go | Parallelizes repository unit tests and formatting cleanup. |
| internal/repositories/postgres/encryption_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/database_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/cron_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/container_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/postgres/cluster_repo_unit_test.go | Updates cluster schema expectations and parallelizes tests. |
| internal/repositories/postgres/accounting_repo_unit_test.go | Parallelizes repository unit tests. |
| internal/repositories/noop/adapters_test.go | Updates noop compute backend tests to new API. |
| internal/repositories/libvirt/templates_test.go | Marks as integration tests; parallelizes subtests. |
| internal/repositories/libvirt/mock_client_test.go | Marks as integration tests. |
| internal/repositories/libvirt/adapter_volume_test.go | Marks as integration tests; parallelizes tests. |
| internal/repositories/libvirt/adapter_test.go | Marks as integration tests; parallelizes tests. |
| internal/repositories/libvirt/adapter_network_test.go | Marks as integration tests; parallelizes tests. |
| internal/repositories/libvirt/adapter_lifecycle_test.go | Marks as integration tests; updates to new API + parallelization. |
| internal/repositories/libvirt/adapter_context_test.go | Marks as integration tests; updates to new API + parallelization. |
| internal/repositories/libvirt/adapter.go | Renames create→launch, adds StartInstance, extends cloud-init generation to accept raw UserData. |
| internal/repositories/k8s/templates/worker.yaml | Adds worker node cloud-config template. |
| internal/repositories/k8s/templates/control_plane.yaml | Adds control-plane cloud-config template. |
| internal/repositories/k8s/security_test.go | Removes prior security tests (behavior shifted). |
| internal/repositories/k8s/security.go | Changes SG naming/rules behavior for clusters. |
| internal/repositories/k8s/provisioner_all_test.go | Removes broad/legacy provisioner tests. |
| internal/repositories/k8s/node_executor_test.go | Updates expected shell path (sh vs /bin/sh). |
| internal/repositories/k8s/node_executor.go | Reworks executor implementations; adds raw SSH execution path. |
| internal/repositories/k8s/mocks_test.go | Centralizes k8s mocks and updates interfaces. |
| internal/repositories/k8s/mock_provisioner.go | Removes old mock provisioner implementation. |
| internal/repositories/k8s/kubeadm_provisioner_test.go | Updates instance service mock + uses new launch API and list mocks. |
| internal/repositories/k8s/kubeadm.go | Removes legacy kubeadm helper code. |
| internal/repositories/k8s/health_test.go | Updates mocks and adds secret decryption path for kubeconfig. |
| internal/repositories/k8s/ha_test.go | Removes legacy HA tests. |
| internal/repositories/k8s/ha.go | Replaces HA provision flow with LB ensure helper. |
| internal/repositories/k8s/export_test.go | Removes test-only exported wrappers. |
| internal/repositories/k8s/cni_test.go | Removes legacy CNI tests. |
| internal/repositories/k8s/cni.go | Removes legacy CNI helper code. |
| internal/repositories/k8s/bootstrap.go | Removes legacy bootstrap helpers. |
| internal/repositories/docker/stats_test.go | Updates adapter constructor + launch API. |
| internal/repositories/docker/fakes_test.go | Adds call counters to fake client for async bootstrap assertions. |
| internal/repositories/docker/adapter_unit_test.go | Updates create→launch and adds UserData bootstrap unit test. |
| internal/repositories/docker/adapter_test.go | Updates constructor + launch API and adds UserData bootstrap integration test. |
| internal/repositories/docker/adapter.go | Adds logger dependency, UserData handling (script + cloud-config), and implements StartInstance. |
| internal/handlers/ws/ws_test.go | Parallelizes websocket handler tests. |
| internal/handlers/ws/hub_test.go | Parallelizes hub tests. |
| internal/handlers/ws/client_test.go | Parallelizes client tests. |
| internal/handlers/vpc_handler_test.go | Parallelizes handler tests. |
| internal/handlers/volume_handler_test.go | Parallelizes handler tests. |
| internal/handlers/tenant_handler_test.go | Parallelizes handler tests. |
| internal/handlers/subnet_handler_test.go | Parallelizes handler tests. |
| internal/handlers/storage_handler_test.go | Parallelizes handler tests. |
| internal/handlers/storage_handler_errors_test.go | Parallelizes error-table tests (including subtests). |
| internal/handlers/stack_handler_test.go | Parallelizes handler tests. |
| internal/handlers/snapshot_handler_test.go | Parallelizes handler tests. |
| internal/handlers/security_group_handler_test.go | Parallelizes handler tests. |
| internal/handlers/secret_handler_test.go | Parallelizes handler tests. |
| internal/handlers/rbac_handler_test.go | Parallelizes handler tests. |
| internal/handlers/queue_handler_test.go | Parallelizes handler tests. |
| internal/handlers/notify_handler_test.go | Parallelizes handler tests and minor constant formatting. |
| internal/handlers/lifecycle_handler_test.go | Parallelizes handler tests. |
| internal/handlers/lb_handler_test.go | Parallelizes handler tests. |
| internal/handlers/instance_type_handler_test.go | Parallelizes handler tests. |
| internal/handlers/instance_handler_test.go | Updates instance service mock interface and parallelizes handler tests. |
| internal/handlers/image_handler_test.go | Parallelizes handler tests. |
| internal/handlers/identity_handler_test.go | Parallelizes handler tests. |
| internal/handlers/helper_test.go | Parallelizes helper tests. |
| internal/handlers/health_handler_test.go | Parallelizes health handler tests. |
| internal/handlers/global_lb_handler_test.go | Parallelizes handler tests. |
| internal/handlers/gateway_handler_test.go | Parallelizes handler tests. |
| internal/handlers/function_handler_test.go | Parallelizes handler tests and minor constant formatting. |
| internal/handlers/event_handler_test.go | Parallelizes handler tests. |
| internal/handlers/dns_handler_test.go | Parallelizes handler tests. |
| internal/handlers/database_handler_test.go | Parallelizes handler tests and minor constant formatting. |
| internal/handlers/dashboard_handler_test.go | Parallelizes handler tests. |
| internal/handlers/cron_handler_test.go | Parallelizes handler tests and minor constant formatting. |
| internal/handlers/container_handler_test.go | Parallelizes handler tests. |
| internal/handlers/cluster_handler_test.go | Parallelizes handler tests. |
| internal/handlers/cache_handler_test.go | Parallelizes handler tests. |
| internal/handlers/autoscaling_handler_test.go | Parallelizes handler tests. |
| internal/handlers/auth_handler_test.go | Parallelizes handler tests. |
| internal/handlers/audit_handler_test.go | Parallelizes handler tests. |
| internal/handlers/accounting_handler_test.go | Parallelizes handler tests. |
| internal/core/services/volume_test.go | Updates Provision calls to include userData argument. |
| internal/core/services/stack_test.go | Parallelizes stack service tests. |
| internal/core/services/shared_test.go | Updates mocks for new compute + instance service interfaces. |
| internal/core/services/rbac_cached_test.go | Parallelizes cached RBAC tests. |
| internal/core/services/loadbalancer_service_test.go | Parallelizes LB service tests. |
| internal/core/services/lifecycle_svc_test.go | Parallelizes lifecycle service tests. |
| internal/core/services/lb_worker_test.go | Parallelizes LB worker tests. |
| internal/core/services/instance_test.go | Updates docker adapter construction + provision signature + test port. |
| internal/core/services/instance_ports_test.go | Parallelizes port parsing tests. |
| internal/core/services/instance_internal_test.go | Parallelizes internal helper tests. |
| internal/core/services/instance.go | Adds LaunchInstanceWithOptions, extends Provision with UserData, adds StartInstance. |
| internal/core/services/health_test.go | Updates docker adapter constructor usage. |
| internal/core/services/global_lb_test.go | Parallelizes global LB tests. |
| internal/core/services/function_test.go | Updates docker adapter constructor usage. |
| internal/core/services/database_test.go | Updates docker adapter constructor usage. |
| internal/core/services/database.go | Uses new compute backend launch API. |
| internal/core/services/dashboard_test.go | Parallelizes dashboard tests. |
| internal/core/services/cron_worker_test.go | Parallelizes cron worker tests. |
| internal/core/services/container_worker_test.go | Parallelizes container worker tests. |
| internal/core/services/container_test.go | Updates provision signature. |
| internal/core/services/cluster_test.go | Parallelizes cluster service tests and updates encrypted kubeconfig field name. |
| internal/core/services/cluster_lifecycle_test.go | Parallelizes cluster lifecycle tests. |
| internal/core/services/cluster.go | Adds Pod/Service CIDR support and uses encrypted kubeconfig field. |
| internal/core/services/cached_identity_test.go | Parallelizes cached identity tests. |
| internal/core/services/cache_test.go | Updates docker adapter constructor usage. |
| internal/core/services/cache.go | Uses new compute backend launch API. |
| internal/core/services/autoscaling_worker_test.go | Parallelizes autoscaling worker tests. |
| internal/core/services/autoscaling_test.go | Updates provision signature. |
| internal/core/ports/instance.go | Extends InstanceService port with LaunchInstanceWithOptions and StartInstance. |
| internal/core/ports/compute_options.go | Adds UserData to compute options. |
| internal/core/ports/compute.go | Renames create→launch and adds StartInstance to compute backend port. |
| internal/core/ports/cluster.go | Adds Pod/Service CIDR fields to cluster creation params. |
| internal/core/domain/ws_event_test.go | Parallelizes ws event tests. |
| internal/core/domain/security_group_test.go | Parallelizes security group/domain tests. |
| internal/core/domain/jobs.go | Adds UserData to provision job payload. |
| internal/core/domain/domain_test.go | Parallelizes smoke test. |
| internal/core/domain/dns_test.go | Parallelizes DNS domain tests. |
| internal/core/domain/cluster.go | Updates cluster/node domain model (CIDRs, encrypted secrets, heartbeat). |
| internal/api/setup/infrastructure.go | Passes logger into Docker adapter constructor. |
| docs/swagger/swagger.yaml | Adds pod_cidr/service_cidr fields to cluster schema. |
| docs/swagger/swagger.json | Adds pod_cidr/service_cidr fields to cluster schema. |
| docs/swagger/docs.go | Syncs generated swagger docs for new fields. |
| cmd/cloud/queue_cli_test.go | Minor indentation cleanup in tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| config := &ssh.ClientConfig{ | ||
| User: e.user, | ||
| Auth: []ssh.AuthMethod{ | ||
| ssh.PublicKeys(signer), | ||
| }, | ||
| HostKeyCallback: ssh.InsecureIgnoreHostKey(), | ||
| Timeout: 10 * time.Second, | ||
| } | ||
|
|
||
| addr := net.JoinHostPort(e.ip, "22") | ||
| client, err := ssh.Dial("tcp", addr, config) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create ssh client: %w", err) | ||
| return "", fmt.Errorf("failed to dial ssh: %w", err) | ||
| } |
There was a problem hiding this comment.
SSHExecutor disables host key verification via ssh.InsecureIgnoreHostKey(), which makes SSH connections vulnerable to MITM and is generally unsafe for production provisioning. Use a known_hosts-based HostKeyCallback (or a pinned host key / CA) and consider making insecure mode opt-in for tests only. Also note that ssh.Dial/session.Run don’t honor ctx cancellation; if ctx is canceled, the command may still hang until SSH timeouts fire—consider wiring ctx into dialing (net.Dialer.DialContext) and closing the session/client when ctx is done.
| // NewDockerAdapter constructs a DockerAdapter with a Docker client. | ||
| func NewDockerAdapter() (*DockerAdapter, error) { | ||
| func NewDockerAdapter(logger *slog.Logger) (*DockerAdapter, error) { | ||
| cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation()) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create docker client: %w", err) | ||
| } | ||
| return &DockerAdapter{cli: cli}, nil | ||
| return &DockerAdapter{cli: cli, logger: logger}, nil | ||
| } |
There was a problem hiding this comment.
NewDockerAdapter stores the provided logger without nil-guarding, but LaunchInstanceWithOptions/handleUserData later call a.logger.Warn/Error/Info. If any caller passes a nil logger, this will panic at runtime. Default logger to slog.Default() (or a no-op logger) when the argument is nil.
| // Note: We use base64 encoding to avoid escaping issues with complex scripts | ||
| encoded := base64.StdEncoding.EncodeToString([]byte(userData)) | ||
| writeCmd := []string{"sh", "-c", fmt.Sprintf("mkdir -p %s && echo %s | base64 -d > %s && chmod +x %s", | ||
| filepath.Dir(scriptPath), encoded, scriptPath, scriptPath)} | ||
|
|
||
| if _, err := a.Exec(ctx, containerID, writeCmd); err != nil { | ||
| return fmt.Errorf("failed to write userdata to container: %w", err) | ||
| } |
There was a problem hiding this comment.
The userdata write command embeds the base64 payload into a shell string via echo %s | base64 -d ... without quoting/printf. While base64 usually avoids whitespace, using printf '%s' and proper quoting is more robust and avoids edge cases with shell interpretation/length limits. Consider switching to a heredoc or printf-based approach and quoting paths consistently.
| // UserData execution occurs asynchronously in a background goroutine. | ||
| // A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls. | ||
| time.Sleep(50 * time.Millisecond) | ||
|
|
||
| // Verify the two-stage bootstrap sequence: | ||
| // Stage 1: Payload delivery (writing the bootstrap script to the container filesystem). | ||
| // Stage 2: Bootstrap execution (invoking the script via a background Exec operation). | ||
| // Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach, | ||
| // and ContainerExecInspect according to the adapter's implementation. | ||
| require.Equal(t, 2, cli.Calls["ContainerExecCreate"], "Expected dual Stage (delivery + execution) Exec invocations") | ||
| require.Equal(t, 2, cli.Calls["ContainerExecAttach"], "Expected associated ExecAttach calls for I/O") | ||
| require.Equal(t, 2, cli.Calls["ContainerExecInspect"], "Expected ExecInspect calls to verify termination state") | ||
| } |
There was a problem hiding this comment.
This test uses a fixed time.Sleep to wait for an asynchronous goroutine, which can be flaky on slower CI machines (the background exec may not have completed yet). Prefer require.Eventually polling on the fake client's call counters (or synchronize via a channel/hook) so the test waits deterministically for the expected Exec sequence.
| CREATE TABLE clusters ( | ||
| id UUID PRIMARY KEY, | ||
| user_id UUID NOT NULL REFERENCES users(id), | ||
| vpc_id UUID NOT NULL REFERENCES vpcs(id), | ||
| name VARCHAR(255) NOT NULL, |
There was a problem hiding this comment.
This migration introduces a second version "020" alongside existing 020_create_notify.{up,down}.sql, which will break migration ordering/uniqueness in goose-style tooling. It also recreates the clusters/cluster_nodes tables without IF NOT EXISTS even though clusters already exists via 037_create_clusters_table.up.sql, so applying migrations will fail on a fresh DB. Renumber this migration to a new unused version (and add a matching down migration) or fold these schema changes into the existing clusters migrations instead of creating a conflicting table definition.
| // LaunchInstanceWithOptions provisions an instance using structured options. | ||
| func (s *InstanceService) LaunchInstanceWithOptions(ctx context.Context, opts ports.CreateInstanceOptions) (*domain.Instance, error) { | ||
| ctx, span := otel.Tracer("instance-service").Start(ctx, "LaunchInstanceWithOptions") | ||
| defer span.End() | ||
|
|
||
| inst := &domain.Instance{ | ||
| ID: uuid.New(), | ||
| UserID: appcontext.UserIDFromContext(ctx), | ||
| TenantID: appcontext.TenantIDFromContext(ctx), | ||
| Name: opts.Name, | ||
| Image: opts.ImageName, | ||
| Status: domain.StatusStarting, | ||
| InstanceType: "custom", // Marking as custom since we are passing raw constraints or defaults | ||
| Version: 1, | ||
| CreatedAt: time.Now(), | ||
| UpdatedAt: time.Now(), | ||
| } | ||
|
|
||
| if opts.NetworkID != "" { | ||
| vpcID, err := uuid.Parse(opts.NetworkID) | ||
| if err == nil { | ||
| inst.VpcID = &vpcID | ||
| } | ||
| } | ||
|
|
||
| if err := s.repo.Create(ctx, inst); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // 4. Enqueue provision task with UserData | ||
| job := domain.ProvisionJob{ | ||
| InstanceID: inst.ID, | ||
| UserID: inst.UserID, | ||
| TenantID: inst.TenantID, | ||
| UserData: opts.UserData, | ||
| } | ||
|
|
||
| if err := s.taskQueue.Enqueue(ctx, "provision_queue", job); err != nil { | ||
| s.logger.Error("failed to enqueue provision job", "instance_id", inst.ID, "error", err) | ||
| return nil, errors.Wrap(errors.Internal, "failed to enqueue provisioning task", err) | ||
| } | ||
|
|
||
| return inst, nil |
There was a problem hiding this comment.
LaunchInstanceWithOptions currently creates an Instance record without persisting key options needed by Provision (e.g., Ports, SubnetID/VpcID, and any volume attachments), and enqueues a ProvisionJob without setting Volumes. As a result, Provision will launch with empty port mappings and no volumes, and may fail later unless an instance type named "custom" exists in the DB. Consider either (a) mapping opts into the Instance model (validate/join opts.Ports into inst.Ports, store the correct VPC/Subnet identifiers, ensure a valid InstanceType, and propagate volumes via the job), or (b) defining a dedicated service-level options struct instead of reusing compute-level CreateInstanceOptions (where NetworkID is not necessarily a UUID).
| if inst.Status == domain.StatusRunning { | ||
| return nil // Already running | ||
| } | ||
|
|
||
| // 2. Call Compute backend | ||
| target := inst.ContainerID | ||
| if target == "" { | ||
| // Try to recover ID from name if missing | ||
| target = s.formatContainerName(inst.ID) | ||
| } | ||
|
|
||
| if err := s.compute.StartInstance(ctx, target); err != nil { | ||
| platform.InstanceOperationsTotal.WithLabelValues("start", "failure").Inc() | ||
| s.logger.Error("failed to start instance", "instance_id", inst.ID, "container_id", target, "error", err) | ||
| return errors.Wrap(errors.Internal, "failed to start instance", err) | ||
| } | ||
|
|
||
| // 3. Update Metrics & Status | ||
| platform.InstancesTotal.WithLabelValues("stopped", s.compute.Type()).Dec() | ||
| platform.InstancesTotal.WithLabelValues("running", s.compute.Type()).Inc() | ||
| platform.InstanceOperationsTotal.WithLabelValues("start", "success").Inc() |
There was a problem hiding this comment.
StartInstance updates Prometheus counters by decrementing the "stopped" gauge unconditionally. If StartInstance is called while the instance is in STARTING/ERROR (or any non-stopped state), this can drive the "stopped" metric negative and make state counts inaccurate. Gate the gauge updates on the prior status being STOPPED (or recompute gauges from DB periodically) before Dec/Inc.
| // Add default rules (K8s control plane ports) | ||
| rules := []domain.SecurityRule{ | ||
| {Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // API Server | ||
| {Protocol: "udp", PortMin: 4789, PortMax: 4789, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // VXLAN | ||
| {Protocol: "tcp", PortMin: 179, PortMax: 179, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // BGP | ||
| {Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // Kubelet | ||
| {Protocol: "tcp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort TCP | ||
| {Protocol: "udp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort UDP | ||
| } | ||
|
|
||
| for _, r := range rules { | ||
| _, _ = p.sgSvc.AddRule(ctx, sg.ID, r) | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: "0.0.0.0/0"}, | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 22, PortMax: 22, CIDR: "0.0.0.0/0"}, | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 2379, PortMax: 2380, CIDR: "10.0.0.0/8"}, // etcd (internal) | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: "10.0.0.0/8"}, // kubelet | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| func (p *KubeadmProvisioner) applyBaseSecurity(ctx context.Context, cluster *domain.Cluster, masterIP string) error { | ||
| exec, err := p.getExecutor(ctx, cluster, masterIP) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| p.logger.Info("applying base security manifests", "ip", masterIP, "isolation", cluster.NetworkIsolation) | ||
|
|
||
| var securityManifests string | ||
| if cluster.NetworkIsolation { | ||
| // Apply a default-deny ingress policy for the default namespace | ||
| securityManifests = ` | ||
| apiVersion: networking.k8s.io/v1 | ||
| kind: NetworkPolicy | ||
| metadata: | ||
| name: default-deny-ingress | ||
| namespace: default | ||
| spec: | ||
| podSelector: {} | ||
| policyTypes: | ||
| - Ingress | ||
| ` | ||
| } | ||
| if securityManifests != "" { | ||
| manifestFile := "/tmp/base-security.yaml" | ||
| _, _ = exec.Run(ctx, fmt.Sprintf("cat <<EOF > %s\n%s\nEOF", manifestFile, securityManifests)) | ||
| _, err = exec.Run(ctx, fmt.Sprintf("%s apply -f %s", kubectlBase, manifestFile)) | ||
| return err | ||
| for _, rule := range rules { | ||
| _, _ = p.sgSvc.AddRule(ctx, newSG.ID, rule) | ||
| } |
There was a problem hiding this comment.
ensureClusterSecurityGroup ignores errors from AddRule (", _ = ..."), so the security group may be created without required rules and provisioning will continue silently. Return/log the error and fail fast (or at least warn) so misconfigured clusters don’t end up unreachable. Also consider whether opening SSH (22) and the API server (6443) to 0.0.0.0/0 is intended; if not, restrict to a configurable admin CIDR or VPC CIDR by default.
There was a problem hiding this comment.
Actionable comments posted: 4
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (18)
internal/handlers/event_handler_test.go (1)
33-34:⚠️ Potential issue | 🟡 MinorPotential race condition with
gin.SetModein parallel tests.
gin.SetMode(gin.TestMode)modifies global state in the gin package. Witht.Parallel()enabled, multiple tests callingsetupEventHandlerTestconcurrently will race on this global variable, potentially triggering race detector warnings.Consider moving
gin.SetMode(gin.TestMode)to aTestMainfunction or usingsync.Onceto ensure it's called only once.🛠️ Proposed fix using sync.Once
+var ginModeOnce sync.Once + func setupEventHandlerTest(_ *testing.T) (*mockEventService, *EventHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) + ginModeOnce.Do(func() { + gin.SetMode(gin.TestMode) + }) svc := new(mockEventService) handler := NewEventHandler(svc) r := gin.New() return svc, handler, r }Also add
"sync"to the imports.Also applies to: 41-42
internal/handlers/instance_type_handler_test.go (1)
29-35:⚠️ Potential issue | 🟠 MajorData race:
gin.SetModeis called aftert.Parallel().
gin.SetMode(gin.TestMode)modifies a package-level global variable. When this test runs in parallel with other gin tests that also callgin.SetMode, you'll encounter a data race (detectable with-raceflag).Move gin mode initialization to
TestMainso it runs once before any parallel tests execute:func TestMain(m *testing.M) { gin.SetMode(gin.TestMode) os.Exit(m.Run()) }Then remove line 31 from this test and any other parallel tests in the file.
internal/repositories/postgres/cron_repo_unit_test.go (1)
14-41:⚠️ Potential issue | 🟡 MinorConvert these tests to table-driven style for better maintainability.
The tests follow a repetitive pattern with single cases each. Consider refactoring with table-driven structure using
casesslice andt.Runfor test organization. Apply consistently toTestCronRepository_CreateJob,TestCronRepository_GetJobByID, andTestCronRepository_ListJobs.Example refactor (apply same pattern to other tests)
func TestCronRepository_CreateJob(t *testing.T) { t.Parallel() - mock, err := pgxmock.NewPool() - assert.NoError(t, err) - defer mock.Close() - - repo := NewPostgresCronRepository(mock) - job := &domain.CronJob{ - ID: uuid.New(), - UserID: uuid.New(), - Name: "test-job", - Schedule: "* * * * *", - TargetURL: "http://test", - TargetMethod: "POST", - TargetPayload: "{}", - Status: domain.CronStatusActive, - NextRunAt: nil, - CreatedAt: time.Now(), - UpdatedAt: time.Now(), - } - - mock.ExpectExec("INSERT INTO cron_jobs"). - WithArgs(job.ID, job.UserID, job.Name, job.Schedule, job.TargetURL, job.TargetMethod, job.TargetPayload, job.Status, job.NextRunAt, job.CreatedAt, job.UpdatedAt). - WillReturnResult(pgxmock.NewResult("INSERT", 1)) - - err = repo.CreateJob(context.Background(), job) - assert.NoError(t, err) + cases := []struct { + name string + job *domain.CronJob + }{ + { + name: "active-job", + job: &domain.CronJob{ + ID: uuid.New(), + UserID: uuid.New(), + Name: "test-job", + Schedule: "* * * * *", + TargetURL: "http://test", + TargetMethod: "POST", + TargetPayload: "{}", + Status: domain.CronStatusActive, + NextRunAt: nil, + CreatedAt: time.Now(), + UpdatedAt: time.Now(), + }, + }, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + mock, err := pgxmock.NewPool() + assert.NoError(t, err) + defer mock.Close() + + repo := NewPostgresCronRepository(mock) + mock.ExpectExec("INSERT INTO cron_jobs"). + WithArgs(tc.job.ID, tc.job.UserID, tc.job.Name, tc.job.Schedule, tc.job.TargetURL, tc.job.TargetMethod, tc.job.TargetPayload, tc.job.Status, tc.job.NextRunAt, tc.job.CreatedAt, tc.job.UpdatedAt). + WillReturnResult(pgxmock.NewResult("INSERT", 1)) + + err = repo.CreateJob(context.Background(), tc.job) + assert.NoError(t, err) + }) + } }internal/handlers/ws/ws_test.go (1)
73-89:⚠️ Potential issue | 🟡 MinorAvoid silent Close errors (and double-close) in the websocket lifecycle.
_ = conn.Close()hides failures and makes it easy to miss shutdown issues. Prefert.Cleanupwith error checks and avoid double-closing.🔧 Suggested fix
conn, _, err := dialer.Dial(wsURL, nil) assert.NoError(t, err) - defer func() { _ = conn.Close() }() + t.Cleanup(func() { + if conn != nil { + assert.NoError(t, conn.Close()) + } + }) time.Sleep(100 * time.Millisecond) assert.Equal(t, 1, hub.ClientCount()) @@ - _ = conn.Close() + assert.NoError(t, conn.Close()) + conn = nilAs per coding guidelines: Do not use silent failures - avoid blank identifier assignment like
_ = someFunc().internal/repositories/postgres/lifecycle_repo_unit_test.go (1)
47-48:⚠️ Potential issue | 🟡 MinorAdd
ExpectationsWereMet()assertions to verify mock expectations were fulfilled in all test cases.Tests using pgxmock must verify that expected database calls were actually executed. Without these assertions, tests can pass even if the mocked queries are never invoked.
✅ Proposed fix
- assert.NoError(t, err) + assert.NoError(t, err) + assert.NoError(t, mock.ExpectationsWereMet())Also applies to: lines 63-66 (Get test) and 79-80 (Delete test)
internal/core/services/cron_worker_test.go (1)
57-71:⚠️ Potential issue | 🟡 MinorAvoid fixed localhost ports in parallel tests.
http://localhost:12345can be unexpectedly available under parallel runs. Use a closedhttptest.Server(or a mocked HTTP client) to guarantee failure deterministically.🔧 Suggested adjustment
- // No server, request should fail + // Closed server ensures deterministic connection failure + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {})) + server.Close() ctx := context.Background() jobID := uuid.New() job := &domain.CronJob{ @@ - TargetURL: "http://localhost:12345/unreachable", + TargetURL: server.URL, Schedule: "* * * * *", }internal/core/services/lb_worker_test.go (1)
219-259:⚠️ Potential issue | 🟡 MinorAvoid real network calls in
processHealthCheckstest.
The default dialer can hit the host network and become flaky (e.g., if the port is open). InjectmockDialerto make the test deterministic.🔧 Suggested fix
proxy := new(MockLBProxyAdapter) worker := NewLBWorker(lbRepo, instRepo, proxy) + worker.dialer = &mockDialer{err: fmt.Errorf("dial failed")}As per coding guidelines: Do not test with real external dependencies in unit tests - use mocks instead.
internal/repositories/libvirt/adapter_volume_test.go (2)
99-139:⚠️ Potential issue | 🟠 MajorSame race condition concern applies here.
TestRestoreVolumeSnapshotSuccessalso modifies bothexecCommandandmkdirTemppackage-level variables (lines 105-123) while running in parallel. This compounds the race condition risk withTestCreateVolumeSnapshotSuccess.
70-97:⚠️ Potential issue | 🟠 MajorConfirmed race condition with package-level variable mutations during parallel test execution.
TestCreateVolumeSnapshotSuccess(line 71) andTestRestoreVolumeSnapshotSuccess(line 100) both uset.Parallel()while simultaneously modifying package-level variablesexecCommand(lines 78 and 107) andmkdirTemp(line 114). Even with defer restoration, concurrent access to these shared globals creates a data race where test state can bleed into other concurrent tests.Remove
t.Parallel()from these tests, use synchronization primitives (e.g.,sync.Mutex) to protect the package-level variables, or refactor to avoid modifying globals.internal/repositories/libvirt/adapter_unit_test.go (2)
62-74:⚠️ Potential issue | 🟠 MajorCapture range variables before parallel subtests to avoid shared closure variable bugs.
In table-driven tests with
t.Parallel(), the loop variablettis shared across all parallel goroutines. Without capturing it, assertions may read the wrong test case data, causing test flakiness.Apply this fix to all six locations:
Suggested fix
for _, tt := range tests { + tt := tt t.Run(tt.name, func(t *testing.T) { t.Parallel() // ... }) }Applies to lines: 62–74, 150–161, 182–187, 233–238, 276–285, 343–354
421-433:⚠️ Potential issue | 🟠 MajorRemove
t.Parallel()from tests that mutate package-level test hooks.These three tests override
osOpenandexecCommand(package-level variables defined in adapter.go) while declaringt.Parallel(). Running them in parallel with other tests can cause data races and flaky failures. The defer restoration does not prevent interference during test execution.Affected tests:
TestGetInstanceLogs(line 421)TestCleanupPortMappings(line 539)TestGenerateCloudInitISO(line 559)Remove the
t.Parallel()call from each of these test functions.internal/repositories/docker/fakes_test.go (1)
49-176:⚠️ Potential issue | 🟡 MinorGuard
Callswith a mutex to avoid races under parallel tests.
Callsis a shared map updated from multiple methods; with parallelized tests or concurrent adapter code, this can trigger data races.🔒 Suggested fix
import ( "bufio" "bytes" "context" "errors" "io" "net" "strings" + "sync" "github.com/docker/docker/api/types" @@ type fakeDockerClient struct { @@ networkCreateErr error networkRemoveErr error + callsMu sync.Mutex Calls map[string]int } @@ func (f *fakeDockerClient) inc(name string) { + f.callsMu.Lock() + defer f.callsMu.Unlock() if f.Calls == nil { f.Calls = make(map[string]int) } f.Calls[name]++ }internal/repositories/libvirt/adapter_test.go (1)
172-210:⚠️ Potential issue | 🟠 MajorAvoid
t.Parallel()while overriding the globalexecCommandContext.
This test mutates a global function and can run concurrently with other tests, causing cross‑test interference or data races. Prefer keeping this test serial or guarding the override with a package‑level mutex.🛠️ Suggested fix (serialize this test)
func TestLBProxyAdapterOperations(t *testing.T) { - t.Parallel() // Mock execCommandContext using a hermetic helper process oldExec := execCommandContext @@ t.Run("DeployProxy", func(t *testing.T) { - t.Parallel() id, err := a.DeployProxy(ctx, lb, nil) @@ t.Run("UpdateProxyConfig", func(t *testing.T) { - t.Parallel() err := a.UpdateProxyConfig(ctx, lb, nil) @@ t.Run("RemoveProxy", func(t *testing.T) { - t.Parallel() err := a.RemoveProxy(ctx, lb.ID)internal/core/services/function_test.go (1)
23-33:⚠️ Potential issue | 🟠 MajorAvoid real Docker dependency in unit tests (mark as integration or mock).
setupFunctionServiceTestinstantiates a real Docker adapter at line 30, which couples this test suite to a Docker daemon. This file lacks an integration build tag. Please either (a) switch to a mockedports.ComputeBackend(e.g., using testify/mock) or (b) mark this file as an integration test with a//go:build integrationtag so unit test suites remain hermetic. As per coding guidelines, **/*_test.go must not test with real external dependencies—use mocks instead.internal/repositories/libvirt/adapter_network_test.go (2)
22-103:⚠️ Potential issue | 🟠 MajorAvoid parallelizing tests that mutate package-level exec hooks.
execCommand/lookPathare package-level and swapped inTestSetupPortForwardingSuccess, so running these tests in parallel can race and flake. Please run these tests serially (or refactor to inject exec functions per adapter).Proposed fix (serialize the affected tests)
func TestSetupPortForwardingSuccess(t *testing.T) { - t.Parallel() + // Keep serial: mutates package-level exec hooks // Mock execCommand oldExec := execCommand defer func() { execCommand = oldExec }() @@ func TestSetupPortForwardingErrors(t *testing.T) { - t.Parallel() + // Keep serial: depends on package-level exec hooks m := new(MockLibvirtClient) @@ func TestConfigureIptablesNoPath(t *testing.T) { - t.Parallel() + // Keep serial: depends on package-level exec hooks // This test depends on environment, but we can verify it doesn't crash m := new(MockLibvirtClient)
105-126:⚠️ Potential issue | 🟠 MajorDon’t run parallel subtests while sharing the same mock/adapter.
The subtests shareMockLibvirtClientandLibvirtAdapter, andtestify/mockisn’t thread-safe, sot.Parallel()here can race. Either remove parallelization or create separate mocks/adapters per subtest.Proposed fix (run subtests sequentially)
t.Run("DomainNotFound", func(t *testing.T) { - t.Parallel() m.On("DomainLookupByName", ctx, "missing").Return(libvirt.Domain{}, libvirt.Error{Code: 1}) _, err := a.GetInstanceIP(ctx, "missing") assert.Error(t, err) }) @@ t.Run("NoMAC", func(t *testing.T) { - t.Parallel() dom := libvirt.Domain{Name: "nomac"} m.On("DomainLookupByName", ctx, "nomac").Return(dom, nil)scripts/test_libvirt.go (1)
46-58:⚠️ Potential issue | 🟡 MinorUpdate log/comment to reflect the new API name.
The comment and error message still refer to
CreateInstance, which no longer matches the call site. Keeping this consistent avoids confusion when running the script.Suggested fix
-// Since we implement a "create empty volume" fallback in CreateInstance, it should at least define the VM. +// Since we implement a "create empty volume" fallback in LaunchInstanceWithOptions, it should at least define the VM. ... - fmt.Printf("CreateInstance failed: %v\n", err) + fmt.Printf("LaunchInstanceWithOptions failed: %v\n", err)internal/repositories/k8s/ha.go (1)
31-49:⚠️ Potential issue | 🟠 MajorMake the LB IP wait loop context-aware and eliminate magic numbers.
time.Sleepignores cancellation, so the wait can block even ifctxis done. Also, port/retry/interval literals should be named constants.Suggested fix
+const ( + apiServerPort = 6443 + lbIPRetries = 10 + lbIPRetryDelay = 2 * time.Second +) ... - lb, err := p.lbSvc.Create(ctx, lbName, cluster.VpcID, 6443, "round-robin", cluster.ID.String()) + lb, err := p.lbSvc.Create(ctx, lbName, cluster.VpcID, apiServerPort, "round-robin", cluster.ID.String()) ... - for i := 0; i < 10; i++ { + for i := 0; i < lbIPRetries; i++ { if lb.IP != "" { lbIP = lb.IP break } - time.Sleep(2 * time.Second) + select { + case <-ctx.Done(): + return fmt.Errorf("timeout waiting for LB IP: %w", ctx.Err()) + case <-time.After(lbIPRetryDelay): + } if lbNew, err := p.lbSvc.Get(ctx, lb.ID); err == nil { lb = lbNew } }As per coding guidelines: Propagate context.Context to all blocking calls; Do not use magic numbers - use named constants instead.
🤖 Fix all issues with AI agents
In `@internal/core/services/instance.go`:
- Around line 152-182: The code silently ignores invalid opts.NetworkID and
fails to persist CreateInstanceOptions (Ports, VolumeBinds, Env, Cmd, CPULimit,
MemoryLimit, DiskLimit) so Provision later reads empty inst fields; validate
opts.NetworkID (return error on invalid uuid when opts.NetworkID != "") and
persist the options either by adding those fields to the domain.Instance before
calling s.repo.Create (e.g., populate inst.Ports, inst.VolumeBinds, inst.Env,
inst.Cmd, inst.CPULimit, inst.MemoryLimit, inst.DiskLimit), or extend
domain.ProvisionJob to carry the full CreateInstanceOptions and populate
ProvisionJob.UserData/fields from opts (or simply attach the entire
CreateInstanceOptions to ProvisionJob) so the Provision handler (which reads
inst.Ports and Env/Cmd) receives the real values instead of nil; ensure
InstanceType is set appropriately (not hardcoded "custom") or documented/handled
elsewhere.
In `@internal/repositories/docker/adapter.go`:
- Around line 208-234: handleUserData (and likewise processCloudConfig) spawns a
goroutine that calls a.Exec with context.Background(), which bypasses the parent
ctx, prevents cancellation propagation, and causes concurrent Exec calls that
race on shared test state; change the background execution to a synchronous call
using a derived context with a timeout from the provided ctx (e.g.,
ctxWithTimeout, cancel := context.WithTimeout(ctx, <reasonableDuration>); defer
cancel()), call a.Exec(ctxWithTimeout, containerID, ...) directly instead of
launching a goroutine, and propagate errors back to the caller (or log them
consistently via a.logger) so Exec is not invoked concurrently from multiple
goroutines and respects cancellation; apply the same change in
processCloudConfig and ensure any callers handle the returned error.
In `@internal/repositories/k8s/mocks_test.go`:
- Around line 89-112: There is a duplicate MockSecretService type and method
implementations (e.g., CreateSecret, GetSecret, Decrypt) which will cause a
duplicate type error; remove the entire MockSecretService definition and its
methods from this file and rely on the single shared MockSecretService already
defined elsewhere in the package, updating any tests in this file to use that
existing MockSecretService (and its m.Called stubs) instead of declaring a new
one.
In `@internal/repositories/k8s/provisioner.go`:
- Around line 155-160: The current fallback stores plaintext kubeconfig when
encryption fails (setting cluster.KubeconfigEncrypted = kubeconfig), which is a
security risk; update the error path in the block that handles
encryptedKubeconfig (the branch using encryptedKubeconfig, kubeconfig and
p.logger.Error) to refuse to persist plaintext: log the error via p.logger.Error
as-is, and return or propagate an error from the surrounding function instead of
assigning the unencrypted kubeconfig to cluster.KubeconfigEncrypted so the
caller can handle the failure and no plaintext kubeconfig is saved.
🟠 Major comments (21)
internal/handlers/dns_handler_test.go-26-28 (1)
26-28:⚠️ Potential issue | 🟠 MajorRace condition:
gin.SetMode()modifies global state.All test functions call
gin.SetMode(gin.TestMode)aftert.Parallel(). Sincegin.SetMode()writes to a package-level global variable without synchronization, running these tests in parallel creates a data race that the Go race detector will flag.Move the call to a
TestMainfunction so it executes once before any parallel tests run:🔒️ Proposed fix: Add TestMain and remove per-test gin.SetMode() calls
Add this function to the file:
func TestMain(m *testing.M) { gin.SetMode(gin.TestMode) os.Exit(m.Run()) }Then remove
gin.SetMode(gin.TestMode)from each test function:func TestCreateZoneHandler(t *testing.T) { t.Parallel() - gin.SetMode(gin.TestMode) t.Run("success", func(t *testing.T) {Apply the same removal to all other test functions in this file.
internal/handlers/identity_handler_test.go-74-78 (1)
74-78:⚠️ Potential issue | 🟠 MajorParallel tests race on Gin's unprotected global mode setting.
gin.SetModewrites to package-level globals without synchronization, making it fundamentally not thread-safe. Whent.Parallel()tests callsetupIdentityHandlerTest, they concurrently mutate Gin's mode state, causing data races and flaky failures under-race. Movegin.SetMode(gin.TestMode)to aninit()function (orTestMain) at the test package level so it executes once before parallel tests begin, then remove the call fromsetupIdentityHandlerTest.✅ Suggested fix
import ( "bytes" "context" "encoding/json" "net/http" "net/http/httptest" "testing" "github.com/gin-gonic/gin" "github.com/google/uuid" "github.com/poyrazk/thecloud/internal/core/domain" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/mock" ) + +func init() { + gin.SetMode(gin.TestMode) +} const ( authKeysPath = "/auth/keys" testKeyName = "Test Key" ) @@ func setupIdentityHandlerTest(userID uuid.UUID) (*mockIdentityService, *IdentityHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) svc := new(mockIdentityService) handler := NewIdentityHandler(svc) r := gin.New()internal/handlers/database_handler_test.go-78-81 (1)
78-81:⚠️ Potential issue | 🟠 MajorMove
gin.SetModetoTestMainto prevent race conditions witht.Parallel.
gin.SetModeis a global operation in gin-gonic; calling it from parallel tests races and causes-racefailures. Set the mode once inTestMainand remove it fromsetupDatabaseHandlerTest.🔧 Suggested fix (set mode once in TestMain)
import ( "bytes" "context" "encoding/json" "net/http" "net/http/httptest" + "os" "testing" "github.com/gin-gonic/gin" @@ ) +func TestMain(m *testing.M) int { + gin.SetMode(gin.TestMode) + return m.Run() +} + func setupDatabaseHandlerTest(_ *testing.T) (*mockDatabaseService, *DatabaseHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) svc := new(mockDatabaseService) handler := NewDatabaseHandler(svc) r := gin.New() return svc, handler, r }internal/handlers/cache_handler_test.go-82-84 (1)
82-84:⚠️ Potential issue | 🟠 MajorParallel tests race on gin.SetMode global state.
t.Parallel()now allows concurrent calls tosetupCacheHandlerTest, butgin.SetModemutates global state without synchronization and will trigger data races under-race. Move it to a packageinit()so it runs once at startup.🔧 Proposed fix (move SetMode to init)
+func init() { + gin.SetMode(gin.TestMode) +} + func setupCacheHandlerTest(_ *testing.T) (*mockCacheService, *CacheHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) svc := new(mockCacheService) handler := NewCacheHandler(svc) r := gin.New() return svc, handler, r }internal/handlers/secret_handler_test.go-85-86 (1)
85-86:⚠️ Potential issue | 🟠 MajorParallel tests race on gin.SetMode—move to TestMain
t.Parallel()runs all 8 tests concurrently. Each callssetupSecretHandlerTest, which invokesgin.SetMode(gin.TestMode). Gin'sSetModewrites to unsynchronized package-level variables without mutex protection, causing a data race. Set the mode once inTestMainbefore tests run.🔧 Suggested fix
import ( "bytes" "context" "encoding/json" "net/http" "net/http/httptest" + "os" "testing" "github.com/gin-gonic/gin" "github.com/google/uuid" "github.com/poyrazk/thecloud/internal/core/domain" "github.com/poyrazk/thecloud/internal/errors" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/mock" ) + +func TestMain(m *testing.M) { + gin.SetMode(gin.TestMode) + os.Exit(m.Run()) +} func setupSecretHandlerTest(_ *testing.T) (*mockSecretService, *SecretHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) svc := new(mockSecretService) handler := NewSecretHandler(svc) r := gin.New() return svc, handler, r }Also applies to: 109-110, 127-128, 146-147, 164-165, 215-216, 239-240, 251-252
internal/handlers/lb_handler_test.go-77-78 (1)
77-78:⚠️ Potential issue | 🟠 MajorGuard
gin.SetModefrom concurrent calls undert.Parallel().
setupLBHandlerTestmutates the unsynchronized globalgin.Mode; parallel tests trigger race conditions. Usesync.Onceor move toTestMain.🔧 Suggested fix (sync.Once)
import ( "bytes" "context" "encoding/json" "net/http" "net/http/httptest" + "sync" "testing" ) +var ginModeOnce sync.Once + func setupLBHandlerTest(_ *testing.T) (*mockLBService, *LBHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) + ginModeOnce.Do(func() { gin.SetMode(gin.TestMode) }) svc := new(mockLBService) handler := NewLBHandler(svc) r := gin.New() return svc, handler, r }Applies to lines: 77–78, 103–104, 121–122, 140–141, 158–159, 183–184, 202–203
internal/handlers/queue_handler_test.go-90-91 (1)
90-91:⚠️ Potential issue | 🟠 MajorMove
gin.SetModeoutside parallel test execution.
gin.SetMode()is not thread-safe and writes to package globals without synchronization. Calling it fromsetupQueueHandlerTest()in multiple parallel tests causes a data race. Set the mode once before any parallel tests run usingTestMainorsync.Once.🔧 Suggested fix (TestMain)
+func TestMain(m *testing.M) { + gin.SetMode(gin.TestMode) + code := m.Run() + os.Exit(code) +} + func setupQueueHandlerTest(_ *testing.T) (*mockQueueService, *QueueHandler, *gin.Engine) { - gin.SetMode(gin.TestMode) svc := new(mockQueueService)internal/repositories/k8s/security.go-24-30 (1)
24-30:⚠️ Potential issue | 🟠 MajorAvoid hard-coded wide-open ingress CIDRs.
Using
0.0.0.0/0for SSH and fixed10.0.0.0/8for control-plane ports can expose clusters or misconfigure non-10/8 networks. Consider deriving CIDRs from the cluster/VPC or a configurable allowlist.internal/repositories/k8s/security.go-32-34 (1)
32-34:⚠️ Potential issue | 🟠 MajorHandle AddRule failures instead of dropping them.
Ignoring these errors can leave the cluster without required ingress rules while still returning success.
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.🐛 Suggested fix
- for _, rule := range rules { - _, _ = p.sgSvc.AddRule(ctx, newSG.ID, rule) - } + for _, rule := range rules { + if _, err := p.sgSvc.AddRule(ctx, newSG.ID, rule); err != nil { + return fmt.Errorf("failed to add security group rule: %w", err) + } + }internal/core/services/health_test.go-3-6 (1)
3-6:⚠️ Potential issue | 🟠 MajorAvoid real Docker dependency in this unit test.
docker.NewDockerAdapter(...)requires a running Docker daemon, which makes this a brittle unit test. Prefer a mock/noop compute backend here, or mark the test as integration-only if it must hit Docker.As per coding guidelines: **/*_test.go: Do not test with real external dependencies in unit tests - use mocks instead.Proposed fix (swap to noop backend)
import ( "context" - "log/slog" "testing" "github.com/poyrazk/thecloud/internal/core/services" - "github.com/poyrazk/thecloud/internal/repositories/docker" "github.com/poyrazk/thecloud/internal/repositories/noop" "github.com/stretchr/testify/assert" - "github.com/stretchr/testify/require" ) @@ - compute, err := docker.NewDockerAdapter(slog.Default()) - require.NoError(t, err) + compute := noop.NewNoopComputeBackend()Also applies to: 15-22
internal/repositories/docker/adapter_unit_test.go-393-405 (1)
393-405:⚠️ Potential issue | 🟠 MajorFix data race in
TestDockerAdapterLaunchWithUserData(CI reports race).The background goroutine mutates
cli.Callswhile the test reads it after aSleep, which matches the race reported in CI. Use a synchronization mechanism and make the call counters concurrency-safe (mutex/atomic) instead oftime.Sleep.Example approach
- time.Sleep(50 * time.Millisecond) - - require.Equal(t, 2, cli.Calls["ContainerExecCreate"], "Expected dual Stage (delivery + execution) Exec invocations") - require.Equal(t, 2, cli.Calls["ContainerExecAttach"], "Expected associated ExecAttach calls for I/O") - require.Equal(t, 2, cli.Calls["ContainerExecInspect"], "Expected ExecInspect calls to verify termination state") + require.Eventually(t, func() bool { + return cli.CallCount("ContainerExecCreate") == 2 && + cli.CallCount("ContainerExecAttach") == 2 && + cli.CallCount("ContainerExecInspect") == 2 + }, 2*time.Second, 10*time.Millisecond)(Implement
CallCountwith a mutex aroundcli.CallsinfakeDockerClient.)internal/repositories/k8s/templates/control_plane.yaml-26-32 (1)
26-32:⚠️ Potential issue | 🟠 MajorTemplate should honor cluster version and avoid
localhostendpoint.
kubernetesVersion: stableignores the requested cluster version, andcontrolPlaneEndpointset tolocalhostwill make worker joins and external access fail on non-HA clusters. Prefer templating the version and only settingcontrolPlaneEndpointwhen HA is enabled (or use the control-plane node IP).Suggested fix
- kubernetesVersion: stable - controlPlaneEndpoint: "{{ if .HAEnabled }}{{ .LBAddress }}{{ else }}localhost{{ end }}:6443" + kubernetesVersion: "{{ .Version }}" + {{- if .HAEnabled }} + controlPlaneEndpoint: "{{ .LBAddress }}:6443" + {{- end }}internal/repositories/k8s/node_executor.go-72-78 (1)
72-78:⚠️ Potential issue | 🟠 MajorAdd configurable host key verification to SSHExecutor.
The hardcoded
ssh.InsecureIgnoreHostKey()at line 77 disables host key verification and enables MITM attacks. Either refactor SSHExecutor to use the existingpkg/sshutil.Client(which provides configurableHostKeyCallback), or add aHostKeyCallbackfield to SSHExecutor to allow injection ofknownhosts.New(...)orssh.FixedHostKey(...)during construction.internal/repositories/k8s/node_executor.go-81-103 (1)
81-103:⚠️ Potential issue | 🟠 MajorUse the existing
pkg/sshutil.Clientinstead of reimplementing SSH connection logic.The codebase already provides
pkg/sshutil.Client.Run(ctx context.Context, ...)which properly propagates context vianet.Dialer{}.DialContext()and usesssh.NewClientConn(). The current implementation duplicates this and fails to propagatectxtossh.Dial()andsession.Run(). Additionally, silent close errors onclient.Close()andsession.Close()violate the "no silent failures" guideline.Either refactor to use
pkg/sshutil.Clientor adopt its pattern: usenet.Dialer{}.DialContext(ctx, ...)to establish the TCP connection, thenssh.NewClientConn()to wrap it. Also log or return close errors instead of discarding them.internal/core/services/instance.go-412-414 (1)
412-414:⚠️ Potential issue | 🟠 MajorHandle audit log errors instead of dropping them.
Line 412 ignores audit logging failures. Please log or propagate the error so audit failures aren’t silently lost.🛠️ Suggested fix
- _ = s.auditSvc.Log(ctx, inst.UserID, "instance.start", "instance", inst.ID.String(), map[string]interface{}{ - "name": inst.Name, - }) + if err := s.auditSvc.Log(ctx, inst.UserID, "instance.start", "instance", inst.ID.String(), map[string]interface{}{ + "name": inst.Name, + }); err != nil { + s.logger.Warn("failed to write audit log", "instance_id", inst.ID, "error", err) + }As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like
_ = someFunc().internal/repositories/postgres/cluster_repo.go-205-210 (1)
205-210:⚠️ Potential issue | 🟠 MajorScanning nullable columns into non-nullable string fields will fail for legacy clusters.
Lines 205-210 scan
kubeconfig_encrypted,ssh_private_key_encrypted,join_token, andca_cert_hashdirectly into non-nullablestringfields. Migration 081 adds these columns as nullableTEXTwithout defaults. For clusters created before this migration, these columns will beNULL, causing pgx scan errors and breakingGet()andListByUserID()operations.Use
COALESCE()in the SELECT statements to provide empty string defaults:SELECT id, user_id, vpc_id, name, version, status, worker_count, ha_enabled, network_isolation, pod_cidr, service_cidr, api_server_lb_address, COALESCE(kubeconfig_encrypted, ''), COALESCE(ssh_private_key_encrypted, ''), COALESCE(join_token, ''), token_expires_at, COALESCE(ca_cert_hash, ''), job_id, created_at, updated_at FROM clustersAlternatively, change the struct fields to
*stringand scan into nullable types.internal/repositories/k8s/provisioner.go-98-100 (1)
98-100:⚠️ Potential issue | 🟠 MajorSilent failure on repo update violates coding guidelines.
Line 99 discards the error from
p.repo.Update. If the update fails, the cluster status won't persist but provisioning will report success, leading to inconsistent state.🔧 Proposed fix
cluster.Status = domain.ClusterStatusRunning - _ = p.repo.Update(ctx, cluster) - return nil + return p.repo.Update(ctx, cluster)As per coding guidelines: "Do not use silent failures - avoid blank identifier assignment like
_ = someFunc()"internal/repositories/k8s/provisioner.go-143-143 (1)
143-143:⚠️ Potential issue | 🟠 MajorMultiple silent failures throughout the file.
Several
_ = p.repo.Update(...)and_ = p.repo.AddNode(...)calls discard errors:
- Line 143: Update after storing control plane IP
- Line 162: Update after kubeconfig storage
- Line 204: AddNode for worker nodes
- Lines 336-337: TerminateInstance and DeleteNode in Deprovision
These can lead to data inconsistency without any indication of failure.
🔧 Suggested approach
At minimum, log the errors. For critical paths, propagate them:
- _ = p.repo.Update(ctx, cluster) + if err := p.repo.Update(ctx, cluster); err != nil { + p.logger.Error("failed to update cluster", "cluster_id", cluster.ID, "error", err) + }As per coding guidelines: "Do not use silent failures - avoid blank identifier assignment like
_ = someFunc()"Also applies to: 162-162, 204-204, 336-337
internal/repositories/k8s/provisioner.go-261-274 (1)
261-274:⚠️ Potential issue | 🟠 Major
waitForIPignores context cancellation.The polling loop uses
time.Sleepwithout checkingctx.Done(). This can block for up to 150 seconds even if the context is cancelled, wasting resources and delaying shutdown.🔧 Proposed fix using select with context
func (p *KubeadmProvisioner) waitForIP(ctx context.Context, instID uuid.UUID) string { for i := 0; i < 30; i++ { + select { + case <-ctx.Done(): + return "" + default: + } inst, err := p.instSvc.GetInstance(ctx, instID.String()) if err == nil && inst.PrivateIP != "" { ip := inst.PrivateIP if idx := strings.Index(ip, "/"); idx != -1 { ip = ip[:idx] } return ip } - time.Sleep(5 * time.Second) + select { + case <-ctx.Done(): + return "" + case <-time.After(5 * time.Second): + } } return "" }As per coding guidelines: "Propagate
context.Contextto all blocking calls"internal/repositories/k8s/provisioner.go-166-172 (1)
166-172:⚠️ Potential issue | 🟠 MajorPotential panic if
ControlPlaneIPsis empty.Line 169 accesses
cluster.ControlPlaneIPs[0]without checking if the slice is empty. IfprovisionControlPlanefailed to populate IPs, this will panic.🛡️ Proposed fix
func (p *KubeadmProvisioner) provisionWorkers(ctx context.Context, cluster *domain.Cluster) error { p.logger.Info("provisioning worker nodes", "count", cluster.WorkerCount) + if len(cluster.ControlPlaneIPs) == 0 { + return p.failCluster(ctx, cluster, "no control plane IPs available for worker join", nil) + } + apiServer := cluster.ControlPlaneIPs[0] if cluster.HAEnabled && cluster.APIServerLBAddress != nil { apiServer = *cluster.APIServerLBAddress }internal/repositories/k8s/provisioner.go-456-458 (1)
456-458:⚠️ Potential issue | 🟠 MajorPotential slice bounds panic if version string is malformed.
strings.LastIndex(v, ".")returns-1if no dot is found. Slicing withv[:−1]will panic.🛡️ Proposed fix
v := strings.TrimPrefix(version, "v") - minor := v[:strings.LastIndex(v, ".")] + lastDot := strings.LastIndex(v, ".") + if lastDot == -1 { + return errors.New(errors.InvalidInput, "invalid version format: "+version) + } + minor := v[:lastDot]
🟡 Minor comments (18)
internal/handlers/accounting_handler_test.go-45-47 (1)
45-47:⚠️ Potential issue | 🟡 MinorPotential data race with
gin.SetModein parallel tests.
gin.SetMode()modifies global state and is not thread-safe. Whent.Parallel()is enabled, concurrent tests callingSetModecan race. Since all tests setgin.TestMode, the practical impact is limited, but it's still a race condition.Consider moving
gin.SetMode(gin.TestMode)to aTestMainfunction to set the mode once before any tests run:func TestMain(m *testing.M) { gin.SetMode(gin.TestMode) os.Exit(m.Run()) }internal/repositories/postgres/lifecycle_repo_unit_test.go-29-31 (1)
29-31:⚠️ Potential issue | 🟡 MinorHandle
pgxmock.NewPool()errors instead of discarding them with the blank identifier.Ignoring errors silently can mask setup failures and lead to misleading test results. Fail the test immediately when pool creation fails.
Proposed fix
- mock, _ := pgxmock.NewPool() + mock, err := pgxmock.NewPool() + if err != nil { + t.Fatalf("pgxmock.NewPool: %v", err) + }Also applies to lines 53-55 and 70-72.
internal/handlers/health_handler_test.go-40-42 (1)
40-42:⚠️ Potential issue | 🟡 MinorSame
gin.SetMode()race concern applies here.This test also calls
gin.SetMode()aftert.Parallel(), creating the same potential race condition asTestHealthHandler_Live.internal/handlers/health_handler_test.go-25-27 (1)
25-27:⚠️ Potential issue | 🟡 MinorPotential data race with
gin.SetMode()in parallel tests.
gin.SetMode()modifies a package-level global variable. When this test runs concurrently with other tests that also callgin.SetMode(), a data race can occur. Consider one of these approaches:
- Move
gin.SetMode(gin.TestMode)to aTestMainfunction that runs once before all tests.- Use
sync.Onceto set the mode only once.- Remove
t.Parallel()if parallel execution isn't critical for this test.internal/core/services/instance_test.go-188-189 (1)
188-189:⚠️ Potential issue | 🟡 MinorDon’t ignore Docker adapter init errors.
If adapter creation fails, the test can proceed with a nil compute backend and panic later. Handle the error explicitly.
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.🐛 Suggested fix
- compute, _ := docker.NewDockerAdapter(slog.Default()) + compute, err := docker.NewDockerAdapter(slog.Default()) + require.NoError(t, err)internal/repositories/k8s/lifecycle_test.go-145-148 (1)
145-148:⚠️ Potential issue | 🟡 MinorHandle ListInstances errors.
Dropping the error can turn real failures into false positives.
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.🐛 Suggested fix
- instances, _ := instSvc.ListInstances(ctx) + instances, err := instSvc.ListInstances(ctx) + require.NoError(t, err)internal/core/services/system_integration_test.go-111-128 (1)
111-128:⚠️ Potential issue | 🟡 MinorDon’t ignore instance-type creation errors.
If these inserts fail, provisioning will later fail with a less obvious root cause.
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.🐛 Suggested fix
- _, _ = itRepo.Create(ctx, customType) + _, err = itRepo.Create(ctx, customType) + require.NoError(t, err) @@ - _, _ = itRepo.Create(ctx, sys1) + _, err = itRepo.Create(ctx, sys1) + require.NoError(t, err)internal/core/services/system_integration_test.go-71-74 (1)
71-74:⚠️ Potential issue | 🟡 MinorHandle network creation errors in setup.
Ignoring the error can mask Docker networking failures and lead to misleading downstream errors.
As per coding guidelines: Do not use silent failures - avoid blank identifier assignment like `_ = someFunc()`.🐛 Suggested fix
- if compute.Type() == "docker" { - _, _ = compute.CreateNetwork(ctx, "cloud-network") - } + if compute.Type() == "docker" { + _, err = compute.CreateNetwork(ctx, "cloud-network") + require.NoError(t, err) + }internal/repositories/k8s/lifecycle_test.go-199-205 (1)
199-205:⚠️ Potential issue | 🟡 MinorSurface unexpected queue payloads in the sync task queue.
Returning nil for unknown payloads can hide wiring issues and make the test pass when provisioning never ran.
🐛 Suggested fix
- return nil + return fmt.Errorf("unexpected payload type %T", payload)internal/repositories/postgres/migrations/081_fix_clusters_table.up.sql-11-18 (1)
11-18:⚠️ Potential issue | 🟡 MinorScope column-existence checks to the current schema.
Without filtering bytable_schema, a similarly named table in another schema can make the check pass and the subsequent UPDATE fail in the active schema.🛠️ Suggested fix
- IF EXISTS (SELECT 1 FROM information_schema.columns WHERE table_name='clusters' AND column_name='ssh_key') THEN + IF EXISTS ( + SELECT 1 FROM information_schema.columns + WHERE table_name = 'clusters' + AND column_name = 'ssh_key' + AND table_schema = current_schema() + ) THEN UPDATE clusters SET ssh_private_key_encrypted = ssh_key WHERE ssh_private_key_encrypted IS NULL; END IF; - IF EXISTS (SELECT 1 FROM information_schema.columns WHERE table_name='clusters' AND column_name='kubeconfig') THEN + IF EXISTS ( + SELECT 1 FROM information_schema.columns + WHERE table_name = 'clusters' + AND column_name = 'kubeconfig' + AND table_schema = current_schema() + ) THEN UPDATE clusters SET kubeconfig_encrypted = kubeconfig WHERE kubeconfig_encrypted IS NULL; END IF;internal/repositories/noop/adapters.go-165-174 (1)
165-174:⚠️ Potential issue | 🟡 MinorMake nil returns consistent with other noop methods in the same file.
The methods
GetByID,ListByUserID, andListAllreturn(nil, nil), but similar methods in the same file likeGetNodes(line 179) andListLogs(line 159) return empty slices instead. This inconsistency can lead to nil dereference in callers (e.g.,cluster_worker.gocallsGetByIDand only checkserr, not if the cluster is nil). Return empty slices or stub objects to match the pattern established elsewhere in this file.Suggested fix
func (r *NoopClusterRepository) GetByID(ctx context.Context, id uuid.UUID) (*domain.Cluster, error) { - return nil, nil + return &domain.Cluster{ID: id}, nil } func (r *NoopClusterRepository) ListByUserID(ctx context.Context, userID uuid.UUID) ([]*domain.Cluster, error) { - return nil, nil + return []*domain.Cluster{}, nil } func (r *NoopClusterRepository) ListAll(ctx context.Context) ([]*domain.Cluster, error) { - return nil, nil + return []*domain.Cluster{}, nil }internal/handlers/security_group_handler_test.go-83-85 (1)
83-85:⚠️ Potential issue | 🟡 MinorPotential race condition with
gin.SetModein parallel tests.
gin.SetMode()modifies package-level globals without synchronization and is not goroutine-safe. Calling it in multiple parallel tests (this pattern affects 38+ test files in internal/handlers) causes data race conditions. All these test functions must havegin.SetMode(gin.TestMode)removed from individual tests and set once globally before any tests run.Add a
TestMainfunction tointernal/handlers/security_group_handler_test.go(and all other affected test files) to set the mode once:func TestMain(m *testing.M) { gin.SetMode(gin.TestMode) os.Exit(m.Run()) }Then remove the
gin.SetMode(gin.TestMode)calls from all individual test functions.internal/repositories/docker/adapter_test.go-76-80 (1)
76-80:⚠️ Potential issue | 🟡 MinorPotential flaky assertion in Eventually block.
assert.ObjectsAreEqualreturns aboolbut doesn't produce test output on failure. The string comparisonexpectedContent+"\n"assumes the file ends with a newline, which depends onechobehavior. Consider using direct string comparison with trimming for robustness:🔧 Suggested fix
require.Eventually(t, func() bool { out, err := adapter.Exec(ctx, id, []string{"cat", "/tmp/bootstrap_test.txt"}) - return err == nil && assert.ObjectsAreEqual(expectedContent+"\n", out) + return err == nil && strings.TrimSpace(out) == expectedContent }, 10*time.Second, 500*time.Millisecond, "Bootstrap script failed to execute or write file")You'll need to add
"strings"to imports.internal/handlers/instance_handler_test.go-112-114 (1)
112-114:⚠️ Potential issue | 🟡 MinorGuard global gin state now that tests are parallel.
setupInstanceHandlerTestcallsgin.SetMode, which mutates global state. Witht.Parallel()added broadly, consider movinggin.SetMode(gin.TestMode)into aTestMain/initwithsync.Onceto avoid race detector failures.internal/repositories/k8s/node_executor.go-39-40 (1)
39-40:⚠️ Potential issue | 🟡 MinorReplace readiness/timeout magic numbers with named constants.
Hard-coded poll/timeout values make tuning harder and violate the no-magic-numbers guideline.
Suggested fix
+const ( + serviceReadyPollInterval = 500 * time.Millisecond + sshReadyPollInterval = 2 * time.Second + sshDialTimeout = 10 * time.Second +) ... - ticker := time.NewTicker(500 * time.Millisecond) + ticker := time.NewTicker(serviceReadyPollInterval) ... - Timeout: 10 * time.Second, + Timeout: sshDialTimeout, ... - ticker := time.NewTicker(2 * time.Second) + ticker := time.NewTicker(sshReadyPollInterval)As per coding guidelines: Do not use magic numbers - use named constants instead.
Also applies to: 78-79, 110-111
internal/repositories/k8s/kubeadm_provisioner_test.go-39-41 (1)
39-41:⚠️ Potential issue | 🟡 MinorMake
StartInstanceusem.Calledso expectations and errors work.Returning
nilunconditionally hides unexpected calls and prevents tests from asserting behavior.Suggested fix
func (m *MockInstanceService) StartInstance(ctx context.Context, id string) error { - return nil + return m.Called(ctx, id).Error(0) }As per coding guidelines: Use
testify/mockfor creating mock objects in tests.internal/repositories/docker/adapter.go-73-78 (1)
73-78:⚠️ Potential issue | 🟡 MinorDefault the logger to avoid nil deref.
Line 73-78 accepts a logger but doesn’t guard against nil; later calls toa.logger.*will panic. Consider defaulting toslog.Default()(or a no-op logger).🛠️ Suggested fix
func NewDockerAdapter(logger *slog.Logger) (*DockerAdapter, error) { + if logger == nil { + logger = slog.Default() + } cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())internal/repositories/k8s/provisioner.go-316-322 (1)
316-322:⚠️ Potential issue | 🟡 MinorFallback uses encrypted key directly if decryption fails.
When
p.secretSvc.Decryptfails, the code silently usescluster.SSHPrivateKeyEncryptedas the SSH key. This will either fail SSH auth (best case) or, if the field contains a plaintext key, succeed unexpectedly.🔧 Proposed fix: Return error on decrypt failure
// 2. Fallback to SSH - decryptedKey := cluster.SSHPrivateKeyEncrypted // Fallback or if not encrypted if p.secretSvc != nil && cluster.SSHPrivateKeyEncrypted != "" { key, err := p.secretSvc.Decrypt(ctx, cluster.UserID, cluster.SSHPrivateKeyEncrypted) - if err == nil { - decryptedKey = key + if err != nil { + return nil, errors.Wrap(errors.Internal, "failed to decrypt SSH key", err) } + return NewSSHExecutor(ip, defaultUser, key), nil } - - return NewSSHExecutor(ip, defaultUser, decryptedKey), nil + return nil, errors.New(errors.InvalidInput, "no SSH key available for node access")
| inst := &domain.Instance{ | ||
| ID: uuid.New(), | ||
| UserID: appcontext.UserIDFromContext(ctx), | ||
| TenantID: appcontext.TenantIDFromContext(ctx), | ||
| Name: opts.Name, | ||
| Image: opts.ImageName, | ||
| Status: domain.StatusStarting, | ||
| InstanceType: "custom", // Marking as custom since we are passing raw constraints or defaults | ||
| Version: 1, | ||
| CreatedAt: time.Now(), | ||
| UpdatedAt: time.Now(), | ||
| } | ||
|
|
||
| if opts.NetworkID != "" { | ||
| vpcID, err := uuid.Parse(opts.NetworkID) | ||
| if err == nil { | ||
| inst.VpcID = &vpcID | ||
| } | ||
| } | ||
|
|
||
| if err := s.repo.Create(ctx, inst); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // 4. Enqueue provision task with UserData | ||
| job := domain.ProvisionJob{ | ||
| InstanceID: inst.ID, | ||
| UserID: inst.UserID, | ||
| TenantID: inst.TenantID, | ||
| UserData: opts.UserData, | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the instance.go file and check its structure
fd 'instance\.go' internal -t f | head -20Repository: PoyrazK/thecloud
Length of output: 159
🏁 Script executed:
#!/bin/bash
# Search for "custom" instance type definition
rg -n '"custom"' internal --type go | head -30Repository: PoyrazK/thecloud
Length of output: 665
🏁 Script executed:
#!/bin/bash
# Find CreateInstanceOptions and ProvisionJob definitions
rg -n 'type CreateInstanceOptions|type ProvisionJob' internal --type go -A 10Repository: PoyrazK/thecloud
Length of output: 1974
🏁 Script executed:
#!/bin/bash
# Search for where ProvisionJob is used and how Provision function works
rg -n 'ProvisionJob|func.*Provision' internal --type go | head -40Repository: PoyrazK/thecloud
Length of output: 5346
🏁 Script executed:
#!/bin/bash
# Get the full context of instance.go around the problematic lines
wc -l internal/core/services/instance.goRepository: PoyrazK/thecloud
Length of output: 98
🏁 Script executed:
#!/bin/bash
# Read the LaunchInstanceWithOptions function and surrounding context
sed -n '140,200p' internal/core/services/instance.go | cat -nRepository: PoyrazK/thecloud
Length of output: 2464
🏁 Script executed:
#!/bin/bash
# Check Instance domain struct to see what fields it has
rg -n 'type Instance struct' internal/core/domain --type go -A 20Repository: PoyrazK/thecloud
Length of output: 2939
🏁 Script executed:
#!/bin/bash
# Find the Provision function or method that processes ProvisionJob
rg -n 'func.*Provision\(' internal/core/services --type go -A 15 | head -80Repository: PoyrazK/thecloud
Length of output: 5550
🏁 Script executed:
#!/bin/bash
# Check if Instance struct fields could store Ports, Env, Cmd or limits
rg -n 'type Instance struct' internal/core/domain/instance.go -A 30Repository: PoyrazK/thecloud
Length of output: 1712
🏁 Script executed:
#!/bin/bash
# Verify the exact NetworkID handling and whether errors are truly silently ignored
sed -n '152,182p' internal/core/services/instance.go | cat -nRepository: PoyrazK/thecloud
Length of output: 1083
🏁 Script executed:
#!/bin/bash
# Check where Ports might be stored or used in Provision flow
rg -n '\.Ports|ports\[' internal/core/services/instance.go | head -20Repository: PoyrazK/thecloud
Length of output: 115
🏁 Script executed:
#!/antml:0
# Check if "custom" instance type is verified to exist anywhere in the codebase
rg -n 'instance.*type|InstanceType' internal/core/services/instance.go | head -30Repository: PoyrazK/thecloud
Length of output: 617
🏁 Script executed:
#!/bin/bash
# Check the parseAndValidatePorts function and how it's used
sed -n '220,240p' internal/core/services/instance.go | cat -nRepository: PoyrazK/thecloud
Length of output: 896
🏁 Script executed:
#!/an/bin/bash
# Verify how Instance.Ports field could ever be populated if opts.Ports is not stored
rg -n 'inst\.Ports\s*=' internal/core/services/instance.goRepository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check the InstanceType validation/existence checking in Provision
sed -n '210,225p' internal/core/services/instance.go | cat -nRepository: PoyrazK/thecloud
Length of output: 670
Validate NetworkID and persist option fields before enqueueing Provision.
Line 165-170: Invalid opts.NetworkID is silently ignored instead of returning an error. Additionally, most CreateInstanceOptions fields (Ports, VolumeBinds, Env, Cmd, CPULimit, MemoryLimit, DiskLimit) are not persisted in the Instance struct or ProvisionJob. During Provision (line 225), the code attempts to parse inst.Ports which was never populated, and Env/Cmd are hardcoded to nil, causing these options to be lost. The hardcoded InstanceType: "custom" also requires manual database setup. Please validate NetworkID and either: (1) persist these options in the Instance struct, (2) extend ProvisionJob to include them, or (3) pass CreateInstanceOptions directly to the ProvisionJob.
✅ Suggested validation for NetworkID
if opts.NetworkID != "" {
vpcID, err := uuid.Parse(opts.NetworkID)
- if err == nil {
- inst.VpcID = &vpcID
- }
+ if err != nil {
+ return nil, errors.New(errors.InvalidInput, "invalid network_id")
+ }
+ inst.VpcID = &vpcID
}🤖 Prompt for AI Agents
In `@internal/core/services/instance.go` around lines 152 - 182, The code silently
ignores invalid opts.NetworkID and fails to persist CreateInstanceOptions
(Ports, VolumeBinds, Env, Cmd, CPULimit, MemoryLimit, DiskLimit) so Provision
later reads empty inst fields; validate opts.NetworkID (return error on invalid
uuid when opts.NetworkID != "") and persist the options either by adding those
fields to the domain.Instance before calling s.repo.Create (e.g., populate
inst.Ports, inst.VolumeBinds, inst.Env, inst.Cmd, inst.CPULimit,
inst.MemoryLimit, inst.DiskLimit), or extend domain.ProvisionJob to carry the
full CreateInstanceOptions and populate ProvisionJob.UserData/fields from opts
(or simply attach the entire CreateInstanceOptions to ProvisionJob) so the
Provision handler (which reads inst.Ports and Env/Cmd) receives the real values
instead of nil; ensure InstanceType is set appropriately (not hardcoded
"custom") or documented/handled elsewhere.
| func (a *DockerAdapter) handleUserData(ctx context.Context, containerID string, userData string) error { | ||
| // For Docker, we simulate Cloud-Init by writing the script to the container and executing it. | ||
| if strings.HasPrefix(userData, "#cloud-config") { | ||
| return a.processCloudConfig(ctx, containerID, userData) | ||
| } | ||
|
|
||
| // Default fallback: treat as shell script | ||
| scriptPath := "/tmp/bootstrap.sh" | ||
|
|
||
| // Write file to container | ||
| // Note: We use base64 encoding to avoid escaping issues with complex scripts | ||
| encoded := base64.StdEncoding.EncodeToString([]byte(userData)) | ||
| writeCmd := []string{"sh", "-c", fmt.Sprintf("mkdir -p %s && echo %s | base64 -d > %s && chmod +x %s", | ||
| filepath.Dir(scriptPath), encoded, scriptPath, scriptPath)} | ||
|
|
||
| if _, err := a.Exec(ctx, containerID, writeCmd); err != nil { | ||
| return fmt.Errorf("failed to write userdata to container: %w", err) | ||
| } | ||
|
|
||
| // Run in background | ||
| go func() { | ||
| bgCtx := context.Background() | ||
| _, err := a.Exec(bgCtx, containerID, []string{"/bin/sh", scriptPath}) | ||
| if err != nil { | ||
| a.logger.Error("failed to execute userdata script", "container_id", containerID, "error", err) | ||
| } | ||
| }() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Get file size and examine the relevant sections
wc -l internal/repositories/docker/adapter.goRepository: PoyrazK/thecloud
Length of output: 103
🏁 Script executed:
#!/bin/bash
# Examine the exact code at lines 208-234 (the snippet shown)
sed -n '208,234p' internal/repositories/docker/adapter.goRepository: PoyrazK/thecloud
Length of output: 1192
🏁 Script executed:
#!/bin/bash
# Examine the second location mentioned (lines 314-323)
sed -n '314,323p' internal/repositories/docker/adapter.goRepository: PoyrazK/thecloud
Length of output: 464
🏁 Script executed:
#!/bin/bash
# Check for the Exec method signature and understand its concurrency properties
rg "func.*Exec\(" internal/repositories/docker/adapter.go -A 5Repository: PoyrazK/thecloud
Length of output: 263
🏁 Script executed:
#!/bin/bash
# Look for test files that might exercise concurrent Exec calls
fd -e go internal/repositories/docker | grep -i testRepository: PoyrazK/thecloud
Length of output: 477
🏁 Script executed:
#!/bin/bash
# Find test files in the docker directory
fd --full-path 'internal/repositories/docker' -name '*test.go'Repository: PoyrazK/thecloud
Length of output: 290
🏁 Script executed:
#!/bin/bash
# Look for any goroutine patterns in the adapter
rg "go func" internal/repositories/docker/adapter.go -B 2 -A 8Repository: PoyrazK/thecloud
Length of output: 1040
🏁 Script executed:
#!/bin/bash
# Check if there are any mutexes or synchronization primitives in the adapter
rg "sync\.|Mutex|Lock|RWLock" internal/repositories/docker/adapter.go -B 2 -A 2Repository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check the struct definition of DockerAdapter to see fields
rg "type DockerAdapter struct" internal/repositories/docker/adapter.go -A 15Repository: PoyrazK/thecloud
Length of output: 979
🏁 Script executed:
#!/bin/bash
# Find test files in the docker directory with correct syntax
find internal/repositories/docker -name '*test.go'Repository: PoyrazK/thecloud
Length of output: 342
🏁 Script executed:
#!/bin/bash
# Look for the full Exec implementation to understand concurrency
sed -n '600,700p' internal/repositories/docker/adapter.goRepository: PoyrazK/thecloud
Length of output: 2942
🏁 Script executed:
#!/bin/bash
# Search for where Exec is fully defined
rg "func \(a \*DockerAdapter\) Exec" internal/repositories/docker/adapter.go -A 30Repository: PoyrazK/thecloud
Length of output: 1116
🏁 Script executed:
#!/bin/bash
# Check if there's any test that might be running these concurrently
rg "TestDockerAdapter\|TestHandleUserData\|TestProcessCloudConfig" internal/repositories/docker/ -A 10Repository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look for handleUserData or processCloudConfig tests
rg "handleUserData|processCloudConfig" internal/repositories/docker/adapter_test.go internal/repositories/docker/adapter_unit_test.go -B 2 -A 10Repository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check test files for concurrent execution patterns
rg "go func|t\.Parallel|concurrent" internal/repositories/docker/adapter_test.go -B 2 -A 5Repository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look at the beginning of test files to understand test structure
head -50 internal/repositories/docker/adapter_test.goRepository: PoyrazK/thecloud
Length of output: 1446
🏁 Script executed:
#!/bin/bash
# Check if there are multiple calls to handleUserData or Exec in tests
rg "\.handleUserData\(|\.Exec\(" internal/repositories/docker/adapter_test.go internal/repositories/docker/adapter_unit_test.go | head -20Repository: PoyrazK/thecloud
Length of output: 432
🏁 Script executed:
#!/bin/bash
# Look at the full adapter_unit_test.go to see the test structure
cat internal/repositories/docker/adapter_unit_test.goRepository: PoyrazK/thecloud
Length of output: 12499
🏁 Script executed:
#!/bin/bash
# Search for any test that might call handleUserData or processCloudConfig
rg "handleUserData|processCloudConfig|userData" internal/repositories/docker/ --type goRepository: PoyrazK/thecloud
Length of output: 1060
🏁 Script executed:
#!/bin/bash
# Check if the adapter has any state that could be raced on
sed -n '1,50p' internal/repositories/docker/adapter.goRepository: PoyrazK/thecloud
Length of output: 1583
🏁 Script executed:
#!/bin/bash
# Look at fakeDockerClient to understand the Calls map and potential race
rg "type fakeDockerClient" internal/repositories/docker/ -A 30Repository: PoyrazK/thecloud
Length of output: 1933
🏁 Script executed:
#!/bin/bash
# Check if Calls map is protected by mutex
rg "Calls\s*map" internal/repositories/docker/ -B 2 -A 2Repository: PoyrazK/thecloud
Length of output: 324
🏁 Script executed:
#!/bin/bash
# Look at how Calls is incremented to see if there's synchronization
rg "cli\.Calls\[" internal/repositories/docker/ -B 2 -A 2Repository: PoyrazK/thecloud
Length of output: 858
🏁 Script executed:
#!/bin/bash
# Look at the full fakeDockerClient implementation to see how Calls is incremented
cat internal/repositories/docker/fakes_test.goRepository: PoyrazK/thecloud
Length of output: 5469
🏁 Script executed:
#!/bin/bash
# Verify the actual Exec calls in handleUserData to confirm both are concurrent
sed -n '208,237p' internal/repositories/docker/adapter.go | head -35Repository: PoyrazK/thecloud
Length of output: 1207
🏁 Script executed:
#!/bin/bash
# Check processCloudConfig full implementation to understand the second goroutine
sed -n '240,350p' internal/repositories/docker/adapter.go | grep -A 30 "processCloudConfig"Repository: PoyrazK/thecloud
Length of output: 1300
Fix context propagation and data races in user-data execution.
Background goroutines in handleUserData and processCloudConfig use context.Background(), bypassing parent context cancellation and violating the guideline to propagate context.Context to all blocking calls. Additionally, concurrent Exec calls race on the test's unprotected Calls map. Replace background execution with synchronous calls using a derived context with timeout:
Example fix
- // Run in background
- go func() {
- bgCtx := context.Background()
- _, err := a.Exec(bgCtx, containerID, []string{"/bin/sh", scriptPath})
- if err != nil {
- a.logger.Error("failed to execute userdata script", "container_id", containerID, "error", err)
- }
- }()
+ bgCtx, cancel := context.WithTimeout(ctx, DefaultOperationTimeout)
+ defer cancel()
+ if _, err := a.Exec(bgCtx, containerID, []string{"/bin/sh", scriptPath}); err != nil {
+ a.logger.Error("failed to execute userdata script", "container_id", containerID, "error", err)
+ }Also applies to: processCloudConfig (around line 314-323)
🧰 Tools
🪛 GitHub Actions: CI
[error] 228-230: Data race detected in DockerAdapter.handleUserData during LaunchInstanceWithOptions. Concurrent map access observed (race between goroutines 47 and 48).
🤖 Prompt for AI Agents
In `@internal/repositories/docker/adapter.go` around lines 208 - 234,
handleUserData (and likewise processCloudConfig) spawns a goroutine that calls
a.Exec with context.Background(), which bypasses the parent ctx, prevents
cancellation propagation, and causes concurrent Exec calls that race on shared
test state; change the background execution to a synchronous call using a
derived context with a timeout from the provided ctx (e.g., ctxWithTimeout,
cancel := context.WithTimeout(ctx, <reasonableDuration>); defer cancel()), call
a.Exec(ctxWithTimeout, containerID, ...) directly instead of launching a
goroutine, and propagate errors back to the caller (or log them consistently via
a.logger) so Exec is not invoked concurrently from multiple goroutines and
respects cancellation; apply the same change in processCloudConfig and ensure
any callers handle the returned error.
| type MockSecretService struct{ mock.Mock } | ||
|
|
||
| func (m *MockSecretService) CreateSecret(ctx context.Context, name, value, description string) (*domain.Secret, error) { | ||
| return nil, nil | ||
| } | ||
| func (m *MockSecretService) GetSecret(ctx context.Context, id uuid.UUID) (*domain.Secret, error) { | ||
| return nil, nil | ||
| } | ||
| func (m *MockSecretService) GetSecretByName(ctx context.Context, name string) (*domain.Secret, error) { | ||
| return nil, nil | ||
| } | ||
| func (m *MockSecretService) ListSecrets(ctx context.Context) ([]*domain.Secret, error) { | ||
| return nil, nil | ||
| } | ||
| func (m *MockSecretService) DeleteSecret(ctx context.Context, id uuid.UUID) error { | ||
| return nil | ||
| } | ||
| func (m *MockSecretService) Encrypt(ctx context.Context, userID uuid.UUID, plainText string) (string, error) { | ||
| return plainText, nil | ||
| } | ||
| func (m *MockSecretService) Decrypt(ctx context.Context, userID uuid.UUID, cipherText string) (string, error) { | ||
| args := m.Called(ctx, userID, cipherText) | ||
| return args.String(0), args.Error(1) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for duplicate MockSecretService definitions in the k8s package
rg -n 'type MockSecretService struct' internal/repositories/k8s/Repository: PoyrazK/thecloud
Length of output: 245
Remove duplicate MockSecretService definition.
The type MockSecretService struct{ mock.Mock } is already defined in kubeadm_provisioner_test.go:96. Duplicate type definitions in the same package will cause a compilation error. Delete the duplicate definition from mocks_test.go:89-112 and reuse the existing mock from the other file if needed.
🤖 Prompt for AI Agents
In `@internal/repositories/k8s/mocks_test.go` around lines 89 - 112, There is a
duplicate MockSecretService type and method implementations (e.g., CreateSecret,
GetSecret, Decrypt) which will cause a duplicate type error; remove the entire
MockSecretService definition and its methods from this file and rely on the
single shared MockSecretService already defined elsewhere in the package,
updating any tests in this file to use that existing MockSecretService (and its
m.Called stubs) instead of declaring a new one.
| if err == nil { | ||
| cluster.KubeconfigEncrypted = encryptedKubeconfig | ||
| } else { | ||
| cluster.Kubeconfig = encryptedKubeconfig | ||
| p.logger.Error("failed to encrypt kubeconfig", "error", err) | ||
| cluster.KubeconfigEncrypted = kubeconfig // Fallback | ||
| } |
There was a problem hiding this comment.
Security risk: Storing unencrypted kubeconfig as fallback.
If encryption fails, the kubeconfig is stored in plaintext. This defeats the purpose of having KubeconfigEncrypted and could expose cluster credentials.
🔒 Proposed fix: Fail instead of fallback
encryptedKubeconfig, err := p.secretSvc.Encrypt(ctx, cluster.UserID, kubeconfig)
- if err == nil {
- cluster.KubeconfigEncrypted = encryptedKubeconfig
- } else {
- p.logger.Error("failed to encrypt kubeconfig", "error", err)
- cluster.KubeconfigEncrypted = kubeconfig // Fallback
+ if err != nil {
+ return p.failCluster(ctx, cluster, "failed to encrypt kubeconfig", err)
}
+ cluster.KubeconfigEncrypted = encryptedKubeconfig🤖 Prompt for AI Agents
In `@internal/repositories/k8s/provisioner.go` around lines 155 - 160, The current
fallback stores plaintext kubeconfig when encryption fails (setting
cluster.KubeconfigEncrypted = kubeconfig), which is a security risk; update the
error path in the block that handles encryptedKubeconfig (the branch using
encryptedKubeconfig, kubeconfig and p.logger.Error) to refuse to persist
plaintext: log the error via p.logger.Error as-is, and return or propagate an
error from the surrounding function instead of assigning the unencrypted
kubeconfig to cluster.KubeconfigEncrypted so the caller can handle the failure
and no plaintext kubeconfig is saved.
Add IF NOT EXISTS/IF EXISTS checks to: - 003_add_container_id - 006_create_vpcs_table - 012_add_failure_tracking - 013_create_users_table - 014_resource_scoping - 032_add_instance_network_fields - 036_add_volume_backend_path This resolves CI failures related to verifying migration rollbacks.
Add IF NOT EXISTS/IF EXISTS checks to: - 020_create_clusters.up - 053_add_cluster_network_isolation - 054_extend_clusters_v2_2
Add IF NOT EXISTS/IF EXISTS checks to: - 058_add_object_versioning - 060_bucket_encryption Also wrap constraint creation in DO block to avoid errors.
Add IF NOT EXISTS/IF EXISTS checks to remaining migrations.
Add status code assertions and body printing on decode failure for VerifyMethodMatching test.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
internal/repositories/postgres/migrations/076_add_gateway_pattern_matching.up.sql (2)
14-14:⚠️ Potential issue | 🟡 Minor
CREATE INDEXis not idempotent and will fail on re-run.This migration will error if executed again because the index already exists. Use
IF NOT EXISTSfor consistency with the idempotent approach applied to the column additions.Proposed fix
--- Add index for pattern lookups -CREATE INDEX idx_gateway_routes_pattern_type ON gateway_routes(pattern_type); +CREATE INDEX IF NOT EXISTS idx_gateway_routes_pattern_type ON gateway_routes(pattern_type);
17-19:⚠️ Potential issue | 🟡 Minor
ADD CONSTRAINTis not idempotent and will fail on re-run.PostgreSQL doesn't support
IF NOT EXISTSfor constraints directly. To maintain consistency with the idempotent migration approach, wrap this in aDOblock that checks for existing constraints.Proposed fix
--- Add constraint: pattern_type must be 'prefix' or 'pattern' -ALTER TABLE gateway_routes - ADD CONSTRAINT chk_pattern_type - CHECK (pattern_type IN ('prefix', 'pattern')); +DO $$ +BEGIN + IF NOT EXISTS ( + SELECT 1 FROM pg_constraint WHERE conname = 'chk_pattern_type' + ) THEN + ALTER TABLE gateway_routes + ADD CONSTRAINT chk_pattern_type + CHECK (pattern_type IN ('prefix', 'pattern')); + END IF; +END $$;internal/repositories/postgres/migrations/071_add_tenant_id_to_resources.up.sql (1)
69-90:⚠️ Potential issue | 🟠 MajorGuard index creation with table existence checks.
Lines 69-90 create indexes unconditionally, but
CREATE INDEX IF NOT EXISTSonly suppresses errors if an index name already exists—it still errors if the underlying table is missing. Since the preceding column additions (lines 2-67) are conditional, the index creation must be conditional too, or the migration will fail if any table doesn't exist.Wrap each index creation in a
DOblock with an existence check:🔧 Example pattern (apply to each index)
- CREATE INDEX IF NOT EXISTS idx_instances_tenant ON instances(tenant_id); + DO $$ + BEGIN + IF EXISTS (SELECT FROM pg_tables WHERE tablename = 'instances') THEN + EXECUTE 'CREATE INDEX IF NOT EXISTS idx_instances_tenant ON instances(tenant_id)'; + END IF; + END $$;
🤖 Fix all issues with AI agents
In `@internal/repositories/postgres/migrations/014_resource_scoping.up.sql`:
- Around line 4-10: The migration currently uses "ADD COLUMN IF NOT EXISTS ...
REFERENCES" which is skipped if the column exists and thus can leave missing FK
constraints; update 014_resource_scoping.up.sql to (1) ensure each table
(instances, vpcs, volumes, objects, load_balancers, scaling_groups, events) has
the user_id column added with ALTER TABLE ... ADD COLUMN IF NOT EXISTS user_id
UUID (no REFERENCES), and (2) for each table add a guarded FK constraint named
e.g. instances_user_id_fkey, vpcs_user_id_fkey, etc., inside a DO $$ BEGIN ...
END $$ block that checks pg_constraint for conname before running ALTER TABLE
... ADD CONSTRAINT <name> FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE
RESTRICT (following the pattern used in migration 013 for api_keys) so the
migration is idempotent and will add missing constraints without error.
In `@internal/repositories/postgres/migrations/020_create_clusters.up.sql`:
- Around line 37-38: The CREATE INDEX statements for idx_clusters_user_id and
idx_cluster_nodes_cluster_id should be made idempotent by adding IF NOT EXISTS;
update the two statements that create those indexes (the lines using CREATE
INDEX idx_clusters_user_id ON clusters(user_id) and CREATE INDEX
idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id)) to include IF NOT
EXISTS so rerunning the migration won’t fail if the indexes already exist.
In `@internal/repositories/postgres/migrations/058_add_object_versioning.up.sql`:
- Around line 11-12: The migration drops the wrong unique constraint; change the
DROP CONSTRAINT to remove the original pre-versioning unique constraint
(objects_bucket_key_key) instead of objects_bucket_key_version_unique, then add
the new UNIQUE (bucket, key, version_id) constraint on table objects so multiple
versions per (bucket,key) are allowed; update the ALTER TABLE statement
targeting constraint objects_bucket_key_key and keep the ADD CONSTRAINT
objects_bucket_key_version_unique UNIQUE (bucket, key, version_id).
- Around line 3-4: The migration leaves objects.version_id nullable which allows
duplicate (bucket, key) combos to bypass uniqueness; update the migration to set
a safe non-null default, backfill existing NULLs, and make the column NOT NULL
(e.g., ALTER TABLE objects ALTER COLUMN version_id SET DEFAULT ''; UPDATE
objects SET version_id = '' WHERE version_id IS NULL; ALTER TABLE objects ALTER
COLUMN version_id SET NOT NULL), and ensure the unique index/constraint that
should cover (bucket, key, version_id) is present/updated so NULLs cannot bypass
uniqueness; reference the objects table and the version_id and is_latest columns
when making these changes.
In `@internal/repositories/postgres/migrations/077_add_gateway_methods.up.sql`:
- Around line 8-12: The IF NOT EXISTS check must be scoped to this specific
table/schema so a same-named constraint elsewhere doesn't block creation; update
the SELECT in the DO $$ block that currently checks only constraint_name =
'gateway_routes_pattern_methods_key' to also filter by table_name =
'gateway_routes' and the appropriate table_schema (e.g. 'public' or the schema
used by this migration), ensuring the ALTER TABLE gateway_routes ADD CONSTRAINT
gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods) only gets
skipped when that constraint exists on this table/schema.
🧹 Nitpick comments (8)
internal/repositories/postgres/migrations/078_add_api_key_expiry.up.sql (1)
1-1: LGTM!The migration correctly uses
IF NOT EXISTSfor idempotent behavior, andTIMESTAMP WITH TIME ZONEis the appropriate type for expiration timestamps. Nullable by default is sensible—API keys without an expiration remain valid indefinitely.Optional consideration: If the application runs periodic cleanup jobs or frequently queries expired keys (e.g.,
WHERE expires_at IS NOT NULL AND expires_at < NOW()), consider adding an index onexpires_atin a follow-up migration to optimize those queries.,
internal/repositories/postgres/migrations/013_create_users_table.up.sql (1)
18-23: Consider usingtable_constraintsfor more robust constraint existence check.The
constraint_column_usageview shows columns referenced by constraints, whereastable_constraintsdirectly lists constraints on tables and is more commonly used for this purpose. Additionally, filtering by schema avoids false positives in multi-schema environments.♻️ Suggested improvement
DO $$ BEGIN - IF NOT EXISTS (SELECT 1 FROM information_schema.constraint_column_usage WHERE constraint_name = 'fk_api_keys_user') THEN + IF NOT EXISTS ( + SELECT 1 FROM information_schema.table_constraints + WHERE constraint_name = 'fk_api_keys_user' + AND table_name = 'api_keys' + AND constraint_type = 'FOREIGN KEY' + ) THEN ALTER TABLE api_keys ADD CONSTRAINT fk_api_keys_user FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT; END IF; END $$;internal/repositories/postgres/migrations/058_add_object_versioning.down.sql (1)
7-12: Scope the constraint existence check to theobjectstable.The check only filters by constraint name, which can yield false positives in edge cases. Consider constraining by table and schema for accuracy.
🔧 Suggested adjustment
DO $$ BEGIN - IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'objects_bucket_key_key') THEN + IF NOT EXISTS ( + SELECT 1 + FROM information_schema.table_constraints + WHERE constraint_name = 'objects_bucket_key_key' + AND table_name = 'objects' + AND table_schema = current_schema() + ) THEN ALTER TABLE objects ADD CONSTRAINT objects_bucket_key_key UNIQUE (bucket, key); END IF; END $$;internal/repositories/postgres/migrations/014_resource_scoping.up.sql (1)
13-19: Consider concurrent index creation for large production tables.If these tables can be large in production,
CREATE INDEXwill take heavyweight locks and block writes. Consider a separate no-transaction migration that usesCREATE INDEX CONCURRENTLYwhere needed. Validate against your goose transaction settings before changing.internal/repositories/postgres/migrations/006_create_vpcs_table.up.sql (1)
10-10: Consider indexinginstances.vpc_idfor join performance.If this column is used in joins or lookups, adding an index can prevent table scans and keep FK lookups fast.
internal/repositories/postgres/migrations/071_add_tenant_id_to_resources.up.sql (1)
4-67: Ensure schema resolution is deterministic for unqualifiedALTER TABLE.These checks are schema-agnostic, but
ALTER TABLE <name>followssearch_path. If multiple schemas can contain these table names, consider settingsearch_pathexplicitly or using schema-qualified dynamic SQL frompg_tables.schemaname.internal/repositories/postgres/migrations/072_migrate_to_tenants.up.sql (1)
32-93: Consider schema-qualified updates or enforcesearch_path.These schema-agnostic existence checks are paired with unqualified
UPDATEstatements. In multi-schema environments, that can hit the wrong table or miss the intended one. Consider settingsearch_pathfor the migration or using schema-qualified dynamic SQL.tests/gateway_e2e_test.go (1)
284-329: Consider consolidating repeated HTTP method tests into a parameterized loop.The GET, POST, and DELETE test blocks share identical setup and assertion patterns. Extracting the method name, expected status code, and URL fragment into a loop would reduce duplication and make it easier to add additional method tests. This is particularly valuable since the current code duplicates request creation, header setup, body reading, and unmarshaling logic three times.
♻️ Suggested refactor
- // Test GET request - reqGet, _ := http.NewRequest("GET", testutil.TestBaseURL+"/gw"+pattern, nil) - reqGet.Header.Set(testutil.TestHeaderAPIKey, token) - respGet, err := client.Do(reqGet) - require.NoError(t, err) - defer func() { _ = respGet.Body.Close() }() - bodyBytesGet, err := io.ReadAll(respGet.Body) - require.NoError(t, err) - var resGet struct { - URL string `json:"url"` - } - if err := json.Unmarshal(bodyBytesGet, &resGet); err != nil { - t.Fatalf("Failed to decode GET response: %v. Body: %s", err, string(bodyBytesGet)) - } - assert.Contains(t, resGet.URL, "/get-only") - - // Test POST request - reqPost, _ := http.NewRequest("POST", testutil.TestBaseURL+"/gw"+pattern, nil) - reqPost.Header.Set(testutil.TestHeaderAPIKey, token) - respPost, err := client.Do(reqPost) - require.NoError(t, err) - defer func() { _ = respPost.Body.Close() }() - bodyBytes, err := io.ReadAll(respPost.Body) - require.NoError(t, err) - var resPost struct { - URL string `json:"url"` - } - if err := json.Unmarshal(bodyBytes, &resPost); err != nil { - t.Fatalf("Failed to decode POST response: %v. Body: %s", err, string(bodyBytes)) - } - assert.Contains(t, resPost.URL, "/post-only") - - // Test DELETE request (should fail) - reqDel, _ := http.NewRequest("DELETE", testutil.TestBaseURL+"/gw"+pattern, nil) - reqDel.Header.Set(testutil.TestHeaderAPIKey, token) - respDel, err := client.Do(reqDel) - require.NoError(t, err) - assert.Equal(t, http.StatusNotFound, respDel.StatusCode) - _ = respDel.Body.Close() + cases := []struct { + name string + method string + wantStatus int + wantURLContains string + }{ + {name: "GET", method: http.MethodGet, wantStatus: http.StatusOK, wantURLContains: "/get-only"}, + {name: "POST", method: http.MethodPost, wantStatus: http.StatusOK, wantURLContains: "/post-only"}, + {name: "DELETE", method: http.MethodDelete, wantStatus: http.StatusNotFound}, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + req, err := http.NewRequest(tc.method, testutil.TestBaseURL+"/gw"+pattern, nil) + require.NoError(t, err) + req.Header.Set(testutil.TestHeaderAPIKey, token) + + resp, err := client.Do(req) + require.NoError(t, err) + defer func() { _ = resp.Body.Close() }() + + assert.Equal(t, tc.wantStatus, resp.StatusCode) + if tc.wantURLContains == "" { + return + } + + bodyBytes, err := io.ReadAll(resp.Body) + require.NoError(t, err) + var res struct { + URL string `json:"url"` + } + if err := json.Unmarshal(bodyBytes, &res); err != nil { + t.Fatalf("Failed to decode %s response: %v. Body: %s", tc.name, err, string(bodyBytes)) + } + assert.Contains(t, res.URL, tc.wantURLContains) + }) + }
| ALTER TABLE instances ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; | ||
| ALTER TABLE vpcs ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; | ||
| ALTER TABLE volumes ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; | ||
| ALTER TABLE objects ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; | ||
| ALTER TABLE load_balancers ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; | ||
| ALTER TABLE scaling_groups ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; | ||
| ALTER TABLE events ADD COLUMN IF NOT EXISTS user_id UUID REFERENCES users(id) ON DELETE RESTRICT; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the full migration file
cat -n internal/repositories/postgres/migrations/014_resource_scoping.up.sqlRepository: PoyrazK/thecloud
Length of output: 1484
🏁 Script executed:
# Search for earlier migrations that might have added these columns
ls -la internal/repositories/postgres/migrations/ | head -20Repository: PoyrazK/thecloud
Length of output: 1537
🏁 Script executed:
# Look for any migrations that added user_id columns before 014
rg "user_id" internal/repositories/postgres/migrations/ --type-list | head -30Repository: PoyrazK/thecloud
Length of output: 449
🏁 Script executed:
# Properly search for user_id in migration files
rg "user_id" internal/repositories/postgres/migrations/Repository: PoyrazK/thecloud
Length of output: 16027
🏁 Script executed:
# Check migrations 001-013 more specifically
for i in {001..013}; do
file="internal/repositories/postgres/migrations/${i}_"*.up.sql
if [ -f "$file" ]; then
echo "=== $file ==="
cat "$file"
fi
doneRepository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
# Check the early migration files for these tables to see initial schema
cat -n internal/repositories/postgres/migrations/001_create_instances_table.up.sqlRepository: PoyrazK/thecloud
Length of output: 784
🏁 Script executed:
cat -n internal/repositories/postgres/migrations/006_create_vpcs_table.up.sqlRepository: PoyrazK/thecloud
Length of output: 429
🏁 Script executed:
cat -n internal/repositories/postgres/migrations/007_create_events_table.up.sqlRepository: PoyrazK/thecloud
Length of output: 525
🏁 Script executed:
cat -n internal/repositories/postgres/migrations/008_create_volumes_table.up.sqlRepository: PoyrazK/thecloud
Length of output: 746
🏁 Script executed:
# Check migration 013 which creates users and adds FK to api_keys
cat -n internal/repositories/postgres/migrations/013_create_users_table.up.sqlRepository: PoyrazK/thecloud
Length of output: 1064
🏁 Script executed:
# Check migrations 009-013 for any user_id additions to these tables
for i in {009..013}; do
file="internal/repositories/postgres/migrations/${i}_"*.up.sql
if [ -f "$file" ]; then
echo "=== $file ==="
cat -n "$file" 2>/dev/null
fi
doneRepository: PoyrazK/thecloud
Length of output: 42
🏁 Script executed:
# Look for load_balancers and scaling_groups table creation
rg "CREATE TABLE.*load_balancers|CREATE TABLE.*scaling_groups|CREATE TABLE.*objects" internal/repositories/postgres/migrations/ -A 10Repository: PoyrazK/thecloud
Length of output: 4854
🏁 Script executed:
# Let's verify PostgreSQL behavior with ADD COLUMN IF NOT EXISTS and FK constraints
# Check if there are any intermediate migrations between 014 and later ones that might modify these columns
for i in {015..025}; do
file="internal/repositories/postgres/migrations/${i}_"*.up.sql
if [ -f "$file" ]; then
grep -l "ALTER TABLE.*instances\|ALTER TABLE.*vpcs\|ALTER TABLE.*volumes\|ALTER TABLE.*objects\|ALTER TABLE.*load_balancers\|ALTER TABLE.*scaling_groups\|ALTER TABLE.*events" "$file" 2>/dev/null && echo "Found in $file"
fi
doneRepository: PoyrazK/thecloud
Length of output: 42
🌐 Web query:
PostgreSQL ADD COLUMN IF NOT EXISTS REFERENCES foreign key constraint behavior
💡 Result:
-
ALTER TABLE ... ADD COLUMN IF NOT EXISTS col ...is all-or-nothing for that subcommand: if the column already exists, PostgreSQL does not execute the column definition at all (it just suppresses the error). So... ADD COLUMN IF NOT EXISTS col integer REFERENCES parent(id)will not add/repair the foreign key ifcolalready exists. [1][2] -
REFERENCES ...in a column definition is a foreign key constraint (same syntax asCREATE TABLE). It requires referenced columns to be a PRIMARY KEY or UNIQUE (or equivalent unique index, depending on version). [3][4] -
When the column is newly added, PostgreSQL will create the FK and (in general) checks existing rows immediately;
ALTER TABLE ... ADD CONSTRAINT ... NOT VALIDexists specifically to skip the initial scan for large tables. In the common “new nullable column” case, existing rows areNULL, andNULLvalues do not have to satisfy the FK, so it won’t fail unless you also make itNOT NULL/ give a non-null default. [5][6]
Practical pattern for idempotency
- Do
ADD COLUMN IF NOT EXISTS ...first, then add the FK separately (optionallyNOT VALID+VALIDATE CONSTRAINT), becauseIF NOT EXISTSwon’t help you if the column exists but the FK doesn’t. [1][6]
Sources: PostgreSQL ALTER TABLE (ADD COLUMN IF NOT EXISTS) docs [1][2]; CREATE TABLE / REFERENCES FK definition details [4]; constraint behavior for FKs with NULLs [3]; adding constraints + NOT VALID behavior [5][6].
Add explicit FK constraints guarded by pg_constraint for idempotency.
ADD COLUMN IF NOT EXISTS ... REFERENCES won't create the foreign key if the column already exists. If a column was added in a previous run or partial migration without the FK constraint, this statement silently skips and leaves referential integrity inconsistent. Follow the same pattern as migration 013 (which does this for api_keys): add the column first, then add the FK constraint separately within a DO block checking pg_constraint.
Pattern (repeat for each table)
DO $$
BEGIN
IF NOT EXISTS (
SELECT 1
FROM pg_constraint
WHERE conname = 'instances_user_id_fkey'
) THEN
ALTER TABLE instances
ADD CONSTRAINT instances_user_id_fkey
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT;
END IF;
END $$;🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/014_resource_scoping.up.sql` around
lines 4 - 10, The migration currently uses "ADD COLUMN IF NOT EXISTS ...
REFERENCES" which is skipped if the column exists and thus can leave missing FK
constraints; update 014_resource_scoping.up.sql to (1) ensure each table
(instances, vpcs, volumes, objects, load_balancers, scaling_groups, events) has
the user_id column added with ALTER TABLE ... ADD COLUMN IF NOT EXISTS user_id
UUID (no REFERENCES), and (2) for each table add a guarded FK constraint named
e.g. instances_user_id_fkey, vpcs_user_id_fkey, etc., inside a DO $$ BEGIN ...
END $$ block that checks pg_constraint for conname before running ALTER TABLE
... ADD CONSTRAINT <name> FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE
RESTRICT (following the pattern used in migration 013 for api_keys) so the
migration is idempotent and will add missing constraints without error.
| CREATE INDEX idx_clusters_user_id ON clusters(user_id); | ||
| CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n internal/repositories/postgres/migrations/020_create_clusters.up.sqlRepository: PoyrazK/thecloud
Length of output: 1719
Add IF NOT EXISTS to index creation statements for migration idempotency.
The table creation statements (lines 1 and 27) correctly use IF NOT EXISTS, but the index creation statements (lines 37-38) do not. Re-running the migration will fail with "relation already exists" errors if the indexes are already present.
🔧 Suggested fix
-CREATE INDEX idx_clusters_user_id ON clusters(user_id);
-CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);
+CREATE INDEX IF NOT EXISTS idx_clusters_user_id ON clusters(user_id);
+CREATE INDEX IF NOT EXISTS idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CREATE INDEX idx_clusters_user_id ON clusters(user_id); | |
| CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id); | |
| CREATE INDEX IF NOT EXISTS idx_clusters_user_id ON clusters(user_id); | |
| CREATE INDEX IF NOT EXISTS idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id); |
🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/020_create_clusters.up.sql` around
lines 37 - 38, The CREATE INDEX statements for idx_clusters_user_id and
idx_cluster_nodes_cluster_id should be made idempotent by adding IF NOT EXISTS;
update the two statements that create those indexes (the lines using CREATE
INDEX idx_clusters_user_id ON clusters(user_id) and CREATE INDEX
idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id)) to include IF NOT
EXISTS so rerunning the migration won’t fail if the indexes already exist.
| ALTER TABLE objects ADD COLUMN IF NOT EXISTS version_id VARCHAR(64); | ||
| ALTER TABLE objects ADD COLUMN IF NOT EXISTS is_latest BOOLEAN DEFAULT TRUE; |
There was a problem hiding this comment.
Prevent NULL version_id from bypassing uniqueness.
version_id is nullable and has no default, so inserts without a value can create duplicate (bucket, key) rows because unique constraints treat NULLs as distinct. This weakens the pre-existing uniqueness guarantee.
✅ Suggested fix
-ALTER TABLE objects ADD COLUMN IF NOT EXISTS version_id VARCHAR(64);
+ALTER TABLE objects ADD COLUMN IF NOT EXISTS version_id VARCHAR(64) DEFAULT 'null';
ALTER TABLE objects ADD COLUMN IF NOT EXISTS is_latest BOOLEAN DEFAULT TRUE;
UPDATE objects SET version_id = 'null' WHERE version_id IS NULL;
+
+ALTER TABLE objects ALTER COLUMN version_id SET NOT NULL;🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/058_add_object_versioning.up.sql`
around lines 3 - 4, The migration leaves objects.version_id nullable which
allows duplicate (bucket, key) combos to bypass uniqueness; update the migration
to set a safe non-null default, backfill existing NULLs, and make the column NOT
NULL (e.g., ALTER TABLE objects ALTER COLUMN version_id SET DEFAULT ''; UPDATE
objects SET version_id = '' WHERE version_id IS NULL; ALTER TABLE objects ALTER
COLUMN version_id SET NOT NULL), and ensure the unique index/constraint that
should cover (bucket, key, version_id) is present/updated so NULLs cannot bypass
uniqueness; reference the objects table and the version_id and is_latest columns
when making these changes.
| DO $$ | ||
| BEGIN | ||
| IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'gateway_routes_pattern_methods_key') THEN | ||
| ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods); | ||
| END IF; |
There was a problem hiding this comment.
Scope the constraint existence check to table/schema.
The IF NOT EXISTS check only filters by constraint_name, so a same-named constraint in another schema/table will short-circuit and skip adding the unique constraint on gateway_routes. This can leave the table without the intended uniqueness guarantee.
✅ Suggested fix (table + schema scoped)
DO $$
BEGIN
- IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'gateway_routes_pattern_methods_key') THEN
+ IF NOT EXISTS (
+ SELECT 1
+ FROM information_schema.table_constraints tc
+ WHERE tc.constraint_name = 'gateway_routes_pattern_methods_key'
+ AND tc.table_name = 'gateway_routes'
+ AND tc.constraint_schema = current_schema()
+ ) THEN
ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods);
END IF;
END $$;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DO $$ | |
| BEGIN | |
| IF NOT EXISTS (SELECT 1 FROM information_schema.table_constraints WHERE constraint_name = 'gateway_routes_pattern_methods_key') THEN | |
| ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods); | |
| END IF; | |
| DO $$ | |
| BEGIN | |
| IF NOT EXISTS ( | |
| SELECT 1 | |
| FROM information_schema.table_constraints tc | |
| WHERE tc.constraint_name = 'gateway_routes_pattern_methods_key' | |
| AND tc.table_name = 'gateway_routes' | |
| AND tc.constraint_schema = current_schema() | |
| ) THEN | |
| ALTER TABLE gateway_routes ADD CONSTRAINT gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods); | |
| END IF; | |
| END $$; |
🤖 Prompt for AI Agents
In `@internal/repositories/postgres/migrations/077_add_gateway_methods.up.sql`
around lines 8 - 12, The IF NOT EXISTS check must be scoped to this specific
table/schema so a same-named constraint elsewhere doesn't block creation; update
the SELECT in the DO $$ block that currently checks only constraint_name =
'gateway_routes_pattern_methods_key' to also filter by table_name =
'gateway_routes' and the appropriate table_schema (e.g. 'public' or the schema
used by this migration), ensuring the ALTER TABLE gateway_routes ADD CONSTRAINT
gateway_routes_pattern_methods_key UNIQUE (path_pattern, methods) only gets
skipped when that constraint exists on this table/schema.
Make call counting in fakeDockerClient thread-safe using a mutex.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/repositories/docker/adapter_unit_test.go`:
- Around line 393-404: The fixed 50ms time.Sleep is causing flaky tests; replace
it with a deterministic wait: either (A) implement a polling-with-timeout in the
test that repeatedly checks cli.CallCount("ContainerExecCreate"),
cli.CallCount("ContainerExecAttach"), and cli.CallCount("ContainerExecInspect")
until they reach the expected counts (or a deadline expires), using short sleeps
between polls, or (B) add a test-only synchronization hook to the adapter (e.g.,
have LaunchInstanceWithOptions optionally return a wait handle or accept a
completion channel/callback) and use that to block until the background
bootstrap goroutine finishes; update the test to use the polling or the new wait
handle instead of time.Sleep(50 * time.Millisecond).
🧹 Nitpick comments (1)
internal/repositories/docker/adapter_unit_test.go (1)
377-405: Consider addingt.Parallel()for consistency with PR objectives.The PR objectives state that parallel execution was enabled for unit tests via
t.Parallel(). This new test doesn't include it, which may be intentional if the test has shared state concerns—but given that it uses an isolatedfakeDockerClientinstance, it should be safe to parallelize.Suggested change
func TestDockerAdapterLaunchWithUserData(t *testing.T) { + t.Parallel() cli := &fakeDockerClient{}
| // UserData execution occurs asynchronously in a background goroutine. | ||
| // A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls. | ||
| time.Sleep(50 * time.Millisecond) | ||
|
|
||
| // Verify the two-stage bootstrap sequence: | ||
| // Stage 1: Payload delivery (writing the bootstrap script to the container filesystem). | ||
| // Stage 2: Bootstrap execution (invoking the script via a background Exec operation). | ||
| // Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach, | ||
| // and ContainerExecInspect according to the adapter's implementation. | ||
| require.Equal(t, 2, cli.CallCount("ContainerExecCreate"), "Expected dual Stage (delivery + execution) Exec invocations") | ||
| require.Equal(t, 2, cli.CallCount("ContainerExecAttach"), "Expected associated ExecAttach calls for I/O") | ||
| require.Equal(t, 2, cli.CallCount("ContainerExecInspect"), "Expected ExecInspect calls to verify termination state") |
There was a problem hiding this comment.
Flaky test: time.Sleep for async synchronization is unreliable.
Using a fixed 50ms sleep to wait for background goroutines is a known source of test flakiness—it may be too short on slow CI runners or under load, and unnecessarily slow on fast machines.
Consider one of these alternatives:
- Polling with timeout: Loop checking
CallCountuntil expected values are reached or a deadline expires. - Expose a synchronization hook: Have the adapter signal completion via a channel or callback in test mode.
- Return a wait handle: Modify
LaunchInstanceWithOptionsto optionally return a way to await bootstrap completion.
Example polling approach
- // UserData execution occurs asynchronously in a background goroutine.
- // A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls.
- time.Sleep(50 * time.Millisecond)
+ // Poll until the expected call counts are reached or timeout.
+ deadline := time.Now().Add(2 * time.Second)
+ for time.Now().Before(deadline) {
+ if cli.CallCount("ContainerExecCreate") >= 2 &&
+ cli.CallCount("ContainerExecAttach") >= 2 &&
+ cli.CallCount("ContainerExecInspect") >= 2 {
+ break
+ }
+ time.Sleep(10 * time.Millisecond)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // UserData execution occurs asynchronously in a background goroutine. | |
| // A brief pause ensures that the goroutine has sufficient time to initiate its Exec calls. | |
| time.Sleep(50 * time.Millisecond) | |
| // Verify the two-stage bootstrap sequence: | |
| // Stage 1: Payload delivery (writing the bootstrap script to the container filesystem). | |
| // Stage 2: Bootstrap execution (invoking the script via a background Exec operation). | |
| // Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach, | |
| // and ContainerExecInspect according to the adapter's implementation. | |
| require.Equal(t, 2, cli.CallCount("ContainerExecCreate"), "Expected dual Stage (delivery + execution) Exec invocations") | |
| require.Equal(t, 2, cli.CallCount("ContainerExecAttach"), "Expected associated ExecAttach calls for I/O") | |
| require.Equal(t, 2, cli.CallCount("ContainerExecInspect"), "Expected ExecInspect calls to verify termination state") | |
| // Poll until the expected call counts are reached or timeout. | |
| deadline := time.Now().Add(2 * time.Second) | |
| for time.Now().Before(deadline) { | |
| if cli.CallCount("ContainerExecCreate") >= 2 && | |
| cli.CallCount("ContainerExecAttach") >= 2 && | |
| cli.CallCount("ContainerExecInspect") >= 2 { | |
| break | |
| } | |
| time.Sleep(10 * time.Millisecond) | |
| } | |
| // Verify the two-stage bootstrap sequence: | |
| // Stage 1: Payload delivery (writing the bootstrap script to the container filesystem). | |
| // Stage 2: Bootstrap execution (invoking the script via a background Exec operation). | |
| // Each 'Exec' operation sequentially triggers ContainerExecCreate, ContainerExecAttach, | |
| // and ContainerExecInspect according to the adapter's implementation. | |
| require.Equal(t, 2, cli.CallCount("ContainerExecCreate"), "Expected dual Stage (delivery + execution) Exec invocations") | |
| require.Equal(t, 2, cli.CallCount("ContainerExecAttach"), "Expected associated ExecAttach calls for I/O") | |
| require.Equal(t, 2, cli.CallCount("ContainerExecInspect"), "Expected ExecInspect calls to verify termination state") |
🤖 Prompt for AI Agents
In `@internal/repositories/docker/adapter_unit_test.go` around lines 393 - 404,
The fixed 50ms time.Sleep is causing flaky tests; replace it with a
deterministic wait: either (A) implement a polling-with-timeout in the test that
repeatedly checks cli.CallCount("ContainerExecCreate"),
cli.CallCount("ContainerExecAttach"), and cli.CallCount("ContainerExecInspect")
until they reach the expected counts (or a deadline expires), using short sleeps
between polls, or (B) add a test-only synchronization hook to the adapter (e.g.,
have LaunchInstanceWithOptions optionally return a wait handle or accept a
completion channel/callback) and use that to block until the background
bootstrap goroutine finishes; update the test to use the polling or the new wait
handle instead of time.Sleep(50 * time.Millisecond).
Ensure 'objects_bucket_key_key' is dropped when adding versioning support to prevent constraint violations on multiple versions.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 167 out of 168 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -- Remove old unique constraint and add new one that includes version_id | ||
| -- Note: Constraint name might vary, but objects_bucket_key_key is standard for (bucket, key) | ||
| ALTER TABLE objects DROP CONSTRAINT IF EXISTS objects_bucket_key_key; | ||
| ALTER TABLE objects DROP CONSTRAINT IF EXISTS objects_bucket_key_version_unique; |
There was a problem hiding this comment.
In this migration, the old UNIQUE(bucket, key) constraint created in 004_create_objects_table.up.sql will be named objects_bucket_key_key by Postgres. Dropping objects_bucket_key_version_unique here won’t remove the existing constraint, so adding the new (bucket, key, version_id) unique constraint will fail on databases that already have the original constraint. Drop the actual old constraint (or query information_schema/pg_constraint to drop whichever exists) before adding the new one.
| @@ -0,0 +1 @@ | |||
| ALTER TABLE load_balancers ADD COLUMN IF NOT EXISTS ip VARCHAR(50); | |||
There was a problem hiding this comment.
This migration adds load_balancers.ip, but that column is already introduced in 054_extend_clusters_v2_2.up.sql (as TEXT). Adding it again here (as VARCHAR(50)) is redundant and creates rollback hazards: rolling back version 080 would drop the column even though earlier migrations expect it to exist. Consider removing this migration or converting it into a type/constraint adjustment that is consistent with the earlier column definition.
| -- +goose Down | ||
| ALTER TABLE load_balancers DROP COLUMN IF EXISTS ip; |
There was a problem hiding this comment.
080_add_lb_ip.up.sql is missing the -- +goose Up directive, while the corresponding down migration has -- +goose Down. If migrations are applied via goose (as the rest of this directory suggests), the missing directive can cause the migration to be skipped or mis-parsed. Add the proper goose header (and keep Up/Down files consistent).
| CREATE INDEX idx_clusters_user_id ON clusters(user_id); | ||
| CREATE INDEX idx_cluster_nodes_cluster_id ON cluster_nodes(cluster_id); |
There was a problem hiding this comment.
The indexes in this migration are created without IF NOT EXISTS. Because the tables are created with IF NOT EXISTS, this migration is likely intended to be idempotent, but repeated application (or partial schema presence) can still fail on index creation. Use CREATE INDEX IF NOT EXISTS ... to match the rest of the migration style in this repo.
| -- Add missing columns to clusters table | ||
| ALTER TABLE clusters ADD COLUMN IF NOT EXISTS pod_cidr VARCHAR(50) DEFAULT '10.244.0.0/16'; | ||
| ALTER TABLE clusters ADD COLUMN IF NOT EXISTS service_cidr VARCHAR(50) DEFAULT '10.96.0.0/12'; | ||
| ALTER TABLE clusters ADD COLUMN IF NOT EXISTS ssh_private_key_encrypted TEXT; | ||
| ALTER TABLE clusters ADD COLUMN IF NOT EXISTS kubeconfig_encrypted TEXT; |
There was a problem hiding this comment.
These new migration files don’t include the -- +goose Up / -- +goose Down directives that are used throughout the rest of this migrations directory. If goose is used to run migrations, these files may not execute as intended. Add the appropriate goose headers to both the up and down scripts.
| // Add default rules (K8s control plane ports) | ||
| rules := []domain.SecurityRule{ | ||
| {Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // API Server | ||
| {Protocol: "udp", PortMin: 4789, PortMax: 4789, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // VXLAN | ||
| {Protocol: "tcp", PortMin: 179, PortMax: 179, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // BGP | ||
| {Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // Kubelet | ||
| {Protocol: "tcp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort TCP | ||
| {Protocol: "udp", PortMin: 30000, PortMax: 32767, CIDR: AnyCIDR, Direction: domain.RuleIngress, Priority: 100}, // NodePort UDP | ||
| } | ||
|
|
||
| for _, r := range rules { | ||
| _, _ = p.sgSvc.AddRule(ctx, sg.ID, r) | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 6443, PortMax: 6443, CIDR: "0.0.0.0/0"}, | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 22, PortMax: 22, CIDR: "0.0.0.0/0"}, | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 2379, PortMax: 2380, CIDR: "10.0.0.0/8"}, // etcd (internal) | ||
| {Direction: domain.RuleIngress, Protocol: "tcp", PortMin: 10250, PortMax: 10250, CIDR: "10.0.0.0/8"}, // kubelet | ||
| } |
There was a problem hiding this comment.
The default security group rules here open the Kubernetes API server (6443) and SSH (22) to 0.0.0.0/0. That’s a high-risk default for managed clusters. Prefer restricting these to the VPC CIDR (or a configurable admin CIDR) and only exposing 6443 publicly when explicitly requested.
| for _, rule := range rules { | ||
| _, _ = p.sgSvc.AddRule(ctx, newSG.ID, rule) | ||
| } |
There was a problem hiding this comment.
Errors from AddRule are currently ignored (_, _ = ...). If rule creation fails, provisioning will proceed with a partially configured security group, which can lead to hard-to-debug connectivity issues. Handle and surface these errors (or at least log them and fail provisioning).
| CREATE TABLE IF NOT EXISTS clusters ( | ||
| id UUID PRIMARY KEY, | ||
| user_id UUID NOT NULL REFERENCES users(id), | ||
| vpc_id UUID NOT NULL REFERENCES vpcs(id), | ||
| name VARCHAR(255) NOT NULL, |
There was a problem hiding this comment.
This new migration file name collides with an existing migration version (internal/repositories/postgres/migrations/020_create_notify.up.sql already uses version 020). Having two migrations with the same version will break migration tooling (ordering/application). Rename this to the next available migration number and ensure it’s applied in the intended order.
| config := &ssh.ClientConfig{ | ||
| User: e.user, | ||
| Auth: []ssh.AuthMethod{ | ||
| ssh.PublicKeys(signer), | ||
| }, | ||
| HostKeyCallback: ssh.InsecureIgnoreHostKey(), | ||
| Timeout: 10 * time.Second, | ||
| } | ||
|
|
||
| addr := net.JoinHostPort(e.ip, "22") | ||
| client, err := ssh.Dial("tcp", addr, config) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to create ssh client: %w", err) | ||
| return "", fmt.Errorf("failed to dial ssh: %w", err) | ||
| } |
There was a problem hiding this comment.
ssh.Dial doesn’t honor context cancellation/timeouts, so Run/WaitForReady can block past ctx deadlines. Since pkg/sshutil already implements SSH execution using net.Dialer.DialContext (pkg/sshutil/client.go:55), consider reusing it here (or switching to DialContext + ssh.NewClientConn) so node exec respects ctx and caller timeouts.
| config := &ssh.ClientConfig{ | ||
| User: e.user, | ||
| Auth: []ssh.AuthMethod{ | ||
| ssh.PublicKeys(signer), | ||
| }, | ||
| HostKeyCallback: ssh.InsecureIgnoreHostKey(), | ||
| Timeout: 10 * time.Second, | ||
| } |
There was a problem hiding this comment.
Host key verification is disabled via ssh.InsecureIgnoreHostKey(). If this executor is used outside of local/test environments, this enables MITM attacks. Consider wiring a known-hosts callback or making the host key policy configurable, defaulting to strict verification in production.
Allow deployment to staging registry from feature branches to verify build/push before merge. Also adds branch name as docker tag.
Summary
This PR resolves various test failures and improves overall codebase stability and performance.
Key Changes
internal/repositories/k8s/mocks_test.go.SecretServicefor health checks and kubeconfig retrieval.t.Parallel()) for unit tests and resolved port conflicts (changed test port to 8888).golangci-lint(errcheck, staticcheck, empty branches).Verification
TestKubernetesE2E) are passing.golangci-lintreturns 0 issues.Summary by CodeRabbit
New Features
Bug Fixes
Documentation