Skip to content

feat: compute backend improvements and libvirt macos support#56

Merged
poyrazK merged 90 commits intomainfrom
feature/compute-improvements
Feb 13, 2026
Merged

feat: compute backend improvements and libvirt macos support#56
poyrazK merged 90 commits intomainfrom
feature/compute-improvements

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Feb 12, 2026

This PR consolidates improvements to the compute backend, including full macOS support for the libvirt adapter, enhanced load balancer operations, and significant CI/automation infrastructure updates via Gemini.

Summary by CodeRabbit

  • New Features

    • SSH key management (register, list, get, delete) across API, CLI and SDK
    • Instance metadata & labels (add/update) and new console (VNC) URL endpoint
    • Self‑healing worker to auto‑restart errored instances; Cloud‑Init SSH key injection
  • Documentation

    • ADR and FEATURES updated describing resilience, metadata, and SSH key flows
  • Improvements / Chores

    • CLI metadata/ssh commands; longer DB init timeout; libvirt URI config; CI staging tag tweak; .gitignore additions; removed libvirt test harness

…ntegrate SSH key provisioning into instance creation.
…ntegrate SSH key injection into instance provisioning.
… database persistence, and integration for instance creation.
…itory, database migrations, and integrate SSH key selection into instance launch.
…ability to associate SSH keys with new instances.
…epository, and add mock tenant usage methods for testing.
… worker, and enhance cloud-init with SSH key support.
… worker, and enhance SSH key provisioning for instances.
…g worker for error instances, and enable SSH key injection during provisioning.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 88 out of 89 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/sdk/compute.go
Comment on lines 61 to 75
@@ -57,6 +68,10 @@ func (c *Client) LaunchInstance(name, image, ports, instanceType string, vpcID,
"vpc_id": vpcID,
"subnet_id": subnetID,
"volumes": volumes,
"metadata": metadata,
"labels": labels,
"ssh_key_id": sshKeyID,
"cmd": cmd,
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LaunchInstance now has 11 positional parameters, which is hard to use correctly and is a breaking API change for SDK consumers. Consider introducing a request/options struct (or a new method) to avoid a long arg list and preserve backward compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines 799 to 803
func (a *LibvirtAdapter) writeCloudInitFiles(tmpDir, name string, env, cmd []string, userDataRaw string) error {
metaData := fmt.Sprintf("instance-id: %s\nlocal-hostname: %s\n", name, name)
metaData := fmt.Sprintf("{\n \"instance-id\": \"%s\",\n \"local-hostname\": \"%s\"\n}\n", name, name)
if err := os.WriteFile(filepath.Join(tmpDir, metaDataFileName), []byte(metaData), 0644); err != nil {
return err
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-data for the NoCloud datasource is typically YAML (e.g., instance-id: ...), but this writes JSON. This risks cloud-init ignoring the meta-data. Consider keeping the previous YAML format (or verify JSON is supported for your target images).

Copilot uses AI. Check for mistakes.
Comment on lines 838 to 851
func (a *LibvirtAdapter) runIsoCommand(isoPath, tmpDir string) error {
genCmd := execCommand("genisoimage", "-output", isoPath, "-volid", "config-2", "-joliet", "-rock",
filepath.Join(tmpDir, userDataFileName), filepath.Join(tmpDir, metaDataFileName))

if _, err := genCmd.CombinedOutput(); err != nil {
genCmd = execCommand("mkisofs", "-output", isoPath, "-volid", "config-2", "-joliet", "-rock",
filepath.Join(tmpDir, userDataFileName), filepath.Join(tmpDir, metaDataFileName))
if _, err2 := genCmd.CombinedOutput(); err2 != nil {
return fmt.Errorf("failed to generate iso (genisoimage/mkisofs): %w", err2)
}
var tool string
if p, err := lookPath("mkisofs"); err == nil {
tool = p
} else if p, err := lookPath("genisoimage"); err == nil {
tool = p
} else {
return fmt.Errorf("neither mkisofs nor genisoimage found in PATH")
}

genCmd := execCommand(tool, "-output", isoPath, "-volid", "cidata", "-l", "-R", "-J", tmpDir)
if output, err := genCmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to generate iso using %s: %w (output: %s)", tool, err, string(output))
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mkisofs/genisoimage is invoked with tmpDir as an argument, which will usually embed the directory as a top-level folder in the ISO rather than placing user-data and meta-data at the ISO root. Cloud-init NoCloud expects these files at the root. Consider running the command with Cmd.Dir = tmpDir and using . as the source, or use -graft-points to map filenames to root.

Copilot uses AI. Check for mistakes.
Comment thread pkg/sdk/compute.go
Comment on lines +47 to +53
func (c *Client) GetConsoleURL(idOrName string) (string, error) {
var res Response[string]
if err := c.get(fmt.Sprintf("/instances/%s/console", idOrName), &res); err != nil {
return "", err
}
return res.Data, nil
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API currently returns { "url": "..." } for the console endpoint (see InstanceHandler.GetConsole), but the SDK decodes Response[string]. This will fail to unmarshal. Either change the handler to return the URL as a plain string in data, or update the SDK to decode the object shape.

Copilot uses AI. Check for mistakes.
Comment thread cmd/api/main.go
Comment on lines +34 to +37
const (
dbInitTimeout = 120 * time.Second
)

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbInitTimeout is declared but not used (the code still hard-codes 120*time.Second), which will fail compilation. Use the constant in the timeout call or remove it.

Suggested change
const (
dbInitTimeout = 120 * time.Second
)

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +81
// Inject context for the instance owner
hCtx := appcontext.WithUserID(ctx, inst.UserID)
hCtx = appcontext.WithTenantID(hCtx, inst.TenantID)

w.reconcileWG.Add(1)
// Self-healing: Stop then Start
go func(id string) {
defer w.reconcileWG.Done()
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine closes over hCtx from the loop. Because hCtx is reassigned each iteration, concurrent healing runs can use the wrong user/tenant context. Create a per-iteration local context and pass it into the goroutine as a parameter.

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +88
// Wait a bit to avoid race conditions with recent failures
select {
case <-ctx.Done():
return
case <-time.After(5 * time.Second):
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healingDelay is configured as a testing hook but the code always sleeps for 5 seconds. Use w.healingDelay (and consider a timer that can be canceled) so behavior matches configuration and tests don't rely on a magic constant.

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 40
t.Run("default network when empty", func(t *testing.T) {
t.Parallel()
xml := generateDomainXML("vm-default", libvirtTestDiskPath, "", "", 512, 1, nil)
assert.Contains(t, xml, "<source network='default'/>")
xml := generateDomainXML("vm-default", libvirtTestDiskPath, "", "", 512, 1, nil, nil)
// Now we use SLIRP by default if no ports, so we look for type='user'
assert.Contains(t, xml, "<interface type='user'>")
})
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test expects <interface type='user'>, but generateDomainXML currently emits a <interface type='network'> when ports is empty/nil. Either update the generator to match the new SLIRP-by-default behavior or fix the assertion to match the actual XML output.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +34
// 1. Launch with UserData
id, err := adapter.LaunchInstanceWithOptions(ctx, ports.CreateInstanceOptions{
Name: name,
ImageName: "alpine",
UserData: userData,
})
require.NoError(t, err)
defer func() { _ = adapter.DeleteInstance(ctx, id) }()
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LaunchInstanceWithOptions now returns (string, []string, error), but this integration test still uses the old (string, error) signature. Update it to capture/ignore the allocated ports to keep -tags integration builds compiling.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +40
func (m *MockLBRepo) GetByName(ctx context.Context, name string) (*domain.LoadBalancer, error) {
for _, lb := range m.LBs {
if lb.Name == name {
return lb, nil
}
}
return nil, nil
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning (nil, nil) on miss makes callers treat "not found" as success and can lead to nil dereferences or false-positive tests. Consider returning a NotFound error when no load balancer matches the provided name (consistent with the real Postgres repo behavior).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/core/services/database.go (1)

102-105: ⚠️ Potential issue | 🟡 Minor

Same silent failure pattern on cleanup.

Line 103 also discards the DeleteInstance error. Same guideline applies — log the failure to avoid silent orphaned resources.

Proposed fix
-		_ = s.compute.DeleteInstance(ctx, containerID)
+		if delErr := s.compute.DeleteInstance(ctx, containerID); delErr != nil {
+			s.logger.Error("failed to clean up database container after repo create failure", "container_id", containerID, "error", delErr)
+		}

As per coding guidelines, Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

internal/repositories/libvirt/templates.go (1)

55-74: ⚠️ Potential issue | 🟡 Minor

Additional disk device letter can overflow past 'z' with many disks.

Line 58: dev := fmt.Sprintf("vd%c", 'b'+i) will produce invalid device names when i >= 25. This is an edge case but silently produces garbage device names. Consider capping or returning an error.

🤖 Fix all issues with AI agents
In `@internal/core/services/cache.go`:
- Around line 91-94: Replace the magic string "6379" with a named constant (e.g.
const redisPort = "6379") declared at the top of the file, then use that
constant in calls to s.parseAllocatedPort, s.compute.GetInstancePort and
anywhere else the literal appears (references found in calls around
s.parseAllocatedPort(...) and s.compute.GetInstancePort(...), and the occurrence
near line 164) so the Redis port value is defined in one place and reused.
- Around line 82-89: The assignment cache.Status = domain.CacheStatusFailed is
dead because you immediately call s.repo.Delete(ctx, cache.ID); either remove
the status assignment or persist it instead of deleting: in the error branch
after s.launchCacheContainer(ctx, cache, networkID) update the record via the
repository (e.g., set cache.Status = domain.CacheStatusFailed and call the repo
update method such as s.repo.Update(ctx, cache) or s.repo.UpdateStatus(ctx,
cache.ID, domain.CacheStatusFailed) before returning) or if deletion is intended
simply remove the cache.Status assignment; adjust the code around s.repo.Delete
and the error return accordingly.
- Around line 97-99: The code is silently discarding errors from
s.compute.DeleteInstance(ctx, containerID) and s.repo.Update(ctx, cache);
capture and handle those errors instead of using the blank identifier. Modify
the block in the method that touches DeleteInstance and repo.Update so you
assign their return errors to variables, log or wrap the errors with context
(including containerID and cache.ID), and propagate or return an appropriate
error (or set a retry/failure state) rather than ignoring them; ensure the
cache.Status = domain.CacheStatusFailed remains but only persists after handling
repo.Update errors so failures are visible and traceable.
- Around line 117-133: Add a concise doc comment above
CacheService.parseAllocatedPort explaining it extracts the host port from
allocated port mapping strings and that the expected format is
"hostPort:containerPort" (for example "8080:6379"); update the function comment
to mention it returns the host port as an int and may return an error when
conversion fails. Also locate the misleading comment in database.go that
mentions "host:port or host:containerPort:hostPort or containerPort:hostPort"
and simplify it to reflect the actual 2-part adapter behavior
("hostPort:containerPort") so comments across parseAllocatedPort and database.go
are consistent.

In `@internal/core/services/database.go`:
- Around line 88-96: The cleanup call currently discards errors from
s.compute.DeleteInstance when compute.GetInstancePort fails; change this to
capture the returned error and log it (using s.logger.Error) with context such
as "failed to delete instance during port resolution", the containerID, and the
deletion error, instead of assigning to the blank identifier; update the block
around parseAllocatedPort / compute.GetInstancePort so that after
s.compute.DeleteInstance(ctx, containerID) you check its error and log it
(include containerID and err) while preserving the original wrapped return
error.

In `@internal/repositories/libvirt/adapter.go`:
- Line 114: The returned error currently uses fmt.Errorf("failed to connect to
libvirt hypervisor: %v", err2) which breaks the error chain; update the return
to wrap the underlying error with %w (e.g., fmt.Errorf("failed to connect to
libvirt hypervisor: %w", err2)) so callers can use errors.Is/errors.As on err2;
locate the return in the libvirt adapter (the code that references err2) and
replace %v with %w while keeping the same message/context.
- Around line 948-954: The code silently discards the error from findFreePort(),
which can leave hPort as 0 and return (0, cP, nil); update the block handling
hostPort to capture the error from findFreePort(), e.g., call port, err :=
findFreePort(); if err != nil return the error (or wrap it) instead of ignoring
it, and assign hPort = port only on success; ensure you adjust the surrounding
function return path so callers receive a non-nil error when findFreePort()
fails (referencing findFreePort(), hPort, hostPort and hP).
- Around line 152-192: Extract the duplicated lock/init/assign logic into a
helper method on the adapter (suggested name: recordPortMapping(name string,
hPort string, cPort string) error) and replace both branches with a call to it;
inside the helper acquire a.mu, ensure a.portMappings[name] is initialized,
parse hPort with strconv.Atoi while returning an error if parsing fails (do not
ignore the error), and store the parsed host port into
a.portMappings[name][cPort]; callers (both the len(parts)==2 and len(parts)==1
branches) should handle the helper error by logging the warning (similar to
existing messages) and continuing, and keep the existing allocatedPorts append
behavior in the caller after successful allocation.
- Line 195: The call that builds domain XML hardcodes memory and vCPU
(generateDomainXML(name, diskPath, networkID, isoPath, 512, 1, ...)) instead of
using the user-specified limits; update the call to pass the VM limits from opts
(e.g., opts.MemoryLimit and opts.CPULimit) and fall back to the current defaults
when those are zero/nil (so domainXML is generated with the requested memory and
vCPU). Locate the call that assigns domainXML and replace the literal 512 and 1
with the computed memory and vcpu variables derived from opts (ensuring
units/count match generateDomainXML’s expectations).
- Line 800: The metadata JSON in writeCloudInitFiles is built by interpolating
name into metaData with fmt.Sprintf which can produce malformed JSON if name
contains quotes or backslashes; replace this manual formatting by constructing a
small struct (e.g., fields InstanceID/LocalHostname) and use encoding/json
(json.Marshal or json.MarshalIndent) to produce a properly escaped JSON byte
slice/string, handle any marshal error, and use that marshaled result wherever
metaData is consumed.

In `@internal/repositories/libvirt/templates.go`:
- Around line 99-101: The qemuArgsXML construction is embedding the HTML-escaped
variable name into a filesystem path (qemuArgsXML += fmt.Sprintf(...
file:/tmp/%s-console.log ...)), causing filenames like /tmp/&amp;-console.log;
fix by using a filename-safe domain name instead of the HTML-escaped
`name`—either call the existing sanitizeDomainName function (or create a
sanitizeFilename helper) and use that sanitized value (e.g., sanitizedName) when
building the serial log path so the XML still uses the escaped `name` where
required but the filesystem path uses a safe, unescaped identifier.

In `@internal/workers/healing_worker_test.go`:
- Line 85: The test sets worker.healingDelay but the goroutine in
healERRORInstances ignores it because it uses a hardcoded time.After(5 *
time.Second); change healERRORInstances to use the instance field (e.g.
time.After(w.healingDelay) or equivalent) so the goroutine respects
w.healingDelay, remove the hardcoded 5s, and ensure the worker struct's
healingDelay default is applied where the ticker/delay is created so tests that
set worker.healingDelay take effect.

In `@internal/workers/healing_worker.go`:
- Around line 56-102: healERRORInstances currently spawns an unbounded goroutine
per ERROR instance which can overwhelm downstream services; limit concurrency by
adding a semaphore/worker-pool (e.g., a buffered channel with capacity
maxConcurrentHeals) and acquire a token before starting each heal goroutine and
release it (defer) when done, keep using w.reconcileWG.Add/Done as currently
used, and ensure the token acquisition respects ctx cancellation so long-running
loops stop promptly; apply this around the goroutine that calls
instSvc.StopInstance and instSvc.StartInstance to cap concurrent healing
operations.
- Around line 84-88: Replace the hardcoded sleep in the select to use the
instance's configurable delay: change the select that currently waits on
time.After(5 * time.Second) to wait on time.After(w.healingDelay) so the
goroutine honors the healingDelay field (used in tests and production); locate
the select within the healing loop in healing_worker.go (the goroutine that
checks ctx.Done()) and update the time.After argument to w.healingDelay.
🧹 Nitpick comments (15)
internal/api/setup/router.go (1)

243-250: Introduce dedicated SSH key RBAC permissions for consistency and granular access control.

SSH key routes use PermissionInstanceUpdate and PermissionInstanceRead instead of dedicated SSH key permissions. Every other resource in this file has its own permission set (snapshots, images, clusters, VPCs, load balancers), but no PermissionSSHKeyCreate, PermissionSSHKeyRead, or PermissionSSHKeyDelete constants exist. This:

  1. Conflates SSH key management with instance permissions—granting one implicitly grants both, preventing independent access control.
  2. Uses PermissionInstanceUpdate for DELETE (line 249), inconsistent with other resources where destructive operations use dedicated *Delete permissions.

Define PermissionSSHKeyCreate, PermissionSSHKeyRead, PermissionSSHKeyDelete in internal/core/domain/rbac.go to match the established pattern across other resources.

internal/core/services/database.go (2)

32-48: Constructor has 6 dependencies but does not use a params struct.

The coding guidelines require a params struct for constructors with 3 or more dependencies. While this constructor predates this PR, you're touching the file — consider migrating to a params struct.

Example
type DatabaseServiceParams struct {
	Repo     ports.DatabaseRepository
	Compute  ports.ComputeBackend
	VpcRepo  ports.VpcRepository
	EventSvc ports.EventService
	AuditSvc ports.AuditService
	Logger   *slog.Logger
}

func NewDatabaseService(p DatabaseServiceParams) *DatabaseService {
	return &DatabaseService{
		repo:     p.Repo,
		compute:  p.Compute,
		vpcRepo:  p.VpcRepo,
		eventSvc: p.EventSvc,
		auditSvc: p.AuditSvc,
		logger:   p.Logger,
	}
}

As per coding guidelines, Use a params struct for service constructors with 3 or more dependencies.


182-194: Multiple silent failures in recordDatabaseCreation.

Lines 183 and 188 discard errors from RecordEvent and Log. If these are truly fire-and-forget, at minimum log at Warn level. This is existing code, but noting for consistency with the guideline.

As per coding guidelines, Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

internal/repositories/libvirt/templates.go (3)

6-6: html.EscapeString is a workable but imprecise choice for XML escaping.

Go's html.EscapeString covers the five XML predefined entities (& < > " '), so it prevents injection in this context. However, it's semantically misleading—readers may wonder whether HTML escaping is correct for XML. Consider adding a brief comment or aliasing it (e.g., var xmlEscape = html.EscapeString) to make the intent explicit.

Also applies to: 11-12, 34-37, 57-57, 84-85, 151-155


77-89: No validation that port strings are numeric before embedding in QEMU args.

generateDomainXML embeds hPort/cPort directly into hostfwd=tcp::<hPort>-:<cPort> after only HTML-escaping them. If a non-numeric value slips through (e.g., from a malformed API request), the resulting QEMU command line will be invalid. The adapter does some parsing, but a defensive check here (or at least a documented contract) would harden this template against misuse.


112-117: Architecture detection via substring match on file paths is fragile.

Checking strings.Contains(isoPath, "arm64") could produce false positives (e.g., a path like /images/not-arm64-really/ubuntu.qcow2) or false negatives (e.g., aarch64 naming convention). Consider accepting the architecture as an explicit parameter to generateDomainXML instead of inferring it from paths.

Suggested approach
-func generateDomainXML(name, diskPath, networkID, isoPath string, memoryMB, vcpu int, additionalDisks []string, ports []string) string {
+func generateDomainXML(name, diskPath, networkID, isoPath string, memoryMB, vcpu int, additionalDisks []string, ports []string, arch string) string {
 	name = html.EscapeString(name)
 	diskPath = html.EscapeString(diskPath)
 	networkID = html.EscapeString(networkID)
 	isoPath = html.EscapeString(isoPath)
 ...
-	arch := "x86_64"
-	machine := "pc"
-	if strings.Contains(isoPath, "arm64") || strings.Contains(diskPath, "arm64") {
-		arch = "aarch64"
-		machine = "virt"
-	}
+	if arch == "" {
+		arch = "x86_64"
+	}
+	machine := "pc"
+	if arch == "aarch64" {
+		machine = "virt"
+	}
internal/repositories/libvirt/adapter.go (3)

27-30: Global mutable variables for exec.Command, exec.LookPath, and os.Open.

These violate the "Do not use global variables" coding guideline. While this is a common Go pattern for testability, the guideline is explicit. Consider injecting these as fields on LibvirtAdapter or via an OSExecutor interface for test substitution.

As per coding guidelines, **/*.go: "Do not use global variables (e.g., var DB *sql.DB)".


307-311: Magic numbers for domain state and memory stat tags.

state == 1 (Running), stat.Tag == 6 (RSS), stat.Tag == 5 (actual) are opaque. Define named constants per the coding guidelines.

As per coding guidelines, **/*.go: "Do not use magic numbers - use named constants instead".

Proposed constants
const (
	domainStateRunning  = 1
	domainStateShutoff  = 5
	memStatTagActual    = 5
	memStatTagRSS       = 6
)

Also applies to: 375-383


208-210: Port forwarding launched in a fire-and-forget goroutine.

setupPortForwarding runs in a background goroutine with no way for the caller to know if it succeeds or fails. If it fails, the instance is running but unreachable on its declared ports. Consider at minimum logging a structured error that can be monitored, or tracking the goroutine in a sync.WaitGroup so cleanup can wait for it.

cmd/api/main.go (2)

34-36: The 120-second DB init timeout is a significant increase from the previous value.

This is a 12× increase from the original 10-second timeout. While it may be necessary for slow CI environments or cold starts, it can mask connection problems in production. Consider making this configurable via environment variable or platform config rather than a hardcoded constant.


278-322: Consider extracting the repetitive worker-start pattern into a helper.

All 11 workers follow the identical if w != nil { wg.Add(1); go w.Run(ctx, wg) } pattern. A small helper or a slice of interface{ Run(context.Context, *sync.WaitGroup) } would reduce boilerplate and make it harder to forget wg.Add(1).

Example
type Worker interface {
	Run(context.Context, *sync.WaitGroup)
}

func startWorkers(ctx context.Context, wg *sync.WaitGroup, workers ...Worker) {
	for _, w := range workers {
		if w != nil {
			wg.Add(1)
			go w.Run(ctx, wg)
		}
	}
}

Note: with interfaces stored in a slice, a typed-nil check requires reflection or a non-nil wrapper, so be cautious with nil interface values.

internal/workers/healing_worker_test.go (2)

79-104: Consider table-driven tests to cover more scenarios.

The current test covers only the happy path (one running + one errored instance). Missing scenarios include: all instances healthy, repo ListAll failure, StopInstance failure followed by StartInstance attempt, and context cancellation during healing. Table-driven tests would make this more maintainable.

As per coding guidelines, **/*_test.go: "Use table-driven tests in test files".


15-38: Mock methods that don't delegate to mock.Called() won't detect unexpected invocations.

Methods like Create, GetByID, Update, Delete on mockInstanceRepo (and similarly on mockInstanceSvc) return hardcoded values without going through testify's call tracking. If the code under test accidentally calls one of these, the test won't catch it. Consider either delegating all methods to mock.Called(), or using mock.AssertNotCalled for the methods you don't expect to be invoked.

Also applies to: 40-77

internal/workers/healing_worker.go (2)

57-57: ListAll loads every instance across all tenants into memory just to filter for ERROR status.

At scale, this is an O(N) memory and DB cost on every tick. A dedicated repository query like ListByStatus(ctx, domain.StatusError) would be far more efficient.


41-41: Tick interval and healing delay are magic numbers.

1 * time.Minute and 5 * time.Second should either be named constants or configurable fields. The healingDelay field already exists but isn't used (see above); consider also making the tick interval a field.

As per coding guidelines, **/*.go: "Do not use magic numbers - use named constants instead".

Also applies to: 87-87

Comment thread internal/core/services/cache.go
Comment thread internal/core/services/cache.go Outdated
Comment thread internal/core/services/cache.go Outdated
Comment thread internal/core/services/cache.go
Comment thread internal/core/services/database.go
Comment thread internal/repositories/libvirt/adapter.go
Comment on lines +99 to +101
qemuArgsXML += fmt.Sprintf(`
<qemu:arg value='-serial'/>
<qemu:arg value='file:/tmp/%s-console.log'/>`, name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Serial log path uses the already-HTML-escaped name, which could produce unexpected filenames.

At this point name has been through html.EscapeString. If the original name contained &, the log path would be /tmp/&amp;-console.log. In practice, the adapter's sanitizeDomainName strips non-alphanumeric characters, so this is currently safe, but the template shouldn't rely on that upstream guarantee.

🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/templates.go` around lines 99 - 101, The
qemuArgsXML construction is embedding the HTML-escaped variable name into a
filesystem path (qemuArgsXML += fmt.Sprintf(... file:/tmp/%s-console.log ...)),
causing filenames like /tmp/&amp;-console.log; fix by using a filename-safe
domain name instead of the HTML-escaped `name`—either call the existing
sanitizeDomainName function (or create a sanitizeFilename helper) and use that
sanitized value (e.g., sanitizedName) when building the serial log path so the
XML still uses the escaped `name` where required but the filesystem path uses a
safe, unescaped identifier.

Comment thread internal/workers/healing_worker_test.go Outdated
logger := slog.Default()

worker := NewHealingWorker(svc, repo, logger)
worker.healingDelay = 1 * time.Millisecond
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

healingDelay is set here but never read by the goroutine in healERRORInstances.

The healing goroutine in healing_worker.go (Line 87) uses a hardcoded time.After(5 * time.Second) instead of w.healingDelay. This means:

  1. This assignment has no effect — the test still waits the full 5 seconds.
  2. The healingDelay field on the struct is dead code.

See the related comment on healing_worker.go for the fix.

🤖 Prompt for AI Agents
In `@internal/workers/healing_worker_test.go` at line 85, The test sets
worker.healingDelay but the goroutine in healERRORInstances ignores it because
it uses a hardcoded time.After(5 * time.Second); change healERRORInstances to
use the instance field (e.g. time.After(w.healingDelay) or equivalent) so the
goroutine respects w.healingDelay, remove the hardcoded 5s, and ensure the
worker struct's healingDelay default is applied where the ticker/delay is
created so tests that set worker.healingDelay take effect.

Comment thread internal/workers/healing_worker.go
Comment thread internal/workers/healing_worker.go
Copilot AI review requested due to automatic review settings February 12, 2026 15:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 88 out of 89 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +5
CREATE TABLE ssh_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE RESTRICT,
name VARCHAR(64) NOT NULL,
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrations are re-run on each startup; CREATE TABLE without IF NOT EXISTS will emit warnings on subsequent runs. Consider using CREATE TABLE IF NOT EXISTS (and similarly idempotent DDL for constraints/indexes) to match the project's migration strategy.

Copilot uses AI. Check for mistakes.
Comment thread internal/api/setup/router.go Outdated

gatewayGroup := r.Group("/gateway")
gatewayGroup.Use(httputil.Auth(svcs.Identity, svcs.Tenant))
gatewayGroup.Use(httputil.Auth(services.Identity, services.Tenant), httputil.RequireTenant())
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

services.Identity / services.Tenant are referenced here, but the function receives svcs *Services and there is no services identifier in scope. This will not compile; use svcs.Identity and svcs.Tenant (and keep middleware consistent with the other route groups).

Suggested change
gatewayGroup.Use(httputil.Auth(services.Identity, services.Tenant), httputil.RequireTenant())
gatewayGroup.Use(httputil.Auth(svcs.Identity, svcs.Tenant), httputil.RequireTenant())

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +153
if err := s.tenantSvc.CheckQuota(ctx, tenantID, "memory", it.MemoryMB/1024); err != nil {
return nil, err
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it.MemoryMB/1024 uses integer division, so instance types under 1GiB (e.g. 128MB in tests) result in a requested amount of 0 and bypass memory quota checks/usage tracking. Convert MB→GB using a ceiling (min 1 for non-zero memory) or track quota in MB to avoid undercounting.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
ALTER TABLE instances ADD COLUMN ssh_key_id UUID REFERENCES ssh_keys(id) ON DELETE SET NULL;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrations are executed on every startup and errors are only logged, so this should be idempotent. Add IF NOT EXISTS to the ADD COLUMN to avoid repeated startup warnings.

Copilot uses AI. Check for mistakes.
Comment thread internal/api/setup/dependencies.go Outdated
clusterProvisioner := k8s.NewKubeadmProvisioner(instSvc, c.Repos.Cluster, secretSvc, sgSvc, storageSvc, lbSvc, c.Logger)
clusterSvc, err := services.NewClusterService(services.ClusterServiceParams{
Repo: c.Repos.Cluster, Provisioner: clusterProvisioner, VpcSvc: vpcSvc, InstanceSvc: instSvc, SecretSvc: secretSvc, TaskQueue: c.Repos.TaskQueue, Logger: c.Logger,
Repo: c.Repos.Cluster, Provisioner: clusterProvisioner, vpcSvc: vpcSvc, InstanceSvc: instSvc, SecretSvc: secretSvc, TaskQueue: c.Repos.TaskQueue, Logger: c.Logger,
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterServiceParams uses the exported field VpcSvc, but this struct literal sets vpcSvc, which will not compile. Rename the key to VpcSvc to match the struct definition.

Suggested change
Repo: c.Repos.Cluster, Provisioner: clusterProvisioner, vpcSvc: vpcSvc, InstanceSvc: instSvc, SecretSvc: secretSvc, TaskQueue: c.Repos.TaskQueue, Logger: c.Logger,
Repo: c.Repos.Cluster, Provisioner: clusterProvisioner, VpcSvc: vpcSvc, InstanceSvc: instSvc, SecretSvc: secretSvc, TaskQueue: c.Repos.TaskQueue, Logger: c.Logger,

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/ci.yml Outdated
Comment on lines +322 to +325
<<<<<<< HEAD
ghcr.io/${{ steps.repo_name.outputs.REPO_LC }}:${{ steps.repo_name.outputs.REF_NAME }}
=======
>>>>>>> main
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow still contains unresolved Git merge-conflict markers (<<<<<<< / ======= / >>>>>>>). This will break the YAML and stop CI from running; resolve the conflict and keep the intended tag list.

Suggested change
<<<<<<< HEAD
ghcr.io/${{ steps.repo_name.outputs.REPO_LC }}:${{ steps.repo_name.outputs.REF_NAME }}
=======
>>>>>>> main
ghcr.io/${{ steps.repo_name.outputs.REPO_LC }}:${{ steps.repo_name.outputs.REF_NAME }}

Copilot uses AI. Check for mistakes.
@poyrazK poyrazK force-pushed the feature/compute-improvements branch from 0d5aecd to 38d3e08 Compare February 12, 2026 20:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/core/services/database.go (1)

233-237: ⚠️ Potential issue | 🟡 Minor

Silent failures in DeleteDatabase — inconsistent with the create path fix.

Lines 233 and 235 still use _ = to discard errors from RecordEvent and Log, violating the "no silent failures" guideline. The same pattern was properly fixed in recordDatabaseCreation (lines 190–202). Apply the same warn-and-log treatment here for consistency.

Proposed fix
-	_ = s.eventSvc.RecordEvent(ctx, "DATABASE_DELETE", id.String(), "DATABASE", nil)
+	if err := s.eventSvc.RecordEvent(ctx, "DATABASE_DELETE", id.String(), "DATABASE", nil); err != nil {
+		s.logger.Warn("failed to record database deletion event", "id", id, "error", err)
+	}

-	_ = s.auditSvc.Log(ctx, db.UserID, "database.delete", "database", db.ID.String(), map[string]interface{}{
+	if err := s.auditSvc.Log(ctx, db.UserID, "database.delete", "database", db.ID.String(), map[string]interface{}{
 		"name": db.Name,
-	})
+	}); err != nil {
+		s.logger.Warn("failed to log database deletion to audit", "id", db.ID, "error", err)
+	}

As per coding guidelines, Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

🤖 Fix all issues with AI agents
In `@docs/swagger/docs.go`:
- Around line 4240-4393: Add the Swagger security annotation to the SSH key
handlers so the generated spec shows APIKey auth: insert the comment line "//
`@Security` APIKeyAuth" immediately above each exported handler method in
internal/handlers/ssh_key_handler.go (the Create, List, Get, and Delete methods)
and then regenerate the docs; ensure the annotation is placed directly above
each method declaration so Swagger picks it up.
- Around line 2908-2971: The swagger annotation currently declares a free-form
body but the handler in internal/handlers/instance_handler.go expects a
structured request with two fields (metadata and labels both map[string]string);
fix this by creating a named request type (e.g., UpdateInstanceMetadataRequest
with Metadata and Labels map[string]string) or by updating the `@Param` annotation
for the handler at internal/handlers/instance_handler.go:384 to reference those
explicit fields (required as appropriate) instead of map[string]interface{} so
the generated docs show metadata and labels as properties rather than a
free-form object.

In `@docs/swagger/swagger.json`:
- Around line 2900-2963: The PUT operation for "/instances/{id}/metadata"
currently uses a free-form body schema (parameter name "request") which prevents
SDKs from exposing the expected metadata/labels shape; replace the loose schema
in the "put" operation with a concrete model (e.g., a new definition like
"InstanceMetadataRequest") that explicitly defines properties such as "metadata"
(object<string,string>), "labels" (object<string,string>), and any required
fields, update the parameter's "schema" to $ref
"#/definitions/InstanceMetadataRequest", and add the corresponding definition
under "definitions" so generated clients and validation will reflect the exact
payload shape.
- Around line 4232-4302: The OpenAPI spec for the "/ssh-keys" path is missing
the API key security requirement for both the "get" and "post" operations;
update the "/ssh-keys" operations (the "get" and "post" objects) to include a
"security" array that references the existing APIKeyAuth security scheme (e.g.,
add "security": [{"APIKeyAuth": []}]) so clients and SDK generators know these
endpoints require API key authentication; locate the "/ssh-keys" path block and
add the security entry to both the get and post operation objects (ensuring it
matches the existing securityDefinitions/securitySchemes name used elsewhere).

In `@docs/swagger/swagger.yaml`:
- Around line 4538-4638: The four SSH key operations under the /ssh-keys and
/ssh-keys/{id} paths (the get/list, post/create, get/Get SSH Key, and
delete/Delete SSH Key operations) are missing the operation-level security
declaration; add a security section to each operation with APIKeyAuth (e.g.,
include a security: - APIKeyAuth entry for the list, create, get, and delete
operations) so the spec marks them as authenticated like other tenant-scoped
endpoints.

In `@internal/api/setup/dependencies.go`:
- Line 363: In the struct literal for ClusterServiceParams replace the incorrect
lowercase field name vpcSvc with the exported field name VpcSvc (i.e., use
VpcSvc: vpcSvc variable), ensuring the field initializer matches the declared
struct field exactly; also scan nearby struct initializers in the same
ClusterServiceParams construction to confirm all other field names (e.g., Repo,
Provisioner, InstanceSvc, SecretSvc, TaskQueue, Logger) use the exact exported
identifiers.

In `@internal/api/setup/router.go`:
- Line 498: The call to gatewayGroup.Use references an undefined identifier
"services" causing a build error; update the call to use the correct parameter
name "svcs" (i.e., replace references to services with svcs) so the middleware
invocation gatewayGroup.Use(httputil.Auth(svcs.Identity, svcs.Tenant),
httputil.RequireTenant()) compiles and uses the passed-in service bundle.
- Line 10: Update the imports and variable references in router.go: replace the
incorrect import "github.com/poyrazk/thecloud/internal/api/setup/ws" with
"github.com/poyrazk/thecloud/internal/handlers/ws", add the missing import
"github.com/gin-contrib/pprof" so pprof.Register(r) compiles, and change the
undefined references services.Identity and services.Tenant to use the function
parameter name svcs (svcs.Identity and svcs.Tenant) to match the rest of the
code.

In `@internal/core/services/database.go`:
- Around line 118-135: The comment for parseAllocatedPort is misleading: update
the comment/documentation above the DatabaseService.parseAllocatedPort function
to accurately describe supported formats and the matching/extraction logic —
namely that it treats the last segment as the container port to compare with
targetPort, and returns the host port from parts[0] for 2-part strings
(hostPort:containerPort) or parts[1] for 3-part strings
(host:hostPort:containerPort); add a short doc-comment describing expected input
formats and the return semantics (int hostPort, error) so future readers
understand the behavior.

In `@internal/core/services/instance_test.go`:
- Around line 456-465: The test "Update Metadata" discards the error returned by
svc.LaunchInstance which can leave inst nil and cause a panic; update the test
to capture and assert the error immediately after calling svc.LaunchInstance
(e.g., require.NoError or if err != nil { t.Fatal(err) }) before using inst.ID,
so the test fails cleanly when LaunchInstance returns an error.

In `@internal/core/services/instance.go`:
- Around line 192-197: The rollback after s.repo.Create in the Create flow
currently decrements "vcpus" and "memory" but must also decrement the
"instances" quota now that IncrementUsage for "instances" has been added; update
the rollback in the block that handles the Create error (the same block calling
s.repo.Create) to call s.tenantSvc.DecrementUsage(ctx, tenantID, "instances", 1)
alongside the existing DecrementUsage calls (use tenantID and the instance count
1).
- Around line 122-140: The userData shell script embeds key.PublicKey directly
(within the params.SSHKeyID handling after s.sshKeySvc.GetKey), which breaks if
the key contains a single quote; sanitize or validate the key before embedding:
either validate key.PublicKey against a strict SSH public-key pattern (e.g.,
starts with "ssh-" and has no quotes) and return an error for invalid keys, or
escape single quotes in key.PublicKey before formatting into userData (e.g.,
replace ' with the safe shell-escaping sequence) so the generated script remains
syntactically correct; make the change where userData is built so all downstream
use of userData is safe.
- Around line 142-164: Add a reservation for the "instances" quota and fix
memory GB rounding: after checking quotas for "instances", call
s.tenantSvc.IncrementUsage(ctx, tenantID, "instances", 1) and ensure it is
rolled back on subsequent failures (mirror the existing vcpus/memory rollback
pattern); compute memory GB using integer-ceiling (e.g., (it.MemoryMB + 1023) /
1024) instead of plain it.MemoryMB/1024 when calling CheckQuota, IncrementUsage,
and DecrementUsage so small sizes (128, 512 MB) count as 1 GB; apply the same
ceiling logic in finalizeTermination where DecrementUsage is called for "memory"
and ensure finalizeTermination also decrements "instances" using
DecrementUsage(ctx, tenantID, "instances", 1).

In `@internal/repositories/libvirt/adapter.go`:
- Around line 218-226: The memMB calculation can become 0 when opts.MemoryLimit
is >0 but less than 1 MiB; update the logic around memMB in adapter.go to
compute bytesToMB := int(opts.MemoryLimit / 1024 / 1024) (or use integer
division result) and then ensure memMB = max(1, bytesToMB) when opts.MemoryLimit
> 0 (i.e., if the division yields 0 but opts.MemoryLimit > 0 set memMB to 1) so
that any non-zero MemoryLimit results in at least 1 MB being used; keep the
existing default of 512 when opts.MemoryLimit == 0 and reference memMB and
opts.MemoryLimit in your change.

In `@internal/repositories/libvirt/templates.go`:
- Around line 86-95: The loop over ports in the function that builds hostfwds
silently skips malformed entries and blindly xml-escapes port strings; update it
to validate and surface bad input: when iterating over ports (the for _, p :=
range ports loop) check that strings.Split(p, ":") yields exactly 2 parts and
that both parts parse as integers (use strconv.Atoi) — if parsing fails, either
log a warning (using the package logger) or return an error to the caller
instead of silently skipping; replace the xmlEscape usage for numeric ports with
the validated integer values (or document the contract clearly) and keep the
hostfwds append logic and hasNetworkMapping assignment only for validated
entries (reference xmlEscape, hostfwds, hasNetworkMapping in your changes).
- Around line 35-43: In generateDomainXML: define and populate interfaceXML (the
missing network <interface> block) and use escapedNetworkID when no network
mapping is provided so interfaceXML is not undefined and escapedNetworkID is
consumed; specifically, reintroduce the conditional that builds interfaceXML
when hasNetworkMapping is false (or when networkID==""), moving or keeping the
xmlEscape(networkID) into that block, and then pass interfaceXML into the
fmt.Sprintf call where it is currently referenced (or remove the unused
escapedNetworkID if you instead always build an interface from mapping). Ensure
the symbol names interfaceXML and escapedNetworkID in generateDomainXML match
their uses so compilation succeeds.
- Around line 105-108: The qemuArgsXML line interpolates the raw variable name
into a filesystem path causing a path-traversal risk; sanitize name before
embedding it by normalizing to a safe basename and restricting characters (e.g.,
use filepath.Base(name) to strip directories, remove null bytes, then validate
against an allowlist regex like ^[A-Za-z0-9_-]+$ and fall back to a safe
generated token if it fails), and then use that sanitized variable when
formatting the qemuArgsXML string (referencing the name variable and qemuArgsXML
construction).
🧹 Nitpick comments (12)
internal/core/services/cache.go (1)

180-192: Silent failures in logCacheCreation — pre-existing but worth noting.

Lines 181 and 187 discard errors from RecordEvent and auditSvc.Log via blank identifiers. This violates the coding guideline against silent failures. Since these lines are not part of the current diff, this is informational — consider addressing in a follow-up.

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

internal/repositories/libvirt/templates.go (4)

10-10: Package-level var violates the "no global variables" guideline.

var xmlEscape = html.EscapeString is a mutable package-level variable. Since it's just an alias, replace it with a constant-like function or call html.EscapeString directly at each call site.

Proposed fix
-var xmlEscape = html.EscapeString

Then replace all xmlEscape(...) calls with html.EscapeString(...).

As per coding guidelines, "Do not use global variables (e.g., var DB *sql.DB)".


59-61: Magic number 25 should be a named constant.

Proposed fix
+const MaxAdditionalDisks = 25 // vd[b-z]
+
 func generateDomainXML(...) string {
     ...
-		if i >= 25 { // Limit to vd[b-z]
+		if i >= MaxAdditionalDisks {
 			break
 		}

As per coding guidelines, "Do not use magic numbers - use named constants instead".


110-115: Fragile architecture auto-detection from path substrings.

Inferring aarch64 by checking if isoPath or diskPath contains "arm64" is brittle — a path like /images/farm64/ubuntu.qcow2 would falsely match. If the caller is already expected to pass arch, consider making this an explicit requirement rather than guessing from path contents.


35-35: Consider an options struct to reduce the 9-parameter signature.

generateDomainXML accepts 9 positional parameters, making call sites error-prone (easy to swap memoryMB/vcpu, or ports/additionalDisks). A DomainConfig struct would improve readability and extensibility.

internal/core/services/database.go (2)

31-51: Good adoption of the params struct pattern.

The DatabaseServiceParams struct correctly consolidates 6 dependencies, aligning with the coding guidelines. Consider adding nil-checks for required dependencies in NewDatabaseService to fail fast at construction time rather than panicking on first use.

Proposed nil-check
 func NewDatabaseService(params DatabaseServiceParams) *DatabaseService {
+	if params.Repo == nil || params.Compute == nil || params.Logger == nil {
+		panic("DatabaseService: required dependencies must not be nil")
+	}
 	return &DatabaseService{

Alternatively, return (*DatabaseService, error) to avoid panic, which better aligns with the "do not panic in production code" guideline.


91-101: Parse error from parseAllocatedPort is silently swallowed.

When parseAllocatedPort returns a non-nil error (e.g., a malformed port string), the error is discarded as err gets reassigned on line 93. Consider logging the parse error before falling back so operators can diagnose unexpected port formats.

Proposed fix
 	hostPort, err := s.parseAllocatedPort(allocatedPorts, defaultPort)
 	if err != nil || hostPort == 0 {
+		if err != nil {
+			s.logger.Warn("parseAllocatedPort failed, falling back to GetInstancePort", "error", err)
+		}
 		hostPort, err = s.compute.GetInstancePort(ctx, containerID, defaultPort)
internal/repositories/libvirt/adapter.go (1)

964-973: Misleading method name: configureIptables does not configure iptables.

This method only stores a port mapping and logs — it doesn't actually invoke iptables. The name will confuse future readers. It also redundantly re-records mappings already stored by recordPortMapping during LaunchInstanceWithOptions.

Suggested rename and simplification

Consider renaming to something like recordForwardingSetup or simply removing the re-recording and keeping only the log statement, since recordPortMapping already persists the mapping during launch.

internal/core/services/instance.go (1)

1022-1055: Consider adding audit logging and input validation for UpdateInstanceMetadata.

Other mutating operations (start, stop, terminate) log via s.auditSvc.Log. This method mutates instance state without an audit trail. Additionally, there's no validation on the size/count of metadata or labels entries.

internal/workers/healing_worker.go (1)

68-122: Consider adding a ListByStatus method to InstanceRepository for more efficient filtering.

The current implementation calls ListAll (line 69), which fetches every instance across the system in memory and then filters for StatusError instances. As your instance count grows, this becomes inefficient. Implement a ListByStatus(ctx context.Context, status string) method on InstanceRepository and use it instead of the in-memory filter pattern.

docs/swagger/swagger.yaml (2)

3700-3706: Metadata update request body schema is untyped — consumers cannot discover the expected shape.

The request body is declared as a bare additionalProperties: true object, which gives SDK generators and API consumers no guidance on what keys to send. Since the handler expects metadata and labels (both map[string]string), define a proper schema:

Suggested schema
       - description: Metadata and Labels
         in: body
         name: request
         required: true
         schema:
-          additionalProperties: true
-          type: object
+          properties:
+            metadata:
+              additionalProperties:
+                type: string
+              type: object
+            labels:
+              additionalProperties:
+                type: string
+              type: object
+          type: object

4554-4555: Nit: tag ssh_keys is inconsistent with URL path ssh-keys and other tags.

Other endpoint groups use hyphenated tags matching their URL paths (e.g., elastic-ips, security-groups). Consider renaming to ssh-keys for consistency.

Comment thread docs/swagger/docs.go
Comment on lines +2908 to +2971
"/instances/{id}/metadata": {
"put": {
"security": [
{
"APIKeyAuth": []
}
],
"description": "Updates the metadata and labels for a compute instance",
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"instances"
],
"summary": "Update instance metadata",
"parameters": [
{
"type": "string",
"description": "Instance ID",
"name": "id",
"in": "path",
"required": true
},
{
"description": "Metadata and Labels",
"name": "request",
"in": "body",
"required": true,
"schema": {
"type": "object",
"additionalProperties": true
}
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"404": {
"description": "Not Found",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Locate the handler and swagger annotations for the metadata update endpoint
rg -n "instances/{id}/metadata|UpdateInstanceMetadata|MetadataRequest" -g '*.go' internal/ --max-count 20

Repository: poyrazK/thecloud

Length of output: 215


🏁 Script executed:

# Search for PUT handler patterns and metadata-related handlers
rg -n "put.*metadata|metadata.*put" -g '*.go' -i internal/ --max-count 20

Repository: poyrazK/thecloud

Length of output: 936


🏁 Script executed:

# Find swagger annotations near metadata endpoint handlers
rg -B5 -A15 "swagger:route.*put|PUT.*metadata" -g '*.go' internal/ --max-count 30

Repository: poyrazK/thecloud

Length of output: 3830


🏁 Script executed:

# Get the actual handler implementation for UpdateMetadata
sed -n '379,411p' internal/handlers/instance_handler.go

Repository: poyrazK/thecloud

Length of output: 1118


🏁 Script executed:

# Check the request struct if one is defined
rg -B10 "func.*UpdateMetadata" -g '*.go' internal/handlers/ | head -40

Repository: poyrazK/thecloud

Length of output: 1755


🏁 Script executed:

# Verify the generated swagger schema is indeed from this annotation
sed -n '2900,2945p' docs/swagger/docs.go

Repository: poyrazK/thecloud

Length of output: 1644


🏁 Script executed:

# Check if there are any request struct definitions in handlers
rg -B2 "json:\"metadata\"" -g '*.go' internal/handlers/ | head -20

Repository: poyrazK/thecloud

Length of output: 239


Update the swagger annotation to reflect the actual request structure.

The handler expects a request body with two specific fields: metadata and labels (both map[string]string). However, the swagger annotation at internal/handlers/instance_handler.go:384 declares @Param request body map[string]interface{}, which generates an incorrect free-form schema. This misleads clients about the actual request format. Update the annotation to match the handler's struct definition—either by defining a named request type or by annotating the fields explicitly so the generated schema shows metadata and labels as required properties.

🤖 Prompt for AI Agents
In `@docs/swagger/docs.go` around lines 2908 - 2971, The swagger annotation
currently declares a free-form body but the handler in
internal/handlers/instance_handler.go expects a structured request with two
fields (metadata and labels both map[string]string); fix this by creating a
named request type (e.g., UpdateInstanceMetadataRequest with Metadata and Labels
map[string]string) or by updating the `@Param` annotation for the handler at
internal/handlers/instance_handler.go:384 to reference those explicit fields
(required as appropriate) instead of map[string]interface{} so the generated
docs show metadata and labels as properties rather than a free-form object.

Comment thread docs/swagger/docs.go
Comment on lines +4240 to +4393
"/ssh-keys": {
"get": {
"produces": [
"application/json"
],
"tags": [
"ssh_keys"
],
"summary": "List SSH Keys",
"responses": {
"200": {
"description": "OK",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/domain.SSHKey"
}
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
},
"post": {
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"ssh_keys"
],
"summary": "Create SSH Key",
"parameters": [
{
"description": "SSH Key details",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/httphandlers.CreateSSHKeyRequest"
}
}
],
"responses": {
"201": {
"description": "Created",
"schema": {
"$ref": "#/definitions/domain.SSHKey"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
},
"/ssh-keys/{id}": {
"get": {
"produces": [
"application/json"
],
"tags": [
"ssh_keys"
],
"summary": "Get SSH Key",
"parameters": [
{
"type": "string",
"description": "SSH Key ID",
"name": "id",
"in": "path",
"required": true
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/domain.SSHKey"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"404": {
"description": "Not Found",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
},
"delete": {
"tags": [
"ssh_keys"
],
"summary": "Delete SSH Key",
"parameters": [
{
"type": "string",
"description": "SSH Key ID",
"name": "id",
"in": "path",
"required": true
}
],
"responses": {
"204": {
"description": "No Content"
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"404": {
"description": "Not Found",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the structure and find SSH key related files
find . -type f -name "*.go" | grep -E "(ssh|key)" | head -20

Repository: poyrazK/thecloud

Length of output: 496


🏁 Script executed:

# Search for SSH key handlers and swagger annotations
rg -n "ssh-keys|SSHKey" -g '*.go' --max-count 20

Repository: poyrazK/thecloud

Length of output: 14094


🏁 Script executed:

# Check if docs.go is auto-generated and look for the generation source
head -20 docs/swagger/docs.go

Repository: poyrazK/thecloud

Length of output: 573


🏁 Script executed:

# Search for API key security patterns and how security is defined in swagger
rg -n "APIKeyAuth|@Security|swag" -g '*.go' | head -30

Repository: poyrazK/thecloud

Length of output: 1993


🏁 Script executed:

cat -n internal/handlers/ssh_key_handler.go

Repository: poyrazK/thecloud

Length of output: 3940


🏁 Script executed:

# Check the router setup to see how SSH key routes are protected
sed -n '244,250p' internal/api/setup/router.go

Repository: poyrazK/thecloud

Length of output: 660


🏁 Script executed:

# Look for the httputil.Permission middleware to understand authentication
rg -A 5 "func Permission" -g '*.go'

Repository: poyrazK/thecloud

Length of output: 440


Add @Security annotations to SSH key handler methods.

The /ssh-keys and /ssh-keys/{id} endpoints are protected at runtime (authentication and RBAC checks are enforced via middleware), but the Swagger annotations lack @Security APIKeyAuth directives. Add the annotation to each handler method in internal/handlers/ssh_key_handler.go and regenerate the docs:

// `@Security` APIKeyAuth

Include this comment above each method (Create, List, Get, Delete) so clients see the security requirement in the generated Swagger spec, consistent with other protected endpoints.

🤖 Prompt for AI Agents
In `@docs/swagger/docs.go` around lines 4240 - 4393, Add the Swagger security
annotation to the SSH key handlers so the generated spec shows APIKey auth:
insert the comment line "// `@Security` APIKeyAuth" immediately above each
exported handler method in internal/handlers/ssh_key_handler.go (the Create,
List, Get, and Delete methods) and then regenerate the docs; ensure the
annotation is placed directly above each method declaration so Swagger picks it
up.

Comment thread docs/swagger/swagger.json
Comment on lines +2900 to +2963
"/instances/{id}/metadata": {
"put": {
"security": [
{
"APIKeyAuth": []
}
],
"description": "Updates the metadata and labels for a compute instance",
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"instances"
],
"summary": "Update instance metadata",
"parameters": [
{
"type": "string",
"description": "Instance ID",
"name": "id",
"in": "path",
"required": true
},
{
"description": "Metadata and Labels",
"name": "request",
"in": "body",
"required": true,
"schema": {
"type": "object",
"additionalProperties": true
}
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"404": {
"description": "Not Found",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define a concrete request schema for metadata/labels.

The request body is declared as a free-form object, which prevents generated SDKs from exposing the metadata/labels shape and risks clients sending incompatible payloads. Consider modeling the expected structure explicitly.

✅ Proposed schema refinement
-                        "schema": {
-                            "type": "object",
-                            "additionalProperties": true
-                        }
+                        "schema": {
+                            "type": "object",
+                            "properties": {
+                                "metadata": {
+                                    "type": "object",
+                                    "additionalProperties": {
+                                        "type": "string"
+                                    }
+                                },
+                                "labels": {
+                                    "type": "object",
+                                    "additionalProperties": {
+                                        "type": "string"
+                                    }
+                                }
+                            }
+                        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"/instances/{id}/metadata": {
"put": {
"security": [
{
"APIKeyAuth": []
}
],
"description": "Updates the metadata and labels for a compute instance",
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"instances"
],
"summary": "Update instance metadata",
"parameters": [
{
"type": "string",
"description": "Instance ID",
"name": "id",
"in": "path",
"required": true
},
{
"description": "Metadata and Labels",
"name": "request",
"in": "body",
"required": true,
"schema": {
"type": "object",
"additionalProperties": true
}
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"404": {
"description": "Not Found",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
"/instances/{id}/metadata": {
"put": {
"security": [
{
"APIKeyAuth": []
}
],
"description": "Updates the metadata and labels for a compute instance",
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"instances"
],
"summary": "Update instance metadata",
"parameters": [
{
"type": "string",
"description": "Instance ID",
"name": "id",
"in": "path",
"required": true
},
{
"description": "Metadata and Labels",
"name": "request",
"in": "body",
"required": true,
"schema": {
"type": "object",
"properties": {
"metadata": {
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"labels": {
"type": "object",
"additionalProperties": {
"type": "string"
}
}
}
}
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"404": {
"description": "Not Found",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
🤖 Prompt for AI Agents
In `@docs/swagger/swagger.json` around lines 2900 - 2963, The PUT operation for
"/instances/{id}/metadata" currently uses a free-form body schema (parameter
name "request") which prevents SDKs from exposing the expected metadata/labels
shape; replace the loose schema in the "put" operation with a concrete model
(e.g., a new definition like "InstanceMetadataRequest") that explicitly defines
properties such as "metadata" (object<string,string>), "labels"
(object<string,string>), and any required fields, update the parameter's
"schema" to $ref "#/definitions/InstanceMetadataRequest", and add the
corresponding definition under "definitions" so generated clients and validation
will reflect the exact payload shape.

Comment thread docs/swagger/swagger.json
Comment on lines +4232 to +4302
"/ssh-keys": {
"get": {
"produces": [
"application/json"
],
"tags": [
"ssh_keys"
],
"summary": "List SSH Keys",
"responses": {
"200": {
"description": "OK",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/domain.SSHKey"
}
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
},
"post": {
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"ssh_keys"
],
"summary": "Create SSH Key",
"parameters": [
{
"description": "SSH Key details",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/httphandlers.CreateSSHKeyRequest"
}
}
],
"responses": {
"201": {
"description": "Created",
"schema": {
"$ref": "#/definitions/domain.SSHKey"
}
},
"400": {
"description": "Bad Request",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
},
"500": {
"description": "Internal Server Error",
"schema": {
"$ref": "#/definitions/httputil.Response"
}
}
}
}
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add API key security to SSH key collection endpoints.

These endpoints likely require authentication like other user-scoped resources, but the spec omits APIKeyAuth, which can mislead clients and SDKs.

🔒 Suggested addition
             "get": {
+                "security": [
+                    {
+                        "APIKeyAuth": []
+                    }
+                ],
                 "produces": [
                     "application/json"
                 ],
@@
             "post": {
+                "security": [
+                    {
+                        "APIKeyAuth": []
+                    }
+                ],
                 "consumes": [
                     "application/json"
                 ],
🤖 Prompt for AI Agents
In `@docs/swagger/swagger.json` around lines 4232 - 4302, The OpenAPI spec for the
"/ssh-keys" path is missing the API key security requirement for both the "get"
and "post" operations; update the "/ssh-keys" operations (the "get" and "post"
objects) to include a "security" array that references the existing APIKeyAuth
security scheme (e.g., add "security": [{"APIKeyAuth": []}]) so clients and SDK
generators know these endpoints require API key authentication; locate the
"/ssh-keys" path block and add the security entry to both the get and post
operation objects (ensuring it matches the existing
securityDefinitions/securitySchemes name used elsewhere).

Comment thread docs/swagger/swagger.yaml
Comment on lines +4538 to +4638
/ssh-keys:
get:
produces:
- application/json
responses:
"200":
description: OK
schema:
items:
$ref: '#/definitions/domain.SSHKey'
type: array
"500":
description: Internal Server Error
schema:
$ref: '#/definitions/httputil.Response'
summary: List SSH Keys
tags:
- ssh_keys
post:
consumes:
- application/json
parameters:
- description: SSH Key details
in: body
name: request
required: true
schema:
$ref: '#/definitions/httphandlers.CreateSSHKeyRequest'
produces:
- application/json
responses:
"201":
description: Created
schema:
$ref: '#/definitions/domain.SSHKey'
"400":
description: Bad Request
schema:
$ref: '#/definitions/httputil.Response'
"500":
description: Internal Server Error
schema:
$ref: '#/definitions/httputil.Response'
summary: Create SSH Key
tags:
- ssh_keys
/ssh-keys/{id}:
delete:
parameters:
- description: SSH Key ID
in: path
name: id
required: true
type: string
responses:
"204":
description: No Content
"400":
description: Bad Request
schema:
$ref: '#/definitions/httputil.Response'
"404":
description: Not Found
schema:
$ref: '#/definitions/httputil.Response'
"500":
description: Internal Server Error
schema:
$ref: '#/definitions/httputil.Response'
summary: Delete SSH Key
tags:
- ssh_keys
get:
parameters:
- description: SSH Key ID
in: path
name: id
required: true
type: string
produces:
- application/json
responses:
"200":
description: OK
schema:
$ref: '#/definitions/domain.SSHKey'
"400":
description: Bad Request
schema:
$ref: '#/definitions/httputil.Response'
"404":
description: Not Found
schema:
$ref: '#/definitions/httputil.Response'
"500":
description: Internal Server Error
schema:
$ref: '#/definitions/httputil.Response'
summary: Get SSH Key
tags:
- ssh_keys
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

All SSH key endpoints are missing security declarations.

None of the four SSH key operations (list, create, get, delete) include a security section with APIKeyAuth. Every other tenant-scoped endpoint in this spec requires APIKeyAuth. Without this, code generators and documentation tools will advertise these endpoints as unauthenticated, and gateway/middleware relying on the spec for enforcement may skip auth checks.

Add the security block to each operation, e.g.:

Proposed fix (showing list endpoint as example; apply to all four)
   /ssh-keys:
     get:
       produces:
       - application/json
       responses:
         "200":
           description: OK
           schema:
             items:
               $ref: '#/definitions/domain.SSHKey'
             type: array
         "500":
           description: Internal Server Error
           schema:
             $ref: '#/definitions/httputil.Response'
+      security:
+      - APIKeyAuth: []
       summary: List SSH Keys
       tags:
       - ssh_keys
     post:
       consumes:
       - application/json
       # ... same for POST, GET {id}, DELETE {id}
+      security:
+      - APIKeyAuth: []
🤖 Prompt for AI Agents
In `@docs/swagger/swagger.yaml` around lines 4538 - 4638, The four SSH key
operations under the /ssh-keys and /ssh-keys/{id} paths (the get/list,
post/create, get/Get SSH Key, and delete/Delete SSH Key operations) are missing
the operation-level security declaration; add a security section to each
operation with APIKeyAuth (e.g., include a security: - APIKeyAuth entry for the
list, create, get, and delete operations) so the spec marks them as
authenticated like other tenant-scoped endpoints.

Comment on lines 192 to 197
if err := s.repo.Create(ctx, inst); err != nil {
// Rollback quota reservation
_ = s.tenantSvc.DecrementUsage(ctx, tenantID, "vcpus", it.VCPUs)
_ = s.tenantSvc.DecrementUsage(ctx, tenantID, "memory", it.MemoryMB/1024)
return nil, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Rollback path also needs to decrement "instances" usage (once the increment bug is fixed).

After adding IncrementUsage for "instances" (see earlier comment), ensure this rollback block also decrements it:

 	if err := s.repo.Create(ctx, inst); err != nil {
 		// Rollback quota reservation
+		_ = s.tenantSvc.DecrementUsage(ctx, tenantID, "instances", 1)
 		_ = s.tenantSvc.DecrementUsage(ctx, tenantID, "vcpus", it.VCPUs)
 		_ = s.tenantSvc.DecrementUsage(ctx, tenantID, "memory", it.MemoryMB/1024)
 		return nil, err
 	}
🤖 Prompt for AI Agents
In `@internal/core/services/instance.go` around lines 192 - 197, The rollback
after s.repo.Create in the Create flow currently decrements "vcpus" and "memory"
but must also decrement the "instances" quota now that IncrementUsage for
"instances" has been added; update the rollback in the block that handles the
Create error (the same block calling s.repo.Create) to call
s.tenantSvc.DecrementUsage(ctx, tenantID, "instances", 1) alongside the existing
DecrementUsage calls (use tenantID and the instance count 1).

Comment on lines +218 to +226
// Use user-specified limits or defaults
memMB := 512
if opts.MemoryLimit > 0 {
memMB = int(opts.MemoryLimit / 1024 / 1024)
}
vcpu := 1
if opts.CPULimit > 0 {
vcpu = int(opts.CPULimit)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Edge case: memMB can be zero if MemoryLimit is positive but less than 1 MiB.

When opts.MemoryLimit is between 1 and 1,048,575 bytes, the integer division opts.MemoryLimit / 1024 / 1024 truncates to 0, bypassing the default.

Proposed fix
 	memMB := 512
 	if opts.MemoryLimit > 0 {
 		memMB = int(opts.MemoryLimit / 1024 / 1024)
+		if memMB <= 0 {
+			memMB = 512
+		}
 	}
🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/adapter.go` around lines 218 - 226, The memMB
calculation can become 0 when opts.MemoryLimit is >0 but less than 1 MiB; update
the logic around memMB in adapter.go to compute bytesToMB :=
int(opts.MemoryLimit / 1024 / 1024) (or use integer division result) and then
ensure memMB = max(1, bytesToMB) when opts.MemoryLimit > 0 (i.e., if the
division yields 0 but opts.MemoryLimit > 0 set memMB to 1) so that any non-zero
MemoryLimit results in at least 1 MB being used; keep the existing default of
512 when opts.MemoryLimit == 0 and reference memMB and opts.MemoryLimit in your
change.

Comment on lines +35 to 43
func generateDomainXML(name, diskPath, networkID, isoPath string, memoryMB, vcpu int, additionalDisks []string, ports []string, arch string) string {
escapedName := xmlEscape(name)
escapedDiskPath := xmlEscape(diskPath)
escapedNetworkID := xmlEscape(networkID)
escapedIsoPath := xmlEscape(isoPath)

if networkID == "" {
networkID = "default"
escapedNetworkID = "default"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Build-breaking: interfaceXML is undefined and escapedNetworkID is unused.

The code references interfaceXML on line 150 in the Sprintf call, but this variable is never declared anywhere in the function. Similarly, escapedNetworkID (line 38) is assigned but never consumed — it was likely intended to be used when constructing the missing interface/network XML block. Both issues are confirmed by static analysis and will prevent compilation.

It appears the conditional network <interface> block (rendered when hasNetworkMapping is false) was dropped during refactoring. You need to define interfaceXML — something like:

Proposed fix (illustrative)
+	interfaceXML := ""
+	if !hasNetworkMapping {
+		interfaceXML = fmt.Sprintf(`
+    <interface type='network'>
+      <source network='%s'/>
+      <model type='virtio'/>
+    </interface>`, escapedNetworkID)
+	}
+
 	if arch == "" {

And remove the now-redundant standalone escape if you keep the conditional:

-	escapedNetworkID := xmlEscape(networkID)

(Move the escape into the block where it's actually used, or keep it above — as long as it's consumed.)

Also applies to: 148-150

🧰 Tools
🪛 GitHub Check: Enforce Coverage

[failure] 38-38:
declared and not used: escapedNetworkID

🪛 GitHub Check: lint

[failure] 38-38:
declared and not used: escapedNetworkID

🪛 GitHub Check: test-libvirt

[failure] 38-38:
declared and not used: escapedNetworkID

🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/templates.go` around lines 35 - 43, In
generateDomainXML: define and populate interfaceXML (the missing network
<interface> block) and use escapedNetworkID when no network mapping is provided
so interfaceXML is not undefined and escapedNetworkID is consumed; specifically,
reintroduce the conditional that builds interfaceXML when hasNetworkMapping is
false (or when networkID==""), moving or keeping the xmlEscape(networkID) into
that block, and then pass interfaceXML into the fmt.Sprintf call where it is
currently referenced (or remove the unused escapedNetworkID if you instead
always build an interface from mapping). Ensure the symbol names interfaceXML
and escapedNetworkID in generateDomainXML match their uses so compilation
succeeds.

Comment on lines +86 to +95
for _, p := range ports {
parts := strings.Split(p, ":")
if len(parts) == 2 {
// Validation is expected at the caller (adapter)
hPort := xmlEscape(parts[0])
cPort := xmlEscape(parts[1])
hostfwds = append(hostfwds, fmt.Sprintf("hostfwd=tcp::%s-:%s", hPort, cPort))
hasNetworkMapping = true
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silently ignoring malformed port entries and XML-escaping numeric values.

  1. Entries where len(parts) != 2 are silently skipped. Per guidelines, avoid silent failures — at minimum log a warning or return an error.
  2. xmlEscape on port strings is a no-op for valid numeric ports and misleading; if ports aren't numeric, you have a bigger problem. Consider validating (e.g., strconv.Atoi) or documenting the contract more explicitly.
🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/templates.go` around lines 86 - 95, The loop
over ports in the function that builds hostfwds silently skips malformed entries
and blindly xml-escapes port strings; update it to validate and surface bad
input: when iterating over ports (the for _, p := range ports loop) check that
strings.Split(p, ":") yields exactly 2 parts and that both parts parse as
integers (use strconv.Atoi) — if parsing fails, either log a warning (using the
package logger) or return an error to the caller instead of silently skipping;
replace the xmlEscape usage for numeric ports with the validated integer values
(or document the contract clearly) and keep the hostfwds append logic and
hasNetworkMapping assignment only for validated entries (reference xmlEscape,
hostfwds, hasNetworkMapping in your changes).

Comment on lines +105 to +108
// Use the unescaped name for the log file path to avoid malformed filenames
qemuArgsXML += fmt.Sprintf(`
<qemu:arg value='-serial'/>
<qemu:arg value='file:/tmp/%s-console.log'/>`, name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Path traversal risk: unescaped name is interpolated into a file path.

Using the raw name in file:/tmp/%s-console.log avoids the HTML-encoding problem flagged previously, but it introduces a path-traversal vector. If name contains ../ (e.g., ../../etc/cron.d/evil), the serial output could be written to an arbitrary location. The comment on line 89 notes "validation is expected at the caller," but defense-in-depth should apply here too.

Sanitize name for filesystem use (strip /, .., null bytes, etc.) before embedding it in the path, e.g.:

Proposed fix
-	// Use the unescaped name for the log file path to avoid malformed filenames
-	qemuArgsXML += fmt.Sprintf(`
+	safeName := strings.ReplaceAll(name, "/", "_")
+	safeName = strings.ReplaceAll(safeName, "..", "_")
+	qemuArgsXML += fmt.Sprintf(`
     <qemu:arg value='-serial'/>
-    <qemu:arg value='file:/tmp/%s-console.log'/>`, name)
+    <qemu:arg value='file:/tmp/%s-console.log'/>`, safeName)

A stricter approach would be to use filepath.Base or a regex allowing only [a-zA-Z0-9_-].

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Use the unescaped name for the log file path to avoid malformed filenames
qemuArgsXML += fmt.Sprintf(`
<qemu:arg value='-serial'/>
<qemu:arg value='file:/tmp/%s-console.log'/>`, name)
safeName := strings.ReplaceAll(name, "/", "_")
safeName = strings.ReplaceAll(safeName, "..", "_")
qemuArgsXML += fmt.Sprintf(`
<qemu:arg value='-serial'/>
<qemu:arg value='file:/tmp/%s-console.log'/>`, safeName)
🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/templates.go` around lines 105 - 108, The
qemuArgsXML line interpolates the raw variable name into a filesystem path
causing a path-traversal risk; sanitize name before embedding it by normalizing
to a safe basename and restricting characters (e.g., use filepath.Base(name) to
strip directories, remove null bytes, then validate against an allowlist regex
like ^[A-Za-z0-9_-]+$ and fall back to a safe generated token if it fails), and
then use that sanitized variable when formatting the qemuArgsXML string
(referencing the name variable and qemuArgsXML construction).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/core/services/cache.go (1)

37-53: 🛠️ Refactor suggestion | 🟠 Major

Constructor has 6 dependencies — use a params struct.

NewCacheService accepts 6 parameters, exceeding the threshold of 3 specified in coding guidelines. Extract them into a CacheServiceParams struct.

♻️ Proposed refactor
+type CacheServiceParams struct {
+	Repo     ports.CacheRepository
+	Compute  ports.ComputeBackend
+	VpcRepo  ports.VpcRepository
+	EventSvc ports.EventService
+	AuditSvc ports.AuditService
+	Logger   *slog.Logger
+}
+
-func NewCacheService(
-	repo ports.CacheRepository,
-	compute ports.ComputeBackend,
-	vpcRepo ports.VpcRepository,
-	eventSvc ports.EventService,
-	auditSvc ports.AuditService,
-	logger *slog.Logger,
-) *CacheService {
+func NewCacheService(p CacheServiceParams) *CacheService {
 	return &CacheService{
-		repo:     repo,
-		compute:  compute,
-		vpcRepo:  vpcRepo,
-		eventSvc: eventSvc,
-		auditSvc: auditSvc,
-		logger:   logger,
+		repo:     p.Repo,
+		compute:  p.Compute,
+		vpcRepo:  p.VpcRepo,
+		eventSvc: p.EventSvc,
+		auditSvc: p.AuditSvc,
+		logger:   p.Logger,
 	}
 }

As per coding guidelines, "Use a params struct for service constructors with 3 or more dependencies".

🤖 Fix all issues with AI agents
In `@cmd/api/main.go`:
- Around line 182-189: The code currently ignores ParseDuration errors for
DB_INIT_TIMEOUT; update the block around defaultDBInitTimeout/DB_INIT_TIMEOUT so
that if time.ParseDuration(os.Getenv("DB_INIT_TIMEOUT")) returns an error you
log a warning including the invalid value and the error (e.g., using the package
logger or log.Printf) and then fall back to the default timeout; keep the
existing behavior of using timeout when parse succeeds and creating ctx, cancel
via context.WithTimeout as before, but do not silently swallow the parse
error—include the env value and err in the warning message to aid diagnostics.
- Around line 289-294: The startWorker function's nil check (if w != nil) is
unsafe because typed nil pointers (e.g., a nil *services.LBWorker wrapped as
runner) pass the check and cause a panic when Run is invoked; move the nil
validation to the concrete call sites (the caller(s) that iterate/dispatch
Workers fields—e.g., runWorkers / runApplication where the Workers struct is
read) so they skip any nil concrete pointers before calling startWorker, and
then remove the in-function guard in startWorker (keep startWorker as a simple
caller of w.Run(ctx, wg) assuming the caller ensures non-nil concrete
implementations). Ensure you reference and check each pointer field on the
Workers struct before passing it into startWorker.

In `@internal/repositories/libvirt/adapter.go`:
- Around line 182-216: LaunchInstanceWithOptions currently allocates host ports
into allocatedPorts and records them via recordPortMapping, but
setupPortForwarding (via parseAndValidatePort and configureIptables) re-runs
allocation and overwrites mappings; change setupPortForwarding so it accepts the
already-resolved allocatedPorts (or the same opts.Ports) and only creates
iptables rules without calling parseAndValidatePort/findFreePort or calling
recordPortMapping again. Concretely, update LaunchInstanceWithOptions/where the
goroutine is spawned to pass the resolved allocatedPorts into
setupPortForwarding(…) and remove the path in
parseAndValidatePort/configureIptables that re-allocates or rewrites mappings;
ensure you avoid concurrent writes to the portMappings map (use the existing
recordPortMapping done earlier or protect access) so the goroutine only reads
the final resolved mappings when setting rules.
- Around line 695-728: In prepareRootVolume, when StorageVolLookupByName finds
backingVol but StorageVolGetPath returns an error, currently backingXML stays
empty and the code silently falls back to creating an empty volume; update the
block that calls StorageVolGetPath (after StorageVolLookupByName) to detect the
error and call a.logger.Warn (or a.logger.Error) with a clear message that
includes the backing image name and the error (e.g., "failed to get backing
volume path", "image", imageName, "err", err), so the failure is visible in logs
while preserving the existing fallback behavior; reference prepareRootVolume,
StorageVolLookupByName, StorageVolGetPath, backingXML, and a.logger.Warn when
making the change.
- Around line 86-150: The system-vs-session dial fallback in NewLibvirtAdapter
discards the session error (err2) when both net.DialTimeout calls fail; update
the failure return to include both errors (e.g., use errors.Join(err, err2) or
format both err and err2 into the returned error) so callers see both
diagnostics; modify the branch that currently returns fmt.Errorf("failed to dial
libvirt (system and session): %w", err) to instead return an error that contains
err and err2 (and add the errors import if using errors.Join).

In `@internal/repositories/libvirt/templates.go`:
- Line 10: Remove the global variable xmlEscape and stop using
html.EscapeString; instead inject an XML escaper (use encoding/xml.EscapeText)
via constructor or factory for the template-related type or function that
currently relies on xmlEscape, replace all references to xmlEscape with the
injected escaper function, and ensure the injected escaper signature matches
func([]byte) ([]byte, error) or wrap EscapeText to a func(string) string if
callers expect that shape; update the constructor/factory to accept the escaper
and pass it through to methods that render XML templates.

In `@internal/workers/healing_worker_test.go`:
- Line 13: The import "github.com/stretchr/testify/assert" in
internal/workers/healing_worker_test.go is unused and will fail compilation;
either remove that import line or use the assert package in the test functions
(e.g., replace plain t.Fatalf/t.Error checks with assert.* calls) so the symbol
from "github.com/stretchr/testify/assert" is referenced; update the import list
accordingly so the file compiles.
🧹 Nitpick comments (9)
internal/workers/healing_worker_test.go (1)

17-128: Hand-rolled mocks are correct but verbose.

Consider using a mock generator like mockery to auto-generate these from the ports.InstanceRepository and ports.InstanceService interfaces. This avoids maintaining them by hand when the interfaces change.

docs/swagger/swagger.yaml (1)

3689-3730: Metadata update endpoint: request body schema is too loosely typed.

The PUT /instances/{id}/metadata request body is defined as a generic additionalProperties: true object with no structure. Since the handler presumably expects specific keys like metadata and labels (both map[string]string), consider defining a named request type (e.g., httphandlers.UpdateMetadataRequest) with explicit properties. This would give API consumers clear guidance and enable proper client code generation.

internal/core/services/database.go (1)

91-101: Consider logging the parseAllocatedPort error before falling back.

When parseAllocatedPort returns an error (e.g., malformed port string), that error is silently consumed by the fallback condition on line 92. A debug/warn log here would help operators diagnose port-mapping configuration issues.

Proposed improvement
 	hostPort, err := s.parseAllocatedPort(allocatedPorts, defaultPort)
 	if err != nil || hostPort == 0 {
+		if err != nil {
+			s.logger.Warn("parseAllocatedPort failed, falling back to GetInstancePort", "error", err)
+		}
 		hostPort, err = s.compute.GetInstancePort(ctx, containerID, defaultPort)
internal/core/services/cache.go (2)

124-138: parseAllocatedPort returns (0, nil) on no match — ambiguous success.

When no matching port is found, returning (0, nil) is indistinguishable from a parsed port of value 0. The caller at line 95 handles this with port == 0, but this conflates "not found" with "parse error returning 0". Consider returning a sentinel error for the no-match case to make intent explicit.


180-192: Silent failures on event and audit logging.

Lines 181 and 187 discard errors from RecordEvent and Log. While these are non-critical side effects, the coding guidelines explicitly prohibit blank identifier assignment. At minimum, log the errors.

♻️ Proposed fix
-	_ = s.eventSvc.RecordEvent(ctx, "CACHE_CREATE", cache.ID.String(), "CACHE", map[string]interface{}{
+	if err := s.eventSvc.RecordEvent(ctx, "CACHE_CREATE", cache.ID.String(), "CACHE", map[string]interface{}{
 		"name":    cache.Name,
 		"version": cache.Version,
 		"memory":  cache.MemoryMB,
-	})
+	}); err != nil {
+		s.logger.Warn("failed to record cache create event", "id", cache.ID, "error", err)
+	}

-	_ = s.auditSvc.Log(ctx, cache.UserID, "cache.create", "cache", cache.ID.String(), map[string]interface{}{
+	if err := s.auditSvc.Log(ctx, cache.UserID, "cache.create", "cache", cache.ID.String(), map[string]interface{}{
 		"name": originalName,
-	})
+	}); err != nil {
+		s.logger.Warn("failed to log cache create audit", "id", cache.ID, "error", err)
+	}

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

internal/repositories/libvirt/templates.go (1)

59-61: Magic number 25 for disk limit.

Extract to a named constant for clarity.

+const maxAdditionalDisks = 25 // vd[b-z]
+
 	for i, dPath := range additionalDisks {
-		if i >= 25 { // Limit to vd[b-z]
+		if i >= maxAdditionalDisks {
 			break
 		}

As per coding guidelines, "Do not use magic numbers - use named constants instead".

internal/repositories/libvirt/adapter.go (3)

234-239: Silent failures when cleaning up on domain start failure.

Lines 236–237 discard errors from DomainUndefine and StorageVolDelete. If cleanup fails, the system is left with orphaned resources and no log trail.

♻️ Proposed fix
-		_ = a.client.DomainUndefine(ctx, dom)
-		_ = a.client.StorageVolDelete(ctx, vol, 0)
+		if err := a.client.DomainUndefine(ctx, dom); err != nil {
+			a.logger.Warn("failed to undefine domain during cleanup", "name", name, "error", err)
+		}
+		if err := a.client.StorageVolDelete(ctx, vol, 0); err != nil {
+			a.logger.Warn("failed to delete volume during cleanup", "name", name, "error", err)
+		}

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".


1051-1063: TOCTOU in findFreePort: port may be claimed between close and use.

The classic ephemeral-port race: the listener is closed before the port is actually used by QEMU/iptables, so another process could bind to it in between. This is inherent to the approach and difficult to eliminate entirely, but worth documenting. Consider adding a brief comment noting this limitation.


418-424: json.Marshal error silently discarded.

Line 418 uses statJSON, _ := json.Marshal(...). While marshalling a map[string]uint64 is extremely unlikely to fail, the guidelines prohibit blank identifier assignment.

♻️ Proposed fix
-	statJSON, _ := json.Marshal(map[string]interface{}{
+	statJSON, err := json.Marshal(map[string]interface{}{
 		"memory_stats": map[string]uint64{
 			"usage": usage,
 			"limit": limit,
 		},
 	})
+	if err != nil {
+		return nil, fmt.Errorf("failed to marshal stats: %w", err)
+	}
 	return io.NopCloser(bytes.NewReader(statJSON)), nil

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

Comment thread cmd/api/main.go
Comment on lines +182 to +189
timeout := defaultDBInitTimeout
if os.Getenv("DB_INIT_TIMEOUT") != "" {
if t, err := time.ParseDuration(os.Getenv("DB_INIT_TIMEOUT")); err == nil {
timeout = t
}
}

ctx, cancel := context.WithTimeout(context.Background(), timeout)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unparsable DB_INIT_TIMEOUT is silently ignored — log a warning.

If an operator sets DB_INIT_TIMEOUT to an invalid value (e.g., "10" without a unit), the error is swallowed and the default 120 s is used with no indication. This makes misconfiguration hard to diagnose.

Proposed fix
 	timeout := defaultDBInitTimeout
 	if os.Getenv("DB_INIT_TIMEOUT") != "" {
 		if t, err := time.ParseDuration(os.Getenv("DB_INIT_TIMEOUT")); err == nil {
 			timeout = t
+		} else {
+			logger.Warn("invalid DB_INIT_TIMEOUT, using default", "value", os.Getenv("DB_INIT_TIMEOUT"), "default", defaultDBInitTimeout)
 		}
 	}

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()". While this isn't a blank-identifier case, the spirit is the same: errors should not be discarded silently.

🤖 Prompt for AI Agents
In `@cmd/api/main.go` around lines 182 - 189, The code currently ignores
ParseDuration errors for DB_INIT_TIMEOUT; update the block around
defaultDBInitTimeout/DB_INIT_TIMEOUT so that if
time.ParseDuration(os.Getenv("DB_INIT_TIMEOUT")) returns an error you log a
warning including the invalid value and the error (e.g., using the package
logger or log.Printf) and then fall back to the default timeout; keep the
existing behavior of using timeout when parse succeeds and creating ctx, cancel
via context.WithTimeout as before, but do not silently swallow the parse
error—include the env value and err in the warning message to aid diagnostics.

Comment thread cmd/api/main.go
Comment on lines +289 to 294
func startWorker(ctx context.Context, wg *sync.WaitGroup, w runner) {
if w != nil {
wg.Add(1)
go workers.ClusterReconciler.Run(ctx, wg)
go w.Run(ctx, wg)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the startWorker function and surrounding context in cmd/api/main.go
cat -n cmd/api/main.go | sed -n '280,310p'

Repository: poyrazK/thecloud

Length of output: 1047


🏁 Script executed:

# Let's also find the Workers struct definition and how it's initialized
rg -n --type=go 'type Workers struct' -A 20

Repository: poyrazK/thecloud

Length of output: 1618


🏁 Script executed:

# Find the runWorkers function to see the call pattern
rg -n --type=go 'func runWorkers' -A 20

Repository: poyrazK/thecloud

Length of output: 855


🏁 Script executed:

# Find the runner interface definition
rg -n --type=go 'type runner interface' -A 5

Repository: poyrazK/thecloud

Length of output: 330


🏁 Script executed:

# Check how Workers is populated/initialized in setup or initialization code
rg -n --type=go 'Workers\s*:?=\s*&?setup\.Workers|Workers\s*\{' -B 2 -A 15 | head -100

Repository: poyrazK/thecloud

Length of output: 2552


🏁 Script executed:

# Find where runWorkers is called
rg -n --type=go 'runWorkers' -B 2 -A 2

Repository: poyrazK/thecloud

Length of output: 507


🏁 Script executed:

# Find the runApplication function to see the control flow
rg -n --type=go 'func runApplication' -A 30

Repository: poyrazK/thecloud

Length of output: 1516


Nil-interface pitfall: w != nil does not catch typed nil pointers — will panic.

When a nil concrete pointer (e.g., a nil *services.LBWorker) is passed to the runner interface parameter, Go wraps it in an interface value that carries type information, so w != nil evaluates to true. Calling w.Run(...) then dereferences a nil receiver and panics.

The Workers struct initializes all fields as pointers, and in the test (main_test.go:157) it is created as empty &setup.Workers{}. When runApplication is called with empty workers and role defaults to "all", it invokes runWorkers which then passes all nil worker pointers to startWorker. The nil check at line 290 fails to catch this, triggering a panic.

Move the nil checks to the caller:

Proposed fix — guard at the concrete-type call site
 func runWorkers(ctx context.Context, wg *sync.WaitGroup, workers *setup.Workers) {
-	startWorker(ctx, wg, workers.LB)
-	startWorker(ctx, wg, workers.AutoScaling)
-	startWorker(ctx, wg, workers.Cron)
-	startWorker(ctx, wg, workers.Container)
-	startWorker(ctx, wg, workers.Provision)
-	startWorker(ctx, wg, workers.Accounting)
-	startWorker(ctx, wg, workers.Cluster)
-	startWorker(ctx, wg, workers.Lifecycle)
-	startWorker(ctx, wg, workers.ReplicaMonitor)
-	startWorker(ctx, wg, workers.ClusterReconciler)
-	startWorker(ctx, wg, workers.Healing)
+	if workers.LB != nil {
+		startWorker(ctx, wg, workers.LB)
+	}
+	if workers.AutoScaling != nil {
+		startWorker(ctx, wg, workers.AutoScaling)
+	}
+	if workers.Cron != nil {
+		startWorker(ctx, wg, workers.Cron)
+	}
+	if workers.Container != nil {
+		startWorker(ctx, wg, workers.Container)
+	}
+	if workers.Provision != nil {
+		startWorker(ctx, wg, workers.Provision)
+	}
+	if workers.Accounting != nil {
+		startWorker(ctx, wg, workers.Accounting)
+	}
+	if workers.Cluster != nil {
+		startWorker(ctx, wg, workers.Cluster)
+	}
+	if workers.Lifecycle != nil {
+		startWorker(ctx, wg, workers.Lifecycle)
+	}
+	if workers.ReplicaMonitor != nil {
+		startWorker(ctx, wg, workers.ReplicaMonitor)
+	}
+	if workers.ClusterReconciler != nil {
+		startWorker(ctx, wg, workers.ClusterReconciler)
+	}
+	if workers.Healing != nil {
+		startWorker(ctx, wg, workers.Healing)
+	}
 }

Then remove the if w != nil guard inside startWorker since the caller now guarantees non-nil concrete values.

🤖 Prompt for AI Agents
In `@cmd/api/main.go` around lines 289 - 294, The startWorker function's nil check
(if w != nil) is unsafe because typed nil pointers (e.g., a nil
*services.LBWorker wrapped as runner) pass the check and cause a panic when Run
is invoked; move the nil validation to the concrete call sites (the caller(s)
that iterate/dispatch Workers fields—e.g., runWorkers / runApplication where the
Workers struct is read) so they skip any nil concrete pointers before calling
startWorker, and then remove the in-function guard in startWorker (keep
startWorker as a simple caller of w.Run(ctx, wg) assuming the caller ensures
non-nil concrete implementations). Ensure you reference and check each pointer
field on the Workers struct before passing it into startWorker.

Comment on lines 86 to 150
// NewLibvirtAdapter creates a LibvirtAdapter connected to the provided URI.
func NewLibvirtAdapter(logger *slog.Logger, uri string) (*LibvirtAdapter, error) {
if uri == "" {
uri = os.Getenv("LIBVIRT_URI")
}
if uri == "" {
uri = "/var/run/libvirt/libvirt-sock"
}

// Connect to libvirt
// We use a dialer for the unix socket
// Connect to libvirt socket
c, err := net.DialTimeout("unix", uri, 2*time.Second)
if err != nil {
return nil, fmt.Errorf("failed to dial libvirt: %w", err)
// Fallback to session mode if system socket fails
if !strings.Contains(uri, "session") {
sessionUri := filepath.Join(os.Getenv("HOME"), ".cache/libvirt/libvirt-sock")
if c2, err2 := net.DialTimeout("unix", sessionUri, 2*time.Second); err2 == nil {
c = c2
uri = sessionUri
} else {
return nil, fmt.Errorf("failed to dial libvirt (system and session): %w", err)
}
} else {
return nil, fmt.Errorf("failed to dial libvirt: %w", err)
}
}

//nolint:staticcheck // libvirt.New is deprecated but NewWithDialer doesn't work with our setup
//nolint:staticcheck
l := libvirt.New(c)
if err := l.Connect(); err != nil {
return nil, fmt.Errorf("failed to connect to libvirt: %w", err)
adapter := &LibvirtAdapter{
client: &RealLibvirtClient{conn: l},
logger: logger,
uri: uri,
portMappings: make(map[string]map[string]int),
networkCounter: 0,
poolStart: net.ParseIP("192.168.100.0"),
poolEnd: net.ParseIP("192.168.200.255"),
ipWaitInterval: 5 * time.Second,
taskWaitInterval: 2 * time.Second,
execCommand: exec.Command,
execCommandContext: exec.CommandContext,
lookPath: exec.LookPath,
osOpen: os.Open,
}

connectCtx, connectCancel := context.WithTimeout(context.Background(), 10*time.Second)
defer connectCancel()

// Inferred URI for the hypervisor connection
hypervisorUri := "qemu:///session"
adapter.isSession = true
if strings.Contains(uri, "/var/run/libvirt/libvirt-sock") {
hypervisorUri = "qemu:///system"
adapter.isSession = false
}

if err := adapter.client.ConnectToURI(connectCtx, hypervisorUri); err != nil {
logger.Warn("failed to connect to hypervisor URI, trying session fallback", "uri", hypervisorUri, "error", err)
if err2 := adapter.client.ConnectToURI(connectCtx, "qemu:///session"); err2 != nil {
return nil, fmt.Errorf("failed to connect to libvirt hypervisor: %w", err2)
}
adapter.isSession = true
}

return &LibvirtAdapter{
client: &RealLibvirtClient{conn: l},
logger: logger,
uri: uri,
portMappings: make(map[string]map[string]int),
networkCounter: 0,
// Local private network range for VM pools - safe for internal usage
poolStart: net.ParseIP("192.168.100.0"),
poolEnd: net.ParseIP("192.168.200.255"),
ipWaitInterval: 5 * time.Second,
taskWaitInterval: 2 * time.Second,
}, nil
return adapter, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Session fallback on line 105 wraps the system socket error, losing the session error.

When both the system and session socket connections fail, the returned error wraps err (the system error) but discards err2 (the session error). The caller loses diagnostic info about why the session fallback failed.

🐛 Proposed fix
-			return nil, fmt.Errorf("failed to dial libvirt (system and session): %w", err)
+			return nil, fmt.Errorf("failed to dial libvirt (system: %v, session: %w)", err, err2)
🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/adapter.go` around lines 86 - 150, The
system-vs-session dial fallback in NewLibvirtAdapter discards the session error
(err2) when both net.DialTimeout calls fail; update the failure return to
include both errors (e.g., use errors.Join(err, err2) or format both err and
err2 into the returned error) so callers see both diagnostics; modify the branch
that currently returns fmt.Errorf("failed to dial libvirt (system and session):
%w", err) to instead return an error that contains err and err2 (and add the
errors import if using errors.Join).

Comment on lines +182 to +216
allocatedPorts := make([]string, 0, len(opts.Ports))
for _, p := range opts.Ports {
parts := strings.Split(p, ":")
if len(parts) == 2 {
hPort := parts[0]
cPort := parts[1]
if hPort == "0" {
freePort, err := findFreePort()
if err != nil {
a.logger.Warn("failed to auto-allocate port", "container_port", cPort, "error", err)
continue
}
hPort = strconv.Itoa(freePort)
}
if err := a.recordPortMapping(name, hPort, cPort); err != nil {
a.logger.Warn("failed to record port mapping", "port", p, "error", err)
continue
}
allocatedPorts = append(allocatedPorts, fmt.Sprintf("%s:%s", hPort, cPort))
} else if len(parts) == 1 {
cPort := parts[0]
freePort, err := findFreePort()
if err != nil {
a.logger.Warn("failed to auto-allocate port", "container_port", cPort, "error", err)
continue
}
hPort := strconv.Itoa(freePort)
if err := a.recordPortMapping(name, hPort, cPort); err != nil {
a.logger.Warn("failed to record port mapping", "port", p, "error", err)
continue
}
allocatedPorts = append(allocatedPorts, fmt.Sprintf("%s:%s", hPort, cPort))
}
}
opts.Ports = allocatedPorts
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Double port allocation: setupPortForwarding re-allocates ports that were already allocated.

LaunchInstanceWithOptions (lines 182–215) already allocates host ports via findFreePort, records mappings via recordPortMapping, and builds allocatedPorts. Then on line 242, setupPortForwarding is called in a goroutine, which calls parseAndValidatePort (line 952) — and that function calls findFreePort() again for hostPort == "0".

Since allocatedPorts already contains resolved "hostPort:containerPort" pairs (never "0:..."), the second findFreePort call in parseAndValidatePort is dead code for this path. However, configureIptables (line 959) also overwrites the port mappings that recordPortMapping already stored. This is redundant and could mask race conditions if the goroutine runs concurrently with the caller reading portMappings.

Consider simplifying: setupPortForwarding should only set up iptables rules using the already-resolved ports, without re-parsing or re-recording them.

Also applies to: 241-243, 939-962, 975-1007

🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/adapter.go` around lines 182 - 216,
LaunchInstanceWithOptions currently allocates host ports into allocatedPorts and
records them via recordPortMapping, but setupPortForwarding (via
parseAndValidatePort and configureIptables) re-runs allocation and overwrites
mappings; change setupPortForwarding so it accepts the already-resolved
allocatedPorts (or the same opts.Ports) and only creates iptables rules without
calling parseAndValidatePort/findFreePort or calling recordPortMapping again.
Concretely, update LaunchInstanceWithOptions/where the goroutine is spawned to
pass the resolved allocatedPorts into setupPortForwarding(…) and remove the path
in parseAndValidatePort/configureIptables that re-allocates or rewrites
mappings; ensure you avoid concurrent writes to the portMappings map (use the
existing recordPortMapping done earlier or protect access) so the goroutine only
reads the final resolved mappings when setting rules.

Comment on lines +695 to +728
func (a *LibvirtAdapter) prepareRootVolume(ctx context.Context, name string, imageName string) (string, libvirt.StorageVol, error) {
pool, err := a.client.StoragePoolLookupByName(ctx, defaultPoolName)
if err != nil {
return "", libvirt.StorageVol{}, fmt.Errorf("default pool not found: %w", err)
}

var backingXML string
if imageName != "" {
backingVol, err := a.client.StorageVolLookupByName(ctx, pool, imageName)
if err == nil {
backingPath, err := a.client.StorageVolGetPath(ctx, backingVol)
if err == nil {
backingXML = backingPath
}
} else {
a.logger.Warn("backing image not found, creating empty volume", "image", imageName)
}
}

volXML := generateVolumeXML(name+"-root", 10, backingXML)

vol, err := a.client.StorageVolCreateXML(ctx, pool, volXML, 0)
if err != nil {
return "", libvirt.StorageVol{}, fmt.Errorf("failed to create root volume: %w", err)
}

diskPath, err := a.client.StorageVolGetPath(ctx, vol)
if err != nil {
_ = a.client.StorageVolDelete(ctx, vol, 0)
return "", libvirt.StorageVol{}, fmt.Errorf(errGetVolumePath, err)
}

return diskPath, vol, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

prepareRootVolume silently ignores StorageVolGetPath error for backing volume.

Lines 704–708: when StorageVolLookupByName succeeds but StorageVolGetPath fails, backingXML remains empty and the volume is created without a backing store — silently degrading to an empty disk. At minimum, log a warning.

♻️ Proposed fix
 		backingVol, err := a.client.StorageVolLookupByName(ctx, pool, imageName)
 		if err == nil {
 			backingPath, err := a.client.StorageVolGetPath(ctx, backingVol)
 			if err == nil {
 				backingXML = backingPath
+			} else {
+				a.logger.Warn("backing image found but path retrieval failed, creating empty volume", "image", imageName, "error", err)
 			}
 		} else {
🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/adapter.go` around lines 695 - 728, In
prepareRootVolume, when StorageVolLookupByName finds backingVol but
StorageVolGetPath returns an error, currently backingXML stays empty and the
code silently falls back to creating an empty volume; update the block that
calls StorageVolGetPath (after StorageVolLookupByName) to detect the error and
call a.logger.Warn (or a.logger.Error) with a clear message that includes the
backing image name and the error (e.g., "failed to get backing volume path",
"image", imageName, "err", err), so the failure is visible in logs while
preserving the existing fallback behavior; reference prepareRootVolume,
StorageVolLookupByName, StorageVolGetPath, backingXML, and a.logger.Warn when
making the change.

)

func generateVolumeXML(name string, sizeGB int) string {
var xmlEscape = html.EscapeString
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the templates.go file and see how xmlEscape is used
cd internal/repositories/libvirt || cd . && find . -name "templates.go" -path "*/libvirt/*"

Repository: poyrazK/thecloud

Length of output: 42


🏁 Script executed:

# Examine the file content
cat -n internal/repositories/libvirt/templates.go | head -50

Repository: poyrazK/thecloud

Length of output: 1647


🌐 Web query:

Does Go's html.EscapeString escape single quotes for XML safety

💡 Result:

Yes. html.EscapeString escapes single quotes (as well as <, >, &, and "). In its output, a ' becomes &#39; (numeric character reference), which is valid in XML too. [1], [2]

That said, html.EscapeString is an HTML escaper, not an XML serializer. For XML output, prefer encoding/xml (e.g., xml.EscapeText or xml.Marshal). [3]

Sources: [1] Go html.EscapeString docs, [2] Go html.EscapeString example output, [3] Go encoding/xml.EscapeText docs.


Remove the global variable and use proper XML escaping.

html.EscapeString is an HTML escaper, not an XML serializer, and violates the "Do not use global variables" guideline. Although it does escape single quotes to &#39;, it's the wrong tool for XML output. Use constructor injection to provide an XML escaper (such as xml.EscapeText from the encoding/xml package) instead of a global variable assignment.

🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/templates.go` at line 10, Remove the global
variable xmlEscape and stop using html.EscapeString; instead inject an XML
escaper (use encoding/xml.EscapeText) via constructor or factory for the
template-related type or function that currently relies on xmlEscape, replace
all references to xmlEscape with the injected escaper function, and ensure the
injected escaper signature matches func([]byte) ([]byte, error) or wrap
EscapeText to a func(string) string if callers expect that shape; update the
constructor/factory to accept the escaper and pass it through to methods that
render XML templates.

Comment thread internal/workers/healing_worker_test.go Outdated
Copilot AI review requested due to automatic review settings February 13, 2026 11:23
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
internal/repositories/libvirt/lb_proxy.go (2)

64-77: ⚠️ Potential issue | 🟠 Major

Silent failures on cleanup violate coding guidelines.

Lines 72 and 75 use blank identifier assignments (_ = cmd.Run(), _ = os.RemoveAll(...)) which the guidelines explicitly prohibit. If nginx fails to stop, the proxy may remain running, and the caller will never know. RemoveProxy unconditionally returns nil, masking all errors.

At minimum, collect and return these errors (or log them if best-effort cleanup is intentional).

Proposed fix
 func (a *LBProxyAdapter) RemoveProxy(ctx context.Context, lbID uuid.UUID) error {
 	configDir := filepath.Join("/tmp", "thecloud", "lb", lbID.String())
 	pidPath := filepath.Join(configDir, nginxPidFileName)
 
+	var errs []error
 	if _, err := os.Stat(pidPath); err == nil {
 		// Stop nginx: nginx -s stop -c ...
 		configPath := filepath.Join(configDir, nginxConfigFileName)
 		cmd := a.execCommandContext(ctx, "nginx", "-c", configPath, "-g", fmt.Sprintf("pid %s;", pidPath), "-s", "stop")
-		_ = cmd.Run()
+		if err := cmd.Run(); err != nil {
+			errs = append(errs, fmt.Errorf("failed to stop nginx: %w", err))
+		}
 	}
 
-	_ = os.RemoveAll(configDir)
-	return nil
+	if err := os.RemoveAll(configDir); err != nil {
+		errs = append(errs, fmt.Errorf("failed to remove config dir: %w", err))
+	}
+	return errors.Join(errs...)
 }

As per coding guidelines, "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".


144-154: ⚠️ Potential issue | 🟠 Major

Silently skipping targets on IP lookup failure can produce a broken config.

When GetInstanceIP fails for a target, the error is swallowed and the target is simply omitted. If all lookups fail, the generated config will have zero targets and nginx will return 503 — with no error surfaced to the caller. This is effectively a silent failure.

Consider accumulating errors or at minimum logging the skipped target so operators can diagnose issues.

internal/api/setup/dependencies.go (1)

190-191: ⚠️ Potential issue | 🔴 Critical

Remove the duplicate WebSocket hub — pass svcs.WsHub to ws.NewHandler instead of creating a new hub.

InitServices creates a wsHub and stores it in Services.WsHub, which is passed to EventService for broadcasting events. However, InitHandlers ignores this and creates a second, separate hub at line 73, passing it to the WebSocket handler. This means events published through EventService are broadcast to one hub while WebSocket clients are connected to a different hub, so clients never receive real-time updates.

Change line 113 in router.go from ws.NewHandler(hub, ...) to ws.NewHandler(svcs.WsHub, ...) and remove lines 73-74.

🤖 Fix all issues with AI agents
In `@internal/core/services/benchmarks_test.go`:
- Around line 33-46: The InstanceServiceParams constructions creating svc via
services.NewInstanceService are missing the required SSHKeySvc dependency;
update each InstanceServiceParams literal (the one assigning Repo, VpcRepo,
SubnetRepo, VolumeRepo, Compute, Network, EventSvc, AuditSvc, TaskQueue, Logger,
TenantSvc, InstanceTypeRepo) to include SSHKeySvc:
services.NewSSHKeyService(&noop.NoopSSHKeyRepository{}), mirroring the noop
pattern used for InstanceTypeRepo/TaskQueue so the service contract is satisfied
and avoids nil derefs.

In `@internal/core/services/instance_test.go`:
- Around line 471-475: The test discards the error from repo.GetByID which can
leave dbInst nil and cause panics; change the call to capture the error (e.g.,
dbInst, err := repo.GetByID(ctx, inst.ID)) and assert it succeeded before
dereferencing (use require.NoError(t, err) or if err != nil { t.Fatalf("GetByID
failed: %v", err) }) so subsequent assertions on dbInst.Metadata and
dbInst.Labels are safe.

In `@internal/repositories/libvirt/templates.go`:
- Around line 59-61: Replace the magic number 25 in the condition "if i >= 25 {
// Limit to vd[b-z]" with a PascalCase named constant (e.g., MaxVdLetters)
declared alongside related constants in this file; set its value to 25 and add a
short comment explaining it represents the letters for vdb–vdz (i.e., 25 device
letters), then use that constant in the "if i >= MaxVdLetters" check so the
intent is clear and adheres to the naming guidelines.
🧹 Nitpick comments (6)
internal/repositories/libvirt/templates.go (2)

110-115: Arch detection via substring match is fragile.

strings.Contains(isoPath, "arm64") would match paths like /images/not-arm64-really/focal.qcow2. This is a best-effort fallback, but consider matching on the last path segment or requiring the caller to always pass arch explicitly to avoid surprises.


150-152: VNC bound to 0.0.0.0 — verify this is intentional.

Binding VNC to all interfaces exposes the guest console to the network. If this is only needed for local development, consider defaulting to 127.0.0.1 or making the listen address configurable.

internal/core/services/database_test.go (2)

24-52: Integration tests use real Docker and Postgres — consider adding mocks for unit-level coverage.

This test setup wires real postgres.*Repository and docker.DockerAdapter instances, which makes these integration tests rather than unit tests. The coding guidelines state mock repositories should be used in service tests. Consider adding a parallel suite with mocked dependencies for fast, isolated unit tests — but this can be deferred since the integration tests are valuable on their own.

As per coding guidelines, **/*_test.go: "Do not test with real external dependencies in unit tests - use mocks instead" and internal/core/services/**/*_test.go: "Mock repositories in service tests".


74-74: Silent failure on cleanup — blanked error from DeleteInstance.

Lines 74, 95, and 117 discard errors with _ =. While this is test cleanup code, it violates the coding guideline. At minimum, log or call t.Log so failures during cleanup are visible during debugging.

As per coding guidelines, **/*.go: "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

internal/core/services/benchmarks_test.go (1)

146-146: Blanked os.Setenv error and environment side-effect leakage.

_ = os.Setenv(...) silently discards the error. More importantly, os.Setenv mutates the process-wide environment and is not cleaned up, which can affect other benchmarks or tests running in parallel. Consider using b.Setenv() (Go 1.17+) which automatically restores the value after the test.

Suggested fix
-	_ = os.Setenv("TRACING_ENABLED", "false")
+	b.Setenv("TRACING_ENABLED", "false")

As per coding guidelines, **/*.go: "Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

internal/core/services/instance_test.go (1)

141-141: Port string simplified to "0:80" — verify this is intentional.

The port string was changed from what appears to have been a dynamic/environment-based selection to a hardcoded "0:80" (random host port mapped to container port 80). This is fine for integration tests but worth confirming the port allocation logic in the service handles this format correctly.

Comment on lines 33 to 46
svc := services.NewInstanceService(services.InstanceServiceParams{
Repo: repo,
VpcRepo: vpcRepo,
SubnetRepo: subnetRepo,
VolumeRepo: volumeRepo,
Compute: compute,
Network: network,
EventSvc: eventSvc,
AuditSvc: auditSvc,
TaskQueue: &services.TaskQueueStub{},
Logger: logger,
Repo: repo,
VpcRepo: vpcRepo,
SubnetRepo: subnetRepo,
VolumeRepo: volumeRepo,
Compute: compute,
Network: network,
EventSvc: eventSvc,
AuditSvc: auditSvc,
TaskQueue: &services.TaskQueueStub{},
Logger: logger,
TenantSvc: &NoopTenantService{},
InstanceTypeRepo: &noop.NoopInstanceTypeRepository{},
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the InstanceServiceParams struct definition
rg -n 'type InstanceServiceParams struct' --type=go -A 20

Repository: poyrazK/thecloud

Length of output: 1663


🏁 Script executed:

# Check all SSHKeySvc references in the services layer
rg -n 'SSHKeySvc|sshKeySvc' --type=go -C 3

Repository: poyrazK/thecloud

Length of output: 6521


🏁 Script executed:

# Examine the benchmark file at the specified lines
cat -n internal/core/services/benchmarks_test.go | sed -n '30,100p'

Repository: poyrazK/thecloud

Length of output: 2663


🏁 Script executed:

# Get the full benchmark file around the mentioned lines
cat -n internal/core/services/benchmarks_test.go | sed -n '25,165p'

Repository: poyrazK/thecloud

Length of output: 5170


🏁 Script executed:

# Check all usages of sshKeySvc in instance.go
rg -n 'sshKeySvc' internal/core/services/instance.go -B 2 -A 2

Repository: poyrazK/thecloud

Length of output: 615


🏁 Script executed:

# Check the LaunchParams struct to confirm SSHKeyID is optional
rg -n 'type LaunchParams struct' --type=go -A 15

Repository: poyrazK/thecloud

Length of output: 1006


Add missing SSHKeySvc to all InstanceServiceParams constructions in benchmarks.

SSHKeySvc is a required dependency in InstanceServiceParams (not marked as optional like TaskQueue). While the benchmarks currently work because they either don't call methods using SSH keys or don't provide an SSH key ID, the missing required dependency violates the service contract and creates a latent nil-dereference risk if the implementation changes or if SSH key IDs are added to the benchmark parameters.

Provide SSHKeySvc in the constructions at lines 33-46, 83-96, and 148-161. Use a noop or stub implementation matching the pattern in other tests:

SSHKeySvc: services.NewSSHKeyService(&noop.NoopSSHKeyRepository{}),
🤖 Prompt for AI Agents
In `@internal/core/services/benchmarks_test.go` around lines 33 - 46, The
InstanceServiceParams constructions creating svc via services.NewInstanceService
are missing the required SSHKeySvc dependency; update each InstanceServiceParams
literal (the one assigning Repo, VpcRepo, SubnetRepo, VolumeRepo, Compute,
Network, EventSvc, AuditSvc, TaskQueue, Logger, TenantSvc, InstanceTypeRepo) to
include SSHKeySvc: services.NewSSHKeyService(&noop.NoopSSHKeyRepository{}),
mirroring the noop pattern used for InstanceTypeRepo/TaskQueue so the service
contract is satisfied and avoids nil derefs.

Comment on lines +471 to +475
dbInst, _ := repo.GetByID(ctx, inst.ID)
assert.Equal(t, "val2", dbInst.Metadata["key2"])
_, ok := dbInst.Metadata["key1"]
assert.False(t, ok)
assert.Equal(t, "v1", dbInst.Labels["l1"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same pattern: discarded error from repo.GetByID risks nil-pointer panic.

If the DB fetch fails, dbInst will be nil and the assertions on lines 472-475 will panic instead of producing a clean test failure.

Proposed fix
-		dbInst, _ := repo.GetByID(ctx, inst.ID)
+		dbInst, err := repo.GetByID(ctx, inst.ID)
+		require.NoError(t, err)
 		assert.Equal(t, "val2", dbInst.Metadata["key2"])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dbInst, _ := repo.GetByID(ctx, inst.ID)
assert.Equal(t, "val2", dbInst.Metadata["key2"])
_, ok := dbInst.Metadata["key1"]
assert.False(t, ok)
assert.Equal(t, "v1", dbInst.Labels["l1"])
dbInst, err := repo.GetByID(ctx, inst.ID)
require.NoError(t, err)
assert.Equal(t, "val2", dbInst.Metadata["key2"])
_, ok := dbInst.Metadata["key1"]
assert.False(t, ok)
assert.Equal(t, "v1", dbInst.Labels["l1"])
🤖 Prompt for AI Agents
In `@internal/core/services/instance_test.go` around lines 471 - 475, The test
discards the error from repo.GetByID which can leave dbInst nil and cause
panics; change the call to capture the error (e.g., dbInst, err :=
repo.GetByID(ctx, inst.ID)) and assert it succeeded before dereferencing (use
require.NoError(t, err) or if err != nil { t.Fatalf("GetByID failed: %v", err)
}) so subsequent assertions on dbInst.Metadata and dbInst.Labels are safe.

Comment on lines +59 to +61
if i >= 25 { // Limit to vd[b-z]
break
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Magic number 25 should be a named constant.

Per coding guidelines, avoid magic numbers and use named constants. This limit also deserves a brief comment on why 25 (vdb–vdz = 25 letters).

Proposed fix
+// MaxAdditionalDisks is the number of extra virtio disks (vdb through vdz).
+const MaxAdditionalDisks = 25
+
 func generateDomainXML(...) string {
     ...
-    if i >= 25 { // Limit to vd[b-z]
+    if i >= MaxAdditionalDisks {
         break
     }

As per coding guidelines, "Do not use magic numbers - use named constants instead" and "Use PascalCase for limit constants (e.g., MaxPortsPerInstance)".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if i >= 25 { // Limit to vd[b-z]
break
}
// MaxAdditionalDisks is the number of extra virtio disks (vdb through vdz).
const MaxAdditionalDisks = 25
if i >= MaxAdditionalDisks { // Limit to vd[b-z]
break
}
🤖 Prompt for AI Agents
In `@internal/repositories/libvirt/templates.go` around lines 59 - 61, Replace the
magic number 25 in the condition "if i >= 25 { // Limit to vd[b-z]" with a
PascalCase named constant (e.g., MaxVdLetters) declared alongside related
constants in this file; set its value to 25 and add a short comment explaining
it represents the letters for vdb–vdz (i.e., 25 device letters), then use that
constant in the "if i >= MaxVdLetters" check so the intent is clear and adheres
to the naming guidelines.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 91 out of 92 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +68 to +122
func (w *HealingWorker) healERRORInstances(ctx context.Context) {
instances, err := w.repo.ListAll(ctx)
if err != nil {
w.logger.Error("failed to list all instances for healing", "error", err)
return
}

for _, inst := range instances {
if inst.Status != domain.StatusError {
continue
}

// Skip managed instances if possible (heuristic: name prefix or labels if we start using them)
// For now, we try to heal all ERROR instances. Overlapping with other workers is mostly safe
// because of optimistic locking and status checks in InstanceService.

w.logger.Info("attempting to heal instance", "instance_id", inst.ID, "name", inst.Name)

// Acquire semaphore token
select {
case w.semaphore <- struct{}{}:
case <-ctx.Done():
return
}

// Inject context for the instance owner
hCtx := appcontext.WithUserID(ctx, inst.UserID)
hCtx = appcontext.WithTenantID(hCtx, inst.TenantID)

w.reconcileWG.Add(1)
// Self-healing: Stop then Start
go func(id string) {
defer w.reconcileWG.Done()
defer func() { <-w.semaphore }() // Release semaphore token

// Wait a bit to avoid race conditions with recent failures
select {
case <-ctx.Done():
return
case <-time.After(w.healingDelay):
}

w.logger.Info("initiating healing restart", "instance_id", id)
if err := w.instSvc.StopInstance(hCtx, id); err != nil {
w.logger.Warn("healing: stop failed", "instance_id", id, "error", err)
// Continue anyway, maybe it was already half-stopped
}

if err := w.instSvc.StartInstance(hCtx, id); err != nil {
w.logger.Error("healing: start failed", "instance_id", id, "error", err)
} else {
w.logger.Info("successfully healed instance", "instance_id", id)
}
}(inst.ID.String())
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healing worker attempts to restart instances in ERROR state without checking if they are already being processed by other workers (e.g., provision worker). While the comment on line 80-82 mentions that optimistic locking provides some safety, there's a potential race condition: if an instance enters ERROR state and both the healing worker and a retry mechanism attempt recovery simultaneously, they could interfere with each other. Consider adding a "last_healing_attempt" timestamp to prevent rapid repeated healing attempts.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +132
// handle host:port or host:containerPort:hostPort or containerPort:hostPort
if len(parts) >= 2 && parts[len(parts)-1] == targetPort {
portStr := parts[0]
if len(parts) == 3 {
portStr = parts[1]
}
hp, err := strconv.Atoi(portStr)
if err != nil {
return 0, err
}
return hp, nil
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseAllocatedPort function has inconsistent port format handling between Database and Cache services. In database.go:118-135, it expects format "hostPort:containerPort" but also handles a 3-part format. In cache.go:124-138, it only handles 2-part format. This inconsistency could lead to bugs if the compute backends return different formats. The format handling should be consistent across both services.

Suggested change
// handle host:port or host:containerPort:hostPort or containerPort:hostPort
if len(parts) >= 2 && parts[len(parts)-1] == targetPort {
portStr := parts[0]
if len(parts) == 3 {
portStr = parts[1]
}
hp, err := strconv.Atoi(portStr)
if err != nil {
return 0, err
}
return hp, nil
}
// Expect format "hostPort:containerPort", matching cache service behavior.
if len(parts) != 2 {
continue
}
hostPortStr := parts[0]
containerPortStr := parts[1]
if containerPortStr != targetPort {
continue
}
hp, err := strconv.Atoi(hostPortStr)
if err != nil {
return 0, err
}
return hp, nil

Copilot uses AI. Check for mistakes.
Comment on lines 10 to +179
@@ -53,42 +76,96 @@ func generateDomainXML(name, diskPath, networkID, isoPath string, memoryMB, vcpu
<driver name='qemu' type='%s'/>
<source %s='%s'/>
<target dev='%s' bus='virtio'/>
</disk>`, diskType, driverType, sourceAttr, dPath, dev)
</disk>`, diskType, driverType, sourceAttr, escapedDPath, dev)
}

qemuArgsXML := ""
hasNetworkMapping := false
var hostfwds []string

for _, p := range ports {
parts := strings.Split(p, ":")
if len(parts) == 2 {
// Validation is expected at the caller (adapter)
hPort := xmlEscape(parts[0])
cPort := xmlEscape(parts[1])
hostfwds = append(hostfwds, fmt.Sprintf("hostfwd=tcp::%s-:%s", hPort, cPort))
hasNetworkMapping = true
}
}

if hasNetworkMapping {
qemuArgsXML += fmt.Sprintf(`
<qemu:arg value='-netdev'/>
<qemu:arg value='user,id=net0,%s'/>
<qemu:arg value='-device'/>
<qemu:arg value='virtio-net-pci,netdev=net0,bus=pci.0,addr=0x8'/>`, strings.Join(hostfwds, ","))
}

// Use the unescaped name for the log file path to avoid malformed filenames
qemuArgsXML += fmt.Sprintf(`
<qemu:arg value='-serial'/>
<qemu:arg value='file:/tmp/%s-console.log'/>`, name)

if arch == "" {
arch = "x86_64"
if strings.Contains(isoPath, "arm64") || strings.Contains(diskPath, "arm64") {
arch = "aarch64"
}
}

machine := "pc"
if arch == "aarch64" {
machine = "virt"
}

interfaceXML := ""
if !hasNetworkMapping {
interfaceXML = fmt.Sprintf(`
<interface type='network'>
<source network='%s'/>
<model type='virtio'/>
</interface>`, escapedNetworkID)
}

return fmt.Sprintf(`
<domain type='kvm'>
<domain type='qemu' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>%s</name>
<memory unit='KiB'>%d</memory>
<vcpu placement='static'>%d</vcpu>
<os>
<type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
<type arch='%s' machine='%s'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
</features>
<devices>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='%s'/>
<target dev='vda' bus='virtio'/>
</disk>%s%s
<interface type='network'>
<source network='%s'/>
<model type='virtio'/>
</interface>
</disk>%s%s%s
<graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0'>
<listen type='address' address='0.0.0.0'/>
</graphics>
<serial type='pty'>
<target port='0'/>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
<rng model='virtio'>
<backend model='random'>/dev/urandom</backend>
</rng>
</devices>
</domain>`, name, memoryKB, vcpu, diskPath, isoXML, additionalDisksXML, networkID)
<qemu:commandline>%s
</qemu:commandline>
</domain>`, escapedName, memoryKB, vcpu, arch, machine, escapedDiskPath, isoXML, additionalDisksXML, interfaceXML, qemuArgsXML)
}

func generateNetworkXML(name, bridgeName, gatewayIP, rangeStart, rangeEnd string) string {
escapedName := xmlEscape(name)
escapedBridgeName := xmlEscape(bridgeName)
escapedGatewayIP := xmlEscape(gatewayIP)
escapedRangeStart := xmlEscape(rangeStart)
escapedRangeEnd := xmlEscape(rangeEnd)

return fmt.Sprintf(`
<network>
<name>%s</name>
@@ -99,5 +176,5 @@ func generateNetworkXML(name, bridgeName, gatewayIP, rangeStart, rangeEnd string
<range start='%s' end='%s'/>
</dhcp>
</ip>
</network>`, name, bridgeName, gatewayIP, rangeStart, rangeEnd)
</network>`, escapedName, escapedBridgeName, escapedGatewayIP, escapedRangeStart, escapedRangeEnd)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML escaping using html.EscapeString is inappropriate for XML contexts and could allow XML injection attacks. The html.EscapeString function is designed for HTML attribute contexts, not XML, and may not properly escape all XML-specific characters (like single quotes in attribute values). Consider using a proper XML escaping function or the encoding/xml package for safe XML generation.

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +140
userData = fmt.Sprintf("#!/bin/sh\n"+
"for user in cirros ubuntu root; do\n"+
" home=\"/home/$user\"\n"+
" if [ \"$user\" = \"root\" ]; then home=\"/root\"; fi\n"+
" if [ -d \"$home\" ]; then\n"+
" mkdir -p \"$home/.ssh\"\n"+
" echo '%s' >> \"$home/.ssh/authorized_keys\"\n"+
" chown -R \"$user:$user\" \"$home/.ssh\" 2>/dev/null || true\n"+
" chmod 700 \"$home/.ssh\"\n"+
" chmod 600 \"$home/.ssh/authorized_keys\"\n"+
" fi\n"+
"done\n", key.PublicKey)
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SSH key public key is being injected directly into a shell script without proper escaping, which could lead to command injection if the public key contains shell metacharacters. The key should be properly escaped or written to a file in a safer manner (e.g., using base64 encoding or proper shell escaping).

Copilot uses AI. Check for mistakes.
}

if err := s.repo.Create(ctx, inst); err != nil {
// Rollback quota reservation
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quota rollback logic on line 193-195 may not properly handle all failure cases. If repo.Create fails after both IncrementUsage calls succeed (lines 157-164), only vCPUs and memory are decremented. However, this is correct since the "instances" quota is tracked by live count (as noted in tenant_repo.go:228), so the rollback is appropriate. Consider adding a comment to clarify this intentional design.

Suggested change
// Rollback quota reservation
// Rollback quota reservation for vCPUs and memory only. The "instances"
// quota is enforced based on the live instance count in the tenant
// repository (see tenant_repo.go:228), so there is no separate reservation
// to roll back here.

Copilot uses AI. Check for mistakes.
@poyrazK poyrazK merged commit 179c43d into main Feb 13, 2026
35 checks passed
@poyrazK poyrazK deleted the feature/compute-improvements branch February 13, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants