Conversation
Split the monolithic `miren logs` command into proper subcommands: - `miren logs app` — application logs (also the default for bare `miren logs`) - `miren logs sandbox <id>` — sandbox logs - `miren logs build <version>` — build logs - `miren logs system [component]` — system/server logs (new) Each subcommand gets its own opts struct with only relevant flags, eliminating the conflicting-flag validation that was growing unwieldy. Change the default output window from "since midnight today" to "last 100 lines" by wiring WithLimit into executeStreamQuery with a desc-sort-then-reverse strategy. The server applies this limit when no explicit --last is provided and --follow is not set. Add system log ingestion via a new slog.Handler (SystemLogHandler) that tees server logs to VictoriaLogs under entity "system/miren-server". This is wired into server startup after the log writer is created, making all server logs queryable through `miren logs system`. Add `system` bool field to the LogTarget RPC type and handle it in the log server's StreamLogChunks/StreamLogs methods via a new resolveLogTarget helper that consolidates app/sandbox/system target resolution.
…cture - Fix test assertion to match updated error message including "system" target - Wait for VictoriaLogs readiness before creating SystemLogHandler to avoid startup race where early writes silently fail with 30s HTTP timeouts - Restore --service/--build old-server validation in dispatchLogs so these filters error clearly instead of being silently dropped - Change legacy default from "since midnight" to "last 1 hour" for a more reasonable bounded window (legacy protocol can't do limit=100)
VictoriaLogs crashed with "too many open files" during small-parts merge because the container inherited the default 1024 fd limit. The system log handler also wrote one HTTP POST per slog record, creating excessive small parts under load. Two fixes: - Set RLIMIT_NOFILE to 65536 on VictoriaLogs and VictoriaMetrics containers via a new WithRlimitNOFILE helper in pkg/containerdx - Add BatchLogWriter that buffers system log entries and flushes as a single NDJSON POST every 250ms or 50 entries, whichever comes first
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an optional boolean Comment |
miren logs into subcommands and add system log ingestion
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
cli/commands/server.go (1)
505-507: Add klog rebinding after ctx.Log wrap for consistency and to prevent future regressions.Currently,
klog.SetLoggeris configured at line 475 before thectx.Logwrap, which means if klog is triggered after line 506, it would bypass the SystemLogHandler. While klog is not actively used in this file currently, k8s libraries that depend on klog are present as transitive dependencies. Adding a rebind after wrapping ensures all logging paths flow through SystemLogHandler consistently.Suggested fix
systemHandler := observability.NewSystemLogHandler(ctx.Log.Handler(), batchWriter) ctx.Log = slog.New(systemHandler) +// Rebind klog so any k8s libraries used in dependencies also flow through SystemLogHandler. +klog.SetLogger(logr.FromSlogHandler(ctx.Log.With("module", "global").Handler()))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/commands/server.go` around lines 505 - 507, After wrapping ctx.Log with the SystemLogHandler (using observability.NewSystemLogHandler and slog.New), rebind klog so it uses the new handler: call klog.SetLogger with the updated ctx.Log handler (or an adapter) immediately after creating systemHandler and resetting ctx.Log. This ensures klog's logger is set after the ctx.Log wrap so any klog emissions go through SystemLogHandler; update the code around systemHandler/ctx.Log (and the existing klog.SetLogger usage) to perform the rebind right after slog.New(systemHandler).observability/system_log_handler.go (1)
20-21:WithGroupstate is tracked but not applied to emitted attribute keys.Grouped attrs currently lose group context on the VictoriaLogs write path. Either apply group prefixing in
Handleor removegroupsto avoid misleading behavior.Also applies to: 47-56, 86-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@observability/system_log_handler.go` around lines 20 - 21, The struct field groups (populated via WithGroup) is tracked but never applied to emitted attribute keys; update the log emission path (the Handle method that constructs/writes attributes) to prefix attribute keys with the current group context (e.g., join s.groups with a separator and prepend to each attribute key) so grouped attributes retain their group namespace, or if grouping is not required remove the groups field and WithGroup implementation to avoid misleading state; target symbols: the struct holding groups, the WithGroup method, and the Handle function where attributes are serialized/emitted (also adjust related code regions around the noted blocks at lines ~47-56 and ~86-92 that build attribute keys).observability/batch_log_writer_test.go (1)
41-41: Prefer deadline-based waiting over fixed sleeps in async tests.Line 41 and Line 99 can flake in slow CI. Poll until condition or timeout instead of sleeping for a fixed duration.
Also applies to: 99-99
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@observability/batch_log_writer_test.go` at line 41, Replace the fixed time.Sleep(500 * time.Millisecond) calls in observability/batch_log_writer_test.go (both occurrences) with a deadline-based wait: poll the test condition in a short interval until a timeout is reached (e.g., using time.After for the deadline and a ticker or loop that checks the expected state), failing the test if the condition isn't met before the timeout; locate the sleeps by searching for time.Sleep in the test file and replace them with a loop that repeatedly checks the same condition the test expects (for example the batch flush/ buffer length/received logs) and returns early when satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@observability/batch_log_writer.go`:
- Around line 89-92: The Close method on BatchLogWriter can panic if called
multiple times because closing the done channel twice is illegal; make Close
idempotent by introducing and using a sync.Once (e.g., a field like closeOnce)
or an equivalent guard so that the block that closes b.done and calls
b.wg.Wait() is executed only once; update the BatchLogWriter struct to add the
guard (closeOnce) and change the Close method to call closeOnce.Do(func() {
close(b.done); b.wg.Wait() }) to prevent double-close panics.
- Around line 131-134: Always drain the HTTP response body regardless of status
to enable connection reuse: call io.Copy(io.Discard, resp.Body) unconditionally
before closing the body (i.e., remove the conditional around io.Copy and ensure
resp.Body.Close() still runs afterwards or call Close after the unconditional
drain). Update the code that checks resp.StatusCode (references to resp,
resp.Body, http.StatusOK, http.StatusNoContent) to remove selective draining so
every response body is fully read before closing.
In `@observability/logs.go`:
- Around line 374-387: In parseLogStreamReversed, stop scanning/parsing once the
request context is canceled by checking ctx.Done during the loop (e.g., before
parsing each line and/or before appending to entries) and returning ctx.Err so
we don't continue buffering; specifically update the loop that uses scanner,
line, and entries in the LogReader.parseLogStreamReversed function to select on
ctx.Done (or call ctx.Err()) and break/return immediately when canceled,
ensuring no further parseLogLine calls or entries = append operations occur
after cancellation and that any caller-visible error is propagated.
In `@observability/system_log_handler.go`:
- Around line 98-113: The WaitForVictoriaLogs loop can exceed the intended
timeout because it passes the outer ctx directly to reader.Read, allowing that
call to block up to the client's internal timeout; update WaitForVictoriaLogs to
derive a timeout-scoped context for each Read (or one that covers the remaining
wait duration) so reader.Read uses a context bounded by the overall timeout,
e.g., compute a per-iteration/remaining deadline from timeout and call
reader.Read with that derived context instead of the original ctx; touch the
WaitForVictoriaLogs function and the call site using NewLogReader/reader.Read to
ensure the read is canceled when the wait deadline elapses.
In `@pkg/containerdx/opts.go`:
- Around line 656-666: WithRlimitNOFILE currently always appends a new
runtimespec.POSIXRlimit which can produce duplicate RLIMIT_NOFILE entries;
change it to upsert instead: in the WithRlimitNOFILE function inspect
s.Process.Rlimits (ensure s.Process != nil), loop to find an existing entry with
Type == "RLIMIT_NOFILE" and if found set its Soft and Hard to n and return nil,
otherwise append the new runtimespec.POSIXRlimit; this prevents duplicate Type
entries while preserving existing rlimits.
In `@servers/logs/logs.go`:
- Around line 220-221: The current check in StreamLogChunks uses
!target.HasSystem() which allows app-scoped callers to access system targets;
update the authorization to block non-app targets for app-scoped callers by
checking target.HasApp() instead. Specifically, in the conditional that uses
rpc.BoundApp(ctx) and rpc.ErrUnauthorized, replace or augment the
!target.HasSystem() check so that if rpc.BoundApp(ctx) != "" and the target does
not have an app (i.e., !target.HasApp()), you return fmt.Errorf("%w: app-scoped
caller must specify app target", rpc.ErrUnauthorized) to prevent app-scoped
callers from reading system logs.
---
Nitpick comments:
In `@cli/commands/server.go`:
- Around line 505-507: After wrapping ctx.Log with the SystemLogHandler (using
observability.NewSystemLogHandler and slog.New), rebind klog so it uses the new
handler: call klog.SetLogger with the updated ctx.Log handler (or an adapter)
immediately after creating systemHandler and resetting ctx.Log. This ensures
klog's logger is set after the ctx.Log wrap so any klog emissions go through
SystemLogHandler; update the code around systemHandler/ctx.Log (and the existing
klog.SetLogger usage) to perform the rebind right after slog.New(systemHandler).
In `@observability/batch_log_writer_test.go`:
- Line 41: Replace the fixed time.Sleep(500 * time.Millisecond) calls in
observability/batch_log_writer_test.go (both occurrences) with a deadline-based
wait: poll the test condition in a short interval until a timeout is reached
(e.g., using time.After for the deadline and a ticker or loop that checks the
expected state), failing the test if the condition isn't met before the timeout;
locate the sleeps by searching for time.Sleep in the test file and replace them
with a loop that repeatedly checks the same condition the test expects (for
example the batch flush/ buffer length/received logs) and returns early when
satisfied.
In `@observability/system_log_handler.go`:
- Around line 20-21: The struct field groups (populated via WithGroup) is
tracked but never applied to emitted attribute keys; update the log emission
path (the Handle method that constructs/writes attributes) to prefix attribute
keys with the current group context (e.g., join s.groups with a separator and
prepend to each attribute key) so grouped attributes retain their group
namespace, or if grouping is not required remove the groups field and WithGroup
implementation to avoid misleading state; target symbols: the struct holding
groups, the WithGroup method, and the Handle function where attributes are
serialized/emitted (also adjust related code regions around the noted blocks at
lines ~47-56 and ~86-92 that build attribute keys).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d4567b49-3691-4586-8a5c-3a14277aabba
📒 Files selected for processing (17)
api/app/app_v1alpha/rpc.gen.goapi/app/rpc.ymlcli/commands/commands.gocli/commands/logs.gocli/commands/logs_doc.gocli/commands/logs_test.gocli/commands/server.gocomponents/victorialogs/victorialogs.gocomponents/victoriametrics/victoriametrics.goobservability/batch_log_writer.goobservability/batch_log_writer_test.goobservability/logs.goobservability/system_log_handler.goobservability/system_log_handler_test.gopkg/containerdx/opts.goservers/logs/logs.goservers/logs/logs_test.go
The default StreamLogChunks path now requests desc-sorted results with a limit, then reverses client-side. The mock servers were returning entries in asc order regardless of the query, so the reversal produced the wrong ordering. Teach the mocks to reverse entries when the query contains "desc", matching real VictoriaLogs behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@servers/logs/logs_test.go`:
- Around line 34-43: The test's mock reversal uses strings.Contains(query,
"desc") which can falsely match substrings like "description"; update the
detection to a token-based check instead (e.g., parse the query string or use a
word-boundary regex like `\bdesc\b`) when inspecting the value from
r.URL.Query().Get("query") so only an explicit "desc" token triggers reversal;
change the same check used elsewhere in this file (the other occurrence noted)
to the same token-aware approach and keep the rest of the mock logic (ordered
creation/reversal using mockLogEntry) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d4b398d3-9813-41aa-85fb-699a550ea7f3
📒 Files selected for processing (1)
servers/logs/logs_test.go
evanphx
left a comment
There was a problem hiding this comment.
Looks great! 2 minor things we could cleanup.
| } | ||
|
|
||
| // LogsApp shows application logs. This is the default subcommand for `miren logs`. | ||
| func LogsApp(ctx *Context, opts struct { |
There was a problem hiding this comment.
Should we use AppCentric here because it's now, again, app centric?
cli/commands/logs.go
Outdated
|
|
||
| // dispatchLogs handles protocol negotiation and dispatches to the appropriate | ||
| // log streaming method based on server capabilities. | ||
| func dispatchLogs(ctx *Context, cl *rpc.NetworkClient, app, sandbox string, last *time.Duration, follow bool, rawFilter, combinedFilter string) error { |
There was a problem hiding this comment.
I'm gonna pull a card I so rarely pull and say: this list of arguments should be a struct instead.
Create docs/docs/logs.md covering all miren logs subcommands (app, sandbox, build, system), grep filtering syntax, time ranges, following, and output format. Update getting-started.md and troubleshooting.md to use the new subcommand syntax instead of the old flag-based style, and add cross-links to the new page from observability.md. Also add docs build instructions to CLAUDE.md.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/docs/logs.md`:
- Around line 129-130: Change the heading/text "Follow all logs" to accurately
reflect behavior of the command: update the line describing miren logs -f to say
it follows default app logs (not all sources), e.g., "Follow app logs (default
sources)" or "Follow default app logs", and ensure the example text clarifies
that -f follows the default log sources rather than every log source; update the
phrase near the command example referencing miren logs -f accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ddadddca-f0c1-407f-9a95-f9341ad33f0d
📒 Files selected for processing (6)
CLAUDE.mddocs/docs/getting-started.mddocs/docs/logs.mddocs/docs/observability.mddocs/docs/troubleshooting.mddocs/sidebars.ts
✅ Files skipped from review due to trivial changes (1)
- CLAUDE.md
- Use AppCentric for LogsApp/LogsBuild instead of manual App/Dir fields - Extract dispatchLogs parameters into logDispatchArgs struct - Make BatchLogWriter.Close() idempotent with sync.Once - Always drain HTTP response bodies for connection reuse - Use context.WithTimeout in WaitForVictoriaLogs instead of time.After - Upsert RLIMIT_NOFILE to avoid duplicate OCI spec entries - Block app-scoped callers from reading system logs
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@observability/batch_log_writer.go`:
- Around line 55-64: The merge of entry attributes into logData allows reserved
NDJSON fields to be overwritten; update the attribute-merge in
batch_log_writer.go (where logData is built and le.Attributes are iterated) to
ignore or rename any attribute keys that match the reserved set
{"_msg","_time","entity","stream","trace_id"} so they cannot overwrite canonical
fields — e.g., when looping over le.Attributes, check the key against that
reserved set and skip adding it (or prefix it like "attr_") and optionally emit
a debug/warn via the existing logger if a reserved key was encountered.
In `@observability/system_log_handler.go`:
- Around line 45-55: The code sets attrs["source"]="system" but then blindly
merges handler attrs (h.attrs) and record.Attrs which can overwrite it; update
the merge logic in the system log handler (the loop over h.attrs and the
record.Attrs callback) to skip writing the "source" key if it already exists
(i.e., if a.Key == "source" or attrs["source"] is present) so that
attrs["source"] remains immutable for system-ingested records.
In `@servers/logs/logs.go`:
- Around line 182-209: The resolveLogTarget function currently uses precedence
instead of enforcing exclusivity among target kinds; update resolveLogTarget to
require exactly one of system, sandbox, or app be specified (i.e. count how many
of target.HasSystem() && target.System(), target.HasSandbox() &&
target.Sandbox() != "", and target.HasApp() && target.App() != "" are true) and
return an error if the count is not exactly 1, then proceed with the existing
branching logic (System -> set EntityID to observability.SystemLogEntityID,
Sandbox -> set SandboxID, App -> lookup via s.EC.Get and set EntityID) so
StreamLogs/StreamLogChunks cannot send mixed targets and bypass auth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f7cbee4d-764c-4459-a9b5-e87da3ca94c3
📒 Files selected for processing (5)
cli/commands/logs.goobservability/batch_log_writer.goobservability/system_log_handler.gopkg/containerdx/opts.goservers/logs/logs.go
Prevents auth bypass where a request could set both app and system=true, pass app validation, then resolve to system logs via precedence.
b55e86e to
2648df7
Compare
miren logshad grown into a single command trying to do too many things — app logs, sandbox logs, build logs — with a pile of mutually-exclusive flags and validation logic to keep them from stepping on each other. Adding system log querying on top of that would've made it worse, so this felt like the right time to break things up.The first commit splits
miren logsinto proper subcommands:miren logs app,miren logs sandbox,miren logs build, and the newmiren logs system. Each subcommand gets its own opts struct with only the flags that make sense for it. Baremiren logsstill works as a shorthand formiren logs appso existing muscle memory isn't disrupted. While in there, the default output window was changed from "since midnight" to "last 100 lines" which is a much more useful default for the common "what just happened" workflow.The system log ingestion works by teeing server logs through a new
SystemLogHandler(anslog.Handlerthat writes to VictoriaLogs under a well-known entity ID). This means all server logs are queryable through the same interface as app logs —miren logs system --last 5m --followjust works.The second commit fixes a few bugs found during review — a startup race where writes could silently fail before VictoriaLogs was ready, stale filter validation, and a test assertion that needed updating.
The third commit addresses a crash we hit during manual testing: VictoriaLogs would die with
FATAL: too many open filesduring small-parts merge. The root cause was the container inheriting the system default 1024 fd limit, which is way too low for a storage engine doing compaction. We raise RLIMIT_NOFILE to 65536 via a newWithRlimitNOFILEhelper inpkg/containerdx(applied to both VictoriaLogs and VictoriaMetrics since they share the same architecture). As defense in depth, we also wrap the system log path in aBatchLogWriterthat buffers entries and flushes as a single NDJSON POST every 250ms or 50 entries, cutting down the number of small parts VictoriaLogs has to manage.