Skip to content

Restructure miren logs into subcommands and add system log ingestion#645

Merged
phinze merged 7 commits intomainfrom
phinze/mir-750-restructure-miren-logs-into-subcommands-and-add-system-log
Mar 5, 2026
Merged

Restructure miren logs into subcommands and add system log ingestion#645
phinze merged 7 commits intomainfrom
phinze/mir-750-restructure-miren-logs-into-subcommands-and-add-system-log

Conversation

@phinze
Copy link
Copy Markdown
Contributor

@phinze phinze commented Mar 4, 2026

miren logs had grown into a single command trying to do too many things — app logs, sandbox logs, build logs — with a pile of mutually-exclusive flags and validation logic to keep them from stepping on each other. Adding system log querying on top of that would've made it worse, so this felt like the right time to break things up.

The first commit splits miren logs into proper subcommands: miren logs app, miren logs sandbox, miren logs build, and the new miren logs system. Each subcommand gets its own opts struct with only the flags that make sense for it. Bare miren logs still works as a shorthand for miren logs app so existing muscle memory isn't disrupted. While in there, the default output window was changed from "since midnight" to "last 100 lines" which is a much more useful default for the common "what just happened" workflow.

The system log ingestion works by teeing server logs through a new SystemLogHandler (an slog.Handler that writes to VictoriaLogs under a well-known entity ID). This means all server logs are queryable through the same interface as app logs — miren logs system --last 5m --follow just works.

The second commit fixes a few bugs found during review — a startup race where writes could silently fail before VictoriaLogs was ready, stale filter validation, and a test assertion that needed updating.

The third commit addresses a crash we hit during manual testing: VictoriaLogs would die with FATAL: too many open files during small-parts merge. The root cause was the container inheriting the system default 1024 fd limit, which is way too low for a storage engine doing compaction. We raise RLIMIT_NOFILE to 65536 via a new WithRlimitNOFILE helper in pkg/containerdx (applied to both VictoriaLogs and VictoriaMetrics since they share the same architecture). As defense in depth, we also wrap the system log path in a BatchLogWriter that buffers entries and flushes as a single NDJSON POST every 250ms or 50 entries, cutting down the number of small parts VictoriaLogs has to manage.

phinze added 3 commits March 4, 2026 17:10
Split the monolithic `miren logs` command into proper subcommands:
- `miren logs app` — application logs (also the default for bare `miren logs`)
- `miren logs sandbox <id>` — sandbox logs
- `miren logs build <version>` — build logs
- `miren logs system [component]` — system/server logs (new)

Each subcommand gets its own opts struct with only relevant flags,
eliminating the conflicting-flag validation that was growing unwieldy.

Change the default output window from "since midnight today" to "last
100 lines" by wiring WithLimit into executeStreamQuery with a
desc-sort-then-reverse strategy. The server applies this limit when no
explicit --last is provided and --follow is not set.

Add system log ingestion via a new slog.Handler (SystemLogHandler) that
tees server logs to VictoriaLogs under entity "system/miren-server".
This is wired into server startup after the log writer is created,
making all server logs queryable through `miren logs system`.

Add `system` bool field to the LogTarget RPC type and handle it in the
log server's StreamLogChunks/StreamLogs methods via a new
resolveLogTarget helper that consolidates app/sandbox/system target
resolution.
…cture

- Fix test assertion to match updated error message including "system" target
- Wait for VictoriaLogs readiness before creating SystemLogHandler to avoid
  startup race where early writes silently fail with 30s HTTP timeouts
- Restore --service/--build old-server validation in dispatchLogs so these
  filters error clearly instead of being silently dropped
- Change legacy default from "since midnight" to "last 1 hour" for a more
  reasonable bounded window (legacy protocol can't do limit=100)
VictoriaLogs crashed with "too many open files" during small-parts merge
because the container inherited the default 1024 fd limit. The system log
handler also wrote one HTTP POST per slog record, creating excessive small
parts under load.

Two fixes:
- Set RLIMIT_NOFILE to 65536 on VictoriaLogs and VictoriaMetrics containers
  via a new WithRlimitNOFILE helper in pkg/containerdx
- Add BatchLogWriter that buffers system log entries and flushes as a single
  NDJSON POST every 250ms or 50 entries, whichever comes first
@phinze phinze requested a review from a team as a code owner March 4, 2026 23:10
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an optional boolean system field to LogTarget with Has/System/Set accessors and updates RPC schema. Reworks the CLI logs command into subcommands (app, sandbox, build, system) with centralized dispatch and new filter builders; documentation and tests updated. Adds BatchLogWriter for batched NDJSON POSTs and a SystemLogHandler that tees server logs to VictoriaLogs plus WaitForVictoriaLogs. Log streaming supports reversed ordering for limited queries and chunked streaming with filtering. Server code coordinates with VictoriaLogs via resolveLogTarget and applies the batch writer. Adds WithRlimitNOFILE container option.


Comment @coderabbitai help to get the list of available commands and usage tips.

@phinze phinze changed the title Restructure miren logs into subcommands and add system log ingestion Restructure miren logs into subcommands and add system log ingestion Mar 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
cli/commands/server.go (1)

505-507: Add klog rebinding after ctx.Log wrap for consistency and to prevent future regressions.

Currently, klog.SetLogger is configured at line 475 before the ctx.Log wrap, which means if klog is triggered after line 506, it would bypass the SystemLogHandler. While klog is not actively used in this file currently, k8s libraries that depend on klog are present as transitive dependencies. Adding a rebind after wrapping ensures all logging paths flow through SystemLogHandler consistently.

Suggested fix
 systemHandler := observability.NewSystemLogHandler(ctx.Log.Handler(), batchWriter)
 ctx.Log = slog.New(systemHandler)
+// Rebind klog so any k8s libraries used in dependencies also flow through SystemLogHandler.
+klog.SetLogger(logr.FromSlogHandler(ctx.Log.With("module", "global").Handler()))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cli/commands/server.go` around lines 505 - 507, After wrapping ctx.Log with
the SystemLogHandler (using observability.NewSystemLogHandler and slog.New),
rebind klog so it uses the new handler: call klog.SetLogger with the updated
ctx.Log handler (or an adapter) immediately after creating systemHandler and
resetting ctx.Log. This ensures klog's logger is set after the ctx.Log wrap so
any klog emissions go through SystemLogHandler; update the code around
systemHandler/ctx.Log (and the existing klog.SetLogger usage) to perform the
rebind right after slog.New(systemHandler).
observability/system_log_handler.go (1)

20-21: WithGroup state is tracked but not applied to emitted attribute keys.

Grouped attrs currently lose group context on the VictoriaLogs write path. Either apply group prefixing in Handle or remove groups to avoid misleading behavior.

Also applies to: 47-56, 86-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@observability/system_log_handler.go` around lines 20 - 21, The struct field
groups (populated via WithGroup) is tracked but never applied to emitted
attribute keys; update the log emission path (the Handle method that
constructs/writes attributes) to prefix attribute keys with the current group
context (e.g., join s.groups with a separator and prepend to each attribute key)
so grouped attributes retain their group namespace, or if grouping is not
required remove the groups field and WithGroup implementation to avoid
misleading state; target symbols: the struct holding groups, the WithGroup
method, and the Handle function where attributes are serialized/emitted (also
adjust related code regions around the noted blocks at lines ~47-56 and ~86-92
that build attribute keys).
observability/batch_log_writer_test.go (1)

41-41: Prefer deadline-based waiting over fixed sleeps in async tests.

Line 41 and Line 99 can flake in slow CI. Poll until condition or timeout instead of sleeping for a fixed duration.

Also applies to: 99-99

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@observability/batch_log_writer_test.go` at line 41, Replace the fixed
time.Sleep(500 * time.Millisecond) calls in
observability/batch_log_writer_test.go (both occurrences) with a deadline-based
wait: poll the test condition in a short interval until a timeout is reached
(e.g., using time.After for the deadline and a ticker or loop that checks the
expected state), failing the test if the condition isn't met before the timeout;
locate the sleeps by searching for time.Sleep in the test file and replace them
with a loop that repeatedly checks the same condition the test expects (for
example the batch flush/ buffer length/received logs) and returns early when
satisfied.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@observability/batch_log_writer.go`:
- Around line 89-92: The Close method on BatchLogWriter can panic if called
multiple times because closing the done channel twice is illegal; make Close
idempotent by introducing and using a sync.Once (e.g., a field like closeOnce)
or an equivalent guard so that the block that closes b.done and calls
b.wg.Wait() is executed only once; update the BatchLogWriter struct to add the
guard (closeOnce) and change the Close method to call closeOnce.Do(func() {
close(b.done); b.wg.Wait() }) to prevent double-close panics.
- Around line 131-134: Always drain the HTTP response body regardless of status
to enable connection reuse: call io.Copy(io.Discard, resp.Body) unconditionally
before closing the body (i.e., remove the conditional around io.Copy and ensure
resp.Body.Close() still runs afterwards or call Close after the unconditional
drain). Update the code that checks resp.StatusCode (references to resp,
resp.Body, http.StatusOK, http.StatusNoContent) to remove selective draining so
every response body is fully read before closing.

In `@observability/logs.go`:
- Around line 374-387: In parseLogStreamReversed, stop scanning/parsing once the
request context is canceled by checking ctx.Done during the loop (e.g., before
parsing each line and/or before appending to entries) and returning ctx.Err so
we don't continue buffering; specifically update the loop that uses scanner,
line, and entries in the LogReader.parseLogStreamReversed function to select on
ctx.Done (or call ctx.Err()) and break/return immediately when canceled,
ensuring no further parseLogLine calls or entries = append operations occur
after cancellation and that any caller-visible error is propagated.

In `@observability/system_log_handler.go`:
- Around line 98-113: The WaitForVictoriaLogs loop can exceed the intended
timeout because it passes the outer ctx directly to reader.Read, allowing that
call to block up to the client's internal timeout; update WaitForVictoriaLogs to
derive a timeout-scoped context for each Read (or one that covers the remaining
wait duration) so reader.Read uses a context bounded by the overall timeout,
e.g., compute a per-iteration/remaining deadline from timeout and call
reader.Read with that derived context instead of the original ctx; touch the
WaitForVictoriaLogs function and the call site using NewLogReader/reader.Read to
ensure the read is canceled when the wait deadline elapses.

In `@pkg/containerdx/opts.go`:
- Around line 656-666: WithRlimitNOFILE currently always appends a new
runtimespec.POSIXRlimit which can produce duplicate RLIMIT_NOFILE entries;
change it to upsert instead: in the WithRlimitNOFILE function inspect
s.Process.Rlimits (ensure s.Process != nil), loop to find an existing entry with
Type == "RLIMIT_NOFILE" and if found set its Soft and Hard to n and return nil,
otherwise append the new runtimespec.POSIXRlimit; this prevents duplicate Type
entries while preserving existing rlimits.

In `@servers/logs/logs.go`:
- Around line 220-221: The current check in StreamLogChunks uses
!target.HasSystem() which allows app-scoped callers to access system targets;
update the authorization to block non-app targets for app-scoped callers by
checking target.HasApp() instead. Specifically, in the conditional that uses
rpc.BoundApp(ctx) and rpc.ErrUnauthorized, replace or augment the
!target.HasSystem() check so that if rpc.BoundApp(ctx) != "" and the target does
not have an app (i.e., !target.HasApp()), you return fmt.Errorf("%w: app-scoped
caller must specify app target", rpc.ErrUnauthorized) to prevent app-scoped
callers from reading system logs.

---

Nitpick comments:
In `@cli/commands/server.go`:
- Around line 505-507: After wrapping ctx.Log with the SystemLogHandler (using
observability.NewSystemLogHandler and slog.New), rebind klog so it uses the new
handler: call klog.SetLogger with the updated ctx.Log handler (or an adapter)
immediately after creating systemHandler and resetting ctx.Log. This ensures
klog's logger is set after the ctx.Log wrap so any klog emissions go through
SystemLogHandler; update the code around systemHandler/ctx.Log (and the existing
klog.SetLogger usage) to perform the rebind right after slog.New(systemHandler).

In `@observability/batch_log_writer_test.go`:
- Line 41: Replace the fixed time.Sleep(500 * time.Millisecond) calls in
observability/batch_log_writer_test.go (both occurrences) with a deadline-based
wait: poll the test condition in a short interval until a timeout is reached
(e.g., using time.After for the deadline and a ticker or loop that checks the
expected state), failing the test if the condition isn't met before the timeout;
locate the sleeps by searching for time.Sleep in the test file and replace them
with a loop that repeatedly checks the same condition the test expects (for
example the batch flush/ buffer length/received logs) and returns early when
satisfied.

In `@observability/system_log_handler.go`:
- Around line 20-21: The struct field groups (populated via WithGroup) is
tracked but never applied to emitted attribute keys; update the log emission
path (the Handle method that constructs/writes attributes) to prefix attribute
keys with the current group context (e.g., join s.groups with a separator and
prepend to each attribute key) so grouped attributes retain their group
namespace, or if grouping is not required remove the groups field and WithGroup
implementation to avoid misleading state; target symbols: the struct holding
groups, the WithGroup method, and the Handle function where attributes are
serialized/emitted (also adjust related code regions around the noted blocks at
lines ~47-56 and ~86-92 that build attribute keys).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4567b49-3691-4586-8a5c-3a14277aabba

📥 Commits

Reviewing files that changed from the base of the PR and between 11190b9 and 58362ce.

📒 Files selected for processing (17)
  • api/app/app_v1alpha/rpc.gen.go
  • api/app/rpc.yml
  • cli/commands/commands.go
  • cli/commands/logs.go
  • cli/commands/logs_doc.go
  • cli/commands/logs_test.go
  • cli/commands/server.go
  • components/victorialogs/victorialogs.go
  • components/victoriametrics/victoriametrics.go
  • observability/batch_log_writer.go
  • observability/batch_log_writer_test.go
  • observability/logs.go
  • observability/system_log_handler.go
  • observability/system_log_handler_test.go
  • pkg/containerdx/opts.go
  • servers/logs/logs.go
  • servers/logs/logs_test.go

The default StreamLogChunks path now requests desc-sorted results
with a limit, then reverses client-side. The mock servers were
returning entries in asc order regardless of the query, so the
reversal produced the wrong ordering. Teach the mocks to reverse
entries when the query contains "desc", matching real VictoriaLogs
behavior.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@servers/logs/logs_test.go`:
- Around line 34-43: The test's mock reversal uses strings.Contains(query,
"desc") which can falsely match substrings like "description"; update the
detection to a token-based check instead (e.g., parse the query string or use a
word-boundary regex like `\bdesc\b`) when inspecting the value from
r.URL.Query().Get("query") so only an explicit "desc" token triggers reversal;
change the same check used elsewhere in this file (the other occurrence noted)
to the same token-aware approach and keep the rest of the mock logic (ordered
creation/reversal using mockLogEntry) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4b398d3-9813-41aa-85fb-699a550ea7f3

📥 Commits

Reviewing files that changed from the base of the PR and between 58362ce and 9a5aec8.

📒 Files selected for processing (1)
  • servers/logs/logs_test.go

Copy link
Copy Markdown
Contributor

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 2 minor things we could cleanup.

}

// LogsApp shows application logs. This is the default subcommand for `miren logs`.
func LogsApp(ctx *Context, opts struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use AppCentric here because it's now, again, app centric?


// dispatchLogs handles protocol negotiation and dispatches to the appropriate
// log streaming method based on server capabilities.
func dispatchLogs(ctx *Context, cl *rpc.NetworkClient, app, sandbox string, last *time.Duration, follow bool, rawFilter, combinedFilter string) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna pull a card I so rarely pull and say: this list of arguments should be a struct instead.

Create docs/docs/logs.md covering all miren logs subcommands (app,
sandbox, build, system), grep filtering syntax, time ranges, following,
and output format. Update getting-started.md and troubleshooting.md to
use the new subcommand syntax instead of the old flag-based style, and
add cross-links to the new page from observability.md. Also add docs
build instructions to CLAUDE.md.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/docs/logs.md`:
- Around line 129-130: Change the heading/text "Follow all logs" to accurately
reflect behavior of the command: update the line describing miren logs -f to say
it follows default app logs (not all sources), e.g., "Follow app logs (default
sources)" or "Follow default app logs", and ensure the example text clarifies
that -f follows the default log sources rather than every log source; update the
phrase near the command example referencing miren logs -f accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ddadddca-f0c1-407f-9a95-f9341ad33f0d

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5aec8 and 5db83c8.

📒 Files selected for processing (6)
  • CLAUDE.md
  • docs/docs/getting-started.md
  • docs/docs/logs.md
  • docs/docs/observability.md
  • docs/docs/troubleshooting.md
  • docs/sidebars.ts
✅ Files skipped from review due to trivial changes (1)
  • CLAUDE.md

- Use AppCentric for LogsApp/LogsBuild instead of manual App/Dir fields
- Extract dispatchLogs parameters into logDispatchArgs struct
- Make BatchLogWriter.Close() idempotent with sync.Once
- Always drain HTTP response bodies for connection reuse
- Use context.WithTimeout in WaitForVictoriaLogs instead of time.After
- Upsert RLIMIT_NOFILE to avoid duplicate OCI spec entries
- Block app-scoped callers from reading system logs
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@observability/batch_log_writer.go`:
- Around line 55-64: The merge of entry attributes into logData allows reserved
NDJSON fields to be overwritten; update the attribute-merge in
batch_log_writer.go (where logData is built and le.Attributes are iterated) to
ignore or rename any attribute keys that match the reserved set
{"_msg","_time","entity","stream","trace_id"} so they cannot overwrite canonical
fields — e.g., when looping over le.Attributes, check the key against that
reserved set and skip adding it (or prefix it like "attr_") and optionally emit
a debug/warn via the existing logger if a reserved key was encountered.

In `@observability/system_log_handler.go`:
- Around line 45-55: The code sets attrs["source"]="system" but then blindly
merges handler attrs (h.attrs) and record.Attrs which can overwrite it; update
the merge logic in the system log handler (the loop over h.attrs and the
record.Attrs callback) to skip writing the "source" key if it already exists
(i.e., if a.Key == "source" or attrs["source"] is present) so that
attrs["source"] remains immutable for system-ingested records.

In `@servers/logs/logs.go`:
- Around line 182-209: The resolveLogTarget function currently uses precedence
instead of enforcing exclusivity among target kinds; update resolveLogTarget to
require exactly one of system, sandbox, or app be specified (i.e. count how many
of target.HasSystem() && target.System(), target.HasSandbox() &&
target.Sandbox() != "", and target.HasApp() && target.App() != "" are true) and
return an error if the count is not exactly 1, then proceed with the existing
branching logic (System -> set EntityID to observability.SystemLogEntityID,
Sandbox -> set SandboxID, App -> lookup via s.EC.Get and set EntityID) so
StreamLogs/StreamLogChunks cannot send mixed targets and bypass auth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7cbee4d-764c-4459-a9b5-e87da3ca94c3

📥 Commits

Reviewing files that changed from the base of the PR and between 5db83c8 and 6e46e54.

📒 Files selected for processing (5)
  • cli/commands/logs.go
  • observability/batch_log_writer.go
  • observability/system_log_handler.go
  • pkg/containerdx/opts.go
  • servers/logs/logs.go

Prevents auth bypass where a request could set both app and system=true,
pass app validation, then resolve to system logs via precedence.
@phinze phinze force-pushed the phinze/mir-750-restructure-miren-logs-into-subcommands-and-add-system-log branch from b55e86e to 2648df7 Compare March 5, 2026 15:42
@phinze phinze merged commit e969ce9 into main Mar 5, 2026
13 checks passed
@phinze phinze deleted the phinze/mir-750-restructure-miren-logs-into-subcommands-and-add-system-log branch March 5, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants