feat: add local telemetry ledger and reporting command#39
feat: add local telemetry ledger and reporting command#39
Conversation
Add a production-safe Phase 1 visibility capability with local-only telemetry. - Add redacted JSONL telemetry ledger with rotation and query/summarize helpers - Add telemetry toggle in config/schema with CODEX_AUTH_TELEMETRY_ENABLED override - Add codex auth telemetry command with json/text output and filters - Instrument CLI lifecycle and plugin high-signal failure/recovery branches - Update docs and tests for telemetry behavior and configuration Co-authored-by: Codex <noreply@openai.com>
📝 WalkthroughWalkthroughadds a local telemetry system that writes product events to Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes key review areas requiring attention:
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/reference/commands.md`:
- Around line 41-57: Add an upgrade note and npm-script rollout pointer for the
changed telemetry behavior by updating the commands reference next to the `codex
auth telemetry` entry (and the Common Flags entries for `--since-hours` and
`--limit`): insert a short “Upgrade / Rollout” note linking to the
release/upgrade doc or CHANGELOG and mention the new npm script operators should
run (for example, add a line like “See upgrade notes: <link> and run npm run
<rollout-script> to migrate telemetry automation”), ensuring the note appears
immediately after the telemetry command description so operators see
migration/automation impact.
In `@lib/codex-manager.ts`:
- Around line 4253-4268: The emitTelemetry helper currently calls
recordTelemetryEvent with a void cast which creates a race when the process
returns/throws (e.g., around cli.command.finish / cli.command.exception); make
emitTelemetry return a Promise by removing the void cast and marking it async so
callers can await it, then update call sites that immediately return or throw
(those near the cli.command.finish / cli.command.exception usage) to await
emitTelemetry(...) before returning/throwing to ensure recordTelemetryEvent
completes; reference functions: emitTelemetry and recordTelemetryEvent and the
cli.command.finish / cli.command.exception call sites.
- Around line 2325-2330: The returned events from queryTelemetryEvents are not
guaranteed time-ordered, so before calling summarizeTelemetryEvents and before
building the "recent events" block you must sort the events array by the event
timestamp field (e.g. event.timestamp or event.ts) in ascending order; update
the code around the variable events (result of queryTelemetryEvents) so it
creates a sortedEvents = events.slice().sort((a,b) => a.timestamp - b.timestamp
|| a.ts - b.ts) and then pass sortedEvents into summarizeTelemetryEvents and
into any recent-events slicing logic (referencing the events variable,
summarizeTelemetryEvents function, and the code that constructs the recent
output) to ensure summary.firstTimestamp/lastTimestamp and recent lists are
correct.
In `@lib/telemetry.ts`:
- Around line 172-180: The runtime type guard isTelemetryEvent currently only
checks that timestamp/source/event/outcome are strings, allowing invalid enum
values to pass; update isTelemetryEvent to also validate that record.source and
record.outcome are one of the allowed enum members (use the same enum or allowed
value arrays used in aggregation logic) so malformed source/outcome strings are
rejected, and adjust downstream parsing to skip invalid lines; then add a vitest
regression in test/telemetry.test.ts that feeds malformed telemetry lines
(invalid source/outcome) and asserts they are ignored and do not affect the
summary counters (the aggregation code referenced around the summary logic
should continue to rely on the stricter guard).
In `@test/codex-manager-cli.test.ts`:
- Around line 338-380: Replace the runtime Date.now() usage with Vitest fake
timers to make the since-hours assertion deterministic: call vi.useFakeTimers()
and vi.setSystemTime(fixedNow) before creating the now variable and the events,
use that fixedNow for calculation of expected sinceMs and for creating event
timestamps, run the test (invoking runCodexMultiAuthCli) and then restore timers
with vi.useRealTimers(); reference the now variable, queryTelemetryEventsMock
mock calls/sinceMs, and runCodexMultiAuthCli to locate where to apply the
fake-timer setup and teardown.
- Around line 403-427: Add two regression test cases to cover the missing
telemetry branches: (1) a test that sets telemetryEnabled=false (or mocks the
telemetry gate used by runCodexMultiAuthCli) and asserts
recordTelemetryEventMock is not called for "cli.command.start" /
"cli.command.finish", and (2) a test that forces the command handler invoked by
runCodexMultiAuthCli(["auth","features"]) to throw (or mock a thrown error) and
asserts recordTelemetryEventMock is called with an objectContaining event
"cli.command.exception" and outcome "error" (including details.command
"features" and the error info); use the same helpers and spies as the existing
test (runCodexMultiAuthCli, recordTelemetryEventMock, logSpy) so the tests
exercise the async emission paths in lib/codex-manager.ts (around
telemetryEnabled and cli.command.exception).
- Around line 24-26: Tests only cover unix-style telemetry log paths via the
getTelemetryLogPathMock; add a Windows-path regression case by making
getTelemetryLogPathMock (or an additional mock used in the same test suite)
return a Windows-style path string (e.g.
"C:\\Users\\you\\logs\\product-telemetry.jsonl") and duplicate the existing
telemetry assertions against that mock to ensure formatting/parsing works on
Windows paths; apply the same addition to the other telemetry test block that
mirrors the assertions so both places use a Windows-path case in addition to the
Unix-like case.
In `@test/index.test.ts`:
- Around line 111-122: The tests enable telemetry and mock recordTelemetryEvent
(recordTelemetryEventMock) but never assert it; add deterministic assertions
that recordTelemetryEventMock was called (and with expected event names/payload
shapes) in the concurrency flow tests and token refresh/error flows referenced
(the tests around the concurrency block and the refresh/error block).
Specifically, inside the test cases that simulate the failure/recovery path
(around the refresh/error flow) and the token refresh race/concurrency tests,
add expectations like
expect(recordTelemetryEventMock).toHaveBeenCalledTimes(...) and
expect(recordTelemetryEventMock).toHaveBeenCalledWith(expect.objectContaining({
event: "<expected-event-name>" })) to lock the telemetry emission behavior; use
the existing mocked recordTelemetryEventMock symbol and keep assertions
deterministic (use toHaveBeenCalledTimes and
toHaveBeenCalledWith/expect.objectContaining) so regression failures surface.
In `@test/telemetry.test.ts`:
- Around line 63-65: The fixture value for the accessToken property in the test
(the object containing email: "user@example.com", accessToken: "...", nested:
{...}) looks like a real API key and triggers secret scanners; update the
accessToken string to a clearly non-secret sentinel (e.g.
"TEST_TOKEN_PLACEHOLDER" or "placeholder-token-do-not-use") in the telemetry
test fixture so the property name accessToken in the test/telemetry.test.ts no
longer resembles a real secret.
- Around line 160-177: Add two deterministic vitest regression tests in
test/telemetry.test.ts: (1) a concurrency test that targets queueAppend
(lib/telemetry.ts:237-262) by invoking recordTelemetryEvent concurrently (e.g.,
Promise.all over many calls) and asserting no data loss/corruption in the main
log file (use getTelemetryLogPath to read contents) to reproduce and prevent
race conditions; (2) a Windows rm-retry test that mocks fs.rm (or
fs.promises.rm) to first throw EPERM/ENOENT errors and then succeed, asserting
the retry branch is exercised and no exception escapes—use vitest mocks/fake
timers to keep the test deterministic and restore mocks after the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9eb7ffc1-3af4-4cb7-9374-4a4fa0aafd35
📒 Files selected for processing (14)
docs/privacy.mddocs/reference/commands.mddocs/reference/settings.mdindex.tslib/codex-manager.tslib/config.tslib/index.tslib/schemas.tslib/telemetry.tstest/codex-manager-cli.test.tstest/index.test.tstest/plugin-config.test.tstest/schemas.test.tstest/telemetry.test.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (3)
lib/**
⚙️ CodeRabbit configuration file
focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.
Files:
lib/config.tslib/schemas.tslib/index.tslib/codex-manager.tslib/telemetry.ts
docs/**
⚙️ CodeRabbit configuration file
keep README, SECURITY, and docs consistent with actual CLI flags and workflows. whenever behavior changes, require updated upgrade notes and mention new npm scripts.
Files:
docs/privacy.mddocs/reference/commands.mddocs/reference/settings.md
test/**
⚙️ CodeRabbit configuration file
tests must stay deterministic and use vitest. demand regression cases that reproduce concurrency bugs, token refresh races, and windows filesystem behavior. reject changes that mock real secrets or skip assertions.
Files:
test/plugin-config.test.tstest/telemetry.test.tstest/index.test.tstest/codex-manager-cli.test.tstest/schemas.test.ts
🧬 Code graph analysis (6)
test/plugin-config.test.ts (1)
lib/config.ts (1)
getTelemetryEnabled(725-731)
test/telemetry.test.ts (1)
lib/telemetry.ts (6)
getTelemetryConfig(230-232)configureTelemetry(226-228)recordTelemetryEvent(238-263)queryTelemetryEvents(265-285)getTelemetryLogPath(234-236)summarizeTelemetryEvents(287-316)
test/codex-manager-cli.test.ts (1)
lib/codex-manager.ts (1)
runCodexMultiAuthCli(4230-4317)
lib/codex-manager.ts (2)
lib/telemetry.ts (5)
queryTelemetryEvents(265-285)summarizeTelemetryEvents(287-316)getTelemetryLogPath(234-236)TelemetryOutcome(14-14)recordTelemetryEvent(238-263)lib/config.ts (1)
getTelemetryEnabled(725-731)
index.ts (2)
lib/config.ts (1)
getTelemetryEnabled(725-731)lib/telemetry.ts (1)
recordTelemetryEvent(238-263)
lib/telemetry.ts (2)
lib/runtime-paths.ts (1)
getCodexLogDir(226-228)lib/logger.ts (2)
maskEmail(430-430)getCorrelationId(137-139)
🪛 Gitleaks (8.30.0)
test/telemetry.test.ts
[high] 64-64: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🔇 Additional comments (8)
lib/schemas.ts (1)
35-35: schema change is backward-compatible
lib/schemas.ts:35addstelemetryEnabledas an optional boolean, so existing config documents continue to parse safely.lib/config.ts (1)
136-136: telemetry setting resolution is consistent
lib/config.ts:136andlib/config.ts:725-731correctly implement default-enabled behavior with env override precedence through the existing boolean resolver path.Also applies to: 725-731
lib/telemetry.ts (1)
106-135: structured redaction flow is solid
lib/telemetry.ts:106-135applies recursive sanitization and key-based masking before persistence, aligned withlib/logger.ts:429masking usage for email-safe output.index.ts (2)
57-58: telemetry helper wiring is clean and correctly gatedthe helper uses
lib/config.ts:725-731for enablement and delegates tolib/telemetry.ts:238-263in a non-blocking way, which keeps request handling isolated from telemetry failure paths.Also applies to: 103-103, 1164-1177
1522-1527: event points are high-signal and well-scopedthese emit sites focus on failure/recovery/exhaustion boundaries and align with the local-ledger model in
lib/telemetry.ts:238-285without adding noisy per-request tracing.Also applies to: 1718-1723, 1992-1997, 2117-2122, 2286-2293, 2312-2322, 2413-2420, 2462-2468
lib/index.ts (1)
32-32: public export wiring is clean
lib/index.ts:32re-exports telemetry APIs cleanly and keeps the module barrel consistent.test/schemas.test.ts (1)
43-43: good schema coverage update
test/schemas.test.ts:43correctly extends the full-config happy path for the newtelemetryEnabledfield.test/plugin-config.test.ts (1)
751-768: good coverage for telemetry toggle precedence.this block cleanly validates default/config/env precedence for telemetry enablement and stays deterministic (
test/plugin-config.test.ts:751-768,lib/config.ts:724-730).
Replace sync rotation fs calls with async operations wrapped in bounded retry\nfor EBUSY/EPERM/ENOTEMPTY to handle Windows lock contention safely.\nAlso adds a telemetry test that simulates a transient rename lock and\nverifies rotation still succeeds.\n\nCo-authored-by: Codex <noreply@openai.com>
Address outstanding review feedback for telemetry reliability and coverage: - enforce telemetry source/outcome enum validation while parsing logs - sort telemetry report events deterministically before summary/recent output - await CLI telemetry lifecycle writes before command return/throw - expand CLI/index/telemetry regressions (windows paths, deterministic time, exception paths, concurrent appends) - document telemetry rollout/automation pointer in command reference Co-authored-by: Codex <noreply@openai.com>
Additional Comments (1)
the pr upgraded this is the same class of problem the pr explicitly addressed elsewhere. replace async function listTelemetryFiles(): Promise<string[]> {
try {
const entries = await runFileOperationWithRetry(() =>
fs.readdir(telemetryConfig.logDir)
);
const sortable = entries
.map((name) => {
const archiveIndex = parseArchiveSuffix(name);
if (archiveIndex === null) return null;
return { name, archiveIndex };
})
.filter((value): value is { name: string; archiveIndex: number } => value !== null)
.sort((left, right) => right.archiveIndex - left.archiveIndex);
return sortable.map((entry) => join(telemetryConfig.logDir, entry.name));
} catch {
return [];
}
}
|
Co-authored-by: Codex <noreply@openai.com>
|
Merged via integration PR to safely combine multiple approved changes. |
Summary
Why this increases leverage
This introduces a reusable visibility primitive for product iteration without adding any remote analytics dependency. Teams can quickly answer "what is failing/recovering" and layer future experimentation/usage systems on top of the same local event stream.
Minimal implementation
Files changed
Validation
pm run typecheck
pm run lint
pm test
pm run build
All passed in the isolated worktree branch.
Post-review updates (2026-03-05)
What changed
sourceandoutcomeenum values before events are admitted into query/summary flows.start/finish/exception) to avoid dropped telemetry on fast exits.How to test
npm run test -- test/telemetry.test.ts test/codex-manager-cli.test.ts test/index.test.tsnpm run typechecknpm run lint:tsnpm run lint:scriptsRisk / rollout notes
note: greptile review for oc-chatgpt-multi-auth. cite files like
lib/foo.ts:123. confirm regression tests + windows concurrency/token redaction coverage.Greptile Summary
this pr introduces a local-only structured telemetry subsystem (
lib/telemetry.ts) backed by a bounded, rotating jsonl log at~/.codex/multi-auth/logs/product-telemetry.jsonl, instruments high-signal cli and plugin outcomes, and adds acodex auth telemetryreporting command. the overall design is solid — serial append queue, pii redaction, config/env gate, and async+retry for windows filesystem concurrency are all present.key changes
lib/telemetry.ts— new module: serialqueueAppend,sanitizeValuewith email/token masking, asyncrotateLogsIfNeededwithrunFileOperationWithRetry,queryTelemetryEvents/summarizeTelemetryEventshelperslib/codex-manager.ts— addsrunTelemetrycommand (--json,--since-hours,--limit) and wraps the entire cli dispatch in a try/catch that emitscli.command.start/cli.command.finish/cli.command.exceptioneventsindex.ts— instruments plugin request paths:request.auth_refresh_failed,request.network_error,request.server_error,request.stream_failover_recovered,request.accounts_exhaustedlib/config.ts/lib/schemas.ts—telemetryEnabledconfig field +CODEX_AUTH_TELEMETRY_ENABLEDenv overrideissues found
listTelemetryFiles()inlib/telemetry.tsstill usesreaddirSync— a synchronous call with no windows ebusy/eperm retry, inconsistent with the async+retry pattern applied to all other fs operations in this pr. an antivirus lock silently returns[](no events shown, no error). needsfs.readdirwrapped inrunFileOperationWithRetry, plus a corresponding vitest coverage caseisErrnoExceptiontype guard is too broad (error instanceof Error) — it should also asserttypeof (error as NodeJS.ErrnoException).code === "string"to be a safeNodeJS.ErrnoExceptionguardConfidence Score: 3/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant CLI as codex auth CLI participant PM as codex-manager.ts participant TM as lib/telemetry.ts participant FS as ~/.codex/multi-auth/logs/ CLI->>PM: runCodexMultiAuthCli(args) PM->>TM: recordTelemetryEvent(cli.command.start) TM->>TM: sanitizeValue(details) TM->>TM: queueAppend(task) TM->>FS: ensureLogDir() TM->>FS: rotateLogsIfNeeded() [async+retry] TM->>FS: appendFile(product-telemetry.jsonl) PM->>PM: dispatch command (login/report/telemetry/…) alt command == "telemetry" PM->>TM: queryTelemetryEvents({sinceMs, limit}) TM->>FS: readdirSync(logDir) ⚠️ sync TM->>FS: readFile(*.jsonl) [async] TM-->>PM: TelemetryEvent[] PM->>TM: summarizeTelemetryEvents(events) TM-->>PM: TelemetrySummary PM-->>CLI: JSON or human output end PM->>TM: recordTelemetryEvent(cli.command.finish / cli.command.exception) TM->>FS: appendFile(product-telemetry.jsonl) note over TM,FS: plugin path (index.ts) fires void recordTelemetryEvent<br/>for: auth_refresh_failed, network_error, server_error,<br/>stream_failover_recovered, accounts_exhaustedLast reviewed commit: 2ab197f
Context used:
dashboard- What: Every code change must explain how it defends against Windows filesystem concurrency bugs and ... (source)