YNU-779: feat(nitronode/metrics): saturation + DB latency + label hygiene#717
Conversation
Three new emission paths plug observability gaps that today's metrics
cannot fill:
- nitronode_rpc_inflight{method} — gauge incremented at the
ObservabilityMiddleware entry and decremented on exit. Saturation
signal that complements rpc_request_duration_seconds: pairs let an
on-call distinguish "slow handler" from "queued behind contention".
- nitronode_db_query_duration_seconds{query_kind} — histogram observed
via gorm callbacks (Create / Query / Update / Delete / Row / Raw).
Captures app-side round-trip including pgbouncer hop, so on-call can
tell whether DB stalls live in the pool, in the network, or in the
query itself.
- go_sql_* (standard collectors.NewDBStatsCollector under db name
"nitronode") — pool gauges + wait counters for the underlying
*sql.DB. Pair with db_query_duration_seconds to separate
pool-acquire latency from in-DB execution.
Wiring:
- store/database/metrics.go owns the gorm-callback registration and
the dbstats collector helper, exposing a tiny QueryDurationObserver
interface so the package doesn't import nitronode/metrics (which
pulls in pkg/{app,core,rpc} and would induce a cycle).
- runtime.go calls both helpers right after the metric exporters are
constructed, with prometheus.DefaultRegisterer as the registry.
Non-breaking — purely additive metrics + an additional middleware
hook. Existing dashboards and alerts continue to work.
Cross-cutting hygiene that becomes load-bearing once we author alerts on top: every alert needs to express success/failure, app-id, and chain results consistently, and today the labels disagree across metrics. N1 — `application` → `application_id` * nitronode_app_sessions_total * nitronode_app_sessions_active * nitronode_app_state_updates_total * nitronode_app_session_update_sig_validations_total user_states_total / transactions_total already used `application_id`; the four above were the holdouts. `application_id` is more explicit and doesn't collide with the HTTP path namespace. N2 — settle on `result` for success/failure semantics * nitronode_rpc_requests_total: `status` → `result` * nitronode_rpc_request_duration_seconds: `status` → `result` * nitronode_blockchain_events_total: `process_result` → `result` channel_state / app_session_update sig validations and blockchain_actions_total already used `result`; standardizing means alert templates can use a single label name across the metric set. N3 — apply the empty-applicationID scrubber consistently IncAppStateUpdate, IncAppSessionUpdateSigValidation, SetAppSessions and SetActiveAppSessions now wrap applicationID with getApplicationIDLabelValue, so an empty value becomes `_DEFAULT` instead of an empty-string label (which previously matched `=~"."` pattern in queries — surprising). Matches the existing IncUserState / RecordTransaction behaviour. N5 — rename rpc_messages_total → rpc_messages_emitted_total + expand Help The metric counts both the request and the response message of every RPC, so rate(rpc_messages_total[5m]) reports ~2× QPS. That's a foot- gun the current name actively encourages. Renaming makes the doubled emission explicit; the Help string spells out the relationship to rpc_requests_total for clarity. Breaking — every dashboard / alert / recording-rule that referenced the old label names or the old metric name must update in lockstep with the deploy. The ops repo's monitoring dashboard PR follows.
Every metric Help string was a one-liner that omitted what the labels
mean, what their value sets are, or how to aggregate the metric. That
forced dashboard authors to read exporter.go to write a query.
Expanded to a 3-line format covering trigger condition, label semantics
with enum value sets, and aggregation guidance. Concretely:
- All labelled metrics now spell out their label sets:
status / result / process_result enums; sig_type enums (default vs
session_key for channel state, wallet vs session_key for app session
updates); transition / tx_type enums sourced from core types;
timespan ∈ {day, week, month}; application_id scrubber default
("_DEFAULT") is documented in place.
- nitronode_user_balance_underfunded / _releasable now explain the
channel-checkpoint semantics: each is a worst-case potential
outflow / inflow from node_balance if every underfunded / releasable
channel checkpointed simultaneously. Calls out that node_balance and
user_balance_total are parallel pools, not signed opposites — the
framing alert authors need.
- nitronode_transactions_amount_total documents the
decimal.Decimal.InexactFloat64 precision loss explicitly; the metric
is for dashboards and alerts only, accounting accuracy must come from
the database (N6).
- nitronode_rpc_request_duration_seconds points at rpc_inflight as the
saturation pair (queueing vs. handler-slow distinction).
- nitronode_blockchain_events_total notes a flat / zero per-chain rate
may indicate a stalled chain connection.
Help-text-only change; no behaviour or label changes.
📝 WalkthroughWalkthroughAdds RPC in-flight counting, DB query-duration observation, and application label standardization; extends the runtime metrics interface and exporter; registers DB metrics and SQL pool stats during startup; hooks GORM callbacks to measure query durations. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RPC_MW as ObservabilityMiddleware
participant RPC_Handler as RPC Handler
participant GORM as GORM
participant SQLDB as SQL DB
participant RuntimeMetrics as runtimeMetrics/Prometheus
Client->>RPC_MW: RPC request (method)
RPC_MW->>RuntimeMetrics: IncRPCInflight(method)
RPC_MW->>RPC_Handler: forward request
RPC_Handler->>GORM: execute query (queryKind)
GORM->>GORM: before-callback (store start time)
GORM->>SQLDB: run SQL
SQLDB-->>GORM: result
GORM->>GORM: after-callback (observe duration)
GORM->>RuntimeMetrics: ObserveDBQueryDuration(queryKind, duration)
RPC_Handler-->>RPC_MW: response
RPC_MW->>RuntimeMetrics: DecRPCInflight(method)
RPC_MW-->>Client: response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
nksazonov
left a comment
There was a problem hiding this comment.
LGTM with some comments
…empotent Two review-comment fixes on store/database/metrics.go: 1. The kind→chain switch pre-initialised chain to db.Callback().Query() and lacked a "query" case, so any new entry in queryKinds without a matching switch arm would silently hook the query processor and emit the wrong query_kind label. Replaced the implicit fallthrough with an explicit case per kind + a default that errors out. New entries now fail loudly instead of corrupting the histogram. The chain variable still has to be inferred from a dummy db.Callback().Query() init because gorm's processor / callback types are unexported — we can't write `var chain *gorm.processor`. 2. chain.Register(name, fn) appends when a callback with that name already exists — gorm logs a warning and keeps both, so calling RegisterMetricsCallbacks twice on the same gorm.DB (test helpers, DI restart, future refactor) would double-fire the after callback and observe 2× the actual query duration. Switched both before and after registrations to chain.Replace, which is idempotent on the callback name and a drop-in for Register here. No behaviour change on the happy path; both fixes harden against silently-wrong observations under repeat-init or schema drift.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nitronode/store/database/metrics.go (1)
54-55: ⚡ Quick winDrop the redundant
Query()pre-initialization.
chainis reassigned in every switch arm, so the pre-switchdb.Callback().Query()value is never used and trips SA4006. Initializing the chain only inside the selected branch would avoid the lint hit and make the control flow clearer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nitronode/store/database/metrics.go` around lines 54 - 55, Remove the redundant pre-initialization of chain from db.Callback().Query() before the switch: the variable chain is reassigned in every arm of the switch on kind, so delete the initial assignment and instead declare/initialize chain inside each branch where it's used (refer to the chain variable and the db.Callback().Query() call in the switch handling for kind) so the linter SA4006 is avoided and control flow is clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nitronode/store/database/metrics.go`:
- Around line 54-55: Remove the redundant pre-initialization of chain from
db.Callback().Query() before the switch: the variable chain is reassigned in
every arm of the switch on kind, so delete the initial assignment and instead
declare/initialize chain inside each branch where it's used (refer to the chain
variable and the db.Callback().Query() call in the switch handling for kind) so
the linter SA4006 is avoided and control flow is clearer.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 15d4f00e-12d9-400b-a0b0-f540c9e1ede1
📒 Files selected for processing (1)
nitronode/store/database/metrics.go
Summary
Extends nitronode's emitted metric set and standardizes its label naming so dashboards and alerts can be authored against a coherent surface. Three commits, each independently reviewable:
2aba0e3dfeat(nitronode/metrics): add saturation + DB latency observability — three new emission paths:nitronode_rpc_inflight{method}— gauge incremented at theObservabilityMiddlewareentry and decremented on exit. Saturation signal that pairs withrpc_request_duration_secondsto distinguish "slow handler" from "queued behind contention".nitronode_db_query_duration_seconds{query_kind}— histogram observed via gorm callbacks (Create / Query / Update / Delete / Row / Raw). Captures app-side round-trip including pgbouncer hop.go_sql_*— standardcollectors.NewDBStatsCollectorregistered with db namenitronode. Pool gauges + wait counters (go_sql_wait_duration_seconds_total,go_sql_wait_count_total, etc.) for the underlying*sql.DB.New
store/database/metrics.goowns the gorm-callback registration and dbstats collector helpers, exposing a tinyQueryDurationObserverinterface so the package doesn't depend onnitronode/metrics(which would induce a cycle throughpkg/{app,core,rpc}).921a00b0refactor(nitronode/metrics): standardize label naming + scrub appID — coordinated breaking label renames:application→application_idonnitronode_app_sessions_total,nitronode_app_sessions_active,nitronode_app_state_updates_total,nitronode_app_session_update_sig_validations_total(the holdouts;user_states_total/transactions_totalalready usedapplication_id).status→resultonnitronode_rpc_requests_totalandnitronode_rpc_request_duration_seconds;process_result→resultonnitronode_blockchain_events_total. Settles onresult ∈ {success, failed}everywhere.IncAppStateUpdate,IncAppSessionUpdateSigValidation,SetAppSessions,SetActiveAppSessionsnow wrapapplicationIDwithgetApplicationIDLabelValueso empty becomes_DEFAULT(matchesIncUserState/RecordTransaction).nitronode_rpc_messages_total→nitronode_rpc_messages_emitted_total. The metric counts both the request and the response message of every RPC; the new name and Help spell out the doubled emission so downstream consumers don't compute QPS at 2× the real rate.196190cadocs(nitronode/metrics): expand Help texts to encode label semantics — Help string rewrite covering trigger conditions, label semantics with enum value sets, and aggregation guidance for every metric. Notable:user_balance_underfunded/user_balance_releasablenow explain the channel-checkpoint semantics: each is the worst-case potential outflow / inflow fromnode_balanceif every underfunded / releasable channel checkpointed simultaneously.transactions_amount_totaldocuments thedecimal.Decimal.InexactFloat64precision loss explicitly.Verified
go build ./nitronode/...clean.go test ./nitronode/...clean (existing test suite — no behaviour-affecting changes; rename-only refactors covered by package-level builds).time.Timestashed in the gorm context dict.Migration
Breaking — every dashboard / alert / recording rule that referenced the old label names or
nitronode_rpc_messages_totalmust update in lockstep with the deploy. The ops-side changes (dashboard YAML, alert rules) follow onfeat/metrics-extensionsinlayer-3/ops.Deferred
D N7(PodMonitoringmetricRelabelingguard againstapplication_idcardinality blow-up) is captured indocs/nitronode-env-metrics.md§7 / §9 in the ops repo — postponed until cardinality becomes a real ingest cost.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores