Skip to content

YNU-779: feat(nitronode/metrics): saturation + DB latency + label hygiene#717

Merged
philanton merged 4 commits into
mainfrom
feat/extend-metrics
May 1, 2026
Merged

YNU-779: feat(nitronode/metrics): saturation + DB latency + label hygiene#717
philanton merged 4 commits into
mainfrom
feat/extend-metrics

Conversation

@philanton
Copy link
Copy Markdown
Contributor

@philanton philanton commented Apr 30, 2026

Summary

Extends nitronode's emitted metric set and standardizes its label naming so dashboards and alerts can be authored against a coherent surface. Three commits, each independently reviewable:

  1. 2aba0e3d feat(nitronode/metrics): add saturation + DB latency observability — three new emission paths:

    • nitronode_rpc_inflight{method} — gauge incremented at the ObservabilityMiddleware entry and decremented on exit. Saturation signal that pairs with rpc_request_duration_seconds to distinguish "slow handler" from "queued behind contention".
    • nitronode_db_query_duration_seconds{query_kind} — histogram observed via gorm callbacks (Create / Query / Update / Delete / Row / Raw). Captures app-side round-trip including pgbouncer hop.
    • go_sql_* — standard collectors.NewDBStatsCollector registered with db name nitronode. Pool gauges + wait counters (go_sql_wait_duration_seconds_total, go_sql_wait_count_total, etc.) for the underlying *sql.DB.

    New store/database/metrics.go owns the gorm-callback registration and dbstats collector helpers, exposing a tiny QueryDurationObserver interface so the package doesn't depend on nitronode/metrics (which would induce a cycle through pkg/{app,core,rpc}).

  2. 921a00b0 refactor(nitronode/metrics): standardize label naming + scrub appID — coordinated breaking label renames:

    • applicationapplication_id on nitronode_app_sessions_total, nitronode_app_sessions_active, nitronode_app_state_updates_total, nitronode_app_session_update_sig_validations_total (the holdouts; user_states_total / transactions_total already used application_id).
    • statusresult on nitronode_rpc_requests_total and nitronode_rpc_request_duration_seconds; process_resultresult on nitronode_blockchain_events_total. Settles on result ∈ {success, failed} everywhere.
    • IncAppStateUpdate, IncAppSessionUpdateSigValidation, SetAppSessions, SetActiveAppSessions now wrap applicationID with getApplicationIDLabelValue so empty becomes _DEFAULT (matches IncUserState / RecordTransaction).
    • nitronode_rpc_messages_totalnitronode_rpc_messages_emitted_total. The metric counts both the request and the response message of every RPC; the new name and Help spell out the doubled emission so downstream consumers don't compute QPS at 2× the real rate.
  3. 196190ca docs(nitronode/metrics): expand Help texts to encode label semantics — Help string rewrite covering trigger conditions, label semantics with enum value sets, and aggregation guidance for every metric. Notable:

    • user_balance_underfunded / user_balance_releasable now explain the channel-checkpoint semantics: each is the worst-case potential outflow / inflow from node_balance if every underfunded / releasable channel checkpointed simultaneously.
    • transactions_amount_total documents the decimal.Decimal.InexactFloat64 precision loss explicitly.

Verified

  • go build ./nitronode/... clean.
  • go test ./nitronode/... clean (existing test suite — no behaviour-affecting changes; rename-only refactors covered by package-level builds).
  • DB instrumentation does not introduce per-call allocation beyond a single time.Time stashed in the gorm context dict.

Migration

Breaking — every dashboard / alert / recording rule that referenced the old label names or nitronode_rpc_messages_total must update in lockstep with the deploy. The ops-side changes (dashboard YAML, alert rules) follow on feat/metrics-extensions in layer-3/ops.

Deferred

  • D N7 (PodMonitoring metricRelabeling guard against application_id cardinality blow-up) is captured in docs/nitronode-env-metrics.md §7 / §9 in the ops repo — postponed until cardinality becomes a real ingest cost.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • In-flight RPC request tracking added to monitor concurrent requests per RPC method
    • Database query duration metrics and DB pool stats collection registered for deeper DB performance insights
  • Chores

    • Standardized application metric label name to application_id
    • Renamed and clarified RPC metric names and result semantics for improved observability

Three new emission paths plug observability gaps that today's metrics
cannot fill:

- nitronode_rpc_inflight{method} — gauge incremented at the
  ObservabilityMiddleware entry and decremented on exit. Saturation
  signal that complements rpc_request_duration_seconds: pairs let an
  on-call distinguish "slow handler" from "queued behind contention".

- nitronode_db_query_duration_seconds{query_kind} — histogram observed
  via gorm callbacks (Create / Query / Update / Delete / Row / Raw).
  Captures app-side round-trip including pgbouncer hop, so on-call can
  tell whether DB stalls live in the pool, in the network, or in the
  query itself.

- go_sql_* (standard collectors.NewDBStatsCollector under db name
  "nitronode") — pool gauges + wait counters for the underlying
  *sql.DB. Pair with db_query_duration_seconds to separate
  pool-acquire latency from in-DB execution.

Wiring:

- store/database/metrics.go owns the gorm-callback registration and
  the dbstats collector helper, exposing a tiny QueryDurationObserver
  interface so the package doesn't import nitronode/metrics (which
  pulls in pkg/{app,core,rpc} and would induce a cycle).
- runtime.go calls both helpers right after the metric exporters are
  constructed, with prometheus.DefaultRegisterer as the registry.

Non-breaking — purely additive metrics + an additional middleware
hook. Existing dashboards and alerts continue to work.
Cross-cutting hygiene that becomes load-bearing once we author alerts
on top: every alert needs to express success/failure, app-id, and chain
results consistently, and today the labels disagree across metrics.

N1 — `application` → `application_id`
  * nitronode_app_sessions_total
  * nitronode_app_sessions_active
  * nitronode_app_state_updates_total
  * nitronode_app_session_update_sig_validations_total

  user_states_total / transactions_total already used `application_id`;
  the four above were the holdouts. `application_id` is more explicit
  and doesn't collide with the HTTP path namespace.

N2 — settle on `result` for success/failure semantics
  * nitronode_rpc_requests_total: `status` → `result`
  * nitronode_rpc_request_duration_seconds: `status` → `result`
  * nitronode_blockchain_events_total: `process_result` → `result`

  channel_state / app_session_update sig validations and
  blockchain_actions_total already used `result`; standardizing means
  alert templates can use a single label name across the metric set.

N3 — apply the empty-applicationID scrubber consistently
  IncAppStateUpdate, IncAppSessionUpdateSigValidation, SetAppSessions
  and SetActiveAppSessions now wrap applicationID with
  getApplicationIDLabelValue, so an empty value becomes `_DEFAULT`
  instead of an empty-string label (which previously matched
  `=~"."` pattern in queries — surprising). Matches the existing
  IncUserState / RecordTransaction behaviour.

N5 — rename rpc_messages_total → rpc_messages_emitted_total + expand Help

  The metric counts both the request and the response message of every
  RPC, so rate(rpc_messages_total[5m]) reports ~2× QPS. That's a foot-
  gun the current name actively encourages. Renaming makes the doubled
  emission explicit; the Help string spells out the relationship to
  rpc_requests_total for clarity.

Breaking — every dashboard / alert / recording-rule that referenced the
old label names or the old metric name must update in lockstep with the
deploy. The ops repo's monitoring dashboard PR follows.
Every metric Help string was a one-liner that omitted what the labels
mean, what their value sets are, or how to aggregate the metric. That
forced dashboard authors to read exporter.go to write a query.

Expanded to a 3-line format covering trigger condition, label semantics
with enum value sets, and aggregation guidance. Concretely:

- All labelled metrics now spell out their label sets:
  status / result / process_result enums; sig_type enums (default vs
  session_key for channel state, wallet vs session_key for app session
  updates); transition / tx_type enums sourced from core types;
  timespan ∈ {day, week, month}; application_id scrubber default
  ("_DEFAULT") is documented in place.

- nitronode_user_balance_underfunded / _releasable now explain the
  channel-checkpoint semantics: each is a worst-case potential
  outflow / inflow from node_balance if every underfunded / releasable
  channel checkpointed simultaneously. Calls out that node_balance and
  user_balance_total are parallel pools, not signed opposites — the
  framing alert authors need.

- nitronode_transactions_amount_total documents the
  decimal.Decimal.InexactFloat64 precision loss explicitly; the metric
  is for dashboards and alerts only, accounting accuracy must come from
  the database (N6).

- nitronode_rpc_request_duration_seconds points at rpc_inflight as the
  saturation pair (queueing vs. handler-slow distinction).

- nitronode_blockchain_events_total notes a flat / zero per-chain rate
  may indicate a stalled chain connection.

Help-text-only change; no behaviour or label changes.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

Adds RPC in-flight counting, DB query-duration observation, and application label standardization; extends the runtime metrics interface and exporter; registers DB metrics and SQL pool stats during startup; hooks GORM callbacks to measure query durations.

Changes

Cohort / File(s) Summary
RPC Observability Middleware
nitronode/api/rpc_router.go
Increments rpc_inflight{method} before handling RPC and defers decrement to ensure inflight counter is always decremented after c.Next().
Metrics Interface
nitronode/metrics/interface.go
Adds IncRPCInflight(method string), DecRPCInflight(method string), and ObserveDBQueryDuration(queryKind string, duration time.Duration) to RuntimeMetricExporter; noop exporter updated with no-op implementations.
Metrics Implementation & Labels
nitronode/metrics/exporter.go
Renames/standardizes labels from applicationapplication_id; renames rpc_messages_totalrpc_messages_emitted_total; changes RPC result label semantics (statusresult), adds rpc_inflight{method} gauge, db_query_duration_seconds{query_kind} histogram, and corresponding Inc/Dec/Observe methods.
Database Metrics Integration
nitronode/store/database/metrics.go
Adds QueryDurationObserver interface, RegisterMetricsCallbacks to time GORM operations (create/query/update/delete/row/raw) via before/after callbacks, and RegisterDBStatsCollector to expose Go SQL pool stats to Prometheus.
Runtime Initialization
nitronode/runtime.go
Registers GORM metrics callbacks and DB stats collector during InitBackbone; failures are treated as fatal initialization errors.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant RPC_MW as ObservabilityMiddleware
    participant RPC_Handler as RPC Handler
    participant GORM as GORM
    participant SQLDB as SQL DB
    participant RuntimeMetrics as runtimeMetrics/Prometheus

    Client->>RPC_MW: RPC request (method)
    RPC_MW->>RuntimeMetrics: IncRPCInflight(method)
    RPC_MW->>RPC_Handler: forward request
    RPC_Handler->>GORM: execute query (queryKind)
    GORM->>GORM: before-callback (store start time)
    GORM->>SQLDB: run SQL
    SQLDB-->>GORM: result
    GORM->>GORM: after-callback (observe duration)
    GORM->>RuntimeMetrics: ObserveDBQueryDuration(queryKind, duration)
    RPC_Handler-->>RPC_MW: response
    RPC_MW->>RuntimeMetrics: DecRPCInflight(method)
    RPC_MW-->>Client: response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • YNU-768: Add Clearnode Metrics #562: Reworks RPC router/ObservabilityMiddleware and runtime metrics surface; directly related to adding IncRPCInflight/DecRPCInflight and observability integration.

Poem

🐰 I hopped through code both day and night,
Counting flights and timing bites,
Queries measured, labels neat—
Metrics hum with every beat,
A tiny rabbit cheers this sight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: adding saturation metrics (RPC inflight), DB latency observability, and standardizing label naming across the metrics surface.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/extend-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@nksazonov nksazonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some comments

Comment thread nitronode/store/database/metrics.go
Comment thread nitronode/store/database/metrics.go
Comment thread nitronode/store/database/metrics.go Outdated
…empotent

Two review-comment fixes on store/database/metrics.go:

1. The kind→chain switch pre-initialised chain to db.Callback().Query()
   and lacked a "query" case, so any new entry in queryKinds without a
   matching switch arm would silently hook the query processor and
   emit the wrong query_kind label. Replaced the implicit fallthrough
   with an explicit case per kind + a default that errors out. New
   entries now fail loudly instead of corrupting the histogram.

   The chain variable still has to be inferred from a dummy
   db.Callback().Query() init because gorm's processor / callback
   types are unexported — we can't write `var chain *gorm.processor`.

2. chain.Register(name, fn) appends when a callback with that name
   already exists — gorm logs a warning and keeps both, so calling
   RegisterMetricsCallbacks twice on the same gorm.DB (test helpers,
   DI restart, future refactor) would double-fire the after callback
   and observe 2× the actual query duration. Switched both before and
   after registrations to chain.Replace, which is idempotent on the
   callback name and a drop-in for Register here.

No behaviour change on the happy path; both fixes harden against
silently-wrong observations under repeat-init or schema drift.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
nitronode/store/database/metrics.go (1)

54-55: ⚡ Quick win

Drop the redundant Query() pre-initialization.

chain is reassigned in every switch arm, so the pre-switch db.Callback().Query() value is never used and trips SA4006. Initializing the chain only inside the selected branch would avoid the lint hit and make the control flow clearer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nitronode/store/database/metrics.go` around lines 54 - 55, Remove the
redundant pre-initialization of chain from db.Callback().Query() before the
switch: the variable chain is reassigned in every arm of the switch on kind, so
delete the initial assignment and instead declare/initialize chain inside each
branch where it's used (refer to the chain variable and the
db.Callback().Query() call in the switch handling for kind) so the linter SA4006
is avoided and control flow is clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nitronode/store/database/metrics.go`:
- Around line 54-55: Remove the redundant pre-initialization of chain from
db.Callback().Query() before the switch: the variable chain is reassigned in
every arm of the switch on kind, so delete the initial assignment and instead
declare/initialize chain inside each branch where it's used (refer to the chain
variable and the db.Callback().Query() call in the switch handling for kind) so
the linter SA4006 is avoided and control flow is clearer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 15d4f00e-12d9-400b-a0b0-f540c9e1ede1

📥 Commits

Reviewing files that changed from the base of the PR and between 196190c and 01f12f1.

📒 Files selected for processing (1)
  • nitronode/store/database/metrics.go

@philanton philanton merged commit 67cbff6 into main May 1, 2026
8 checks passed
@philanton philanton deleted the feat/extend-metrics branch May 1, 2026 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants