Skip to content

richat-v10.1.0

Choose a tag to compare

@github-actions github-actions released this 19 May 15:43
· 1 commit to master since this release
97e2e59
richat: add subscribe handshake observability (#207)

Add three metrics to diagnose 3-5s zero-byte stalls observed on new gRPC
subscribers in EWR/LAX that cause clients to send RST_STREAM before any
data reaches them. No behavioral fix yet — we first want one deploy's
worth of data to tell us which latency actually dominates.

Metrics added:
- grpc_subscribe_filter_parse_seconds (histogram, labeled by
  x_subscription_id): seconds from subscribe2() entry until the
  SubscribeRequest is read off the wire and parsed into a Filter.
- grpc_subscribe_time_to_first_message_seconds (histogram, labeled by
  x_subscription_id): seconds from the filter being applied until the
  worker loop pushes the first data message to the client. This is the
  latency the client actually perceives as "time to first byte of data".
- grpc_subscribe_handshake_abandoned_total (counter, labeled by
  x_subscription_id): incremented when the client's request stream
  ends before a filter is ever set — i.e. client disconnected mid-
  handshake. Tells us how many stalls abandon pre-filter vs post-filter.

Plus a one-shot warn! log inside the ping task when a client has been
connected for >3s without a filter set. Threshold matches the observed
client timeout window.

Only the initial unset -> set transition is recorded for both
histograms; subsequent filter updates (commitment change, etc.) are not
part of the subscribe handshake and would skew the tails.

---------

Co-authored-by: Kirill Fomichev <fanatid@ya.ru>