Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel mpsc_import_notification_stream #611

Closed
LukeWheeldon opened this issue Jun 15, 2023 · 8 comments · Fixed by #1568
Closed

Channel mpsc_import_notification_stream #611

LukeWheeldon opened this issue Jun 15, 2023 · 8 comments · Fixed by #1568

Comments

@LukeWheeldon
Copy link

I get those messages below on a regular basis. It's much less than before 0.9.42, but it's still there once in a while on most/all my nodes.

Labels
alertname = UnboundedChannelPersistentlyLarge
chain = polkadot
entity = mpsc_import_notification_stream
instance = server:9615
job = server
network = polkadot
severity = warning
Annotations
message = Channel mpsc_import_notification_stream on node server:9615 contains more than 200 items for more than 5 minutes. Node might be frozen.

_Originally posted by @LukeWheeldon in #679

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@BulatSaif
Copy link
Contributor

BulatSaif commented Sep 8, 2023

I took a closer look at the issue, and it is caused by this alerting rule:

- alert: UnboundedChannelPersistentlyLarge
expr: '(
(substrate_unbounded_channel_len{action = "send"} -
ignoring(action) substrate_unbounded_channel_len{action = "received"})
or on(instance) substrate_unbounded_channel_len{action = "send"}
) >= 200'
for: 5m
labels:
severity: warning
annotations:
message: 'Channel {{ $labels.entity }} on node {{ $labels.instance }} contains
more than 200 items for more than 5 minutes. Node might be frozen.'

  1. After checking the historical data, I found that the first time mpsc_import_notification_stream started to increase was on 26.03.2023, right after we upgraded from v0.9.39 to v0.9.40. The alert did not cause issues initially because it took 10 days to go from 0 to 200.

  2. I also examined the metric substrate_unbounded_channel_len, and it appears that its behavior has changed. Prior to v0.9.40, substrate_unbounded_channel_len had three actions: send|received|dropped, and the dropped action was counted twice, both as dropped and as received.

v0.9.39:  send - received = processing 
v0.9.40:  send - received = dropped + processing

Therefore, we need to update the alert as follows:

v0.9.40: send - (received + dropped) = processing

These changes should help address the issue.

@altonen
Copy link
Contributor

altonen commented Sep 8, 2023

@dmitry-markin you worked on the tracing-unbounded alert system, could you take a look at this?

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 11, 2023

@BulatSaif Thank you for the investigation! I see that I introduced a breaking change in paritytech/substrate#13504. Namely, I stopped counting dropped messages as received, and that broke alerts. There are three options to get it fixed:

  1. Reintroduce the logic that counted dropped messages as received in substrate.
  2. Update the alerting rules in prometheus to correctly take into account dropped messages when calculating the channel size.
  3. Calculate the channel size in substrate and report the resulting number to prometheus.

I would vote for either option 2 or 3, because considering dropped messages to be received seems counterintuitive to me. Also, the alert is likely triggered only after the node is stopped, because the messages in this channel can be dropped only when the node is terminated. Taking this into account, I don't think calculating the channel size in prometheus alerts using sent, received and dropped is a good idea as it carries the numbers from the previously running node instances.

Option 2 is a little bit simpler, because we don't need to update substrate, but option 3 is more correct IMO.

@altonen What do you think?

@altonen
Copy link
Contributor

altonen commented Sep 11, 2023

I would go for option 3

@BulatSaif
Copy link
Contributor

  1. Calculate the channel size in substrate and report the resulting number to prometheus.

It is better to avoid any metrics calculation on application level, it may cause unnecessary load.

However, there is a catch when implementing the second option. Many of the substrate_unbounded_channel_len metrics lack dropped labels and only have send and received. I assume that a metric is not reported until the first message is dropped. The same assumption holds for the received label based on the current alert:

Current Alert:
(send - received) or send > 200

If received is null, then (send - received) is also null. Consequently, (null) or send will return send.

To address this, the new alert should be modified as follows:

New Alert:
(send - ((received + dropped) or received) or send > 200)

Here's the final expression:

expr: '(
  (
    substrate_unbounded_channel_len{action = "send"} -
    ignoring(action)(
      (
        substrate_unbounded_channel_len{action = "received"} +
        ignoring(action) substrate_unbounded_channel_len{action = "dropped"}
      ) or on(instance) substrate_unbounded_channel_len{action = "received"}
    )
  ) or on(instance) substrate_unbounded_channel_len{action = "send"}
) >= 200'

@dmitry-markin
Copy link
Contributor

In this specific case channel size reporting is cheap. And it's more robust, because it might happen that the dropped messages are not reported on the node shut down (e.g., if the node shuts down fast enough for metrics to be not reported to prometheus).

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Sep 12, 2023

@BulatSaif Could you reconfigure the alerts to use a new metric substrate_unbounded_channel_size, please, once the changes from PR #1489 are deployed? It contains the queue size (the number of messages sent but not yet received/dropped) labeled just by a channel name.

@BulatSaif
Copy link
Contributor

@BulatSaif Could you reconfigure the alerts to use a new metric substrate_unbounded_channel_size, please, once the changes from PR #1489 are deployed? It contains the queue size (the number of messages sent but not yet received/dropped) labeled just by a channel name.

I created PR with alert fix: #1568

BulatSaif added a commit that referenced this issue Oct 16, 2023
#1568)

# Description

Follow up for #1489.
Closes #611 

Before we calculated the channel size during alert expression but in
#1489 a new metric was introduced that reports channel size.
## Changes:
1. updated alert rule to use new metric.
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
paritytech#1568)

# Description

Follow up for paritytech#1489.
Closes paritytech#611 

Before we calculated the channel size during alert expression but in
paritytech#1489 a new metric was introduced that reports channel size.
## Changes:
1. updated alert rule to use new metric.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants