Skip to content

Check for chart obsoletion on children re-connections#12707

Merged
MrZammler merged 2 commits intonetdata:masterfrom
MrZammler:obsolete-children-charts
May 3, 2022
Merged

Check for chart obsoletion on children re-connections#12707
MrZammler merged 2 commits intonetdata:masterfrom
MrZammler:obsolete-children-charts

Conversation

@MrZammler
Copy link
Copy Markdown
Contributor

Summary

When a child is connected and streams to a parent, the parent has a set of the streamed charts from the child. However, if the child at some point disconnects, and during that disconnection time a chart it has disappear, then on re-connection, the parent will have no information that that specific chart has been gone.

If a chart is gone during the child being connected to the parent, the child will inform the parent about that chart's obsoletion, but not in the above case.

If in addition, there is a raised alert on the gone chart, then the alert will remain on the parent, until the parent restarts. The same alert will also remain on the cloud since there won't be any event after that.

This PR adds a small check on the parent, after a child is connected. It will go through the charts for this host, and if some are not updated after the child's connection, it will set them as OBSOLETE. The check will happen 2 minutes after the child's connection.

We might, or not need to do the same for separate dimensions...

Also the check will run on first child's connect, when it's not really needed. If there is an easy way to check if it is a fresh (for this parent's session) connection, we can skip it. But it shouldn't make much difference.

With this PR, a continue statement when a REMOVED event is created because of chart obsoletion, is removed. This is to allow to mark the rc as not RRDCALC_FLAG_RUNNABLE to prevent it from being calculated.

Test Plan

Tests can be conducted by examining what happens to a raised alert when this situation occurs.

  1. Setup a child -> parent setup.
  2. On the child, have a mount point (e.g. a usb device) with very little space left to trigger an alert on the parent.
  3. Stop the child and remove the usb device.
  4. Start the child again.

Without this PR, the raised alert will remain on the agent.

With this PR, after a while, the alert will be removed from the parent (and a REMOVED event will be transmitted to the cloud). If the chart re-appears then the alert should be raised again normally.

Additional Information
For users: How does this change affect me?

@thiagoftsm
Copy link
Copy Markdown
Contributor

Hello @MrZammler ,

I was reading the description, and I think we really need to obsolete dimensions. It is good to remember that when apps and cgroups are running, the charts will still be alive, but some application can stop running, so we won't have more dimensions for it.

Best regards!

@MrZammler
Copy link
Copy Markdown
Contributor Author

Hello @MrZammler ,

I was reading the description, and I think we really need to obsolete dimensions. It is good to remember that when apps and cgroups are running, the charts will still be alive, but some application can stop running, so we won't have more dimensions for it.

Best regards!

Yes indeed. We need to check. Do you agree to check and do this on another PR? Let's cover the charts obsoletion here, and I'll need to check for dimensions in another one (and investigate since I'm not 100% sure).

rpt->host->hostname);
}
}
rpt->host->trigger_chart_obsoletion_check = now_realtime_sec();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to set this after the parser terminates?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sorry, not sure what you mean? I set it to 0 after the parser terminates... We basically want to trigger the check when the child connects. When the check happens, it will be set to 0 and will not retry, or if the child disconnects.

@thiagoftsm
Copy link
Copy Markdown
Contributor

thiagoftsm commented Apr 20, 2022

Yes indeed. We need to check. Do you agree to check and do this on another PR? Let's cover the charts obsoletion here, and I'll need to check for dimensions in another one (and investigate since I'm not 100% sure).

Yes, I do not have objections for this idea.

Copy link
Copy Markdown
Contributor

@odynik odynik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set-up a parent (dbengine) <-> child (ram) arch. I removed some plugins before the restart and I could get the alarms (REMOVED) on local agent dashboard and on the cloud.

@thiagoftsm thiagoftsm requested a review from stelfrag May 2, 2022 17:15
@MrZammler
Copy link
Copy Markdown
Contributor Author

Thanks!!

@MrZammler MrZammler merged commit 0996abc into netdata:master May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants