Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a fail reason to pinpoint exactly what went wrong #15866

Merged
merged 2 commits into from Aug 23, 2023

Conversation

stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Aug 22, 2023

Summary

When the agent fails to initialize it will now set a fail reason to help debugging. The new field is submitted via the anonymous statistics (if possible ie. statistics enabled)

For now this will be set if the metadata database fails to initialize properly.

daemon/anonymous-statistics.sh.in Outdated Show resolved Hide resolved
daemon/analytics.c Outdated Show resolved Hide resolved
@andrewm4894
Copy link
Contributor

how does NETDATA_FAIL_REASON get populated and when does it get wiped clean?

just wondering about making sure we don't end up having a NETDATA_FAIL_REASON get set but then in some way persisting and getting populated when things are fine.

eg say netdata crashes and populates NETDATA_FAIL_REASON and then as a user i fix it and everything is good - as we happy enough that NETDATA_FAIL_REASON will be empty and never contain the old values from earlier?

Assuming this wont/cant happen but just wanted to double check since its a little funny sort of using an env var like this as opposed to tying it to a specific event of some sort.

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is working as expected, LGTM!

@stelfrag
Copy link
Collaborator Author

how does NETDATA_FAIL_REASON get populated and when does it get wiped clean?

For now the only places that this is populated it is when the agent is about to cause a FATAL (both cases being when it fails to open the database files) during start up (before it sends a START message)

A normal START message will have a failed reason as NULL

This will allow us to troubleshoot early failures (usually during database init) to see if how we can improve

Reasons can be (but not limited):

  • Somehow read only file system
  • Disk full
  • Database file corruption

Copy link
Contributor

@Dim-P Dim-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't check if netdata_fail_reason is received correctly by our analytics DB (maybe @andrewm4894 can check), but other than that, the PR seems to work fine.

Also, much cleaner code now!

@stelfrag stelfrag merged commit 9dec766 into netdata:master Aug 23, 2023
137 checks passed
@stelfrag stelfrag deleted the add_fatal_reason branch August 23, 2023 08:00
@andrewm4894
Copy link
Contributor

I will update the downstream ETL to pull it out of the events coming in.

@andrewm4894
Copy link
Contributor

pr to add fail reason to agent events etl: https://github.com/netdata/analytics-bi/pull/2130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants