New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a fail reason to pinpoint exactly what went wrong #15866
Conversation
a2f9d6d
to
933b581
Compare
how does NETDATA_FAIL_REASON get populated and when does it get wiped clean? just wondering about making sure we don't end up having a NETDATA_FAIL_REASON get set but then in some way persisting and getting populated when things are fine. eg say netdata crashes and populates NETDATA_FAIL_REASON and then as a user i fix it and everything is good - as we happy enough that NETDATA_FAIL_REASON will be empty and never contain the old values from earlier? Assuming this wont/cant happen but just wanted to double check since its a little funny sort of using an env var like this as opposed to tying it to a specific event of some sort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is working as expected, LGTM!
For now the only places that this is populated it is when the agent is about to cause a FATAL (both cases being when it fails to open the database files) during start up (before it sends a START message) A normal START message will have a failed reason as NULL This will allow us to troubleshoot early failures (usually during database init) to see if how we can improve Reasons can be (but not limited):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't check if netdata_fail_reason
is received correctly by our analytics DB (maybe @andrewm4894 can check), but other than that, the PR seems to work fine.
Also, much cleaner code now!
I will update the downstream ETL to pull it out of the events coming in. |
pr to add fail reason to agent events etl: https://github.com/netdata/analytics-bi/pull/2130 |
Summary
When the agent fails to initialize it will now set a fail reason to help debugging. The new field is submitted via the anonymous statistics (if possible ie. statistics enabled)
For now this will be set if the metadata database fails to initialize properly.