Add a fail reason to pinpoint exactly what went wrong #15866

stelfrag · 2023-08-22T09:26:07Z

Summary

When the agent fails to initialize it will now set a fail reason to help debugging. The new field is submitted via the anonymous statistics (if possible ie. statistics enabled)

For now this will be set if the metadata database fails to initialize properly.

daemon/anonymous-statistics.sh.in

daemon/analytics.c

…ason

andrewm4894 · 2023-08-22T15:52:56Z

how does NETDATA_FAIL_REASON get populated and when does it get wiped clean?

just wondering about making sure we don't end up having a NETDATA_FAIL_REASON get set but then in some way persisting and getting populated when things are fine.

eg say netdata crashes and populates NETDATA_FAIL_REASON and then as a user i fix it and everything is good - as we happy enough that NETDATA_FAIL_REASON will be empty and never contain the old values from earlier?

Assuming this wont/cant happen but just wanted to double check since its a little funny sort of using an env var like this as opposed to tying it to a specific event of some sort.

thiagoftsm

PR is working as expected, LGTM!

stelfrag · 2023-08-22T18:09:42Z

how does NETDATA_FAIL_REASON get populated and when does it get wiped clean?

For now the only places that this is populated it is when the agent is about to cause a FATAL (both cases being when it fails to open the database files) during start up (before it sends a START message)

A normal START message will have a failed reason as NULL

This will allow us to troubleshoot early failures (usually during database init) to see if how we can improve

Reasons can be (but not limited):

Somehow read only file system
Disk full
Database file corruption

Dim-P

I can't check if netdata_fail_reason is received correctly by our analytics DB (maybe @andrewm4894 can check), but other than that, the PR seems to work fine.

Also, much cleaner code now!

andrewm4894 · 2023-08-23T08:14:20Z

I will update the downstream ETL to pull it out of the events coming in.

andrewm4894 · 2023-08-23T10:17:04Z

pr to add fail reason to agent events etl: https://github.com/netdata/analytics-bi/pull/2130

Add a fail reason to pinpoint exactly what went wrong

933b581

stelfrag force-pushed the add_fatal_reason branch from a2f9d6d to 933b581 Compare August 22, 2023 09:26

stelfrag requested review from MrZammler and andrewm4894 August 22, 2023 09:26

github-actions bot added area/daemon area/database labels Aug 22, 2023

stelfrag marked this pull request as ready for review August 22, 2023 09:31

stelfrag requested review from thiagoftsm and vkalintiris as code owners August 22, 2023 09:31

Dim-P reviewed Aug 22, 2023

View reviewed changes

daemon/anonymous-statistics.sh.in Outdated Show resolved Hide resolved

daemon/analytics.c Outdated Show resolved Hide resolved

Drop the env for setting the fail reason. Always pass netdata_fail_re…

66dc029

…ason

thiagoftsm approved these changes Aug 22, 2023

View reviewed changes

Dim-P approved these changes Aug 22, 2023

View reviewed changes

stelfrag merged commit 9dec766 into netdata:master Aug 23, 2023
137 checks passed

stelfrag deleted the add_fatal_reason branch August 23, 2023 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a fail reason to pinpoint exactly what went wrong #15866

Add a fail reason to pinpoint exactly what went wrong #15866

stelfrag commented Aug 22, 2023 •

edited

andrewm4894 commented Aug 22, 2023

thiagoftsm left a comment

stelfrag commented Aug 22, 2023

Dim-P left a comment

andrewm4894 commented Aug 23, 2023

andrewm4894 commented Aug 23, 2023

Add a fail reason to pinpoint exactly what went wrong #15866

Add a fail reason to pinpoint exactly what went wrong #15866

Conversation

stelfrag commented Aug 22, 2023 • edited

Summary

andrewm4894 commented Aug 22, 2023

thiagoftsm left a comment

Choose a reason for hiding this comment

stelfrag commented Aug 22, 2023

Dim-P left a comment

Choose a reason for hiding this comment

andrewm4894 commented Aug 23, 2023

andrewm4894 commented Aug 23, 2023

stelfrag commented Aug 22, 2023 •

edited