Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health monitoring stops working for slave hosts after some time #10548

Closed
razielin opened this issue Jan 25, 2021 · 14 comments
Closed

health monitoring stops working for slave hosts after some time #10548

razielin opened this issue Jan 25, 2021 · 14 comments

Comments

@razielin
Copy link

razielin commented Jan 25, 2021

Bug report summary

I have one netdata master server and about 10 slave netdata servers that are streaming metrics to the master server.
All health configuration files are configured at the master server only. But after a while (about 2-7 days, but it rather nondeterministic), periodic checks for all alarms of a particular slave host are stopped from being executing on the master server. Active alarms stay forever in their current states.

curl http://netdata_master:19999/host/slave_host1/api/v1/alarms?all
{
      "hostname": "slave_host1",
      "latest_alarm_log_unique_id": 1611136189,
      "status": false,
      "now": 1611328438, // "2021-01-22 15:13:58"
	"alarms": {
		"system.ram.ram_in_use": {
                        "id": 1611136034,
                        "name": "ram_in_use",
                        "chart": "system.ram",
                         …
			"last_updated": 1611145396, // "2021-01-20 12:23:16"
                        "next_update": 1611145406, // "2021-01-20 12:23:26"
                        "update_every": 10,
}

As you can see "last_updated" and "next_update" are two days ago in the past compared to “now”. The "last_updated" value is the same for all alarms of the slave host.
This situation occurs for the rest of the slave hosts after some time. Streaming of metrics to the master keeps working properly.
I haven't found anything strange in the log. If you need some part of it, let me know, please.

OS / Environment

The netdata master service works inside a docker container from the official netdata image v1.28.0.
Configuration of the host system:

  • Ubuntu 18.04.1 LTS
  • Linux 4.15.0-122-generic
  • Docker version 19.03.6.
    cat docker-compose.yml:
version: '2'
services:
  netdata_master:
    image: netdata_master
    build:
      context: netdata_master
      dockerfile: Dockerfile
    restart: always
    ports:
      - 19999:19999
    cap_add:
      - SYS_PTRACE
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdata_cache:/var/cache/netdata
      - netdata_lib:/var/lib/netdata
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/run:/var/run/host_run:ro
    networks:
      - monitoring
      ...
Netdata version

v1.28.0
But this problem also occurred in the previous versions as well ( v1.26 and some older ones).

Component Name

health

Steps To Reproduce

I don't know. It is non-deterministic.

Expected behavior

Health monitoring keeps working properly for the streamed slave hosts.

@razielin razielin added bug needs triage Issues which need to be manually labelled labels Jan 25, 2021
@stelfrag
Copy link
Collaborator

Hi @razielin

Thank you for reporting this. A couple of questions.

  • The children nodes are also in v.1.28.0 ?
  • Can you access the metrics of the children from the parent?

@razielin
Copy link
Author

Hi @stelfrag.

  • The children nodes are also in v.1.28.0 ?

Most of them - yes, maybe just a few of them are not. If it is important I can give you the full list of versions of all slave hosts.
Netdata being installed (and further updated) via the script kickstart.sh except for the master host (which is used inside of a docker container).

  • Can you access the metrics of the children from the parent?

Yes, I can. Streaming of metrics works fine.

@razielin
Copy link
Author

I haven't found any correlation between netdata versions and the bug occurrence.
Right now I have 3 slave netdata servers with this problem (after the restarting of the master netdata):

  • slave1 - v1.19.0 - the problem begins on the 1st day.
  • slave2 - v1.28.0 - the problem begins on the 5th day.
  • slave3 - v1.14.0 - the problem begins on the 6th day.

@Peter-Sh
Copy link

Peter-Sh commented Feb 3, 2021

I have the same issue with streaming setup with fully functional slave node (with database and alarms).
Alarms on master looks like disabled.

Get system.ram data and active alarms directly from slave and from master (ips and names are replaced):

curl -s 'http://slave-host-ip:19999/api/v1/data?chart=system.ram&before=-600' > /tmp/1; \
curl -s 'http://master-host-ip:19999/host/slave-host-name/api/v1/data?chart=system.ram&before=-600' > /tmp/2; \
echo DIRECT; \
curl -s 'http://slave-host-ip:19999/api/v1/alarms' | jq -r '.alarms[] | "\({name,value})"';
echo STREAMED; curl -s 'http://master-host-ip:19999/host/slave-host-name/api/v1/alarms' | jq -r '.alarms[] | "\({name,value})"';

DIRECT                                               
{"name":"ram_in_use","value":90.8972535}                                                                                                                                                                                                    
{"name":"ram_available","value":6.4488477}                                                                                                                                                                                                  
{"name":"ram_in_swap","value":28.6731703}
{"name":"cgroup_ram_in_use","value":92.0898438}
STREAMED

Compare data

[user:~]$ diff -q /tmp/1 /tmp/2 && echo "NO DIFFERENCE"                                                                
NO DIFFERENCE

Host info

curl -s  'http://slave-host-ip:19999/api/v1/info' | jq '{uid,version,mirrored_hosts,mirrored_hosts_status}'
{
  "uid": "a8fb8f0a-4931-11eb-83a4-02420a000289",
  "version": "v1.28.0-238-nightly",
  "mirrored_hosts": [
    "slave-host-name"
  ],
  "mirrored_hosts_status": [
    {
      "guid": "a8fb8f0a-4931-11eb-83a4-02420a000289",
      "reachable": true,
      "claim_id": null
    }
  ]
}



curl -s  'http://master-host-ip:19999/host/slave-host-name/api/v1/info' | jq '{uid,version,mirrored_hosts,mirrored_hosts_status}'
{
  "uid": "a8fb8f0a-4931-11eb-83a4-02420a000289",
  "version": "v1.28.0-238-nightly",
  "mirrored_hosts": [
	.....
    "slave-host-name",
	.....
  ],
  "mirrored_hosts_status": [
	....
    {
      "guid": "a8fb8f0a-4931-11eb-83a4-02420a000289",
      "reachable": true,
      "claim_id": null
    },
	....
  ]
}

@stelfrag
Copy link
Collaborator

@Peter-Sh and @razielin

There has been an issue if the health was configured in auto in the stream conf (on the parent side). This has been fixed in v1.33.0. This would cause health to be disabled on the parent if a child would disconnect and reconnect.

@cpipilas
Copy link

Fixed in the latest releases

@razielin
Copy link
Author

razielin commented Jul 1, 2022

@stelfrag @cpipilas
I am still experiencing this problem, so it's not fixed.
Yesterday, I have updated netdata to the latest version (v1.35.1), It looks like the probability of the bug occurrence has decreased. Right now I have 15 slave netdata nodes. Previously this bug occurred every hour or so, now it's about just once per 4 hours.

@cpipilas cpipilas removed the needs triage Issues which need to be manually labelled label Jul 1, 2022
@cpipilas cpipilas reopened this Jul 1, 2022
@MrZammler
Copy link
Contributor

Hi @razielin !

Could you please do the following on the parent node: ?

  • Enable debugging of health. You can do that by un-commenting the line debug flags in netdata.conf and using the value of 0x0000000000800000 for it.
  • Let it run for a while, at least until you notice the parent has stopped running health for a child.
  • Share the error.log and debug.log to manolis@netdata.cloud.

Can you also please share the value of stream.conf, especially the health enabled by default option ?

Thank you!

@razielin
Copy link
Author

razielin commented Jul 2, 2022

Hi @MrZammler
Do I need to compile netdata from source to enable debugging?
I'm using docker netdata image from Docker Hub. I set debug flags = 0x0000000000800000 and changed default debug log path to debug log = /var/log/netdata/debug2.log because default log path values are symlinks to stdout/stderr. I also changed error log and access log paths in netdata.conf:

debug flags = 0x0000000000800000
debug log = /var/log/netdata/debug2.log
error log = /var/log/netdata/error2.log
access log = /var/log/netdata/access2.log

But the debug2.log file is empty. error2.log and access2.log are filled as intended without problems.

@razielin
Copy link
Author

razielin commented Jul 2, 2022

@MrZammler
Here is my stream.conf:

[API_KEY]
    enabled = yes
    allow from = *
    default history = 3600
    default memory mode = ram
    health enabled by default = auto
    default postpone alarms on connect seconds = 60
    multiple connections = allow

@MrZammler
Copy link
Contributor

Hi @razielin you shouldn't need any special building options, or to compile from source ... Will check and let you know. Thanks!

@dimko
Copy link
Contributor

dimko commented Jul 4, 2022

@razielin @MrZammler just a note here, after editing the netdata.conf inside the docker container a restart of the agent is required for the new config to take effect. Also please keep in mind that editing the netdata.conf directly requires to uncomment the line in order to override the default value.

An alternative to directly editing the netdata.conf is using the netdatacli to edit those values when you start netdata.
You can add the following values at the end of docker run command

-W set logs "debug flags" "0x0000000000800000"
-W set logs "debug" "/var/log/netdata/debug2.log"
-W set logs "error" "/var/log/netdata/error2.log"
-W set logs "access" "/var/log/netdata/access2.log" \

so full command would be

docker run -d --name=netdata
-p 19999:19999
-v netdataconfig:/etc/netdata
-v netdatalib:/var/lib/netdata
-v netdatacache:/var/cache/netdata
-v /etc/passwd:/host/etc/passwd:ro
-v /etc/group:/host/etc/group:ro
-v /proc:/host/proc:ro
-v /sys:/host/sys:ro
-v /etc/os-release:/host/etc/os-release:ro
--restart unless-stopped
--cap-add SYS_PTRACE
--security-opt apparmor=unconfined
netdata/netdata \
-W set logs "debug flags" "0x0000000000800000"
-W set logs "debug" "/var/log/netdata/debug2.log"
-W set logs "error" "/var/log/netdata/error2.log"
-W set logs "access" "/var/log/netdata/access2.log"

I edited the config as @MrZammler suggested (added debug2.log and error2.log) and I see the debug logs and error logs getting generated in /var/log/netdata directory inside the container.

/ # ls -l /var/log/netdata
total 416
lrwxrwxrwx 1 netdata root 11 Jul 2 03:23 access.log -> /dev/stdout
lrwxrwxrwx 1 netdata root 11 Jul 2 03:23 debug.log -> /dev/stdout
-rw-r--r-- 1 netdata netdata 9894 Jul 4 12:09 debug2.log
lrwxrwxrwx 1 netdata root 11 Jul 2 03:23 error.log -> /dev/stderr
-rw-r--r-- 1 netdata netdata 411341 Jul 4 12:10 error2.log

@razielin could you please try with the -W options

@razielin
Copy link
Author

razielin commented Jul 6, 2022

@dimko @MrZammler
I tried to add:

CMD ["-W", "set", "logs" ,"\"debug flags\"", "\"0x0000000000800000\"", "-W", "set", "logs", "\"debug\"", "\"/var/log/netdata/debug2.log\""]

to the end of my netdata's Dockerfile. The resulted netdata start command looks ok

/ # ps aux
PID   USER     TIME  COMMAND
    1 netdata   9:30 /usr/sbin/netdata -u netdata -D -s /host -p 19999 -W set logs "debug flags" "0x0000000000800000" -W set logs "debug" "/var/log/netdata/debug2.log"

But still no luck, debug2.log remains empty as with config file.
If I understand correctly the docs, netdata should be compiled from source to enable debug logging.
I compiled and installed netdata from source, let it run for 3 days and the bug hasn't occurred at all. The docs states that during compiling for debug logging some optimization has disabled and additional debug code has added. Maybe this does influence the result ...

@ilyam8
Copy link
Member

ilyam8 commented Dec 5, 2022

Hey, @razielin 👋 Do you still have the issue with v1.37.0? There have been a lot of changes/optimizations since v1.28.0.

@ilyam8 ilyam8 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants