Clarifications for streaming documentation #10629

kousu · 2021-02-14T10:54:23Z

Bug report summary

The streaming docs have oversights.

OS / Environment

Locally:

Linux monitor.neuro.polymtl.ca 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.2 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.2 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal

Netdata version

netdata v1.19.0, but I'm reading the online docs as of 9ba79d3, which has already confused me a couple of times (e.g. dbengine multihost disk space doesn't exist in 1.19, only dbengine disk space)

Component Name

streaming, docs

🧶 🧶 🐈

I tried to follow https://learn.netdata.cloud/docs/agent/streaming#database-replication but got a bit detoured several times.

I have suggestions for clarifications in these docs:

The docs covering "archiving" mention everything but netdata as an option. I'm not interested in compliance archives, but I do want some amount of archiving. I need to review infrastructure patterns to debug glitches. And I don't want to have to learn and maintain a second set of software, I would rather use netdata everywhere.

If I just set up a netdata node with a very large dbengine disk space shouldn't it be able to function like an archive? The way this section is phrased makes me think there's some reason it is impossible for netdata to retain a child node's data after the child node stops sending it, which I know is not true since Netdata in master/slave deployment losing metrics after unsolicited restarts across server estate (caused by cron daily) #7303, Netdata in master/slave deployment losing metrics after restart #7360, and I've produced such a situation myself e.g.

I had to figure it out by reading between the lines from e.g.

The child and the parent may have different data retention policies for the same metrics.

Any number of daisy chaining Netdata servers are supported, each with or without a database and with or without alarms for the child metrics.

and

A proxy, which receives metrics from other hosts and pushes them immediately to other Netdata servers. Netdata proxies can also be store and forward proxies meaning that they are able to maintain a local database for all metrics passing through them (with or without alarms).

(and by the way, why is "store and forward proxies" code-quoted there?)
The docs never define "ephemeral" very well. How does a parent server know that a child server is ephemeral? What counts as ephemeral? Are my prone-to-crashing servers ephemeral?

My guess is that netdata doesn't define it in code, instead letting newer data naturally replace older data, and thus gradually forgetting "ephemeral" nodes. Is that right? It would help us all if that was clearer in the docs.
The documentation about the global retention options is vague:

delete obsolete charts files default=yes See monitoring ephemeral containers, also affects the deletion of files for obsolete dimensions
delete orphan hosts files default=yes Set to no to disable non-responsive host removal.

This doesn't mention that these options are key to making streaming work reliably (Netdata in master/slave deployment losing metrics after restart #7360 (comment), Netdata in master/slave deployment losing metrics after unsolicited restarts across server estate (caused by cron daily) #7303 (comment)), or really give any clue about what they do. The link just talks about containers, which are a niche sub-case of the more general metric retention rules. It's all pretty opaque to me, right now. And from what I can tell, these options partially contradict my assumption about what "ephemeral" means: on their default setting, a shutting down parent immediately deletes logs for servers that are not currently connected.
The separate "replication" and "proxies" section headers here, and the follow up table here ("headless collector", "headless proxy", "proxy with db", "central netdata") made me think these were all separate modes netdata can run in; but they're not, they're orthogonal features that can be combined.
The default stream.conf that I got from /etc/netdata/edit-config stream.conf was configured to use memory mode = save; I think this is confusing because https://learn.netdata.cloud/docs/agent/streaming#database-replication says to use dbengine, and implies that dbengine covers the data for child nodes too. I replaced that with default memory mode = dbengine in my stream.conf, but I don't have a good way to check if it worked beyond looking at what kinds of files netdata has open.
With default memory mode = save and history = 3600, the retention period of child nodes is easy to understand, but with default memory mode = dbengine it's a lot more opaque. It would help if the streaming docs addressed this; how does dbengine's space get shared out amongst child nodes? What happens if there's a tree of nodes like in your diagrams

does the space get bucketed according to what node the data came from, or does each leaf node get equal share in the central collector?

The text was updated successfully, but these errors were encountered:

odyslam · 2021-02-16T12:58:18Z

Thanks for this issue @kousu ! We actually intend to update these docs with @joelhans, so thank you so much for the detailed feedback.

Also, welcome to our little community! For questions, it's better to use our forum at https://community.netdata.cloud ✌️

ilyam8 · 2021-02-16T13:57:28Z

@kousu awesome, you rock 👍

odyslam · 2021-02-16T14:04:39Z

@kousu, upon a more detailed look over this issue, I consider this as a contribution, so we would like to offer you our contributor swag!

Please do send us an email at rewards@netdata.cloud with the following info (so you can claim your reward ✌️)

GitHub Username
First Name
Last Name
Email
Phone
Full Shipping Address (including city, state, zip, & country)

kousu added bug needs triage Issues which need to be manually labelled labels Feb 14, 2021

kousu mentioned this issue Feb 14, 2021

Netdata in master/slave deployment losing metrics after unsolicited restarts across server estate (caused by cron daily) #7303

Closed

odyslam closed this as completed Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifications for streaming documentation #10629

Clarifications for streaming documentation #10629

kousu commented Feb 14, 2021 •

edited

odyslam commented Feb 16, 2021

ilyam8 commented Feb 16, 2021

odyslam commented Feb 16, 2021

Clarifications for streaming documentation #10629

Clarifications for streaming documentation #10629

Comments

kousu commented Feb 14, 2021 • edited

Bug report summary

OS / Environment

Netdata version

Component Name

🧶 🧶 🐈

odyslam commented Feb 16, 2021

ilyam8 commented Feb 16, 2021

odyslam commented Feb 16, 2021

kousu commented Feb 14, 2021 •

edited