Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications for streaming documentation #10629

Closed
kousu opened this issue Feb 14, 2021 · 3 comments
Closed

Clarifications for streaming documentation #10629

kousu opened this issue Feb 14, 2021 · 3 comments
Labels
bug needs triage Issues which need to be manually labelled

Comments

@kousu
Copy link

kousu commented Feb 14, 2021

Bug report summary

The streaming docs have oversights.

OS / Environment

Remotely: https://learn.netdata.com

Locally:

Linux monitor.neuro.polymtl.ca 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.2 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.2 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal
Netdata version

netdata v1.19.0, but I'm reading the online docs as of 9ba79d3, which has already confused me a couple of times (e.g. dbengine multihost disk space doesn't exist in 1.19, only dbengine disk space)

Component Name

streaming, docs

🧶 🧶 🐈

I tried to follow https://learn.netdata.cloud/docs/agent/streaming#database-replication but got a bit detoured several times.

I have suggestions for clarifications in these docs:

  1. The docs covering "archiving" mention everything but netdata as an option. I'm not interested in compliance archives, but I do want some amount of archiving. I need to review infrastructure patterns to debug glitches. And I don't want to have to learn and maintain a second set of software, I would rather use netdata everywhere.

    If I just set up a netdata node with a very large dbengine disk space shouldn't it be able to function like an archive? The way this section is phrased makes me think there's some reason it is impossible for netdata to retain a child node's data after the child node stops sending it, which I know is not true since Netdata in master/slave deployment losing metrics after unsolicited restarts across server estate (caused by cron daily) #7303, Netdata in master/slave deployment losing metrics after restart #7360, and I've produced such a situation myself e.g.

    2021-02-14-044603_1086x617_scrot

    I had to figure it out by reading between the lines from e.g.

    The child and the parent may have different data retention policies for the same metrics.

    Any number of daisy chaining Netdata servers are supported, each with or without a database and with or without alarms for the child metrics.

    and

    A proxy, which receives metrics from other hosts and pushes them immediately to other Netdata servers. Netdata proxies can also be store and forward proxies meaning that they are able to maintain a local database for all metrics passing through them (with or without alarms).

    (and by the way, why is "store and forward proxies" code-quoted there?)

  2. The docs never define "ephemeral" very well. How does a parent server know that a child server is ephemeral? What counts as ephemeral? Are my prone-to-crashing servers ephemeral?

    My guess is that netdata doesn't define it in code, instead letting newer data naturally replace older data, and thus gradually forgetting "ephemeral" nodes. Is that right? It would help us all if that was clearer in the docs.

  3. The documentation about the global retention options is vague:

    delete obsolete charts files default=yes See monitoring ephemeral containers, also affects the deletion of files for obsolete dimensions
    delete orphan hosts files default=yes Set to no to disable non-responsive host removal.

    This doesn't mention that these options are key to making streaming work reliably (Netdata in master/slave deployment losing metrics after restart #7360 (comment), Netdata in master/slave deployment losing metrics after unsolicited restarts across server estate (caused by cron daily) #7303 (comment)), or really give any clue about what they do. The link just talks about containers, which are a niche sub-case of the more general metric retention rules. It's all pretty opaque to me, right now. And from what I can tell, these options partially contradict my assumption about what "ephemeral" means: on their default setting, a shutting down parent immediately deletes logs for servers that are not currently connected.

  4. The separate "replication" and "proxies" section headers here, and the follow up table here ("headless collector", "headless proxy", "proxy with db", "central netdata") made me think these were all separate modes netdata can run in; but they're not, they're orthogonal features that can be combined.

  5. The default stream.conf that I got from /etc/netdata/edit-config stream.conf was configured to use memory mode = save; I think this is confusing because https://learn.netdata.cloud/docs/agent/streaming#database-replication says to use dbengine, and implies that dbengine covers the data for child nodes too. I replaced that with default memory mode = dbengine in my stream.conf, but I don't have a good way to check if it worked beyond looking at what kinds of files netdata has open.

  6. With default memory mode = save and history = 3600, the retention period of child nodes is easy to understand, but with default memory mode = dbengine it's a lot more opaque. It would help if the streaming docs addressed this; how does dbengine's space get shared out amongst child nodes? What happens if there's a tree of nodes like in your diagrams

    does the space get bucketed according to what node the data came from, or does each leaf node get equal share in the central collector?

@odyslam
Copy link
Contributor

odyslam commented Feb 16, 2021

Thanks for this issue @kousu ! We actually intend to update these docs with @joelhans, so thank you so much for the detailed feedback.

Also, welcome to our little community! For questions, it's better to use our forum at https://community.netdata.cloud ✌️

@odyslam odyslam closed this as completed Feb 16, 2021
@ilyam8
Copy link
Member

ilyam8 commented Feb 16, 2021

@kousu awesome, you rock 👍

@odyslam
Copy link
Contributor

odyslam commented Feb 16, 2021

@kousu, upon a more detailed look over this issue, I consider this as a contribution, so we would like to offer you our contributor swag!

Please do send us an email at rewards@netdata.cloud with the following info (so you can claim your reward ✌️)

GitHub Username
First Name
Last Name
Email
Phone
Full Shipping Address (including city, state, zip, & country)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issues which need to be manually labelled
Projects
None yet
Development

No branches or pull requests

3 participants