Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster parents #16127

Merged
merged 51 commits into from
Oct 27, 2023
Merged

Faster parents #16127

merged 51 commits into from
Oct 27, 2023

Conversation

ktsaou
Copy link
Member

@ktsaou ktsaou commented Oct 4, 2023

Current master on parent receiver:

image

This PR on parent receiver:

image

To understand the difference, check the width of quoted_strings_splitter() in the 2 charts.
On the second this function (that has't changed in this PR) is many times larger than in the first.
This means the rest of the code is now many times faster.

Simple optimizations to increase the efficiency of busy parents.

  • cache dbengine context so that calling mrg all the time is avoided.
  • cache rd and rd->id (as const char) together with rda for speeding up dimension lookup by pluginsd.
  • use 2 collected flags inside RRDDIM and RRDSET to avoid calling rrdcontexts to update the collected status on every data collection.
  • mrg is now lockless for all operations.
  • updated the streaming protocol with a new capability called SLOTS. The new protocol requires for the sender to number uniquely all the RRDSET and RRDDIM it sends. The numbers are used to help the receiver quickly find the RRDSET and RRDDIM pointers.
  • fixed a bug, where the EXPOSED flag in dimensions was allocated as a collector option (non-atomic), while it was used by replication to check the status and by the sender to clear it on disconnect.

Comparison: 2.7 million metrics per second, Netdata vs Prometheus

In this setup, both Netdata and Prometheus are configured to collect the same 2.5 million metrics per second from 500 Netdata children. To test similar functionality, we disabled ML and Health at netdata.conf of the Netdata parent.

CPU utilization

image

  • Netdata needs about 2 CPU cores per million metrics
  • Prometheus needs about 3 CPU cores per million metrics, with frequent spikes at 14+ CPU cores.

Prometheus has a huge spike every 2 minutes, utilizing almost all CPU cores available on the system (both VMs have 24 cores available).

Memory consumption

As far as memory consumption is concerned:

  • Netdata uses 40GiB (after we added 16GiB main cache, and 8GiB extend cache)
  • Prometheus uses 30GiB

Disk footprint

application tier on disk retention
Netdata tier 0 625GiB 7 days
Netdata tier 1 285GiB 14days
Netdata tier 2 114GiB 90 days
Prometheus - 3TiB 7 days

Netdata total, including data and metadata is 1 TiB.

Disk I/O

Each of the VMs has each own physical disk (so that we can measure the disk I/O of each VM). In the following screenshot, Prometheus is using sdd and Netdata is using zd16:

image

As you can see, Prometheus is really stressing the disks at this scale, possibly due to its WAL. Netdata achieves the same safety against data loss by re-streaming its metrics to another Netdata Parent (when configured to do so).

Network bandwidth

Netdata reception is 380Mbps.
Prometheus reception is 240Mbps.

Netdata is using LZ4 compression on a much more compact communication, while Prometheus uses gzip/deflate on a more chatty communication. However, the compression efficiency of gzip is quite higher than LZ4.

In PR #16268 we add ZSTD streaming support in Netdata, to see how its bandwidth changes.

@ktsaou
Copy link
Member Author

ktsaou commented Oct 8, 2023

@stelfrag this is ready for merge. If you can't find an issue, let's merge it.

@ilyam8 I have installed it on lab-parent2. Please install it on all its children and stop lab-parent3 to see the difference in action. It should be way faster than before.

@ktsaou ktsaou merged commit 2175104 into netdata:master Oct 27, 2023
148 of 149 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants