Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream compression - Deactivate compression at runtime in case of a compressor buffer overflow #12037

Merged
merged 16 commits into from Mar 24, 2022

Conversation

odynik
Copy link
Contributor

@odynik odynik commented Jan 25, 2022

Summary

This PR catches a stream compression buffer overflow and escapes from a stream corruption by deactivating stream compression at runtime. If the sender thread experiences a compressor buffer overflow will deactivate stream compression and re-establish a fresh link with stream version protocol 4.

Test Plan

To test this PR you are going to need,

  1. At least a child <-> parent netdata agent set-up but I would recommend a grand-parent<->parent<->child set-up.
  2. A collector/chart that can provide you with a chart/dimension definition or a data transmission greater than 16kB.

Steps

1 Enable stream compression in the stream.conf file for all the agents.

[stream]
enabled = yes
enable compression = yes

OR/AND

[API_KEY]
enabled = yes
enable compression = yes
  1. In the child Netdata agent create a go example collector with 2 charts and 300 dim/chart,
    2.1 cd /etc/netdata
    2.2 sudo ./edit-config go.d.conf
  2. Enable example go-plugin
#  dockerhub: yes
#  elasticsearch: yes
  example: yes
#  filecheck: yes
#  fluentd: yes
  1. Create the chart with sudo ./edit-config go.d/example.conf
jobs:
  - name: stress
    charts:
      num: 2
      dimensions: 300
  1. Restart all the agents
  2. Look in the child error.log for the message sequence,
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: connecting...
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: initializing communication...
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: waiting response from remote netdata...
netdata INFO  : STREAM_SENDER[child01] : STREAM_COMPRESSION: Compressor Reset
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: established communication with a parent using protocol version 5 - ready to send metrics...
...
netdata ERROR : PLUGINSD[go.d] : STREAM_COMPRESSION: Compression Failed - Message size 27847 above compression buffer limit: 16384 (errno 9, Bad file descriptor)
netdata ERROR : PLUGINSD[go.d] : STREAM_COMPRESSION: Deactivating compression to avoid stream corruption
netdata ERROR : PLUGINSD[go.d] : STREAM_COMPRESSION child01 [send to my.parent.IP]: Restarting connection without compression
...
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: connecting...
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: initializing communication...
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: waiting response from remote netdata...
netdata INFO  : STREAM_SENDER[child01] : Stream is uncompressed! One of the agents (my.parent.IP <-> child01) does not support compression OR compression is disabled.
netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send to my.parent.IP]: established communication with a parent using protocol version 4 - ready to send metrics...
netdata INFO  : WEB_SERVER[static4] : STREAM child01 [send]: sending metrics...
  1. The stream between child <-> parent <-> gparent stream should not be corrupted which means,
    8.1 See streaming data in parent and grandparent agent dashboards.
    8.2 In parent agent, you should see a healthy sender thread shutdown sequence,
 netdata ERROR : STREAM_RECEIVER[child01,[my.child.ip]:43478] : STREAM child01 [receive from [my.child.ip]:43478]: disconnected (completed 665999 updates).
 netdata INFO  : STREAM_RECEIVER[child01,[my.child.ip]:43478] : Queuing status update for node=1423ff00-e22c-4f1f-ad51-e861b8cd7ce3, live=0, hops=1
 netdata INFO  : STREAM_RECEIVER[child01,[my.child.ip]:43478] : STREAM child01 [send]: signaling sending thread to stop...
 netdata INFO  : STREAM_RECEIVER[child01,[my.child.ip]:43478] : STREAM child01 [send]: waiting for the sending thread to stop...
 netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send]: sending thread cleans up...
 netdata INFO  : STREAM_SENDER[child01] : STREAM child01 [send]: sending thread now exits.
 netdata INFO  : STREAM_SENDER[child01] : thread with task id 3311408 finished
 netdata INFO  : STREAM_RECEIVER[child01,[my.child.ip]:43478] : STREAM child01 [send]: sending thread has exited.
 netdata INFO  : STREAM_RECEIVER[child01,[my.child.ip]:43478] : STREAM child01 [receive from [my.child.ip]:43478]: receive thread ended (task id 3311323)
 netdata INFO  : STREAM_RECEIVER[child01,[my.child.ip]:43478] : thread with task id 3311323 finished

8.3 Please report any sequence of messages containing

EOF found in spawn pipe
Shutting down spawn server event loop
Additional Information

This PR is a temporary work around for this issue #12020. I am not linking the issue here since it is not the permanent solution.

  1. Check the netdata -W buildinfo for Stream Compression: YES
  2. Check the runtime compression flag stream-compression: enabled/disabled/N/A at web api /api/v1/info

streaming/README.md Outdated Show resolved Hide resolved
erdem2000
erdem2000 previously approved these changes Jan 26, 2022
vlvkobal
vlvkobal previously approved these changes Jan 26, 2022
Copy link
Contributor

@vlvkobal vlvkobal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it with child <-> parent <-> gparent.

kickoke
kickoke previously approved these changes Jan 26, 2022
Copy link
Contributor

@kickoke kickoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation: LGTM

thiagoftsm
thiagoftsm previously approved these changes Jan 27, 2022
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested proposed scenario and I did not observe problems with SPAWN server!

@thiagoftsm
Copy link
Contributor

@odynik we have conflicts with streaming/receiver.c, please, rebase your PR.

@odynik
Copy link
Contributor Author

odynik commented Jan 27, 2022

@odynik we have conflicts with streaming/receiver.c, please, rebase your PR.

Thanks @thiagoftsm. For now, we decided to keep this PR on hold.
I will rebase it if/when we are about to merge it.

@odynik odynik dismissed stale reviews from thiagoftsm, kickoke, vlvkobal, and erdem2000 via 4705561 March 11, 2022 10:28
@odynik odynik force-pushed the stream_compression_buff_overflow_fix branch from 0851eec to 4705561 Compare March 11, 2022 10:28
@odynik
Copy link
Contributor Author

odynik commented Mar 11, 2022

PR rebased on latest master

streaming/README.md Outdated Show resolved Hide resolved
kickoke
kickoke previously approved these changes Mar 11, 2022
Copy link
Contributor

@kickoke kickoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs: LGTM

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested all the possible combinations for the proposed scenario, and I did not observe any crash in spawn server. I also observed a downgrade when buffer overflow happens. LGTM!

error("STREAM_COMPRESSION: Deactivating compression to avoid stream corruption");
default_compression_enabled = 0;
s->rrdpush_compression = 0;
s->version = STREAM_VERSION_CLABELS;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently CLABELS is 4 and compression is 5. What will happen with version 6 ? It will support compression but when it fails it will fallback to 4 (losing the functionally added by version 6).

Copy link
Contributor Author

@odynik odynik Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to merge this PR in the master and respect the stream versions (4 - clabels, 5 - compression, 6 - gap filling) in the gap filling PR.

The stream version checks need to be gathered in a function because if we keep adding features in streaming the if/elif statements are going to increase the code complexity.

@odynik odynik merged commit 57d6c17 into netdata:master Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants