Skip to content

fix streaming bug#4425

Merged
ktsaou merged 1 commit intonetdata:masterfrom
ktsaou:fix-streaming-denial
Oct 17, 2018
Merged

fix streaming bug#4425
ktsaou merged 1 commit intonetdata:masterfrom
ktsaou:fix-streaming-denial

Conversation

@ktsaou
Copy link
Copy Markdown
Member

@ktsaou ktsaou commented Oct 17, 2018

fixes #4370

There was a bug in the streaming functionality of the slaves:

If for any reason the connection from the slave to the master was interrupted, the slave was randomly unable to recover and push metrics to the master.

The reason was that a local variable was not reset to zero, when the slave discarded the buffer of data to be sent. In more detail:

The slave runs data collection in multiple threads. All these threads append data to a buffer. They lock it, append their data, they release the lock. So, the buffer is continuously increasing in size.

Another thread of the slave attempts to push the buffer to the remote netdata, as fast as the network can do it. So, the buffer is being streamed from the other end:

+ <<< The sending thread sends from here
|
++++++++++++++++++
|     BUFFER     |
++++++++++++++++++
                 |
                 + <<< Collection threads append here

The sending thread keeps a local variable tracking the next begin point.

(btw, if the network cannot push metrics at the rate they are collected, the buffer will become full and will be discarded).

On disconnects, we flush the buffer, so that the slave will start sending everything from the beginning. Since we don't know if the master server was restarted, the slave needs to send all the metadata about the metrics again.

The bug was that on disconnects, the begin variable was not reset. So it was random if the slave will recover, depending on the last value of the begin variable.

This PR fixes it.

@ktsaou ktsaou requested a review from vlvkobal October 17, 2018 17:31
@marcelmfs
Copy link
Copy Markdown

Which released version has this fix? I'm using 1.11 and I'm experiencing this issue.

@paulfantom
Copy link
Copy Markdown
Contributor

At least v1.11.1

@marcelmfs
Copy link
Copy Markdown

Using 1.11.1-68-g0fad9bf5 and seeing a lot of disconnects

@paulfantom
Copy link
Copy Markdown
Contributor

I don't know why this is happening. Could you create a bug report issue so we could properly track it? Also it would be good to try it with v1.12.1

@kfo2010
Copy link
Copy Markdown

kfo2010 commented Feb 28, 2019

I was having a similar issue...
PROBLEM: When cloning a box, you notice the netdata service starts dropping data like you have some sort of conflict going on. 

SOLUTION: Make sure your streaming machine uid (not the central master) is unique for all your streaming servers. 
# uuidgen > /var/lib/netdata/registry/netdata.public.unique.id;

@cakrit
Copy link
Copy Markdown
Contributor

cakrit commented Feb 28, 2019

Indeed @KyleFromOhio , see also https://docs.netdata.cloud/streaming/#netdata-unique-id

#5511 is also related, but there we only check for the master GUID vs the slave GUID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stream client connected to server but constantly failed to read

6 participants