fix streaming bug by ktsaou · Pull Request #4425 · netdata/netdata

ktsaou · 2018-10-17T17:29:28Z

There was a bug in the streaming functionality of the slaves:

If for any reason the connection from the slave to the master was interrupted, the slave was randomly unable to recover and push metrics to the master.

The reason was that a local variable was not reset to zero, when the slave discarded the buffer of data to be sent. In more detail:

The slave runs data collection in multiple threads. All these threads append data to a buffer. They lock it, append their data, they release the lock. So, the buffer is continuously increasing in size.

Another thread of the slave attempts to push the buffer to the remote netdata, as fast as the network can do it. So, the buffer is being streamed from the other end:

+ <<< The sending thread sends from here
|
++++++++++++++++++
|     BUFFER     |
++++++++++++++++++
                 |
                 + <<< Collection threads append here

The sending thread keeps a local variable tracking the next begin point.

(btw, if the network cannot push metrics at the rate they are collected, the buffer will become full and will be discarded).

On disconnects, we flush the buffer, so that the slave will start sending everything from the beginning. Since we don't know if the master server was restarted, the slave needs to send all the metadata about the metrics again.

The bug was that on disconnects, the begin variable was not reset. So it was random if the slave will recover, depending on the last value of the begin variable.

This PR fixes it.

marcelmfs · 2019-02-27T12:06:42Z

Which released version has this fix? I'm using 1.11 and I'm experiencing this issue.

paulfantom · 2019-02-27T13:14:47Z

At least v1.11.1

marcelmfs · 2019-02-27T14:51:46Z

Using 1.11.1-68-g0fad9bf5 and seeing a lot of disconnects

paulfantom · 2019-02-27T15:01:18Z

I don't know why this is happening. Could you create a bug report issue so we could properly track it? Also it would be good to try it with v1.12.1

kfo2010 · 2019-02-28T05:01:58Z

I was having a similar issue...
PROBLEM: When cloning a box, you notice the netdata service starts dropping data like you have some sort of conflict going on.  
SOLUTION: Make sure your streaming machine uid (not the central master) is unique for all your streaming servers.  # uuidgen > /var/lib/netdata/registry/netdata.public.unique.id;

cakrit · 2019-02-28T11:48:23Z

Indeed @KyleFromOhio , see also https://docs.netdata.cloud/streaming/#netdata-unique-id

#5511 is also related, but there we only check for the master GUID vs the slave GUID.

fix streaming bug; fixes netdata#4370

4b469ec

ktsaou requested a review from vlvkobal October 17, 2018 17:31

vlvkobal approved these changes Oct 17, 2018

View reviewed changes

ktsaou merged commit 5d23cb0 into netdata:master Oct 17, 2018

ktsaou deleted the fix-streaming-denial branch October 17, 2018 19:34

ktsaou mentioned this pull request Nov 19, 2018

netdata stream clients disconnecting from netdata server #4049

Closed

paulfantom added the area/streaming label Nov 29, 2018

Ehekatl mentioned this pull request Dec 17, 2018

streaming missing data with no error on client side #5014

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix streaming bug#4425

fix streaming bug#4425
ktsaou merged 1 commit intonetdata:masterfrom
ktsaou:fix-streaming-denial

ktsaou commented Oct 17, 2018 •

edited

Loading

Uh oh!

marcelmfs commented Feb 27, 2019

Uh oh!

paulfantom commented Feb 27, 2019

Uh oh!

marcelmfs commented Feb 27, 2019

Uh oh!

paulfantom commented Feb 27, 2019

Uh oh!

kfo2010 commented Feb 28, 2019

Uh oh!

cakrit commented Feb 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ktsaou commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcelmfs commented Feb 27, 2019

Uh oh!

paulfantom commented Feb 27, 2019

Uh oh!

marcelmfs commented Feb 27, 2019

Uh oh!

paulfantom commented Feb 27, 2019

Uh oh!

kfo2010 commented Feb 28, 2019

Uh oh!

cakrit commented Feb 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ktsaou commented Oct 17, 2018 •

edited

Loading