Conversation
|
Which released version has this fix? I'm using 1.11 and I'm experiencing this issue. |
|
At least v1.11.1 |
|
Using 1.11.1-68-g0fad9bf5 and seeing a lot of disconnects |
|
I don't know why this is happening. Could you create a bug report issue so we could properly track it? Also it would be good to try it with v1.12.1 |
|
I was having a similar issue... |
|
Indeed @KyleFromOhio , see also https://docs.netdata.cloud/streaming/#netdata-unique-id #5511 is also related, but there we only check for the master GUID vs the slave GUID. |
fixes #4370
There was a bug in the streaming functionality of the slaves:
If for any reason the connection from the slave to the master was interrupted, the slave was randomly unable to recover and push metrics to the master.
The reason was that a local variable was not reset to zero, when the slave discarded the buffer of data to be sent. In more detail:
The slave runs data collection in multiple threads. All these threads append data to a buffer. They lock it, append their data, they release the lock. So, the buffer is continuously increasing in size.
Another thread of the slave attempts to push the buffer to the remote netdata, as fast as the network can do it. So, the buffer is being streamed from the other end:
The sending thread keeps a local variable tracking the next
beginpoint.(btw, if the network cannot push metrics at the rate they are collected, the buffer will become full and will be discarded).
On disconnects, we flush the buffer, so that the slave will start sending everything from the beginning. Since we don't know if the master server was restarted, the slave needs to send all the metadata about the metrics again.
The bug was that on disconnects, the
beginvariable was not reset. So it was random if the slave will recover, depending on the last value of thebeginvariable.This PR fixes it.