New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log the first heartbeat after a series of missed #5629
Conversation
You're reintroducing potential I/O on this critical execution path. We cannot do this. |
@tinwelint does that thread only handle heartbeats, or does it handle other events as well? Because looking at that event handling code, there are other events that trigger logging. |
Clustering has thread pools both for sending, receiving and processing messages. It's about any thread that is inside the heartbeat state machine doing anything with heartbeats that is sensitive. It can be a different thread each time, but since state machine processing is synchronous then only one at a time, if that matters. |
Marked as not ready for merge until we (re)move the synchronous logging from this code path. |
For being extra safe there should be no synchronous logging in the heartbeat state machine. Whether or not some paths are not as critical I don't know yet. @digitalstain ? |
7084445
to
5aca97b
Compare
@digitalstain @tinwelint I've updated this with asynchronous logging for the clustering code. |
8f156ac
to
2d772c4
Compare
@thobe looks to be license headers problem. |
@tinwelint yes. Intellij added GPL headers to all new files, but some of the should have AGPL headers. I'll fix it. |
|
||
@RunWith( Parameterized.class ) | ||
public class AsyncLogTest | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Summary (mostly for myself): Whereas this is a massive change. The actual change isn't. The only part I'm itchy about is the added header to |
2d772c4
to
9a991c0
Compare
@tinwelint yes, I was about to write almost that comment myself. Most of the added code is actually test code. |
|
||
@Override | ||
public void run() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should start this method with spinning on CASing from shutdownSentinel
to endSentinel
to handle threads that submit events before we have started running.
@chrisvest what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think it is currently possible for the background thread to race with senders that think we've been shutdown. Senders with that impression will not call LockSupport.unpark
on the background thread, which in turn can miss wake ups and go to sleep forever.
We can either fix this by spinning in the background thread to CAS shutdownSentinel
to endSentinel
, or we can initialise the stack
field to null
and add a guard clause to the beginning of the send
method, that just process the event directly if the stack is observed to be null
. Either approach would work, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null
sounds like it would cause other issues. I'd prefer it if the state after shutting down is the same as the state before starting up (and vice versa). Having the background thread spin-CAS shutdownSentinel
to endSentinel
on start would do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. The latch-based mechanism for awaitTermination
I have will break restartability, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need restartability afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed restartability.
Marking as NOT READY FOR MERGE to highlight that it's starting to get too late to include into 2.3.0 |
I'm thinking we should move the Shows how sensitive clustering is to IO waits from logging tests to |
{ | ||
try | ||
{ | ||
Thread.sleep( 1_000 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SECONDS.toMillis
Also general thing regarding async logging. |
@MishaDemianenko your good point about log timestamps struck me at some point as well. I think we should keep the timestamp as-is when the event is actually logged, otherwise we'd have to tear a bunch of internals of logging open or tear a hole in the interface to allow timestamps to be externally provided, which would probably not be a good thing. We could have a threshold and if crossed include how delayed that message is (or which ever is easiest to read: absolute time when event happened or relative to when it was logged). |
@tinwelint @MishaDemianenko adding an additional timestamp would be very easy to do. Perhaps we should do that? So we would have something like this end up in the log:
|
@thobe sounds good to me, it's easy to do and we will have event date logged. |
9a991c0
to
7a04f01
Compare
Sentinel( String identifier ) | ||
{ | ||
this.str = "AsyncEvent/Sentinel[" + identifier + "]"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
7a04f01
to
93e35d7
Compare
|
||
/** | ||
* Feature toggles are used for features that are possible to configure, but where the configuration is always fixed in | ||
* a production system. They are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit of missing content in They are
sentence
93e35d7
to
952ec1a
Compare
@MishaDemianenko I have removed the controversial bits from |
@thobe i think it make sense to move it close to Settings because it's just another way to tune product and move those two into |
Previously we logged all heartbeat messages. This is too much logging for the system to handle, so we removed it. This change introduces logging of the interesting heartbeat messages, the ones for when the heartbeat comes back after having missed a few beats. We already log heartbeat timeouts, so adding logging of when the heartbeat returns is sufficient for being able to reason about the heartbeats.
The context implementations were exposing a wider interface (LogService) than what was being used. This introduces a narrower interface to only expose what is needed.
The clustering code was passing around a LogService, which is a much more capable interface than what it needed. In fact all usages just used the internal log provider. The needs of the clustering system can be fulfilled by passing a LogProvider to the various components instead.
The AsyncEvents concurrency primitive can be used for very quickly enqueueing events to be processed in a background thread. This will be useful for building asynchronous logging.
Asynchronous logging is implemented by simply delegating to AsyncEvents.
FeatureToggles is a helper class that makes it easer to use java System properties to control certain features. This is done by providing methods for boolean flag values (with a default value), integer parameters, and enum flags. All feature toggle System parameter names are fully qualified with a class name. The intended use case is for the FeatureToggles methods to be invoked from a static context and the value stored in a static final field. The default value is used if the parameter is not set or if the parameter has an unrecognized value.
The clustering code is extremely time sensitive. If logging gets delayed waiting for I/O the cluster could become unstable. Therefore all logging in the clustering code is made asynchronous, in order to ensure that it does not block.
952ec1a
to
becd524
Compare
How are things in here? Have we resolved:
? |
@tinwelint timestamps for actual log-event time has been added: https://github.com/neo4j/neo4j/pull/5629/files#diff-98b920bc20ec71b0684b5d65867fd074R150 The FeatureToggles have been stripped down to the essential bits, the hack has been removed that made them controversial to some. The question is of package. Since the use cases for it are spread over the "early" modules ( |
@thobe OK I agree that's good enough for now |
Log the first heartbeat after a series of missed
Previously we logged all heartbeat messages. This is too much logging for the system to handle, so we removed it. This change introduces logging of the interesting heartbeat messages, the ones for when the heartbeat comes back after having missed a few beats.
In order to ensure that logging from anywhere in the clustering code does not affect the cluster stability, all logging from the clustering code is made asynchronous.