Log the first heartbeat after a series of missed #5629

thobe · 2015-10-14T12:49:01Z

Previously we logged all heartbeat messages. This is too much logging for the system to handle, so we removed it. This change introduces logging of the interesting heartbeat messages, the ones for when the heartbeat comes back after having missed a few beats.

In order to ensure that logging from anywhere in the clustering code does not affect the cluster stability, all logging from the clustering code is made asynchronous.

tinwelint · 2015-10-14T15:11:15Z

You're reintroducing potential I/O on this critical execution path. We cannot do this.

thobe · 2015-10-14T15:16:49Z

@tinwelint does that thread only handle heartbeats, or does it handle other events as well? Because looking at that event handling code, there are other events that trigger logging.

tinwelint · 2015-10-14T15:19:52Z

Clustering has thread pools both for sending, receiving and processing messages. It's about any thread that is inside the heartbeat state machine doing anything with heartbeats that is sensitive. It can be a different thread each time, but since state machine processing is synchronous then only one at a time, if that matters.

tinwelint · 2015-10-14T16:03:59Z

Marked as not ready for merge until we (re)move the synchronous logging from this code path.

tinwelint · 2015-10-14T16:45:27Z

For being extra safe there should be no synchronous logging in the heartbeat state machine. Whether or not some paths are not as critical I don't know yet. @digitalstain ?

thobe · 2015-10-15T15:37:39Z

@digitalstain @tinwelint I've updated this with asynchronous logging for the clustering code.

tinwelint · 2015-10-15T22:14:36Z

@thobe looks to be license headers problem.

thobe · 2015-10-16T06:59:17Z

@tinwelint yes. Intellij added GPL headers to all new files, but some of the should have AGPL headers. I'll fix it.

chrisvest · 2015-10-16T07:18:17Z

community/logging/src/test/java/org/neo4j/logging/async/AsyncLogTest.java

+
+@RunWith( Parameterized.class )
+public class AsyncLogTest
+{


tinwelint · 2015-10-16T09:23:55Z

Summary (mostly for myself): Whereas this is a massive change. The actual change isn't. The only part I'm itchy about is the added header to Message and that handling, but mostly because I'm not as familiar/comfortable with that code as I could be. The asynchronous logging part were pulled from another PR, which have been quite tested already and the rest is about LogService --> LogProvider refactoring in clustering code.

thobe · 2015-10-16T12:26:45Z

@tinwelint yes, I was about to write almost that comment myself. Most of the added code is actually test code.

thobe · 2015-10-16T12:47:41Z

community/primitive-collections/src/main/java/org/neo4j/concurrent/AsyncEvents.java

+
+    @Override
+    public void run()
+    {


perhaps we should start this method with spinning on CASing from shutdownSentinel to endSentinel to handle threads that submit events before we have started running.

@chrisvest what do you think?

Yeah. I think it is currently possible for the background thread to race with senders that think we've been shutdown. Senders with that impression will not call LockSupport.unpark on the background thread, which in turn can miss wake ups and go to sleep forever.

We can either fix this by spinning in the background thread to CAS shutdownSentinel to endSentinel, or we can initialise the stack field to null and add a guard clause to the beginning of the send method, that just process the event directly if the stack is observed to be null. Either approach would work, I think.

null sounds like it would cause other issues. I'd prefer it if the state after shutting down is the same as the state before starting up (and vice versa). Having the background thread spin-CAS shutdownSentinel to endSentinel on start would do that.

Okay. The latch-based mechanism for awaitTermination I have will break restartability, I think.

We don't need restartability afaik

removed restartability.

tinwelint · 2015-10-16T17:00:06Z

Marking as NOT READY FOR MERGE to highlight that it's starting to get too late to include into 2.3.0

thobe · 2015-10-19T08:29:16Z

I'm thinking we should move the Shows how sensitive clustering is to IO waits from logging tests to ha-robustness.

MishaDemianenko · 2015-10-20T09:41:37Z

enterprise/ha/src/test/java/org/neo4j/ha/PerformTransactionsNowAndThenOnMaster.java

+            {
+                try
+                {
+                    Thread.sleep( 1_000 );


SECONDS.toMillis

MishaDemianenko · 2015-10-20T12:02:59Z

Also general thing regarding async logging.
Now date of logging message will be generated when logger will actually log that message. Which will be in some hopefully close future.
And it will be just another task to find out when exactly that message was produced, which sometimes is really important.
What do you thing about having date as part of AsyncEvent or any other wrapper that will wrap message with (or simply prepend message with that date) to preserve actual message log time?

tinwelint · 2015-10-20T12:35:42Z

@MishaDemianenko your good point about log timestamps struck me at some point as well. I think we should keep the timestamp as-is when the event is actually logged, otherwise we'd have to tear a bunch of internals of logging open or tear a hole in the interface to allow timestamps to be externally provided, which would probably not be a good thing. We could have a threshold and if crossed include how delayed that message is (or which ever is easiest to read: absolute time when event happened or relative to when it was logged).

thobe · 2015-10-20T12:41:14Z

@tinwelint @MishaDemianenko adding an additional timestamp would be very easy to do. Perhaps we should do that? So we would have something like this end up in the log:

2015-10-19 07:43:13.497+0000 INFO [o.n.c.c.ClusterClient] [AsyncEvent @ 2015-10-19 07:43:14.862+0000] This is a sample log message

MishaDemianenko · 2015-10-20T13:07:52Z

@thobe sounds good to me, it's easy to do and we will have event date logged.

chrisvest · 2015-10-22T09:26:01Z

community/primitive-collections/src/main/java/org/neo4j/concurrent/AsyncEvents.java

+        Sentinel( String identifier )
+        {
+            this.str = "AsyncEvent/Sentinel[" + identifier + "]";
+        }


MishaDemianenko · 2015-10-22T19:42:38Z

community/unsafe/src/main/java/org/neo4j/unsafe/impl/internal/dragons/FeatureToggles.java

+
+/**
+ * Feature toggles are used for features that are possible to configure, but where the configuration is always fixed in
+ * a production system. They are


a bit of missing content in They are sentence

thobe · 2015-11-11T09:43:02Z

@MishaDemianenko I have removed the controversial bits from FeatureToggles, now I'm open for moving it to a different package, if you have a good suggestion for what that package might be...

MishaDemianenko · 2015-11-11T10:50:12Z

@thobe i think it make sense to move it close to Settings because it's just another way to tune product and move those two into org.neo4j.helpers.settings or similar

Previously we logged all heartbeat messages. This is too much logging for the system to handle, so we removed it. This change introduces logging of the interesting heartbeat messages, the ones for when the heartbeat comes back after having missed a few beats. We already log heartbeat timeouts, so adding logging of when the heartbeat returns is sufficient for being able to reason about the heartbeats.

The context implementations were exposing a wider interface (LogService) than what was being used. This introduces a narrower interface to only expose what is needed.

The clustering code was passing around a LogService, which is a much more capable interface than what it needed. In fact all usages just used the internal log provider. The needs of the clustering system can be fulfilled by passing a LogProvider to the various components instead.

The AsyncEvents concurrency primitive can be used for very quickly enqueueing events to be processed in a background thread. This will be useful for building asynchronous logging.

Asynchronous logging is implemented by simply delegating to AsyncEvents.

FeatureToggles is a helper class that makes it easer to use java System properties to control certain features. This is done by providing methods for boolean flag values (with a default value), integer parameters, and enum flags. All feature toggle System parameter names are fully qualified with a class name. The intended use case is for the FeatureToggles methods to be invoked from a static context and the value stored in a static final field. The default value is used if the parameter is not set or if the parameter has an unrecognized value.

The clustering code is extremely time sensitive. If logging gets delayed waiting for I/O the cluster could become unstable. Therefore all logging in the clustering code is made asynchronous, in order to ensure that it does not block.

tinwelint · 2015-11-12T10:37:10Z

How are things in here? Have we resolved:

Timestamps for actual logging
FeatureToggles package

?

thobe · 2015-11-13T12:43:50Z

@tinwelint timestamps for actual log-event time has been added: https://github.com/neo4j/neo4j/pull/5629/files#diff-98b920bc20ec71b0684b5d65867fd074R150
Although not for bulk logging.

The FeatureToggles have been stripped down to the essential bits, the hack has been removed that made them controversial to some. The question is of package. Since the use cases for it are spread over the "early" modules (kernel and the modules it depends on) it has been put in the unsafe module, since all other things depend on this. They aren't unsafe, so it doesn't really feel right for it to be defined here though, but it's hard to find a good home in an existing module. Since that is the module the class is in, it is in the only package in that module. Since I don't think we will expand the use of FeatureToggles in 2.3, I think this is good enough, and we can focus on finding a better home for it in 3.0, and defer the decision of where it actually goes until later. The place where it is now is a sufficiently discouraging place that we should have no trouble moving it without prior notice.

tinwelint · 2015-11-17T09:46:07Z

@thobe OK I agree that's good enough for now

Log the first heartbeat after a series of missed

thobe added kernel 2.3 labels Oct 14, 2015

digitalstain self-assigned this Oct 14, 2015

tinwelint added the NOT READY FOR MERGE label Oct 14, 2015

thobe force-pushed the 2.3-log-returning-heartbeat branch from 7084445 to 5aca97b Compare October 15, 2015 15:35

thobe removed the NOT READY FOR MERGE label Oct 15, 2015

thobe force-pushed the 2.3-log-returning-heartbeat branch 2 times, most recently from 8f156ac to 2d772c4 Compare October 15, 2015 16:30

tinwelint assigned tinwelint and unassigned digitalstain Oct 15, 2015

chrisvest reviewed Oct 16, 2015
View reviewed changes

thobe force-pushed the 2.3-log-returning-heartbeat branch from 2d772c4 to 9a991c0 Compare October 16, 2015 12:23

thobe reviewed Oct 16, 2015
View reviewed changes

tinwelint added the NOT READY FOR MERGE label Oct 16, 2015

MishaDemianenko reviewed Oct 20, 2015
View reviewed changes

thobe force-pushed the 2.3-log-returning-heartbeat branch from 9a991c0 to 7a04f01 Compare October 21, 2015 15:32

chrisvest reviewed Oct 22, 2015
View reviewed changes

thobe force-pushed the 2.3-log-returning-heartbeat branch from 7a04f01 to 93e35d7 Compare October 22, 2015 09:59

MishaDemianenko reviewed Oct 22, 2015
View reviewed changes

thobe force-pushed the 2.3-log-returning-heartbeat branch from 93e35d7 to 952ec1a Compare November 11, 2015 09:28

thobe and others added 7 commits November 11, 2015 14:07

Introduce LoggingContext

3f5fb6d

The context implementations were exposing a wider interface (LogService) than what was being used. This introduces a narrower interface to only expose what is needed.

AsyncEvents concurrency primitive

628dc09

The AsyncEvents concurrency primitive can be used for very quickly enqueueing events to be processed in a background thread. This will be useful for building asynchronous logging.

Introduce asynchronous logging

db2e4f8

Asynchronous logging is implemented by simply delegating to AsyncEvents.

Use asynchronous logging in clustering code

becd524

The clustering code is extremely time sensitive. If logging gets delayed waiting for I/O the cluster could become unstable. Therefore all logging in the clustering code is made asynchronous, in order to ensure that it does not block.

thobe force-pushed the 2.3-log-returning-heartbeat branch from 952ec1a to becd524 Compare November 11, 2015 13:07

tinwelint removed the NOT READY FOR MERGE label Nov 17, 2015

tinwelint added a commit that referenced this pull request Nov 25, 2015

Merge pull request #5629 from thobe/2.3-log-returning-heartbeat

f9e1f7a

Log the first heartbeat after a series of missed

tinwelint merged commit f9e1f7a into neo4j:2.3 Nov 25, 2015

thobe deleted the 2.3-log-returning-heartbeat branch November 26, 2015 09:17

spacecowboy added the changelog label Jun 15, 2016

tinwelint added the team-kernel label Aug 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log the first heartbeat after a series of missed #5629

Log the first heartbeat after a series of missed #5629

thobe commented Oct 14, 2015

tinwelint commented Oct 14, 2015

thobe commented Oct 14, 2015

tinwelint commented Oct 14, 2015

tinwelint commented Oct 14, 2015

tinwelint commented Oct 14, 2015

thobe commented Oct 15, 2015

tinwelint commented Oct 15, 2015

thobe commented Oct 16, 2015

chrisvest Oct 16, 2015

tinwelint commented Oct 16, 2015

thobe commented Oct 16, 2015

thobe Oct 16, 2015

chrisvest Oct 16, 2015

thobe Oct 16, 2015

chrisvest Oct 16, 2015

tinwelint Oct 16, 2015

thobe Oct 22, 2015

tinwelint commented Oct 16, 2015

thobe commented Oct 19, 2015

MishaDemianenko Oct 20, 2015

MishaDemianenko commented Oct 20, 2015

tinwelint commented Oct 20, 2015

thobe commented Oct 20, 2015

MishaDemianenko commented Oct 20, 2015

chrisvest Oct 22, 2015

MishaDemianenko Oct 22, 2015

thobe commented Nov 11, 2015

MishaDemianenko commented Nov 11, 2015

tinwelint commented Nov 12, 2015

thobe commented Nov 13, 2015

tinwelint commented Nov 17, 2015

Log the first heartbeat after a series of missed #5629

Log the first heartbeat after a series of missed #5629

Conversation

thobe commented Oct 14, 2015

tinwelint commented Oct 14, 2015

thobe commented Oct 14, 2015

tinwelint commented Oct 14, 2015

tinwelint commented Oct 14, 2015

tinwelint commented Oct 14, 2015

thobe commented Oct 15, 2015

tinwelint commented Oct 15, 2015

thobe commented Oct 16, 2015

Choose a reason for hiding this comment

tinwelint commented Oct 16, 2015

thobe commented Oct 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tinwelint commented Oct 16, 2015

thobe commented Oct 19, 2015

Choose a reason for hiding this comment

MishaDemianenko commented Oct 20, 2015

tinwelint commented Oct 20, 2015

thobe commented Oct 20, 2015

MishaDemianenko commented Oct 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thobe commented Nov 11, 2015

MishaDemianenko commented Nov 11, 2015

tinwelint commented Nov 12, 2015

thobe commented Nov 13, 2015

tinwelint commented Nov 17, 2015