Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log the first heartbeat after a series of missed #5629

Merged
merged 7 commits into from Nov 25, 2015

Conversation

thobe
Copy link
Contributor

@thobe thobe commented Oct 14, 2015

Previously we logged all heartbeat messages. This is too much logging for the system to handle, so we removed it. This change introduces logging of the interesting heartbeat messages, the ones for when the heartbeat comes back after having missed a few beats.

In order to ensure that logging from anywhere in the clustering code does not affect the cluster stability, all logging from the clustering code is made asynchronous.

@tinwelint
Copy link
Member

You're reintroducing potential I/O on this critical execution path. We cannot do this.

@thobe
Copy link
Contributor Author

thobe commented Oct 14, 2015

@tinwelint does that thread only handle heartbeats, or does it handle other events as well? Because looking at that event handling code, there are other events that trigger logging.

@tinwelint
Copy link
Member

Clustering has thread pools both for sending, receiving and processing messages. It's about any thread that is inside the heartbeat state machine doing anything with heartbeats that is sensitive. It can be a different thread each time, but since state machine processing is synchronous then only one at a time, if that matters.

@tinwelint
Copy link
Member

Marked as not ready for merge until we (re)move the synchronous logging from this code path.

@tinwelint
Copy link
Member

For being extra safe there should be no synchronous logging in the heartbeat state machine. Whether or not some paths are not as critical I don't know yet. @digitalstain ?

@thobe
Copy link
Contributor Author

thobe commented Oct 15, 2015

@digitalstain @tinwelint I've updated this with asynchronous logging for the clustering code.

@thobe thobe force-pushed the 2.3-log-returning-heartbeat branch 2 times, most recently from 8f156ac to 2d772c4 Compare October 15, 2015 16:30
@tinwelint
Copy link
Member

@thobe looks to be license headers problem.

@tinwelint tinwelint assigned tinwelint and unassigned digitalstain Oct 15, 2015
@thobe
Copy link
Contributor Author

thobe commented Oct 16, 2015

@tinwelint yes. Intellij added GPL headers to all new files, but some of the should have AGPL headers. I'll fix it.


@RunWith( Parameterized.class )
public class AsyncLogTest
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@tinwelint
Copy link
Member

Summary (mostly for myself): Whereas this is a massive change. The actual change isn't. The only part I'm itchy about is the added header to Message and that handling, but mostly because I'm not as familiar/comfortable with that code as I could be. The asynchronous logging part were pulled from another PR, which have been quite tested already and the rest is about LogService --> LogProvider refactoring in clustering code.

@thobe thobe force-pushed the 2.3-log-returning-heartbeat branch from 2d772c4 to 9a991c0 Compare October 16, 2015 12:23
@thobe
Copy link
Contributor Author

thobe commented Oct 16, 2015

@tinwelint yes, I was about to write almost that comment myself. Most of the added code is actually test code.


@Override
public void run()
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should start this method with spinning on CASing from shutdownSentinel to endSentinel to handle threads that submit events before we have started running.

@chrisvest what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think it is currently possible for the background thread to race with senders that think we've been shutdown. Senders with that impression will not call LockSupport.unpark on the background thread, which in turn can miss wake ups and go to sleep forever.

We can either fix this by spinning in the background thread to CAS shutdownSentinel to endSentinel, or we can initialise the stack field to null and add a guard clause to the beginning of the send method, that just process the event directly if the stack is observed to be null. Either approach would work, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null sounds like it would cause other issues. I'd prefer it if the state after shutting down is the same as the state before starting up (and vice versa). Having the background thread spin-CAS shutdownSentinel to endSentinel on start would do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. The latch-based mechanism for awaitTermination I have will break restartability, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need restartability afaik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed restartability.

@tinwelint
Copy link
Member

Marking as NOT READY FOR MERGE to highlight that it's starting to get too late to include into 2.3.0

@thobe
Copy link
Contributor Author

thobe commented Oct 19, 2015

I'm thinking we should move the Shows how sensitive clustering is to IO waits from logging tests to ha-robustness.

{
try
{
Thread.sleep( 1_000 );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SECONDS.toMillis

@MishaDemianenko
Copy link
Contributor

Also general thing regarding async logging.
Now date of logging message will be generated when logger will actually log that message. Which will be in some hopefully close future.
And it will be just another task to find out when exactly that message was produced, which sometimes is really important.
What do you thing about having date as part of AsyncEvent or any other wrapper that will wrap message with (or simply prepend message with that date) to preserve actual message log time?

@tinwelint
Copy link
Member

@MishaDemianenko your good point about log timestamps struck me at some point as well. I think we should keep the timestamp as-is when the event is actually logged, otherwise we'd have to tear a bunch of internals of logging open or tear a hole in the interface to allow timestamps to be externally provided, which would probably not be a good thing. We could have a threshold and if crossed include how delayed that message is (or which ever is easiest to read: absolute time when event happened or relative to when it was logged).

@thobe
Copy link
Contributor Author

thobe commented Oct 20, 2015

@tinwelint @MishaDemianenko adding an additional timestamp would be very easy to do. Perhaps we should do that? So we would have something like this end up in the log:

2015-10-19 07:43:13.497+0000 INFO [o.n.c.c.ClusterClient] [AsyncEvent @ 2015-10-19 07:43:14.862+0000] This is a sample log message

@MishaDemianenko
Copy link
Contributor

@thobe sounds good to me, it's easy to do and we will have event date logged.

@thobe thobe force-pushed the 2.3-log-returning-heartbeat branch from 9a991c0 to 7a04f01 Compare October 21, 2015 15:32
Sentinel( String identifier )
{
this.str = "AsyncEvent/Sentinel[" + identifier + "]";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@thobe thobe force-pushed the 2.3-log-returning-heartbeat branch from 7a04f01 to 93e35d7 Compare October 22, 2015 09:59

/**
* Feature toggles are used for features that are possible to configure, but where the configuration is always fixed in
* a production system. They are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit of missing content in They are sentence

@thobe thobe force-pushed the 2.3-log-returning-heartbeat branch from 93e35d7 to 952ec1a Compare November 11, 2015 09:28
@thobe
Copy link
Contributor Author

thobe commented Nov 11, 2015

@MishaDemianenko I have removed the controversial bits from FeatureToggles, now I'm open for moving it to a different package, if you have a good suggestion for what that package might be...

@MishaDemianenko
Copy link
Contributor

@thobe i think it make sense to move it close to Settings because it's just another way to tune product and move those two into org.neo4j.helpers.settings or similar

thobe and others added 7 commits November 11, 2015 14:07
Previously we logged all heartbeat messages. This is too much logging
for the system to handle, so we removed it. This change introduces
logging of the interesting heartbeat messages, the ones for when the
heartbeat comes back after having missed a few beats. We already log
heartbeat timeouts, so adding logging of when the heartbeat returns is
sufficient for being able to reason about the heartbeats.
The context implementations were exposing a wider interface (LogService)
than what was being used. This introduces a narrower interface to only
expose what is needed.
The clustering code was passing around a LogService, which is a much
more capable interface than what it needed. In fact all usages just used
the internal log provider. The needs of the clustering system can be
fulfilled by passing a LogProvider to the various components instead.
The AsyncEvents concurrency primitive can be used for very quickly enqueueing events to be processed in a background thread.

This will be useful for building asynchronous logging.
Asynchronous logging is implemented by simply delegating to AsyncEvents.
FeatureToggles is a helper class that makes it easer to use java System
properties to control certain features. This is done by providing
methods for boolean flag values (with a default value), integer
parameters, and enum flags. All feature toggle System parameter names
are fully qualified with a class name.
The intended use case is for the FeatureToggles methods to be invoked
from a static context and the value stored in a static final field.
The default value is used if the parameter is not set or if the
parameter has an unrecognized value.
The clustering code is extremely time sensitive. If logging gets delayed
waiting for I/O the cluster could become unstable. Therefore all logging
in the clustering code is made asynchronous, in order to ensure that it
does not block.
@thobe thobe force-pushed the 2.3-log-returning-heartbeat branch from 952ec1a to becd524 Compare November 11, 2015 13:07
@tinwelint
Copy link
Member

How are things in here? Have we resolved:

  • Timestamps for actual logging
  • FeatureToggles package

?

@thobe
Copy link
Contributor Author

thobe commented Nov 13, 2015

@tinwelint timestamps for actual log-event time has been added: https://github.com/neo4j/neo4j/pull/5629/files#diff-98b920bc20ec71b0684b5d65867fd074R150
Although not for bulk logging.

The FeatureToggles have been stripped down to the essential bits, the hack has been removed that made them controversial to some. The question is of package. Since the use cases for it are spread over the "early" modules (kernel and the modules it depends on) it has been put in the unsafe module, since all other things depend on this. They aren't unsafe, so it doesn't really feel right for it to be defined here though, but it's hard to find a good home in an existing module. Since that is the module the class is in, it is in the only package in that module. Since I don't think we will expand the use of FeatureToggles in 2.3, I think this is good enough, and we can focus on finding a better home for it in 3.0, and defer the decision of where it actually goes until later. The place where it is now is a sufficiently discouraging place that we should have no trouble moving it without prior notice.

@tinwelint
Copy link
Member

@thobe OK I agree that's good enough for now

tinwelint added a commit that referenced this pull request Nov 25, 2015
Log the first heartbeat after a series of missed
@tinwelint tinwelint merged commit f9e1f7a into neo4j:2.3 Nov 25, 2015
@thobe thobe deleted the 2.3-log-returning-heartbeat branch November 26, 2015 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants