Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce and use Subject to fix race-condition when sending last leadership transition #6345

Merged
merged 4 commits into from Jul 16, 2018

Conversation

timcharper
Copy link
Contributor

@timcharper timcharper commented Jul 6, 2018

This commit fixes a race condition that existed in ElectionService.subscribe. The old logic would:

  1. Add a subscription to the event stream
  2. Send the last known election state value

It was possible that between 1-2, a new value was available, and that this value was published to the subscriber before the last known election state value was updated. This would result in step 2 sending a stale election state value after a newer one was published.

To solve, we introduce a Subject component, which provides the following guarantees:

  1. The last received event will be propagated to new subscribers
  2. The last received event will not be propagated out of order
  3. All events following the initially published last received event will be delivered.

JIRA Issues: MARATHON-8305

…dership transition

JIRA Issues: MARATHON-8305
@timcharper timcharper requested a review from zen-dog July 6, 2018 07:22
Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-6345-1.

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-6345-2.

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔ Build of #6345 completed successfully.

See details at jenkins-marathon-pipelines-PR-6345-1.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.8-cb56a9435/marathon-1.7.8-cb56a9435.tgz",
"sha1": "c060161ef1893c24720d4007582bcbf0e30bc726"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6345.
The job then reports back to this PR.

\\ ٩( ᐛ )و //

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✗ Build of #6345 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

Copy link
Contributor

@zen-dog zen-dog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand where the race was or how you fixed it. Could you please elaborate?

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-6345-3.

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✗ Build of #6345 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-6345-4.

Copy link
Collaborator

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔ Build of #6345 completed successfully.

See details at jenkins-marathon-pipelines-PR-6345-4.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.7.12-add9d7120/marathon-1.7.12-add9d7120.tgz",
"sha1": "ea2353e6e161e105ed46914e33bf7d1256ef8bba"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 6345.
The job then reports back to this PR.

\\ ٩( ᐛ )و //

@timcharper
Copy link
Contributor Author

I struggled with introducing the Subject abstraction. I ultimately decided it was the last hacky way to solve this. I explored versioning the leadership transition states and then having logic to drop out-of-order events, but this did not provided guarantees that we would completely miss an update in the following scenario:

  1. Transition event 2 is in flight
  2. We add out subscription, just barely missing transition event 2
  3. ElectionService re-publishes the last seen transition event, event 1 (initial state)
  4. ElectionService receives transition event 2 and updates it's local value

In this case, we would completely lose event 2 and the subscriber would erroneously think that event 1 is the current value.

@timcharper
Copy link
Contributor Author

Even though Subject seems to work under pretty heavy testing, I'm still unsure about introducing it. I tried an alternate approach involving versioning to resolve the race without introducing another component. I'm undecided if it's better.

@timcharper
Copy link
Contributor Author

After consideration, I'm inclined to stick with the Subject implementation since it separates the concurrency complexity concern better from ElectionService. Also, it is much easier to test the implementation in isolation.

def onElement(t: T): Unit = synchronized {
elementPushed = true
lastElement = t
subscribers.foreach(_.offer(t))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of our usage of SourceQueue, logic handling downstream stages of the graph will happen in a different thread / outside of this call context. This mitigates the possibility of deadlock here.

@timcharper
Copy link
Contributor Author

There are other uses of Subject in our code base for RX. With the introduction of this component we can remove RX from our code base and move completely to Akka Streams, reducing the number of concurrency abstractions we are using.

Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I did another pass and I am closer to approving :) Let's trigger SI tests for this and then I'll approve :)

@@ -87,14 +91,18 @@ class HttpEventStreamActor(
streamHandleActors += handle -> actor
}

private[stream] var isActive = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we have this only because of testing? ... 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :/ lame, I know. Options were:

  • Have the test send the messages directly (and bypass the stream). Lost coverage. Maybe okay.
  • Find a way to synchronously materialize an akka stream in tests. Same-thread dispatcher?
  • Find a way to pull out the current behavior context and compare the function reference to check the actor state

¯\_(ツ)_/¯ adding this was an easier option.

@alenkacz
Copy link
Contributor

@timcharper yeah in general I am not very happy about dragging such a non-trivial piece of code inside the code base (as the Subject is). But I believe you revised other options and this was the best one :)

@timcharper
Copy link
Contributor Author

I've reached out to the Akka development team about the process to propose the inclusion of Subject in Akka core. There isn't a lot of urgency to this PR, so we can hold the merge until we have a clear picture of where Subject will end up.

Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this, I think the akka communication can take time especially in summer

@timcharper timcharper merged commit 72e0f86 into master Jul 16, 2018
@timcharper timcharper deleted the tharper/leader-election-subscribe-race-fix branch July 16, 2018 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants