Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow simultaneous launching of 2+ worker processes per host #78

Conversation

erikdw
Copy link
Collaborator

@erikdw erikdw commented Dec 4, 2015

One of the logic changes in PR #65 broke the ability to simultaneously launch
more than 1 worker process for a given topology. The cause of the breakage
was intentionally avoiding inclusion of the executor's resources into the
ExecutorInfo structure associated with the mesos tasks (storm workers).
This is problematic because the mesos-master rejects tasks whose ExecutorInfo
isn't identical to other tasks under the same executor. Notably, since the
executor is already running for subsequent tasks, the resources that are
added to these subsequent tasks' ExecutorInfo aren't actually used, so there
is no advantage in attempting to avoid their inclusion.

FFR, this is the commit with that change:

After this fix I was able to instantly launch 3 workers for a topology
on the same mesos-slave host.

@erikdw
Copy link
Collaborator Author

erikdw commented Dec 4, 2015

@brndnmtthws & @dsKarthick & @JessicaLHartog: please 👀 when you have ⌚

@erikdw erikdw force-pushed the allow-simultaneous-launching-of-more-than-1-worker-per-host branch from 641bd50 to f9d9f89 Compare December 4, 2015 22:16
One of the logic changes in PR mesos#65 broke the ability to simultaneously launch
more than 1 worker process for a given topology.  The cause of the breakage
was intentionally avoiding inclusion of the executor's resources into the
ExecutorInfo structure associated with the mesos tasks (storm workers).
Notably, the avoidance is only triggered for tasks other than the 1st one
that potentially launches the executor.

This is problematic because the mesos-master rejects tasks whose ExecutorInfo
isn't identical to other tasks under the same executor.  Notably, since the
executor is already running for subsequent tasks, the resources that are
added to these subsequent tasks' ExecutorInfo aren't actually used, so there
is no advantage in attempting to avoid their inclusion.

FFR, this is the commit with that change:
* af8c49b

After this fix I was able to instantly launch 3 workers for a topology
on the same mesos-slave host.
@erikdw erikdw force-pushed the allow-simultaneous-launching-of-more-than-1-worker-per-host branch from f9d9f89 to 2fbd817 Compare December 4, 2015 22:21
@brndnmtthws
Copy link
Member

Ah, makes sense.

:shipit:

@erikdw
Copy link
Collaborator Author

erikdw commented Dec 4, 2015

@@ -789,15 +789,11 @@ protected void computeResourcesForSlot(final RotatingMap<OfferID, Offer> offers,
executorInfoBuilder
.setName(executorName)
.setExecutorId(ExecutorID.newBuilder().setValue(details.getId()))
.setData(ByteString.copyFromUtf8(executorDataStr));
if (!subtractedExecutorResources) {
subtractedExecutorResources = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add this assignment back in. With this change it no longer gets set to true in the rest of the code, and if subtractedExecutorResources means the same as it did before PR #65, this will mean that we're over-provisioning resources and accounting for the executor's resources in every task, instead of just in the first one.

@brndnmtthws Am I correct in assuming subtractedExecutorResources does essentially the same thing before and after PR #65?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing that out, definitely a screw up by me in removing the assignment to subtractedExecutorResources there and not reintroducing it in the place it used to be set to true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soo... my fix is buggy. The 2nd time through the loop where subtractedExecutorResources is true, it won't enter the block that sets the executorCpuResources and executorMemResources variables, which causes an NPE further down when building the ExecutorInfo.

The immediate fix is simple, just set those variables unconditionally... but that led me to read the logviewer port-acquisition logic. I'm pretty sure it has multiple issues. Gotta test more...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can write a unit test to validate the behaviour? And also catch regressions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely the ideal. I'll look into how to adjust the unit tests.

erikdw added a commit that referenced this pull request Dec 4, 2015
…re-than-1-worker-per-host

allow simultaneous launching of 2+ worker processes per host
@erikdw erikdw merged commit 7deb029 into mesos:master Dec 4, 2015
@erikdw erikdw deleted the allow-simultaneous-launching-of-more-than-1-worker-per-host branch December 4, 2015 22:29
erikdw added a commit to erikdw/storm-mesos that referenced this pull request Dec 10, 2015
That commit introduced a NullPointerException crash:

After the first pass through the `computeResourcesForSlot()` main loop, the
`executorCpuResources` and `executorMemResources` variables are not populated,
and thus incorreclty null, because `subtractedExecutorResources ` was set to
true on the 1st pass.

The fix is to unconditionally set them to the requested resource amounts
unconditionally, and only subtract the resources out of the offer on the
1st pass.

Also an unrelated change:  clarify that the reservations that are being skipped
in a couple functions are *dynamic* reservations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants