-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow simultaneous launching of 2+ worker processes per host #78
allow simultaneous launching of 2+ worker processes per host #78
Conversation
@brndnmtthws & @dsKarthick & @JessicaLHartog: please 👀 when you have ⌚ |
641bd50
to
f9d9f89
Compare
One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoiding inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). Notably, the avoidance is only triggered for tasks other than the 1st one that potentially launches the executor. This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.
f9d9f89
to
2fbd817
Compare
Ah, makes sense. |
@@ -789,15 +789,11 @@ protected void computeResourcesForSlot(final RotatingMap<OfferID, Offer> offers, | |||
executorInfoBuilder | |||
.setName(executorName) | |||
.setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) | |||
.setData(ByteString.copyFromUtf8(executorDataStr)); | |||
if (!subtractedExecutorResources) { | |||
subtractedExecutorResources = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add this assignment back in. With this change it no longer gets set to true in the rest of the code, and if subtractedExecutorResources
means the same as it did before PR #65, this will mean that we're over-provisioning resources and accounting for the executor's resources in every task, instead of just in the first one.
@brndnmtthws Am I correct in assuming subtractedExecutorResources
does essentially the same thing before and after PR #65?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing that out, definitely a screw up by me in removing the assignment to subtractedExecutorResources
there and not reintroducing it in the place it used to be set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soo... my fix is buggy. The 2nd time through the loop where subtractedExecutorResources
is true
, it won't enter the block that sets the executorCpuResources
and executorMemResources
variables, which causes an NPE further down when building the ExecutorInfo
.
The immediate fix is simple, just set those variables unconditionally... but that led me to read the logviewer port-acquisition logic. I'm pretty sure it has multiple issues. Gotta test more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can write a unit test to validate the behaviour? And also catch regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely the ideal. I'll look into how to adjust the unit tests.
…re-than-1-worker-per-host allow simultaneous launching of 2+ worker processes per host
That commit introduced a NullPointerException crash: After the first pass through the `computeResourcesForSlot()` main loop, the `executorCpuResources` and `executorMemResources` variables are not populated, and thus incorreclty null, because `subtractedExecutorResources ` was set to true on the 1st pass. The fix is to unconditionally set them to the requested resource amounts unconditionally, and only subtract the resources out of the offer on the 1st pass. Also an unrelated change: clarify that the reservations that are being skipped in a couple functions are *dynamic* reservations.
One of the logic changes in PR #65 broke the ability to simultaneously launch
more than 1 worker process for a given topology. The cause of the breakage
was intentionally avoiding inclusion of the executor's resources into the
ExecutorInfo structure associated with the mesos tasks (storm workers).
This is problematic because the mesos-master rejects tasks whose ExecutorInfo
isn't identical to other tasks under the same executor. Notably, since the
executor is already running for subsequent tasks, the resources that are
added to these subsequent tasks' ExecutorInfo aren't actually used, so there
is no advantage in attempting to avoid their inclusion.
FFR, this is the commit with that change:
After this fix I was able to instantly launch 3 workers for a topology
on the same mesos-slave host.