-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Storm improvements #65
Conversation
- Use Storm 0.9.5 - Add prefix to the worker mesos id - Use proper directory name for HTTP server
} | ||
|
||
private final Protos.TaskInfo task; | ||
private final Protos.Offer offer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be at the top of the class.
Please add findbugs to the pom:
|
2ff17ce
to
1293fb9
Compare
Thanks for the review @sargun. Updated the PR as per your comments. |
7a49c40
to
a437ce3
Compare
@@ -9,7 +9,7 @@ STORM_CMD = STORM_PATH + "/storm" | |||
def nimbus(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please PEP8 this file:
3c075477e55e:bin sdhillon$ pep8 storm-mesos
storm-mesos:9:1: E302 expected 2 blank lines, found 1
storm-mesos:10:3: E111 indentation is not a multiple of four
storm-mesos:11:3: E111 indentation is not a multiple of four
a437ce3
to
40d206b
Compare
public static String hostFromAssignmentId(String assignmentId, String delimiter) { | ||
final int last = assignmentId.lastIndexOf(delimiter); | ||
String host = assignmentId.substring(last + delimiter.length()); | ||
LOG.debug("AssignMentId: " + assignmentId + " Host: " + host); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssignMentId? Why is the M capitalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. I'll fix it though.
40d206b
to
f9c6c75
Compare
All the places you're doing: Why not use Optionals across the board? |
} | ||
} | ||
|
||
public void taskLost(final TaskID taskId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're not doing reconcilation, how will this callback ever get triggered on disconnection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure why you're asking. This could be called any time there's a status update with TASK_LOST
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - but you're not doing task reconciliation. So, if you're dependent on task statuses working at all, things wont work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Storm does its own out-of-band reconciliation-like thing (using ZK), so I'm not too worried about it.
Think I got all the |
f4b6788
to
c6a6c0a
Compare
I was unable to test fault recovery, but it looks pretty good. Please reorder / squash commits. |
## Optional configuration | ||
|
||
* `mesos.supervisor.suicide.inactive.timeout.secs`: Seconds to wait before supervisor to suicides if supervisor has no task to run. Defaults to "120". | ||
* `mesos.master.failover.timeout.secs`: Framework failover timeout in second. Defaults to "3600". | ||
* `mesos.master.failover.timeout.secs`: Framework failover timeout in second. Defaults to "24*7*3600". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brndnmtthws Whats the motivation for setting such a long timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent accidental framework removal. It's just a default, so you can always specify something else :)
DOOOOOOOOOD!!! |
I said we had other comments man. This is a HUGE change. |
Happy to address them quickly. I just don't want this PR to keep growing over time, because we'll never get it merged. |
Well... let's all make an effort to keep changes small and modular in the future then. Then they are reviewable without herculean effort, less likely to break things, etc. |
--> | ||
<property name="tokens" value="ASSIGN, BAND, BAND_ASSIGN, BOR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this, and just accepting the default are the following values:
DO_WHILE
(thewhile
keyword in a do-while)LCURLY
(left curly)LITERAL_SWITCH
(theswitch
keyword)RCURLY
(right curly)SLIST
(a statement list)LITERAL_ASSERT
(theassert
keyword)TYPE_EXTENSION_AND
(the&
symbol when used in a generic upper or lower bounds constrain)
If there isn't a particular reason for excluding these values, this entire block can be simplified to <module name="WhitespaceAround"/>
<property name="tokens" value="COMMA, SEMI, TYPECAST"/> | ||
</module> | ||
|
||
<module name="NoWhitespaceAfter"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this, and just accepting the default are the following values:
ARRAY_INIT
(An array initialization)ARRAY_DECLARATOR
(An array declaration)INDEX_OP
(The array index operator)
If there isn't a particular reason for excluding these values, this entire block can be simplified to <module name="NoWhitespaceAfter"/>
I'm not sure yet what the cause is, but when I'm testing the post-#65 version of this project, it is unable to launch multiple worker tasks in the same executor at the same time. I'm suspicious of either the assignmentId changes or the declined-offer-filtering change, but cannot say for sure yet what the cause really is. As an example, I have a small topology that needs 3 workers, all 3 of which get assigned to the same host (I only have 1 mesos-slave host) per the MesosNimbus logs, but only 1 comes up every 2 minutes. i.e., first port 31000 comes up, but for the other 2 there are no logs in the supervisor, then a bit over 2 minutes later it's 31001 that comes up, then a bit over 2 minutes after that it's 31002 that comes up. Will update once I figure out more about what's happening, but figured I'd mention it as soon as I validated that it's happening. (Of course it's also entirely possible that this is some artifact of the vagrant setup I'm using and not a real problem in the MesosNimbus / MesosSupervisor code -- TBD! 🔍 ) Ah... so I found the proximate cause in the mesos-master logs. The problem is related to something that I feel like is a bug (or at least odd design decision) within mesos proper. Specifically, the ExecutorInfo field must be identical between the Executor and all tasks within a given Executor, or it rejects tasks with mismatching ExecutorInfo. Error log:
Existing ExecutorInfo:
Rejected Task's ExecutorInfo:
Notably, the rejected Task's
|
executorInfoBuilder | ||
.setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) | ||
.setData(ByteString.copyFromUtf8(executorDataStr)); | ||
if (!subtractedExecutorResources) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the logic causing the problem I described here. I'm sending a PR in a second.
One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoidiing inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.
One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoiding inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.
One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoiding inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). Notably, the avoidance is only triggered for tasks other than the 1st one that potentially launches the executor. This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.
One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoiding inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). Notably, the avoidance is only triggered for tasks other than the 1st one that potentially launches the executor. This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.
resources.memSlots = (int) Math.floor((offerMem - executorMem) / mem); | ||
if (r.hasReservation()) { | ||
// skip resources with reservations | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brndnmtthws Why are we skipping reserved resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the way dynamic reservations are implemented in Mesos, you may receive reserved offers for other frameworks. Since the Storm framework doesn't implement dynamic reservations, we just decline all of them.
behaviour:
containers. This also introduces the mesos.container.docker.image
config param
Also from @maverick2202:
NOTE: this isn't quite ready to merge yet.
Let's merge this instead of #62 and #63.
cc @sargun @erikdw