[WIP] Storm improvements #65

brndnmtthws · 2015-11-11T22:05:30Z

Added additional configuration parameters for controlling scheduler
behaviour:
- mesos.offer.filter.seconds
- mesos.offer.expiry.multiplier
- mesos.prefer.reserved.resources
Implemented combining of reserved & unreserved resources
Improved Docker support, added support for running Storm inside
containers. This also introduces the mesos.container.docker.image
config param
Added unit tests
Improved port handling, especially with regard to logviewer
Code cleanup/style fixes
Implemented filters & reviveOffers()

Also from @maverick2202:

Upgrade Storm
Add worker name prefix

NOTE: this isn't quite ready to merge yet.

Let's merge this instead of #62 and #63.

cc @sargun @erikdw

- Use Storm 0.9.5 - Add prefix to the worker mesos id - Use proper directory name for HTTP server

sargun · 2015-11-11T22:55:26Z

src/main/storm/mesos/LaunchTask.java

+  }
+
+  private final Protos.TaskInfo task;
+  private final Protos.Offer offer;


These should be at the top of the class.

sargun · 2015-11-11T23:54:35Z

Please add findbugs to the pom:

    <reporting>
      <plugins>
        <plugin>
          <groupId>org.codehaus.mojo</groupId>
          <artifactId>findbugs-maven-plugin</artifactId>
          <version>3.0.3</version>
          <configuration>
            <effort>Max</effort>
            <threshold>Default</threshold>
            <xmlOutput>true</xmlOutput>
          </configuration>
        </plugin>
      </plugins>
    </reporting>

brndnmtthws · 2015-11-12T01:28:48Z

Thanks for the review @sargun. Updated the PR as per your comments.

sargun · 2015-11-12T19:34:30Z

bin/storm-mesos

@@ -9,7 +9,7 @@ STORM_CMD = STORM_PATH + "/storm"
 def nimbus():


Please PEP8 this file:

3c075477e55e:bin sdhillon$ pep8 storm-mesos storm-mesos:9:1: E302 expected 2 blank lines, found 1 storm-mesos:10:3: E111 indentation is not a multiple of four storm-mesos:11:3: E111 indentation is not a multiple of four

sargun · 2015-11-12T23:22:02Z

src/main/storm/mesos/MesosCommon.java

+  public static String hostFromAssignmentId(String assignmentId, String delimiter) {
+    final int last = assignmentId.lastIndexOf(delimiter);
+    String host = assignmentId.substring(last + delimiter.length());
+    LOG.debug("AssignMentId: " + assignmentId + " Host: " + host);


AssignMentId? Why is the M capitalized?

Not sure. I'll fix it though.

sargun · 2015-11-13T01:19:38Z

All the places you're doing:
if (X == null)
X = $default_value

Why not use Optionals across the board?

sargun · 2015-11-13T01:31:39Z

src/main/storm/mesos/MesosNimbus.java

+    }
+  }
+
+  public void taskLost(final TaskID taskId) {


If you're not doing reconcilation, how will this callback ever get triggered on disconnection?

Hmm, I'm not sure why you're asking. This could be called any time there's a status update with TASK_LOST.

Yeah - but you're not doing task reconciliation. So, if you're dependent on task statuses working at all, things wont work well.

Fair enough. Storm does its own out-of-band reconciliation-like thing (using ZK), so I'm not too worried about it.

brndnmtthws · 2015-11-13T01:50:52Z

Think I got all the == null's.

sargun · 2015-11-13T22:26:04Z

I was unable to test fault recovery, but it looks pretty good. Please reorder / squash commits.

dsKarthick · 2015-11-14T01:22:31Z

README.markdown

 ## Optional configuration

 * `mesos.supervisor.suicide.inactive.timeout.secs`: Seconds to wait before supervisor to suicides if supervisor has no task to run. Defaults to "120".
-* `mesos.master.failover.timeout.secs`: Framework failover timeout in second. Defaults to "3600".
+* `mesos.master.failover.timeout.secs`: Framework failover timeout in second. Defaults to "24*7*3600".


@brndnmtthws Whats the motivation for setting such a long timeout?

To prevent accidental framework removal. It's just a default, so you can always specify something else :)

erikdw · 2015-11-17T22:27:58Z

DOOOOOOOOOD!!!

erikdw · 2015-11-17T22:28:19Z

I said we had other comments man. This is a HUGE change.

brndnmtthws · 2015-11-17T22:34:55Z

Happy to address them quickly. I just don't want this PR to keep growing over time, because we'll never get it merged.

erikdw · 2015-11-17T22:36:17Z

Well... let's all make an effort to keep changes small and modular in the future then. Then they are reviewable without herculean effort, less likely to break things, etc.

JessicaLHartog · 2015-11-18T01:14:27Z

config/checkstyle.xml

        -->
+      <property name="tokens" value="ASSIGN, BAND, BAND_ASSIGN, BOR,


The difference between this, and just accepting the default are the following values:

DO_WHILE (the while keyword in a do-while)

LCURLY (left curly)

LITERAL_SWITCH (the switch keyword)

RCURLY(right curly)

SLIST (a statement list)

LITERAL_ASSERT (the assert keyword)

TYPE_EXTENSION_AND (the & symbol when used in a generic upper or lower bounds constrain)

If there isn't a particular reason for excluding these values, this entire block can be simplified to <module name="WhitespaceAround"/>

JessicaLHartog · 2015-11-18T02:04:36Z

config/checkstyle.xml

+      <property name="tokens" value="COMMA, SEMI, TYPECAST"/>
+    </module>
+
+    <module name="NoWhitespaceAfter">


The difference between this, and just accepting the default are the following values:

ARRAY_INIT (An array initialization)

ARRAY_DECLARATOR(An array declaration)

INDEX_OP(The array index operator)

If there isn't a particular reason for excluding these values, this entire block can be simplified to <module name="NoWhitespaceAfter"/>

erikdw · 2015-12-04T08:19:29Z

I'm not sure yet what the cause is, but when I'm testing the post-#65 version of this project, it is unable to launch multiple worker tasks in the same executor at the same time. I'm suspicious of either the assignmentId changes or the declined-offer-filtering change, but cannot say for sure yet what the cause really is. As an example, I have a small topology that needs 3 workers, all 3 of which get assigned to the same host (I only have 1 mesos-slave host) per the MesosNimbus logs, but only 1 comes up every 2 minutes. i.e., first port 31000 comes up, but for the other 2 there are no logs in the supervisor, then a bit over 2 minutes later it's 31001 that comes up, then a bit over 2 minutes after that it's 31002 that comes up. Will update once I figure out more about what's happening, but figured I'd mention it as soon as I validated that it's happening.

(Of course it's also entirely possible that this is some artifact of the vagrant setup I'm using and not a real problem in the MesosNimbus / MesosSupervisor code -- TBD! 🔍 )

Ah... so I found the proximate cause in the mesos-master logs. The problem is related to something that I feel like is a bug (or at least odd design decision) within mesos proper. Specifically, the ExecutorInfo field must be identical between the Executor and all tasks within a given Executor, or it rejects tasks with mismatching ExecutorInfo.

Error log:

I1204 07:54:43.277238  6970 master.cpp:4449] Sending status update TASK_ERROR (UUID: e06501dc-357f-0000-400b-2aee357f0000) for task master-31001 of framework 8602f882-7716-44f2-9b98-5de10437ad61-0000 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible).

Existing ExecutorInfo:

executor_id {
  value: "smoketest-f-2-1449215681"
}
data: "{\"supervisorid\":\"master-smoketest-f-2-1449215681\",\"assignmentid\":\"master\"}"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.1
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 500
  }
  role: "*"
}
command {
  uris {
    value: "file:///usr/local/storm/storm-mesos-0.9.6.tgz"
  }
  uris {
    value: "http://master:53877/generated-conf/storm.yaml"
  }
  value: "cp storm.yaml storm-mesos*/conf && cd storm-mesos* && python bin/storm supervisor storm.mesos.MesosSupervisor"
}
framework_id {
  value: "8602f882-7716-44f2-9b98-5de10437ad61-0000"
}
name: "storm-supervisor | smoketest-f-2-1449215681 | master"

Rejected Task's ExecutorInfo:

executor_id {
  value: "smoketest-f-2-1449215681"
}
data: "{\"supervisorid\":\"master-smoketest-f-2-1449215681\",\"assignmentid\":\"master\"}"
command {
  uris {
    value: "file:///usr/local/storm/storm-mesos-0.9.6.tgz"
  }
  uris {
    value: "http://master:53877/generated-conf/storm.yaml"
  }
  value: "cp storm.yaml storm-mesos*/conf && cd storm-mesos* && python bin/storm supervisor storm.mesos.MesosSupervisor"
}
framework_id {
  value: "8602f882-7716-44f2-9b98-5de10437ad61-0000"
}
name: "storm-supervisor | smoketest-f-2-1449215681 | master"

Notably, the rejected Task's `ExecutorInfo` lacks the 2 `resources` sections.

As for why that is... need to look further.

erikdw · 2015-12-04T22:11:09Z

src/main/storm/mesos/MesosNimbus.java

+        executorInfoBuilder
+            .setExecutorId(ExecutorID.newBuilder().setValue(details.getId()))
+            .setData(ByteString.copyFromUtf8(executorDataStr));
+        if (!subtractedExecutorResources) {


This is the logic causing the problem I described here. I'm sending a PR in a second.

One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoidiing inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.

One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoiding inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.

One of the logic changes in PR mesos#65 broke the ability to simultaneously launch more than 1 worker process for a given topology. The cause of the breakage was intentionally avoiding inclusion of the executor's resources into the ExecutorInfo structure associated with the mesos tasks (storm workers). Notably, the avoidance is only triggered for tasks other than the 1st one that potentially launches the executor. This is problematic because the mesos-master rejects tasks whose ExecutorInfo isn't identical to other tasks under the same executor. Notably, since the executor is already running for subsequent tasks, the resources that are added to these subsequent tasks' ExecutorInfo aren't actually used, so there is no advantage in attempting to avoid their inclusion. FFR, this is the commit with that change: * af8c49b After this fix I was able to instantly launch 3 workers for a topology on the same mesos-slave host.

dsKarthick · 2016-02-12T23:50:29Z

src/main/storm/mesos/MesosNimbus.java

-        resources.memSlots = (int) Math.floor((offerMem - executorMem) / mem);
+      if (r.hasReservation()) {
+        // skip resources with reservations
+        continue;


@brndnmtthws Why are we skipping reserved resources?

With the way dynamic reservations are implemented in Mesos, you may receive reserved offers for other frameworks. Since the Storm framework doesn't implement dynamic reservations, we just decline all of them.

Ankur Choksi added 3 commits November 11, 2015 14:01

Storm improvements.

e5df66c

- Use Storm 0.9.5 - Add prefix to the worker mesos id - Use proper directory name for HTTP server

- Make the httpserver directory configurable.

aad1994

Make mesos taskid prefix delimiter configurable.

98a831a

This was referenced Nov 11, 2015

Upgrade to Storm 0.9.5 and Add worker name prefix #62

Closed

[WIP] Scheduler improvements. #63

Closed

sargun reviewed Nov 11, 2015
View reviewed changes

brndnmtthws force-pushed the wip-improvements-0.9.6 branch from 2ff17ce to 1293fb9 Compare November 12, 2015 01:28

brndnmtthws force-pushed the wip-improvements-0.9.6 branch 3 times, most recently from 7a49c40 to a437ce3 Compare November 12, 2015 16:35

sargun reviewed Nov 12, 2015
View reviewed changes

brndnmtthws force-pushed the wip-improvements-0.9.6 branch from a437ce3 to 40d206b Compare November 12, 2015 19:42

sargun reviewed Nov 12, 2015
View reviewed changes

brndnmtthws force-pushed the wip-improvements-0.9.6 branch from 40d206b to f9c6c75 Compare November 12, 2015 23:58

sargun reviewed Nov 13, 2015
View reviewed changes

brndnmtthws force-pushed the wip-improvements-0.9.6 branch from f4b6788 to c6a6c0a Compare November 13, 2015 19:09

dsKarthick reviewed Nov 14, 2015
View reviewed changes

This was referenced Nov 17, 2015

Improving file server location and executor package handling #57

Closed

Added dockerized supervisor #49

Closed

This was referenced Nov 18, 2015

How to run this entirely under mesos #29

Closed

Dockerize supervisor #41

Closed

Upgrade to storm 0.9.4 #47

Closed

JessicaLHartog reviewed Nov 18, 2015
View reviewed changes

erikdw mentioned this pull request Nov 18, 2015

Storm framework consumes all mesos resources #50

Closed

JessicaLHartog reviewed Nov 18, 2015
View reviewed changes

brndnmtthws mentioned this pull request Nov 18, 2015

Clean up checkstyle config. #67

Merged

This was referenced Nov 24, 2015

Worker jobs getting lost since they cannot download storm conf file #56

Closed

Dockerized workers having issues starting container #69

Closed

java.lang.RuntimeException: Couldn't greate generated-conf dir #70

Closed

erikdw mentioned this pull request Dec 3, 2015

docker build inefficiency - maven artifacts not cached across runs #73

Closed

erikdw reviewed Dec 4, 2015
View reviewed changes

erikdw mentioned this pull request Dec 4, 2015

allow simultaneous launching of 2+ worker processes per host #78

Merged

This was referenced Dec 9, 2015

Improve vagrant setup #80

Merged

add unit test for more than 1 storm workers being launched in 1 mesos executor #81

Open

dsKarthick reviewed Feb 12, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Storm improvements #65

[WIP] Storm improvements #65

brndnmtthws commented Nov 11, 2015

sargun Nov 11, 2015

sargun commented Nov 11, 2015

brndnmtthws commented Nov 12, 2015

sargun Nov 12, 2015

sargun Nov 12, 2015

brndnmtthws Nov 12, 2015

sargun commented Nov 13, 2015

sargun Nov 13, 2015

brndnmtthws Nov 13, 2015

sargun Nov 13, 2015

brndnmtthws Nov 13, 2015

brndnmtthws commented Nov 13, 2015

sargun commented Nov 13, 2015

dsKarthick Nov 14, 2015

brndnmtthws Nov 14, 2015

erikdw commented Nov 17, 2015

erikdw commented Nov 17, 2015

brndnmtthws commented Nov 17, 2015

erikdw commented Nov 17, 2015

JessicaLHartog Nov 18, 2015

JessicaLHartog Nov 18, 2015

erikdw commented Dec 4, 2015

erikdw Dec 4, 2015

dsKarthick Feb 12, 2016

brndnmtthws Feb 14, 2016

		@@ -9,7 +9,7 @@ STORM_CMD = STORM_PATH + "/storm"
		def nimbus():

		-->
		<property name="tokens" value="ASSIGN, BAND, BAND_ASSIGN, BOR,

[WIP] Storm improvements #65

[WIP] Storm improvements #65

Conversation

brndnmtthws commented Nov 11, 2015

Choose a reason for hiding this comment

sargun commented Nov 11, 2015

brndnmtthws commented Nov 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sargun commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brndnmtthws commented Nov 13, 2015

sargun commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikdw commented Nov 17, 2015

erikdw commented Nov 17, 2015

brndnmtthws commented Nov 17, 2015

erikdw commented Nov 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikdw commented Dec 4, 2015

Error log:

Existing ExecutorInfo:

Rejected Task's ExecutorInfo:

Notably, the rejected Task's ExecutorInfo lacks the 2 resources sections.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Notably, the rejected Task's `ExecutorInfo` lacks the 2 `resources` sections.