Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect "incompatible ExecutorInfo" situation and kill the old Executors #172

Open
erikdw opened this issue Sep 25, 2016 · 0 comments
Open

Comments

@erikdw
Copy link
Collaborator

erikdw commented Sep 25, 2016

Sometimes you may witness Storm thinking it has some Storm Worker processes, but the Worker process aren't actually visible as tasks in the Mesos UI. It will usually surface on the Storm UI's component-view as the Storm Executors having a hostname & port, but the uptime being an empty string.

One of the causes of this situation relates to the contents of the ExecutorInfo for the new task. If there was an existing Storm Supervisor (Mesos Executor) on the target host for this task, and if the new task has different values in its ExecutorInfo, then the new task will be rejected by Mesos with a TaskStatus update containing a TaskState of TASK_ERROR.

The message will look like:

s.m.MesosNimbus [INFO] Received status update: {"task_id":"worker-host.domain-31000-1474755616.828","slave_id":"20160427-042423-617289226-5050-9149-S3","state":"TASK_ERROR","message":"Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). ...

This can happen for various reasons, since Mesos considers any variance in the ExecutorInfo to be a problem:

  • changing the Executor resources in storm.yaml
    • e.g., topology.mesos.executor.cpu or topology.mesos.executor.mem.
  • changing the URI used for downloading resources into the sandbox.
    • e.g., the URL for the Nimbus's Jetty Server which is used on the worker hosts to download the storm.yaml config from the Nimbus.
    • e.g., the URI from which the storm-mesos release tarball is downloaded.

So, with the current framework implementation, if we want to ever change those values, then we must kill all of the existing Supervisors and Executors under this framework instance before enabling the new config, otherwise we end up with confusing problems.

It would be nice if the framework could instead detect such a mismatch and automatically kill the existing Executor/Supervisor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant