Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving file server location and executor package handling #57

Closed
wants to merge 1 commit into from

Conversation

echinthaka
Copy link

This PR fixes two issues.

  1. When the local file server is started locally to mainly serve storm.yaml to storm workers, the code assumes the conf directory to be conf. But this can fail especially when storm-mesos process is started through upstart scripts etc. Hence, we need to provide a way to explicitly pass this location to help in these cases. This fix allows the conf location to be passed using mesos.local.file.server.location. If not provided, it will fallback to conf
  2. When we rely on mesos to download the storm-mesos package (when the workers are scheduled) it might contain additional characters at the end. For example, if we are downloading from hdfs the URI will have ?op=OPEN at the end. This can fail the complete command that gets executed after that. Hence, instead of relying on mesos to do the download for us, we are explicitly using wget to do that and use that also to rename the file.

@erikdw
Copy link
Collaborator

erikdw commented Sep 3, 2015

hey @echinthaka : thanks for the PR. I'm on vacation (actually in an airplane ✈️ right now) for the next few days, so please allow for me to return early next week to give this the full review it deserves. But I'm curious about the 2nd change:

  1. Do all HDFS URIs have ?op=OPEN at the end?
  2. So you're saying that the inclusion of a query param at the end of a URI causes the mesos-slave (recently renamed to mesos-agent!) to fail downloading the file from that URI? That feels like a serious bug that should be addressed in mesos itself, instead of worked-around in various frameworks. Follow-up Qs:
    • Do you have mesos-slave (err.... mesos-agent) error logs that you can sanitize and point me to which show what happens when you have such a URI?
    • Have you seen if the same thing happens with HTTP, or if it's only specific to HDFS?
    • What version of mesos exhibited this behavior?

@echinthaka
Copy link
Author

@erikdw

Do all HDFS URIs have ?op=OPEN at the end?

yes

Do you have mesos-slave (err.... mesos-agent) error logs that you can sanitize and point me to which show what happens when you have such a URL?

I lost that log but it was complaining when it was trying to copy the storm.yaml with cp storm.yaml storm-mesos*/conf

Have you seen if the same thing happens with HTTP, or if it's only specific to HDFS?

This is kinda specific to HDFS if the file is hosted in HDFS. For example in my case the executorURI is something like http://hdfs-nn.myhost.com:50070/webhdfs/v1/binaries/storm-mesos/storm-mesos-0.9.3.tgz?op=OPEN. When mesos downloads this file it creates storm-mesos-0.9.3.tgz?op=OPEN. Then when it extracts it, it creates storm-mesos-0.9.3.tgz?op=OPEN folder.

What version of mesos exhibited this behavior?

0.22.1

@erikdw
Copy link
Collaborator

erikdw commented Sep 3, 2015

Since downloading executor binaries from HDFS URIs seems to be one of the main suggested ways of running mesos, I still believe (perhaps naively) that this is not an issue we should be solving in the storm framework, and instead it should solved by mesos proper. @tnachen : can you please speak to whether HDFS URIs are supported in the Executor.command.uri? From what @echinthaka is reporting it seems like they are broken, but I don't know much about HDFS, nor much about mesos-agent's executor starting internals.

@erikdw
Copy link
Collaborator

erikdw commented Sep 3, 2015

Also, @echinthaka : I don't understand how changing to wget solves this problem for you, since I don't believe that wget understands the HDFS protocol.

Ok, I need to learn to read better. You are using WebHDFS to serve files over HTTP from HDFS. And you are using wget's ability to set the downloaded filename to overcome the fact that the URL includes a query param at the end. I'm not an HTTP expert, but shouldn't the server be able to tell the client what the filename should be saved as? Something like this:

Or maybe we want to enhance the URI definition in mesos to include a "downloaded filename" value?
Or change mesos to drop any "?fooo" junk when downloading a URI.

In general I'm not sure it's correct to change this logic to use wget. I presume that Mesos supports protocols other than HTTP for fetching the URIs, and those protocols might not be ones that wget would support, but I could be wrong there.

NOTE: I often edit my comments after initially sending them, so please go to github instead of replying to the emails.

@erikdw
Copy link
Collaborator

erikdw commented Oct 5, 2015

@echinthaka : coming back to this PR (thanks for the offline reminder): I feel like we probably need to start 3 changes in the 3 projects that are involved:

Hadoop / WebHDFS

We should file an HDFS ticket with component webhdfs for adding support to WebHDFS such that the downloaded file is automatically set to filefoo instead of filefoo?op=OPEN. I believe this can be accomplished by having the WebHDFS server set the Content-Disposition HTTP header, with the filename parameter of the disposition header being set to the basename of the HDFS path. I'm not an HTTP expert, but I believe that would work. Furthermore, I believe this is the hadoop code we'd need to change:

This fix would take forever to get adopted anywhere, so it's a long-term thing, not an immediate fix.

Mesos

Assumptions:

  • WebHDFS in its current form (no support for Content-Disposition as outlined above) will be with us for long awhile, so we wanna support it nicely.
  • Mesos supports MesosExecutor binaries being downloaded from HDFS (which I assume means using WebHDFS)

Given those assumptions, it would seem that Mesos should somehow handle this situation more cleanly. e.g., we could modify the Mesos protobuf CommandInfo.URI to have a "filename" field a la the Content-Disposition header, thus allowing us to tell Mesos what we want the downloaded file to be named. Alternatively, we might have a binary toggle in the CommandInfo.URI for whether to auto-strip query-params when saving the file.

Perhaps I will just email the mesos dev list about the proposal above. It seems like better handling is needed, though I haven't tested anything with WebHDFS myself yet. I wanna mess around with the HDFS proposal as well, so that I can file that bug and then reference it in the mesos dev email.

Even if this Mesos proposal is accepted and implemented, it will take time to become available for our use.

Storm Framework for Mesos (i.e., this project, mesos/storm)

So your proposal is one way to workaround the issue since we cannot rely on the fixes above being ready anytime soon. But since using wget directly is not the standard way for Mesos to download a binary to be executed, there are some potential issues with it:

  • we miss out on the new caching feature added into the Mesos fetcher in release 0.23
  • the upcoming planned features of the fetcher, listed at the bottom of this page
  • potential future support for protocols that wget doesn't support (i.e., not HTTP or FTP)

All of that simply for supporting WebHDFS.

Perhaps we can reconcile your desired support with my concerns with this proposal: having a configuration knob in mesos/storm for choosing wget-based downloading of the executor binaries?

Does this all sound ok to you Eran?

@brndnmtthws
Copy link
Member

This might no longer be needed since merging #65. From that PR, the config is generated at runtime and then served up to the supervisors.

Can you reexamine your changes to see if this is still necessary?

@erikdw
Copy link
Collaborator

erikdw commented Nov 17, 2015

@brndnmtthws : the bigger issue that @echinthaka is dealing with is that the executor file is being served from HDFS for his company, and the WebHDFS URL results in a funky filename after the expansion of the tar ball; i.e., storm-mesos-0.9.3.tgz?op=OPEN instead of storm-mesos-0.9.3. Though, it now comes to mind that this should be "ok" since the launching of the executor/MesosSupervisor is done by referencing storm-mesos*, so that should still work I believe.

@choang
Copy link

choang commented Dec 11, 2015

I tried to cherry-pick this into master, and got these conflicts:

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)

    deleted by us:   README.markdown
    deleted by us:   src/storm/mesos/MesosNimbus.java

For me to move forward with testing, I will apply the changes manually and maybe submit a pr to @echinthaka

@erikdw
Copy link
Collaborator

erikdw commented Jan 19, 2016

@echinthaka : I finally got some time to learn and test the "serve-executor-tarball-over-WebHDFS" behavior, and it seems to be different in mesos-0.25.0 than you reported in mesos-0.22.1. It's still not "working", but the error behavior is different:

The mesos fetcher fails to extract the tarball at all, reporting this in the executor logs:

W0118 22:56:14.260010  7581 fetcher.cpp:265] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: http://hadoop-namenode:50070/webhdfs/v1/user/erikdw/storm-mesos-0.9.6.tgz?op=OPEN

I've found 3 related tickets in Mesos:

I've linked the tickets together and posed some Qs in one of them (MESOS-3367).

@echinthaka : for now how about we add your proposed alternative launching method as a configurable variant? i.e., in the latest code we allow use of either a docker containerizer or direct use of a regular storm-mesos tarball. (Please note that @DarinJ in #77 is refactoring this area of the code (his change is very close to being merged).)

@erikdw
Copy link
Collaborator

erikdw commented Feb 22, 2016

@echinthaka : I'm closing this PR in favor of #97. Please comment if you still need this support.

@erikdw erikdw closed this Feb 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants