Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading Hadoop from Apache mirrors is fragile #66

Closed
ericmjonas opened this issue Dec 10, 2015 · 9 comments
Closed

Downloading Hadoop from Apache mirrors is fragile #66

ericmjonas opened this issue Dec 10, 2015 · 9 comments
Milestone

Comments

@ericmjonas
Copy link
Contributor

Ok, it was taking me forever to figure out why sometimes my cluster installs were hanging, appearing to never complete. It turns out that some of the different mirrors returned by the mirror checking script are incredibly slow (100kB/sec slow) . This has resulted in clusters taking up to 45min to launch. Is there any way around this mirror search? It is extra complicated because different invocations of the mirror-finding script can return different mirrors on different ec2 nodes in a cluster, resulting in very erratic behavior.

@ericmjonas
Copy link
Contributor Author

Wow, I've been struggling with launching clusters quickly for the past few days, and I had just attributed it to EC2's flakiness, but hand-coding the default apache us URL for hadoop seems to have fixed it and I'm back down to 5min cluster launches.

@nchammas
Copy link
Owner

Yeah, I've had trouble with some of the Apache mirrors just being plain broken, though I've never hit a mirror that was working but was just really slow.

To be precise, the logic to get a mirror is working fine -- at least, this is how Apache recommends we download stuff -- it's the mirrors we get that are sometimes off. Anyone can setup an Apache mirror and start serving downloads, so you sometimes end up with flakiness like this.

We're not supposed to hardcode a specific URL, otherwise we're gradually going to overload that one mirror. We want different invocations of the mirror script to return different mirrors so that we spread out the load. These mirrors are not serving powerhouses like S3 is, and a mirror that works well today might not work so well in a few days.

I've already thought about ways to address this issue. I think the two that make the most sense are:

  1. Provide an option to let the user cache downloaded packages to an S3 location they own. That location will be used to download packages like Hadoop to the cluster.
  2. Automatically retry failed or extremely slow downloads from a new Apache mirror.

Another option is to look for a different way to download Hadoop that doesn't involve Apache mirrors. Maybe the default yum repos available to Amazon Linux and CentOS have a recent version available that we can use. The disadvantage there is that we're tied to when these distributions choose to update, which may not be often.

@ericmjonas
Copy link
Contributor Author

If you think you can do something like #2 without adding a tremendous amount of complex logic then go ahead, but I don't know how to do that reliably. Given the current size of the flintrock user base and the rate at which people launch clusters, I'd imagine it'll be a long time before we make any sort of sizable dent in their download numbers.

@broxtronix
Copy link

I encountered this problem as well with my spark-gce scripts. Downloading the same software package from 100+ cluster nodes is pretty much an unintentional (but very effective) DDOS! It definitely triggers bandwidth throttling in the best case, or just slows down their servers in the worst case. Servers with a proper CDN will be able to handle the load, but old-style "mirrors" will definitely struggle.

My strategy for fixing this was to stage packages on Google Storage (or S3). Since data is geo-redundantly replicated, these services act more or less like a local CDN with a very fast connection to your instances, and thus are definitely a good spot to cache this data. The downside is that you are then left maintaining downloads for various package versions, you must cache new versions, etc.

The master node could download a package, and then "scatter" it to the slaves by way of S3. That's probably the best way to do this without incurring the aforementioned package management overhead.

Another solution is to generate AMIs, of course.

Whatever the solution, I think it is generally a good practice not to hammer on software download servers when these huge clusters boot up!

@ericmjonas
Copy link
Contributor Author

+1 to the "copy it to the master" solution , it could even just rsync a set of downloadable packages to them. While this does limit flintrock's "everything in parallel" approach, it sure seems to beat having a separate staging process.

@nchammas
Copy link
Owner

The downside is that you are then left maintaining downloads for various package versions, you must cache new versions, etc.

Yes, I would be against having to centrally maintain an S3 bucket for Flintrock users. Same goes for maintaining AMIs, at least for now.

The master node could download a package, and then "scatter" it to the slaves by way of S3. That's probably the best way to do this without incurring the aforementioned package management overhead.

If I understood correctly, this is what I described in option 1, except I would just have the Flintrock client do this instead of the master.

The basic steps are: If no user-provided S3 cache is specified, download to the cluster from the mirrors. If an S3 cache is specified, check the cache for the Hadoop package we need. If it's not there, put it there, then launch the cluster and download from that user-provided S3 location.

+1 to the "copy it to the master" solution , it could even just rsync a set of downloadable packages to them.

If you mean rsyncing from the master to all the slaves, please note that this is precisely one of the main flaws of spark-ec2 that made it take so long to launch large clusters. I don't think it's a good idea.

While this does limit flintrock's "everything in parallel" approach, it sure seems to beat having a separate staging process.

The staging process, if you are referring to my option 1, is one-time only. It happens only the first time you provide Flintrock with an S3 location to cache Hadoop to. After that it can just reuse that same, user-provided S3 cache to setup all clusters going forward--no staging step required.

With this approach we get all the advantages of downloading from S3 without anyone having to centrally maintain a download repository for users. Users who want to avoid the flakiness of Apache mirrors just provide an S3 location they can write to (and that their clusters can read from), wait an extra 30 seconds during their first launch while Flintrock caches Hadoop to that location, and thereafter live happily ever after. 👸

Do you still have reservations about this approach?

@nchammas
Copy link
Owner

Whatever the solution, I think it is generally a good practice not to hammer on software download servers when these huge clusters boot up!

Amen to this. I just wish Apache and Amazon would strike some kind of hosting agreement and save us from having to do all this extra work to deal with flaky mirrors!

We're lucky that Spark is offered on S3. I don't even know who pays those bills. Maybe the AMPLab?

Anyway.

@ericmjonas
Copy link
Contributor Author

If you mean rsyncing from the master to all the slaves, please note that this is precisely one of the main flaws of spark-ec2 that made it take so long to launch large clusters. I don't think it's a good idea.

I thought the issue with spark-ec2 was that it did this serially? But for bringing up 100-node clusters the 50MB/sec bandwidth to the master would make this take ~300s so, so I see the appeal of S3.

I actually already have several other use cases where I want to cache large installers (such as Anaconda) on s3 for easy download from the workers for installation. So some sort of generic support would be great.

But Nick, there are only so many hours in the day for you to hack on this :) So maybe just wrapping the mirror-download script in a timeout-retry would be the best for the time being!

@nchammas
Copy link
Owner

I thought the issue with spark-ec2 was that it did this serially?

Nope, it does it in parallel. Notice the &, which sends each rsync task to the background and lets the loop immediately progress to the next slave.

This shouldn't be surprising. Broadcasting large files from the master to all the slaves isn't a good architecture. 🙅

But Nick, there are only so many hours in the day for you to hack on this :) So maybe just wrapping the mirror-download script in a timeout-retry would be the best for the time being!

Yeah, maybe I'll do that first and then add support for a user-owned S3 cache. I'll schedule this for right after the 0.1 release, which should be ready by next week.

@nchammas nchammas added this to the 0.2 milestone Dec 10, 2015
@nchammas nchammas changed the title HDFS apache mirror checking fragile Downloading Hadoop from Apache mirrors is fragile Dec 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants