Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Flintrock repository to host Hadoop and Spark releases #238

Open
nchammas opened this issue Mar 10, 2018 · 1 comment
Open

Create a Flintrock repository to host Hadoop and Spark releases #238

nchammas opened this issue Mar 10, 2018 · 1 comment

Comments

@nchammas
Copy link
Owner

nchammas commented Mar 10, 2018

Since its creation, Flintrock has sourced Spark releases from s3://spark-related-packages, an S3 bucket hosted by the AMPLab and kept up-to-date by the Apache Spark project. As of Spark 2.2.1, the Spark committers have confirmed that this bucket will no longer receive updates (alternate reference).

This is a big change for Flintrock's out-of-the-box experience. Users today can configure Flintrock to download Spark from a custom location via the --spark-download-source option, but by default Flintrock downloads Spark from s3://spark-related-packages. This gives users a fast, reliable, and convenient source of Spark releases to use with Flintrock without users needing to do any work. Now that the bucket is being retired, we're stuck with the Apache mirror network as a default download source. Flintrock already uses Apache mirrors as a default source for Hadoop, and as Flintrock users know, they are slow and often unreliable (#66).

To preserve a strong out-of-the-box experience for Flintrock, I have begrudgingly decided to maintain a repository of Spark and Hadoop releases on S3 for use with Flintrock. I am loath to maintain new infrastructure, but in the absence of a fast CDN hosting public Spark and Hadoop releases, I think this is the only way.

To summarize the changes I plan to make:

  1. How Flintrock works today:
    • By default, Flintrock downloads Spark from s3://spark-related-packages.
    • By default, Flintrock downloads Hadoop from the Apache mirror network.
    • Users can customize where Flintrock downloads Spark and Hadoop from using --spark-download-source and --hdfs-download-source.
  2. How Flintrock will work after the change proposed here is complete:
    • By default, Flintrock will download both Spark and Hadoop from an S3 bucket maintained by me / the Flintrock project.
      • The bucket will be a Requester Pays bucket, meaning that users will pay the cost of data transfer from S3 to their Flintrock clusters on EC2.
      • The Flintrock project will only maintain a rolling window of select, recent releases of Spark and Hadoop in this repository.
    • As before, users can continue to customize where Flintrock downloads Spark and Hadoop from.

When this change is complete, Flintrock will no longer depend on external sources for Spark and Hadoop, and clusters that use Hadoop will launch faster by default since they will now download Hadoop from S3 as opposed to the Apache mirror network.

Thank you to the AMPLab and to the Apache Spark project for graciously hosting Spark releases on S3 for as long as they did (and footing the bill!), and to Matei for the suggestion to use a Requester Pays bucket with Flintrock.

@nchammas
Copy link
Owner Author

Working with Requester Pays S3 buckets turned out to be more difficult than expected. Using Requester Pays means all requests need to be authenticated and authorized through IAM. This implies a number of things:

  1. On the bucket owner side: The ACL on each object in a Requester Pays bucket needs to be set to public-read or authenticated-read, otherwise no-one other than the bucket owner will be able to read those objects. It's not enough to just make the bucket a Requester Pays bucket.
  2. On the bucket reader side: The person trying to access objects needs to be given explicit permission to access those objects. So if I'm offering Spark and Hadoop releases in an S3 bucket called flintrock-resources, then any Flintrock user who wants to download from this bucket needs to explicitly grant themselves access to the flintrock-resources bucket, either specifically, or via some wildcard. This is regardless of the fact that object ACLs are set to public-read or authenticated-read.
  3. For Flintrock itself: When Flintrock downloads Spark and Hadoop from this Requester Pays bucket, it can't do it anymore using plain curl or wget. We would need to a) use the AWS CLI along with the --request-payer requester option, and b) the instances downloading from the bucket need credentials to be available, either via an attached IAM role or as an access key and secret.

To make this all work without requiring users to make a bunch of manual changes related to IAM, Flintrock would need to do some extra work whenever a user tries to launch a cluster without providing explicit download sources for Spark and Hadoop:

  1. Create and maintain an IAM policy that grants access to the Flintrock resources bucket hosting Spark and Hadoop releases.
  2. Create and maintain an IAM role that Flintrock instances get assigned by default. The role has the aforementioned policy attached.
  3. Add the aforementioned policy to any user-provided IAM role that the user wants to launch their Flintrock cluster with.

It's a fair bit of work, unfortunately. And this is on top of needing to maintain the repository of Spark and Hadoop releases itself.

The alternative to doing all this is to stick to the Apache mirror network as a default download source for both Hadoop and Spark. Flintrock can provide a warning about the performance and reliability of Apache mirrors and simply leave it up to users to setup their own alternate download sources if they choose.

For the sake of expediency, I'm now thinking of pursuing this less exciting strategy first. It will enable Flintrock users to launch Spark 2.3 clusters and it takes much less work. Once that's in place, I can revisit this grander strategy of maintaining a repository of Spark and Hadoop releases.

If you're reading along and have some questions or suggestions, please chime in. I wish there was a better solution. Short of Apache hosting Spark and Hadoop releases on a fast and modern CDN, this is the best I could come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant