Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for Minimum EBS Root Volume Size #174

Closed
PiercingDan opened this issue Dec 26, 2016 · 8 comments
Closed

Option for Minimum EBS Root Volume Size #174

PiercingDan opened this issue Dec 26, 2016 · 8 comments

Comments

@PiercingDan
Copy link

PiercingDan commented Dec 26, 2016

  • Flintrock version: 0.7.0

There should be an option to modify min_root_device_size_gb = 30 in line 626, ec2.py to any desired value in the flintrock configuration file. 30 GB may be excessive and costly in some cases, provided the AMI is smaller than 30 GB (10 GB in my case).

Edit: I address this also in my guide.

@PiercingDan PiercingDan changed the title Option for Minimum EBS Option for Minimum EBS Root Volume Size Dec 26, 2016
@nchammas
Copy link
Owner

If I'm remembering my Flintrock history correctly, I believe I set the default size to 30 GB because 10 GB is not enough to build Spark from source, which is one of the features that Flintrock supports. The initial 10 GB default was also reported as too small by several early users of Flintrock. I set this new default in #50.

What's the additional cost when going from 10 GB to 30 GB for the root volume if, say, you have a 100-node cluster? I remember it being minuscule, but I don't have a hard calculation documenting it.

I'm inclined to leave this default as-is without an option to change it, since every new option complicates things a bit. But if the added cost is significant I would be open to reconsidering, since I know one of the reasons people use Flintrock over, say, EMR is to cut costs.

@PiercingDan
Copy link
Author

PiercingDan commented Dec 27, 2016

EDIT: Below has been modified

From my guide (based on https://aws.amazon.com/ebs/pricing/)

The price for Amazon EBS gp2 volumes is $0.10 per GB-month for US East and since Flintrock sets its default minimum EBS root volume to be 30 GB, the EBS volumes costs about $0.10/hour day per instance or $0.004/hour per instance regardless of the instance type or AMI, whereas spot-requested m3.medium instances cost about $0.01/hour per instance.

The price is comparable to the instance cost.

@pragnesh
Copy link
Contributor

pragnesh commented Jan 2, 2017

I find 30 GB EBS volume small for my hdfs cluster use. Is there any other way to increase hdfs cluster disk size ?

@PiercingDan
Copy link
Author

You could do one of the following:

  • Increase the size of your Snapshot/AMI you're launching from
  • Change min_root_device_size_gb = 30 to desired size in line 626, ec2.py

@pragnesh
Copy link
Contributor

pragnesh commented Jan 3, 2017

@PiercingDan EBS gp2 volume pricing is $0.10 per GB-month so it only cost $3 per month for 30 GB and hourly cost 3/(24*30)=0.004 is less then instance cost $0.01/hour

@PiercingDan
Copy link
Author

Good catch @pragnesh

PiercingDan added a commit to PiercingDan/spark-Jupyter-AWS that referenced this issue Jan 3, 2017
@nchammas
Copy link
Owner

nchammas commented Jan 4, 2017

@pragnesh:

I find 30 GB EBS volume small for my hdfs cluster use. Is there any other way to increase hdfs cluster disk size ?

Flintrock deploys HDFS to the EBS root volume only if the AMI has no ephemeral volumes attached. If you select a larger instance type that has ephemeral volumes (also called instance store volumes) Flintrock will use those instead for HDFS. That's because they are super fast (faster than EBS), and Flintrock users (from my understanding) typically use HDFS in conjunction with Spark to share things like shuffle files or temporarily stage data before starting their job. The permanent store for these users is typically something like S3. I strongly recommend against using Flintrock-managed HDFS as anything other than a temporary store for your data.

This should probably be documented explicitly somewhere. I don't believe it currently is.

@pragnesh
Copy link
Contributor

pragnesh commented Jan 4, 2017

@nchammas We use hdfs only for temporary store. I know we can use instance with ephemeral volume if we need more hdfs storage. But spot instance price for instance store volume usually high and also change frequently, so in order to avoid losing instance, we tend to use instance like m4.large. For us EBS performance for hdfs is not big issue compare to instance loss during running job. we can workaround this issue by having more instance. I just nice to have some setting during launch config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants