Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assume AWS Role in PrestoS3FileSystem #2640

Closed
wants to merge 4 commits into from

Conversation

zhenxiao
Copy link
Collaborator

@zhenxiao zhenxiao commented Apr 4, 2015

Users could use AWS role to do per-session access.
eg. use AWS role to query some limit-access tables:
./presto --server coordinator:8080 --catalog hive --debug --session hive.aws_iam_role="arn:aws:iam::example:role/access"

@nezihyigitbasi
Copy link
Contributor

@electrum can you please take a look?

@zhenxiao
Copy link
Collaborator Author

@electrum would you please take a look?

@zhenxiao
Copy link
Collaborator Author

@electrum rebased with the current master

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented May 1, 2015

@electrum comments addressed

@electrum
Copy link
Contributor

electrum commented May 1, 2015

This will cause the creation of many instances of AmazonS3Client. Will this cause socket leaks or performance problems due to lack of connection pooling, throttling on Amazon's side, etc.?

Do we need our own cache (based on role?) to share clients between FS instances?

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented May 1, 2015

@electrum yes, exactly, did see many instances of AmazonS3Client, and a large number of timeout or connection reset. We did cache of AmazonS3Client, based on the provider name, suffixed by role name. Was planning to submit as a separate PR. Now appended here. Your comments are appreciated.

@zhenxiao zhenxiao force-pushed the awsRole branch 3 times, most recently from d1de28e to c43c39c Compare May 1, 2015 20:56
@zhenxiao
Copy link
Collaborator Author

zhenxiao commented May 8, 2015

@electrum kindly ping

@hongbozeng
Copy link

This is a very useful feature for us too, it would be great this can be merged to trunk, would you please take a look @electrum? Thank you, guys.

if (useInstanceCredentials) {
return new AmazonS3Client(new InstanceProfileCredentialsProvider(), clientConfig, METRIC_COLLECTOR);
if (useInstanceCredentials) {
credentialsProvider = new InstanceProfileCredentialsProvider();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

credentialsProvider here overrides the assignment on line 576, so forth it breaks the test com.facebook.presto.hive.TestPrestoS3FileSystem.testStaticCredentials

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dayzzz thank you, it is a bug related to defaults config, get it fixed

@dain
Copy link
Contributor

dain commented Jun 26, 2015

Does AWS support Kerberos SSO? If so, a better approach will be to use the Kerberos authentication that @martint just checked in. Currently this is not passed down to connectors for authorization, but that this being worked on next.

Generally, I don't think we should be passing credentials as generic session properties and instead attach the Principal to the session.

@zhenxiao
Copy link
Collaborator Author

@dain @martint @electrum
thanks for the comment. we are using AWS IAM roles to delegate access to users:
http://docs.aws.amazon.com/IAM/latest/UserGuide/roles-toplevel.html
Whenever a user assume that role, it could access special data.
The session property we pass is not credentials, it is an IAM role. What do you think of passing this IAM role in the session property?

@zhenxiao
Copy link
Collaborator Author

I think this does not conflict with Kerberos, it is just a AWS S3 feature support. What do you think?

@zhenxiao
Copy link
Collaborator Author

@dain @martint @electrum Just chatted our Amazon EMR operations team, Amazon does not have Kerberos support, so we use this Amazon role based access to special sensitive buckets. What do you think of passing AWS role as a session property?

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Jul 6, 2015

kindly ping

@dain
Copy link
Contributor

dain commented Oct 10, 2015

I'm still a bit lost on how this works. This seems to be similar to Hive's storage-based security where instead of checking the Metastore for grants, you interrogate the storage (directory) permissions. In this case the user "principal" would be the IAM credentials and instead of an explicit authorization check, we would let the S3 client throw exceptions. It that how this is intended to work (at a high level)?

@zhenxiao
Copy link
Collaborator Author

@dain yes, as you said, instead of checking Metastore, we have all permission settings at S3 storage, and users need to specify their aws iam role to access the specified files. If permission denied, S3 client will throw exceptions for unauthorized access.
Our use case is, all files storing at S3, and have corresponding permission settings. There are some sensitive data have specific permission settings, that only specific aws iam role could access. AWS does not have Kerberos support, and passing the aws iam role as a session parameter seems a good match for our use case.

@dain
Copy link
Contributor

dain commented Oct 10, 2015

I get it. I'm sure this is something we can support but I have a bunch of follow up questions. Hopefully, the AWS folks will jump in with their opinion on how this should work.

How do we assure the IAM string is valid? How do we authenticate it?

How do we validate the IAM string is allowed to run queries as the specified Presto username? Presto username is used for queue management and for functions, so it is important to restrict a person to only use one Presto username string. For example, can we verify that the IAM string ends with /<username>?

How should CREATE TABLE or INSERT work? Do we need to set file permissions?

How will views work? A SQL view executes as the person that created the view. Where should we store the IAM string from the view owner? Or do we simply disallow view creation?

Does this mean that when we add GRANT to Presto that this will be disabled for S3? Or would we set permissions in S3?

How do you envision this working for people using a mixed S3 and HDFS installation on AWS? For HDFS we do not support storage-based security. This would be confusing to users and maybe impossible in the code.

@martint
Copy link
Contributor

martint commented Oct 10, 2015

We may want to model this as another authentication scheme, where the AWS IAM credentials are captured in a Principal and passed to the connector to do as it wants, instead of using a session property.

@zhenxiao
Copy link
Collaborator Author

Thank you @dain @martint for the comments. Yes, this seems more like a hacking, did not consider the scenario to work with CREATE/INSERT, GRANT, or working with HDFS.
For the IAM string, the AWS credential provider will check whether it is valid, and throw exception if not. For working with CREATE/INSERT, GRANT, and working with HDFS, I did not have a picture.
Let me think about doing this in a Principal, as @martint suggested.

@zhenxiao
Copy link
Collaborator Author

@dain @martint
Let me elaborate more on our usage, and answer @dain 's questions one by one:
For S3 files, there are a few permissions, including List, Put, Get, and Delete.
We create a few Roles, and set their permission, say, IamReadRole has permission to List and Get, IamWriteRole has permission to List, Get, and Put, IamAllRole has permission ot List, Get, Put, and Delete.
Users are assuming the same IamRole when accessing files. If username1 and username2 are trying to read a sensitive database, they have to assume IamReadRole each. If they are trying to write to the sensitive database, they have to assume IamWriteRole each.

How do we assure the IAM string is valid? How do we authenticate it?

The AWS Credential Provider will throw exception if the IAM string is invalid, and will deny access if the role does not have permission to do so. Say, if a user assuming IamReadRole trying to write files(CreateTable), it will fail, and get permission denied exception.

How do we validate the IAM string is allowed to run queries as the specified Presto username? Presto username is used for queue management and for functions, so it is important to restrict a person to only use one Presto username string. For example, can we verify that the IAM string ends with /?

If I understand the question correctly, different users could assume the same IAM role, and the same users could assume different IAM roles in his different sessions. The username is not part of IAM role. It is the role that user is accessing S3.

How should CREATE TABLE or INSERT work? Do we need to set file permissions?

Create Table and Insert definitely needs to assume IamWriteRole, if not they will fail and get permission denied exceptions.

How will views work? A SQL view executes as the person that created the view. Where should we store the IAM string from the view owner? Or do we simply disallow view creation?

Our current usage is, if a view is created by a user assuming valid IamWriteRole in the sensitive database, then all users assuming IamReadRole could read from the view.

Does this mean that when we add GRANT to Presto that this will be disabled for S3? Or would we set permissions in S3?

I am thinking, when Presto has GRANT, maybe could just have this S3 IAM role continue working with the GRANT, say, if the data warehouse is on S3, then GRANT does not guarantee permission to access the warehouse, also need to assume appropriate IAM roles to accessing it.

How do you envision this working for people using a mixed S3 and HDFS installation on AWS? For HDFS we do not support storage-based security. This would be confusing to users and maybe impossible in the code

Yes, this is confusing. Seems like these are two models, the metastore GRANT model, which HDFS follows, and the storage based model, which S3 IAM role follows. I am thinking, maybe make HDFS follow the storage based model, or make S3 follow the metastore GRANT model, are confusing and difficult. Making the S3 IAM role a session property is kind of assuming metastore GRANT model is the ideal model to Presto, but just make it OK to work with S3 IAM role.

Your comments and suggestions are appreciated.

@nezihyigitbasi
Copy link
Contributor

@zhenxiao any plans to finalize this PR?

@nezihyigitbasi
Copy link
Contributor

@martint @dain @zhenxiao Just want to check with you guys whether it's OK to build this feature with the following steps as another authentication scheme:

  • Get the IAM role from the presto-cli. Either as a session property or with another command line argument. We should be able to specify an IAM role per client session, it can't be a static config.
  • Add a new servlet filter that checks for a specific header (that the cli sets when it sees the IAM role passed to it) and create a principal that wraps the role. After the filter is executed we should have the principal in the current http request and then in the current session.
  • We can implement ConnectorAccessControl and use aws IAM sdk to get info about the policies (resources + actions allowed on those resources etc.) and do access control checks. I actually prefer allowing all and then let tasks fail at the point they make the s3 calls.
  • Pass the session (which has a principal set) all the way down to the s3 filesystem initialization and assume that role before executing the filesystem operations. We also need to disable filesystem caching as different roles will require different STSAssumeRoleSessionCredentialsProvider instances.

@zhenxiao
Copy link
Collaborator Author

zhenxiao commented Mar 9, 2016

thanks for the update @nezihyigitbasi
@dain @martint let me pass this AWS role support to @nezihyigitbasi
I might not be able to debug AWS in the short term. Will join the discussion as well

@dain
Copy link
Contributor

dain commented Jul 27, 2016

@nezihyigitbasi is this one still needed?

@ghost ghost added the CLA Signed label Jul 27, 2016
@dain dain assigned nezihyigitbasi and unassigned dain Jul 27, 2016
@dain dain added the question label Jul 27, 2016
@nezihyigitbasi
Copy link
Contributor

Nope, one can get this functionality with a custom credentials provider (#5667), so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants