-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new S3FileSystemType in addition to PRESTO and EMRFS #1397
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla. |
Why not to add additional property to Presto instead of loading hadoop specific files like
Additionally, we could have something that could load file system that is configured in hadoop.
|
presto-hive/src/main/java/io/prestosql/plugin/hive/s3/HiveS3Module.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/s3/HiveS3Module.java
Outdated
Show resolved
Hide resolved
Thanks for the suggestion @kokosing . I like the second approach. Regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive.config.resources
is an arbitrary list of Hadoop config files, so they can be named anything.
I had the same thought as @kokosing about configuring the class name via a normal Presto config property. However, I'm not sure if this is really needed, since the Hadoop config file is likely still need for other file system configuration. One reason to want a Presto config is so that we can ensure all three schemes, s3
, s3a
, s3n
, are mapped to the same class name. But again, I'm not sure that is needed. Thoughts?
Otherwise, this PR looks good. @apc999 please squash the commits, then we can merge this, assuming everyone agrees that no additional config is needed.
presto-hive/src/main/java/io/prestosql/plugin/hive/s3/HiveS3Module.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the result (error?) when user chooses CUSTOM/DEFAULT but does not provide -site.xml with the configuration for s3 filesystem?
@@ -17,4 +17,5 @@ | |||
{ | |||
PRESTO, | |||
EMRFS, | |||
CUSTOM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kokosing proposed "HADOOP"
I'd propose "HADOOP_DEFAULT".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. After a second thoughts, I will go with HADOOP
for the following reasons:
- This approach does go through Hadoop DFS interface to classload the correct implementing class. So
HADOOP
is accurate. HADOOP_DEFAULT
may remind users the other Hadoop propertyfs.defaultFS
(link, see below) on setting a default DFS, which is not really the property used to inject the scheme like Alluxio here.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HADOOP_DEFAULT
may remind users the other Hadoop propertyfs.defaultFS
I see your point.
To me it stands for "Hadoop's default".
I prefer to have this "default" there, because "Hadoop" is not a filesystem implementation.
(In fact, we internally already use HADOOP_DEFAULT with exactly same meaning (for some service other than file system))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it could be named NONE
because we don't set anything in Presto and one may not load any resource files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like NONE, as we may end up having s3 fs impl.
@kokosing WDYT about what I suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. Changed to HADOOP_DEFAULT
to be consistent.
@findepi if they choose the option and do nothing else, the FS schemes will simply not exist. |
@findepi Exception like "No FileSystem for scheme: s3" will be returned. |
Commits squashed |
Some Test failed. Is it flaky? |
@electrum That was my intention. I agree with you that this that much useful, considering that very likely other properties have to be set as well. |
This small PR aims to allow a storage service to serve URLs with s3 scheme. Currently, input with URLs like s3://bucket/path can only be served by PrestoS3FileSystem or EMR FS class (i.e., com.amazon.ws.emr.hadoop.fs.EmrFileSystem). In addition to these two possible choices, a RuntimeException will be thrown. This PR enables Presto to be served by additional services (e.g., Alluxio as a caching layer on top of S3 but without change HMS). Particularly, to leverage users can update etc/config.properties hive.s3-file-system-type=HADOOP_DEFAULT and update core-site.xml <property> <name>fs.s3.impl</name> <value>alluxio.hadoop.ShimFileSystem</value> </property>
This small PR aims to allow a storage service to serve URLs with
s3
scheme. Currently, input with URLs likes3://bucket/path
can only be served byPrestoS3FileSystem
or EMR FS class (i.e.,com.amazon.ws.emr.hadoop.fs.EmrFileSystem
). In addition to these two possible choices, a RuntimeException will be thrown. This PR enables Presto to be served by additional services (e.g., Alluxio as a caching layer on top of S3 but without change HMS).Particularly, users can update
etc/config.properties
and update
core-site.xml
As a result, end users can transparently benefit from the caching from Alluxio for Presto.
Fixes #1416