Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new S3FileSystemType in addition to PRESTO and EMRFS #1397

Merged
merged 1 commit into from
Aug 30, 2019

Conversation

apc999
Copy link
Contributor

@apc999 apc999 commented Aug 28, 2019

This small PR aims to allow a storage service to serve URLs with s3 scheme. Currently, input with URLs like s3://bucket/path can only be served by PrestoS3FileSystem or EMR FS class (i.e., com.amazon.ws.emr.hadoop.fs.EmrFileSystem). In addition to these two possible choices, a RuntimeException will be thrown. This PR enables Presto to be served by additional services (e.g., Alluxio as a caching layer on top of S3 but without change HMS).

Particularly, users can update etc/config.properties

hive.s3-file-system-type=CUSTOM

and update core-site.xml

<property>
  <name>fs.s3.impl</name>
  <value>alluxio.hadoop.ShimFileSystem</value>
</property>

As a result, end users can transparently benefit from the caching from Alluxio for Presto.

Fixes #1416

@cla-bot
Copy link

cla-bot bot commented Aug 28, 2019

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@prestosql.io. For more information, see https://github.com/prestosql/cla.

@kokosing
Copy link
Member

Why not to add additional property to Presto instead of loading hadoop specific files like core-site.xml

hive.s3-file-system-type=CUSTOM
hive.s3-file-system-type.class=alluxio.hadoop.FileSystem

Additionally, we could have something that could load file system that is configured in hadoop.

hive.s3-file-system-type=HADOOP
hive.config.resources=s3-site.xml

@apc999
Copy link
Contributor Author

apc999 commented Aug 28, 2019

Why not to add additional property to Presto instead of loading hadoop specific files like core-site.xml

hive.s3-file-system-type=CUSTOM
hive.s3-file-system-type.class=alluxio.hadoop.FileSystem

Additionally, we could have something that could load file system that is configured in hadoop.

hive.s3-file-system-type=HADOOP
hive.config.resources=s3-site.xml

Thanks for the suggestion @kokosing . I like the second approach.
This particular PR is also required to allow hive.s3-file-system-type=CUSTOM in addition to two accepted values PRESTO, EMRFS. Otherwise a RuntimeException will be thrown.

Regarding hive.config.resources=s3-site.xml, I notice that currently configuration like hive.config.resources=/<PATH>/core-site.xml,/<PATH>/hdfs-site.xml is documented. Do I need to modify anything to allow hive.config.resources=s3-site.xml?

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive.config.resources is an arbitrary list of Hadoop config files, so they can be named anything.

I had the same thought as @kokosing about configuring the class name via a normal Presto config property. However, I'm not sure if this is really needed, since the Hadoop config file is likely still need for other file system configuration. One reason to want a Presto config is so that we can ensure all three schemes, s3, s3a, s3n, are mapped to the same class name. But again, I'm not sure that is needed. Thoughts?

Otherwise, this PR looks good. @apc999 please squash the commits, then we can merge this, assuming everyone agrees that no additional config is needed.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the result (error?) when user chooses CUSTOM/DEFAULT but does not provide -site.xml with the configuration for s3 filesystem?

@@ -17,4 +17,5 @@
{
PRESTO,
EMRFS,
CUSTOM,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing proposed "HADOOP"

I'd propose "HADOOP_DEFAULT".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

@apc999 apc999 Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. After a second thoughts, I will go with HADOOP for the following reasons:

  • This approach does go through Hadoop DFS interface to classload the correct implementing class. So HADOOP is accurate.
  • HADOOP_DEFAULT may remind users the other Hadoop property fs.defaultFS (link, see below) on setting a default DFS, which is not really the property used to inject the scheme like Alluxio here.
<property>
   <name>fs.defaultFS</name>
   <value>file:///</value>
</property>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • HADOOP_DEFAULT may remind users the other Hadoop property fs.defaultFS

I see your point.
To me it stands for "Hadoop's default".
I prefer to have this "default" there, because "Hadoop" is not a filesystem implementation.

(In fact, we internally already use HADOOP_DEFAULT with exactly same meaning (for some service other than file system))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it could be named NONE because we don't set anything in Presto and one may not load any resource files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like NONE, as we may end up having s3 fs impl.
@kokosing WDYT about what I suggested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. Changed to HADOOP_DEFAULT to be consistent.

@electrum
Copy link
Member

@findepi if they choose the option and do nothing else, the FS schemes will simply not exist.

@apc999
Copy link
Contributor Author

apc999 commented Aug 28, 2019

@findepi Exception like "No FileSystem for scheme: s3" will be returned.

@apc999
Copy link
Contributor Author

apc999 commented Aug 29, 2019

Commits squashed

@apc999
Copy link
Contributor Author

apc999 commented Aug 29, 2019

Some Test failed. Is it flaky?

@kokosing
Copy link
Member

One reason to want a Presto config is so that we can ensure all three schemes, s3, s3a, s3n, are mapped to the same class name. But again, I'm not sure that is needed. Thoughts?

@electrum That was my intention. I agree with you that this that much useful, considering that very likely other properties have to be set as well.

This small PR aims to allow a storage service to serve URLs with s3 scheme. Currently, input with URLs like s3://bucket/path can only be served by PrestoS3FileSystem or EMR FS class (i.e., com.amazon.ws.emr.hadoop.fs.EmrFileSystem). In addition to these two possible choices, a RuntimeException will be thrown. This PR enables Presto to be served by additional services (e.g., Alluxio as a caching layer on top of S3 but without change HMS).

Particularly, to leverage users can update etc/config.properties

hive.s3-file-system-type=HADOOP_DEFAULT
and update core-site.xml

<property>
  <name>fs.s3.impl</name>
  <value>alluxio.hadoop.ShimFileSystem</value>
</property>
@sopel39 sopel39 merged commit 1ab3978 into trinodb:master Aug 30, 2019
@sopel39 sopel39 mentioned this pull request Aug 30, 2019
6 tasks
@martint martint added this to the 319 milestone Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

s3 api compatible storage
6 participants