Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD S3 support for downloading and uploading processed datasets #1723
ADD S3 support for downloading and uploading processed datasets #1723
Changes from 16 commits
6cf8813
aa90496
6dba448
ac1d90f
49b2fd8
7eb6160
82f891d
ed17c5e
3bbab8c
574c9dd
0315025
8d072e6
7a70258
29313ab
5f6749f
354e39f
bc36832
57bcbe7
3493e87
b4fa6a9
0be3a11
5fbfcd7
b540612
9a1b282
081f4bc
2ce5fec
d346c6f
f25a036
df78d8b
e3fa922
fb992a5
53a6a4b
85f0297
8885a7b
b91345c
93a5f5b
8b55b89
2bf289d
83e4673
04042ea
ec29076
187e01d
7785f90
926f31c
22b33d7
72440ba
ea273a8
5359003
fd106e4
878f8b7
0b1a2f8
a3bebd5
eb69cdb
8b7cd48
9d7f5c6
8514bee
a8738ca
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have
boto3
as an optional dependency, maybe there's a way to not use those (verbose) params and use like a profile name or something instead? (not sure, just a question)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @lhoestq mentioned, we can probably avoid using these params by letting the user provide a custom
fs
directly. I think this has several advantages (avoid having too many params, lets remove code specific to s3, ..)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you would suggest something like that?
What I don't like about that is the manual creation, since
fsspec
is not that well documented for the remote filesystems, e.g. when you want to know which "credentials" you need you to have to go to thes3fs
documentation.what do you think if we remove the named arguments
aws_profile
... and handle it howfsspec
does with anstorage_options
dict.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the docstring of
fsspec.filesystem
is not ideal:Maybe we can have a better documentation on our side instead using a wrapper:
Where the docstring of
datasets.filesystem
is more complete and includes examples for popular filesystems like s3There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to make available the filesystems classes easily:
shows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_options
just likefsspec.filesystem
might be better but it seems difficult to document also. I think the same argument applies to having adatasets.filesystem
helper.I think I have a preference for your second option @lhoestq
fsspec
without explicit support from us?What do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think the second option is interesting and I totally agree with your three points.
Maybe let's start by having
S3FileSystem
indatasets.filesystems
and we can add the other ones later.In the documentation of
save_to_disk
/load_from_disk
we can then say that any filesystem fromdatasets.filesystems
orfsspec
can be used.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rebuilt everything so that you can now pass in a
fsspec
like filesystem.I also created a "draft" documentation version with a few examples for
S3Fileystem
. Your feedback would be nice.Afterwards, I would adjust the documentation of
save_to_disk
/load_from_disk