Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD S3 support for downloading and uploading processed datasets #1723

Merged
merged 57 commits into from Jan 26, 2021

Conversation

philschmid
Copy link
Member

What does this PR do?

This PR adds the functionality to load and save datasets from and to s3.
You can save datasets with either Dataset.save_to_disk() or DatasetDict.save_to_disk.
You can load datasets with either load_from_disk or Dataset.load_from_disk(), DatasetDict.load_from_disk().

Loading csv or json datasets from s3 is not implemented.

To save/load datasets to s3 you either need to provide an aws_profile, which is set up on your machine, per default it uses the default profile or you have to pass an aws_access_key_id and aws_secret_access_key.

The implementation was done with the fsspec and boto3.

Example aws_profile :

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

load_from_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

Example aws_access_key_id and aws_secret_access_key :

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk",
                            aws_access_key_id="fake_access_key", 
                            aws_secret_access_key="fake_secret_key"
                           )

load_from_disk("s3://moto-mock-s3-bucket/datasets/sdk",
                            aws_access_key_id="fake_access_key", 
                            aws_secret_access_key="fake_secret_key"
                           )

If you want to load a dataset from a public s3 bucket you can pass anon=True

Example anon=True :

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

load_from_disk("s3://moto-mock-s3-bucketdatasets/sdk",anon=True)

Full Example

import datasets

dataset = datasets.load_dataset("imdb")
print(f"DatasetDict contains {len(dataset)} datasets")
print(f"train Dataset has the size of: {len(dataset['train'])}")

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

remote_dataset = datasets.load_from_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")
print(f"DatasetDict contains {len(remote_dataset)} datasets")
print(f"train Dataset has the size of: {len(remote_dataset['train'])}")

Related to #878

I would also adjust the documentation after the code would be reviewed, as long as I leave the PR in "draft" status. Something that we can consider is renaming the functions and changing the _disk maybe to _filesystem

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your help on this :)

I left a few comments.
In particular I think we can make things even simpler by adding a fs parameter to save_to_disk/load_from_disk so that the user can have full control on authentication but also on the type of filesystem they want to use.

I'm not too familiar with moto unfortunately but if you think I can help fixing the tests let me know.

setup.py Outdated Show resolved Hide resolved
src/datasets/utils/filesystem_utils.py Outdated Show resolved Hide resolved
tests/test_arrow_dataset.py Show resolved Hide resolved
src/datasets/arrow_dataset.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
dataset_path (``str``): path or s3 uri of the dataset directory where the dataset will be saved to
aws_profile (:obj:`str`, `optional`, defaults to :obj:``default``): the aws profile used to create the `boto_session` for uploading the data to s3
aws_access_key_id (:obj:`str`, `optional`, defaults to :obj:``None``): the aws access key id used to create the `boto_session` for uploading the data to s3
aws_secret_access_key (:obj:`str`, `optional`, defaults to :obj:``None``): the aws secret access key used to create the `boto_session` for uploading the data to s3
Copy link
Member

@julien-c julien-c Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have boto3 as an optional dependency, maybe there's a way to not use those (verbose) params and use like a profile name or something instead? (not sure, just a question)

Copy link
Member

@n1t0 n1t0 Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @lhoestq mentioned, we can probably avoid using these params by letting the user provide a custom fs directly. I think this has several advantages (avoid having too many params, lets remove code specific to s3, ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you would suggest something like that?

import fsspec

s3 = fsspec.filesystem("s3", anon=False, key=aws_access_key_id, secret=aws_secret_access_key)

dataset.save_to_disk('s3://my-s3-bucket-with-region/dataset/train',fs=s3)

What I don't like about that is the manual creation, since fsspec is not that well documented for the remote filesystems, e.g. when you want to know which "credentials" you need you to have to go to the s3fs documentation.

what do you think if we remove the named arguments aws_profile... and handle it how fsspec does with an storage_options dict.

dataset.save_to_disk('s3://my-s3-bucket-with-region/dataset/train',
                                      storage_options={
                                             'aws_access_key_id': 123,
                                             'aws_secret_access_key': 123
                                      })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the docstring of fsspec.filesystem is not ideal:

Signature: fsspec.filesystem(protocol, **storage_options)
Docstring:
Instantiate filesystems for given protocol and arguments

``storage_options`` are specific to the protocol being chosen, and are
passed directly to the class.

Maybe we can have a better documentation on our side instead using a wrapper:

import datasets

fs = datasets.filesystem("s3", anon=False, key=aws_access_key_id, secret=aws_secret_access_key)

Where the docstring of datasets.filesystem is more complete and includes examples for popular filesystems like s3

Copy link
Member

@lhoestq lhoestq Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to make available the filesystems classes easily:

>>> from datasets.filesystems import S3FileSystem  # S3FileSystem is simply the class from s3fs
>>> S3FileSystem?  # show the docstring

shows

Init signature: s3fs.core.S3FileSystem(*args, **kwargs)
Docstring:     
Access S3 as if it were a file system.

This exposes a filesystem-like API (ls, cp, open, etc.) on top of S3
storage.

Provide credentials either explicitly (``key=``, ``secret=``) or depend
on boto's credential methods. See botocore documentation for more
information. If no credentials are available, use ``anon=True``.

Parameters
----------
anon : bool (False)
    Whether to use anonymous connection (public buckets only). If False,
    uses the key/secret given, or boto's credential resolver (client_kwargs,
    environment, variables, config files, EC2 IAM server, in that order)
key : string (None)
    If not anonymous, use this access key ID, if specified
secret : string (None)
    If not anonymous, use this secret access key, if specified
token : string (None)
    If not anonymous, use this security token, if specified
etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If we add the various params separately, the more new filesystems we support, the more complicated it becomes. For the user it becomes difficult to know which params to look at, and for us to document everything. It seems like a good option to test things out, but having to change it down the road will require breaking changes, and we usually try to avoid these as much as possible.
  • Using some kind of storage_options just like fsspec.filesystem might be better but it seems difficult to document also. I think the same argument applies to having a datasets.filesystem helper.

I think I have a preference for your second option @lhoestq

  • It seems easy to show all the available filesystems to the user, with each of them having a meaningful documentation
  • We can probably add tests for those we support explicitly
  • If all of them share the same interface, then power users can probably use anything they want from fsspec without explicit support from us?

What do you guys think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the second option is interesting and I totally agree with your three points.
Maybe let's start by having S3FileSystem in datasets.filesystems and we can add the other ones later.

In the documentation of save_to_disk/load_from_disk we can then say that any filesystem from datasets.filesystems or fsspec can be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rebuilt everything so that you can now pass in a fsspec like filesystem.

from datasets import S3Filesystem, load_from_disk

s3 = datasets.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  # doctest: +SKIP

dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  # doctest: +SKIP

print(len(dataset))
# 25000

I also created a "draft" documentation version with a few examples for S3Fileystem. Your feedback would be nice.
Afterwards, I would adjust the documentation of save_to_disk/load_from_disk

image

@philschmid
Copy link
Member Author

philschmid commented Jan 19, 2021

I created the documentation for FileSystem Integration for cloud storage with loading and saving datasets to/from a filesystem with an example of using datasets.filesystem.S3Filesystem. I added a note on the Saving a processed dataset on disk and reload saying that it is also possible to use other filesystems and cloud storages such as S3 with a link to the newly created documentation page from me.
I Attach a screenshot of it here.
screencapture-localhost-5500-docs-build-html-filesystems-html-2021-01-19-17_16_10

@philschmid philschmid marked this pull request as ready for review January 19, 2021 16:20
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes regarding the temp directories !
I added a few comments but it's mostly typos or some suggestions for the docs

docs/source/filesystems.rst Outdated Show resolved Hide resolved
docs/source/filesystems.rst Outdated Show resolved Hide resolved
docs/source/filesystems.rst Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/filesystems/__init__.py Outdated Show resolved Hide resolved
src/datasets/filesystems/s3filesystem.py Outdated Show resolved Hide resolved
src/datasets/filesystems/s3filesystem.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
tests/test_filesystem.py Outdated Show resolved Hide resolved
philschmid and others added 10 commits January 22, 2021 13:51
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @philschmid :)

I tried to find a better name for save_to_disk/load_from_disk but actually I like it this way

philschmid and others added 2 commits January 26, 2021 17:25
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Julien Chaumond <chaumond@gmail.com>
@lhoestq lhoestq merged commit 40b42a1 into huggingface:master Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants