ADD S3 support for downloading and uploading processed datasets #1723

philschmid · 2021-01-12T07:17:34Z

What does this PR do?

This PR adds the functionality to load and save datasets from and to s3.
You can save datasets with either Dataset.save_to_disk() or DatasetDict.save_to_disk.
You can load datasets with either load_from_disk or Dataset.load_from_disk(), DatasetDict.load_from_disk().

Loading csv or json datasets from s3 is not implemented.

To save/load datasets to s3 you either need to provide an aws_profile, which is set up on your machine, per default it uses the default profile or you have to pass an aws_access_key_id and aws_secret_access_key.

The implementation was done with the fsspec and boto3.

Example `aws_profile` :

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

load_from_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

Example `aws_access_key_id` and `aws_secret_access_key` :

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk",
                            aws_access_key_id="fake_access_key", 
                            aws_secret_access_key="fake_secret_key"
                           )

load_from_disk("s3://moto-mock-s3-bucket/datasets/sdk",
                            aws_access_key_id="fake_access_key", 
                            aws_secret_access_key="fake_secret_key"
                           )

If you want to load a dataset from a public s3 bucket you can pass anon=True

Example `anon=True` :

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

load_from_disk("s3://moto-mock-s3-bucketdatasets/sdk",anon=True)

Full Example

import datasets

dataset = datasets.load_dataset("imdb")
print(f"DatasetDict contains {len(dataset)} datasets")
print(f"train Dataset has the size of: {len(dataset['train'])}")

dataset.save_to_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")

remote_dataset = datasets.load_from_disk("s3://moto-mock-s3-bucket/datasets/sdk", aws_profile="hf-sm")
print(f"DatasetDict contains {len(remote_dataset)} datasets")
print(f"train Dataset has the size of: {len(remote_dataset['train'])}")

Related to #878

I would also adjust the documentation after the code would be reviewed, as long as I leave the PR in "draft" status. Something that we can consider is renaming the functions and changing the _disk maybe to _filesystem

lhoestq

Thank you for your help on this :)

I left a few comments.
In particular I think we can make things even simpler by adding a fs parameter to save_to_disk/load_from_disk so that the user can have full control on authentication but also on the type of filesystem they want to use.

I'm not too familiar with moto unfortunately but if you think I can help fixing the tests let me know.

setup.py

src/datasets/utils/filesystem_utils.py

tests/test_arrow_dataset.py

src/datasets/arrow_dataset.py

Make it working with Windows style paths.

setup.py

julien-c · 2021-01-13T18:40:59Z

src/datasets/arrow_dataset.py

+            dataset_path (``str``): path or s3 uri of the dataset directory where the dataset will be saved to
+            aws_profile (:obj:`str`,  `optional`, defaults to :obj:``default``): the aws profile used to create the `boto_session` for uploading the data to s3
+            aws_access_key_id (:obj:`str`,  `optional`, defaults to :obj:``None``): the aws access key id used to create the `boto_session` for uploading the data to s3
+            aws_secret_access_key (:obj:`str`,  `optional`, defaults to :obj:``None``): the aws secret access key used to create the `boto_session` for uploading the data to s3


if we have boto3 as an optional dependency, maybe there's a way to not use those (verbose) params and use like a profile name or something instead? (not sure, just a question)

As @lhoestq mentioned, we can probably avoid using these params by letting the user provide a custom fs directly. I think this has several advantages (avoid having too many params, lets remove code specific to s3, ..)

so you would suggest something like that?

import fsspec s3 = fsspec.filesystem("s3", anon=False, key=aws_access_key_id, secret=aws_secret_access_key) dataset.save_to_disk('s3://my-s3-bucket-with-region/dataset/train',fs=s3)

What I don't like about that is the manual creation, since fsspec is not that well documented for the remote filesystems, e.g. when you want to know which "credentials" you need you to have to go to the s3fs documentation.

what do you think if we remove the named arguments aws_profile... and handle it how fsspec does with an storage_options dict.

dataset.save_to_disk('s3://my-s3-bucket-with-region/dataset/train', storage_options={ 'aws_access_key_id': 123, 'aws_secret_access_key': 123 })

Indeed the docstring of fsspec.filesystem is not ideal:

Signature: fsspec.filesystem(protocol, **storage_options) Docstring: Instantiate filesystems for given protocol and arguments ``storage_options`` are specific to the protocol being chosen, and are passed directly to the class.

Maybe we can have a better documentation on our side instead using a wrapper:

import datasets fs = datasets.filesystem("s3", anon=False, key=aws_access_key_id, secret=aws_secret_access_key)

Where the docstring of datasets.filesystem is more complete and includes examples for popular filesystems like s3

Another option would be to make available the filesystems classes easily:

>>> from datasets.filesystems import S3FileSystem # S3FileSystem is simply the class from s3fs >>> S3FileSystem? # show the docstring

shows

Init signature: s3fs.core.S3FileSystem(*args, **kwargs) Docstring: Access S3 as if it were a file system. This exposes a filesystem-like API (ls, cp, open, etc.) on top of S3 storage. Provide credentials either explicitly (``key=``, ``secret=``) or depend on boto's credential methods. See botocore documentation for more information. If no credentials are available, use ``anon=True``. Parameters ---------- anon : bool (False) Whether to use anonymous connection (public buckets only). If False, uses the key/secret given, or boto's credential resolver (client_kwargs, environment, variables, config files, EC2 IAM server, in that order) key : string (None) If not anonymous, use this access key ID, if specified secret : string (None) If not anonymous, use this secret access key, if specified token : string (None) If not anonymous, use this security token, if specified etc.

If we add the various params separately, the more new filesystems we support, the more complicated it becomes. For the user it becomes difficult to know which params to look at, and for us to document everything. It seems like a good option to test things out, but having to change it down the road will require breaking changes, and we usually try to avoid these as much as possible.

Using some kind of storage_options just like fsspec.filesystem might be better but it seems difficult to document also. I think the same argument applies to having a datasets.filesystem helper.

I think I have a preference for your second option @lhoestq

It seems easy to show all the available filesystems to the user, with each of them having a meaningful documentation

We can probably add tests for those we support explicitly

If all of them share the same interface, then power users can probably use anything they want from fsspec without explicit support from us?

What do you guys think?

Yes I think the second option is interesting and I totally agree with your three points.
Maybe let's start by having S3FileSystem in datasets.filesystems and we can add the other ones later.

In the documentation of save_to_disk/load_from_disk we can then say that any filesystem from datasets.filesystems or fsspec can be used.

I have rebuilt everything so that you can now pass in a fsspec like filesystem.

from datasets import S3Filesystem, load_from_disk s3 = datasets.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP print(len(dataset)) # 25000

I also created a "draft" documentation version with a few examples for S3Fileystem. Your feedback would be nice.
Afterwards, I would adjust the documentation of save_to_disk/load_from_disk

…for win and linux

philschmid · 2021-01-19T16:20:08Z

I created the documentation for FileSystem Integration for cloud storage with loading and saving datasets to/from a filesystem with an example of using datasets.filesystem.S3Filesystem. I added a note on the Saving a processed dataset on disk and reload saying that it is also possible to use other filesystems and cloud storages such as S3 with a link to the newly created documentation page from me.
I Attach a screenshot of it here.

lhoestq

Thanks for all the changes regarding the temp directories !
I added a few comments but it's mostly typos or some suggestions for the docs

docs/source/filesystems.rst

src/datasets/arrow_dataset.py

src/datasets/filesystems/__init__.py

src/datasets/filesystems/s3filesystem.py

src/datasets/load.py

tests/test_filesystem.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

docs/source/filesystems.rst

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

lhoestq

Thanks for adding this @philschmid :)

I tried to find a better name for save_to_disk/load_from_disk but actually I like it this way

docs/source/filesystems.rst

src/datasets/filesystems/s3filesystem.py

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

docs/source/processing.rst

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

philschmid added 16 commits January 8, 2021 19:16

added fsspec and fsspec[s3] adjusted save_to_disk function

6cf8813

added reading from s3

aa90496

fixed save_to_disk for s3 path

6dba448

implemented tests

ac1d90f

added filesystem utils to arrow dataset and dataset dict

49b2fd8

added tests for filesystem_utils

7eb6160

added DatasetDict test

82f891d

changed var from s3_ to proc_

ed17c5e

Merge remote-tracking branch 'upstream/master'

3bbab8c

fixed error in load from disk function

574c9dd

fixing directory creation

0315025

removed fs.makedirs since files has to be saved temp local

8d072e6

fixed code quality checks

7a70258

fixed quality check

29313ab

added noqa for pytest to work with moto

5f6749f

stupid mistake with wrong order at imports

354e39f

lhoestq reviewed Jan 12, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

src/datasets/utils/filesystem_utils.py Outdated Show resolved Hide resolved

tests/test_arrow_dataset.py Show resolved Hide resolved

src/datasets/arrow_dataset.py Show resolved Hide resolved

philschmid and others added 9 commits January 12, 2021 21:17

adjuste boto3 version to work with moto in tests

bc36832

removed pytest fixtures from unittest class

57bcbe7

forgot to remove fixture as parameter...

3493e87

Make it working with Windows style paths.

b4fa6a9

Merge pull request #1 from philschmid/add_s3

0be3a11

Make it working with Windows style paths.

fixed code quality

5fbfcd7

Merge remote-tracking branch 'upstream/master'

b540612

fixed hopefully the last path problems for WIN

9a1b282

added Path().pathjoin with posix to load_from_disk for DatasetDict keys

081f4bc

julien-c reviewed Jan 13, 2021

View reviewed changes

philschmid added 3 commits January 13, 2021 20:47

fixed win path problem

2ce5fec

create conditional dataset_dict_split_path for creating correct path …

d346c6f

…for win and linux

added s3 as extra requires

f25a036

philschmid added 2 commits January 19, 2021 16:41

added documentation

83e4673

implemented save_to_disk for local remote filesystem with temp dir

04042ea

philschmid marked this pull request as ready for review January 19, 2021 16:20

philschmid added 3 commits January 19, 2021 17:27

fixed documentation example

ec29076

fixed documentation for botocore and boto3

187e01d

Merge branch 'master' of git://github.com/huggingface/datasets

7785f90

lhoestq reviewed Jan 22, 2021

View reviewed changes

philschmid and others added 10 commits January 22, 2021 13:51

Update docs/source/filesystems.rst

926f31c

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update docs/source/filesystems.rst

22b33d7

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update docs/source/filesystems.rst

72440ba

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/arrow_dataset.py

ea273a8

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/arrow_dataset.py

5359003

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/filesystems/__init__.py

fd106e4

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/filesystems/s3filesystem.py

878f8b7

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/filesystems/s3filesystem.py

0b1a2f8

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/load.py

a3bebd5

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

removed unnecessary @mock_s3

eb69cdb

lhoestq reviewed Jan 22, 2021

View reviewed changes

docs/source/filesystems.rst Outdated Show resolved Hide resolved

Update docs/source/filesystems.rst

8b7cd48

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

thomwolf approved these changes Jan 26, 2021

View reviewed changes

lhoestq approved these changes Jan 26, 2021

View reviewed changes

julien-c reviewed Jan 26, 2021

View reviewed changes

docs/source/filesystems.rst Outdated Show resolved Hide resolved

julien-c reviewed Jan 26, 2021

View reviewed changes

src/datasets/filesystems/s3filesystem.py Outdated Show resolved Hide resolved

philschmid and others added 2 commits January 26, 2021 17:25

Update docs/source/filesystems.rst

9d7f5c6

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

Update src/datasets/filesystems/s3filesystem.py

8514bee

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

julien-c reviewed Jan 26, 2021

View reviewed changes

docs/source/processing.rst Outdated Show resolved Hide resolved

Update docs/source/processing.rst

a8738ca

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

julien-c approved these changes Jan 26, 2021

View reviewed changes

lhoestq merged commit 40b42a1 into huggingface:master Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD S3 support for downloading and uploading processed datasets #1723

ADD S3 support for downloading and uploading processed datasets #1723

philschmid commented Jan 12, 2021

lhoestq left a comment

julien-c Jan 13, 2021 •

edited

n1t0 Jan 13, 2021 •

edited

philschmid Jan 14, 2021 •

edited

lhoestq Jan 14, 2021

lhoestq Jan 14, 2021 •

edited

n1t0 Jan 14, 2021

lhoestq Jan 15, 2021

philschmid Jan 17, 2021 •

edited

philschmid commented Jan 19, 2021 •

edited

lhoestq left a comment •

edited

lhoestq left a comment

ADD S3 support for downloading and uploading processed datasets #1723

ADD S3 support for downloading and uploading processed datasets #1723

Conversation

philschmid commented Jan 12, 2021

What does this PR do?

Example aws_profile :

Example aws_access_key_id and aws_secret_access_key :

Example anon=True :

Full Example

lhoestq left a comment

Choose a reason for hiding this comment

julien-c Jan 13, 2021 • edited

Choose a reason for hiding this comment

n1t0 Jan 13, 2021 • edited

Choose a reason for hiding this comment

philschmid Jan 14, 2021 • edited

Choose a reason for hiding this comment

lhoestq Jan 14, 2021

Choose a reason for hiding this comment

lhoestq Jan 14, 2021 • edited

Choose a reason for hiding this comment

n1t0 Jan 14, 2021

Choose a reason for hiding this comment

lhoestq Jan 15, 2021

Choose a reason for hiding this comment

philschmid Jan 17, 2021 • edited

Choose a reason for hiding this comment

philschmid commented Jan 19, 2021 • edited

lhoestq left a comment • edited

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Example `aws_profile` :

Example `aws_access_key_id` and `aws_secret_access_key` :

Example `anon=True` :

julien-c Jan 13, 2021 •

edited

n1t0 Jan 13, 2021 •

edited

philschmid Jan 14, 2021 •

edited

lhoestq Jan 14, 2021 •

edited

philschmid Jan 17, 2021 •

edited

philschmid commented Jan 19, 2021 •

edited

lhoestq left a comment •

edited