-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD S3 support for downloading and uploading processed datasets #1723
Merged
Merged
Changes from all commits
Commits
Show all changes
57 commits
Select commit
Hold shift + click to select a range
6cf8813
added fsspec and fsspec[s3] adjusted save_to_disk function
philschmid aa90496
added reading from s3
philschmid 6dba448
fixed save_to_disk for s3 path
philschmid ac1d90f
implemented tests
philschmid 49b2fd8
added filesystem utils to arrow dataset and dataset dict
philschmid 7eb6160
added tests for filesystem_utils
philschmid 82f891d
added DatasetDict test
philschmid ed17c5e
changed var from s3_ to proc_
philschmid 3bbab8c
Merge remote-tracking branch 'upstream/master'
philschmid 574c9dd
fixed error in load from disk function
philschmid 0315025
fixing directory creation
philschmid 8d072e6
removed fs.makedirs since files has to be saved temp local
philschmid 7a70258
fixed code quality checks
philschmid 29313ab
fixed quality check
philschmid 5f6749f
added noqa for pytest to work with moto
philschmid 354e39f
stupid mistake with wrong order at imports
philschmid bc36832
adjuste boto3 version to work with moto in tests
philschmid 57bcbe7
removed pytest fixtures from unittest class
philschmid 3493e87
forgot to remove fixture as parameter...
philschmid b4fa6a9
Make it working with Windows style paths.
mfuntowicz 0be3a11
Merge pull request #1 from philschmid/add_s3
philschmid 5fbfcd7
fixed code quality
philschmid b540612
Merge remote-tracking branch 'upstream/master'
philschmid 9a1b282
fixed hopefully the last path problems for WIN
philschmid 081f4bc
added Path().pathjoin with posix to load_from_disk for DatasetDict keys
philschmid 2ce5fec
fixed win path problem
philschmid d346c6f
create conditional dataset_dict_split_path for creating correct path …
philschmid f25a036
added s3 as extra requires
philschmid df78d8b
fixed boto imports for docs
philschmid e3fa922
added S3FileSystem with documentation
philschmid fb992a5
reworked everything for datasets.filesystem
philschmid 53a6a4b
documentation and styling
philschmid 85f0297
added s3fs for documentation
philschmid 8885a7b
handle optional s3fs dependency
lhoestq b91345c
fix test
lhoestq 93a5f5b
adjusted doc order and renamed preproc_dataset_path to extract_path_f…
philschmid 8b55b89
added temp dir when saving
philschmid 2bf289d
fixed quality
philschmid 83e4673
added documentation
philschmid 04042ea
implemented save_to_disk for local remote filesystem with temp dir
philschmid ec29076
fixed documentation example
philschmid 187e01d
fixed documentation for botocore and boto3
philschmid 7785f90
Merge branch 'master' of git://github.com/huggingface/datasets
philschmid 926f31c
Update docs/source/filesystems.rst
philschmid 22b33d7
Update docs/source/filesystems.rst
philschmid 72440ba
Update docs/source/filesystems.rst
philschmid ea273a8
Update src/datasets/arrow_dataset.py
philschmid 5359003
Update src/datasets/arrow_dataset.py
philschmid fd106e4
Update src/datasets/filesystems/__init__.py
philschmid 878f8b7
Update src/datasets/filesystems/s3filesystem.py
philschmid 0b1a2f8
Update src/datasets/filesystems/s3filesystem.py
philschmid a3bebd5
Update src/datasets/load.py
philschmid eb69cdb
removed unnecessary @mock_s3
philschmid 8b7cd48
Update docs/source/filesystems.rst
philschmid 9d7f5c6
Update docs/source/filesystems.rst
philschmid 8514bee
Update src/datasets/filesystems/s3filesystem.py
philschmid a8738ca
Update docs/source/processing.rst
philschmid File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
FileSystems Integration for cloud storages | ||
==================================================================== | ||
|
||
Supported Filesystems | ||
--------------------- | ||
|
||
Currenlty ``datasets`` offers an s3 filesystem implementation with :class:`datasets.filesystems.S3FileSystem`. ``S3FileSystem`` is a subclass of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_, which is a known implementation of ``fsspec``. | ||
|
||
Furthermore ``datasets`` supports all ``fsspec`` implementations. Currently Known Implementations these are: | ||
|
||
- `s3fs <https://s3fs.readthedocs.io/en/latest/>`_ for Amazon S3 and other compatible stores | ||
- `gcsfs <https://gcsfs.readthedocs.io/en/latest/>`_ for Google Cloud Storage | ||
- `adl <https://github.com/dask/adlfs>`_ for Azure DataLake storage | ||
- `abfs <https://github.com/dask/adlfs>`_ for Azure Blob service | ||
- `dropbox <https://github.com/MarineChap/dropboxdrivefs>`_ for access to dropbox shares | ||
- `gdrive <https://github.com/intake/gdrivefs>`_ to access Google Drive and shares (experimental) | ||
|
||
These know implementations are going to be natively added in the near future within ``datasets``. | ||
|
||
**Examples:** | ||
|
||
Example using :class:`datasets.filesystems.S3FileSystem` within ``datasets``. | ||
|
||
|
||
.. code-block:: | ||
|
||
>>> pip install datasets[s3] | ||
|
||
Listing files from public s3 bucket. | ||
|
||
.. code-block:: | ||
|
||
>>> import datasets | ||
>>> s3 = datasets.filesystems.S3FileSystem(anon=True) # doctest: +SKIP | ||
>>> s3.ls('public-datasets/imdb/train') # doctest: +SKIP | ||
['dataset_info.json.json','dataset.arrow','state.json'] | ||
|
||
Listing files from private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``. | ||
|
||
.. code-block:: | ||
|
||
>>> import datasets | ||
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP | ||
>>> s3.ls('my-private-datasets/imdb/train') # doctest: +SKIP | ||
['dataset_info.json.json','dataset.arrow','state.json'] | ||
|
||
Using ``S3Filesystem`` with ``botocore.session.Session`` and custom ``aws_profile``. | ||
|
||
.. code-block:: | ||
|
||
>>> import botocore | ||
>>> from datasets.filesystems import S3Filesystem | ||
>>> s3_session = botocore.session.Session(profile_name='my_profile_name') | ||
>>> | ||
>>> s3 = S3FileSystem(session=s3_session) # doctest: +SKIP | ||
|
||
|
||
|
||
Saving a processed dataset to s3 | ||
-------------------------------- | ||
|
||
Once you have your final dataset you can save it to s3 and reuse it later using :obj:`datasets.load_from_disk`. | ||
Saving a dataset to s3 will upload various files to your bucket: | ||
|
||
- ``arrow files.arrow``: they contain your dataset's data | ||
- ``dataset_info.json``: contains the description, citations, etc. of the dataset | ||
- ``state.json``: contains the list of the arrow files and other informations like the dataset format type, if any (torch or tensorflow for example) | ||
|
||
Saving ``encoded_dataset`` to a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``. | ||
|
||
.. code-block:: | ||
|
||
>>> from datasets.filesystems import S3FileSystem | ||
>>> | ||
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key | ||
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP | ||
>>> | ||
>>> # saves encoded_dataset to your s3 bucket | ||
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP | ||
|
||
Saving ``encoded_dataset`` to a private s3 bucket using ``botocore.session.Session`` and custom ``aws_profile``. | ||
|
||
.. code-block:: | ||
|
||
>>> import botocore | ||
>>> from datasets.filesystems import S3Filesystem | ||
>>> | ||
>>> # creates a botocore session with the provided aws_profile | ||
>>> s3_session = botocore.session.Session(profile_name='my_profile_name') | ||
>>> | ||
>>> # create S3FileSystem instance with s3_session | ||
>>> s3 = S3FileSystem(sessions=s3_session) # doctest: +SKIP | ||
>>> | ||
>>> # saves encoded_dataset to your s3 bucket | ||
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP | ||
|
||
|
||
Loading a processed dataset from s3 | ||
----------------------------------- | ||
|
||
After you have saved your processed dataset to s3 you can load it using :obj:`datasets.load_from_disk`. | ||
You can only load datasets from s3, which are saved using :func:`datasets.Dataset.save_to_disk` | ||
and :func:`datasets.DatasetDict.save_to_disk`. | ||
|
||
Loading ``encoded_dataset`` from a public s3 bucket. | ||
|
||
.. code-block:: | ||
|
||
>>> from datasets import load_from_disk | ||
>>> from datasets.filesystems import S3Filesystem | ||
>>> | ||
>>> # create S3FileSystem without credentials | ||
>>> s3 = S3FileSystem(anon=True) # doctest: +SKIP | ||
>>> | ||
>>> # load encoded_dataset to from s3 bucket | ||
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) # doctest: +SKIP | ||
>>> | ||
>>> print(len(dataset)) | ||
>>> # 25000 | ||
|
||
Loading ``encoded_dataset`` from a private s3 bucket using ``aws_access_key_id`` and ``aws_secret_access_key``. | ||
|
||
.. code-block:: | ||
|
||
>>> from datasets import load_from_disk | ||
>>> from datasets.filesystems import S3Filesystem | ||
>>> | ||
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key | ||
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP | ||
>>> | ||
>>> # load encoded_dataset to from s3 bucket | ||
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP | ||
>>> | ||
>>> print(len(dataset)) | ||
>>> # 25000 | ||
|
||
Loading ``encoded_dataset`` from a private s3 bucket using ``botocore.session.Session`` and custom ``aws_profile``. | ||
|
||
.. code-block:: | ||
|
||
>>> import botocore | ||
>>> from datasets.filesystems import S3Filesystem | ||
>>> | ||
>>> # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key | ||
>>> s3_session = botocore.session.Session(profile_name='my_profile_name') | ||
>>> | ||
>>> # create S3FileSystem instance with s3_session | ||
>>> s3 = S3FileSystem(sessions=s3_session) | ||
>>> | ||
>>> # load encoded_dataset to from s3 bucket | ||
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP | ||
>>> | ||
>>> print(len(dataset)) | ||
>>> # 25000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need fsspec in the requirements just for the local filesystem right ?
Maybe we can simply have our own LocalFileSystem from their implementation instead in
datasets.filesystem. LocalFileSystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need
fsspec
for the local filesystem. So would copy their implementation instead of installingfsspec
i can give it a try later.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sadly not possible since
Localfilesystem
is havingAbstractFileSystem
as dependencies, which has several more dependencies.https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/spec.html#AbstractFileSystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've taken a look with @philschmid and it's not just a class to copy so we'll need to have fsspec as a hard dependency