Finish deprecating the fs argument #5393

dconathan · 2022-12-28T15:33:17Z

See #5385 for some discussion on this

The fs= arg was depcrecated from Dataset.save_to_disk and Dataset.load_from_disk in 2.8.0 (to be removed in 3.0.0). There are a few other places where the fs= arg was still used (functions/methods in datasets.info and datasets.load). This PR adds a similar behavior, warnings and the storage_options= arg to these functions and methods.

One question: should the "deprecated" / "added" versions be 2.8.1 for the docs/warnings on these? Right now I'm going with "fs was deprecated in 2.8.0" but "storage_options= was added in 2.8.1" where appropriate.

@mariosasko

…o and datasets.load

HuggingFaceDocBuilderDev · 2022-12-29T06:51:58Z

The documentation is not available anymore as the PR was closed or merged.

albertvillanova

Thanks for the deprecation. Some minor suggested fixes below...

Also note that the corresponding tests should be updated as well.

src/datasets/info.py

src/datasets/load.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

dconathan · 2022-12-29T14:13:04Z

Thanks for the deprecation. Some minor suggested fixes below...

Also note that the corresponding tests should be updated as well.

Thanks for the suggestions/typo fixes. I updated the failing test - passing locally now

lhoestq · 2023-01-05T15:44:19Z

Nice thanks !

I believe you also need to update _load_info and _save_info in builder.py - they're still passing fs=self._fs instead of storage_options=self._fs.storage_options

This should remove the remaining warnings in the CI such as

tests/test_builder.py::test_builder_with_filesystem_download_and_prepare_reload
tests/test_load.py::test_load_dataset_local[False]
tests/test_load.py::test_load_dataset_local[True]
tests/test_load.py::test_load_dataset_zip_csv[csv_path-False]
tests/test_load.py::test_load_dataset_then_move_then_reload
  /opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datasets/info.py:344: FutureWarning: 'fs' was deprecated in favor of 'storage_options' in version 2.9.0 and will be removed in 3.0.0.
  You can remove this warning by passing 'storage_options=fs.storage_options' instead.

albertvillanova

Thanks again for all the work, @dconathan.

I agree with @lhoestq that we should better address all remaining fs deprecation warnings with this PR.

For example, there are still some deprecation warnings when calling Dataset.load_from_disk with fs. See:

datasets/src/datasets/load.py

Line 1819 in 232a439

return Dataset.load_from_disk(dataset_path, fs, keep_in_memory=keep_in_memory)

or DatasetDict.load_from_disk with fs. See:

datasets/src/datasets/load.py

Line 1821 in 232a439

    
           return DatasetDict.load_from_disk(dataset_path, fs, keep_in_memory=keep_in_memory)

These docstrings should also be updated:

datasets/src/datasets/filesystems/s3filesystem.py

Line 98 in 232a439

    
               >>> dataset = load_from_disk('s3://my-private-datasets/imdb/train', fs=s3)  # doctest: +SKIP

datasets/src/datasets/filesystems/s3filesystem.py

Line 111 in 232a439

    
               >>> dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)  # doctest: +SKIP

dconathan · 2023-01-14T20:37:06Z

re: docstring, I assume passing in storage_options=s3.storage_options is correct/necessary to pass the secrets?

dconathan · 2023-01-14T21:05:29Z

what about

datasets/src/datasets/filesystems/__init__.py

Lines 43 to 54 in 5b793dd

    
           def is_remote_filesystem(fs: fsspec.AbstractFileSystem) -> bool: 
        
               """ 
        
               Validates if filesystem has remote protocol. 
        
               Args: 
        
                   fs (`fsspec.spec.AbstractFileSystem`): 
        
                       An abstract super-class for pythonic file-systems, e.g. `fsspec.filesystem(\'file\')` or [`datasets.filesystems.S3FileSystem`]. 
        
               """ 
        
               if fs is not None and fs.protocol != "file": 
        
                   return True 
        
               else: 
        
                   return False

leave as is? Is this function no longer necessary?

albertvillanova

Thanks again for all your work on this PR, @dconathan.

I think the function is_remote_filesystem should be kept as it is.

We are going to re-run the CI. Once all green, we can merge.

github-actions · 2023-01-18T12:42:33Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008877 / 0.011353 (-0.002475)	0.004725 / 0.011008 (-0.006283)	0.100738 / 0.038508 (0.062230)	0.030251 / 0.023109 (0.007141)	0.301483 / 0.275898 (0.025585)	0.374161 / 0.323480 (0.050681)	0.007225 / 0.007986 (-0.000761)	0.003654 / 0.004328 (-0.000674)	0.078400 / 0.004250 (0.074149)	0.035786 / 0.037052 (-0.001267)	0.309744 / 0.258489 (0.051255)	0.355834 / 0.293841 (0.061994)	0.034344 / 0.128546 (-0.094202)	0.011584 / 0.075646 (-0.064062)	0.321462 / 0.419271 (-0.097810)	0.041201 / 0.043533 (-0.002332)	0.298808 / 0.255139 (0.043669)	0.332626 / 0.283200 (0.049426)	0.089131 / 0.141683 (-0.052552)	1.477888 / 1.452155 (0.025734)	1.530365 / 1.492716 (0.037649)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.191647 / 0.018006 (0.173640)	0.424339 / 0.000490 (0.423849)	0.002941 / 0.000200 (0.002741)	0.000075 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023442 / 0.037411 (-0.013969)	0.097264 / 0.014526 (0.082738)	0.105655 / 0.176557 (-0.070901)	0.145055 / 0.737135 (-0.592081)	0.108750 / 0.296338 (-0.187588)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422925 / 0.215209 (0.207716)	4.216022 / 2.077655 (2.138367)	1.876441 / 1.504120 (0.372322)	1.665115 / 1.541195 (0.123920)	1.711105 / 1.468490 (0.242615)	0.701820 / 4.584777 (-3.882957)	3.389319 / 3.745712 (-0.356393)	1.909868 / 5.269862 (-3.359994)	1.270482 / 4.565676 (-3.295195)	0.083680 / 0.424275 (-0.340595)	0.012347 / 0.007607 (0.004740)	0.531076 / 0.226044 (0.305031)	5.344045 / 2.268929 (3.075117)	2.310897 / 55.444624 (-53.133728)	1.971953 / 6.876477 (-4.904524)	2.113748 / 2.142072 (-0.028325)	0.823766 / 4.805227 (-3.981462)	0.150864 / 6.500664 (-6.349800)	0.066263 / 0.075469 (-0.009206)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.253190 / 1.841788 (-0.588598)	13.757887 / 8.074308 (5.683579)	13.888195 / 10.191392 (3.696803)	0.137285 / 0.680424 (-0.543139)	0.029151 / 0.534201 (-0.505050)	0.387402 / 0.579283 (-0.191881)	0.401673 / 0.434364 (-0.032691)	0.450474 / 0.540337 (-0.089863)	0.533757 / 1.386936 (-0.853179)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006919 / 0.011353 (-0.004434)	0.004655 / 0.011008 (-0.006353)	0.096946 / 0.038508 (0.058438)	0.028697 / 0.023109 (0.005588)	0.420020 / 0.275898 (0.144122)	0.460193 / 0.323480 (0.136713)	0.005189 / 0.007986 (-0.002796)	0.003425 / 0.004328 (-0.000904)	0.074900 / 0.004250 (0.070649)	0.041844 / 0.037052 (0.004792)	0.421538 / 0.258489 (0.163049)	0.468497 / 0.293841 (0.174656)	0.032573 / 0.128546 (-0.095973)	0.011731 / 0.075646 (-0.063916)	0.320221 / 0.419271 (-0.099050)	0.042113 / 0.043533 (-0.001420)	0.422757 / 0.255139 (0.167618)	0.445372 / 0.283200 (0.162172)	0.090300 / 0.141683 (-0.051383)	1.458598 / 1.452155 (0.006443)	1.550060 / 1.492716 (0.057344)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.235489 / 0.018006 (0.217483)	0.418207 / 0.000490 (0.417718)	0.002511 / 0.000200 (0.002311)	0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025603 / 0.037411 (-0.011808)	0.100237 / 0.014526 (0.085711)	0.108617 / 0.176557 (-0.067939)	0.148417 / 0.737135 (-0.588719)	0.110163 / 0.296338 (-0.186176)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474804 / 0.215209 (0.259595)	4.745370 / 2.077655 (2.667715)	2.417819 / 1.504120 (0.913699)	2.209892 / 1.541195 (0.668697)	2.263296 / 1.468490 (0.794806)	0.695537 / 4.584777 (-3.889240)	3.381028 / 3.745712 (-0.364684)	2.952271 / 5.269862 (-2.317591)	1.507041 / 4.565676 (-3.058636)	0.083334 / 0.424275 (-0.340941)	0.012554 / 0.007607 (0.004947)	0.578861 / 0.226044 (0.352817)	5.795241 / 2.268929 (3.526313)	2.858544 / 55.444624 (-52.586080)	2.516270 / 6.876477 (-4.360207)	2.557350 / 2.142072 (0.415278)	0.801799 / 4.805227 (-4.003428)	0.151579 / 6.500664 (-6.349085)	0.068765 / 0.075469 (-0.006704)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.279935 / 1.841788 (-0.561853)	14.049065 / 8.074308 (5.974757)	13.972703 / 10.191392 (3.781311)	0.140551 / 0.680424 (-0.539873)	0.016831 / 0.534201 (-0.517370)	0.383886 / 0.579283 (-0.195397)	0.385661 / 0.434364 (-0.048703)	0.444525 / 0.540337 (-0.095813)	0.532197 / 1.386936 (-0.854739)

deprecated the fs= arg and added storage_options= arg in datasets.inf…

f789460

…o and datasets.load

albertvillanova requested changes Dec 29, 2022

View reviewed changes

dconathan and others added 2 commits December 29, 2022 08:55

Apply suggestions from code review

a5d8f2e

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

updated DatasetInfo.write_to_directory test to use the mock:// fs uri

8c57f0a

dconathan requested a review from albertvillanova December 31, 2022 17:56

devin and others added 2 commits January 7, 2023 00:28

Merge branch 'main' into deprecate-fs-continued

f4e9723

remove more usages of deprecated fs= arg

c2ea59c

albertvillanova requested changes Jan 10, 2023

View reviewed changes

albertvillanova changed the title ~~Finish Deprecating the fs= arg~~ Finish deprecating the fs argument Jan 10, 2023

dconathan added 3 commits January 14, 2023 10:30

Merge branch 'huggingface:main' into deprecate-fs-continued

77c6f86

remove usage of fs arg in load_from_disk

bf3ed11

remove usage of deprecated fs= arg from docstring

08ef532

dconathan requested a review from albertvillanova January 14, 2023 20:35

albertvillanova approved these changes Jan 18, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into deprecate-fs-continued

7378a10

albertvillanova merged commit 8d20684 into huggingface:main Jan 18, 2023

albertvillanova mentioned this pull request Jan 23, 2023

Is fs= deprecated in load_from_disk() as well? #5385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish deprecating the fs argument #5393

Finish deprecating the fs argument #5393

dconathan commented Dec 28, 2022

HuggingFaceDocBuilderDev commented Dec 29, 2022 •

edited

albertvillanova left a comment •

edited

dconathan commented Dec 29, 2022

lhoestq commented Jan 5, 2023

albertvillanova left a comment

dconathan commented Jan 14, 2023

dconathan commented Jan 14, 2023 •

edited

albertvillanova left a comment

github-actions bot commented Jan 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Finish deprecating the fs argument #5393

Finish deprecating the fs argument #5393

Conversation

dconathan commented Dec 28, 2022

HuggingFaceDocBuilderDev commented Dec 29, 2022 • edited

albertvillanova left a comment • edited

Choose a reason for hiding this comment

dconathan commented Dec 29, 2022

lhoestq commented Jan 5, 2023

albertvillanova left a comment

Choose a reason for hiding this comment

dconathan commented Jan 14, 2023

dconathan commented Jan 14, 2023 • edited

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Dec 29, 2022 •

edited

albertvillanova left a comment •

edited

dconathan commented Jan 14, 2023 •

edited