Remove `HfFileSystem` and deprecate `S3FileSystem` #6052

mariosasko · 2023-07-19T15:00:01Z

Remove the legacy HfFileSystem and deprecate S3FileSystem

cc @philschmid for the SageMaker scripts/notebooks that still use datasets' S3FileSystem

HuggingFaceDocBuilderDev · 2023-07-19T15:08:38Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-19T15:10:38Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006658 / 0.011353 (-0.004695)	0.004347 / 0.011008 (-0.006661)	0.084179 / 0.038508 (0.045671)	0.080842 / 0.023109 (0.057733)	0.321642 / 0.275898 (0.045744)	0.348758 / 0.323480 (0.025278)	0.005624 / 0.007986 (-0.002362)	0.003479 / 0.004328 (-0.000850)	0.065125 / 0.004250 (0.060875)	0.057624 / 0.037052 (0.020572)	0.323643 / 0.258489 (0.065154)	0.360939 / 0.293841 (0.067098)	0.031005 / 0.128546 (-0.097541)	0.008618 / 0.075646 (-0.067028)	0.287443 / 0.419271 (-0.131828)	0.052640 / 0.043533 (0.009107)	0.316947 / 0.255139 (0.061808)	0.330292 / 0.283200 (0.047093)	0.024393 / 0.141683 (-0.117289)	1.476734 / 1.452155 (0.024579)	1.534505 / 1.492716 (0.041789)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.273808 / 0.018006 (0.255802)	0.591146 / 0.000490 (0.590656)	0.000322 / 0.000200 (0.000122)	0.000053 / 0.000054 (-0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029992 / 0.037411 (-0.007419)	0.086654 / 0.014526 (0.072129)	0.098590 / 0.176557 (-0.077967)	0.157225 / 0.737135 (-0.579910)	0.101816 / 0.296338 (-0.194522)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.382578 / 0.215209 (0.167368)	3.803576 / 2.077655 (1.725922)	1.875136 / 1.504120 (0.371016)	1.704207 / 1.541195 (0.163012)	1.765146 / 1.468490 (0.296656)	0.482802 / 4.584777 (-4.101975)	3.571772 / 3.745712 (-0.173940)	3.245626 / 5.269862 (-2.024235)	2.051612 / 4.565676 (-2.514064)	0.056539 / 0.424275 (-0.367736)	0.007199 / 0.007607 (-0.000408)	0.462445 / 0.226044 (0.236401)	4.623800 / 2.268929 (2.354872)	2.318948 / 55.444624 (-53.125677)	1.971442 / 6.876477 (-4.905035)	2.225444 / 2.142072 (0.083371)	0.575205 / 4.805227 (-4.230022)	0.129243 / 6.500664 (-6.371421)	0.059036 / 0.075469 (-0.016433)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.266827 / 1.841788 (-0.574960)	20.323419 / 8.074308 (12.249110)	14.577603 / 10.191392 (4.386210)	0.162131 / 0.680424 (-0.518293)	0.018529 / 0.534201 (-0.515672)	0.395046 / 0.579283 (-0.184237)	0.410870 / 0.434364 (-0.023494)	0.455782 / 0.540337 (-0.084556)	0.662851 / 1.386936 (-0.724085)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006867 / 0.011353 (-0.004486)	0.004197 / 0.011008 (-0.006811)	0.066060 / 0.038508 (0.027552)	0.084145 / 0.023109 (0.061036)	0.366740 / 0.275898 (0.090842)	0.402362 / 0.323480 (0.078882)	0.005785 / 0.007986 (-0.002200)	0.003551 / 0.004328 (-0.000778)	0.066177 / 0.004250 (0.061926)	0.061521 / 0.037052 (0.024468)	0.377807 / 0.258489 (0.119318)	0.413490 / 0.293841 (0.119649)	0.031918 / 0.128546 (-0.096628)	0.008767 / 0.075646 (-0.066879)	0.071437 / 0.419271 (-0.347835)	0.049237 / 0.043533 (0.005704)	0.365929 / 0.255139 (0.110790)	0.393545 / 0.283200 (0.110346)	0.024054 / 0.141683 (-0.117628)	1.524599 / 1.452155 (0.072445)	1.576592 / 1.492716 (0.083876)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.315181 / 0.018006 (0.297174)	0.535501 / 0.000490 (0.535011)	0.000410 / 0.000200 (0.000210)	0.000054 / 0.000054 (-0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032915 / 0.037411 (-0.004497)	0.089310 / 0.014526 (0.074784)	0.105136 / 0.176557 (-0.071421)	0.158572 / 0.737135 (-0.578563)	0.106850 / 0.296338 (-0.189489)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.419343 / 0.215209 (0.204134)	4.200166 / 2.077655 (2.122511)	2.180234 / 1.504120 (0.676114)	2.016885 / 1.541195 (0.475690)	2.131480 / 1.468490 (0.662990)	0.484681 / 4.584777 (-4.100096)	3.613535 / 3.745712 (-0.132177)	5.762111 / 5.269862 (0.492249)	3.190590 / 4.565676 (-1.375086)	0.057403 / 0.424275 (-0.366872)	0.007862 / 0.007607 (0.000255)	0.490857 / 0.226044 (0.264813)	4.911241 / 2.268929 (2.642313)	2.650787 / 55.444624 (-52.793838)	2.317060 / 6.876477 (-4.559416)	2.579677 / 2.142072 (0.437605)	0.587388 / 4.805227 (-4.217840)	0.148109 / 6.500664 (-6.352555)	0.061435 / 0.075469 (-0.014034)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.322181 / 1.841788 (-0.519606)	20.647184 / 8.074308 (12.572875)	14.907635 / 10.191392 (4.716243)	0.156330 / 0.680424 (-0.524094)	0.018719 / 0.534201 (-0.515482)	0.397636 / 0.579283 (-0.181647)	0.414107 / 0.434364 (-0.020257)	0.460812 / 0.540337 (-0.079526)	0.609568 / 1.386936 (-0.777368)

philschmid · 2023-07-19T15:50:08Z

This would mean when i update my examples to newer datasets version i need to make a change right? nothing backward breaking?

philschmid · 2023-07-19T15:50:27Z

what would be the change i need to make?

mariosasko · 2023-07-19T16:12:19Z

@philschmid You just need to replace the occurrences of datasets.filesystems.S3FileSystem with s3fs.S3FileSystem. From the moment it was added until now, datasets.filesystems.S3FileSystem is a "dummy" subclass of s3fs.S3FileSystem that only changes its docstring.

lhoestq

LGTM :)

lhoestq · 2023-07-19T16:37:38Z

The CI is failing because I updated the YAML validation for #6044.
It will be fixed once #6044 is merged

lhoestq · 2023-07-19T16:49:52Z

I just merged the other PR so you should be good now

…-filesystems

github-actions · 2023-07-19T17:10:49Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006303 / 0.011353 (-0.005049)	0.003746 / 0.011008 (-0.007262)	0.081083 / 0.038508 (0.042575)	0.067973 / 0.023109 (0.044864)	0.322221 / 0.275898 (0.046323)	0.359432 / 0.323480 (0.035952)	0.004891 / 0.007986 (-0.003095)	0.002988 / 0.004328 (-0.001341)	0.064068 / 0.004250 (0.059818)	0.052042 / 0.037052 (0.014990)	0.323387 / 0.258489 (0.064898)	0.390416 / 0.293841 (0.096575)	0.028090 / 0.128546 (-0.100457)	0.008009 / 0.075646 (-0.067638)	0.262288 / 0.419271 (-0.156984)	0.044986 / 0.043533 (0.001453)	0.322319 / 0.255139 (0.067180)	0.345323 / 0.283200 (0.062123)	0.021798 / 0.141683 (-0.119885)	1.417259 / 1.452155 (-0.034895)	1.490050 / 1.492716 (-0.002667)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.195902 / 0.018006 (0.177896)	0.490808 / 0.000490 (0.490318)	0.002969 / 0.000200 (0.002770)	0.000126 / 0.000054 (0.000072)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025221 / 0.037411 (-0.012190)	0.075341 / 0.014526 (0.060815)	0.086703 / 0.176557 (-0.089853)	0.146953 / 0.737135 (-0.590182)	0.086610 / 0.296338 (-0.209728)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434890 / 0.215209 (0.219681)	4.352283 / 2.077655 (2.274629)	2.293098 / 1.504120 (0.788979)	2.123023 / 1.541195 (0.581829)	2.179722 / 1.468490 (0.711232)	0.503851 / 4.584777 (-4.080926)	3.087991 / 3.745712 (-0.657721)	2.898689 / 5.269862 (-2.371173)	1.902813 / 4.565676 (-2.662864)	0.058079 / 0.424275 (-0.366196)	0.006600 / 0.007607 (-0.001007)	0.509243 / 0.226044 (0.283199)	5.080204 / 2.268929 (2.811275)	2.753594 / 55.444624 (-52.691030)	2.417385 / 6.876477 (-4.459091)	2.635470 / 2.142072 (0.493398)	0.591059 / 4.805227 (-4.214168)	0.126360 / 6.500664 (-6.374304)	0.062108 / 0.075469 (-0.013361)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.254398 / 1.841788 (-0.587390)	18.866729 / 8.074308 (10.792420)	14.120008 / 10.191392 (3.928616)	0.152388 / 0.680424 (-0.528035)	0.016997 / 0.534201 (-0.517204)	0.336435 / 0.579283 (-0.242848)	0.364612 / 0.434364 (-0.069752)	0.391434 / 0.540337 (-0.148903)	0.567180 / 1.386936 (-0.819756)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006477 / 0.011353 (-0.004876)	0.003723 / 0.011008 (-0.007285)	0.062712 / 0.038508 (0.024204)	0.069380 / 0.023109 (0.046271)	0.393394 / 0.275898 (0.117496)	0.446903 / 0.323480 (0.123423)	0.004833 / 0.007986 (-0.003153)	0.002946 / 0.004328 (-0.001382)	0.062076 / 0.004250 (0.057826)	0.051589 / 0.037052 (0.014537)	0.388536 / 0.258489 (0.130047)	0.451406 / 0.293841 (0.157565)	0.027824 / 0.128546 (-0.100722)	0.008040 / 0.075646 (-0.067606)	0.067085 / 0.419271 (-0.352187)	0.042269 / 0.043533 (-0.001264)	0.363419 / 0.255139 (0.108280)	0.435201 / 0.283200 (0.152001)	0.021275 / 0.141683 (-0.120408)	1.439838 / 1.452155 (-0.012316)	1.477279 / 1.492716 (-0.015437)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229667 / 0.018006 (0.211661)	0.434101 / 0.000490 (0.433611)	0.000652 / 0.000200 (0.000452)	0.000060 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026141 / 0.037411 (-0.011271)	0.078950 / 0.014526 (0.064424)	0.090143 / 0.176557 (-0.086413)	0.143941 / 0.737135 (-0.593195)	0.090465 / 0.296338 (-0.205873)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.432042 / 0.215209 (0.216833)	4.322134 / 2.077655 (2.244479)	2.242897 / 1.504120 (0.738777)	2.076351 / 1.541195 (0.535157)	2.166739 / 1.468490 (0.698249)	0.500833 / 4.584777 (-4.083944)	3.140117 / 3.745712 (-0.605595)	4.383050 / 5.269862 (-0.886812)	2.548245 / 4.565676 (-2.017432)	0.057521 / 0.424275 (-0.366754)	0.006946 / 0.007607 (-0.000662)	0.509613 / 0.226044 (0.283569)	5.114052 / 2.268929 (2.845123)	2.682112 / 55.444624 (-52.762512)	2.362385 / 6.876477 (-4.514092)	2.531787 / 2.142072 (0.389715)	0.595085 / 4.805227 (-4.210142)	0.130198 / 6.500664 (-6.370466)	0.064057 / 0.075469 (-0.011412)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.346254 / 1.841788 (-0.495534)	19.036911 / 8.074308 (10.962603)	14.478689 / 10.191392 (4.287297)	0.147541 / 0.680424 (-0.532883)	0.016851 / 0.534201 (-0.517350)	0.333901 / 0.579283 (-0.245382)	0.380012 / 0.434364 (-0.054352)	0.396021 / 0.540337 (-0.144317)	0.540612 / 1.386936 (-0.846324)

mariosasko · 2023-07-19T17:27:12Z

CI failure is unrelated. Merging.

github-actions · 2023-07-19T17:39:11Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009498 / 0.011353 (-0.001855)	0.005639 / 0.011008 (-0.005369)	0.108731 / 0.038508 (0.070223)	0.094052 / 0.023109 (0.070943)	0.454375 / 0.275898 (0.178477)	0.486852 / 0.323480 (0.163372)	0.006627 / 0.007986 (-0.001359)	0.004712 / 0.004328 (0.000383)	0.082006 / 0.004250 (0.077756)	0.079394 / 0.037052 (0.042342)	0.450982 / 0.258489 (0.192493)	0.502659 / 0.293841 (0.208818)	0.049741 / 0.128546 (-0.078806)	0.014482 / 0.075646 (-0.061164)	0.362661 / 0.419271 (-0.056611)	0.068225 / 0.043533 (0.024692)	0.456219 / 0.255139 (0.201080)	0.483919 / 0.283200 (0.200719)	0.044490 / 0.141683 (-0.097193)	1.809420 / 1.452155 (0.357265)	1.908859 / 1.492716 (0.416143)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267350 / 0.018006 (0.249344)	0.600403 / 0.000490 (0.599913)	0.003665 / 0.000200 (0.003465)	0.000162 / 0.000054 (0.000107)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032499 / 0.037411 (-0.004912)	0.104829 / 0.014526 (0.090303)	0.115809 / 0.176557 (-0.060747)	0.191561 / 0.737135 (-0.545574)	0.113454 / 0.296338 (-0.182885)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.599165 / 0.215209 (0.383956)	5.802947 / 2.077655 (3.725292)	2.477330 / 1.504120 (0.973210)	2.231147 / 1.541195 (0.689952)	2.365688 / 1.468490 (0.897197)	0.853912 / 4.584777 (-3.730865)	5.529472 / 3.745712 (1.783760)	6.145286 / 5.269862 (0.875424)	3.415871 / 4.565676 (-1.149805)	0.099889 / 0.424275 (-0.324386)	0.008933 / 0.007607 (0.001325)	0.704643 / 0.226044 (0.478598)	7.178101 / 2.268929 (4.909173)	3.367120 / 55.444624 (-52.077504)	2.795177 / 6.876477 (-4.081300)	2.796798 / 2.142072 (0.654726)	1.039097 / 4.805227 (-3.766130)	0.232784 / 6.500664 (-6.267881)	0.083608 / 0.075469 (0.008138)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.646827 / 1.841788 (-0.194961)	25.003419 / 8.074308 (16.929111)	22.165746 / 10.191392 (11.974354)	0.246179 / 0.680424 (-0.434245)	0.029304 / 0.534201 (-0.504897)	0.500767 / 0.579283 (-0.078516)	0.606501 / 0.434364 (0.172137)	0.564092 / 0.540337 (0.023755)	0.857568 / 1.386936 (-0.529368)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009206 / 0.011353 (-0.002146)	0.005084 / 0.011008 (-0.005925)	0.081402 / 0.038508 (0.042894)	0.088028 / 0.023109 (0.064919)	0.539509 / 0.275898 (0.263611)	0.590759 / 0.323480 (0.267280)	0.006527 / 0.007986 (-0.001459)	0.004375 / 0.004328 (0.000047)	0.082327 / 0.004250 (0.078076)	0.065442 / 0.037052 (0.028390)	0.548254 / 0.258489 (0.289765)	0.598388 / 0.293841 (0.304547)	0.049409 / 0.128546 (-0.079137)	0.014366 / 0.075646 (-0.061280)	0.094568 / 0.419271 (-0.324703)	0.063685 / 0.043533 (0.020152)	0.545359 / 0.255139 (0.290220)	0.573358 / 0.283200 (0.290159)	0.036864 / 0.141683 (-0.104819)	1.817985 / 1.452155 (0.365830)	1.925188 / 1.492716 (0.432472)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.303205 / 0.018006 (0.285199)	0.620981 / 0.000490 (0.620491)	0.004910 / 0.000200 (0.004710)	0.000104 / 0.000054 (0.000050)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033791 / 0.037411 (-0.003620)	0.114974 / 0.014526 (0.100448)	0.117682 / 0.176557 (-0.058875)	0.177188 / 0.737135 (-0.559947)	0.126425 / 0.296338 (-0.169914)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.636932 / 0.215209 (0.421723)	6.289054 / 2.077655 (4.211399)	2.920772 / 1.504120 (1.416652)	2.672080 / 1.541195 (1.130885)	2.712271 / 1.468490 (1.243781)	0.889305 / 4.584777 (-3.695472)	5.536018 / 3.745712 (1.790306)	4.687564 / 5.269862 (-0.582298)	3.204239 / 4.565676 (-1.361437)	0.095546 / 0.424275 (-0.328729)	0.008838 / 0.007607 (0.001231)	0.714584 / 0.226044 (0.488540)	7.482663 / 2.268929 (5.213735)	3.621392 / 55.444624 (-51.823232)	2.987777 / 6.876477 (-3.888700)	3.312636 / 2.142072 (1.170564)	1.033721 / 4.805227 (-3.771506)	0.206292 / 6.500664 (-6.294372)	0.079423 / 0.075469 (0.003953)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.798645 / 1.841788 (-0.043143)	25.544329 / 8.074308 (17.470021)	23.041318 / 10.191392 (12.849926)	0.259067 / 0.680424 (-0.421357)	0.029839 / 0.534201 (-0.504362)	0.495583 / 0.579283 (-0.083700)	0.598755 / 0.434364 (0.164391)	0.574864 / 0.540337 (0.034527)	0.831160 / 1.386936 (-0.555776)

mariosasko added 2 commits July 19, 2023 16:46

Remove HfFileSystem and deprecate S3FileSystem

894b74d

Update docstring

74398c9

mariosasko requested a review from lhoestq July 19, 2023 16:13

lhoestq approved these changes Jul 19, 2023

View reviewed changes

Merge branch 'main' of github.com:huggingface/datasets into deprecate…

02dd4cc

…-filesystems

mariosasko merged commit 4200443 into main Jul 19, 2023
12 of 13 checks passed

mariosasko deleted the deprecate-filesystems branch July 19, 2023 17:27

Remove HfFileSystem and deprecate S3FileSystem #6052

Remove HfFileSystem and deprecate S3FileSystem #6052

Conversation

mariosasko commented Jul 19, 2023 • edited

HuggingFaceDocBuilderDev commented Jul 19, 2023 • edited

github-actions bot commented Jul 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

philschmid commented Jul 19, 2023

philschmid commented Jul 19, 2023

mariosasko commented Jul 19, 2023

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Jul 19, 2023

lhoestq commented Jul 19, 2023

github-actions bot commented Jul 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko commented Jul 19, 2023

github-actions bot commented Jul 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Remove `HfFileSystem` and deprecate `S3FileSystem` #6052

Remove `HfFileSystem` and deprecate `S3FileSystem` #6052

mariosasko commented Jul 19, 2023 •

edited

HuggingFaceDocBuilderDev commented Jul 19, 2023 •

edited