Always return list in `list_datasets` #5964

mariosasko · 2023-06-19T13:07:08Z

Plus, deprecate list_datasets/inspect_dataset in favor of huggingface_hub.list_datasets/"git clone workflow" (downloads data files)

HuggingFaceDocBuilderDev · 2023-06-19T13:12:33Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

LGTM :)

github-actions · 2023-06-19T17:29:36Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006795 / 0.011353 (-0.004558)	0.004170 / 0.011008 (-0.006838)	0.098698 / 0.038508 (0.060190)	0.045393 / 0.023109 (0.022284)	0.309205 / 0.275898 (0.033307)	0.361333 / 0.323480 (0.037853)	0.006009 / 0.007986 (-0.001977)	0.003334 / 0.004328 (-0.000995)	0.075071 / 0.004250 (0.070821)	0.062587 / 0.037052 (0.025535)	0.322395 / 0.258489 (0.063906)	0.360499 / 0.293841 (0.066659)	0.032243 / 0.128546 (-0.096303)	0.008768 / 0.075646 (-0.066878)	0.329799 / 0.419271 (-0.089472)	0.062261 / 0.043533 (0.018728)	0.298112 / 0.255139 (0.042973)	0.322815 / 0.283200 (0.039615)	0.032348 / 0.141683 (-0.109335)	1.445807 / 1.452155 (-0.006347)	1.528768 / 1.492716 (0.036051)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.195701 / 0.018006 (0.177695)	0.437042 / 0.000490 (0.436552)	0.003867 / 0.000200 (0.003667)	0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026713 / 0.037411 (-0.010698)	0.109548 / 0.014526 (0.095022)	0.119216 / 0.176557 (-0.057341)	0.178947 / 0.737135 (-0.558188)	0.125224 / 0.296338 (-0.171114)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.400885 / 0.215209 (0.185676)	3.991223 / 2.077655 (1.913568)	1.818449 / 1.504120 (0.314329)	1.609285 / 1.541195 (0.068090)	1.666675 / 1.468490 (0.198184)	0.531486 / 4.584777 (-4.053291)	3.770142 / 3.745712 (0.024430)	3.057189 / 5.269862 (-2.212673)	1.517491 / 4.565676 (-3.048186)	0.065782 / 0.424275 (-0.358493)	0.011251 / 0.007607 (0.003644)	0.504277 / 0.226044 (0.278233)	5.038979 / 2.268929 (2.770050)	2.254717 / 55.444624 (-53.189908)	1.929743 / 6.876477 (-4.946734)	2.080051 / 2.142072 (-0.062022)	0.656831 / 4.805227 (-4.148396)	0.142860 / 6.500664 (-6.357804)	0.063057 / 0.075469 (-0.012412)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.208819 / 1.841788 (-0.632969)	14.456966 / 8.074308 (6.382658)	12.839799 / 10.191392 (2.648407)	0.164361 / 0.680424 (-0.516063)	0.017330 / 0.534201 (-0.516871)	0.397384 / 0.579283 (-0.181899)	0.422704 / 0.434364 (-0.011660)	0.472065 / 0.540337 (-0.068273)	0.576960 / 1.386936 (-0.809976)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006950 / 0.011353 (-0.004403)	0.004012 / 0.011008 (-0.006997)	0.076050 / 0.038508 (0.037542)	0.046646 / 0.023109 (0.023537)	0.353813 / 0.275898 (0.077915)	0.417111 / 0.323480 (0.093631)	0.005422 / 0.007986 (-0.002564)	0.003356 / 0.004328 (-0.000972)	0.076662 / 0.004250 (0.072411)	0.055018 / 0.037052 (0.017966)	0.371561 / 0.258489 (0.113072)	0.410471 / 0.293841 (0.116630)	0.031860 / 0.128546 (-0.096686)	0.008754 / 0.075646 (-0.066893)	0.083192 / 0.419271 (-0.336079)	0.050479 / 0.043533 (0.006946)	0.351725 / 0.255139 (0.096586)	0.371596 / 0.283200 (0.088396)	0.023042 / 0.141683 (-0.118641)	1.480533 / 1.452155 (0.028379)	1.545970 / 1.492716 (0.053254)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220095 / 0.018006 (0.202089)	0.441550 / 0.000490 (0.441061)	0.000375 / 0.000200 (0.000175)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029527 / 0.037411 (-0.007884)	0.111645 / 0.014526 (0.097119)	0.125732 / 0.176557 (-0.050825)	0.177322 / 0.737135 (-0.559813)	0.128620 / 0.296338 (-0.167718)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.432415 / 0.215209 (0.217206)	4.314381 / 2.077655 (2.236726)	2.079450 / 1.504120 (0.575331)	1.893139 / 1.541195 (0.351944)	1.951363 / 1.468490 (0.482873)	0.531466 / 4.584777 (-4.053311)	3.716860 / 3.745712 (-0.028852)	1.850111 / 5.269862 (-3.419750)	1.100676 / 4.565676 (-3.465000)	0.066247 / 0.424275 (-0.358028)	0.011503 / 0.007607 (0.003896)	0.537208 / 0.226044 (0.311164)	5.367560 / 2.268929 (3.098631)	2.543697 / 55.444624 (-52.900927)	2.221670 / 6.876477 (-4.654806)	2.252009 / 2.142072 (0.109937)	0.658509 / 4.805227 (-4.146718)	0.142345 / 6.500664 (-6.358319)	0.064701 / 0.075469 (-0.010768)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.266442 / 1.841788 (-0.575346)	15.105953 / 8.074308 (7.031645)	14.288229 / 10.191392 (4.096837)	0.161182 / 0.680424 (-0.519242)	0.017074 / 0.534201 (-0.517127)	0.399464 / 0.579283 (-0.179819)	0.419459 / 0.434364 (-0.014905)	0.467553 / 0.540337 (-0.072784)	0.566337 / 1.386936 (-0.820599)

Always return list in

0222de9

mariosasko requested a review from lhoestq June 19, 2023 13:07

lhoestq approved these changes Jun 19, 2023

View reviewed changes

mariosasko merged commit 53ac2d9 into main Jun 19, 2023
12 of 13 checks passed

mariosasko deleted the fix-5925 branch June 19, 2023 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always return list in `list_datasets` #5964

Always return list in `list_datasets` #5964

mariosasko commented Jun 19, 2023

HuggingFaceDocBuilderDev commented Jun 19, 2023 •

edited

lhoestq left a comment

github-actions bot commented Jun 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Always return list in list_datasets #5964

Always return list in list_datasets #5964

Conversation

mariosasko commented Jun 19, 2023

HuggingFaceDocBuilderDev commented Jun 19, 2023 • edited

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Always return list in `list_datasets` #5964

Always return list in `list_datasets` #5964

HuggingFaceDocBuilderDev commented Jun 19, 2023 •

edited