[docs] Split pattern search order #5693

stevhliu · 2023-03-31T19:51:38Z

This PR addresses #5681 about the order of split patterns 🤗 Datasets searches for when generating dataset splits.

HuggingFaceDocBuilderDev · 2023-03-31T19:55:37Z

The documentation is not available anymore as the PR was closed or merged.

polinaeterna

thank you! much clearer now. left a few comments, feel free to reword my suggestions :)

docs/source/repository_structure.mdx

github-actions · 2023-04-03T18:43:30Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007841 / 0.011353 (-0.003512)	0.005640 / 0.011008 (-0.005368)	0.096465 / 0.038508 (0.057957)	0.036476 / 0.023109 (0.013367)	0.306431 / 0.275898 (0.030533)	0.339545 / 0.323480 (0.016065)	0.006064 / 0.007986 (-0.001922)	0.004404 / 0.004328 (0.000076)	0.073130 / 0.004250 (0.068879)	0.052765 / 0.037052 (0.015713)	0.309895 / 0.258489 (0.051406)	0.354037 / 0.293841 (0.060196)	0.037127 / 0.128546 (-0.091420)	0.012387 / 0.075646 (-0.063260)	0.333503 / 0.419271 (-0.085769)	0.059799 / 0.043533 (0.016266)	0.305496 / 0.255139 (0.050358)	0.324122 / 0.283200 (0.040922)	0.107007 / 0.141683 (-0.034676)	1.416743 / 1.452155 (-0.035411)	1.520772 / 1.492716 (0.028055)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.261233 / 0.018006 (0.243227)	0.573806 / 0.000490 (0.573316)	0.000390 / 0.000200 (0.000190)	0.000058 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027672 / 0.037411 (-0.009740)	0.112803 / 0.014526 (0.098278)	0.121085 / 0.176557 (-0.055471)	0.176056 / 0.737135 (-0.561080)	0.127171 / 0.296338 (-0.169167)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.414756 / 0.215209 (0.199547)	4.148743 / 2.077655 (2.071088)	1.883940 / 1.504120 (0.379820)	1.698771 / 1.541195 (0.157576)	1.811926 / 1.468490 (0.343436)	0.708293 / 4.584777 (-3.876484)	3.780456 / 3.745712 (0.034744)	2.098556 / 5.269862 (-3.171306)	1.323512 / 4.565676 (-3.242164)	0.086253 / 0.424275 (-0.338022)	0.012587 / 0.007607 (0.004980)	0.514824 / 0.226044 (0.288779)	5.157415 / 2.268929 (2.888487)	2.382519 / 55.444624 (-53.062105)	2.014539 / 6.876477 (-4.861938)	2.215239 / 2.142072 (0.073166)	0.847178 / 4.805227 (-3.958049)	0.170053 / 6.500664 (-6.330611)	0.066461 / 0.075469 (-0.009008)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.199056 / 1.841788 (-0.642732)	15.244999 / 8.074308 (7.170691)	14.661593 / 10.191392 (4.470201)	0.168855 / 0.680424 (-0.511569)	0.017889 / 0.534201 (-0.516312)	0.424961 / 0.579283 (-0.154322)	0.428632 / 0.434364 (-0.005732)	0.502680 / 0.540337 (-0.037658)	0.597827 / 1.386936 (-0.789109)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007749 / 0.011353 (-0.003604)	0.005527 / 0.011008 (-0.005482)	0.074774 / 0.038508 (0.036266)	0.035367 / 0.023109 (0.012258)	0.340594 / 0.275898 (0.064696)	0.373970 / 0.323480 (0.050490)	0.006094 / 0.007986 (-0.001892)	0.004428 / 0.004328 (0.000100)	0.074120 / 0.004250 (0.069869)	0.054852 / 0.037052 (0.017800)	0.357173 / 0.258489 (0.098684)	0.388877 / 0.293841 (0.095036)	0.037002 / 0.128546 (-0.091545)	0.012337 / 0.075646 (-0.063309)	0.086962 / 0.419271 (-0.332310)	0.050370 / 0.043533 (0.006837)	0.342989 / 0.255139 (0.087850)	0.358065 / 0.283200 (0.074865)	0.111063 / 0.141683 (-0.030620)	1.516704 / 1.452155 (0.064549)	1.634359 / 1.492716 (0.141643)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.261493 / 0.018006 (0.243487)	0.566288 / 0.000490 (0.565799)	0.000439 / 0.000200 (0.000239)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030426 / 0.037411 (-0.006985)	0.114606 / 0.014526 (0.100080)	0.126134 / 0.176557 (-0.050423)	0.175324 / 0.737135 (-0.561812)	0.132766 / 0.296338 (-0.163573)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426785 / 0.215209 (0.211576)	4.243555 / 2.077655 (2.165900)	2.089631 / 1.504120 (0.585511)	1.994562 / 1.541195 (0.453367)	2.140284 / 1.468490 (0.671794)	0.698645 / 4.584777 (-3.886132)	3.807471 / 3.745712 (0.061759)	3.275343 / 5.269862 (-1.994519)	1.796756 / 4.565676 (-2.768921)	0.085986 / 0.424275 (-0.338289)	0.012213 / 0.007607 (0.004606)	0.536815 / 0.226044 (0.310771)	5.344611 / 2.268929 (3.075683)	2.498578 / 55.444624 (-52.946047)	2.153260 / 6.876477 (-4.723217)	2.251310 / 2.142072 (0.109237)	0.839104 / 4.805227 (-3.966123)	0.169639 / 6.500664 (-6.331025)	0.065880 / 0.075469 (-0.009589)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.268610 / 1.841788 (-0.573178)	15.624915 / 8.074308 (7.550606)	15.163684 / 10.191392 (4.972292)	0.172992 / 0.680424 (-0.507432)	0.018154 / 0.534201 (-0.516047)	0.440485 / 0.579283 (-0.138798)	0.431949 / 0.434364 (-0.002415)	0.547935 / 0.540337 (0.007597)	0.662442 / 1.386936 (-0.724494)

add split pattern order

a5a7a86

stevhliu requested a review from polinaeterna March 31, 2023 19:51

polinaeterna approved these changes Apr 3, 2023

View reviewed changes

docs/source/repository_structure.mdx Outdated Show resolved Hide resolved

docs/source/repository_structure.mdx Outdated Show resolved Hide resolved

docs/source/repository_structure.mdx Outdated Show resolved Hide resolved

apply feedback

fabd328

stevhliu merged commit 5c8a6ba into huggingface:main Apr 3, 2023
12 checks passed

stevhliu deleted the split-patterns branch April 3, 2023 18:30

stevhliu mentioned this pull request Apr 3, 2023

Add information about patterns search order to the doc about structuring repo #5681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Split pattern search order #5693

[docs] Split pattern search order #5693

stevhliu commented Mar 31, 2023

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

polinaeterna left a comment •

edited

github-actions bot commented Apr 3, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

[docs] Split pattern search order #5693

[docs] Split pattern search order #5693

Conversation

stevhliu commented Mar 31, 2023

HuggingFaceDocBuilderDev commented Mar 31, 2023 • edited

polinaeterna left a comment • edited

Choose a reason for hiding this comment

github-actions bot commented Apr 3, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

polinaeterna left a comment •

edited