Use a new low-memory approach for tf dataset index shuffling #5863

Rocketknight1 · 2023-05-15T15:28:34Z

This PR tries out a new approach to generating the index tensor in to_tf_dataset, which should reduce memory usage for very large datasets. I'll need to do some testing before merging it!

Fixes #5855

HuggingFaceDocBuilderDev · 2023-05-15T15:32:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

github-actions · 2023-05-15T15:34:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007764 / 0.011353 (-0.003588)	0.005397 / 0.011008 (-0.005611)	0.097995 / 0.038508 (0.059487)	0.036360 / 0.023109 (0.013251)	0.312148 / 0.275898 (0.036250)	0.349427 / 0.323480 (0.025947)	0.006635 / 0.007986 (-0.001350)	0.004373 / 0.004328 (0.000044)	0.074350 / 0.004250 (0.070099)	0.054667 / 0.037052 (0.017614)	0.301621 / 0.258489 (0.043132)	0.364233 / 0.293841 (0.070392)	0.035356 / 0.128546 (-0.093191)	0.012512 / 0.075646 (-0.063134)	0.333399 / 0.419271 (-0.085873)	0.051363 / 0.043533 (0.007830)	0.302372 / 0.255139 (0.047233)	0.326542 / 0.283200 (0.043343)	0.118610 / 0.141683 (-0.023073)	1.438485 / 1.452155 (-0.013669)	1.539131 / 1.492716 (0.046415)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.010920 / 0.018006 (-0.007086)	0.561263 / 0.000490 (0.560773)	0.003972 / 0.000200 (0.003772)	0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030333 / 0.037411 (-0.007078)	0.113608 / 0.014526 (0.099083)	0.125802 / 0.176557 (-0.050755)	0.183885 / 0.737135 (-0.553250)	0.130242 / 0.296338 (-0.166097)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.404147 / 0.215209 (0.188938)	4.021990 / 2.077655 (1.944335)	1.821450 / 1.504120 (0.317330)	1.619032 / 1.541195 (0.077837)	1.791267 / 1.468490 (0.322777)	0.706683 / 4.584777 (-3.878094)	3.819056 / 3.745712 (0.073344)	3.485714 / 5.269862 (-1.784147)	1.938968 / 4.565676 (-2.626709)	0.086501 / 0.424275 (-0.337774)	0.012300 / 0.007607 (0.004693)	0.503600 / 0.226044 (0.277555)	5.042123 / 2.268929 (2.773195)	2.269712 / 55.444624 (-53.174912)	1.944912 / 6.876477 (-4.931565)	2.155196 / 2.142072 (0.013123)	0.853434 / 4.805227 (-3.951793)	0.175554 / 6.500664 (-6.325110)	0.072005 / 0.075469 (-0.003464)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.203765 / 1.841788 (-0.638022)	15.836634 / 8.074308 (7.762326)	15.707348 / 10.191392 (5.515956)	0.164828 / 0.680424 (-0.515596)	0.018115 / 0.534201 (-0.516086)	0.434591 / 0.579283 (-0.144692)	0.437858 / 0.434364 (0.003495)	0.524672 / 0.540337 (-0.015665)	0.610535 / 1.386936 (-0.776401)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007558 / 0.011353 (-0.003795)	0.005258 / 0.011008 (-0.005750)	0.075263 / 0.038508 (0.036755)	0.033915 / 0.023109 (0.010805)	0.371368 / 0.275898 (0.095470)	0.399239 / 0.323480 (0.075760)	0.006547 / 0.007986 (-0.001439)	0.004675 / 0.004328 (0.000347)	0.074230 / 0.004250 (0.069980)	0.054653 / 0.037052 (0.017601)	0.376655 / 0.258489 (0.118166)	0.438437 / 0.293841 (0.144596)	0.035838 / 0.128546 (-0.092709)	0.012641 / 0.075646 (-0.063005)	0.087279 / 0.419271 (-0.331993)	0.046311 / 0.043533 (0.002778)	0.356649 / 0.255139 (0.101510)	0.377876 / 0.283200 (0.094677)	0.108097 / 0.141683 (-0.033586)	1.478461 / 1.452155 (0.026306)	1.560375 / 1.492716 (0.067658)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.316384 / 0.018006 (0.298378)	0.539382 / 0.000490 (0.538892)	0.002029 / 0.000200 (0.001829)	0.000090 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029950 / 0.037411 (-0.007462)	0.111371 / 0.014526 (0.096846)	0.125254 / 0.176557 (-0.051303)	0.173064 / 0.737135 (-0.564071)	0.130446 / 0.296338 (-0.165893)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.424882 / 0.215209 (0.209673)	4.241575 / 2.077655 (2.163920)	2.096216 / 1.504120 (0.592096)	1.916017 / 1.541195 (0.374823)	2.016318 / 1.468490 (0.547828)	0.701197 / 4.584777 (-3.883580)	3.762365 / 3.745712 (0.016652)	3.307805 / 5.269862 (-1.962057)	1.841752 / 4.565676 (-2.723925)	0.086003 / 0.424275 (-0.338272)	0.012247 / 0.007607 (0.004640)	0.532926 / 0.226044 (0.306882)	5.370509 / 2.268929 (3.101580)	2.587853 / 55.444624 (-52.856772)	2.264541 / 6.876477 (-4.611936)	2.374833 / 2.142072 (0.232760)	0.827751 / 4.805227 (-3.977476)	0.169454 / 6.500664 (-6.331210)	0.066340 / 0.075469 (-0.009129)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.319128 / 1.841788 (-0.522660)	16.702085 / 8.074308 (8.627777)	13.559957 / 10.191392 (3.368565)	0.146659 / 0.680424 (-0.533765)	0.017384 / 0.534201 (-0.516817)	0.421126 / 0.579283 (-0.158157)	0.422067 / 0.434364 (-0.012297)	0.490615 / 0.540337 (-0.049723)	0.587151 / 1.386936 (-0.799785)

github-actions · 2023-05-15T15:40:10Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006604 / 0.011353 (-0.004749)	0.004508 / 0.011008 (-0.006500)	0.098652 / 0.038508 (0.060144)	0.028172 / 0.023109 (0.005063)	0.366997 / 0.275898 (0.091099)	0.403691 / 0.323480 (0.080211)	0.005127 / 0.007986 (-0.002859)	0.003340 / 0.004328 (-0.000989)	0.075408 / 0.004250 (0.071157)	0.038049 / 0.037052 (0.000996)	0.367914 / 0.258489 (0.109425)	0.410958 / 0.293841 (0.117118)	0.030454 / 0.128546 (-0.098093)	0.011422 / 0.075646 (-0.064224)	0.325048 / 0.419271 (-0.094223)	0.042959 / 0.043533 (-0.000574)	0.374536 / 0.255139 (0.119397)	0.394738 / 0.283200 (0.111538)	0.090481 / 0.141683 (-0.051201)	1.504858 / 1.452155 (0.052703)	1.569072 / 1.492716 (0.076356)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.010062 / 0.018006 (-0.007945)	0.408619 / 0.000490 (0.408130)	0.002307 / 0.000200 (0.002107)	0.000070 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022898 / 0.037411 (-0.014514)	0.096975 / 0.014526 (0.082449)	0.103032 / 0.176557 (-0.073524)	0.164877 / 0.737135 (-0.572259)	0.107324 / 0.296338 (-0.189014)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.446652 / 0.215209 (0.231442)	4.466939 / 2.077655 (2.389285)	2.204590 / 1.504120 (0.700471)	2.004048 / 1.541195 (0.462853)	2.053035 / 1.468490 (0.584545)	0.696617 / 4.584777 (-3.888160)	3.391173 / 3.745712 (-0.354539)	1.863306 / 5.269862 (-3.406556)	1.160637 / 4.565676 (-3.405039)	0.083115 / 0.424275 (-0.341160)	0.012470 / 0.007607 (0.004862)	0.547207 / 0.226044 (0.321163)	5.500667 / 2.268929 (3.231739)	2.656615 / 55.444624 (-52.788009)	2.313281 / 6.876477 (-4.563195)	2.395632 / 2.142072 (0.253559)	0.815361 / 4.805227 (-3.989867)	0.152112 / 6.500664 (-6.348552)	0.067485 / 0.075469 (-0.007984)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.206975 / 1.841788 (-0.634813)	13.684136 / 8.074308 (5.609828)	13.919129 / 10.191392 (3.727737)	0.140767 / 0.680424 (-0.539657)	0.016445 / 0.534201 (-0.517756)	0.379136 / 0.579283 (-0.200147)	0.385395 / 0.434364 (-0.048969)	0.445781 / 0.540337 (-0.094556)	0.522056 / 1.386936 (-0.864880)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006370 / 0.011353 (-0.004983)	0.004514 / 0.011008 (-0.006495)	0.075671 / 0.038508 (0.037163)	0.026723 / 0.023109 (0.003614)	0.359819 / 0.275898 (0.083921)	0.387935 / 0.323480 (0.064456)	0.004888 / 0.007986 (-0.003098)	0.004619 / 0.004328 (0.000290)	0.075546 / 0.004250 (0.071295)	0.039024 / 0.037052 (0.001971)	0.361173 / 0.258489 (0.102684)	0.411425 / 0.293841 (0.117584)	0.030842 / 0.128546 (-0.097705)	0.011555 / 0.075646 (-0.064091)	0.084697 / 0.419271 (-0.334574)	0.039281 / 0.043533 (-0.004252)	0.370082 / 0.255139 (0.114943)	0.382113 / 0.283200 (0.098913)	0.091237 / 0.141683 (-0.050445)	1.534185 / 1.452155 (0.082030)	1.576488 / 1.492716 (0.083772)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.226568 / 0.018006 (0.208562)	0.401566 / 0.000490 (0.401076)	0.002915 / 0.000200 (0.002715)	0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025357 / 0.037411 (-0.012054)	0.099747 / 0.014526 (0.085221)	0.106443 / 0.176557 (-0.070113)	0.157147 / 0.737135 (-0.579989)	0.110759 / 0.296338 (-0.185580)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.444648 / 0.215209 (0.229439)	4.437930 / 2.077655 (2.360275)	2.154033 / 1.504120 (0.649913)	1.958351 / 1.541195 (0.417157)	1.991031 / 1.468490 (0.522541)	0.691440 / 4.584777 (-3.893337)	3.369087 / 3.745712 (-0.376625)	1.847103 / 5.269862 (-3.422758)	1.152509 / 4.565676 (-3.413168)	0.082519 / 0.424275 (-0.341756)	0.012609 / 0.007607 (0.005001)	0.547267 / 0.226044 (0.321222)	5.501335 / 2.268929 (3.232407)	2.621079 / 55.444624 (-52.823545)	2.281332 / 6.876477 (-4.595145)	2.300427 / 2.142072 (0.158354)	0.803611 / 4.805227 (-4.001616)	0.151784 / 6.500664 (-6.348880)	0.067801 / 0.075469 (-0.007669)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.343201 / 1.841788 (-0.498587)	13.901033 / 8.074308 (5.826725)	13.114738 / 10.191392 (2.923346)	0.149358 / 0.680424 (-0.531066)	0.016596 / 0.534201 (-0.517605)	0.377310 / 0.579283 (-0.201973)	0.387045 / 0.434364 (-0.047319)	0.441272 / 0.540337 (-0.099065)	0.525783 / 1.386936 (-0.861153)

github-actions · 2023-05-15T15:41:35Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008147 / 0.011353 (-0.003205)	0.005531 / 0.011008 (-0.005477)	0.099796 / 0.038508 (0.061288)	0.041574 / 0.023109 (0.018465)	0.315752 / 0.275898 (0.039854)	0.369846 / 0.323480 (0.046366)	0.006489 / 0.007986 (-0.001497)	0.004339 / 0.004328 (0.000010)	0.074769 / 0.004250 (0.070519)	0.051313 / 0.037052 (0.014261)	0.313463 / 0.258489 (0.054974)	0.369918 / 0.293841 (0.076077)	0.035893 / 0.128546 (-0.092653)	0.012487 / 0.075646 (-0.063159)	0.336464 / 0.419271 (-0.082807)	0.052870 / 0.043533 (0.009337)	0.310795 / 0.255139 (0.055656)	0.333146 / 0.283200 (0.049946)	0.112813 / 0.141683 (-0.028870)	1.488192 / 1.452155 (0.036038)	1.563438 / 1.492716 (0.070721)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.015015 / 0.018006 (-0.002991)	0.531783 / 0.000490 (0.531294)	0.005039 / 0.000200 (0.004839)	0.000103 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030205 / 0.037411 (-0.007207)	0.115997 / 0.014526 (0.101471)	0.122958 / 0.176557 (-0.053599)	0.186956 / 0.737135 (-0.550180)	0.130268 / 0.296338 (-0.166071)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.402648 / 0.215209 (0.187439)	3.996121 / 2.077655 (1.918466)	1.811715 / 1.504120 (0.307595)	1.640805 / 1.541195 (0.099610)	1.810478 / 1.468490 (0.341988)	0.699996 / 4.584777 (-3.884781)	3.834020 / 3.745712 (0.088308)	3.688364 / 5.269862 (-1.581498)	1.973828 / 4.565676 (-2.591849)	0.087085 / 0.424275 (-0.337190)	0.012501 / 0.007607 (0.004894)	0.498934 / 0.226044 (0.272889)	4.977608 / 2.268929 (2.708680)	2.258678 / 55.444624 (-53.185947)	1.934251 / 6.876477 (-4.942226)	2.177409 / 2.142072 (0.035337)	0.873470 / 4.805227 (-3.931757)	0.173132 / 6.500664 (-6.327532)	0.069144 / 0.075469 (-0.006325)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.181554 / 1.841788 (-0.660234)	15.694468 / 8.074308 (7.620160)	15.026954 / 10.191392 (4.835562)	0.167092 / 0.680424 (-0.513332)	0.017921 / 0.534201 (-0.516280)	0.425649 / 0.579283 (-0.153634)	0.423225 / 0.434364 (-0.011139)	0.522132 / 0.540337 (-0.018205)	0.612806 / 1.386936 (-0.774130)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007896 / 0.011353 (-0.003457)	0.005581 / 0.011008 (-0.005427)	0.076338 / 0.038508 (0.037830)	0.037064 / 0.023109 (0.013954)	0.399706 / 0.275898 (0.123808)	0.431698 / 0.323480 (0.108218)	0.006846 / 0.007986 (-0.001140)	0.006010 / 0.004328 (0.001682)	0.075771 / 0.004250 (0.071520)	0.058214 / 0.037052 (0.021161)	0.395753 / 0.258489 (0.137264)	0.459925 / 0.293841 (0.166084)	0.036349 / 0.128546 (-0.092197)	0.012720 / 0.075646 (-0.062926)	0.087248 / 0.419271 (-0.332024)	0.049405 / 0.043533 (0.005872)	0.387576 / 0.255139 (0.132437)	0.409861 / 0.283200 (0.126661)	0.111639 / 0.141683 (-0.030043)	1.482840 / 1.452155 (0.030685)	1.574465 / 1.492716 (0.081749)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.320628 / 0.018006 (0.302622)	0.556338 / 0.000490 (0.555848)	0.000445 / 0.000200 (0.000245)	0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032905 / 0.037411 (-0.004507)	0.121253 / 0.014526 (0.106727)	0.127241 / 0.176557 (-0.049316)	0.178090 / 0.737135 (-0.559045)	0.143285 / 0.296338 (-0.153054)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.437852 / 0.215209 (0.222643)	4.369770 / 2.077655 (2.292115)	2.219932 / 1.504120 (0.715812)	2.032520 / 1.541195 (0.491325)	2.154300 / 1.468490 (0.685810)	0.678942 / 4.584777 (-3.905835)	3.768148 / 3.745712 (0.022436)	2.152738 / 5.269862 (-3.117124)	1.341480 / 4.565676 (-3.224197)	0.084326 / 0.424275 (-0.339949)	0.012288 / 0.007607 (0.004681)	0.547677 / 0.226044 (0.321633)	5.496777 / 2.268929 (3.227848)	2.702267 / 55.444624 (-52.742357)	2.388580 / 6.876477 (-4.487897)	2.471673 / 2.142072 (0.329601)	0.833645 / 4.805227 (-3.971582)	0.167113 / 6.500664 (-6.333551)	0.067658 / 0.075469 (-0.007811)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.282050 / 1.841788 (-0.559737)	16.413677 / 8.074308 (8.339369)	14.080910 / 10.191392 (3.889518)	0.171782 / 0.680424 (-0.508642)	0.018186 / 0.534201 (-0.516015)	0.425244 / 0.579283 (-0.154039)	0.430260 / 0.434364 (-0.004104)	0.500838 / 0.540337 (-0.039499)	0.591900 / 1.386936 (-0.795036)

Rocketknight1 · 2023-05-15T15:42:45Z

The approach we take here is to no longer materialize the entire index array or shuffle buffer. Instead, we do the following:

Generate a dataset with tf.data.Dataset.range. This dataset is not materialized - it's basically a range iterator.
When we begin iterating over a dataset, generate a random seed. This value is constant for each pass over the dataset, and is regenerated if we start a new iteration or epoch over the dataset.
Map the range dataset and the random seed with tf.random.index_shuffle. This converts indices into the equivalent values in a permuted array. In other words tf.random.index_shuffle(indices, maxval=50_000_000) is equivalent to np.random.permutation(50_000_000)[indices], but without ever materializing the np.random.permutation(50_000_000) array.

Using this approach gives us a complete iteration over the dataset that does not skip any samples, compiles in TF and also never materializes the complete index array, which should avoid the memory usage issues. I'm testing that now!

github-actions · 2023-05-15T16:10:23Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008395 / 0.011353 (-0.002958)	0.005893 / 0.011008 (-0.005115)	0.117081 / 0.038508 (0.078573)	0.040987 / 0.023109 (0.017878)	0.394234 / 0.275898 (0.118336)	0.447036 / 0.323480 (0.123556)	0.006703 / 0.007986 (-0.001283)	0.006085 / 0.004328 (0.001757)	0.086479 / 0.004250 (0.082228)	0.050192 / 0.037052 (0.013140)	0.400958 / 0.258489 (0.142469)	0.455551 / 0.293841 (0.161710)	0.041481 / 0.128546 (-0.087065)	0.014135 / 0.075646 (-0.061511)	0.399929 / 0.419271 (-0.019343)	0.060824 / 0.043533 (0.017291)	0.395946 / 0.255139 (0.140807)	0.428811 / 0.283200 (0.145611)	0.120057 / 0.141683 (-0.021626)	1.703244 / 1.452155 (0.251090)	1.841153 / 1.492716 (0.348436)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.021826 / 0.018006 (0.003820)	0.494279 / 0.000490 (0.493789)	0.011258 / 0.000200 (0.011058)	0.000382 / 0.000054 (0.000328)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031651 / 0.037411 (-0.005760)	0.132871 / 0.014526 (0.118345)	0.137388 / 0.176557 (-0.039169)	0.205808 / 0.737135 (-0.531327)	0.147585 / 0.296338 (-0.148753)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474483 / 0.215209 (0.259274)	4.726568 / 2.077655 (2.648914)	2.136172 / 1.504120 (0.632052)	1.918364 / 1.541195 (0.377169)	2.068794 / 1.468490 (0.600304)	0.836481 / 4.584777 (-3.748296)	4.550583 / 3.745712 (0.804871)	2.456287 / 5.269862 (-2.813574)	1.563127 / 4.565676 (-3.002550)	0.102541 / 0.424275 (-0.321734)	0.014492 / 0.007607 (0.006885)	0.598572 / 0.226044 (0.372528)	5.953321 / 2.268929 (3.684392)	2.695210 / 55.444624 (-52.749414)	2.294317 / 6.876477 (-4.582160)	2.456585 / 2.142072 (0.314513)	1.019907 / 4.805227 (-3.785320)	0.201225 / 6.500664 (-6.299439)	0.077113 / 0.075469 (0.001644)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.497662 / 1.841788 (-0.344126)	18.216941 / 8.074308 (10.142633)	17.016638 / 10.191392 (6.825246)	0.193271 / 0.680424 (-0.487153)	0.020440 / 0.534201 (-0.513761)	0.509361 / 0.579283 (-0.069922)	0.513389 / 0.434364 (0.079025)	0.622266 / 0.540337 (0.081928)	0.741733 / 1.386936 (-0.645203)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008641 / 0.011353 (-0.002712)	0.005792 / 0.011008 (-0.005216)	0.086020 / 0.038508 (0.047512)	0.040005 / 0.023109 (0.016896)	0.435120 / 0.275898 (0.159222)	0.480269 / 0.323480 (0.156789)	0.006669 / 0.007986 (-0.001317)	0.006039 / 0.004328 (0.001711)	0.083468 / 0.004250 (0.079218)	0.057700 / 0.037052 (0.020648)	0.416418 / 0.258489 (0.157929)	0.508286 / 0.293841 (0.214445)	0.041198 / 0.128546 (-0.087349)	0.014346 / 0.075646 (-0.061301)	0.100553 / 0.419271 (-0.318718)	0.054201 / 0.043533 (0.010668)	0.438232 / 0.255139 (0.183093)	0.454707 / 0.283200 (0.171508)	0.118332 / 0.141683 (-0.023351)	1.657607 / 1.452155 (0.205452)	1.825510 / 1.492716 (0.332794)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.236156 / 0.018006 (0.218150)	0.487612 / 0.000490 (0.487123)	0.005747 / 0.000200 (0.005547)	0.000111 / 0.000054 (0.000057)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035127 / 0.037411 (-0.002284)	0.132013 / 0.014526 (0.117487)	0.142316 / 0.176557 (-0.034241)	0.198627 / 0.737135 (-0.538508)	0.145454 / 0.296338 (-0.150885)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.513041 / 0.215209 (0.297832)	5.066197 / 2.077655 (2.988542)	2.508779 / 1.504120 (1.004659)	2.273901 / 1.541195 (0.732706)	2.364958 / 1.468490 (0.896468)	0.811367 / 4.584777 (-3.773410)	4.504744 / 3.745712 (0.759032)	2.499811 / 5.269862 (-2.770050)	1.583349 / 4.565676 (-2.982328)	0.101701 / 0.424275 (-0.322574)	0.014379 / 0.007607 (0.006772)	0.669506 / 0.226044 (0.443462)	6.556702 / 2.268929 (4.287774)	3.123457 / 55.444624 (-52.321167)	2.731997 / 6.876477 (-4.144480)	2.862866 / 2.142072 (0.720794)	0.992956 / 4.805227 (-3.812271)	0.200473 / 6.500664 (-6.300191)	0.078780 / 0.075469 (0.003311)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.540718 / 1.841788 (-0.301070)	18.749344 / 8.074308 (10.675036)	15.648983 / 10.191392 (5.457591)	0.174089 / 0.680424 (-0.506335)	0.020441 / 0.534201 (-0.513760)	0.503742 / 0.579283 (-0.075541)	0.500648 / 0.434364 (0.066284)	0.598558 / 0.540337 (0.058221)	0.712093 / 1.386936 (-0.674843)

github-actions · 2023-05-15T16:15:25Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009940 / 0.011353 (-0.001412)	0.006193 / 0.011008 (-0.004815)	0.125874 / 0.038508 (0.087366)	0.038664 / 0.023109 (0.015555)	0.380013 / 0.275898 (0.104115)	0.430152 / 0.323480 (0.106672)	0.006961 / 0.007986 (-0.001025)	0.004749 / 0.004328 (0.000420)	0.099743 / 0.004250 (0.095492)	0.052349 / 0.037052 (0.015297)	0.433354 / 0.258489 (0.174865)	0.436273 / 0.293841 (0.142433)	0.053929 / 0.128546 (-0.074617)	0.019369 / 0.075646 (-0.056278)	0.421783 / 0.419271 (0.002511)	0.062746 / 0.043533 (0.019213)	0.377225 / 0.255139 (0.122086)	0.413708 / 0.283200 (0.130508)	0.111371 / 0.141683 (-0.030312)	1.819166 / 1.452155 (0.367011)	1.974527 / 1.492716 (0.481810)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090664 / 0.018006 (0.072658)	0.566166 / 0.000490 (0.565676)	0.079305 / 0.000200 (0.079105)	0.000755 / 0.000054 (0.000700)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029720 / 0.037411 (-0.007691)	0.126030 / 0.014526 (0.111504)	0.146020 / 0.176557 (-0.030537)	0.210354 / 0.737135 (-0.526781)	0.149428 / 0.296338 (-0.146910)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.624371 / 0.215209 (0.409162)	6.332839 / 2.077655 (4.255184)	2.547784 / 1.504120 (1.043664)	2.150508 / 1.541195 (0.609313)	2.240816 / 1.468490 (0.772326)	1.271131 / 4.584777 (-3.313646)	5.642726 / 3.745712 (1.897014)	3.212988 / 5.269862 (-2.056874)	2.258123 / 4.565676 (-2.307553)	0.149477 / 0.424275 (-0.274798)	0.014603 / 0.007607 (0.006996)	0.782155 / 0.226044 (0.556111)	7.855191 / 2.268929 (5.586262)	3.308638 / 55.444624 (-52.135986)	2.548142 / 6.876477 (-4.328335)	2.627374 / 2.142072 (0.485301)	1.515170 / 4.805227 (-3.290058)	0.262479 / 6.500664 (-6.238185)	0.082181 / 0.075469 (0.006712)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.573169 / 1.841788 (-0.268618)	18.105719 / 8.074308 (10.031411)	22.015179 / 10.191392 (11.823787)	0.254678 / 0.680424 (-0.425746)	0.027098 / 0.534201 (-0.507103)	0.578045 / 0.579283 (-0.001238)	0.647130 / 0.434364 (0.212766)	0.650522 / 0.540337 (0.110185)	0.797713 / 1.386936 (-0.589223)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010376 / 0.011353 (-0.000977)	0.005990 / 0.011008 (-0.005018)	0.097144 / 0.038508 (0.058635)	0.038205 / 0.023109 (0.015096)	0.468347 / 0.275898 (0.192449)	0.497646 / 0.323480 (0.174166)	0.006916 / 0.007986 (-0.001069)	0.004760 / 0.004328 (0.000431)	0.109838 / 0.004250 (0.105587)	0.048321 / 0.037052 (0.011269)	0.437458 / 0.258489 (0.178969)	0.534864 / 0.293841 (0.241023)	0.053655 / 0.128546 (-0.074892)	0.021915 / 0.075646 (-0.053732)	0.121047 / 0.419271 (-0.298224)	0.059694 / 0.043533 (0.016162)	0.466937 / 0.255139 (0.211798)	0.482030 / 0.283200 (0.198831)	0.117458 / 0.141683 (-0.024225)	1.835551 / 1.452155 (0.383396)	1.965748 / 1.492716 (0.473031)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.234885 / 0.018006 (0.216879)	0.529925 / 0.000490 (0.529436)	0.000484 / 0.000200 (0.000284)	0.000085 / 0.000054 (0.000031)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030959 / 0.037411 (-0.006453)	0.128905 / 0.014526 (0.114379)	0.136913 / 0.176557 (-0.039643)	0.195133 / 0.737135 (-0.542002)	0.147929 / 0.296338 (-0.148410)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.715661 / 0.215209 (0.500451)	6.994125 / 2.077655 (4.916470)	3.033178 / 1.504120 (1.529058)	2.663709 / 1.541195 (1.122515)	2.707558 / 1.468490 (1.239068)	1.316195 / 4.584777 (-3.268582)	5.688264 / 3.745712 (1.942552)	3.260897 / 5.269862 (-2.008964)	2.134985 / 4.565676 (-2.430691)	0.153945 / 0.424275 (-0.270330)	0.014727 / 0.007607 (0.007119)	0.911339 / 0.226044 (0.685294)	8.902640 / 2.268929 (6.633711)	3.806606 / 55.444624 (-51.638018)	3.052238 / 6.876477 (-3.824238)	3.046945 / 2.142072 (0.904873)	1.559837 / 4.805227 (-3.245390)	0.272276 / 6.500664 (-6.228388)	0.087728 / 0.075469 (0.012259)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.712691 / 1.841788 (-0.129097)	18.127575 / 8.074308 (10.053267)	19.734063 / 10.191392 (9.542671)	0.235006 / 0.680424 (-0.445418)	0.027581 / 0.534201 (-0.506620)	0.551080 / 0.579283 (-0.028203)	0.608564 / 0.434364 (0.174200)	0.636578 / 0.540337 (0.096241)	0.732374 / 1.386936 (-0.654562)

Rocketknight1 · 2023-05-15T16:54:53Z

Looks good in testing - this should be ready for review! cc @lhoestq @massquantity

massquantity · 2023-05-16T01:53:37Z

Looks good to me, though i doubt that very few people will upgrade to TF >= 2.9 unless their memory is full:)

lhoestq · 2023-05-16T08:29:38Z

Is it more efficient than using numpy to shuffle as in multiprocessing ? Why not use the same strategy ?

Rocketknight1 · 2023-05-16T12:34:09Z

Good question, honestly! The NumPy strategy works fine, but requires us to handle multiple processes instead of doing everything in tf.data. We could just scrap this entire code path and always use the multiprocessing NumPy approach, but I think single-threaded throughput would be lower if we did that. If you prefer it for code simplicity, though, I can do that.

In the longer term, I'm hoping that tf.data gets native support for our data structures and we can transition the whole pipeline to pure tf.data, but that still hasn't happened 🫠

Rocketknight1 · 2023-05-16T12:35:14Z

And @massquantity TF 2.13 is going to release in a couple of days, so I hope most users are at least on TF 2.9 by now!

lhoestq · 2023-05-16T13:29:26Z

Unless there is a big gap in performance I think code simplicity would be appreciated ^^

github-actions · 2023-05-16T14:55:02Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008638 / 0.011353 (-0.002715)	0.006013 / 0.011008 (-0.004995)	0.116456 / 0.038508 (0.077948)	0.040419 / 0.023109 (0.017310)	0.418374 / 0.275898 (0.142476)	0.447693 / 0.323480 (0.124213)	0.007002 / 0.007986 (-0.000984)	0.006175 / 0.004328 (0.001847)	0.087801 / 0.004250 (0.083550)	0.051980 / 0.037052 (0.014928)	0.393275 / 0.258489 (0.134786)	0.449601 / 0.293841 (0.155760)	0.041670 / 0.128546 (-0.086876)	0.014396 / 0.075646 (-0.061251)	0.399175 / 0.419271 (-0.020096)	0.060635 / 0.043533 (0.017102)	0.391449 / 0.255139 (0.136310)	0.420713 / 0.283200 (0.137513)	0.121369 / 0.141683 (-0.020314)	1.692630 / 1.452155 (0.240475)	1.815526 / 1.492716 (0.322810)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.244321 / 0.018006 (0.226315)	0.487947 / 0.000490 (0.487458)	0.004563 / 0.000200 (0.004363)	0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033425 / 0.037411 (-0.003987)	0.134458 / 0.014526 (0.119932)	0.138810 / 0.176557 (-0.037746)	0.208871 / 0.737135 (-0.528264)	0.147964 / 0.296338 (-0.148374)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.483347 / 0.215209 (0.268138)	4.799550 / 2.077655 (2.721895)	2.174149 / 1.504120 (0.670029)	1.943276 / 1.541195 (0.402081)	2.010884 / 1.468490 (0.542394)	0.832030 / 4.584777 (-3.752747)	4.716713 / 3.745712 (0.971001)	4.615810 / 5.269862 (-0.654052)	2.379600 / 4.565676 (-2.186077)	0.103560 / 0.424275 (-0.320715)	0.014683 / 0.007607 (0.007076)	0.598558 / 0.226044 (0.372514)	5.999126 / 2.268929 (3.730197)	2.677819 / 55.444624 (-52.766805)	2.320838 / 6.876477 (-4.555639)	2.503684 / 2.142072 (0.361611)	1.016459 / 4.805227 (-3.788769)	0.201672 / 6.500664 (-6.298992)	0.079310 / 0.075469 (0.003841)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.446374 / 1.841788 (-0.395413)	19.219310 / 8.074308 (11.145002)	17.294665 / 10.191392 (7.103273)	0.246115 / 0.680424 (-0.434309)	0.021406 / 0.534201 (-0.512795)	0.524084 / 0.579283 (-0.055200)	0.511254 / 0.434364 (0.076890)	0.621304 / 0.540337 (0.080966)	0.727088 / 1.386936 (-0.659848)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008907 / 0.011353 (-0.002446)	0.006165 / 0.011008 (-0.004843)	0.090786 / 0.038508 (0.052278)	0.040893 / 0.023109 (0.017784)	0.451252 / 0.275898 (0.175354)	0.477811 / 0.323480 (0.154331)	0.007418 / 0.007986 (-0.000568)	0.005789 / 0.004328 (0.001461)	0.087422 / 0.004250 (0.083171)	0.061800 / 0.037052 (0.024748)	0.459085 / 0.258489 (0.200596)	0.488897 / 0.293841 (0.195056)	0.048157 / 0.128546 (-0.080389)	0.014676 / 0.075646 (-0.060970)	0.104372 / 0.419271 (-0.314900)	0.058066 / 0.043533 (0.014534)	0.446131 / 0.255139 (0.190992)	0.460428 / 0.283200 (0.177228)	0.128492 / 0.141683 (-0.013191)	1.811419 / 1.452155 (0.359265)	1.894781 / 1.492716 (0.402064)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220527 / 0.018006 (0.202520)	0.487663 / 0.000490 (0.487173)	0.003864 / 0.000200 (0.003664)	0.000162 / 0.000054 (0.000107)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036354 / 0.037411 (-0.001057)	0.140469 / 0.014526 (0.125944)	0.149990 / 0.176557 (-0.026566)	0.212369 / 0.737135 (-0.524766)	0.154000 / 0.296338 (-0.142338)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.514172 / 0.215209 (0.298963)	5.129247 / 2.077655 (3.051593)	2.536773 / 1.504120 (1.032653)	2.317253 / 1.541195 (0.776058)	2.424066 / 1.468490 (0.955576)	0.836160 / 4.584777 (-3.748617)	4.906235 / 3.745712 (1.160523)	4.431395 / 5.269862 (-0.838467)	2.332845 / 4.565676 (-2.232831)	0.102867 / 0.424275 (-0.321409)	0.014851 / 0.007607 (0.007244)	0.644104 / 0.226044 (0.418060)	6.415847 / 2.268929 (4.146918)	3.186984 / 55.444624 (-52.257641)	2.774125 / 6.876477 (-4.102352)	2.848045 / 2.142072 (0.705972)	1.018757 / 4.805227 (-3.786470)	0.212333 / 6.500664 (-6.288331)	0.079405 / 0.075469 (0.003936)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.748375 / 1.841788 (-0.093412)	19.733829 / 8.074308 (11.659521)	15.766665 / 10.191392 (5.575273)	0.192087 / 0.680424 (-0.488337)	0.027641 / 0.534201 (-0.506560)	0.504101 / 0.579283 (-0.075182)	0.493815 / 0.434364 (0.059451)	0.583247 / 0.540337 (0.042910)	0.697432 / 1.386936 (-0.689504)

Rocketknight1 · 2023-05-16T15:09:41Z

Hi @lhoestq, I tried moving everything to the NumPy path but ran into issues - the SharedMemory constructs it depends on were only added in Python 3.8. As a result, if we move everything to that path then to_tf_dataset does not work on older Python versions.

For now, how do you feel about reverting and using my original solution, which has fallbacks for all versions of Python and TensorFlow? Once our minimum versions pass Python 3.8 or TF 2.9 we can remove the older code paths.

Rocketknight1 · 2023-05-19T15:26:10Z

Gentle ping on this question @lhoestq!

lhoestq · 2023-05-19T15:55:05Z

Ah yes indeed. Feel free to revert and add comments to explain why you needed to have a different approach for single process

github-actions · 2023-05-23T15:46:26Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008395 / 0.011353 (-0.002958)	0.005773 / 0.011008 (-0.005235)	0.115702 / 0.038508 (0.077194)	0.039897 / 0.023109 (0.016788)	0.483140 / 0.275898 (0.207242)	0.531288 / 0.323480 (0.207808)	0.006739 / 0.007986 (-0.001246)	0.004419 / 0.004328 (0.000090)	0.086374 / 0.004250 (0.082124)	0.056498 / 0.037052 (0.019446)	0.491589 / 0.258489 (0.233100)	0.556366 / 0.293841 (0.262525)	0.041366 / 0.128546 (-0.087181)	0.014373 / 0.075646 (-0.061274)	0.395504 / 0.419271 (-0.023767)	0.094382 / 0.043533 (0.050849)	0.483000 / 0.255139 (0.227861)	0.522693 / 0.283200 (0.239494)	0.138804 / 0.141683 (-0.002879)	1.719563 / 1.452155 (0.267409)	1.853470 / 1.492716 (0.360753)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.235616 / 0.018006 (0.217610)	0.483267 / 0.000490 (0.482777)	0.008663 / 0.000200 (0.008463)	0.000401 / 0.000054 (0.000347)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033124 / 0.037411 (-0.004287)	0.128821 / 0.014526 (0.114295)	0.138910 / 0.176557 (-0.037647)	0.213570 / 0.737135 (-0.523566)	0.146646 / 0.296338 (-0.149693)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.479998 / 0.215209 (0.264789)	4.772325 / 2.077655 (2.694670)	2.228424 / 1.504120 (0.724304)	2.000915 / 1.541195 (0.459721)	2.105799 / 1.468490 (0.637309)	0.824235 / 4.584777 (-3.760542)	4.511902 / 3.745712 (0.766189)	4.723073 / 5.269862 (-0.546789)	2.333442 / 4.565676 (-2.232235)	0.101161 / 0.424275 (-0.323114)	0.014403 / 0.007607 (0.006796)	0.596395 / 0.226044 (0.370351)	5.961046 / 2.268929 (3.692117)	2.746679 / 55.444624 (-52.697946)	2.352085 / 6.876477 (-4.524392)	2.609812 / 2.142072 (0.467740)	0.996950 / 4.805227 (-3.808277)	0.197923 / 6.500664 (-6.302741)	0.075546 / 0.075469 (0.000077)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.529896 / 1.841788 (-0.311892)	18.183887 / 8.074308 (10.109578)	16.352332 / 10.191392 (6.160940)	0.213504 / 0.680424 (-0.466920)	0.020388 / 0.534201 (-0.513813)	0.497832 / 0.579283 (-0.081451)	0.495477 / 0.434364 (0.061113)	0.585984 / 0.540337 (0.045647)	0.688726 / 1.386936 (-0.698210)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008422 / 0.011353 (-0.002931)	0.005876 / 0.011008 (-0.005132)	0.089310 / 0.038508 (0.050802)	0.039769 / 0.023109 (0.016660)	0.425279 / 0.275898 (0.149381)	0.470818 / 0.323480 (0.147338)	0.006519 / 0.007986 (-0.001467)	0.006276 / 0.004328 (0.001948)	0.085753 / 0.004250 (0.081503)	0.053867 / 0.037052 (0.016815)	0.429193 / 0.258489 (0.170704)	0.480278 / 0.293841 (0.186437)	0.040657 / 0.128546 (-0.087889)	0.014055 / 0.075646 (-0.061591)	0.101422 / 0.419271 (-0.317849)	0.053803 / 0.043533 (0.010271)	0.428348 / 0.255139 (0.173209)	0.452193 / 0.283200 (0.168994)	0.124914 / 0.141683 (-0.016769)	1.750122 / 1.452155 (0.297968)	1.850875 / 1.492716 (0.358159)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.249958 / 0.018006 (0.231952)	0.485183 / 0.000490 (0.484694)	0.000472 / 0.000200 (0.000272)	0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034563 / 0.037411 (-0.002848)	0.135565 / 0.014526 (0.121039)	0.143271 / 0.176557 (-0.033285)	0.199080 / 0.737135 (-0.538056)	0.149336 / 0.296338 (-0.147003)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.526170 / 0.215209 (0.310961)	5.270960 / 2.077655 (3.193305)	2.664585 / 1.504120 (1.160465)	2.440027 / 1.541195 (0.898832)	2.612764 / 1.468490 (1.144274)	0.828965 / 4.584777 (-3.755812)	4.769983 / 3.745712 (1.024271)	2.441962 / 5.269862 (-2.827900)	1.549032 / 4.565676 (-3.016644)	0.100851 / 0.424275 (-0.323424)	0.014425 / 0.007607 (0.006818)	0.640908 / 0.226044 (0.414864)	6.399041 / 2.268929 (4.130113)	3.242424 / 55.444624 (-52.202200)	2.836317 / 6.876477 (-4.040160)	2.933010 / 2.142072 (0.790938)	1.002277 / 4.805227 (-3.802950)	0.201247 / 6.500664 (-6.299417)	0.078777 / 0.075469 (0.003308)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.620415 / 1.841788 (-0.221373)	19.153631 / 8.074308 (11.079323)	16.744068 / 10.191392 (6.552676)	0.167327 / 0.680424 (-0.513097)	0.020186 / 0.534201 (-0.514015)	0.503683 / 0.579283 (-0.075600)	0.500051 / 0.434364 (0.065687)	0.587188 / 0.540337 (0.046850)	0.699975 / 1.386936 (-0.686961)

Rocketknight1 · 2023-05-23T15:53:18Z

This is probably ready, but likely conflicts with #5883. I'll wait for that PR to be merged and then rebase and merge this one.

github-actions · 2023-05-23T16:09:26Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008387 / 0.011353 (-0.002965)	0.005824 / 0.011008 (-0.005184)	0.117721 / 0.038508 (0.079213)	0.040420 / 0.023109 (0.017311)	0.404961 / 0.275898 (0.129063)	0.426695 / 0.323480 (0.103215)	0.006634 / 0.007986 (-0.001352)	0.006033 / 0.004328 (0.001705)	0.088652 / 0.004250 (0.084402)	0.048075 / 0.037052 (0.011022)	0.400683 / 0.258489 (0.142194)	0.432489 / 0.293841 (0.138648)	0.042065 / 0.128546 (-0.086482)	0.014071 / 0.075646 (-0.061575)	0.399398 / 0.419271 (-0.019873)	0.066034 / 0.043533 (0.022501)	0.400056 / 0.255139 (0.144918)	0.421130 / 0.283200 (0.137930)	0.119721 / 0.141683 (-0.021962)	1.752166 / 1.452155 (0.300011)	1.820161 / 1.492716 (0.327444)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.244264 / 0.018006 (0.226258)	0.480882 / 0.000490 (0.480392)	0.005604 / 0.000200 (0.005404)	0.000175 / 0.000054 (0.000121)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032397 / 0.037411 (-0.005015)	0.131632 / 0.014526 (0.117106)	0.139765 / 0.176557 (-0.036792)	0.213135 / 0.737135 (-0.524000)	0.147891 / 0.296338 (-0.148447)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474534 / 0.215209 (0.259325)	4.730424 / 2.077655 (2.652770)	2.163706 / 1.504120 (0.659586)	1.936051 / 1.541195 (0.394857)	2.012185 / 1.468490 (0.543695)	0.826583 / 4.584777 (-3.758194)	4.921494 / 3.745712 (1.175782)	2.431401 / 5.269862 (-2.838460)	1.566020 / 4.565676 (-2.999656)	0.101255 / 0.424275 (-0.323020)	0.014553 / 0.007607 (0.006946)	0.608301 / 0.226044 (0.382256)	6.089801 / 2.268929 (3.820873)	2.691986 / 55.444624 (-52.752638)	2.296498 / 6.876477 (-4.579979)	2.455388 / 2.142072 (0.313315)	0.984342 / 4.805227 (-3.820885)	0.200447 / 6.500664 (-6.300217)	0.077602 / 0.075469 (0.002133)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.445067 / 1.841788 (-0.396721)	18.588670 / 8.074308 (10.514362)	16.950216 / 10.191392 (6.758824)	0.169688 / 0.680424 (-0.510736)	0.020544 / 0.534201 (-0.513657)	0.508506 / 0.579283 (-0.070777)	0.516218 / 0.434364 (0.081854)	0.646072 / 0.540337 (0.105734)	0.763227 / 1.386936 (-0.623709)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008816 / 0.011353 (-0.002537)	0.006016 / 0.011008 (-0.004992)	0.090946 / 0.038508 (0.052438)	0.040189 / 0.023109 (0.017080)	0.446723 / 0.275898 (0.170825)	0.494633 / 0.323480 (0.171153)	0.007206 / 0.007986 (-0.000779)	0.004508 / 0.004328 (0.000180)	0.088477 / 0.004250 (0.084226)	0.055587 / 0.037052 (0.018535)	0.445349 / 0.258489 (0.186860)	0.504940 / 0.293841 (0.211099)	0.041976 / 0.128546 (-0.086570)	0.014296 / 0.075646 (-0.061351)	0.102835 / 0.419271 (-0.316436)	0.054786 / 0.043533 (0.011253)	0.444789 / 0.255139 (0.189651)	0.472306 / 0.283200 (0.189106)	0.123365 / 0.141683 (-0.018318)	1.725803 / 1.452155 (0.273648)	1.832216 / 1.492716 (0.339500)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252680 / 0.018006 (0.234674)	0.476719 / 0.000490 (0.476229)	0.000461 / 0.000200 (0.000261)	0.000067 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035961 / 0.037411 (-0.001450)	0.135399 / 0.014526 (0.120873)	0.147549 / 0.176557 (-0.029007)	0.207468 / 0.737135 (-0.529667)	0.151591 / 0.296338 (-0.144747)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.528143 / 0.215209 (0.312934)	5.270766 / 2.077655 (3.193111)	2.675644 / 1.504120 (1.171524)	2.472855 / 1.541195 (0.931660)	2.636020 / 1.468490 (1.167530)	0.841325 / 4.584777 (-3.743452)	4.702290 / 3.745712 (0.956578)	2.523537 / 5.269862 (-2.746325)	1.595617 / 4.565676 (-2.970059)	0.102095 / 0.424275 (-0.322180)	0.014568 / 0.007607 (0.006961)	0.652090 / 0.226044 (0.426046)	6.503086 / 2.268929 (4.234158)	3.277025 / 55.444624 (-52.167599)	2.931264 / 6.876477 (-3.945213)	3.021667 / 2.142072 (0.879594)	1.002560 / 4.805227 (-3.802668)	0.202621 / 6.500664 (-6.298043)	0.080583 / 0.075469 (0.005114)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.639281 / 1.841788 (-0.202507)	18.911529 / 8.074308 (10.837220)	17.082795 / 10.191392 (6.891403)	0.179456 / 0.680424 (-0.500968)	0.021740 / 0.534201 (-0.512460)	0.526426 / 0.579283 (-0.052857)	0.535083 / 0.434364 (0.100719)	0.583304 / 0.540337 (0.042967)	0.696733 / 1.386936 (-0.690203)

github-actions · 2023-05-24T15:50:03Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006823 / 0.011353 (-0.004530)	0.004847 / 0.011008 (-0.006161)	0.096038 / 0.038508 (0.057530)	0.033037 / 0.023109 (0.009928)	0.298379 / 0.275898 (0.022481)	0.333319 / 0.323480 (0.009839)	0.005343 / 0.007986 (-0.002643)	0.003863 / 0.004328 (-0.000465)	0.072928 / 0.004250 (0.068678)	0.040898 / 0.037052 (0.003846)	0.303116 / 0.258489 (0.044627)	0.334021 / 0.293841 (0.040181)	0.034780 / 0.128546 (-0.093767)	0.011978 / 0.075646 (-0.063668)	0.331642 / 0.419271 (-0.087629)	0.052729 / 0.043533 (0.009196)	0.298586 / 0.255139 (0.043447)	0.319296 / 0.283200 (0.036097)	0.097711 / 0.141683 (-0.043972)	1.416899 / 1.452155 (-0.035256)	1.546008 / 1.492716 (0.053292)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.234303 / 0.018006 (0.216296)	0.492767 / 0.000490 (0.492278)	0.004935 / 0.000200 (0.004736)	0.000106 / 0.000054 (0.000051)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030617 / 0.037411 (-0.006795)	0.121203 / 0.014526 (0.106677)	0.126677 / 0.176557 (-0.049879)	0.186379 / 0.737135 (-0.550756)	0.129849 / 0.296338 (-0.166490)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.416324 / 0.215209 (0.201115)	4.135563 / 2.077655 (2.057908)	1.976182 / 1.504120 (0.472062)	1.807611 / 1.541195 (0.266416)	1.886282 / 1.468490 (0.417792)	0.713006 / 4.584777 (-3.871771)	3.899205 / 3.745712 (0.153493)	2.283427 / 5.269862 (-2.986435)	1.543088 / 4.565676 (-3.022589)	0.086189 / 0.424275 (-0.338087)	0.012908 / 0.007607 (0.005301)	0.516156 / 0.226044 (0.290112)	5.144199 / 2.268929 (2.875271)	2.460142 / 55.444624 (-52.984482)	2.209054 / 6.876477 (-4.667423)	2.325277 / 2.142072 (0.183204)	0.849890 / 4.805227 (-3.955337)	0.173687 / 6.500664 (-6.326977)	0.070178 / 0.075469 (-0.005291)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.241790 / 1.841788 (-0.599997)	16.047257 / 8.074308 (7.972949)	15.774146 / 10.191392 (5.582754)	0.145871 / 0.680424 (-0.534553)	0.018106 / 0.534201 (-0.516095)	0.433642 / 0.579283 (-0.145641)	0.425311 / 0.434364 (-0.009053)	0.533963 / 0.540337 (-0.006375)	0.638786 / 1.386936 (-0.748151)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007242 / 0.011353 (-0.004111)	0.005599 / 0.011008 (-0.005410)	0.073443 / 0.038508 (0.034935)	0.033764 / 0.023109 (0.010655)	0.365990 / 0.275898 (0.090092)	0.392943 / 0.323480 (0.069463)	0.005987 / 0.007986 (-0.001999)	0.004312 / 0.004328 (-0.000016)	0.072831 / 0.004250 (0.068580)	0.048854 / 0.037052 (0.011802)	0.362477 / 0.258489 (0.103988)	0.399993 / 0.293841 (0.106152)	0.035602 / 0.128546 (-0.092944)	0.012445 / 0.075646 (-0.063202)	0.085768 / 0.419271 (-0.333504)	0.048544 / 0.043533 (0.005011)	0.362246 / 0.255139 (0.107107)	0.388753 / 0.283200 (0.105554)	0.109829 / 0.141683 (-0.031854)	1.546881 / 1.452155 (0.094726)	1.619454 / 1.492716 (0.126737)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.189926 / 0.018006 (0.171920)	0.447936 / 0.000490 (0.447446)	0.002354 / 0.000200 (0.002155)	0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031740 / 0.037411 (-0.005671)	0.122595 / 0.014526 (0.108069)	0.128389 / 0.176557 (-0.048168)	0.180570 / 0.737135 (-0.556566)	0.132939 / 0.296338 (-0.163399)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425073 / 0.215209 (0.209863)	4.238964 / 2.077655 (2.161309)	2.095116 / 1.504120 (0.590996)	1.913925 / 1.541195 (0.372730)	2.024669 / 1.468490 (0.556179)	0.699172 / 4.584777 (-3.885605)	3.845807 / 3.745712 (0.100094)	2.167502 / 5.269862 (-3.102360)	1.375267 / 4.565676 (-3.190410)	0.086739 / 0.424275 (-0.337536)	0.012198 / 0.007607 (0.004591)	0.525975 / 0.226044 (0.299931)	5.249449 / 2.268929 (2.980521)	2.550565 / 55.444624 (-52.894060)	2.257557 / 6.876477 (-4.618920)	2.298936 / 2.142072 (0.156863)	0.850295 / 4.805227 (-3.954932)	0.170506 / 6.500664 (-6.330158)	0.065659 / 0.075469 (-0.009810)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.330556 / 1.841788 (-0.511231)	16.920203 / 8.074308 (8.845894)	15.966739 / 10.191392 (5.775347)	0.164000 / 0.680424 (-0.516424)	0.018211 / 0.534201 (-0.515990)	0.436253 / 0.579283 (-0.143030)	0.449666 / 0.434364 (0.015302)	0.522287 / 0.540337 (-0.018050)	0.615944 / 1.386936 (-0.770992)

github-actions · 2023-05-24T16:04:43Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007273 / 0.011353 (-0.004080)	0.005198 / 0.011008 (-0.005810)	0.114362 / 0.038508 (0.075854)	0.031113 / 0.023109 (0.008003)	0.378568 / 0.275898 (0.102670)	0.441695 / 0.323480 (0.118215)	0.006037 / 0.007986 (-0.001949)	0.005102 / 0.004328 (0.000774)	0.098682 / 0.004250 (0.094432)	0.042797 / 0.037052 (0.005745)	0.360028 / 0.258489 (0.101539)	0.435757 / 0.293841 (0.141916)	0.041438 / 0.128546 (-0.087109)	0.013728 / 0.075646 (-0.061918)	0.376154 / 0.419271 (-0.043117)	0.075324 / 0.043533 (0.031791)	0.357221 / 0.255139 (0.102082)	0.416378 / 0.283200 (0.133178)	0.110707 / 0.141683 (-0.030975)	1.603215 / 1.452155 (0.151061)	1.736843 / 1.492716 (0.244127)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.249479 / 0.018006 (0.231473)	0.513205 / 0.000490 (0.512715)	0.003856 / 0.000200 (0.003656)	0.000100 / 0.000054 (0.000045)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027750 / 0.037411 (-0.009661)	0.105437 / 0.014526 (0.090911)	0.115903 / 0.176557 (-0.060653)	0.179662 / 0.737135 (-0.557474)	0.116305 / 0.296338 (-0.180033)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.551681 / 0.215209 (0.336472)	5.544590 / 2.077655 (3.466935)	2.193933 / 1.504120 (0.689813)	1.898395 / 1.541195 (0.357201)	1.877288 / 1.468490 (0.408798)	0.858097 / 4.584777 (-3.726680)	4.920982 / 3.745712 (1.175270)	2.478220 / 5.269862 (-2.791641)	1.779608 / 4.565676 (-2.786069)	0.101321 / 0.424275 (-0.322954)	0.012627 / 0.007607 (0.005020)	0.674865 / 0.226044 (0.448820)	6.808224 / 2.268929 (4.539295)	2.822466 / 55.444624 (-52.622159)	2.170379 / 6.876477 (-4.706098)	2.224278 / 2.142072 (0.082205)	1.032763 / 4.805227 (-3.772464)	0.198851 / 6.500664 (-6.301813)	0.069249 / 0.075469 (-0.006220)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.425987 / 1.841788 (-0.415801)	16.212942 / 8.074308 (8.138634)	18.945770 / 10.191392 (8.754378)	0.192901 / 0.680424 (-0.487522)	0.025343 / 0.534201 (-0.508858)	0.465441 / 0.579283 (-0.113842)	0.540966 / 0.434364 (0.106602)	0.576736 / 0.540337 (0.036399)	0.675717 / 1.386936 (-0.711219)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007426 / 0.011353 (-0.003927)	0.005023 / 0.011008 (-0.005985)	0.085083 / 0.038508 (0.046575)	0.030559 / 0.023109 (0.007449)	0.398461 / 0.275898 (0.122563)	0.418998 / 0.323480 (0.095518)	0.006697 / 0.007986 (-0.001288)	0.004665 / 0.004328 (0.000337)	0.087724 / 0.004250 (0.083473)	0.045799 / 0.037052 (0.008747)	0.395165 / 0.258489 (0.136676)	0.430172 / 0.293841 (0.136331)	0.040486 / 0.128546 (-0.088060)	0.014237 / 0.075646 (-0.061409)	0.099429 / 0.419271 (-0.319843)	0.056006 / 0.043533 (0.012473)	0.389046 / 0.255139 (0.133907)	0.419559 / 0.283200 (0.136359)	0.108550 / 0.141683 (-0.033132)	1.614052 / 1.452155 (0.161897)	1.677785 / 1.492716 (0.185069)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.202178 / 0.018006 (0.184172)	0.486365 / 0.000490 (0.485875)	0.003844 / 0.000200 (0.003644)	0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027963 / 0.037411 (-0.009449)	0.110399 / 0.014526 (0.095873)	0.122266 / 0.176557 (-0.054291)	0.178551 / 0.737135 (-0.558585)	0.129259 / 0.296338 (-0.167080)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.604178 / 0.215209 (0.388969)	6.135943 / 2.077655 (4.058288)	2.547576 / 1.504120 (1.043456)	2.262470 / 1.541195 (0.721276)	2.275402 / 1.468490 (0.806912)	0.878804 / 4.584777 (-3.705972)	5.152200 / 3.745712 (1.406488)	2.553715 / 5.269862 (-2.716147)	1.580959 / 4.565676 (-2.984717)	0.107895 / 0.424275 (-0.316380)	0.012751 / 0.007607 (0.005143)	0.770678 / 0.226044 (0.544633)	7.744303 / 2.268929 (5.475374)	3.342037 / 55.444624 (-52.102588)	2.756848 / 6.876477 (-4.119629)	2.739357 / 2.142072 (0.597285)	1.086330 / 4.805227 (-3.718897)	0.230983 / 6.500664 (-6.269681)	0.073771 / 0.075469 (-0.001698)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.493441 / 1.841788 (-0.348347)	16.621611 / 8.074308 (8.547303)	19.081000 / 10.191392 (8.889608)	0.215623 / 0.680424 (-0.464801)	0.025660 / 0.534201 (-0.508541)	0.446490 / 0.579283 (-0.132793)	0.560078 / 0.434364 (0.125714)	0.527231 / 0.540337 (-0.013106)	0.636551 / 1.386936 (-0.750385)

…ng approach" This reverts commit 95c177e.

dataset.shuffle(dataset.cardinality()), so use that instead of dataset.shuffle(len(dataset))

github-actions · 2023-06-07T14:52:53Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008266 / 0.011353 (-0.003087)	0.005082 / 0.011008 (-0.005927)	0.119858 / 0.038508 (0.081350)	0.032907 / 0.023109 (0.009798)	0.362816 / 0.275898 (0.086918)	0.403684 / 0.323480 (0.080204)	0.006296 / 0.007986 (-0.001690)	0.006220 / 0.004328 (0.001891)	0.095609 / 0.004250 (0.091359)	0.048734 / 0.037052 (0.011682)	0.385724 / 0.258489 (0.127235)	0.424315 / 0.293841 (0.130475)	0.042344 / 0.128546 (-0.086202)	0.016147 / 0.075646 (-0.059500)	0.409661 / 0.419271 (-0.009610)	0.057900 / 0.043533 (0.014367)	0.387013 / 0.255139 (0.131874)	0.388901 / 0.283200 (0.105702)	0.103920 / 0.141683 (-0.037762)	1.732730 / 1.452155 (0.280575)	1.863912 / 1.492716 (0.371196)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.237406 / 0.018006 (0.219400)	0.514398 / 0.000490 (0.513909)	0.005941 / 0.000200 (0.005741)	0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027524 / 0.037411 (-0.009888)	0.116498 / 0.014526 (0.101972)	0.129034 / 0.176557 (-0.047522)	0.218272 / 0.737135 (-0.518864)	0.148389 / 0.296338 (-0.147950)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.604555 / 0.215209 (0.389346)	5.921576 / 2.077655 (3.843921)	2.410483 / 1.504120 (0.906363)	2.220286 / 1.541195 (0.679092)	2.138880 / 1.468490 (0.670390)	0.934962 / 4.584777 (-3.649815)	5.808855 / 3.745712 (2.063143)	4.881554 / 5.269862 (-0.388308)	2.536408 / 4.565676 (-2.029268)	0.124260 / 0.424275 (-0.300015)	0.017798 / 0.007607 (0.010190)	0.778991 / 0.226044 (0.552947)	7.899262 / 2.268929 (5.630333)	3.208667 / 55.444624 (-52.235957)	2.631182 / 6.876477 (-4.245295)	2.676199 / 2.142072 (0.534127)	1.165516 / 4.805227 (-3.639711)	0.228751 / 6.500664 (-6.271913)	0.081378 / 0.075469 (0.005909)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.522156 / 1.841788 (-0.319632)	17.975381 / 8.074308 (9.901073)	18.918882 / 10.191392 (8.727490)	0.223984 / 0.680424 (-0.456440)	0.025171 / 0.534201 (-0.509030)	0.467894 / 0.579283 (-0.111389)	0.559501 / 0.434364 (0.125137)	0.550392 / 0.540337 (0.010055)	0.696923 / 1.386936 (-0.690013)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008577 / 0.011353 (-0.002775)	0.006735 / 0.011008 (-0.004273)	0.095108 / 0.038508 (0.056600)	0.035059 / 0.023109 (0.011950)	0.448576 / 0.275898 (0.172677)	0.492049 / 0.323480 (0.168569)	0.006600 / 0.007986 (-0.001385)	0.004760 / 0.004328 (0.000431)	0.094670 / 0.004250 (0.090419)	0.052543 / 0.037052 (0.015491)	0.458927 / 0.258489 (0.200438)	0.511522 / 0.293841 (0.217681)	0.046046 / 0.128546 (-0.082500)	0.015227 / 0.075646 (-0.060419)	0.114585 / 0.419271 (-0.304686)	0.057569 / 0.043533 (0.014036)	0.441989 / 0.255139 (0.186850)	0.487001 / 0.283200 (0.203801)	0.115688 / 0.141683 (-0.025995)	1.777366 / 1.452155 (0.325211)	1.906216 / 1.492716 (0.413499)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224880 / 0.018006 (0.206874)	0.504153 / 0.000490 (0.503664)	0.001143 / 0.000200 (0.000943)	0.000111 / 0.000054 (0.000056)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033618 / 0.037411 (-0.003793)	0.127396 / 0.014526 (0.112870)	0.135648 / 0.176557 (-0.040909)	0.193140 / 0.737135 (-0.543995)	0.142129 / 0.296338 (-0.154209)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.692845 / 0.215209 (0.477636)	6.804897 / 2.077655 (4.727242)	2.851041 / 1.504120 (1.346921)	2.480698 / 1.541195 (0.939504)	2.488619 / 1.468490 (1.020129)	0.970439 / 4.584777 (-3.614338)	5.466059 / 3.745712 (1.720347)	2.790261 / 5.269862 (-2.479601)	1.727638 / 4.565676 (-2.838039)	0.116345 / 0.424275 (-0.307930)	0.014348 / 0.007607 (0.006740)	0.845510 / 0.226044 (0.619465)	8.397198 / 2.268929 (6.128270)	3.591998 / 55.444624 (-51.852626)	2.858339 / 6.876477 (-4.018137)	2.905075 / 2.142072 (0.763003)	1.193569 / 4.805227 (-3.611658)	0.243091 / 6.500664 (-6.257573)	0.082198 / 0.075469 (0.006729)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.610327 / 1.841788 (-0.231461)	17.191414 / 8.074308 (9.117106)	20.176518 / 10.191392 (9.985126)	0.246574 / 0.680424 (-0.433850)	0.024343 / 0.534201 (-0.509858)	0.482091 / 0.579283 (-0.097192)	0.585241 / 0.434364 (0.150877)	0.558833 / 0.540337 (0.018496)	0.654811 / 1.386936 (-0.732125)

github-actions · 2023-06-07T15:05:33Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006353 / 0.011353 (-0.004999)	0.004393 / 0.011008 (-0.006616)	0.098751 / 0.038508 (0.060242)	0.029090 / 0.023109 (0.005981)	0.304169 / 0.275898 (0.028271)	0.339879 / 0.323480 (0.016399)	0.005577 / 0.007986 (-0.002408)	0.003516 / 0.004328 (-0.000813)	0.077347 / 0.004250 (0.073097)	0.041935 / 0.037052 (0.004882)	0.305865 / 0.258489 (0.047376)	0.357063 / 0.293841 (0.063222)	0.025245 / 0.128546 (-0.103301)	0.008753 / 0.075646 (-0.066893)	0.316734 / 0.419271 (-0.102538)	0.043464 / 0.043533 (-0.000069)	0.300944 / 0.255139 (0.045805)	0.330091 / 0.283200 (0.046891)	0.088593 / 0.141683 (-0.053090)	1.588958 / 1.452155 (0.136803)	1.641376 / 1.492716 (0.148660)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220290 / 0.018006 (0.202284)	0.445430 / 0.000490 (0.444940)	0.004800 / 0.000200 (0.004600)	0.000075 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023828 / 0.037411 (-0.013583)	0.103446 / 0.014526 (0.088920)	0.110668 / 0.176557 (-0.065889)	0.169604 / 0.737135 (-0.567531)	0.114818 / 0.296338 (-0.181520)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.416951 / 0.215209 (0.201742)	4.138917 / 2.077655 (2.061263)	1.891265 / 1.504120 (0.387145)	1.687068 / 1.541195 (0.145873)	1.726618 / 1.468490 (0.258128)	0.546977 / 4.584777 (-4.037800)	3.536153 / 3.745712 (-0.209560)	1.795206 / 5.269862 (-3.474656)	1.019845 / 4.565676 (-3.545831)	0.067040 / 0.424275 (-0.357235)	0.012038 / 0.007607 (0.004431)	0.520583 / 0.226044 (0.294539)	5.211520 / 2.268929 (2.942591)	2.336136 / 55.444624 (-53.108488)	2.011262 / 6.876477 (-4.865215)	2.137311 / 2.142072 (-0.004762)	0.654779 / 4.805227 (-4.150448)	0.134555 / 6.500664 (-6.366109)	0.066427 / 0.075469 (-0.009042)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.240187 / 1.841788 (-0.601600)	14.104063 / 8.074308 (6.029755)	13.369572 / 10.191392 (3.178180)	0.147891 / 0.680424 (-0.532533)	0.016993 / 0.534201 (-0.517208)	0.364863 / 0.579283 (-0.214420)	0.398684 / 0.434364 (-0.035680)	0.430524 / 0.540337 (-0.109813)	0.520920 / 1.386936 (-0.866016)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006845 / 0.011353 (-0.004508)	0.004420 / 0.011008 (-0.006588)	0.078334 / 0.038508 (0.039825)	0.030566 / 0.023109 (0.007457)	0.409568 / 0.275898 (0.133670)	0.458389 / 0.323480 (0.134910)	0.005739 / 0.007986 (-0.002247)	0.005222 / 0.004328 (0.000893)	0.076066 / 0.004250 (0.071816)	0.049239 / 0.037052 (0.012187)	0.409841 / 0.258489 (0.151352)	0.472250 / 0.293841 (0.178409)	0.025463 / 0.128546 (-0.103084)	0.008738 / 0.075646 (-0.066909)	0.083114 / 0.419271 (-0.336157)	0.041233 / 0.043533 (-0.002300)	0.407158 / 0.255139 (0.152019)	0.438724 / 0.283200 (0.155524)	0.097974 / 0.141683 (-0.043709)	1.536514 / 1.452155 (0.084360)	1.636704 / 1.492716 (0.143987)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.240589 / 0.018006 (0.222583)	0.440328 / 0.000490 (0.439838)	0.000937 / 0.000200 (0.000737)	0.000076 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027559 / 0.037411 (-0.009853)	0.109930 / 0.014526 (0.095405)	0.113366 / 0.176557 (-0.063190)	0.166849 / 0.737135 (-0.570286)	0.118872 / 0.296338 (-0.177467)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474120 / 0.215209 (0.258911)	4.739222 / 2.077655 (2.661567)	2.484386 / 1.504120 (0.980266)	2.281937 / 1.541195 (0.740742)	2.362974 / 1.468490 (0.894484)	0.549897 / 4.584777 (-4.034879)	3.425540 / 3.745712 (-0.320172)	1.765810 / 5.269862 (-3.504051)	1.008277 / 4.565676 (-3.557400)	0.067288 / 0.424275 (-0.356987)	0.011954 / 0.007607 (0.004347)	0.577216 / 0.226044 (0.351172)	5.790659 / 2.268929 (3.521731)	2.946732 / 55.444624 (-52.497892)	2.608835 / 6.876477 (-4.267641)	2.642987 / 2.142072 (0.500915)	0.652798 / 4.805227 (-4.152429)	0.135909 / 6.500664 (-6.364755)	0.068480 / 0.075469 (-0.006989)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.353550 / 1.841788 (-0.488237)	14.732084 / 8.074308 (6.657775)	14.439174 / 10.191392 (4.247782)	0.131445 / 0.680424 (-0.548979)	0.016608 / 0.534201 (-0.517593)	0.368103 / 0.579283 (-0.211180)	0.393918 / 0.434364 (-0.040446)	0.423562 / 0.540337 (-0.116776)	0.515041 / 1.386936 (-0.871895)

github-actions · 2023-06-07T15:18:26Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006414 / 0.011353 (-0.004938)	0.004704 / 0.011008 (-0.006305)	0.096012 / 0.038508 (0.057504)	0.032910 / 0.023109 (0.009800)	0.290676 / 0.275898 (0.014778)	0.319646 / 0.323480 (-0.003834)	0.005806 / 0.007986 (-0.002180)	0.004008 / 0.004328 (-0.000320)	0.073982 / 0.004250 (0.069731)	0.048985 / 0.037052 (0.011933)	0.299498 / 0.258489 (0.041009)	0.338118 / 0.293841 (0.044277)	0.027680 / 0.128546 (-0.100866)	0.009051 / 0.075646 (-0.066595)	0.325051 / 0.419271 (-0.094221)	0.051011 / 0.043533 (0.007478)	0.292249 / 0.255139 (0.037110)	0.315733 / 0.283200 (0.032533)	0.100327 / 0.141683 (-0.041356)	1.481862 / 1.452155 (0.029707)	1.544884 / 1.492716 (0.052168)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.289610 / 0.018006 (0.271603)	0.510164 / 0.000490 (0.509675)	0.004726 / 0.000200 (0.004526)	0.000090 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027617 / 0.037411 (-0.009794)	0.107593 / 0.014526 (0.093068)	0.122783 / 0.176557 (-0.053774)	0.181086 / 0.737135 (-0.556049)	0.128030 / 0.296338 (-0.168308)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.403571 / 0.215209 (0.188362)	4.002881 / 2.077655 (1.925227)	1.805550 / 1.504120 (0.301430)	1.619165 / 1.541195 (0.077971)	1.606536 / 1.468490 (0.138046)	0.518917 / 4.584777 (-4.065860)	3.731498 / 3.745712 (-0.014214)	3.206645 / 5.269862 (-2.063217)	1.641615 / 4.565676 (-2.924062)	0.065100 / 0.424275 (-0.359175)	0.011396 / 0.007607 (0.003789)	0.500597 / 0.226044 (0.274553)	4.992293 / 2.268929 (2.723364)	2.278726 / 55.444624 (-53.165898)	1.960823 / 6.876477 (-4.915654)	2.038684 / 2.142072 (-0.103388)	0.640910 / 4.805227 (-4.164318)	0.140597 / 6.500664 (-6.360067)	0.062114 / 0.075469 (-0.013355)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.167366 / 1.841788 (-0.674422)	14.748193 / 8.074308 (6.673884)	13.592381 / 10.191392 (3.400989)	0.165341 / 0.680424 (-0.515083)	0.017360 / 0.534201 (-0.516841)	0.393448 / 0.579283 (-0.185836)	0.422951 / 0.434364 (-0.011413)	0.460491 / 0.540337 (-0.079847)	0.558238 / 1.386936 (-0.828698)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006373 / 0.011353 (-0.004980)	0.004587 / 0.011008 (-0.006421)	0.076421 / 0.038508 (0.037913)	0.032162 / 0.023109 (0.009052)	0.385531 / 0.275898 (0.109633)	0.410424 / 0.323480 (0.086944)	0.006154 / 0.007986 (-0.001832)	0.005533 / 0.004328 (0.001205)	0.077035 / 0.004250 (0.072784)	0.051571 / 0.037052 (0.014519)	0.393283 / 0.258489 (0.134794)	0.433756 / 0.293841 (0.139915)	0.028381 / 0.128546 (-0.100165)	0.009034 / 0.075646 (-0.066613)	0.083836 / 0.419271 (-0.335435)	0.048246 / 0.043533 (0.004713)	0.385437 / 0.255139 (0.130298)	0.394187 / 0.283200 (0.110987)	0.105453 / 0.141683 (-0.036230)	1.459173 / 1.452155 (0.007018)	1.575083 / 1.492716 (0.082367)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.320324 / 0.018006 (0.302318)	0.502945 / 0.000490 (0.502455)	0.004470 / 0.000200 (0.004270)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028118 / 0.037411 (-0.009293)	0.111430 / 0.014526 (0.096904)	0.123141 / 0.176557 (-0.053415)	0.175215 / 0.737135 (-0.561920)	0.126429 / 0.296338 (-0.169909)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.433407 / 0.215209 (0.218198)	4.329945 / 2.077655 (2.252291)	2.096822 / 1.504120 (0.592702)	1.908173 / 1.541195 (0.366978)	1.967167 / 1.468490 (0.498676)	0.529207 / 4.584777 (-4.055570)	3.798424 / 3.745712 (0.052712)	3.050716 / 5.269862 (-2.219146)	1.445009 / 4.565676 (-3.120668)	0.066467 / 0.424275 (-0.357809)	0.011698 / 0.007607 (0.004090)	0.528660 / 0.226044 (0.302615)	5.282069 / 2.268929 (3.013141)	2.535501 / 55.444624 (-52.909124)	2.202856 / 6.876477 (-4.673621)	2.293225 / 2.142072 (0.151153)	0.640216 / 4.805227 (-4.165011)	0.140884 / 6.500664 (-6.359780)	0.064231 / 0.075469 (-0.011238)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.292129 / 1.841788 (-0.549659)	15.371370 / 8.074308 (7.297062)	15.114854 / 10.191392 (4.923462)	0.176870 / 0.680424 (-0.503554)	0.017380 / 0.534201 (-0.516821)	0.398156 / 0.579283 (-0.181127)	0.442277 / 0.434364 (0.007913)	0.467093 / 0.540337 (-0.073244)	0.561599 / 1.386936 (-0.825337)

github-actions · 2023-06-07T15:18:38Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009360 / 0.011353 (-0.001993)	0.006297 / 0.011008 (-0.004712)	0.133131 / 0.038508 (0.094623)	0.040261 / 0.023109 (0.017152)	0.419101 / 0.275898 (0.143203)	0.453087 / 0.323480 (0.129607)	0.007718 / 0.007986 (-0.000268)	0.005698 / 0.004328 (0.001369)	0.102261 / 0.004250 (0.098010)	0.055147 / 0.037052 (0.018095)	0.428355 / 0.258489 (0.169866)	0.505241 / 0.293841 (0.211400)	0.046745 / 0.128546 (-0.081802)	0.015559 / 0.075646 (-0.060088)	0.441775 / 0.419271 (0.022503)	0.070165 / 0.043533 (0.026632)	0.421957 / 0.255139 (0.166818)	0.445156 / 0.283200 (0.161957)	0.126321 / 0.141683 (-0.015362)	1.900486 / 1.452155 (0.448331)	2.088630 / 1.492716 (0.595913)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.260244 / 0.018006 (0.242237)	0.606317 / 0.000490 (0.605828)	0.006827 / 0.000200 (0.006627)	0.000117 / 0.000054 (0.000063)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031958 / 0.037411 (-0.005453)	0.139362 / 0.014526 (0.124836)	0.148748 / 0.176557 (-0.027809)	0.226269 / 0.737135 (-0.510866)	0.161145 / 0.296338 (-0.135194)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.666287 / 0.215209 (0.451078)	6.588707 / 2.077655 (4.511053)	2.736155 / 1.504120 (1.232035)	2.329601 / 1.541195 (0.788406)	2.324991 / 1.468490 (0.856501)	0.943608 / 4.584777 (-3.641169)	6.051653 / 3.745712 (2.305941)	2.929150 / 5.269862 (-2.340711)	1.804461 / 4.565676 (-2.761216)	0.113302 / 0.424275 (-0.310973)	0.015245 / 0.007607 (0.007638)	0.827029 / 0.226044 (0.600984)	8.211536 / 2.268929 (5.942608)	3.445231 / 55.444624 (-51.999393)	2.756728 / 6.876477 (-4.119748)	2.904039 / 2.142072 (0.761966)	1.162339 / 4.805227 (-3.642888)	0.231168 / 6.500664 (-6.269496)	0.089038 / 0.075469 (0.013569)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.640619 / 1.841788 (-0.201169)	20.034157 / 8.074308 (11.959849)	22.346006 / 10.191392 (12.154614)	0.255300 / 0.680424 (-0.425124)	0.031452 / 0.534201 (-0.502749)	0.563290 / 0.579283 (-0.015993)	0.653556 / 0.434364 (0.219192)	0.687663 / 0.540337 (0.147326)	0.816432 / 1.386936 (-0.570504)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010340 / 0.011353 (-0.001013)	0.006245 / 0.011008 (-0.004764)	0.128012 / 0.038508 (0.089504)	0.041799 / 0.023109 (0.018690)	0.533340 / 0.275898 (0.257442)	0.592243 / 0.323480 (0.268763)	0.009256 / 0.007986 (0.001271)	0.005310 / 0.004328 (0.000982)	0.110973 / 0.004250 (0.106722)	0.065465 / 0.037052 (0.028412)	0.533845 / 0.258489 (0.275356)	0.602190 / 0.293841 (0.308349)	0.060245 / 0.128546 (-0.068301)	0.016954 / 0.075646 (-0.058693)	0.119727 / 0.419271 (-0.299545)	0.064628 / 0.043533 (0.021095)	0.558229 / 0.255139 (0.303090)	0.563696 / 0.283200 (0.280496)	0.137225 / 0.141683 (-0.004458)	2.038605 / 1.452155 (0.586451)	2.158655 / 1.492716 (0.665939)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.327067 / 0.018006 (0.309061)	0.628812 / 0.000490 (0.628323)	0.010259 / 0.000200 (0.010059)	0.000123 / 0.000054 (0.000069)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037023 / 0.037411 (-0.000388)	0.142462 / 0.014526 (0.127936)	0.158165 / 0.176557 (-0.018392)	0.220808 / 0.737135 (-0.516328)	0.163608 / 0.296338 (-0.132731)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.776119 / 0.215209 (0.560910)	7.813044 / 2.077655 (5.735389)	3.610901 / 1.504120 (2.106781)	3.195144 / 1.541195 (1.653950)	3.218245 / 1.468490 (1.749755)	1.092732 / 4.584777 (-3.492045)	5.965526 / 3.745712 (2.219813)	2.914683 / 5.269862 (-2.355179)	1.848397 / 4.565676 (-2.717280)	0.114436 / 0.424275 (-0.309839)	0.014794 / 0.007607 (0.007187)	0.887141 / 0.226044 (0.661096)	9.009743 / 2.268929 (6.740815)	4.180143 / 55.444624 (-51.264481)	3.452194 / 6.876477 (-3.424283)	3.493520 / 2.142072 (1.351448)	1.233327 / 4.805227 (-3.571900)	0.235390 / 6.500664 (-6.265274)	0.099544 / 0.075469 (0.024075)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.853482 / 1.841788 (0.011694)	20.071177 / 8.074308 (11.996869)	24.507618 / 10.191392 (14.316226)	0.260164 / 0.680424 (-0.420260)	0.028433 / 0.534201 (-0.505768)	0.549181 / 0.579283 (-0.030102)	0.650069 / 0.434364 (0.215705)	0.629541 / 0.540337 (0.089203)	0.808932 / 1.386936 (-0.578004)

github-actions · 2023-06-07T15:21:31Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009537 / 0.011353 (-0.001816)	0.006036 / 0.011008 (-0.004972)	0.141210 / 0.038508 (0.102701)	0.037493 / 0.023109 (0.014384)	0.404285 / 0.275898 (0.128386)	0.458906 / 0.323480 (0.135427)	0.007224 / 0.007986 (-0.000761)	0.005148 / 0.004328 (0.000819)	0.103889 / 0.004250 (0.099639)	0.048877 / 0.037052 (0.011824)	0.413220 / 0.258489 (0.154731)	0.458153 / 0.293841 (0.164312)	0.046008 / 0.128546 (-0.082538)	0.015116 / 0.075646 (-0.060531)	0.439836 / 0.419271 (0.020565)	0.067527 / 0.043533 (0.023994)	0.435794 / 0.255139 (0.180656)	0.451687 / 0.283200 (0.168487)	0.121274 / 0.141683 (-0.020409)	1.950199 / 1.452155 (0.498044)	2.035589 / 1.492716 (0.542873)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.247056 / 0.018006 (0.229050)	0.550348 / 0.000490 (0.549858)	0.005504 / 0.000200 (0.005305)	0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032171 / 0.037411 (-0.005240)	0.135983 / 0.014526 (0.121457)	0.149587 / 0.176557 (-0.026970)	0.233414 / 0.737135 (-0.503722)	0.152598 / 0.296338 (-0.143740)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.634813 / 0.215209 (0.419604)	6.453619 / 2.077655 (4.375964)	2.582070 / 1.504120 (1.077951)	2.214292 / 1.541195 (0.673097)	2.220012 / 1.468490 (0.751522)	0.987374 / 4.584777 (-3.597403)	5.543760 / 3.745712 (1.798047)	2.808865 / 5.269862 (-2.460996)	1.714713 / 4.565676 (-2.850963)	0.111016 / 0.424275 (-0.313259)	0.014688 / 0.007607 (0.007081)	0.842542 / 0.226044 (0.616498)	8.414336 / 2.268929 (6.145407)	3.501021 / 55.444624 (-51.943604)	2.665335 / 6.876477 (-4.211142)	2.843706 / 2.142072 (0.701633)	1.196398 / 4.805227 (-3.608829)	0.245508 / 6.500664 (-6.255156)	0.086970 / 0.075469 (0.011501)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.590244 / 1.841788 (-0.251544)	18.694141 / 8.074308 (10.619833)	21.752463 / 10.191392 (11.561071)	0.264511 / 0.680424 (-0.415913)	0.028713 / 0.534201 (-0.505488)	0.531102 / 0.579283 (-0.048181)	0.626302 / 0.434364 (0.191938)	0.624541 / 0.540337 (0.084203)	0.745745 / 1.386936 (-0.641191)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010097 / 0.011353 (-0.001256)	0.005558 / 0.011008 (-0.005451)	0.111326 / 0.038508 (0.072818)	0.036465 / 0.023109 (0.013356)	0.472116 / 0.275898 (0.196218)	0.524479 / 0.323480 (0.200999)	0.007466 / 0.007986 (-0.000520)	0.005440 / 0.004328 (0.001112)	0.103482 / 0.004250 (0.099231)	0.053217 / 0.037052 (0.016165)	0.476685 / 0.258489 (0.218196)	0.554011 / 0.293841 (0.260170)	0.047157 / 0.128546 (-0.081390)	0.015895 / 0.075646 (-0.059751)	0.115997 / 0.419271 (-0.303274)	0.062290 / 0.043533 (0.018758)	0.474166 / 0.255139 (0.219027)	0.498854 / 0.283200 (0.215655)	0.121798 / 0.141683 (-0.019885)	1.956583 / 1.452155 (0.504428)	2.069620 / 1.492716 (0.576904)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.278637 / 0.018006 (0.260631)	0.555295 / 0.000490 (0.554805)	0.007401 / 0.000200 (0.007201)	0.000121 / 0.000054 (0.000066)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033576 / 0.037411 (-0.003835)	0.136479 / 0.014526 (0.121954)	0.153960 / 0.176557 (-0.022597)	0.203422 / 0.737135 (-0.533713)	0.154159 / 0.296338 (-0.142180)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.672561 / 0.215209 (0.457352)	6.956675 / 2.077655 (4.879020)	3.063636 / 1.504120 (1.559516)	2.668256 / 1.541195 (1.127061)	2.794793 / 1.468490 (1.326303)	0.964242 / 4.584777 (-3.620535)	5.785992 / 3.745712 (2.040279)	2.850079 / 5.269862 (-2.419782)	1.782491 / 4.565676 (-2.783186)	0.114859 / 0.424275 (-0.309416)	0.015229 / 0.007607 (0.007622)	0.858406 / 0.226044 (0.632362)	8.646296 / 2.268929 (6.377367)	3.842133 / 55.444624 (-51.602492)	3.180017 / 6.876477 (-3.696460)	3.241315 / 2.142072 (1.099243)	1.248988 / 4.805227 (-3.556239)	0.235075 / 6.500664 (-6.265589)	0.087192 / 0.075469 (0.011723)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.783877 / 1.841788 (-0.057910)	19.477223 / 8.074308 (11.402914)	22.926734 / 10.191392 (12.735342)	0.246970 / 0.680424 (-0.433454)	0.026386 / 0.534201 (-0.507815)	0.517599 / 0.579283 (-0.061684)	0.626504 / 0.434364 (0.192140)	0.606943 / 0.540337 (0.066606)	0.739115 / 1.386936 (-0.647821)

github-actions · 2023-06-07T15:22:14Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008085 / 0.011353 (-0.003268)	0.005568 / 0.011008 (-0.005440)	0.119674 / 0.038508 (0.081166)	0.040452 / 0.023109 (0.017343)	0.360288 / 0.275898 (0.084390)	0.409448 / 0.323480 (0.085968)	0.007281 / 0.007986 (-0.000705)	0.004931 / 0.004328 (0.000602)	0.089956 / 0.004250 (0.085706)	0.056088 / 0.037052 (0.019036)	0.384708 / 0.258489 (0.126219)	0.423506 / 0.293841 (0.129665)	0.033280 / 0.128546 (-0.095266)	0.010696 / 0.075646 (-0.064951)	0.394851 / 0.419271 (-0.024421)	0.058412 / 0.043533 (0.014879)	0.361514 / 0.255139 (0.106375)	0.399121 / 0.283200 (0.115921)	0.117927 / 0.141683 (-0.023756)	1.791499 / 1.452155 (0.339344)	1.889000 / 1.492716 (0.396284)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.253324 / 0.018006 (0.235318)	0.536151 / 0.000490 (0.535661)	0.010450 / 0.000200 (0.010250)	0.000171 / 0.000054 (0.000117)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034646 / 0.037411 (-0.002765)	0.145999 / 0.014526 (0.131473)	0.153793 / 0.176557 (-0.022763)	0.232871 / 0.737135 (-0.504265)	0.161151 / 0.296338 (-0.135188)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.471407 / 0.215209 (0.256197)	4.715702 / 2.077655 (2.638047)	2.228939 / 1.504120 (0.724819)	2.008511 / 1.541195 (0.467317)	2.135182 / 1.468490 (0.666692)	0.620720 / 4.584777 (-3.964057)	4.960731 / 3.745712 (1.215019)	2.222469 / 5.269862 (-3.047393)	1.284467 / 4.565676 (-3.281209)	0.077931 / 0.424275 (-0.346344)	0.013935 / 0.007607 (0.006328)	0.593164 / 0.226044 (0.367120)	5.940829 / 2.268929 (3.671900)	2.664277 / 55.444624 (-52.780347)	2.290655 / 6.876477 (-4.585822)	2.496664 / 2.142072 (0.354592)	0.759166 / 4.805227 (-4.046061)	0.168011 / 6.500664 (-6.332653)	0.077993 / 0.075469 (0.002524)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.440663 / 1.841788 (-0.401125)	19.105377 / 8.074308 (11.031069)	16.068118 / 10.191392 (5.876726)	0.193024 / 0.680424 (-0.487400)	0.022348 / 0.534201 (-0.511853)	0.517454 / 0.579283 (-0.061829)	0.528072 / 0.434364 (0.093708)	0.565293 / 0.540337 (0.024955)	0.676578 / 1.386936 (-0.710358)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008089 / 0.011353 (-0.003264)	0.005287 / 0.011008 (-0.005721)	0.087964 / 0.038508 (0.049456)	0.041548 / 0.023109 (0.018439)	0.437733 / 0.275898 (0.161835)	0.487878 / 0.323480 (0.164398)	0.006898 / 0.007986 (-0.001087)	0.004649 / 0.004328 (0.000320)	0.086982 / 0.004250 (0.082732)	0.056874 / 0.037052 (0.019822)	0.437397 / 0.258489 (0.178908)	0.490636 / 0.293841 (0.196795)	0.033550 / 0.128546 (-0.094997)	0.010430 / 0.075646 (-0.065216)	0.096076 / 0.419271 (-0.323196)	0.054028 / 0.043533 (0.010495)	0.450262 / 0.255139 (0.195123)	0.465566 / 0.283200 (0.182366)	0.119987 / 0.141683 (-0.021696)	1.764428 / 1.452155 (0.312273)	1.841547 / 1.492716 (0.348831)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.271427 / 0.018006 (0.253420)	0.506386 / 0.000490 (0.505896)	0.001213 / 0.000200 (0.001013)	0.000125 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036159 / 0.037411 (-0.001253)	0.140578 / 0.014526 (0.126053)	0.147517 / 0.176557 (-0.029040)	0.206215 / 0.737135 (-0.530921)	0.152560 / 0.296338 (-0.143779)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.522833 / 0.215209 (0.307624)	5.215732 / 2.077655 (3.138077)	2.553406 / 1.504120 (1.049286)	2.344815 / 1.541195 (0.803620)	2.422377 / 1.468490 (0.953886)	0.631197 / 4.584777 (-3.953580)	4.906216 / 3.745712 (1.160504)	2.212923 / 5.269862 (-3.056938)	1.352937 / 4.565676 (-3.212740)	0.079141 / 0.424275 (-0.345135)	0.013691 / 0.007607 (0.006084)	0.634939 / 0.226044 (0.408895)	6.578770 / 2.268929 (4.309842)	3.080339 / 55.444624 (-52.364286)	2.710243 / 6.876477 (-4.166234)	2.740476 / 2.142072 (0.598404)	0.783610 / 4.805227 (-4.021617)	0.171589 / 6.500664 (-6.329075)	0.077311 / 0.075469 (0.001842)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.584847 / 1.841788 (-0.256941)	19.510132 / 8.074308 (11.435824)	18.074572 / 10.191392 (7.883180)	0.173494 / 0.680424 (-0.506930)	0.021149 / 0.534201 (-0.513052)	0.469026 / 0.579283 (-0.110258)	0.518463 / 0.434364 (0.084099)	0.550363 / 0.540337 (0.010026)	0.667087 / 1.386936 (-0.719849)

github-actions · 2023-06-07T15:25:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007144 / 0.011353 (-0.004209)	0.004783 / 0.011008 (-0.006225)	0.103991 / 0.038508 (0.065483)	0.039098 / 0.023109 (0.015989)	0.319851 / 0.275898 (0.043952)	0.356104 / 0.323480 (0.032625)	0.007077 / 0.007986 (-0.000909)	0.004188 / 0.004328 (-0.000141)	0.078360 / 0.004250 (0.074109)	0.050951 / 0.037052 (0.013899)	0.321791 / 0.258489 (0.063302)	0.356123 / 0.293841 (0.062283)	0.028967 / 0.128546 (-0.099579)	0.009091 / 0.075646 (-0.066555)	0.355265 / 0.419271 (-0.064007)	0.052521 / 0.043533 (0.008988)	0.317333 / 0.255139 (0.062194)	0.340747 / 0.283200 (0.057547)	0.104354 / 0.141683 (-0.037329)	1.522791 / 1.452155 (0.070636)	1.579835 / 1.492716 (0.087118)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.260539 / 0.018006 (0.242532)	0.454230 / 0.000490 (0.453740)	0.036588 / 0.000200 (0.036388)	0.000289 / 0.000054 (0.000235)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028375 / 0.037411 (-0.009036)	0.118939 / 0.014526 (0.104413)	0.126553 / 0.176557 (-0.050004)	0.184596 / 0.737135 (-0.552539)	0.130583 / 0.296338 (-0.165755)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417353 / 0.215209 (0.202144)	4.171595 / 2.077655 (2.093940)	1.855096 / 1.504120 (0.350976)	1.673941 / 1.541195 (0.132747)	1.761370 / 1.468490 (0.292880)	0.544081 / 4.584777 (-4.040696)	3.851877 / 3.745712 (0.106165)	1.896661 / 5.269862 (-3.373200)	1.093303 / 4.565676 (-3.472373)	0.067967 / 0.424275 (-0.356308)	0.012313 / 0.007607 (0.004706)	0.532316 / 0.226044 (0.306272)	5.336016 / 2.268929 (3.067087)	2.344780 / 55.444624 (-53.099845)	1.993909 / 6.876477 (-4.882568)	2.167324 / 2.142072 (0.025251)	0.670334 / 4.805227 (-4.134893)	0.147705 / 6.500664 (-6.352959)	0.067634 / 0.075469 (-0.007835)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.251005 / 1.841788 (-0.590783)	15.405531 / 8.074308 (7.331223)	14.197019 / 10.191392 (4.005627)	0.144230 / 0.680424 (-0.536193)	0.018352 / 0.534201 (-0.515849)	0.427536 / 0.579283 (-0.151748)	0.433135 / 0.434364 (-0.001229)	0.502624 / 0.540337 (-0.037713)	0.612312 / 1.386936 (-0.774624)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007011 / 0.011353 (-0.004342)	0.004857 / 0.011008 (-0.006151)	0.077797 / 0.038508 (0.039289)	0.035411 / 0.023109 (0.012302)	0.368234 / 0.275898 (0.092336)	0.408359 / 0.323480 (0.084879)	0.005883 / 0.007986 (-0.002102)	0.004311 / 0.004328 (-0.000017)	0.077216 / 0.004250 (0.072966)	0.052062 / 0.037052 (0.015010)	0.368502 / 0.258489 (0.110013)	0.428681 / 0.293841 (0.134840)	0.028889 / 0.128546 (-0.099657)	0.009146 / 0.075646 (-0.066501)	0.085515 / 0.419271 (-0.333756)	0.050216 / 0.043533 (0.006683)	0.359562 / 0.255139 (0.104423)	0.378335 / 0.283200 (0.095135)	0.106351 / 0.141683 (-0.035332)	1.538943 / 1.452155 (0.086788)	1.663572 / 1.492716 (0.170855)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216917 / 0.018006 (0.198911)	0.444130 / 0.000490 (0.443641)	0.002640 / 0.000200 (0.002440)	0.000093 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032509 / 0.037411 (-0.004902)	0.123955 / 0.014526 (0.109430)	0.133236 / 0.176557 (-0.043321)	0.187408 / 0.737135 (-0.549727)	0.136696 / 0.296338 (-0.159643)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.443714 / 0.215209 (0.228505)	4.416973 / 2.077655 (2.339318)	2.145279 / 1.504120 (0.641159)	1.946669 / 1.541195 (0.405474)	2.044105 / 1.468490 (0.575614)	0.534463 / 4.584777 (-4.050314)	3.824926 / 3.745712 (0.079214)	3.151796 / 5.269862 (-2.118066)	1.497513 / 4.565676 (-3.068164)	0.066799 / 0.424275 (-0.357476)	0.012408 / 0.007607 (0.004801)	0.544182 / 0.226044 (0.318138)	5.419403 / 2.268929 (3.150474)	2.605191 / 55.444624 (-52.839433)	2.285354 / 6.876477 (-4.591123)	2.359520 / 2.142072 (0.217448)	0.655489 / 4.805227 (-4.149738)	0.143496 / 6.500664 (-6.357168)	0.066782 / 0.075469 (-0.008687)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.329370 / 1.841788 (-0.512418)	16.058019 / 8.074308 (7.983711)	15.119769 / 10.191392 (4.928377)	0.147967 / 0.680424 (-0.532457)	0.018360 / 0.534201 (-0.515841)	0.436847 / 0.579283 (-0.142436)	0.435136 / 0.434364 (0.000773)	0.507176 / 0.540337 (-0.033161)	0.610627 / 1.386936 (-0.776309)

github-actions · 2023-06-07T16:01:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006425 / 0.011353 (-0.004927)	0.003710 / 0.011008 (-0.007298)	0.102072 / 0.038508 (0.063564)	0.033974 / 0.023109 (0.010865)	0.273146 / 0.275898 (-0.002752)	0.313254 / 0.323480 (-0.010226)	0.004889 / 0.007986 (-0.003096)	0.004803 / 0.004328 (0.000475)	0.067359 / 0.004250 (0.063109)	0.040281 / 0.037052 (0.003228)	0.302106 / 0.258489 (0.043617)	0.318039 / 0.293841 (0.024198)	0.028839 / 0.128546 (-0.099707)	0.008726 / 0.075646 (-0.066921)	0.322532 / 0.419271 (-0.096739)	0.048845 / 0.043533 (0.005312)	0.299836 / 0.255139 (0.044697)	0.300983 / 0.283200 (0.017784)	0.103384 / 0.141683 (-0.038299)	1.417245 / 1.452155 (-0.034910)	1.538819 / 1.492716 (0.046102)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219798 / 0.018006 (0.201792)	0.442297 / 0.000490 (0.441807)	0.013792 / 0.000200 (0.013592)	0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024996 / 0.037411 (-0.012416)	0.098558 / 0.014526 (0.084032)	0.116423 / 0.176557 (-0.060133)	0.163481 / 0.737135 (-0.573654)	0.115031 / 0.296338 (-0.181308)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.392411 / 0.215209 (0.177202)	4.025992 / 2.077655 (1.948337)	1.850809 / 1.504120 (0.346690)	1.668330 / 1.541195 (0.127136)	1.627041 / 1.468490 (0.158551)	0.510721 / 4.584777 (-4.074055)	3.841318 / 3.745712 (0.095606)	3.416979 / 5.269862 (-1.852883)	1.640796 / 4.565676 (-2.924880)	0.061968 / 0.424275 (-0.362307)	0.010281 / 0.007607 (0.002674)	0.485592 / 0.226044 (0.259548)	4.872205 / 2.268929 (2.603277)	2.146753 / 55.444624 (-53.297871)	1.832087 / 6.876477 (-5.044390)	1.920928 / 2.142072 (-0.221144)	0.606363 / 4.805227 (-4.198864)	0.134351 / 6.500664 (-6.366313)	0.057583 / 0.075469 (-0.017886)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.153048 / 1.841788 (-0.688739)	14.165743 / 8.074308 (6.091435)	12.237798 / 10.191392 (2.046406)	0.159815 / 0.680424 (-0.520608)	0.018226 / 0.534201 (-0.515975)	0.372390 / 0.579283 (-0.206893)	0.396552 / 0.434364 (-0.037811)	0.439445 / 0.540337 (-0.100892)	0.521924 / 1.386936 (-0.865012)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006162 / 0.011353 (-0.005191)	0.004006 / 0.011008 (-0.007002)	0.067226 / 0.038508 (0.028718)	0.030285 / 0.023109 (0.007176)	0.361220 / 0.275898 (0.085322)	0.386783 / 0.323480 (0.063303)	0.005202 / 0.007986 (-0.002784)	0.003453 / 0.004328 (-0.000876)	0.068299 / 0.004250 (0.064048)	0.041433 / 0.037052 (0.004381)	0.360222 / 0.258489 (0.101733)	0.399327 / 0.293841 (0.105486)	0.026066 / 0.128546 (-0.102480)	0.008025 / 0.075646 (-0.067621)	0.079588 / 0.419271 (-0.339683)	0.042616 / 0.043533 (-0.000917)	0.347639 / 0.255139 (0.092500)	0.386092 / 0.283200 (0.102893)	0.100869 / 0.141683 (-0.040814)	1.386901 / 1.452155 (-0.065254)	1.471523 / 1.492716 (-0.021193)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.217020 / 0.018006 (0.199014)	0.431033 / 0.000490 (0.430543)	0.002902 / 0.000200 (0.002702)	0.000092 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027396 / 0.037411 (-0.010015)	0.114154 / 0.014526 (0.099629)	0.117918 / 0.176557 (-0.058638)	0.173342 / 0.737135 (-0.563794)	0.125812 / 0.296338 (-0.170526)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.424843 / 0.215209 (0.209634)	4.324828 / 2.077655 (2.247174)	2.188263 / 1.504120 (0.684143)	1.912288 / 1.541195 (0.371094)	2.011621 / 1.468490 (0.543131)	0.560944 / 4.584777 (-4.023833)	3.975047 / 3.745712 (0.229335)	3.130242 / 5.269862 (-2.139619)	1.667902 / 4.565676 (-2.897775)	0.062245 / 0.424275 (-0.362030)	0.011300 / 0.007607 (0.003692)	0.498571 / 0.226044 (0.272527)	5.024887 / 2.268929 (2.755958)	2.482967 / 55.444624 (-52.961657)	2.216125 / 6.876477 (-4.660352)	2.175856 / 2.142072 (0.033783)	0.615207 / 4.805227 (-4.190021)	0.133808 / 6.500664 (-6.366856)	0.058681 / 0.075469 (-0.016788)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.370150 / 1.841788 (-0.471637)	14.580907 / 8.074308 (6.506599)	14.209955 / 10.191392 (4.018563)	0.139738 / 0.680424 (-0.540686)	0.018722 / 0.534201 (-0.515479)	0.375755 / 0.579283 (-0.203528)	0.428335 / 0.434364 (-0.006029)	0.438957 / 0.540337 (-0.101380)	0.541130 / 1.386936 (-0.845806)

Rocketknight1 · 2023-06-08T11:12:20Z

@alvarobartt @lhoestq This should be ready for re-review. I've rebased it on the recent PR to allow batch_size=None, and it should also support unbatched loading now.

Having a variety of different methods like this is annoying, but once our minimum Python version is 3.8 I can go back and clear a lot of this out!

alvarobartt · 2023-06-08T13:00:39Z

src/datasets/utils/tf_utils.py

@@ -173,6 +174,21 @@ def dataset_to_tf(
    else:
        raise ImportError("Called a Tensorflow-specific function but Tensorflow is not installed.")

+    # TODO Matt: When our minimum Python version is 3.8 or higher, we can delete all of this and move everything


Hi Matt, is datasets going to drop Python 3.7 support due to its upcoming EOL? Because it will happen by the end of the month in case we want to wait and set the minimum version to 3.8, even though I assume some users may still be using 3.7?

it will probably depend on what transformers does

alvarobartt · 2023-06-08T13:01:30Z

LGTM @Rocketknight1! I may run some tests during the weekend to compare performances with the current approach in case that's useful 😄

lhoestq

lgtm :) feel free to run some tests before merging though

Rocketknight1 · 2023-06-08T16:32:23Z

@alvarobartt I'll probably merge now, just to avoid the major memory usage issues we currently have! Feel free to run the comparisons before/after the commit.

Rocketknight1 · 2023-06-08T16:32:47Z

And yes, hopefully Py3.7 goes EOL and we make Py3.8 the minimum soon to resolve this.

alvarobartt · 2023-06-08T16:34:36Z

@alvarobartt I'll probably merge now, just to avoid the major memory usage issues we currently have! Feel free to run the comparisons before/after the commit.

I'll ping you back with the comparison this weekend! 🤗

github-actions · 2023-06-08T16:40:18Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007102 / 0.011353 (-0.004251)	0.004713 / 0.011008 (-0.006295)	0.102391 / 0.038508 (0.063883)	0.038363 / 0.023109 (0.015253)	0.330843 / 0.275898 (0.054945)	0.365290 / 0.323480 (0.041810)	0.006389 / 0.007986 (-0.001596)	0.004287 / 0.004328 (-0.000041)	0.078710 / 0.004250 (0.074460)	0.051974 / 0.037052 (0.014922)	0.333163 / 0.258489 (0.074674)	0.371016 / 0.293841 (0.077176)	0.028412 / 0.128546 (-0.100134)	0.009350 / 0.075646 (-0.066296)	0.351673 / 0.419271 (-0.067599)	0.051879 / 0.043533 (0.008347)	0.323769 / 0.255139 (0.068630)	0.342994 / 0.283200 (0.059794)	0.107347 / 0.141683 (-0.034336)	1.585641 / 1.452155 (0.133487)	1.679408 / 1.492716 (0.186691)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.251772 / 0.018006 (0.233766)	0.580570 / 0.000490 (0.580081)	0.008346 / 0.000200 (0.008147)	0.000113 / 0.000054 (0.000059)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028740 / 0.037411 (-0.008672)	0.117707 / 0.014526 (0.103182)	0.126397 / 0.176557 (-0.050160)	0.183823 / 0.737135 (-0.553312)	0.132272 / 0.296338 (-0.164066)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.428428 / 0.215209 (0.213219)	4.263983 / 2.077655 (2.186329)	2.012477 / 1.504120 (0.508357)	1.812453 / 1.541195 (0.271259)	1.889282 / 1.468490 (0.420792)	0.534459 / 4.584777 (-4.050318)	3.719460 / 3.745712 (-0.026252)	1.958039 / 5.269862 (-3.311823)	1.078166 / 4.565676 (-3.487510)	0.067902 / 0.424275 (-0.356373)	0.012479 / 0.007607 (0.004872)	0.532071 / 0.226044 (0.306026)	5.343323 / 2.268929 (3.074394)	2.478577 / 55.444624 (-52.966047)	2.146067 / 6.876477 (-4.730409)	2.324783 / 2.142072 (0.182710)	0.655925 / 4.805227 (-4.149302)	0.145578 / 6.500664 (-6.355086)	0.068044 / 0.075469 (-0.007425)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.254036 / 1.841788 (-0.587752)	15.199639 / 8.074308 (7.125331)	13.851406 / 10.191392 (3.660014)	0.168760 / 0.680424 (-0.511664)	0.017807 / 0.534201 (-0.516394)	0.425857 / 0.579283 (-0.153426)	0.413098 / 0.434364 (-0.021266)	0.497433 / 0.540337 (-0.042905)	0.599273 / 1.386936 (-0.787663)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007044 / 0.011353 (-0.004309)	0.005036 / 0.011008 (-0.005972)	0.080307 / 0.038508 (0.041798)	0.035926 / 0.023109 (0.012817)	0.402026 / 0.275898 (0.126128)	0.444185 / 0.323480 (0.120705)	0.006228 / 0.007986 (-0.001758)	0.004481 / 0.004328 (0.000153)	0.080223 / 0.004250 (0.075972)	0.055385 / 0.037052 (0.018333)	0.405674 / 0.258489 (0.147184)	0.461574 / 0.293841 (0.167733)	0.029237 / 0.128546 (-0.099309)	0.009249 / 0.075646 (-0.066398)	0.086215 / 0.419271 (-0.333056)	0.048512 / 0.043533 (0.004979)	0.401374 / 0.255139 (0.146235)	0.418274 / 0.283200 (0.135074)	0.107994 / 0.141683 (-0.033689)	1.560504 / 1.452155 (0.108350)	1.669651 / 1.492716 (0.176935)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.275393 / 0.018006 (0.257387)	0.573688 / 0.000490 (0.573199)	0.007236 / 0.000200 (0.007036)	0.000153 / 0.000054 (0.000099)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033387 / 0.037411 (-0.004024)	0.125027 / 0.014526 (0.110501)	0.138601 / 0.176557 (-0.037956)	0.191820 / 0.737135 (-0.545315)	0.141022 / 0.296338 (-0.155317)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.456382 / 0.215209 (0.241173)	4.559197 / 2.077655 (2.481542)	2.263333 / 1.504120 (0.759213)	2.073151 / 1.541195 (0.531956)	2.185314 / 1.468490 (0.716824)	0.540230 / 4.584777 (-4.044547)	3.934984 / 3.745712 (0.189272)	1.980895 / 5.269862 (-3.288966)	1.101440 / 4.565676 (-3.464237)	0.068255 / 0.424275 (-0.356020)	0.012605 / 0.007607 (0.004997)	0.560695 / 0.226044 (0.334650)	5.588877 / 2.268929 (3.319948)	2.756690 / 55.444624 (-52.687935)	2.427774 / 6.876477 (-4.448702)	2.548903 / 2.142072 (0.406831)	0.657177 / 4.805227 (-4.148050)	0.147645 / 6.500664 (-6.353019)	0.069216 / 0.075469 (-0.006253)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.319813 / 1.841788 (-0.521975)	15.882227 / 8.074308 (7.807919)	15.324481 / 10.191392 (5.133089)	0.193708 / 0.680424 (-0.486716)	0.018264 / 0.534201 (-0.515937)	0.432594 / 0.579283 (-0.146689)	0.437063 / 0.434364 (0.002699)	0.512297 / 0.540337 (-0.028040)	0.617469 / 1.386936 (-0.769467)

Rocketknight1 requested a review from lhoestq May 15, 2023 15:28

Rocketknight1 force-pushed the reduce_to_tf_dataset_memory_usage branch from 824f96c to b899ea4 Compare May 24, 2023 15:56

alvarobartt mentioned this pull request May 25, 2023

Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset #5883

Merged

Rocketknight1 added 5 commits June 7, 2023 15:40

Use a new low-memory approach for tf dataset index shuffling

1d6fa5b

correct fill kwarg

b05b748

...and cast the inputs too

7f936cb

Add warnings for older TF

7bd7312

Fix to use the imported random_index_shuffle

18d92aa

Rocketknight1 added 4 commits June 7, 2023 15:42

Switch to_tf_dataset entirely over to the NumPy multiprocessing approach

82534e3

Revert "Switch to_tf_dataset entirely over to the NumPy multiprocessi…

3011d62

…ng approach" This reverts commit 95c177e.

Add explanatory comment

3c54400

TF 2.13 has a specific optimization for

81761db

dataset.shuffle(dataset.cardinality()), so use that instead of dataset.shuffle(len(dataset))

Rocketknight1 force-pushed the reduce_to_tf_dataset_memory_usage branch from b899ea4 to 81761db Compare June 7, 2023 14:43

Fix a couple of rebase errors

8907bdb

Rocketknight1 added 4 commits June 7, 2023 16:08

More merging with the changes in main

f39ba76

Fix some indents

323747a

Fix docstring merge

e8f051a

Add clearer TODO

5dfcd87

Rename indices -> index to be clearer what the function does now

b4cc3ee

Expand test to make sure shuffling is working correctly

c14806a

alvarobartt reviewed Jun 8, 2023

View reviewed changes

lhoestq approved these changes Jun 8, 2023

View reviewed changes

Rocketknight1 merged commit 6ee61e6 into main Jun 8, 2023
13 checks passed

Rocketknight1 deleted the reduce_to_tf_dataset_memory_usage branch June 8, 2023 16:32

Use a new low-memory approach for tf dataset index shuffling #5863

Use a new low-memory approach for tf dataset index shuffling #5863

Conversation

Rocketknight1 commented May 15, 2023

HuggingFaceDocBuilderDev commented May 15, 2023

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Rocketknight1 commented May 15, 2023 • edited

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Rocketknight1 commented May 15, 2023

massquantity commented May 16, 2023 • edited

lhoestq commented May 16, 2023

Rocketknight1 commented May 16, 2023

Rocketknight1 commented May 16, 2023

lhoestq commented May 16, 2023

github-actions bot commented May 16, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Rocketknight1 commented May 16, 2023 • edited

Rocketknight1 commented May 19, 2023

Rocketknight1 commented May 15, 2023 •

edited

massquantity commented May 16, 2023 •

edited

Rocketknight1 commented May 16, 2023 •

edited