Iterable torch formatting #5852

lhoestq · 2023-05-12T16:48:49Z

Used the TorchFormatter to get torch tensors in iterable dataset with format set to "torch".

It uses the data from Arrow if possible, otherwise applies recursive_tensorize.

When set back to format_type=None, cast_to_python_objects is used.

requires #5821

close #5793

github-actions · 2023-05-12T16:53:48Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006567 / 0.011353 (-0.004786)	0.004479 / 0.011008 (-0.006530)	0.028286 / 0.038508 (-0.010222)	0.033137 / 0.023109 (0.010028)	0.305249 / 0.275898 (0.029351)	0.330306 / 0.323480 (0.006826)	0.003747 / 0.007986 (-0.004238)	0.004409 / 0.004328 (0.000081)	0.004742 / 0.004250 (0.000491)	0.040780 / 0.037052 (0.003728)	0.302879 / 0.258489 (0.044390)	0.346880 / 0.293841 (0.053039)	0.032908 / 0.128546 (-0.095638)	0.010617 / 0.075646 (-0.065029)	0.257996 / 0.419271 (-0.161275)	0.051044 / 0.043533 (0.007511)	0.306113 / 0.255139 (0.050974)	0.324444 / 0.283200 (0.041244)	0.100820 / 0.141683 (-0.040863)	1.478402 / 1.452155 (0.026248)	1.599398 / 1.492716 (0.106682)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216540 / 0.018006 (0.198534)	0.433480 / 0.000490 (0.432991)	0.004032 / 0.000200 (0.003832)	0.000084 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027807 / 0.037411 (-0.009604)	0.107225 / 0.014526 (0.092699)	0.120157 / 0.176557 (-0.056400)	0.174130 / 0.737135 (-0.563005)	0.128902 / 0.296338 (-0.167437)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.395996 / 0.215209 (0.180787)	3.936254 / 2.077655 (1.858599)	1.808864 / 1.504120 (0.304744)	1.608935 / 1.541195 (0.067741)	1.646427 / 1.468490 (0.177937)	0.716026 / 4.584777 (-3.868751)	3.815045 / 3.745712 (0.069333)	2.271534 / 5.269862 (-2.998327)	1.548728 / 4.565676 (-3.016948)	0.076743 / 0.424275 (-0.347532)	0.011575 / 0.007607 (0.003968)	0.499202 / 0.226044 (0.273158)	4.983754 / 2.268929 (2.714825)	2.239319 / 55.444624 (-53.205306)	1.919427 / 6.876477 (-4.957050)	2.019664 / 2.142072 (-0.122408)	0.866318 / 4.805227 (-3.938910)	0.157309 / 6.500664 (-6.343355)	0.063341 / 0.075469 (-0.012128)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.180817 / 1.841788 (-0.660971)	14.579869 / 8.074308 (6.505561)	14.277848 / 10.191392 (4.086456)	0.182560 / 0.680424 (-0.497863)	0.017402 / 0.534201 (-0.516799)	0.411549 / 0.579283 (-0.167734)	0.432938 / 0.434364 (-0.001426)	0.545067 / 0.540337 (0.004730)	0.642173 / 1.386936 (-0.744763)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006753 / 0.011353 (-0.004600)	0.004590 / 0.011008 (-0.006418)	0.006111 / 0.038508 (-0.032397)	0.032763 / 0.023109 (0.009654)	0.401001 / 0.275898 (0.125103)	0.428063 / 0.323480 (0.104583)	0.003730 / 0.007986 (-0.004255)	0.004617 / 0.004328 (0.000289)	0.004770 / 0.004250 (0.000519)	0.049718 / 0.037052 (0.012666)	0.399724 / 0.258489 (0.141235)	0.440292 / 0.293841 (0.146451)	0.032846 / 0.128546 (-0.095700)	0.010842 / 0.075646 (-0.064804)	0.012642 / 0.419271 (-0.406630)	0.046043 / 0.043533 (0.002510)	0.390862 / 0.255139 (0.135723)	0.407027 / 0.283200 (0.123828)	0.099349 / 0.141683 (-0.042334)	1.455739 / 1.452155 (0.003584)	1.572214 / 1.492716 (0.079497)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227186 / 0.018006 (0.209180)	0.447404 / 0.000490 (0.446914)	0.000400 / 0.000200 (0.000200)	0.000055 / 0.000054 (0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029830 / 0.037411 (-0.007581)	0.112365 / 0.014526 (0.097839)	0.125736 / 0.176557 (-0.050821)	0.174781 / 0.737135 (-0.562354)	0.129439 / 0.296338 (-0.166900)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.444438 / 0.215209 (0.229229)	4.459381 / 2.077655 (2.381726)	2.264541 / 1.504120 (0.760421)	2.075257 / 1.541195 (0.534062)	2.181289 / 1.468490 (0.712799)	0.725279 / 4.584777 (-3.859498)	3.863253 / 3.745712 (0.117541)	2.132498 / 5.269862 (-3.137364)	1.402003 / 4.565676 (-3.163673)	0.084268 / 0.424275 (-0.340007)	0.011762 / 0.007607 (0.004155)	0.556239 / 0.226044 (0.330194)	5.617998 / 2.268929 (3.349070)	2.754789 / 55.444624 (-52.689835)	2.418418 / 6.876477 (-4.458059)	2.479696 / 2.142072 (0.337624)	0.870037 / 4.805227 (-3.935190)	0.160480 / 6.500664 (-6.340184)	0.064464 / 0.075469 (-0.011005)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.290916 / 1.841788 (-0.550872)	14.783173 / 8.074308 (6.708865)	13.355883 / 10.191392 (3.164491)	0.169963 / 0.680424 (-0.510461)	0.017657 / 0.534201 (-0.516544)	0.409218 / 0.579283 (-0.170065)	0.422942 / 0.434364 (-0.011422)	0.494968 / 0.540337 (-0.045369)	0.587044 / 1.386936 (-0.799892)

HuggingFaceDocBuilderDev · 2023-05-24T15:24:46Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-05-24T15:26:48Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007183 / 0.011353 (-0.004169)	0.004586 / 0.011008 (-0.006423)	0.032668 / 0.038508 (-0.005840)	0.040896 / 0.023109 (0.017787)	0.358225 / 0.275898 (0.082327)	0.395063 / 0.323480 (0.071583)	0.004540 / 0.007986 (-0.003446)	0.003849 / 0.004328 (-0.000480)	0.005521 / 0.004250 (0.001271)	0.053314 / 0.037052 (0.016262)	0.362417 / 0.258489 (0.103928)	0.414337 / 0.293841 (0.120496)	0.030698 / 0.128546 (-0.097849)	0.008823 / 0.075646 (-0.066823)	0.303583 / 0.419271 (-0.115689)	0.060277 / 0.043533 (0.016744)	0.365938 / 0.255139 (0.110799)	0.379554 / 0.283200 (0.096354)	0.122545 / 0.141683 (-0.019138)	1.712098 / 1.452155 (0.259943)	1.802036 / 1.492716 (0.309319)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.239508 / 0.018006 (0.221502)	0.492194 / 0.000490 (0.491704)	0.003280 / 0.000200 (0.003081)	0.000096 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033301 / 0.037411 (-0.004110)	0.125851 / 0.014526 (0.111325)	0.137757 / 0.176557 (-0.038799)	0.207603 / 0.737135 (-0.529533)	0.143507 / 0.296338 (-0.152831)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.470662 / 0.215209 (0.255453)	4.736017 / 2.077655 (2.658363)	2.154152 / 1.504120 (0.650032)	1.954243 / 1.541195 (0.413048)	2.080186 / 1.468490 (0.611696)	0.622884 / 4.584777 (-3.961893)	4.385885 / 3.745712 (0.640173)	2.262085 / 5.269862 (-3.007776)	1.454215 / 4.565676 (-3.111462)	0.067342 / 0.424275 (-0.356933)	0.012913 / 0.007607 (0.005306)	0.600676 / 0.226044 (0.374631)	5.915093 / 2.268929 (3.646164)	2.664915 / 55.444624 (-52.779709)	2.286986 / 6.876477 (-4.589490)	2.387776 / 2.142072 (0.245704)	0.757067 / 4.805227 (-4.048160)	0.154625 / 6.500664 (-6.346039)	0.074632 / 0.075469 (-0.000838)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.413229 / 1.841788 (-0.428558)	17.433012 / 8.074308 (9.358704)	16.980340 / 10.191392 (6.788948)	0.218943 / 0.680424 (-0.461481)	0.020525 / 0.534201 (-0.513676)	0.451847 / 0.579283 (-0.127436)	0.495587 / 0.434364 (0.061223)	0.548739 / 0.540337 (0.008402)	0.662120 / 1.386936 (-0.724816)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006775 / 0.011353 (-0.004577)	0.004556 / 0.011008 (-0.006452)	0.006462 / 0.038508 (-0.032046)	0.039073 / 0.023109 (0.015964)	0.429249 / 0.275898 (0.153351)	0.469946 / 0.323480 (0.146467)	0.004402 / 0.007986 (-0.003584)	0.003798 / 0.004328 (-0.000530)	0.005347 / 0.004250 (0.001097)	0.053743 / 0.037052 (0.016691)	0.434635 / 0.258489 (0.176146)	0.475661 / 0.293841 (0.181820)	0.029891 / 0.128546 (-0.098656)	0.009058 / 0.075646 (-0.066588)	0.010987 / 0.419271 (-0.408284)	0.053877 / 0.043533 (0.010344)	0.434428 / 0.255139 (0.179289)	0.449637 / 0.283200 (0.166437)	0.124331 / 0.141683 (-0.017352)	1.736083 / 1.452155 (0.283928)	1.831632 / 1.492716 (0.338916)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.248428 / 0.018006 (0.230422)	0.493113 / 0.000490 (0.492623)	0.000429 / 0.000200 (0.000229)	0.000057 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031337 / 0.037411 (-0.006074)	0.132360 / 0.014526 (0.117834)	0.134734 / 0.176557 (-0.041822)	0.193811 / 0.737135 (-0.543324)	0.146883 / 0.296338 (-0.149456)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.510876 / 0.215209 (0.295666)	5.170198 / 2.077655 (3.092543)	2.572105 / 1.504120 (1.067985)	2.316918 / 1.541195 (0.775723)	2.449316 / 1.468490 (0.980826)	0.612219 / 4.584777 (-3.972558)	4.456740 / 3.745712 (0.711028)	2.099757 / 5.269862 (-3.170105)	1.293017 / 4.565676 (-3.272660)	0.067922 / 0.424275 (-0.356353)	0.013467 / 0.007607 (0.005860)	0.634240 / 0.226044 (0.408196)	6.373111 / 2.268929 (4.104182)	3.171567 / 55.444624 (-52.273057)	2.763411 / 6.876477 (-4.113066)	2.845557 / 2.142072 (0.703485)	0.763431 / 4.805227 (-4.041797)	0.155949 / 6.500664 (-6.344715)	0.076264 / 0.075469 (0.000795)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.468075 / 1.841788 (-0.373713)	17.582354 / 8.074308 (9.508046)	16.565964 / 10.191392 (6.374572)	0.163779 / 0.680424 (-0.516644)	0.020472 / 0.534201 (-0.513728)	0.444416 / 0.579283 (-0.134867)	0.488471 / 0.434364 (0.054107)	0.550661 / 0.540337 (0.010323)	0.667230 / 1.386936 (-0.719706)

github-actions · 2023-05-24T18:30:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006160 / 0.011353 (-0.005193)	0.004093 / 0.011008 (-0.006915)	0.056485 / 0.038508 (0.017977)	0.033637 / 0.023109 (0.010528)	0.296448 / 0.275898 (0.020550)	0.332532 / 0.323480 (0.009052)	0.003864 / 0.007986 (-0.004122)	0.003446 / 0.004328 (-0.000883)	0.034808 / 0.004250 (0.030558)	0.048567 / 0.037052 (0.011514)	0.296090 / 0.258489 (0.037601)	0.336067 / 0.293841 (0.042226)	0.026081 / 0.128546 (-0.102465)	0.007875 / 0.075646 (-0.067771)	0.286049 / 0.419271 (-0.133222)	0.050411 / 0.043533 (0.006878)	0.297016 / 0.255139 (0.041877)	0.320030 / 0.283200 (0.036830)	0.110374 / 0.141683 (-0.031308)	1.432470 / 1.452155 (-0.019684)	1.492479 / 1.492716 (-0.000238)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.262352 / 0.018006 (0.244346)	0.557956 / 0.000490 (0.557467)	0.010296 / 0.000200 (0.010096)	0.000315 / 0.000054 (0.000260)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028801 / 0.037411 (-0.008611)	0.109844 / 0.014526 (0.095318)	0.122333 / 0.176557 (-0.054224)	0.180571 / 0.737135 (-0.556564)	0.125990 / 0.296338 (-0.170348)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.401643 / 0.215209 (0.186434)	4.020993 / 2.077655 (1.943338)	1.815256 / 1.504120 (0.311136)	1.619579 / 1.541195 (0.078384)	1.708889 / 1.468490 (0.240398)	0.537847 / 4.584777 (-4.046930)	3.743331 / 3.745712 (-0.002381)	1.779891 / 5.269862 (-3.489970)	1.021423 / 4.565676 (-3.544253)	0.058869 / 0.424275 (-0.365406)	0.011826 / 0.007607 (0.004218)	0.499665 / 0.226044 (0.273621)	4.980928 / 2.268929 (2.712000)	2.285664 / 55.444624 (-53.158960)	1.936553 / 6.876477 (-4.939923)	2.090428 / 2.142072 (-0.051645)	0.655218 / 4.805227 (-4.150009)	0.133178 / 6.500664 (-6.367486)	0.062991 / 0.075469 (-0.012478)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.168895 / 1.841788 (-0.672892)	14.656773 / 8.074308 (6.582465)	13.737921 / 10.191392 (3.546529)	0.145383 / 0.680424 (-0.535041)	0.017614 / 0.534201 (-0.516587)	0.386499 / 0.579283 (-0.192784)	0.425626 / 0.434364 (-0.008738)	0.389572 / 0.540337 (-0.150766)	0.386753 / 1.386936 (-1.000183)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005998 / 0.011353 (-0.005355)	0.004265 / 0.011008 (-0.006743)	0.034743 / 0.038508 (-0.003766)	0.033929 / 0.023109 (0.010820)	0.405535 / 0.275898 (0.129636)	0.407235 / 0.323480 (0.083755)	0.003972 / 0.007986 (-0.004013)	0.003616 / 0.004328 (-0.000712)	0.035278 / 0.004250 (0.031027)	0.052990 / 0.037052 (0.015937)	0.405228 / 0.258489 (0.146739)	0.415007 / 0.293841 (0.121166)	0.025951 / 0.128546 (-0.102595)	0.007990 / 0.075646 (-0.067656)	0.040492 / 0.419271 (-0.378779)	0.049123 / 0.043533 (0.005591)	0.399282 / 0.255139 (0.144143)	0.384303 / 0.283200 (0.101103)	0.115234 / 0.141683 (-0.026448)	1.476904 / 1.452155 (0.024749)	1.627191 / 1.492716 (0.134475)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.209211 / 0.018006 (0.191205)	0.566718 / 0.000490 (0.566228)	0.002094 / 0.000200 (0.001894)	0.000104 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030885 / 0.037411 (-0.006526)	0.110777 / 0.014526 (0.096251)	0.124382 / 0.176557 (-0.052174)	0.175081 / 0.737135 (-0.562054)	0.130263 / 0.296338 (-0.166075)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.448091 / 0.215209 (0.232882)	4.484404 / 2.077655 (2.406749)	2.278438 / 1.504120 (0.774318)	2.087933 / 1.541195 (0.546738)	2.186709 / 1.468490 (0.718219)	0.534822 / 4.584777 (-4.049955)	3.778229 / 3.745712 (0.032517)	3.312334 / 5.269862 (-1.957528)	1.557209 / 4.565676 (-3.008467)	0.058923 / 0.424275 (-0.365352)	0.011350 / 0.007607 (0.003743)	0.550470 / 0.226044 (0.324426)	5.480347 / 2.268929 (3.211419)	2.781709 / 55.444624 (-52.662915)	2.478729 / 6.876477 (-4.397748)	2.492001 / 2.142072 (0.349929)	0.652649 / 4.805227 (-4.152578)	0.131334 / 6.500664 (-6.369330)	0.065619 / 0.075469 (-0.009850)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.253998 / 1.841788 (-0.587790)	15.207433 / 8.074308 (7.133124)	14.627842 / 10.191392 (4.436450)	0.146947 / 0.680424 (-0.533477)	0.017533 / 0.534201 (-0.516668)	0.391627 / 0.579283 (-0.187656)	0.431113 / 0.434364 (-0.003251)	0.413886 / 0.540337 (-0.126451)	0.414483 / 1.386936 (-0.972453)

github-actions · 2023-05-24T19:08:43Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007741 / 0.011353 (-0.003612)	0.004584 / 0.011008 (-0.006424)	0.067869 / 0.038508 (0.029361)	0.041612 / 0.023109 (0.018503)	0.377878 / 0.275898 (0.101980)	0.421633 / 0.323480 (0.098153)	0.004614 / 0.007986 (-0.003371)	0.003824 / 0.004328 (-0.000504)	0.041479 / 0.004250 (0.037229)	0.053309 / 0.037052 (0.016256)	0.390147 / 0.258489 (0.131658)	0.437706 / 0.293841 (0.143865)	0.035951 / 0.128546 (-0.092595)	0.009231 / 0.075646 (-0.066415)	0.357572 / 0.419271 (-0.061699)	0.081332 / 0.043533 (0.037799)	0.370076 / 0.255139 (0.114937)	0.423653 / 0.283200 (0.140453)	0.141401 / 0.141683 (-0.000282)	1.722744 / 1.452155 (0.270589)	1.914668 / 1.492716 (0.421952)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.256568 / 0.018006 (0.238562)	0.512243 / 0.000490 (0.511753)	0.019913 / 0.000200 (0.019713)	0.000136 / 0.000054 (0.000082)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031742 / 0.037411 (-0.005670)	0.128537 / 0.014526 (0.114011)	0.139962 / 0.176557 (-0.036594)	0.210711 / 0.737135 (-0.526424)	0.147162 / 0.296338 (-0.149177)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.509518 / 0.215209 (0.294309)	5.083788 / 2.077655 (3.006134)	2.455381 / 1.504120 (0.951262)	2.208078 / 1.541195 (0.666883)	2.341807 / 1.468490 (0.873317)	0.580014 / 4.584777 (-4.004763)	4.599492 / 3.745712 (0.853780)	2.403249 / 5.269862 (-2.866612)	1.559177 / 4.565676 (-3.006500)	0.072846 / 0.424275 (-0.351429)	0.017327 / 0.007607 (0.009720)	0.627747 / 0.226044 (0.401703)	6.242586 / 2.268929 (3.973657)	2.982875 / 55.444624 (-52.461750)	2.588645 / 6.876477 (-4.287832)	2.765915 / 2.142072 (0.623843)	0.720455 / 4.805227 (-4.084772)	0.157474 / 6.500664 (-6.343190)	0.074295 / 0.075469 (-0.001174)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.540799 / 1.841788 (-0.300988)	18.054632 / 8.074308 (9.980324)	16.544036 / 10.191392 (6.352644)	0.201423 / 0.680424 (-0.479001)	0.020497 / 0.534201 (-0.513704)	0.496275 / 0.579283 (-0.083008)	0.547380 / 0.434364 (0.113017)	0.614605 / 0.540337 (0.074267)	0.749889 / 1.386936 (-0.637047)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006963 / 0.011353 (-0.004389)	0.004543 / 0.011008 (-0.006465)	0.039530 / 0.038508 (0.001022)	0.038420 / 0.023109 (0.015311)	0.454885 / 0.275898 (0.178987)	0.491731 / 0.323480 (0.168251)	0.004211 / 0.007986 (-0.003775)	0.003673 / 0.004328 (-0.000655)	0.038735 / 0.004250 (0.034484)	0.052085 / 0.037052 (0.015032)	0.448924 / 0.258489 (0.190435)	0.499254 / 0.293841 (0.205413)	0.030069 / 0.128546 (-0.098477)	0.009082 / 0.075646 (-0.066565)	0.047181 / 0.419271 (-0.372090)	0.054758 / 0.043533 (0.011225)	0.445035 / 0.255139 (0.189896)	0.475090 / 0.283200 (0.191891)	0.122641 / 0.141683 (-0.019042)	1.706514 / 1.452155 (0.254360)	1.855726 / 1.492716 (0.363010)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246028 / 0.018006 (0.228022)	0.486382 / 0.000490 (0.485892)	0.003038 / 0.000200 (0.002838)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034298 / 0.037411 (-0.003113)	0.135364 / 0.014526 (0.120838)	0.146102 / 0.176557 (-0.030455)	0.207997 / 0.737135 (-0.529139)	0.153119 / 0.296338 (-0.143219)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.528758 / 0.215209 (0.313549)	5.243303 / 2.077655 (3.165648)	2.617194 / 1.504120 (1.113074)	2.400740 / 1.541195 (0.859545)	2.534692 / 1.468490 (1.066202)	0.585825 / 4.584777 (-3.998952)	4.879766 / 3.745712 (1.134054)	2.377419 / 5.269862 (-2.892443)	1.460711 / 4.565676 (-3.104966)	0.075572 / 0.424275 (-0.348703)	0.013650 / 0.007607 (0.006042)	0.697103 / 0.226044 (0.471058)	6.444984 / 2.268929 (4.176055)	3.227662 / 55.444624 (-52.216963)	2.875163 / 6.876477 (-4.001314)	2.860953 / 2.142072 (0.718881)	0.718908 / 4.805227 (-4.086319)	0.158005 / 6.500664 (-6.342659)	0.077581 / 0.075469 (0.002112)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.653027 / 1.841788 (-0.188760)	18.789342 / 8.074308 (10.715034)	16.762678 / 10.191392 (6.571286)	0.238920 / 0.680424 (-0.441504)	0.020698 / 0.534201 (-0.513502)	0.512634 / 0.579283 (-0.066649)	0.542235 / 0.434364 (0.107871)	0.626634 / 0.540337 (0.086297)	0.753324 / 1.386936 (-0.633612)

github-actions · 2023-05-25T15:08:49Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005737 / 0.011353 (-0.005616)	0.003767 / 0.011008 (-0.007241)	0.097792 / 0.038508 (0.059284)	0.028466 / 0.023109 (0.005356)	0.317703 / 0.275898 (0.041805)	0.359512 / 0.323480 (0.036032)	0.003428 / 0.007986 (-0.004558)	0.002848 / 0.004328 (-0.001481)	0.075668 / 0.004250 (0.071418)	0.037165 / 0.037052 (0.000113)	0.329539 / 0.258489 (0.071050)	0.361365 / 0.293841 (0.067524)	0.024777 / 0.128546 (-0.103769)	0.008324 / 0.075646 (-0.067323)	0.317346 / 0.419271 (-0.101926)	0.043296 / 0.043533 (-0.000237)	0.315318 / 0.255139 (0.060179)	0.347641 / 0.283200 (0.064441)	0.089551 / 0.141683 (-0.052132)	1.506335 / 1.452155 (0.054180)	1.573931 / 1.492716 (0.081215)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208041 / 0.018006 (0.190034)	0.428198 / 0.000490 (0.427708)	0.002568 / 0.000200 (0.002369)	0.000072 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023745 / 0.037411 (-0.013667)	0.096256 / 0.014526 (0.081730)	0.104917 / 0.176557 (-0.071639)	0.164341 / 0.737135 (-0.572794)	0.107972 / 0.296338 (-0.188367)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.453995 / 0.215209 (0.238786)	4.546892 / 2.077655 (2.469238)	2.185498 / 1.504120 (0.681378)	1.989156 / 1.541195 (0.447962)	2.053443 / 1.468490 (0.584953)	0.559940 / 4.584777 (-4.024837)	3.420759 / 3.745712 (-0.324954)	1.771528 / 5.269862 (-3.498333)	1.139692 / 4.565676 (-3.425984)	0.067686 / 0.424275 (-0.356589)	0.011729 / 0.007607 (0.004122)	0.558001 / 0.226044 (0.331957)	5.583886 / 2.268929 (3.314957)	2.678726 / 55.444624 (-52.765899)	2.324127 / 6.876477 (-4.552350)	2.472805 / 2.142072 (0.330733)	0.663163 / 4.805227 (-4.142065)	0.134892 / 6.500664 (-6.365772)	0.066722 / 0.075469 (-0.008747)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.195200 / 1.841788 (-0.646587)	13.602517 / 8.074308 (5.528209)	14.036344 / 10.191392 (3.844952)	0.143759 / 0.680424 (-0.536665)	0.017215 / 0.534201 (-0.516986)	0.383749 / 0.579283 (-0.195534)	0.388229 / 0.434364 (-0.046134)	0.469366 / 0.540337 (-0.070971)	0.560408 / 1.386936 (-0.826528)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005953 / 0.011353 (-0.005400)	0.003840 / 0.011008 (-0.007168)	0.077481 / 0.038508 (0.038973)	0.028318 / 0.023109 (0.005209)	0.403991 / 0.275898 (0.128093)	0.433374 / 0.323480 (0.109894)	0.003572 / 0.007986 (-0.004414)	0.003033 / 0.004328 (-0.001295)	0.075873 / 0.004250 (0.071623)	0.039321 / 0.037052 (0.002269)	0.416790 / 0.258489 (0.158301)	0.459368 / 0.293841 (0.165527)	0.025270 / 0.128546 (-0.103276)	0.008574 / 0.075646 (-0.067072)	0.083376 / 0.419271 (-0.335896)	0.043206 / 0.043533 (-0.000327)	0.404831 / 0.255139 (0.149692)	0.418559 / 0.283200 (0.135360)	0.099135 / 0.141683 (-0.042548)	1.501315 / 1.452155 (0.049160)	1.583912 / 1.492716 (0.091195)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.241510 / 0.018006 (0.223504)	0.410473 / 0.000490 (0.409983)	0.001857 / 0.000200 (0.001657)	0.000081 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025366 / 0.037411 (-0.012045)	0.103353 / 0.014526 (0.088828)	0.107934 / 0.176557 (-0.068622)	0.162388 / 0.737135 (-0.574747)	0.113550 / 0.296338 (-0.182789)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.463529 / 0.215209 (0.248320)	4.657688 / 2.077655 (2.580034)	2.455088 / 1.504120 (0.950968)	2.304833 / 1.541195 (0.763638)	2.317520 / 1.468490 (0.849029)	0.563395 / 4.584777 (-4.021382)	3.408489 / 3.745712 (-0.337223)	2.636379 / 5.269862 (-2.633482)	1.425355 / 4.565676 (-3.140322)	0.068335 / 0.424275 (-0.355940)	0.011713 / 0.007607 (0.004106)	0.550230 / 0.226044 (0.324186)	5.519843 / 2.268929 (3.250915)	2.864986 / 55.444624 (-52.579639)	2.604821 / 6.876477 (-4.271655)	2.701501 / 2.142072 (0.559428)	0.668193 / 4.805227 (-4.137034)	0.134739 / 6.500664 (-6.365925)	0.067110 / 0.075469 (-0.008359)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.326358 / 1.841788 (-0.515430)	14.184172 / 8.074308 (6.109864)	14.139245 / 10.191392 (3.947853)	0.151881 / 0.680424 (-0.528542)	0.016718 / 0.534201 (-0.517483)	0.367035 / 0.579283 (-0.212248)	0.393512 / 0.434364 (-0.040852)	0.441261 / 0.540337 (-0.099076)	0.533907 / 1.386936 (-0.853029)

github-actions · 2023-05-31T15:20:02Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006275 / 0.011353 (-0.005078)	0.003980 / 0.011008 (-0.007028)	0.097617 / 0.038508 (0.059109)	0.034089 / 0.023109 (0.010980)	0.297381 / 0.275898 (0.021483)	0.330106 / 0.323480 (0.006626)	0.003838 / 0.007986 (-0.004148)	0.004042 / 0.004328 (-0.000287)	0.074305 / 0.004250 (0.070055)	0.048318 / 0.037052 (0.011265)	0.295585 / 0.258489 (0.037096)	0.346924 / 0.293841 (0.053083)	0.027397 / 0.128546 (-0.101150)	0.008452 / 0.075646 (-0.067194)	0.326837 / 0.419271 (-0.092435)	0.049515 / 0.043533 (0.005982)	0.303931 / 0.255139 (0.048792)	0.317647 / 0.283200 (0.034447)	0.098280 / 0.141683 (-0.043403)	1.442603 / 1.452155 (-0.009552)	1.524050 / 1.492716 (0.031334)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.215095 / 0.018006 (0.197089)	0.437662 / 0.000490 (0.437173)	0.009771 / 0.000200 (0.009571)	0.000401 / 0.000054 (0.000346)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027169 / 0.037411 (-0.010243)	0.111383 / 0.014526 (0.096857)	0.116163 / 0.176557 (-0.060394)	0.173134 / 0.737135 (-0.564001)	0.122376 / 0.296338 (-0.173962)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398332 / 0.215209 (0.183123)	3.974166 / 2.077655 (1.896511)	1.793847 / 1.504120 (0.289727)	1.615117 / 1.541195 (0.073922)	1.660288 / 1.468490 (0.191798)	0.523833 / 4.584777 (-4.060944)	3.704273 / 3.745712 (-0.041439)	1.873308 / 5.269862 (-3.396554)	1.203546 / 4.565676 (-3.362131)	0.064949 / 0.424275 (-0.359326)	0.011830 / 0.007607 (0.004223)	0.497294 / 0.226044 (0.271250)	4.948663 / 2.268929 (2.679735)	2.233391 / 55.444624 (-53.211234)	1.903208 / 6.876477 (-4.973269)	2.067908 / 2.142072 (-0.074164)	0.644256 / 4.805227 (-4.160971)	0.142798 / 6.500664 (-6.357866)	0.064734 / 0.075469 (-0.010735)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.172313 / 1.841788 (-0.669475)	14.665853 / 8.074308 (6.591545)	13.147051 / 10.191392 (2.955659)	0.139338 / 0.680424 (-0.541086)	0.017452 / 0.534201 (-0.516749)	0.395660 / 0.579283 (-0.183623)	0.410138 / 0.434364 (-0.024226)	0.460357 / 0.540337 (-0.079980)	0.555670 / 1.386936 (-0.831266)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006247 / 0.011353 (-0.005106)	0.004098 / 0.011008 (-0.006910)	0.075050 / 0.038508 (0.036542)	0.033232 / 0.023109 (0.010122)	0.384139 / 0.275898 (0.108241)	0.420865 / 0.323480 (0.097385)	0.003889 / 0.007986 (-0.004096)	0.003336 / 0.004328 (-0.000993)	0.073837 / 0.004250 (0.069587)	0.048775 / 0.037052 (0.011723)	0.386373 / 0.258489 (0.127884)	0.421718 / 0.293841 (0.127878)	0.027553 / 0.128546 (-0.100993)	0.008724 / 0.075646 (-0.066922)	0.080970 / 0.419271 (-0.338302)	0.045981 / 0.043533 (0.002448)	0.364381 / 0.255139 (0.109242)	0.391203 / 0.283200 (0.108004)	0.101681 / 0.141683 (-0.040002)	1.469533 / 1.452155 (0.017378)	1.562016 / 1.492716 (0.069300)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222318 / 0.018006 (0.204312)	0.441395 / 0.000490 (0.440905)	0.000408 / 0.000200 (0.000208)	0.000057 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030291 / 0.037411 (-0.007120)	0.114053 / 0.014526 (0.099527)	0.123124 / 0.176557 (-0.053433)	0.173474 / 0.737135 (-0.563661)	0.129946 / 0.296338 (-0.166393)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430342 / 0.215209 (0.215133)	4.309782 / 2.077655 (2.232128)	2.110668 / 1.504120 (0.606548)	1.922881 / 1.541195 (0.381687)	1.993562 / 1.468490 (0.525072)	0.523682 / 4.584777 (-4.061095)	3.774152 / 3.745712 (0.028440)	3.354783 / 5.269862 (-1.915079)	1.489793 / 4.565676 (-3.075884)	0.065169 / 0.424275 (-0.359107)	0.011626 / 0.007607 (0.004019)	0.539126 / 0.226044 (0.313081)	5.372593 / 2.268929 (3.103664)	2.570652 / 55.444624 (-52.873973)	2.253353 / 6.876477 (-4.623123)	2.312876 / 2.142072 (0.170804)	0.644241 / 4.805227 (-4.160986)	0.138326 / 6.500664 (-6.362338)	0.064491 / 0.075469 (-0.010979)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.344164 / 1.841788 (-0.497624)	15.124679 / 8.074308 (7.050371)	14.799310 / 10.191392 (4.607918)	0.149054 / 0.680424 (-0.531370)	0.017564 / 0.534201 (-0.516637)	0.394593 / 0.579283 (-0.184690)	0.428768 / 0.434364 (-0.005596)	0.468235 / 0.540337 (-0.072103)	0.557384 / 1.386936 (-0.829552)

lhoestq · 2023-05-31T17:39:06Z

@albertvillanova could you take a look at this one ? It directly follows the arrow formatting PR

albertvillanova

Thanks. Some comments below.

albertvillanova · 2023-06-01T13:59:02Z

src/datasets/formatting/jax_formatter.py

+        if hasattr(data_struct, "__array__") and not isinstance(data_struct, jax.Array):
+            data_struct = data_struct.__array__()


Is this tested? The same for similar code lines both other formatters.

albertvillanova · 2023-06-01T14:09:22Z

src/datasets/iterable_dataset.py

        shuffling: Optional[ShufflingConfig] = None,
        distributed: Optional[DistributedConfig] = None,
        token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None,
+        format_type="deprecated",


Nice you deprecate it. What about in the classes above?

MappedExamplesIterable

FilteredExamplesIterable

albertvillanova · 2023-06-01T14:09:49Z

src/datasets/iterable_dataset.py

    ):
        if distributed and distributed.world_size > 1 and shuffling and shuffling._original_seed is None:
            raise RuntimeError(
                "The dataset doesn't have a fixed random seed across nodes to shuffle and split the list of dataset shards by node. "
                "Please pass e.g. `seed=42` in `.shuffle()` to make all the nodes use the same seed. "
            )
+        if format_type != "deprecated":
+            formatting = FormattingConfig(format_type=format_type)


Maybe worth raising a warning?

lhoestq · 2023-06-09T10:24:58Z

I added tests for the __array__ case which lets you go from any tensor format to any other tensor format.

I also properly deprecated format_type and added a warning message.

github-actions · 2023-06-09T10:32:42Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007838 / 0.011353 (-0.003515)	0.005177 / 0.011008 (-0.005831)	0.131058 / 0.038508 (0.092550)	0.035959 / 0.023109 (0.012850)	0.414071 / 0.275898 (0.138173)	0.429628 / 0.323480 (0.106148)	0.005151 / 0.007986 (-0.002834)	0.003979 / 0.004328 (-0.000349)	0.103209 / 0.004250 (0.098958)	0.046200 / 0.037052 (0.009148)	0.414020 / 0.258489 (0.155531)	0.475748 / 0.293841 (0.181907)	0.041031 / 0.128546 (-0.087515)	0.014462 / 0.075646 (-0.061185)	0.423706 / 0.419271 (0.004434)	0.063488 / 0.043533 (0.019955)	0.404937 / 0.255139 (0.149798)	0.404973 / 0.283200 (0.121773)	0.114982 / 0.141683 (-0.026701)	1.911867 / 1.452155 (0.459713)	1.925274 / 1.492716 (0.432557)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.284656 / 0.018006 (0.266650)	0.588329 / 0.000490 (0.587840)	0.007092 / 0.000200 (0.006892)	0.000143 / 0.000054 (0.000089)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025136 / 0.037411 (-0.012275)	0.109514 / 0.014526 (0.094988)	0.117953 / 0.176557 (-0.058603)	0.195454 / 0.737135 (-0.541682)	0.134243 / 0.296338 (-0.162096)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.584045 / 0.215209 (0.368836)	6.456922 / 2.077655 (4.379267)	2.759728 / 1.504120 (1.255608)	2.260913 / 1.541195 (0.719718)	2.292535 / 1.468490 (0.824045)	0.906873 / 4.584777 (-3.677904)	5.554455 / 3.745712 (1.808743)	4.881557 / 5.269862 (-0.388305)	2.509121 / 4.565676 (-2.056555)	0.107191 / 0.424275 (-0.317084)	0.014684 / 0.007607 (0.007077)	0.761625 / 0.226044 (0.535580)	7.582708 / 2.268929 (5.313780)	3.150160 / 55.444624 (-52.294464)	2.792284 / 6.876477 (-4.084193)	2.881321 / 2.142072 (0.739248)	1.108353 / 4.805227 (-3.696874)	0.220129 / 6.500664 (-6.280535)	0.075877 / 0.075469 (0.000408)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.465743 / 1.841788 (-0.376045)	17.679219 / 8.074308 (9.604911)	18.929399 / 10.191392 (8.738007)	0.219488 / 0.680424 (-0.460935)	0.028435 / 0.534201 (-0.505766)	0.512623 / 0.579283 (-0.066660)	0.619983 / 0.434364 (0.185619)	0.603430 / 0.540337 (0.063092)	0.730416 / 1.386936 (-0.656520)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008285 / 0.011353 (-0.003068)	0.005771 / 0.011008 (-0.005237)	0.106444 / 0.038508 (0.067936)	0.035078 / 0.023109 (0.011969)	0.441198 / 0.275898 (0.165300)	0.536279 / 0.323480 (0.212800)	0.004561 / 0.007986 (-0.003424)	0.006623 / 0.004328 (0.002294)	0.102392 / 0.004250 (0.098142)	0.051736 / 0.037052 (0.014684)	0.479113 / 0.258489 (0.220624)	0.535088 / 0.293841 (0.241247)	0.041805 / 0.128546 (-0.086741)	0.014031 / 0.075646 (-0.061615)	0.115795 / 0.419271 (-0.303477)	0.057913 / 0.043533 (0.014380)	0.435847 / 0.255139 (0.180708)	0.524831 / 0.283200 (0.241632)	0.119419 / 0.141683 (-0.022263)	1.835577 / 1.452155 (0.383423)	1.936990 / 1.492716 (0.444273)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.288422 / 0.018006 (0.270416)	0.569776 / 0.000490 (0.569287)	0.005652 / 0.000200 (0.005452)	0.000139 / 0.000054 (0.000085)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034632 / 0.037411 (-0.002779)	0.136217 / 0.014526 (0.121691)	0.139468 / 0.176557 (-0.037089)	0.206804 / 0.737135 (-0.530331)	0.148733 / 0.296338 (-0.147606)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.667728 / 0.215209 (0.452518)	6.548972 / 2.077655 (4.471317)	3.051537 / 1.504120 (1.547417)	2.581173 / 1.541195 (1.039978)	2.653443 / 1.468490 (1.184953)	0.906606 / 4.584777 (-3.678171)	5.704384 / 3.745712 (1.958672)	2.848618 / 5.269862 (-2.421244)	1.821402 / 4.565676 (-2.744274)	0.118018 / 0.424275 (-0.306257)	0.014821 / 0.007607 (0.007214)	0.821967 / 0.226044 (0.595923)	8.165818 / 2.268929 (5.896889)	3.744509 / 55.444624 (-51.700116)	2.901097 / 6.876477 (-3.975380)	3.018068 / 2.142072 (0.875996)	1.106155 / 4.805227 (-3.699072)	0.263118 / 6.500664 (-6.237546)	0.088508 / 0.075469 (0.013039)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.725860 / 1.841788 (-0.115928)	19.411246 / 8.074308 (11.336938)	20.807499 / 10.191392 (10.616107)	0.238417 / 0.680424 (-0.442007)	0.026550 / 0.534201 (-0.507651)	0.500715 / 0.579283 (-0.078568)	0.615547 / 0.434364 (0.181183)	0.614361 / 0.540337 (0.074023)	0.720365 / 1.386936 (-0.666571)

albertvillanova

Thank you!!!

github-actions · 2023-06-13T16:04:05Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006640 / 0.011353 (-0.004713)	0.004079 / 0.011008 (-0.006930)	0.100555 / 0.038508 (0.062046)	0.037318 / 0.023109 (0.014209)	0.320050 / 0.275898 (0.044152)	0.358860 / 0.323480 (0.035380)	0.003828 / 0.007986 (-0.004158)	0.003215 / 0.004328 (-0.001113)	0.076577 / 0.004250 (0.072326)	0.048080 / 0.037052 (0.011028)	0.324759 / 0.258489 (0.066270)	0.361862 / 0.293841 (0.068021)	0.030759 / 0.128546 (-0.097787)	0.008998 / 0.075646 (-0.066648)	0.329105 / 0.419271 (-0.090167)	0.051407 / 0.043533 (0.007875)	0.311067 / 0.255139 (0.055928)	0.334401 / 0.283200 (0.051201)	0.098307 / 0.141683 (-0.043376)	1.500931 / 1.452155 (0.048776)	1.574646 / 1.492716 (0.081930)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219080 / 0.018006 (0.201073)	0.447117 / 0.000490 (0.446627)	0.009091 / 0.000200 (0.008891)	0.000396 / 0.000054 (0.000341)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026048 / 0.037411 (-0.011363)	0.112714 / 0.014526 (0.098188)	0.116426 / 0.176557 (-0.060131)	0.172187 / 0.737135 (-0.564948)	0.121707 / 0.296338 (-0.174632)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.358898 / 0.215209 (0.143689)	3.589212 / 2.077655 (1.511557)	1.677927 / 1.504120 (0.173807)	1.515861 / 1.541195 (-0.025334)	1.598479 / 1.468490 (0.129989)	0.478265 / 4.584777 (-4.106512)	3.834982 / 3.745712 (0.089270)	1.933815 / 5.269862 (-3.336047)	1.122769 / 4.565676 (-3.442908)	0.066984 / 0.424275 (-0.357291)	0.011276 / 0.007607 (0.003669)	0.512530 / 0.226044 (0.286486)	5.112667 / 2.268929 (2.843739)	2.266336 / 55.444624 (-53.178288)	1.929671 / 6.876477 (-4.946806)	2.127231 / 2.142072 (-0.014842)	0.671307 / 4.805227 (-4.133920)	0.143919 / 6.500664 (-6.356745)	0.066086 / 0.075469 (-0.009383)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.208767 / 1.841788 (-0.633021)	15.008415 / 8.074308 (6.934106)	14.085442 / 10.191392 (3.894050)	0.184164 / 0.680424 (-0.496260)	0.017619 / 0.534201 (-0.516582)	0.394443 / 0.579283 (-0.184840)	0.457653 / 0.434364 (0.023289)	0.473169 / 0.540337 (-0.067169)	0.571332 / 1.386936 (-0.815604)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007009 / 0.011353 (-0.004344)	0.004330 / 0.011008 (-0.006678)	0.077462 / 0.038508 (0.038954)	0.034780 / 0.023109 (0.011671)	0.395573 / 0.275898 (0.119675)	0.425444 / 0.323480 (0.101964)	0.004119 / 0.007986 (-0.003866)	0.003597 / 0.004328 (-0.000731)	0.075209 / 0.004250 (0.070958)	0.050871 / 0.037052 (0.013819)	0.402990 / 0.258489 (0.144500)	0.445334 / 0.293841 (0.151493)	0.032492 / 0.128546 (-0.096054)	0.009066 / 0.075646 (-0.066581)	0.083073 / 0.419271 (-0.336198)	0.051661 / 0.043533 (0.008128)	0.395207 / 0.255139 (0.140068)	0.409556 / 0.283200 (0.126356)	0.106035 / 0.141683 (-0.035648)	1.506255 / 1.452155 (0.054101)	1.598724 / 1.492716 (0.106008)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.194733 / 0.018006 (0.176727)	0.444920 / 0.000490 (0.444431)	0.002402 / 0.000200 (0.002202)	0.000083 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030464 / 0.037411 (-0.006947)	0.119153 / 0.014526 (0.104627)	0.126081 / 0.176557 (-0.050476)	0.179692 / 0.737135 (-0.557444)	0.131834 / 0.296338 (-0.164504)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.440153 / 0.215209 (0.224944)	4.397504 / 2.077655 (2.319850)	2.138320 / 1.504120 (0.634200)	1.950596 / 1.541195 (0.409402)	2.079792 / 1.468490 (0.611302)	0.537606 / 4.584777 (-4.047171)	3.689420 / 3.745712 (-0.056292)	2.960732 / 5.269862 (-2.309129)	1.585652 / 4.565676 (-2.980024)	0.066102 / 0.424275 (-0.358173)	0.011429 / 0.007607 (0.003821)	0.537011 / 0.226044 (0.310967)	5.342171 / 2.268929 (3.073242)	2.624446 / 55.444624 (-52.820179)	2.313311 / 6.876477 (-4.563166)	2.389166 / 2.142072 (0.247094)	0.657547 / 4.805227 (-4.147681)	0.141640 / 6.500664 (-6.359025)	0.066102 / 0.075469 (-0.009367)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.130471 / 1.841788 (-0.711317)	14.824792 / 8.074308 (6.750484)	13.436463 / 10.191392 (3.245071)	0.155688 / 0.680424 (-0.524736)	0.015811 / 0.534201 (-0.518390)	0.355623 / 0.579283 (-0.223660)	0.450604 / 0.434364 (0.016241)	0.472542 / 0.540337 (-0.067796)	0.563584 / 1.386936 (-0.823352)

lhoestq added 9 commits May 2, 2023 18:26

add iterable arrow formatting

b860cf6

some tests

a905a19

fix filter

6c868e1

add test

f8417a4

fix test

95457f2

tests and fixes

8f019df

use ArrowExamplesIterable in ArrowBasedBuilder.as_streaming_dataset

00b148b

add torch formatting

e25dc0f

support formattign in map and filter

2051e91

Merge branch 'main' into iterable-torch-formatting

3655cbf

lhoestq mentioned this pull request May 24, 2023

IterableDataset Arrow formatting #5821

Merged

fix tests

3f4e987

fix tests

f978ad8

always ensure types

5409875

Merge branch 'main' into iterable-torch-formatting

a8bfac2

lhoestq marked this pull request as ready for review May 31, 2023 17:37

albertvillanova reviewed Jun 1, 2023

View reviewed changes

lhoestq added 4 commits June 8, 2023 18:48

Merge branch 'main' into iterable-torch-formatting

a2c598c

test __array__

4693b19

deprecate format_type

1ab8c50

add warning for deprecated arg

ae2e77f

lhoestq requested a review from albertvillanova June 12, 2023 10:14

albertvillanova approved these changes Jun 13, 2023

View reviewed changes

lhoestq merged commit 963ff6d into main Jun 13, 2023
13 checks passed

lhoestq deleted the iterable-torch-formatting branch June 13, 2023 15:57

ArneBinder mentioned this pull request Sep 5, 2023

set minimum datasets version ArneBinder/pytorch-ie#324

Merged

lhoestq mentioned this pull request Oct 9, 2023

Support numpy/torch/tf/jax formatting for IterableDataset #5083

Closed

		if hasattr(data_struct, "__array__") and not isinstance(data_struct, jax.Array):
		data_struct = data_struct.__array__()

Iterable torch formatting #5852

Iterable torch formatting #5852

Conversation

lhoestq commented May 12, 2023

github-actions bot commented May 12, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented May 24, 2023 • edited

github-actions bot commented May 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 31, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented May 31, 2023

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Jun 1, 2023 • edited

Choose a reason for hiding this comment

albertvillanova Jun 1, 2023

Choose a reason for hiding this comment

albertvillanova Jun 1, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 24, 2023 •

edited

albertvillanova Jun 1, 2023 •

edited