Map-style Dataset to IterableDataset #5410

lhoestq · 2023-01-05T18:12:17Z

Added ds.to_iterable() to get an iterable dataset from a map-style arrow dataset.

It also has a num_shards argument to split the dataset before converting to an iterable dataset. Sharding is important to enable efficient shuffling and parallel loading of iterable datasets.

TODO:

tests
docs

Fix #5265

lhoestq · 2023-01-05T18:12:52Z

src/datasets/builder.py

@@ -493,7 +493,7 @@ def _create_builder_config(
        )
        is_custom = (config_id not in self.builder_configs) and config_id != "default"
        if is_custom:
-            logger.warning(f"Using custom data configuration {config_id}")
+            logger.info(f"Using custom data configuration {config_id}")


I did this because I think it's not relevant anymore, and because I find it confusing to show this when calling IterableDataset.from_generator

github-actions · 2023-01-05T18:16:10Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009812 / 0.011353 (-0.001540)	0.005290 / 0.011008 (-0.005719)	0.099728 / 0.038508 (0.061220)	0.036712 / 0.023109 (0.013602)	0.305924 / 0.275898 (0.030026)	0.349844 / 0.323480 (0.026365)	0.008353 / 0.007986 (0.000368)	0.004464 / 0.004328 (0.000135)	0.075329 / 0.004250 (0.071079)	0.046146 / 0.037052 (0.009094)	0.304197 / 0.258489 (0.045708)	0.354245 / 0.293841 (0.060404)	0.039270 / 0.128546 (-0.089276)	0.012496 / 0.075646 (-0.063151)	0.334390 / 0.419271 (-0.084882)	0.049428 / 0.043533 (0.005896)	0.297318 / 0.255139 (0.042179)	0.315646 / 0.283200 (0.032447)	0.106746 / 0.141683 (-0.034937)	1.443562 / 1.452155 (-0.008593)	1.546022 / 1.492716 (0.053305)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.303419 / 0.018006 (0.285413)	0.536971 / 0.000490 (0.536481)	0.001335 / 0.000200 (0.001135)	0.000088 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030484 / 0.037411 (-0.006927)	0.110043 / 0.014526 (0.095518)	0.125265 / 0.176557 (-0.051291)	0.171410 / 0.737135 (-0.565725)	0.128978 / 0.296338 (-0.167361)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398354 / 0.215209 (0.183145)	3.984180 / 2.077655 (1.906526)	1.781134 / 1.504120 (0.277014)	1.589656 / 1.541195 (0.048462)	1.704192 / 1.468490 (0.235702)	0.682271 / 4.584777 (-3.902506)	3.731504 / 3.745712 (-0.014208)	2.243520 / 5.269862 (-3.026342)	1.511334 / 4.565676 (-3.054343)	0.084243 / 0.424275 (-0.340032)	0.012261 / 0.007607 (0.004654)	0.507499 / 0.226044 (0.281454)	5.066037 / 2.268929 (2.797109)	2.246107 / 55.444624 (-53.198517)	1.921032 / 6.876477 (-4.955444)	2.144111 / 2.142072 (0.002039)	0.845233 / 4.805227 (-3.959995)	0.165392 / 6.500664 (-6.335272)	0.064201 / 0.075469 (-0.011268)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.217649 / 1.841788 (-0.624138)	15.890487 / 8.074308 (7.816179)	14.772039 / 10.191392 (4.580647)	0.192901 / 0.680424 (-0.487523)	0.029119 / 0.534201 (-0.505082)	0.442904 / 0.579283 (-0.136380)	0.451035 / 0.434364 (0.016671)	0.520788 / 0.540337 (-0.019550)	0.623588 / 1.386936 (-0.763348)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007452 / 0.011353 (-0.003901)	0.005426 / 0.011008 (-0.005582)	0.096488 / 0.038508 (0.057980)	0.033575 / 0.023109 (0.010465)	0.375688 / 0.275898 (0.099790)	0.412393 / 0.323480 (0.088913)	0.006050 / 0.007986 (-0.001936)	0.004424 / 0.004328 (0.000095)	0.073102 / 0.004250 (0.068852)	0.052672 / 0.037052 (0.015620)	0.379352 / 0.258489 (0.120862)	0.436065 / 0.293841 (0.142224)	0.036594 / 0.128546 (-0.091952)	0.012380 / 0.075646 (-0.063266)	0.332899 / 0.419271 (-0.086373)	0.048859 / 0.043533 (0.005326)	0.373215 / 0.255139 (0.118076)	0.386990 / 0.283200 (0.103791)	0.105166 / 0.141683 (-0.036517)	1.490762 / 1.452155 (0.038607)	1.611310 / 1.492716 (0.118593)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.333142 / 0.018006 (0.315136)	0.537137 / 0.000490 (0.536647)	0.000452 / 0.000200 (0.000252)	0.000063 / 0.000054 (0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030368 / 0.037411 (-0.007043)	0.109608 / 0.014526 (0.095083)	0.124220 / 0.176557 (-0.052336)	0.162834 / 0.737135 (-0.574301)	0.128037 / 0.296338 (-0.168302)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.440991 / 0.215209 (0.225782)	4.400825 / 2.077655 (2.323170)	2.158768 / 1.504120 (0.654648)	1.968158 / 1.541195 (0.426963)	2.085115 / 1.468490 (0.616625)	0.710757 / 4.584777 (-3.874020)	3.835441 / 3.745712 (0.089729)	2.204118 / 5.269862 (-3.065744)	1.378909 / 4.565676 (-3.186767)	0.089149 / 0.424275 (-0.335126)	0.013066 / 0.007607 (0.005459)	0.539165 / 0.226044 (0.313121)	5.414176 / 2.268929 (3.145248)	2.677020 / 55.444624 (-52.767604)	2.328334 / 6.876477 (-4.548143)	2.518933 / 2.142072 (0.376860)	0.840902 / 4.805227 (-3.964325)	0.170365 / 6.500664 (-6.330299)	0.063909 / 0.075469 (-0.011561)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.237205 / 1.841788 (-0.604583)	15.678776 / 8.074308 (7.604468)	14.118576 / 10.191392 (3.927184)	0.167236 / 0.680424 (-0.513188)	0.018177 / 0.534201 (-0.516024)	0.426680 / 0.579283 (-0.152603)	0.425126 / 0.434364 (-0.009238)	0.501755 / 0.540337 (-0.038582)	0.592754 / 1.386936 (-0.794182)

HuggingFaceDocBuilderDev · 2023-01-05T18:17:18Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-01-25T19:25:42Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008708 / 0.011353 (-0.002645)	0.004462 / 0.011008 (-0.006546)	0.100159 / 0.038508 (0.061651)	0.029543 / 0.023109 (0.006434)	0.304056 / 0.275898 (0.028158)	0.367098 / 0.323480 (0.043618)	0.007049 / 0.007986 (-0.000937)	0.003294 / 0.004328 (-0.001034)	0.076954 / 0.004250 (0.072703)	0.036850 / 0.037052 (-0.000202)	0.307556 / 0.258489 (0.049067)	0.348327 / 0.293841 (0.054486)	0.033520 / 0.128546 (-0.095026)	0.011312 / 0.075646 (-0.064334)	0.317588 / 0.419271 (-0.101684)	0.040196 / 0.043533 (-0.003337)	0.298330 / 0.255139 (0.043191)	0.333821 / 0.283200 (0.050622)	0.086584 / 0.141683 (-0.055099)	1.480205 / 1.452155 (0.028050)	1.520975 / 1.492716 (0.028259)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.186641 / 0.018006 (0.168635)	0.414420 / 0.000490 (0.413930)	0.003021 / 0.000200 (0.002821)	0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022953 / 0.037411 (-0.014458)	0.097338 / 0.014526 (0.082812)	0.104985 / 0.176557 (-0.071572)	0.139208 / 0.737135 (-0.597927)	0.108031 / 0.296338 (-0.188307)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417969 / 0.215209 (0.202759)	4.173189 / 2.077655 (2.095534)	1.862813 / 1.504120 (0.358693)	1.653226 / 1.541195 (0.112031)	1.725917 / 1.468490 (0.257426)	0.701038 / 4.584777 (-3.883739)	3.350500 / 3.745712 (-0.395213)	1.913156 / 5.269862 (-3.356705)	1.267597 / 4.565676 (-3.298079)	0.082197 / 0.424275 (-0.342078)	0.012499 / 0.007607 (0.004892)	0.520173 / 0.226044 (0.294128)	5.219981 / 2.268929 (2.951053)	2.306029 / 55.444624 (-53.138595)	1.948169 / 6.876477 (-4.928307)	2.013160 / 2.142072 (-0.128912)	0.813325 / 4.805227 (-3.991902)	0.149729 / 6.500664 (-6.350935)	0.065492 / 0.075469 (-0.009977)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.194163 / 1.841788 (-0.647625)	13.739562 / 8.074308 (5.665254)	13.881988 / 10.191392 (3.690596)	0.138180 / 0.680424 (-0.542244)	0.029031 / 0.534201 (-0.505170)	0.387858 / 0.579283 (-0.191425)	0.395171 / 0.434364 (-0.039193)	0.446349 / 0.540337 (-0.093988)	0.527073 / 1.386936 (-0.859863)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006504 / 0.011353 (-0.004849)	0.004564 / 0.011008 (-0.006444)	0.099108 / 0.038508 (0.060599)	0.027420 / 0.023109 (0.004311)	0.340712 / 0.275898 (0.064814)	0.391613 / 0.323480 (0.068133)	0.004977 / 0.007986 (-0.003009)	0.003375 / 0.004328 (-0.000953)	0.076403 / 0.004250 (0.072152)	0.036650 / 0.037052 (-0.000402)	0.341948 / 0.258489 (0.083459)	0.392065 / 0.293841 (0.098224)	0.031802 / 0.128546 (-0.096745)	0.011659 / 0.075646 (-0.063987)	0.320099 / 0.419271 (-0.099173)	0.041615 / 0.043533 (-0.001918)	0.342125 / 0.255139 (0.086986)	0.372833 / 0.283200 (0.089633)	0.089032 / 0.141683 (-0.052650)	1.486691 / 1.452155 (0.034536)	1.567326 / 1.492716 (0.074610)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.193123 / 0.018006 (0.175117)	0.404062 / 0.000490 (0.403573)	0.003460 / 0.000200 (0.003260)	0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024565 / 0.037411 (-0.012846)	0.098958 / 0.014526 (0.084432)	0.108701 / 0.176557 (-0.067855)	0.142567 / 0.737135 (-0.594569)	0.111048 / 0.296338 (-0.185290)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474549 / 0.215209 (0.259340)	4.753776 / 2.077655 (2.676121)	2.435528 / 1.504120 (0.931409)	2.234491 / 1.541195 (0.693297)	2.269474 / 1.468490 (0.800984)	0.695636 / 4.584777 (-3.889141)	3.367816 / 3.745712 (-0.377896)	1.854828 / 5.269862 (-3.415034)	1.159729 / 4.565676 (-3.405948)	0.082267 / 0.424275 (-0.342008)	0.012483 / 0.007607 (0.004876)	0.578490 / 0.226044 (0.352446)	5.814490 / 2.268929 (3.545561)	2.893310 / 55.444624 (-52.551314)	2.540555 / 6.876477 (-4.335922)	2.573705 / 2.142072 (0.431633)	0.800545 / 4.805227 (-4.004682)	0.151306 / 6.500664 (-6.349358)	0.067925 / 0.075469 (-0.007544)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.294645 / 1.841788 (-0.547142)	13.641842 / 8.074308 (5.567534)	14.015200 / 10.191392 (3.823808)	0.128829 / 0.680424 (-0.551595)	0.016870 / 0.534201 (-0.517331)	0.389137 / 0.579283 (-0.190146)	0.388384 / 0.434364 (-0.045980)	0.447711 / 0.540337 (-0.092627)	0.540637 / 1.386936 (-0.846299)

github-actions · 2023-01-25T19:29:18Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.012282 / 0.011353 (0.000929)	0.006328 / 0.011008 (-0.004680)	0.129666 / 0.038508 (0.091158)	0.039403 / 0.023109 (0.016294)	0.375464 / 0.275898 (0.099566)	0.463167 / 0.323480 (0.139687)	0.010329 / 0.007986 (0.002344)	0.005111 / 0.004328 (0.000782)	0.108727 / 0.004250 (0.104476)	0.047156 / 0.037052 (0.010103)	0.381869 / 0.258489 (0.123380)	0.441936 / 0.293841 (0.148095)	0.054750 / 0.128546 (-0.073796)	0.019809 / 0.075646 (-0.055837)	0.436389 / 0.419271 (0.017118)	0.066585 / 0.043533 (0.023052)	0.402108 / 0.255139 (0.146969)	0.424571 / 0.283200 (0.141371)	0.118326 / 0.141683 (-0.023357)	1.870175 / 1.452155 (0.418020)	1.878720 / 1.492716 (0.386004)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.012863 / 0.018006 (-0.005144)	0.528670 / 0.000490 (0.528181)	0.006057 / 0.000200 (0.005857)	0.000124 / 0.000054 (0.000069)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030091 / 0.037411 (-0.007320)	0.136143 / 0.014526 (0.121618)	0.148931 / 0.176557 (-0.027626)	0.179578 / 0.737135 (-0.557558)	0.144528 / 0.296338 (-0.151810)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.594080 / 0.215209 (0.378871)	6.029101 / 2.077655 (3.951446)	2.443084 / 1.504120 (0.938964)	2.123949 / 1.541195 (0.582754)	2.183021 / 1.468490 (0.714531)	1.235453 / 4.584777 (-3.349324)	5.585121 / 3.745712 (1.839408)	3.208510 / 5.269862 (-2.061351)	2.090334 / 4.565676 (-2.475342)	0.150353 / 0.424275 (-0.273922)	0.016787 / 0.007607 (0.009180)	0.797561 / 0.226044 (0.571516)	7.756291 / 2.268929 (5.487363)	3.283638 / 55.444624 (-52.160986)	2.527441 / 6.876477 (-4.349036)	2.590765 / 2.142072 (0.448692)	1.446818 / 4.805227 (-3.358409)	0.250563 / 6.500664 (-6.250101)	0.077919 / 0.075469 (0.002450)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.612022 / 1.841788 (-0.229765)	18.363316 / 8.074308 (10.289008)	22.578570 / 10.191392 (12.387178)	0.232801 / 0.680424 (-0.447623)	0.048232 / 0.534201 (-0.485969)	0.549518 / 0.579283 (-0.029766)	0.624663 / 0.434364 (0.190299)	0.674745 / 0.540337 (0.134408)	0.803489 / 1.386936 (-0.583447)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009872 / 0.011353 (-0.001481)	0.006593 / 0.011008 (-0.004415)	0.139248 / 0.038508 (0.100740)	0.035708 / 0.023109 (0.012598)	0.551335 / 0.275898 (0.275437)	0.544995 / 0.323480 (0.221515)	0.007085 / 0.007986 (-0.000900)	0.004742 / 0.004328 (0.000413)	0.095823 / 0.004250 (0.091572)	0.051674 / 0.037052 (0.014621)	0.463405 / 0.258489 (0.204916)	0.640392 / 0.293841 (0.346551)	0.055242 / 0.128546 (-0.073304)	0.022602 / 0.075646 (-0.053044)	0.419171 / 0.419271 (-0.000100)	0.062986 / 0.043533 (0.019453)	0.503683 / 0.255139 (0.248544)	0.568719 / 0.283200 (0.285519)	0.113906 / 0.141683 (-0.027777)	1.825248 / 1.452155 (0.373094)	1.985667 / 1.492716 (0.492951)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.237478 / 0.018006 (0.219472)	0.528861 / 0.000490 (0.528371)	0.008507 / 0.000200 (0.008307)	0.000158 / 0.000054 (0.000103)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033536 / 0.037411 (-0.003875)	0.144202 / 0.014526 (0.129677)	0.139472 / 0.176557 (-0.037084)	0.184540 / 0.737135 (-0.552596)	0.147818 / 0.296338 (-0.148520)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.671654 / 0.215209 (0.456445)	6.616368 / 2.077655 (4.538713)	2.805634 / 1.504120 (1.301514)	2.482890 / 1.541195 (0.941695)	2.547686 / 1.468490 (1.079195)	1.289169 / 4.584777 (-3.295608)	5.551436 / 3.745712 (1.805724)	5.228500 / 5.269862 (-0.041362)	2.456706 / 4.565676 (-2.108970)	0.148556 / 0.424275 (-0.275720)	0.015290 / 0.007607 (0.007683)	0.837090 / 0.226044 (0.611045)	8.373561 / 2.268929 (6.104632)	3.663910 / 55.444624 (-51.780714)	2.927117 / 6.876477 (-3.949360)	2.976785 / 2.142072 (0.834712)	1.501618 / 4.805227 (-3.303609)	0.263321 / 6.500664 (-6.237343)	0.082644 / 0.075469 (0.007175)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.707419 / 1.841788 (-0.134368)	18.371117 / 8.074308 (10.296809)	22.015154 / 10.191392 (11.823762)	0.232066 / 0.680424 (-0.448357)	0.027149 / 0.534201 (-0.507052)	0.544450 / 0.579283 (-0.034833)	0.605134 / 0.434364 (0.170770)	0.656063 / 0.540337 (0.115725)	0.788121 / 1.386936 (-0.598815)

github-actions · 2023-01-25T19:42:54Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008952 / 0.011353 (-0.002401)	0.005592 / 0.011008 (-0.005416)	0.101138 / 0.038508 (0.062630)	0.035573 / 0.023109 (0.012464)	0.295959 / 0.275898 (0.020060)	0.365347 / 0.323480 (0.041867)	0.008136 / 0.007986 (0.000150)	0.004479 / 0.004328 (0.000150)	0.078806 / 0.004250 (0.074556)	0.045180 / 0.037052 (0.008127)	0.321687 / 0.258489 (0.063198)	0.345874 / 0.293841 (0.052033)	0.038720 / 0.128546 (-0.089826)	0.012534 / 0.075646 (-0.063112)	0.335571 / 0.419271 (-0.083700)	0.049048 / 0.043533 (0.005515)	0.294756 / 0.255139 (0.039617)	0.327496 / 0.283200 (0.044296)	0.109181 / 0.141683 (-0.032502)	1.417068 / 1.452155 (-0.035087)	1.455473 / 1.492716 (-0.037244)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267774 / 0.018006 (0.249768)	0.538546 / 0.000490 (0.538056)	0.001755 / 0.000200 (0.001555)	0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026839 / 0.037411 (-0.010572)	0.105862 / 0.014526 (0.091336)	0.118278 / 0.176557 (-0.058279)	0.157926 / 0.737135 (-0.579209)	0.124700 / 0.296338 (-0.171638)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.399060 / 0.215209 (0.183851)	3.991409 / 2.077655 (1.913754)	1.763569 / 1.504120 (0.259449)	1.579602 / 1.541195 (0.038407)	1.652928 / 1.468490 (0.184438)	0.692962 / 4.584777 (-3.891815)	3.784635 / 3.745712 (0.038922)	3.249341 / 5.269862 (-2.020521)	1.815711 / 4.565676 (-2.749966)	0.084384 / 0.424275 (-0.339891)	0.012546 / 0.007607 (0.004939)	0.521397 / 0.226044 (0.295352)	5.075824 / 2.268929 (2.806895)	2.258353 / 55.444624 (-53.186272)	1.925220 / 6.876477 (-4.951256)	2.002821 / 2.142072 (-0.139252)	0.830507 / 4.805227 (-3.974720)	0.165845 / 6.500664 (-6.334819)	0.063905 / 0.075469 (-0.011565)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.198726 / 1.841788 (-0.643061)	14.804448 / 8.074308 (6.730139)	12.855167 / 10.191392 (2.663775)	0.167932 / 0.680424 (-0.512492)	0.028643 / 0.534201 (-0.505558)	0.441224 / 0.579283 (-0.138059)	0.434924 / 0.434364 (0.000560)	0.516188 / 0.540337 (-0.024150)	0.605017 / 1.386936 (-0.781919)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007031 / 0.011353 (-0.004322)	0.005157 / 0.011008 (-0.005851)	0.086943 / 0.038508 (0.048434)	0.031377 / 0.023109 (0.008268)	0.334810 / 0.275898 (0.058912)	0.368590 / 0.323480 (0.045110)	0.005973 / 0.007986 (-0.002013)	0.004173 / 0.004328 (-0.000155)	0.067033 / 0.004250 (0.062783)	0.054070 / 0.037052 (0.017018)	0.332232 / 0.258489 (0.073743)	0.384982 / 0.293841 (0.091141)	0.034023 / 0.128546 (-0.094524)	0.011301 / 0.075646 (-0.064345)	0.295644 / 0.419271 (-0.123628)	0.045589 / 0.043533 (0.002056)	0.330739 / 0.255139 (0.075600)	0.352841 / 0.283200 (0.069642)	0.104829 / 0.141683 (-0.036854)	1.329360 / 1.452155 (-0.122794)	1.437956 / 1.492716 (-0.054760)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.299187 / 0.018006 (0.281181)	0.563407 / 0.000490 (0.562917)	0.004179 / 0.000200 (0.003979)	0.000114 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027405 / 0.037411 (-0.010006)	0.097498 / 0.014526 (0.082972)	0.114265 / 0.176557 (-0.062292)	0.146823 / 0.737135 (-0.590313)	0.117948 / 0.296338 (-0.178391)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.378756 / 0.215209 (0.163547)	3.774804 / 2.077655 (1.697150)	1.804149 / 1.504120 (0.300029)	1.626312 / 1.541195 (0.085117)	1.731111 / 1.468490 (0.262620)	0.633493 / 4.584777 (-3.951284)	3.488220 / 3.745712 (-0.257492)	3.064710 / 5.269862 (-2.205151)	1.690647 / 4.565676 (-2.875029)	0.076093 / 0.424275 (-0.348182)	0.010820 / 0.007607 (0.003213)	0.465091 / 0.226044 (0.239046)	4.676842 / 2.268929 (2.407913)	2.297381 / 55.444624 (-53.147244)	1.960355 / 6.876477 (-4.916122)	1.983742 / 2.142072 (-0.158330)	0.739525 / 4.805227 (-4.065702)	0.152663 / 6.500664 (-6.348001)	0.057316 / 0.075469 (-0.018153)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.104721 / 1.841788 (-0.737067)	14.577171 / 8.074308 (6.502863)	13.680402 / 10.191392 (3.489010)	0.182234 / 0.680424 (-0.498190)	0.018853 / 0.534201 (-0.515348)	0.426194 / 0.579283 (-0.153089)	0.429202 / 0.434364 (-0.005162)	0.543125 / 0.540337 (0.002788)	0.645887 / 1.386936 (-0.741049)

github-actions · 2023-01-25T19:53:04Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010055 / 0.011353 (-0.001298)	0.005576 / 0.011008 (-0.005432)	0.100059 / 0.038508 (0.061551)	0.038535 / 0.023109 (0.015425)	0.297538 / 0.275898 (0.021640)	0.368117 / 0.323480 (0.044637)	0.008540 / 0.007986 (0.000555)	0.004469 / 0.004328 (0.000141)	0.075801 / 0.004250 (0.071551)	0.046604 / 0.037052 (0.009552)	0.307242 / 0.258489 (0.048753)	0.343949 / 0.293841 (0.050108)	0.039353 / 0.128546 (-0.089194)	0.012446 / 0.075646 (-0.063200)	0.334628 / 0.419271 (-0.084643)	0.051628 / 0.043533 (0.008095)	0.298726 / 0.255139 (0.043587)	0.316010 / 0.283200 (0.032810)	0.120564 / 0.141683 (-0.021119)	1.459396 / 1.452155 (0.007241)	1.493682 / 1.492716 (0.000965)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.011702 / 0.018006 (-0.006304)	0.570261 / 0.000490 (0.569771)	0.003760 / 0.000200 (0.003560)	0.000091 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028806 / 0.037411 (-0.008605)	0.112150 / 0.014526 (0.097625)	0.123140 / 0.176557 (-0.053417)	0.173055 / 0.737135 (-0.564080)	0.130060 / 0.296338 (-0.166279)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398216 / 0.215209 (0.183007)	3.978677 / 2.077655 (1.901022)	1.754229 / 1.504120 (0.250109)	1.561892 / 1.541195 (0.020697)	1.679138 / 1.468490 (0.210648)	0.690254 / 4.584777 (-3.894523)	3.817698 / 3.745712 (0.071986)	2.177854 / 5.269862 (-3.092008)	1.361860 / 4.565676 (-3.203816)	0.084108 / 0.424275 (-0.340167)	0.012640 / 0.007607 (0.005033)	0.504385 / 0.226044 (0.278341)	5.034103 / 2.268929 (2.765174)	2.254032 / 55.444624 (-53.190593)	1.910439 / 6.876477 (-4.966038)	2.003515 / 2.142072 (-0.138558)	0.839747 / 4.805227 (-3.965480)	0.165654 / 6.500664 (-6.335010)	0.063483 / 0.075469 (-0.011986)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.187521 / 1.841788 (-0.654267)	15.381121 / 8.074308 (7.306812)	14.579418 / 10.191392 (4.388026)	0.199221 / 0.680424 (-0.481202)	0.029335 / 0.534201 (-0.504866)	0.443159 / 0.579283 (-0.136124)	0.447772 / 0.434364 (0.013408)	0.545071 / 0.540337 (0.004733)	0.650494 / 1.386936 (-0.736442)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007675 / 0.011353 (-0.003677)	0.005364 / 0.011008 (-0.005644)	0.097921 / 0.038508 (0.059413)	0.033645 / 0.023109 (0.010536)	0.404818 / 0.275898 (0.128920)	0.429983 / 0.323480 (0.106503)	0.006106 / 0.007986 (-0.001879)	0.005281 / 0.004328 (0.000953)	0.073762 / 0.004250 (0.069512)	0.053065 / 0.037052 (0.016012)	0.400657 / 0.258489 (0.142168)	0.447743 / 0.293841 (0.153902)	0.036782 / 0.128546 (-0.091765)	0.012593 / 0.075646 (-0.063054)	0.332825 / 0.419271 (-0.086446)	0.049424 / 0.043533 (0.005891)	0.400397 / 0.255139 (0.145258)	0.414794 / 0.283200 (0.131594)	0.106555 / 0.141683 (-0.035128)	1.466917 / 1.452155 (0.014762)	1.571351 / 1.492716 (0.078635)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.254337 / 0.018006 (0.236331)	0.568360 / 0.000490 (0.567870)	0.000445 / 0.000200 (0.000245)	0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031044 / 0.037411 (-0.006367)	0.112282 / 0.014526 (0.097756)	0.127205 / 0.176557 (-0.049352)	0.166551 / 0.737135 (-0.570584)	0.130520 / 0.296338 (-0.165818)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.442906 / 0.215209 (0.227697)	4.430218 / 2.077655 (2.352563)	2.287251 / 1.504120 (0.783132)	2.112345 / 1.541195 (0.571150)	2.240952 / 1.468490 (0.772462)	0.713800 / 4.584777 (-3.870977)	3.884161 / 3.745712 (0.138449)	2.166901 / 5.269862 (-3.102960)	1.374490 / 4.565676 (-3.191187)	0.087548 / 0.424275 (-0.336727)	0.012369 / 0.007607 (0.004761)	0.540783 / 0.226044 (0.314739)	5.396187 / 2.268929 (3.127258)	2.779636 / 55.444624 (-52.664988)	2.434220 / 6.876477 (-4.442257)	2.508180 / 2.142072 (0.366107)	0.852470 / 4.805227 (-3.952757)	0.171266 / 6.500664 (-6.329398)	0.065463 / 0.075469 (-0.010006)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.241720 / 1.841788 (-0.600067)	15.332568 / 8.074308 (7.258260)	13.688723 / 10.191392 (3.497331)	0.145150 / 0.680424 (-0.535273)	0.017694 / 0.534201 (-0.516507)	0.426078 / 0.579283 (-0.153205)	0.441189 / 0.434364 (0.006825)	0.540284 / 0.540337 (-0.000054)	0.657548 / 1.386936 (-0.729388)

github-actions · 2023-01-25T19:56:22Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008604 / 0.011353 (-0.002749)	0.004566 / 0.011008 (-0.006442)	0.099607 / 0.038508 (0.061099)	0.029628 / 0.023109 (0.006519)	0.300481 / 0.275898 (0.024583)	0.342596 / 0.323480 (0.019116)	0.007003 / 0.007986 (-0.000982)	0.003408 / 0.004328 (-0.000920)	0.079076 / 0.004250 (0.074826)	0.034104 / 0.037052 (-0.002948)	0.303856 / 0.258489 (0.045367)	0.348729 / 0.293841 (0.054888)	0.033752 / 0.128546 (-0.094794)	0.011497 / 0.075646 (-0.064149)	0.321568 / 0.419271 (-0.097704)	0.041472 / 0.043533 (-0.002061)	0.303396 / 0.255139 (0.048257)	0.331121 / 0.283200 (0.047921)	0.086203 / 0.141683 (-0.055480)	1.476995 / 1.452155 (0.024840)	1.539428 / 1.492716 (0.046712)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.215810 / 0.018006 (0.197803)	0.414292 / 0.000490 (0.413802)	0.000388 / 0.000200 (0.000188)	0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023441 / 0.037411 (-0.013970)	0.098463 / 0.014526 (0.083938)	0.105435 / 0.176557 (-0.071121)	0.139736 / 0.737135 (-0.597399)	0.109467 / 0.296338 (-0.186872)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418244 / 0.215209 (0.203035)	4.160693 / 2.077655 (2.083039)	1.878895 / 1.504120 (0.374775)	1.679338 / 1.541195 (0.138143)	1.730384 / 1.468490 (0.261894)	0.688603 / 4.584777 (-3.896174)	3.393542 / 3.745712 (-0.352170)	1.901337 / 5.269862 (-3.368525)	1.447269 / 4.565676 (-3.118408)	0.083003 / 0.424275 (-0.341272)	0.012574 / 0.007607 (0.004967)	0.526363 / 0.226044 (0.300318)	5.275159 / 2.268929 (3.006230)	2.323642 / 55.444624 (-53.120982)	1.982929 / 6.876477 (-4.893548)	2.014081 / 2.142072 (-0.127991)	0.809466 / 4.805227 (-3.995761)	0.149038 / 6.500664 (-6.351626)	0.064394 / 0.075469 (-0.011075)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.207439 / 1.841788 (-0.634349)	13.691048 / 8.074308 (5.616740)	13.880965 / 10.191392 (3.689573)	0.148553 / 0.680424 (-0.531871)	0.028397 / 0.534201 (-0.505804)	0.391818 / 0.579283 (-0.187465)	0.407181 / 0.434364 (-0.027183)	0.481163 / 0.540337 (-0.059175)	0.570689 / 1.386936 (-0.816247)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006361 / 0.011353 (-0.004992)	0.004520 / 0.011008 (-0.006488)	0.097679 / 0.038508 (0.059171)	0.027223 / 0.023109 (0.004113)	0.407966 / 0.275898 (0.132068)	0.439868 / 0.323480 (0.116388)	0.004625 / 0.007986 (-0.003360)	0.004039 / 0.004328 (-0.000289)	0.074548 / 0.004250 (0.070298)	0.034957 / 0.037052 (-0.002095)	0.412762 / 0.258489 (0.154273)	0.449716 / 0.293841 (0.155875)	0.031272 / 0.128546 (-0.097274)	0.011598 / 0.075646 (-0.064049)	0.320922 / 0.419271 (-0.098349)	0.041250 / 0.043533 (-0.002283)	0.411439 / 0.255139 (0.156300)	0.429722 / 0.283200 (0.146523)	0.087161 / 0.141683 (-0.054522)	1.512573 / 1.452155 (0.060418)	1.569385 / 1.492716 (0.076668)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222612 / 0.018006 (0.204606)	0.409086 / 0.000490 (0.408596)	0.004246 / 0.000200 (0.004046)	0.000083 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024324 / 0.037411 (-0.013087)	0.099055 / 0.014526 (0.084530)	0.106809 / 0.176557 (-0.069748)	0.141275 / 0.737135 (-0.595860)	0.109426 / 0.296338 (-0.186913)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.469736 / 0.215209 (0.254527)	4.686900 / 2.077655 (2.609246)	2.413392 / 1.504120 (0.909272)	2.217366 / 1.541195 (0.676171)	2.266957 / 1.468490 (0.798467)	0.698647 / 4.584777 (-3.886129)	3.389317 / 3.745712 (-0.356395)	1.862315 / 5.269862 (-3.407546)	1.160931 / 4.565676 (-3.404746)	0.082829 / 0.424275 (-0.341446)	0.012627 / 0.007607 (0.005020)	0.568027 / 0.226044 (0.341983)	5.683220 / 2.268929 (3.414291)	2.865701 / 55.444624 (-52.578924)	2.522401 / 6.876477 (-4.354076)	2.542395 / 2.142072 (0.400323)	0.801224 / 4.805227 (-4.004003)	0.149946 / 6.500664 (-6.350718)	0.065447 / 0.075469 (-0.010023)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.283756 / 1.841788 (-0.558032)	13.903662 / 8.074308 (5.829354)	13.238389 / 10.191392 (3.046997)	0.142304 / 0.680424 (-0.538120)	0.016922 / 0.534201 (-0.517279)	0.377797 / 0.579283 (-0.201487)	0.382460 / 0.434364 (-0.051904)	0.464645 / 0.540337 (-0.075692)	0.556270 / 1.386936 (-0.830666)

stevhliu

This is so cool, I learned a lot reading this. I'm sure it'll be super valuable and welcomed by the community! 😄

I think this would be more of a Conceptual Guide doc since this is more explanatory and compares the differences between a Dataset and an IterableDataset. It’s not necessarily a how-to for how to do something, but it discusses and explains the two types of datasets. There are definitely places in the docs where we can add a nice and link to this doc though to build up the user's understanding of this topic. For example, in the Know your dataset tutorial, we only introduce the regular Dataset object and not the IterableDataset. We can add a section there for IterableDataset and then link to this doc that explains the difference between the two 🙂

stevhliu · 2023-01-25T22:32:01Z

docs/source/dataset_vs_iterable_dataset.mdx

+
+## Downloading and streaming
+
+When you have a regular "map-style" [`Dataset`], you can access it using `my_dataset[0]`: we have what we call "random access" to the rows.


What is meant by a “map-style” Dataset? If I understand correctly, this is just a regular Dataset. So it might be easier for users to understand if we don’t use this specific term and just use Dataset or if we define what we mean by “map-style” (unless this is commonly known jargon, in which case ignore this haha).

Yes it refers to datasets with random access, i.e. that allows you to do my_dataset[0]. I'll define it properly

stevhliu · 2023-01-25T22:32:48Z

docs/source/dataset_vs_iterable_dataset.mdx

+print(my_dataset[0])
+```
+
+To not have to wait for the conversion to Arrow, you can define an iterable dataset by streaming from your local files.


What is the benefit of not converting Dataset to Arrow (obvs it’s faster, but it’d be good to mention this explicitly for the user)?

Faster + save disk space + you can modify your original data and re-instantiate the dataset without having to reconvert the original data. I'll mention this !

lhoestq · 2023-01-26T10:23:59Z

I think this would be more of a Conceptual Guide doc since this is more explanatory and compares the differences between a Dataset and an IterableDataset

sounds good to me !

There are definitely places in the docs where we can add a nice and link to this doc though to build up the user's understanding of this topic. For example, in the Know your dataset tutorial, we only introduce the regular Dataset object and not the IterableDataset. We can add a section there for IterableDataset and then link to this doc that explains the difference between the two 🙂

good idea, thanks :)

stevhliu · 2023-01-27T01:12:03Z

I'll open a PR to add a section on IterableDataset's in the tutorial, and once you're done editing this doc I can give it a final polish! 😄

lhoestq · 2023-01-27T10:45:17Z

I moved the doc page to conceptual guides and took your suggestions into account :)

I think this is ready for final review now

github-actions · 2023-01-27T10:51:29Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009890 / 0.011353 (-0.001463)	0.005156 / 0.011008 (-0.005852)	0.099493 / 0.038508 (0.060984)	0.036671 / 0.023109 (0.013562)	0.304686 / 0.275898 (0.028788)	0.339070 / 0.323480 (0.015590)	0.008466 / 0.007986 (0.000481)	0.005863 / 0.004328 (0.001534)	0.075082 / 0.004250 (0.070832)	0.045926 / 0.037052 (0.008874)	0.303157 / 0.258489 (0.044668)	0.363710 / 0.293841 (0.069870)	0.038497 / 0.128546 (-0.090049)	0.012063 / 0.075646 (-0.063583)	0.334463 / 0.419271 (-0.084808)	0.048161 / 0.043533 (0.004628)	0.300431 / 0.255139 (0.045292)	0.330344 / 0.283200 (0.047145)	0.105509 / 0.141683 (-0.036174)	1.475242 / 1.452155 (0.023087)	1.550624 / 1.492716 (0.057908)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.245749 / 0.018006 (0.227743)	0.575091 / 0.000490 (0.574601)	0.001556 / 0.000200 (0.001357)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030447 / 0.037411 (-0.006964)	0.110982 / 0.014526 (0.096456)	0.126760 / 0.176557 (-0.049797)	0.173375 / 0.737135 (-0.563760)	0.128799 / 0.296338 (-0.167539)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.392861 / 0.215209 (0.177651)	3.911231 / 2.077655 (1.833576)	1.757413 / 1.504120 (0.253293)	1.563287 / 1.541195 (0.022093)	1.658678 / 1.468490 (0.190188)	0.677244 / 4.584777 (-3.907533)	3.754917 / 3.745712 (0.009205)	3.779417 / 5.269862 (-1.490444)	1.993159 / 4.565676 (-2.572517)	0.084425 / 0.424275 (-0.339850)	0.012500 / 0.007607 (0.004893)	0.501788 / 0.226044 (0.275743)	5.003173 / 2.268929 (2.734244)	2.273547 / 55.444624 (-53.171077)	1.909766 / 6.876477 (-4.966711)	1.968287 / 2.142072 (-0.173785)	0.834895 / 4.805227 (-3.970332)	0.165312 / 6.500664 (-6.335352)	0.062202 / 0.075469 (-0.013267)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.203080 / 1.841788 (-0.638708)	15.158284 / 8.074308 (7.083976)	14.174484 / 10.191392 (3.983092)	0.171540 / 0.680424 (-0.508883)	0.028604 / 0.534201 (-0.505597)	0.438379 / 0.579283 (-0.140904)	0.429447 / 0.434364 (-0.004917)	0.540979 / 0.540337 (0.000642)	0.630322 / 1.386936 (-0.756614)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007600 / 0.011353 (-0.003753)	0.005400 / 0.011008 (-0.005608)	0.097983 / 0.038508 (0.059475)	0.033407 / 0.023109 (0.010297)	0.384429 / 0.275898 (0.108531)	0.415880 / 0.323480 (0.092400)	0.006085 / 0.007986 (-0.001900)	0.004330 / 0.004328 (0.000002)	0.074654 / 0.004250 (0.070403)	0.053076 / 0.037052 (0.016024)	0.383958 / 0.258489 (0.125469)	0.427289 / 0.293841 (0.133448)	0.036710 / 0.128546 (-0.091836)	0.012400 / 0.075646 (-0.063246)	0.332712 / 0.419271 (-0.086560)	0.058390 / 0.043533 (0.014857)	0.377747 / 0.255139 (0.122608)	0.398997 / 0.283200 (0.115798)	0.117370 / 0.141683 (-0.024313)	1.464211 / 1.452155 (0.012057)	1.596465 / 1.492716 (0.103749)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.212989 / 0.018006 (0.194983)	0.554968 / 0.000490 (0.554479)	0.004305 / 0.000200 (0.004105)	0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029167 / 0.037411 (-0.008244)	0.109156 / 0.014526 (0.094631)	0.122575 / 0.176557 (-0.053982)	0.163058 / 0.737135 (-0.574077)	0.127431 / 0.296338 (-0.168908)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.445395 / 0.215209 (0.230185)	4.447534 / 2.077655 (2.369879)	2.259186 / 1.504120 (0.755066)	2.082956 / 1.541195 (0.541761)	2.259126 / 1.468490 (0.790636)	0.692271 / 4.584777 (-3.892506)	3.795759 / 3.745712 (0.050047)	3.603000 / 5.269862 (-1.666862)	1.948556 / 4.565676 (-2.617120)	0.084589 / 0.424275 (-0.339687)	0.012751 / 0.007607 (0.005144)	0.544783 / 0.226044 (0.318738)	5.452278 / 2.268929 (3.183349)	2.809467 / 55.444624 (-52.635157)	2.479297 / 6.876477 (-4.397180)	2.587756 / 2.142072 (0.445683)	0.832258 / 4.805227 (-3.972970)	0.167424 / 6.500664 (-6.333240)	0.066064 / 0.075469 (-0.009405)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.262719 / 1.841788 (-0.579069)	15.917869 / 8.074308 (7.843561)	13.879301 / 10.191392 (3.687909)	0.187712 / 0.680424 (-0.492712)	0.018175 / 0.534201 (-0.516026)	0.425840 / 0.579283 (-0.153443)	0.426164 / 0.434364 (-0.008200)	0.527465 / 0.540337 (-0.012872)	0.629478 / 1.386936 (-0.757458)

github-actions · 2023-01-27T12:11:31Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009064 / 0.011353 (-0.002289)	0.004824 / 0.011008 (-0.006184)	0.100869 / 0.038508 (0.062361)	0.030803 / 0.023109 (0.007694)	0.350880 / 0.275898 (0.074982)	0.423816 / 0.323480 (0.100336)	0.007581 / 0.007986 (-0.000405)	0.003642 / 0.004328 (-0.000686)	0.077682 / 0.004250 (0.073432)	0.039856 / 0.037052 (0.002803)	0.366097 / 0.258489 (0.107608)	0.409226 / 0.293841 (0.115385)	0.033698 / 0.128546 (-0.094848)	0.011730 / 0.075646 (-0.063916)	0.321683 / 0.419271 (-0.097588)	0.041794 / 0.043533 (-0.001739)	0.351175 / 0.255139 (0.096036)	0.374328 / 0.283200 (0.091128)	0.091833 / 0.141683 (-0.049850)	1.507082 / 1.452155 (0.054927)	1.543289 / 1.492716 (0.050572)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.010670 / 0.018006 (-0.007337)	0.429674 / 0.000490 (0.429184)	0.003246 / 0.000200 (0.003046)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025015 / 0.037411 (-0.012397)	0.102155 / 0.014526 (0.087629)	0.107010 / 0.176557 (-0.069546)	0.144265 / 0.737135 (-0.592870)	0.110635 / 0.296338 (-0.185703)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.414211 / 0.215209 (0.199002)	4.125582 / 2.077655 (2.047928)	1.997856 / 1.504120 (0.493736)	1.847676 / 1.541195 (0.306481)	1.994100 / 1.468490 (0.525610)	0.694975 / 4.584777 (-3.889802)	3.373629 / 3.745712 (-0.372083)	2.863255 / 5.269862 (-2.406606)	1.565723 / 4.565676 (-2.999953)	0.082539 / 0.424275 (-0.341736)	0.012650 / 0.007607 (0.005043)	0.522989 / 0.226044 (0.296945)	5.205720 / 2.268929 (2.936792)	2.352292 / 55.444624 (-53.092332)	2.080467 / 6.876477 (-4.796010)	2.231014 / 2.142072 (0.088942)	0.811252 / 4.805227 (-3.993975)	0.149171 / 6.500664 (-6.351493)	0.065207 / 0.075469 (-0.010262)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.203137 / 1.841788 (-0.638651)	14.244903 / 8.074308 (6.170595)	14.454368 / 10.191392 (4.262976)	0.139090 / 0.680424 (-0.541334)	0.028738 / 0.534201 (-0.505463)	0.396394 / 0.579283 (-0.182889)	0.407207 / 0.434364 (-0.027156)	0.478036 / 0.540337 (-0.062302)	0.568488 / 1.386936 (-0.818448)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006878 / 0.011353 (-0.004475)	0.004636 / 0.011008 (-0.006372)	0.099118 / 0.038508 (0.060610)	0.028076 / 0.023109 (0.004967)	0.416097 / 0.275898 (0.140199)	0.451722 / 0.323480 (0.128242)	0.005364 / 0.007986 (-0.002622)	0.003506 / 0.004328 (-0.000822)	0.075791 / 0.004250 (0.071541)	0.041373 / 0.037052 (0.004321)	0.416358 / 0.258489 (0.157869)	0.458440 / 0.293841 (0.164599)	0.031870 / 0.128546 (-0.096676)	0.011751 / 0.075646 (-0.063896)	0.321748 / 0.419271 (-0.097524)	0.041780 / 0.043533 (-0.001752)	0.425037 / 0.255139 (0.169898)	0.444169 / 0.283200 (0.160969)	0.093145 / 0.141683 (-0.048538)	1.472151 / 1.452155 (0.019996)	1.542942 / 1.492716 (0.050226)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224287 / 0.018006 (0.206281)	0.415303 / 0.000490 (0.414813)	0.003180 / 0.000200 (0.002980)	0.000082 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026377 / 0.037411 (-0.011035)	0.106222 / 0.014526 (0.091696)	0.113873 / 0.176557 (-0.062684)	0.143255 / 0.737135 (-0.593880)	0.112642 / 0.296338 (-0.183697)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.444149 / 0.215209 (0.228940)	4.421434 / 2.077655 (2.343779)	2.082198 / 1.504120 (0.578078)	1.879909 / 1.541195 (0.338715)	1.968526 / 1.468490 (0.500036)	0.697230 / 4.584777 (-3.887546)	3.430800 / 3.745712 (-0.314912)	1.893353 / 5.269862 (-3.376509)	1.173271 / 4.565676 (-3.392406)	0.082636 / 0.424275 (-0.341639)	0.012357 / 0.007607 (0.004750)	0.544008 / 0.226044 (0.317964)	5.465472 / 2.268929 (3.196543)	2.530017 / 55.444624 (-52.914608)	2.178462 / 6.876477 (-4.698014)	2.279570 / 2.142072 (0.137498)	0.804890 / 4.805227 (-4.000337)	0.152091 / 6.500664 (-6.348573)	0.069442 / 0.075469 (-0.006027)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.256722 / 1.841788 (-0.585065)	14.554131 / 8.074308 (6.479823)	13.499913 / 10.191392 (3.308521)	0.144350 / 0.680424 (-0.536074)	0.016977 / 0.534201 (-0.517224)	0.378836 / 0.579283 (-0.200447)	0.392004 / 0.434364 (-0.042360)	0.468423 / 0.540337 (-0.071914)	0.584711 / 1.386936 (-0.802225)

stevhliu

Awesome doc, thanks for sharing all this infos!

docs/source/use_with_pytorch.mdx

stevhliu · 2023-01-27T19:10:31Z

docs/source/about_mapstyle_vs_iterable.mdx

@@ -0,0 +1,220 @@
+# Differences between Dataset and IterableDataset
+


Maybe just add a sentence or two here that introduces the topic and scope of the doc. Something like:

There are two types of dataset objects, a Dataset and an IterableDataset. Whichever type of dataset you choose to use or create depends on the size of the dataset. In general, an IterableDataset is ideal for big datasets (think hundreds of GBs!) due to its lazy behavior and speed advantages, while a Dataset is great for everything else. This page will compare the differences between a Dataset and an IterableDataset to help you pick the right dataset object for you.

sounds good to me !

docs/source/about_mapstyle_vs_iterable.mdx

stevhliu · 2023-01-27T19:44:45Z

docs/source/about_mapstyle_vs_iterable.mdx

+my_iterable_dataset.n_shards  # 1024
+```
+
+Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !


Suggested change

Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !

Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions!

I would remove this sentence altogether. Two existing links in our docs are more than enough :).

src/datasets/arrow_dataset.py

stevhliu · 2023-01-27T19:47:07Z

src/datasets/arrow_dataset.py

+        Returns:
+            [`datasets.IterableDataset`]
+
+        Example:


Love all the example usages here! 😍

mariosasko

Nice!

The code looks good.

Regarding the docs, I think it would be better to add this info as notes/tips/sections to the existing docs (Process/Stream; e.g. a tip under Dataset.shuffle that explains how to make this operation more performant by using to_iterable + shuffle, etc.) rather than introducing a new doc page.

src/datasets/arrow_dataset.py

mariosasko · 2023-01-30T18:01:34Z

docs/source/about_mapstyle_vs_iterable.mdx

+my_iterable_dataset.n_shards  # 1024
+```
+
+Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !


I would remove this sentence altogether. Two existing links in our docs are more than enough :).

docs/source/use_with_pytorch.mdx

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

github-actions · 2023-01-31T18:28:57Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008542 / 0.011353 (-0.002811)	0.004552 / 0.011008 (-0.006456)	0.100543 / 0.038508 (0.062035)	0.029717 / 0.023109 (0.006608)	0.301948 / 0.275898 (0.026050)	0.360211 / 0.323480 (0.036731)	0.006881 / 0.007986 (-0.001105)	0.003433 / 0.004328 (-0.000896)	0.077760 / 0.004250 (0.073510)	0.037069 / 0.037052 (0.000017)	0.314084 / 0.258489 (0.055595)	0.347759 / 0.293841 (0.053918)	0.033255 / 0.128546 (-0.095291)	0.011487 / 0.075646 (-0.064160)	0.323873 / 0.419271 (-0.095399)	0.041203 / 0.043533 (-0.002330)	0.298397 / 0.255139 (0.043258)	0.327174 / 0.283200 (0.043974)	0.088892 / 0.141683 (-0.052791)	1.560114 / 1.452155 (0.107959)	1.532475 / 1.492716 (0.039759)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.226080 / 0.018006 (0.208074)	0.467492 / 0.000490 (0.467003)	0.002198 / 0.000200 (0.001998)	0.000074 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023627 / 0.037411 (-0.013784)	0.096696 / 0.014526 (0.082170)	0.106196 / 0.176557 (-0.070360)	0.140496 / 0.737135 (-0.596639)	0.108859 / 0.296338 (-0.187480)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422335 / 0.215209 (0.207126)	4.214879 / 2.077655 (2.137224)	1.865866 / 1.504120 (0.361747)	1.660914 / 1.541195 (0.119719)	1.691869 / 1.468490 (0.223379)	0.688164 / 4.584777 (-3.896613)	3.432708 / 3.745712 (-0.313004)	1.856852 / 5.269862 (-3.413010)	1.243685 / 4.565676 (-3.321991)	0.081552 / 0.424275 (-0.342723)	0.012491 / 0.007607 (0.004884)	0.524331 / 0.226044 (0.298287)	5.255090 / 2.268929 (2.986162)	2.269705 / 55.444624 (-53.174919)	1.936722 / 6.876477 (-4.939755)	2.018958 / 2.142072 (-0.123114)	0.800658 / 4.805227 (-4.004569)	0.148665 / 6.500664 (-6.351999)	0.064210 / 0.075469 (-0.011259)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.235422 / 1.841788 (-0.606365)	14.156755 / 8.074308 (6.082447)	14.005916 / 10.191392 (3.814524)	0.150983 / 0.680424 (-0.529441)	0.028500 / 0.534201 (-0.505701)	0.393013 / 0.579283 (-0.186270)	0.408191 / 0.434364 (-0.026173)	0.481017 / 0.540337 (-0.059320)	0.581711 / 1.386936 (-0.805225)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006950 / 0.011353 (-0.004403)	0.004575 / 0.011008 (-0.006434)	0.076702 / 0.038508 (0.038194)	0.028050 / 0.023109 (0.004941)	0.342916 / 0.275898 (0.067018)	0.378861 / 0.323480 (0.055381)	0.005315 / 0.007986 (-0.002671)	0.004822 / 0.004328 (0.000494)	0.075560 / 0.004250 (0.071310)	0.040441 / 0.037052 (0.003388)	0.344284 / 0.258489 (0.085795)	0.386519 / 0.293841 (0.092678)	0.032122 / 0.128546 (-0.096424)	0.011843 / 0.075646 (-0.063803)	0.085798 / 0.419271 (-0.333473)	0.043027 / 0.043533 (-0.000506)	0.342910 / 0.255139 (0.087771)	0.366618 / 0.283200 (0.083418)	0.094766 / 0.141683 (-0.046917)	1.492981 / 1.452155 (0.040827)	1.566994 / 1.492716 (0.074278)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.166083 / 0.018006 (0.148076)	0.409315 / 0.000490 (0.408826)	0.003189 / 0.000200 (0.002989)	0.000127 / 0.000054 (0.000072)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024753 / 0.037411 (-0.012658)	0.099112 / 0.014526 (0.084586)	0.106668 / 0.176557 (-0.069889)	0.142562 / 0.737135 (-0.594573)	0.110648 / 0.296338 (-0.185690)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.452668 / 0.215209 (0.237459)	4.501188 / 2.077655 (2.423534)	2.086197 / 1.504120 (0.582077)	1.873955 / 1.541195 (0.332761)	1.935610 / 1.468490 (0.467120)	0.708290 / 4.584777 (-3.876487)	3.426986 / 3.745712 (-0.318726)	2.805852 / 5.269862 (-2.464009)	1.516918 / 4.565676 (-3.048759)	0.084067 / 0.424275 (-0.340208)	0.012776 / 0.007607 (0.005169)	0.548853 / 0.226044 (0.322809)	5.488198 / 2.268929 (3.219270)	2.704464 / 55.444624 (-52.740161)	2.377817 / 6.876477 (-4.498660)	2.366152 / 2.142072 (0.224079)	0.818192 / 4.805227 (-3.987035)	0.152649 / 6.500664 (-6.348015)	0.066914 / 0.075469 (-0.008555)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.273803 / 1.841788 (-0.567985)	14.071633 / 8.074308 (5.997325)	13.655586 / 10.191392 (3.464194)	0.149471 / 0.680424 (-0.530953)	0.016745 / 0.534201 (-0.517456)	0.386850 / 0.579283 (-0.192434)	0.393595 / 0.434364 (-0.040769)	0.480396 / 0.540337 (-0.059942)	0.573708 / 1.386936 (-0.813228)

github-actions · 2023-02-01T11:04:07Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008173 / 0.011353 (-0.003180)	0.004461 / 0.011008 (-0.006547)	0.100284 / 0.038508 (0.061776)	0.028900 / 0.023109 (0.005791)	0.293639 / 0.275898 (0.017741)	0.359450 / 0.323480 (0.035971)	0.007567 / 0.007986 (-0.000418)	0.003434 / 0.004328 (-0.000894)	0.077913 / 0.004250 (0.073663)	0.036313 / 0.037052 (-0.000740)	0.308484 / 0.258489 (0.049995)	0.347575 / 0.293841 (0.053734)	0.033367 / 0.128546 (-0.095179)	0.011508 / 0.075646 (-0.064138)	0.323490 / 0.419271 (-0.095782)	0.042285 / 0.043533 (-0.001248)	0.295696 / 0.255139 (0.040557)	0.332475 / 0.283200 (0.049276)	0.089980 / 0.141683 (-0.051703)	1.461851 / 1.452155 (0.009697)	1.493030 / 1.492716 (0.000314)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.191068 / 0.018006 (0.173062)	0.396768 / 0.000490 (0.396278)	0.002355 / 0.000200 (0.002155)	0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023117 / 0.037411 (-0.014294)	0.096155 / 0.014526 (0.081630)	0.102424 / 0.176557 (-0.074132)	0.142148 / 0.737135 (-0.594987)	0.105954 / 0.296338 (-0.190384)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.421227 / 0.215209 (0.206018)	4.200403 / 2.077655 (2.122748)	1.899410 / 1.504120 (0.395290)	1.684091 / 1.541195 (0.142896)	1.698084 / 1.468490 (0.229594)	0.696195 / 4.584777 (-3.888582)	3.364116 / 3.745712 (-0.381596)	1.899133 / 5.269862 (-3.370728)	1.281405 / 4.565676 (-3.284272)	0.082958 / 0.424275 (-0.341317)	0.012433 / 0.007607 (0.004826)	0.521856 / 0.226044 (0.295812)	5.217626 / 2.268929 (2.948698)	2.309228 / 55.444624 (-53.135396)	1.956828 / 6.876477 (-4.919648)	2.018964 / 2.142072 (-0.123108)	0.816855 / 4.805227 (-3.988373)	0.152867 / 6.500664 (-6.347798)	0.064764 / 0.075469 (-0.010705)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.219020 / 1.841788 (-0.622768)	13.509058 / 8.074308 (5.434750)	13.637826 / 10.191392 (3.446434)	0.156620 / 0.680424 (-0.523804)	0.028518 / 0.534201 (-0.505683)	0.399138 / 0.579283 (-0.180146)	0.399931 / 0.434364 (-0.034433)	0.482902 / 0.540337 (-0.057435)	0.574089 / 1.386936 (-0.812847)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006232 / 0.011353 (-0.005121)	0.004467 / 0.011008 (-0.006542)	0.075494 / 0.038508 (0.036986)	0.026891 / 0.023109 (0.003782)	0.356603 / 0.275898 (0.080705)	0.371977 / 0.323480 (0.048497)	0.004709 / 0.007986 (-0.003276)	0.003230 / 0.004328 (-0.001099)	0.074338 / 0.004250 (0.070088)	0.035588 / 0.037052 (-0.001464)	0.349554 / 0.258489 (0.091065)	0.389672 / 0.293841 (0.095831)	0.031524 / 0.128546 (-0.097022)	0.011493 / 0.075646 (-0.064153)	0.084584 / 0.419271 (-0.334688)	0.041945 / 0.043533 (-0.001588)	0.341057 / 0.255139 (0.085918)	0.367876 / 0.283200 (0.084677)	0.090113 / 0.141683 (-0.051569)	1.507104 / 1.452155 (0.054949)	1.567810 / 1.492716 (0.075094)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.210939 / 0.018006 (0.192933)	0.392600 / 0.000490 (0.392110)	0.002188 / 0.000200 (0.001988)	0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024294 / 0.037411 (-0.013118)	0.100325 / 0.014526 (0.085799)	0.104027 / 0.176557 (-0.072530)	0.141189 / 0.737135 (-0.595947)	0.107438 / 0.296338 (-0.188901)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.443314 / 0.215209 (0.228105)	4.429612 / 2.077655 (2.351957)	2.129275 / 1.504120 (0.625156)	1.940016 / 1.541195 (0.398821)	2.008975 / 1.468490 (0.540485)	0.695434 / 4.584777 (-3.889343)	3.355137 / 3.745712 (-0.390575)	2.606262 / 5.269862 (-2.663600)	1.451283 / 4.565676 (-3.114394)	0.082875 / 0.424275 (-0.341400)	0.012398 / 0.007607 (0.004791)	0.544262 / 0.226044 (0.318218)	5.450829 / 2.268929 (3.181900)	2.582074 / 55.444624 (-52.862550)	2.220037 / 6.876477 (-4.656439)	2.232473 / 2.142072 (0.090401)	0.802094 / 4.805227 (-4.003134)	0.150188 / 6.500664 (-6.350476)	0.066543 / 0.075469 (-0.008926)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.269098 / 1.841788 (-0.572690)	13.764780 / 8.074308 (5.690472)	13.461490 / 10.191392 (3.270098)	0.143841 / 0.680424 (-0.536583)	0.016687 / 0.534201 (-0.517514)	0.388548 / 0.579283 (-0.190736)	0.385229 / 0.434364 (-0.049135)	0.478966 / 0.540337 (-0.061371)	0.570355 / 1.386936 (-0.816581)

lhoestq · 2023-02-01T11:22:16Z

I took your comments into account :)

Regarding the docs, I think it would be better to add this info as notes/tips/sections to the existing docs (Process/Stream; e.g. a tip under Dataset.shuffle that explains how to make this operation more performant by using to_iterable + shuffle, etc.) rather than introducing a new doc page.

I added a paragraph in the Dataset.shuffle docstring, and a note in the Process doc page

github-actions · 2023-02-01T11:28:51Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010906 / 0.011353 (-0.000447)	0.005995 / 0.011008 (-0.005014)	0.120183 / 0.038508 (0.081675)	0.042166 / 0.023109 (0.019057)	0.350945 / 0.275898 (0.075046)	0.433055 / 0.323480 (0.109575)	0.009093 / 0.007986 (0.001107)	0.004695 / 0.004328 (0.000366)	0.090362 / 0.004250 (0.086112)	0.051402 / 0.037052 (0.014350)	0.368677 / 0.258489 (0.110188)	0.410926 / 0.293841 (0.117086)	0.044471 / 0.128546 (-0.084075)	0.014051 / 0.075646 (-0.061595)	0.397765 / 0.419271 (-0.021507)	0.057227 / 0.043533 (0.013694)	0.357587 / 0.255139 (0.102448)	0.377470 / 0.283200 (0.094270)	0.119482 / 0.141683 (-0.022201)	1.719799 / 1.452155 (0.267645)	1.758228 / 1.492716 (0.265511)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224385 / 0.018006 (0.206379)	0.505070 / 0.000490 (0.504580)	0.004863 / 0.000200 (0.004663)	0.000379 / 0.000054 (0.000324)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030366 / 0.037411 (-0.007046)	0.130481 / 0.014526 (0.115955)	0.136429 / 0.176557 (-0.040128)	0.182263 / 0.737135 (-0.554872)	0.142871 / 0.296338 (-0.153468)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.467623 / 0.215209 (0.252414)	4.665522 / 2.077655 (2.587868)	2.130885 / 1.504120 (0.626766)	1.903810 / 1.541195 (0.362615)	2.019077 / 1.468490 (0.550587)	0.820868 / 4.584777 (-3.763909)	4.543118 / 3.745712 (0.797406)	2.491541 / 5.269862 (-2.778321)	1.585377 / 4.565676 (-2.980299)	0.101850 / 0.424275 (-0.322426)	0.014737 / 0.007607 (0.007129)	0.597241 / 0.226044 (0.371197)	5.938445 / 2.268929 (3.669516)	2.695799 / 55.444624 (-52.748825)	2.286890 / 6.876477 (-4.589587)	2.363064 / 2.142072 (0.220991)	0.986670 / 4.805227 (-3.818557)	0.194407 / 6.500664 (-6.306257)	0.074767 / 0.075469 (-0.000702)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.420630 / 1.841788 (-0.421158)	17.537702 / 8.074308 (9.463394)	16.521804 / 10.191392 (6.330412)	0.173622 / 0.680424 (-0.506802)	0.033944 / 0.534201 (-0.500257)	0.520461 / 0.579283 (-0.058822)	0.541283 / 0.434364 (0.106919)	0.651906 / 0.540337 (0.111569)	0.771724 / 1.386936 (-0.615212)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008448 / 0.011353 (-0.002905)	0.005893 / 0.011008 (-0.005115)	0.087995 / 0.038508 (0.049487)	0.038602 / 0.023109 (0.015493)	0.400048 / 0.275898 (0.124150)	0.436998 / 0.323480 (0.113518)	0.006414 / 0.007986 (-0.001572)	0.004478 / 0.004328 (0.000149)	0.086444 / 0.004250 (0.082194)	0.056535 / 0.037052 (0.019483)	0.402066 / 0.258489 (0.143577)	0.458730 / 0.293841 (0.164889)	0.041622 / 0.128546 (-0.086924)	0.014014 / 0.075646 (-0.061632)	0.101382 / 0.419271 (-0.317889)	0.056986 / 0.043533 (0.013453)	0.404527 / 0.255139 (0.149388)	0.428105 / 0.283200 (0.144906)	0.118321 / 0.141683 (-0.023361)	1.716940 / 1.452155 (0.264785)	1.834683 / 1.492716 (0.341967)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252917 / 0.018006 (0.234910)	0.485950 / 0.000490 (0.485461)	0.000489 / 0.000200 (0.000289)	0.000066 / 0.000054 (0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035023 / 0.037411 (-0.002388)	0.139055 / 0.014526 (0.124529)	0.144165 / 0.176557 (-0.032392)	0.189559 / 0.737135 (-0.547577)	0.153213 / 0.296338 (-0.143126)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.505069 / 0.215209 (0.289860)	5.024620 / 2.077655 (2.946965)	2.429469 / 1.504120 (0.925349)	2.186210 / 1.541195 (0.645015)	2.275971 / 1.468490 (0.807481)	0.829432 / 4.584777 (-3.755345)	4.518600 / 3.745712 (0.772888)	2.466418 / 5.269862 (-2.803443)	1.558910 / 4.565676 (-3.006767)	0.102017 / 0.424275 (-0.322258)	0.015191 / 0.007607 (0.007584)	0.619092 / 0.226044 (0.393048)	6.241105 / 2.268929 (3.972176)	3.044213 / 55.444624 (-52.400411)	2.630194 / 6.876477 (-4.246282)	2.723685 / 2.142072 (0.581613)	0.994018 / 4.805227 (-3.811210)	0.198722 / 6.500664 (-6.301942)	0.075812 / 0.075469 (0.000343)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.545497 / 1.841788 (-0.296291)	18.305250 / 8.074308 (10.230942)	16.035275 / 10.191392 (5.843883)	0.209339 / 0.680424 (-0.471085)	0.020903 / 0.534201 (-0.513298)	0.499909 / 0.579283 (-0.079374)	0.488775 / 0.434364 (0.054411)	0.581990 / 0.540337 (0.041653)	0.697786 / 1.386936 (-0.689150)

polinaeterna

I love the new doc about Dataset vs IterableDataset, thank you! and I think it's worth a separate page.
I left just a few comments to the text.

docs/source/about_mapstyle_vs_iterable.mdx

docs/source/process.mdx

docs/source/use_with_pytorch.mdx

src/datasets/arrow_dataset.py

mariosasko

Looks good! I guess we can keep the new page.

Btw, this page boils down to "switch from Dataset to IterableDataset to save time/disk space by avoiding the full rewrite of a dataset" (map always does it, flatten_indices is good to run after shuffle to preserve the iteration speed), so maybe a better title for it would be "Optimize processing" (or "Working with datasets at scale" as I mentioned earlier on Slack)

PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)
cc @stevhliu

Co-authored-by: Polina Kazakova <polina@huggingface.co>

github-actions · 2023-02-01T15:36:22Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011706 / 0.011353 (0.000353)	0.008406 / 0.011008 (-0.002602)	0.130887 / 0.038508 (0.092379)	0.037468 / 0.023109 (0.014359)	0.385043 / 0.275898 (0.109145)	0.458837 / 0.323480 (0.135357)	0.013400 / 0.007986 (0.005414)	0.004885 / 0.004328 (0.000557)	0.107156 / 0.004250 (0.102905)	0.046958 / 0.037052 (0.009906)	0.419314 / 0.258489 (0.160825)	0.456061 / 0.293841 (0.162220)	0.058859 / 0.128546 (-0.069687)	0.016682 / 0.075646 (-0.058965)	0.428401 / 0.419271 (0.009129)	0.062908 / 0.043533 (0.019376)	0.370902 / 0.255139 (0.115763)	0.433897 / 0.283200 (0.150697)	0.125672 / 0.141683 (-0.016011)	1.818279 / 1.452155 (0.366124)	1.935767 / 1.492716 (0.443050)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.011928 / 0.018006 (-0.006078)	0.591995 / 0.000490 (0.591506)	0.008416 / 0.000200 (0.008216)	0.000122 / 0.000054 (0.000067)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029640 / 0.037411 (-0.007772)	0.121044 / 0.014526 (0.106518)	0.141840 / 0.176557 (-0.034716)	0.195856 / 0.737135 (-0.541280)	0.146460 / 0.296338 (-0.149879)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.591838 / 0.215209 (0.376629)	5.817309 / 2.077655 (3.739654)	2.411864 / 1.504120 (0.907744)	2.098517 / 1.541195 (0.557323)	2.214609 / 1.468490 (0.746119)	1.217542 / 4.584777 (-3.367235)	5.658394 / 3.745712 (1.912682)	5.155807 / 5.269862 (-0.114055)	2.797313 / 4.565676 (-1.768363)	0.141309 / 0.424275 (-0.282967)	0.014462 / 0.007607 (0.006855)	0.772274 / 0.226044 (0.546230)	7.547357 / 2.268929 (5.278429)	3.150178 / 55.444624 (-52.294446)	2.500130 / 6.876477 (-4.376347)	2.572036 / 2.142072 (0.429964)	1.434498 / 4.805227 (-3.370729)	0.257355 / 6.500664 (-6.243309)	0.087491 / 0.075469 (0.012022)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.483899 / 1.841788 (-0.357889)	17.990741 / 8.074308 (9.916433)	20.398965 / 10.191392 (10.207573)	0.239529 / 0.680424 (-0.440895)	0.046118 / 0.534201 (-0.488083)	0.528349 / 0.579283 (-0.050934)	0.614333 / 0.434364 (0.179969)	0.653621 / 0.540337 (0.113284)	0.794654 / 1.386936 (-0.592282)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008732 / 0.011353 (-0.002621)	0.006432 / 0.011008 (-0.004576)	0.090811 / 0.038508 (0.052303)	0.030154 / 0.023109 (0.007045)	0.407885 / 0.275898 (0.131987)	0.452457 / 0.323480 (0.128977)	0.006966 / 0.007986 (-0.001020)	0.006449 / 0.004328 (0.002120)	0.094439 / 0.004250 (0.090188)	0.050628 / 0.037052 (0.013576)	0.401815 / 0.258489 (0.143326)	0.451814 / 0.293841 (0.157973)	0.047456 / 0.128546 (-0.081090)	0.019019 / 0.075646 (-0.056628)	0.112941 / 0.419271 (-0.306331)	0.057677 / 0.043533 (0.014145)	0.406160 / 0.255139 (0.151021)	0.434469 / 0.283200 (0.151269)	0.110515 / 0.141683 (-0.031167)	1.601393 / 1.452155 (0.149238)	1.745581 / 1.492716 (0.252865)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.280264 / 0.018006 (0.262258)	0.630074 / 0.000490 (0.629585)	0.006900 / 0.000200 (0.006700)	0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027338 / 0.037411 (-0.010073)	0.114772 / 0.014526 (0.100246)	0.130436 / 0.176557 (-0.046121)	0.168990 / 0.737135 (-0.568145)	0.135842 / 0.296338 (-0.160496)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.666739 / 0.215209 (0.451530)	6.212953 / 2.077655 (4.135298)	2.781716 / 1.504120 (1.277596)	2.369975 / 1.541195 (0.828781)	2.338807 / 1.468490 (0.870317)	1.174138 / 4.584777 (-3.410639)	5.420297 / 3.745712 (1.674585)	4.972669 / 5.269862 (-0.297192)	2.214294 / 4.565676 (-2.351382)	0.135429 / 0.424275 (-0.288846)	0.013877 / 0.007607 (0.006270)	0.750805 / 0.226044 (0.524761)	7.145429 / 2.268929 (4.876500)	3.215081 / 55.444624 (-52.229544)	2.598307 / 6.876477 (-4.278170)	2.690479 / 2.142072 (0.548406)	1.344673 / 4.805227 (-3.460554)	0.241536 / 6.500664 (-6.259128)	0.075544 / 0.075469 (0.000074)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.473595 / 1.841788 (-0.368192)	17.372237 / 8.074308 (9.297929)	18.586588 / 10.191392 (8.395196)	0.209300 / 0.680424 (-0.471124)	0.030878 / 0.534201 (-0.503323)	0.509131 / 0.579283 (-0.070152)	0.617884 / 0.434364 (0.183520)	0.633721 / 0.540337 (0.093383)	0.727624 / 1.386936 (-0.659312)

lhoestq · 2023-02-01T16:20:11Z

Took your last comments into account !

so maybe a better title for it would be "Optimize processing" (or "Working with datasets at scale" as I mentioned earlier on Slack)

I think the content would be slightly different, e.g. focus more on multiprocessing/sharding or what data formats to use. This can be a complementary page IMO

PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)

Added a link in the how-to stream page. We may want to include it in the tutorial at one point at well - right now none of the tutorials mention streaming

github-actions · 2023-02-01T16:25:59Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009167 / 0.011353 (-0.002186)	0.005345 / 0.011008 (-0.005663)	0.098302 / 0.038508 (0.059794)	0.035649 / 0.023109 (0.012540)	0.295597 / 0.275898 (0.019699)	0.358843 / 0.323480 (0.035364)	0.008011 / 0.007986 (0.000025)	0.004229 / 0.004328 (-0.000100)	0.075123 / 0.004250 (0.070872)	0.046098 / 0.037052 (0.009046)	0.310581 / 0.258489 (0.052092)	0.343230 / 0.293841 (0.049389)	0.038318 / 0.128546 (-0.090229)	0.011954 / 0.075646 (-0.063693)	0.331056 / 0.419271 (-0.088216)	0.052875 / 0.043533 (0.009342)	0.302758 / 0.255139 (0.047619)	0.340596 / 0.283200 (0.057396)	0.113676 / 0.141683 (-0.028007)	1.448272 / 1.452155 (-0.003883)	1.498008 / 1.492716 (0.005291)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.240524 / 0.018006 (0.222518)	0.555823 / 0.000490 (0.555333)	0.003143 / 0.000200 (0.002943)	0.000098 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027764 / 0.037411 (-0.009647)	0.105006 / 0.014526 (0.090480)	0.120550 / 0.176557 (-0.056007)	0.167052 / 0.737135 (-0.570084)	0.124521 / 0.296338 (-0.171818)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.401758 / 0.215209 (0.186549)	3.989629 / 2.077655 (1.911974)	1.767307 / 1.504120 (0.263187)	1.579451 / 1.541195 (0.038257)	1.637642 / 1.468490 (0.169152)	0.702524 / 4.584777 (-3.882253)	3.714326 / 3.745712 (-0.031386)	2.131829 / 5.269862 (-3.138033)	1.487410 / 4.565676 (-3.078267)	0.084901 / 0.424275 (-0.339374)	0.012292 / 0.007607 (0.004685)	0.505211 / 0.226044 (0.279166)	5.074479 / 2.268929 (2.805551)	2.243068 / 55.444624 (-53.201556)	1.880199 / 6.876477 (-4.996278)	2.003757 / 2.142072 (-0.138315)	0.870719 / 4.805227 (-3.934508)	0.167626 / 6.500664 (-6.333039)	0.062024 / 0.075469 (-0.013445)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.192969 / 1.841788 (-0.648819)	14.830812 / 8.074308 (6.756504)	14.331178 / 10.191392 (4.139786)	0.199222 / 0.680424 (-0.481202)	0.029292 / 0.534201 (-0.504909)	0.440427 / 0.579283 (-0.138857)	0.437893 / 0.434364 (0.003529)	0.547155 / 0.540337 (0.006818)	0.645255 / 1.386936 (-0.741681)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007465 / 0.011353 (-0.003888)	0.005386 / 0.011008 (-0.005622)	0.073609 / 0.038508 (0.035100)	0.033550 / 0.023109 (0.010440)	0.341730 / 0.275898 (0.065832)	0.371518 / 0.323480 (0.048038)	0.005986 / 0.007986 (-0.001999)	0.004264 / 0.004328 (-0.000065)	0.073749 / 0.004250 (0.069498)	0.051452 / 0.037052 (0.014399)	0.347385 / 0.258489 (0.088896)	0.392284 / 0.293841 (0.098444)	0.036981 / 0.128546 (-0.091566)	0.012431 / 0.075646 (-0.063216)	0.086421 / 0.419271 (-0.332850)	0.053014 / 0.043533 (0.009481)	0.336660 / 0.255139 (0.081521)	0.359155 / 0.283200 (0.075956)	0.107666 / 0.141683 (-0.034017)	1.424324 / 1.452155 (-0.027830)	1.543027 / 1.492716 (0.050310)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.260862 / 0.018006 (0.242855)	0.552057 / 0.000490 (0.551567)	0.000449 / 0.000200 (0.000249)	0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029184 / 0.037411 (-0.008227)	0.108799 / 0.014526 (0.094274)	0.125136 / 0.176557 (-0.051421)	0.157436 / 0.737135 (-0.579699)	0.126333 / 0.296338 (-0.170005)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.424054 / 0.215209 (0.208845)	4.227847 / 2.077655 (2.150192)	2.051102 / 1.504120 (0.546983)	1.848651 / 1.541195 (0.307457)	1.922728 / 1.468490 (0.454238)	0.705903 / 4.584777 (-3.878874)	3.800977 / 3.745712 (0.055265)	2.099345 / 5.269862 (-3.170517)	1.342919 / 4.565676 (-3.222757)	0.086128 / 0.424275 (-0.338147)	0.012539 / 0.007607 (0.004932)	0.528767 / 0.226044 (0.302723)	5.299989 / 2.268929 (3.031061)	2.534280 / 55.444624 (-52.910345)	2.229532 / 6.876477 (-4.646945)	2.326704 / 2.142072 (0.184632)	0.838533 / 4.805227 (-3.966694)	0.168446 / 6.500664 (-6.332218)	0.065158 / 0.075469 (-0.010311)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.250091 / 1.841788 (-0.591697)	14.988651 / 8.074308 (6.914343)	13.655103 / 10.191392 (3.463711)	0.165079 / 0.680424 (-0.515345)	0.017829 / 0.534201 (-0.516372)	0.425903 / 0.579283 (-0.153381)	0.419771 / 0.434364 (-0.014593)	0.534309 / 0.540337 (-0.006028)	0.635563 / 1.386936 (-0.751373)

github-actions · 2023-02-01T16:28:33Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010569 / 0.011353 (-0.000784)	0.005790 / 0.011008 (-0.005218)	0.118626 / 0.038508 (0.080118)	0.040455 / 0.023109 (0.017346)	0.342309 / 0.275898 (0.066411)	0.411828 / 0.323480 (0.088349)	0.008824 / 0.007986 (0.000839)	0.005426 / 0.004328 (0.001098)	0.088740 / 0.004250 (0.084489)	0.050042 / 0.037052 (0.012990)	0.352350 / 0.258489 (0.093861)	0.396030 / 0.293841 (0.102189)	0.043385 / 0.128546 (-0.085162)	0.013805 / 0.075646 (-0.061841)	0.396489 / 0.419271 (-0.022783)	0.055667 / 0.043533 (0.012135)	0.336165 / 0.255139 (0.081026)	0.372912 / 0.283200 (0.089713)	0.115343 / 0.141683 (-0.026340)	1.656412 / 1.452155 (0.204257)	1.708993 / 1.492716 (0.216277)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.011650 / 0.018006 (-0.006357)	0.444415 / 0.000490 (0.443926)	0.003985 / 0.000200 (0.003785)	0.000136 / 0.000054 (0.000082)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031718 / 0.037411 (-0.005693)	0.119640 / 0.014526 (0.105114)	0.138519 / 0.176557 (-0.038037)	0.188847 / 0.737135 (-0.548288)	0.137891 / 0.296338 (-0.158448)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.447540 / 0.215209 (0.232331)	4.577189 / 2.077655 (2.499534)	2.106992 / 1.504120 (0.602872)	1.889631 / 1.541195 (0.348436)	1.972256 / 1.468490 (0.503766)	0.778209 / 4.584777 (-3.806568)	4.430279 / 3.745712 (0.684567)	2.401226 / 5.269862 (-2.868636)	1.481251 / 4.565676 (-3.084425)	0.094244 / 0.424275 (-0.330031)	0.013961 / 0.007607 (0.006354)	0.570962 / 0.226044 (0.344917)	5.809224 / 2.268929 (3.540295)	2.663290 / 55.444624 (-52.781334)	2.201228 / 6.876477 (-4.675249)	2.319240 / 2.142072 (0.177168)	0.938340 / 4.805227 (-3.866887)	0.185546 / 6.500664 (-6.315118)	0.069087 / 0.075469 (-0.006382)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.448597 / 1.841788 (-0.393191)	17.188573 / 8.074308 (9.114265)	16.197532 / 10.191392 (6.006140)	0.194064 / 0.680424 (-0.486360)	0.033694 / 0.534201 (-0.500507)	0.507585 / 0.579283 (-0.071699)	0.505470 / 0.434364 (0.071106)	0.623270 / 0.540337 (0.082932)	0.729964 / 1.386936 (-0.656972)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008529 / 0.011353 (-0.002824)	0.005705 / 0.011008 (-0.005304)	0.085594 / 0.038508 (0.047086)	0.038377 / 0.023109 (0.015268)	0.384221 / 0.275898 (0.108323)	0.414678 / 0.323480 (0.091199)	0.006195 / 0.007986 (-0.001791)	0.004549 / 0.004328 (0.000221)	0.082710 / 0.004250 (0.078460)	0.054899 / 0.037052 (0.017847)	0.404017 / 0.258489 (0.145528)	0.450309 / 0.293841 (0.156468)	0.040620 / 0.128546 (-0.087926)	0.013774 / 0.075646 (-0.061872)	0.099231 / 0.419271 (-0.320041)	0.057183 / 0.043533 (0.013650)	0.390806 / 0.255139 (0.135667)	0.419334 / 0.283200 (0.136134)	0.116449 / 0.141683 (-0.025234)	1.709124 / 1.452155 (0.256969)	1.812769 / 1.492716 (0.320052)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.225206 / 0.018006 (0.207199)	0.440530 / 0.000490 (0.440040)	0.002982 / 0.000200 (0.002782)	0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032256 / 0.037411 (-0.005155)	0.127086 / 0.014526 (0.112560)	0.138133 / 0.176557 (-0.038424)	0.176168 / 0.737135 (-0.560968)	0.146072 / 0.296338 (-0.150267)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474374 / 0.215209 (0.259165)	4.785106 / 2.077655 (2.707452)	2.319344 / 1.504120 (0.815225)	2.075239 / 1.541195 (0.534045)	2.179231 / 1.468490 (0.710741)	0.832124 / 4.584777 (-3.752653)	4.376302 / 3.745712 (0.630590)	3.966837 / 5.269862 (-1.303024)	1.820230 / 4.565676 (-2.745446)	0.100692 / 0.424275 (-0.323583)	0.014748 / 0.007607 (0.007141)	0.568702 / 0.226044 (0.342657)	5.771548 / 2.268929 (3.502619)	2.747431 / 55.444624 (-52.697193)	2.448482 / 6.876477 (-4.427994)	2.497206 / 2.142072 (0.355133)	0.960842 / 4.805227 (-3.844385)	0.192855 / 6.500664 (-6.307809)	0.072494 / 0.075469 (-0.002975)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.474542 / 1.841788 (-0.367245)	17.344804 / 8.074308 (9.270496)	15.336082 / 10.191392 (5.144690)	0.200134 / 0.680424 (-0.480290)	0.020728 / 0.534201 (-0.513473)	0.488854 / 0.579283 (-0.090429)	0.490781 / 0.434364 (0.056418)	0.626288 / 0.540337 (0.085950)	0.721130 / 1.386936 (-0.665806)

github-actions · 2023-02-01T16:42:50Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008542 / 0.011353 (-0.002811)	0.004624 / 0.011008 (-0.006384)	0.100749 / 0.038508 (0.062241)	0.029587 / 0.023109 (0.006478)	0.298680 / 0.275898 (0.022782)	0.359659 / 0.323480 (0.036180)	0.007001 / 0.007986 (-0.000984)	0.003398 / 0.004328 (-0.000930)	0.078654 / 0.004250 (0.074404)	0.036440 / 0.037052 (-0.000612)	0.313245 / 0.258489 (0.054756)	0.342776 / 0.293841 (0.048936)	0.033195 / 0.128546 (-0.095352)	0.011500 / 0.075646 (-0.064146)	0.323957 / 0.419271 (-0.095314)	0.039878 / 0.043533 (-0.003655)	0.298189 / 0.255139 (0.043050)	0.325488 / 0.283200 (0.042289)	0.087276 / 0.141683 (-0.054407)	1.480846 / 1.452155 (0.028691)	1.507016 / 1.492716 (0.014300)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.189570 / 0.018006 (0.171564)	0.406407 / 0.000490 (0.405917)	0.003062 / 0.000200 (0.002862)	0.000073 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022865 / 0.037411 (-0.014546)	0.096103 / 0.014526 (0.081578)	0.106462 / 0.176557 (-0.070094)	0.140888 / 0.737135 (-0.596247)	0.108172 / 0.296338 (-0.188167)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.415951 / 0.215209 (0.200742)	4.172187 / 2.077655 (2.094532)	1.842210 / 1.504120 (0.338090)	1.636997 / 1.541195 (0.095802)	1.706078 / 1.468490 (0.237588)	0.695825 / 4.584777 (-3.888952)	3.337354 / 3.745712 (-0.408358)	1.877880 / 5.269862 (-3.391982)	1.153882 / 4.565676 (-3.411794)	0.082923 / 0.424275 (-0.341352)	0.012814 / 0.007607 (0.005207)	0.521793 / 0.226044 (0.295748)	5.275980 / 2.268929 (3.007051)	2.279230 / 55.444624 (-53.165394)	1.941777 / 6.876477 (-4.934700)	1.981297 / 2.142072 (-0.160775)	0.809669 / 4.805227 (-3.995558)	0.148753 / 6.500664 (-6.351911)	0.064909 / 0.075469 (-0.010560)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.226757 / 1.841788 (-0.615031)	13.717354 / 8.074308 (5.643046)	12.925885 / 10.191392 (2.734493)	0.137926 / 0.680424 (-0.542498)	0.028788 / 0.534201 (-0.505413)	0.396654 / 0.579283 (-0.182630)	0.401931 / 0.434364 (-0.032432)	0.460515 / 0.540337 (-0.079823)	0.537903 / 1.386936 (-0.849033)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006757 / 0.011353 (-0.004596)	0.004474 / 0.011008 (-0.006534)	0.076571 / 0.038508 (0.038063)	0.027580 / 0.023109 (0.004471)	0.348231 / 0.275898 (0.072333)	0.398403 / 0.323480 (0.074923)	0.005089 / 0.007986 (-0.002897)	0.004676 / 0.004328 (0.000347)	0.076444 / 0.004250 (0.072194)	0.038508 / 0.037052 (0.001456)	0.348515 / 0.258489 (0.090026)	0.401456 / 0.293841 (0.107615)	0.031630 / 0.128546 (-0.096916)	0.011698 / 0.075646 (-0.063949)	0.085805 / 0.419271 (-0.333467)	0.041962 / 0.043533 (-0.001570)	0.343415 / 0.255139 (0.088276)	0.383001 / 0.283200 (0.099801)	0.090231 / 0.141683 (-0.051452)	1.488114 / 1.452155 (0.035960)	1.569039 / 1.492716 (0.076323)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.261751 / 0.018006 (0.243745)	0.411354 / 0.000490 (0.410865)	0.015103 / 0.000200 (0.014903)	0.000262 / 0.000054 (0.000208)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025423 / 0.037411 (-0.011988)	0.101334 / 0.014526 (0.086808)	0.108835 / 0.176557 (-0.067722)	0.143995 / 0.737135 (-0.593140)	0.111751 / 0.296338 (-0.184588)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.446507 / 0.215209 (0.231298)	4.461543 / 2.077655 (2.383888)	2.104648 / 1.504120 (0.600528)	1.895900 / 1.541195 (0.354706)	1.985481 / 1.468490 (0.516991)	0.699029 / 4.584777 (-3.885748)	3.371064 / 3.745712 (-0.374648)	1.883445 / 5.269862 (-3.386416)	1.166150 / 4.565676 (-3.399527)	0.082639 / 0.424275 (-0.341636)	0.012605 / 0.007607 (0.004998)	0.544860 / 0.226044 (0.318815)	5.513223 / 2.268929 (3.244294)	2.570661 / 55.444624 (-52.873963)	2.206066 / 6.876477 (-4.670411)	2.256346 / 2.142072 (0.114273)	0.801142 / 4.805227 (-4.004085)	0.150412 / 6.500664 (-6.350252)	0.067742 / 0.075469 (-0.007727)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.303477 / 1.841788 (-0.538310)	14.287767 / 8.074308 (6.213458)	13.525563 / 10.191392 (3.334171)	0.148202 / 0.680424 (-0.532222)	0.016868 / 0.534201 (-0.517333)	0.380729 / 0.579283 (-0.198555)	0.388177 / 0.434364 (-0.046187)	0.477410 / 0.540337 (-0.062927)	0.569343 / 1.386936 (-0.817593)

stevhliu · 2023-02-01T18:11:45Z

PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)

Just merged #5485, which references this new doc! Will look for other pages in the docs where it'd make sense to add them :)

lhoestq added 2 commits January 5, 2023 19:07

add to_iterable

14464f4

hide not relevant config warning

d30c8c2

lhoestq commented Jan 5, 2023

View reviewed changes

lhoestq and others added 4 commits January 25, 2023 20:18

minor

e43aee2

tests

368d2c1

docs

45ad185

Merge branch 'main' into to_iterable

f1e0ec3

lhoestq marked this pull request as ready for review January 25, 2023 19:19

minor

f830952

lhoestq added 2 commits January 25, 2023 20:45

fix links

c47ecf7

minor

675cf29

stevhliu reviewed Jan 25, 2023

View reviewed changes

lhoestq added 2 commits January 27, 2023 11:23

Merge branch 'main' into to_iterable

9fe31c8

steven's comments

5f7e178

lhoestq requested review from polinaeterna and mariosasko January 27, 2023 10:46

Merge branch 'main' into to_iterable

1e4894f

stevhliu approved these changes Jan 27, 2023

View reviewed changes

mariosasko reviewed Jan 30, 2023

View reviewed changes

stevhliu mentioned this pull request Jan 30, 2023

Add section in tutorial for IterableDataset #5485

Merged

lhoestq commented Jan 31, 2023

View reviewed changes

docs/source/use_with_pytorch.mdx Outdated Show resolved Hide resolved

Apply suggestions from code review

8b2c7de

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

lhoestq added 2 commits February 1, 2023 11:46

rename to_tf_dataset

59c890a

add intro paragraph

0ba81f5

lhoestq added 3 commits February 1, 2023 12:19

added note in Dataset.shuffle

34bf190

added note in Process doc page

2e956f8

style

78dca62

polinaeterna approved these changes Feb 1, 2023

View reviewed changes

mariosasko approved these changes Feb 1, 2023

View reviewed changes

Apply suggestions from code review

87f2062

Co-authored-by: Polina Kazakova <polina@huggingface.co>

lhoestq added 2 commits February 1, 2023 17:18

comments

3a99c5a

add link to guide from the How-to Stream page

f7d17cc

style

cd78778

lhoestq merged commit 79c18b7 into main Feb 1, 2023

lhoestq deleted the to_iterable branch February 1, 2023 16:36


		## Downloading and streaming

		When you have a regular "map-style" [`Dataset`], you can access it using `my_dataset[0]`: we have what we call "random access" to the rows.

		@@ -0,0 +1,220 @@
		# Differences between Dataset and IterableDataset

	Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !
	Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions!

Map-style Dataset to IterableDataset #5410

Map-style Dataset to IterableDataset #5410

Conversation

lhoestq commented Jan 5, 2023 • edited

Choose a reason for hiding this comment

github-actions bot commented Jan 5, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jan 5, 2023 • edited

github-actions bot commented Jan 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jan 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jan 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jan 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jan 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented Jan 26, 2023

stevhliu commented Jan 27, 2023

lhoestq commented Jan 5, 2023 •

edited

HuggingFaceDocBuilderDev commented Jan 5, 2023 •

edited