Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map-style Dataset to IterableDataset #5410

Merged
merged 22 commits into from Feb 1, 2023
Merged

Map-style Dataset to IterableDataset #5410

merged 22 commits into from Feb 1, 2023

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Jan 5, 2023

Added ds.to_iterable() to get an iterable dataset from a map-style arrow dataset.

It also has a num_shards argument to split the dataset before converting to an iterable dataset. Sharding is important to enable efficient shuffling and parallel loading of iterable datasets.

TODO:

  • tests
  • docs

Fix #5265

@@ -493,7 +493,7 @@ def _create_builder_config(
)
is_custom = (config_id not in self.builder_configs) and config_id != "default"
if is_custom:
logger.warning(f"Using custom data configuration {config_id}")
logger.info(f"Using custom data configuration {config_id}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because I think it's not relevant anymore, and because I find it confusing to show this when calling IterableDataset.from_generator

@github-actions
Copy link

github-actions bot commented Jan 5, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009812 / 0.011353 (-0.001540) 0.005290 / 0.011008 (-0.005719) 0.099728 / 0.038508 (0.061220) 0.036712 / 0.023109 (0.013602) 0.305924 / 0.275898 (0.030026) 0.349844 / 0.323480 (0.026365) 0.008353 / 0.007986 (0.000368) 0.004464 / 0.004328 (0.000135) 0.075329 / 0.004250 (0.071079) 0.046146 / 0.037052 (0.009094) 0.304197 / 0.258489 (0.045708) 0.354245 / 0.293841 (0.060404) 0.039270 / 0.128546 (-0.089276) 0.012496 / 0.075646 (-0.063151) 0.334390 / 0.419271 (-0.084882) 0.049428 / 0.043533 (0.005896) 0.297318 / 0.255139 (0.042179) 0.315646 / 0.283200 (0.032447) 0.106746 / 0.141683 (-0.034937) 1.443562 / 1.452155 (-0.008593) 1.546022 / 1.492716 (0.053305)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.303419 / 0.018006 (0.285413) 0.536971 / 0.000490 (0.536481) 0.001335 / 0.000200 (0.001135) 0.000088 / 0.000054 (0.000033)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030484 / 0.037411 (-0.006927) 0.110043 / 0.014526 (0.095518) 0.125265 / 0.176557 (-0.051291) 0.171410 / 0.737135 (-0.565725) 0.128978 / 0.296338 (-0.167361)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398354 / 0.215209 (0.183145) 3.984180 / 2.077655 (1.906526) 1.781134 / 1.504120 (0.277014) 1.589656 / 1.541195 (0.048462) 1.704192 / 1.468490 (0.235702) 0.682271 / 4.584777 (-3.902506) 3.731504 / 3.745712 (-0.014208) 2.243520 / 5.269862 (-3.026342) 1.511334 / 4.565676 (-3.054343) 0.084243 / 0.424275 (-0.340032) 0.012261 / 0.007607 (0.004654) 0.507499 / 0.226044 (0.281454) 5.066037 / 2.268929 (2.797109) 2.246107 / 55.444624 (-53.198517) 1.921032 / 6.876477 (-4.955444) 2.144111 / 2.142072 (0.002039) 0.845233 / 4.805227 (-3.959995) 0.165392 / 6.500664 (-6.335272) 0.064201 / 0.075469 (-0.011268)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.217649 / 1.841788 (-0.624138) 15.890487 / 8.074308 (7.816179) 14.772039 / 10.191392 (4.580647) 0.192901 / 0.680424 (-0.487523) 0.029119 / 0.534201 (-0.505082) 0.442904 / 0.579283 (-0.136380) 0.451035 / 0.434364 (0.016671) 0.520788 / 0.540337 (-0.019550) 0.623588 / 1.386936 (-0.763348)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007452 / 0.011353 (-0.003901) 0.005426 / 0.011008 (-0.005582) 0.096488 / 0.038508 (0.057980) 0.033575 / 0.023109 (0.010465) 0.375688 / 0.275898 (0.099790) 0.412393 / 0.323480 (0.088913) 0.006050 / 0.007986 (-0.001936) 0.004424 / 0.004328 (0.000095) 0.073102 / 0.004250 (0.068852) 0.052672 / 0.037052 (0.015620) 0.379352 / 0.258489 (0.120862) 0.436065 / 0.293841 (0.142224) 0.036594 / 0.128546 (-0.091952) 0.012380 / 0.075646 (-0.063266) 0.332899 / 0.419271 (-0.086373) 0.048859 / 0.043533 (0.005326) 0.373215 / 0.255139 (0.118076) 0.386990 / 0.283200 (0.103791) 0.105166 / 0.141683 (-0.036517) 1.490762 / 1.452155 (0.038607) 1.611310 / 1.492716 (0.118593)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.333142 / 0.018006 (0.315136) 0.537137 / 0.000490 (0.536647) 0.000452 / 0.000200 (0.000252) 0.000063 / 0.000054 (0.000009)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030368 / 0.037411 (-0.007043) 0.109608 / 0.014526 (0.095083) 0.124220 / 0.176557 (-0.052336) 0.162834 / 0.737135 (-0.574301) 0.128037 / 0.296338 (-0.168302)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.440991 / 0.215209 (0.225782) 4.400825 / 2.077655 (2.323170) 2.158768 / 1.504120 (0.654648) 1.968158 / 1.541195 (0.426963) 2.085115 / 1.468490 (0.616625) 0.710757 / 4.584777 (-3.874020) 3.835441 / 3.745712 (0.089729) 2.204118 / 5.269862 (-3.065744) 1.378909 / 4.565676 (-3.186767) 0.089149 / 0.424275 (-0.335126) 0.013066 / 0.007607 (0.005459) 0.539165 / 0.226044 (0.313121) 5.414176 / 2.268929 (3.145248) 2.677020 / 55.444624 (-52.767604) 2.328334 / 6.876477 (-4.548143) 2.518933 / 2.142072 (0.376860) 0.840902 / 4.805227 (-3.964325) 0.170365 / 6.500664 (-6.330299) 0.063909 / 0.075469 (-0.011561)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.237205 / 1.841788 (-0.604583) 15.678776 / 8.074308 (7.604468) 14.118576 / 10.191392 (3.927184) 0.167236 / 0.680424 (-0.513188) 0.018177 / 0.534201 (-0.516024) 0.426680 / 0.579283 (-0.152603) 0.425126 / 0.434364 (-0.009238) 0.501755 / 0.540337 (-0.038582) 0.592754 / 1.386936 (-0.794182)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 5, 2023

The documentation is not available anymore as the PR was closed or merged.

@lhoestq lhoestq marked this pull request as ready for review January 25, 2023 19:19
@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008708 / 0.011353 (-0.002645) 0.004462 / 0.011008 (-0.006546) 0.100159 / 0.038508 (0.061651) 0.029543 / 0.023109 (0.006434) 0.304056 / 0.275898 (0.028158) 0.367098 / 0.323480 (0.043618) 0.007049 / 0.007986 (-0.000937) 0.003294 / 0.004328 (-0.001034) 0.076954 / 0.004250 (0.072703) 0.036850 / 0.037052 (-0.000202) 0.307556 / 0.258489 (0.049067) 0.348327 / 0.293841 (0.054486) 0.033520 / 0.128546 (-0.095026) 0.011312 / 0.075646 (-0.064334) 0.317588 / 0.419271 (-0.101684) 0.040196 / 0.043533 (-0.003337) 0.298330 / 0.255139 (0.043191) 0.333821 / 0.283200 (0.050622) 0.086584 / 0.141683 (-0.055099) 1.480205 / 1.452155 (0.028050) 1.520975 / 1.492716 (0.028259)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.186641 / 0.018006 (0.168635) 0.414420 / 0.000490 (0.413930) 0.003021 / 0.000200 (0.002821) 0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022953 / 0.037411 (-0.014458) 0.097338 / 0.014526 (0.082812) 0.104985 / 0.176557 (-0.071572) 0.139208 / 0.737135 (-0.597927) 0.108031 / 0.296338 (-0.188307)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.417969 / 0.215209 (0.202759) 4.173189 / 2.077655 (2.095534) 1.862813 / 1.504120 (0.358693) 1.653226 / 1.541195 (0.112031) 1.725917 / 1.468490 (0.257426) 0.701038 / 4.584777 (-3.883739) 3.350500 / 3.745712 (-0.395213) 1.913156 / 5.269862 (-3.356705) 1.267597 / 4.565676 (-3.298079) 0.082197 / 0.424275 (-0.342078) 0.012499 / 0.007607 (0.004892) 0.520173 / 0.226044 (0.294128) 5.219981 / 2.268929 (2.951053) 2.306029 / 55.444624 (-53.138595) 1.948169 / 6.876477 (-4.928307) 2.013160 / 2.142072 (-0.128912) 0.813325 / 4.805227 (-3.991902) 0.149729 / 6.500664 (-6.350935) 0.065492 / 0.075469 (-0.009977)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.194163 / 1.841788 (-0.647625) 13.739562 / 8.074308 (5.665254) 13.881988 / 10.191392 (3.690596) 0.138180 / 0.680424 (-0.542244) 0.029031 / 0.534201 (-0.505170) 0.387858 / 0.579283 (-0.191425) 0.395171 / 0.434364 (-0.039193) 0.446349 / 0.540337 (-0.093988) 0.527073 / 1.386936 (-0.859863)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006504 / 0.011353 (-0.004849) 0.004564 / 0.011008 (-0.006444) 0.099108 / 0.038508 (0.060599) 0.027420 / 0.023109 (0.004311) 0.340712 / 0.275898 (0.064814) 0.391613 / 0.323480 (0.068133) 0.004977 / 0.007986 (-0.003009) 0.003375 / 0.004328 (-0.000953) 0.076403 / 0.004250 (0.072152) 0.036650 / 0.037052 (-0.000402) 0.341948 / 0.258489 (0.083459) 0.392065 / 0.293841 (0.098224) 0.031802 / 0.128546 (-0.096745) 0.011659 / 0.075646 (-0.063987) 0.320099 / 0.419271 (-0.099173) 0.041615 / 0.043533 (-0.001918) 0.342125 / 0.255139 (0.086986) 0.372833 / 0.283200 (0.089633) 0.089032 / 0.141683 (-0.052650) 1.486691 / 1.452155 (0.034536) 1.567326 / 1.492716 (0.074610)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.193123 / 0.018006 (0.175117) 0.404062 / 0.000490 (0.403573) 0.003460 / 0.000200 (0.003260) 0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024565 / 0.037411 (-0.012846) 0.098958 / 0.014526 (0.084432) 0.108701 / 0.176557 (-0.067855) 0.142567 / 0.737135 (-0.594569) 0.111048 / 0.296338 (-0.185290)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474549 / 0.215209 (0.259340) 4.753776 / 2.077655 (2.676121) 2.435528 / 1.504120 (0.931409) 2.234491 / 1.541195 (0.693297) 2.269474 / 1.468490 (0.800984) 0.695636 / 4.584777 (-3.889141) 3.367816 / 3.745712 (-0.377896) 1.854828 / 5.269862 (-3.415034) 1.159729 / 4.565676 (-3.405948) 0.082267 / 0.424275 (-0.342008) 0.012483 / 0.007607 (0.004876) 0.578490 / 0.226044 (0.352446) 5.814490 / 2.268929 (3.545561) 2.893310 / 55.444624 (-52.551314) 2.540555 / 6.876477 (-4.335922) 2.573705 / 2.142072 (0.431633) 0.800545 / 4.805227 (-4.004682) 0.151306 / 6.500664 (-6.349358) 0.067925 / 0.075469 (-0.007544)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.294645 / 1.841788 (-0.547142) 13.641842 / 8.074308 (5.567534) 14.015200 / 10.191392 (3.823808) 0.128829 / 0.680424 (-0.551595) 0.016870 / 0.534201 (-0.517331) 0.389137 / 0.579283 (-0.190146) 0.388384 / 0.434364 (-0.045980) 0.447711 / 0.540337 (-0.092627) 0.540637 / 1.386936 (-0.846299)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012282 / 0.011353 (0.000929) 0.006328 / 0.011008 (-0.004680) 0.129666 / 0.038508 (0.091158) 0.039403 / 0.023109 (0.016294) 0.375464 / 0.275898 (0.099566) 0.463167 / 0.323480 (0.139687) 0.010329 / 0.007986 (0.002344) 0.005111 / 0.004328 (0.000782) 0.108727 / 0.004250 (0.104476) 0.047156 / 0.037052 (0.010103) 0.381869 / 0.258489 (0.123380) 0.441936 / 0.293841 (0.148095) 0.054750 / 0.128546 (-0.073796) 0.019809 / 0.075646 (-0.055837) 0.436389 / 0.419271 (0.017118) 0.066585 / 0.043533 (0.023052) 0.402108 / 0.255139 (0.146969) 0.424571 / 0.283200 (0.141371) 0.118326 / 0.141683 (-0.023357) 1.870175 / 1.452155 (0.418020) 1.878720 / 1.492716 (0.386004)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.012863 / 0.018006 (-0.005144) 0.528670 / 0.000490 (0.528181) 0.006057 / 0.000200 (0.005857) 0.000124 / 0.000054 (0.000069)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030091 / 0.037411 (-0.007320) 0.136143 / 0.014526 (0.121618) 0.148931 / 0.176557 (-0.027626) 0.179578 / 0.737135 (-0.557558) 0.144528 / 0.296338 (-0.151810)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.594080 / 0.215209 (0.378871) 6.029101 / 2.077655 (3.951446) 2.443084 / 1.504120 (0.938964) 2.123949 / 1.541195 (0.582754) 2.183021 / 1.468490 (0.714531) 1.235453 / 4.584777 (-3.349324) 5.585121 / 3.745712 (1.839408) 3.208510 / 5.269862 (-2.061351) 2.090334 / 4.565676 (-2.475342) 0.150353 / 0.424275 (-0.273922) 0.016787 / 0.007607 (0.009180) 0.797561 / 0.226044 (0.571516) 7.756291 / 2.268929 (5.487363) 3.283638 / 55.444624 (-52.160986) 2.527441 / 6.876477 (-4.349036) 2.590765 / 2.142072 (0.448692) 1.446818 / 4.805227 (-3.358409) 0.250563 / 6.500664 (-6.250101) 0.077919 / 0.075469 (0.002450)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.612022 / 1.841788 (-0.229765) 18.363316 / 8.074308 (10.289008) 22.578570 / 10.191392 (12.387178) 0.232801 / 0.680424 (-0.447623) 0.048232 / 0.534201 (-0.485969) 0.549518 / 0.579283 (-0.029766) 0.624663 / 0.434364 (0.190299) 0.674745 / 0.540337 (0.134408) 0.803489 / 1.386936 (-0.583447)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009872 / 0.011353 (-0.001481) 0.006593 / 0.011008 (-0.004415) 0.139248 / 0.038508 (0.100740) 0.035708 / 0.023109 (0.012598) 0.551335 / 0.275898 (0.275437) 0.544995 / 0.323480 (0.221515) 0.007085 / 0.007986 (-0.000900) 0.004742 / 0.004328 (0.000413) 0.095823 / 0.004250 (0.091572) 0.051674 / 0.037052 (0.014621) 0.463405 / 0.258489 (0.204916) 0.640392 / 0.293841 (0.346551) 0.055242 / 0.128546 (-0.073304) 0.022602 / 0.075646 (-0.053044) 0.419171 / 0.419271 (-0.000100) 0.062986 / 0.043533 (0.019453) 0.503683 / 0.255139 (0.248544) 0.568719 / 0.283200 (0.285519) 0.113906 / 0.141683 (-0.027777) 1.825248 / 1.452155 (0.373094) 1.985667 / 1.492716 (0.492951)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.237478 / 0.018006 (0.219472) 0.528861 / 0.000490 (0.528371) 0.008507 / 0.000200 (0.008307) 0.000158 / 0.000054 (0.000103)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033536 / 0.037411 (-0.003875) 0.144202 / 0.014526 (0.129677) 0.139472 / 0.176557 (-0.037084) 0.184540 / 0.737135 (-0.552596) 0.147818 / 0.296338 (-0.148520)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.671654 / 0.215209 (0.456445) 6.616368 / 2.077655 (4.538713) 2.805634 / 1.504120 (1.301514) 2.482890 / 1.541195 (0.941695) 2.547686 / 1.468490 (1.079195) 1.289169 / 4.584777 (-3.295608) 5.551436 / 3.745712 (1.805724) 5.228500 / 5.269862 (-0.041362) 2.456706 / 4.565676 (-2.108970) 0.148556 / 0.424275 (-0.275720) 0.015290 / 0.007607 (0.007683) 0.837090 / 0.226044 (0.611045) 8.373561 / 2.268929 (6.104632) 3.663910 / 55.444624 (-51.780714) 2.927117 / 6.876477 (-3.949360) 2.976785 / 2.142072 (0.834712) 1.501618 / 4.805227 (-3.303609) 0.263321 / 6.500664 (-6.237343) 0.082644 / 0.075469 (0.007175)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.707419 / 1.841788 (-0.134368) 18.371117 / 8.074308 (10.296809) 22.015154 / 10.191392 (11.823762) 0.232066 / 0.680424 (-0.448357) 0.027149 / 0.534201 (-0.507052) 0.544450 / 0.579283 (-0.034833) 0.605134 / 0.434364 (0.170770) 0.656063 / 0.540337 (0.115725) 0.788121 / 1.386936 (-0.598815)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008952 / 0.011353 (-0.002401) 0.005592 / 0.011008 (-0.005416) 0.101138 / 0.038508 (0.062630) 0.035573 / 0.023109 (0.012464) 0.295959 / 0.275898 (0.020060) 0.365347 / 0.323480 (0.041867) 0.008136 / 0.007986 (0.000150) 0.004479 / 0.004328 (0.000150) 0.078806 / 0.004250 (0.074556) 0.045180 / 0.037052 (0.008127) 0.321687 / 0.258489 (0.063198) 0.345874 / 0.293841 (0.052033) 0.038720 / 0.128546 (-0.089826) 0.012534 / 0.075646 (-0.063112) 0.335571 / 0.419271 (-0.083700) 0.049048 / 0.043533 (0.005515) 0.294756 / 0.255139 (0.039617) 0.327496 / 0.283200 (0.044296) 0.109181 / 0.141683 (-0.032502) 1.417068 / 1.452155 (-0.035087) 1.455473 / 1.492716 (-0.037244)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.267774 / 0.018006 (0.249768) 0.538546 / 0.000490 (0.538056) 0.001755 / 0.000200 (0.001555) 0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026839 / 0.037411 (-0.010572) 0.105862 / 0.014526 (0.091336) 0.118278 / 0.176557 (-0.058279) 0.157926 / 0.737135 (-0.579209) 0.124700 / 0.296338 (-0.171638)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.399060 / 0.215209 (0.183851) 3.991409 / 2.077655 (1.913754) 1.763569 / 1.504120 (0.259449) 1.579602 / 1.541195 (0.038407) 1.652928 / 1.468490 (0.184438) 0.692962 / 4.584777 (-3.891815) 3.784635 / 3.745712 (0.038922) 3.249341 / 5.269862 (-2.020521) 1.815711 / 4.565676 (-2.749966) 0.084384 / 0.424275 (-0.339891) 0.012546 / 0.007607 (0.004939) 0.521397 / 0.226044 (0.295352) 5.075824 / 2.268929 (2.806895) 2.258353 / 55.444624 (-53.186272) 1.925220 / 6.876477 (-4.951256) 2.002821 / 2.142072 (-0.139252) 0.830507 / 4.805227 (-3.974720) 0.165845 / 6.500664 (-6.334819) 0.063905 / 0.075469 (-0.011565)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.198726 / 1.841788 (-0.643061) 14.804448 / 8.074308 (6.730139) 12.855167 / 10.191392 (2.663775) 0.167932 / 0.680424 (-0.512492) 0.028643 / 0.534201 (-0.505558) 0.441224 / 0.579283 (-0.138059) 0.434924 / 0.434364 (0.000560) 0.516188 / 0.540337 (-0.024150) 0.605017 / 1.386936 (-0.781919)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007031 / 0.011353 (-0.004322) 0.005157 / 0.011008 (-0.005851) 0.086943 / 0.038508 (0.048434) 0.031377 / 0.023109 (0.008268) 0.334810 / 0.275898 (0.058912) 0.368590 / 0.323480 (0.045110) 0.005973 / 0.007986 (-0.002013) 0.004173 / 0.004328 (-0.000155) 0.067033 / 0.004250 (0.062783) 0.054070 / 0.037052 (0.017018) 0.332232 / 0.258489 (0.073743) 0.384982 / 0.293841 (0.091141) 0.034023 / 0.128546 (-0.094524) 0.011301 / 0.075646 (-0.064345) 0.295644 / 0.419271 (-0.123628) 0.045589 / 0.043533 (0.002056) 0.330739 / 0.255139 (0.075600) 0.352841 / 0.283200 (0.069642) 0.104829 / 0.141683 (-0.036854) 1.329360 / 1.452155 (-0.122794) 1.437956 / 1.492716 (-0.054760)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.299187 / 0.018006 (0.281181) 0.563407 / 0.000490 (0.562917) 0.004179 / 0.000200 (0.003979) 0.000114 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027405 / 0.037411 (-0.010006) 0.097498 / 0.014526 (0.082972) 0.114265 / 0.176557 (-0.062292) 0.146823 / 0.737135 (-0.590313) 0.117948 / 0.296338 (-0.178391)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.378756 / 0.215209 (0.163547) 3.774804 / 2.077655 (1.697150) 1.804149 / 1.504120 (0.300029) 1.626312 / 1.541195 (0.085117) 1.731111 / 1.468490 (0.262620) 0.633493 / 4.584777 (-3.951284) 3.488220 / 3.745712 (-0.257492) 3.064710 / 5.269862 (-2.205151) 1.690647 / 4.565676 (-2.875029) 0.076093 / 0.424275 (-0.348182) 0.010820 / 0.007607 (0.003213) 0.465091 / 0.226044 (0.239046) 4.676842 / 2.268929 (2.407913) 2.297381 / 55.444624 (-53.147244) 1.960355 / 6.876477 (-4.916122) 1.983742 / 2.142072 (-0.158330) 0.739525 / 4.805227 (-4.065702) 0.152663 / 6.500664 (-6.348001) 0.057316 / 0.075469 (-0.018153)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.104721 / 1.841788 (-0.737067) 14.577171 / 8.074308 (6.502863) 13.680402 / 10.191392 (3.489010) 0.182234 / 0.680424 (-0.498190) 0.018853 / 0.534201 (-0.515348) 0.426194 / 0.579283 (-0.153089) 0.429202 / 0.434364 (-0.005162) 0.543125 / 0.540337 (0.002788) 0.645887 / 1.386936 (-0.741049)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010055 / 0.011353 (-0.001298) 0.005576 / 0.011008 (-0.005432) 0.100059 / 0.038508 (0.061551) 0.038535 / 0.023109 (0.015425) 0.297538 / 0.275898 (0.021640) 0.368117 / 0.323480 (0.044637) 0.008540 / 0.007986 (0.000555) 0.004469 / 0.004328 (0.000141) 0.075801 / 0.004250 (0.071551) 0.046604 / 0.037052 (0.009552) 0.307242 / 0.258489 (0.048753) 0.343949 / 0.293841 (0.050108) 0.039353 / 0.128546 (-0.089194) 0.012446 / 0.075646 (-0.063200) 0.334628 / 0.419271 (-0.084643) 0.051628 / 0.043533 (0.008095) 0.298726 / 0.255139 (0.043587) 0.316010 / 0.283200 (0.032810) 0.120564 / 0.141683 (-0.021119) 1.459396 / 1.452155 (0.007241) 1.493682 / 1.492716 (0.000965)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.011702 / 0.018006 (-0.006304) 0.570261 / 0.000490 (0.569771) 0.003760 / 0.000200 (0.003560) 0.000091 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028806 / 0.037411 (-0.008605) 0.112150 / 0.014526 (0.097625) 0.123140 / 0.176557 (-0.053417) 0.173055 / 0.737135 (-0.564080) 0.130060 / 0.296338 (-0.166279)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398216 / 0.215209 (0.183007) 3.978677 / 2.077655 (1.901022) 1.754229 / 1.504120 (0.250109) 1.561892 / 1.541195 (0.020697) 1.679138 / 1.468490 (0.210648) 0.690254 / 4.584777 (-3.894523) 3.817698 / 3.745712 (0.071986) 2.177854 / 5.269862 (-3.092008) 1.361860 / 4.565676 (-3.203816) 0.084108 / 0.424275 (-0.340167) 0.012640 / 0.007607 (0.005033) 0.504385 / 0.226044 (0.278341) 5.034103 / 2.268929 (2.765174) 2.254032 / 55.444624 (-53.190593) 1.910439 / 6.876477 (-4.966038) 2.003515 / 2.142072 (-0.138558) 0.839747 / 4.805227 (-3.965480) 0.165654 / 6.500664 (-6.335010) 0.063483 / 0.075469 (-0.011986)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.187521 / 1.841788 (-0.654267) 15.381121 / 8.074308 (7.306812) 14.579418 / 10.191392 (4.388026) 0.199221 / 0.680424 (-0.481202) 0.029335 / 0.534201 (-0.504866) 0.443159 / 0.579283 (-0.136124) 0.447772 / 0.434364 (0.013408) 0.545071 / 0.540337 (0.004733) 0.650494 / 1.386936 (-0.736442)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007675 / 0.011353 (-0.003677) 0.005364 / 0.011008 (-0.005644) 0.097921 / 0.038508 (0.059413) 0.033645 / 0.023109 (0.010536) 0.404818 / 0.275898 (0.128920) 0.429983 / 0.323480 (0.106503) 0.006106 / 0.007986 (-0.001879) 0.005281 / 0.004328 (0.000953) 0.073762 / 0.004250 (0.069512) 0.053065 / 0.037052 (0.016012) 0.400657 / 0.258489 (0.142168) 0.447743 / 0.293841 (0.153902) 0.036782 / 0.128546 (-0.091765) 0.012593 / 0.075646 (-0.063054) 0.332825 / 0.419271 (-0.086446) 0.049424 / 0.043533 (0.005891) 0.400397 / 0.255139 (0.145258) 0.414794 / 0.283200 (0.131594) 0.106555 / 0.141683 (-0.035128) 1.466917 / 1.452155 (0.014762) 1.571351 / 1.492716 (0.078635)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.254337 / 0.018006 (0.236331) 0.568360 / 0.000490 (0.567870) 0.000445 / 0.000200 (0.000245) 0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031044 / 0.037411 (-0.006367) 0.112282 / 0.014526 (0.097756) 0.127205 / 0.176557 (-0.049352) 0.166551 / 0.737135 (-0.570584) 0.130520 / 0.296338 (-0.165818)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.442906 / 0.215209 (0.227697) 4.430218 / 2.077655 (2.352563) 2.287251 / 1.504120 (0.783132) 2.112345 / 1.541195 (0.571150) 2.240952 / 1.468490 (0.772462) 0.713800 / 4.584777 (-3.870977) 3.884161 / 3.745712 (0.138449) 2.166901 / 5.269862 (-3.102960) 1.374490 / 4.565676 (-3.191187) 0.087548 / 0.424275 (-0.336727) 0.012369 / 0.007607 (0.004761) 0.540783 / 0.226044 (0.314739) 5.396187 / 2.268929 (3.127258) 2.779636 / 55.444624 (-52.664988) 2.434220 / 6.876477 (-4.442257) 2.508180 / 2.142072 (0.366107) 0.852470 / 4.805227 (-3.952757) 0.171266 / 6.500664 (-6.329398) 0.065463 / 0.075469 (-0.010006)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.241720 / 1.841788 (-0.600067) 15.332568 / 8.074308 (7.258260) 13.688723 / 10.191392 (3.497331) 0.145150 / 0.680424 (-0.535273) 0.017694 / 0.534201 (-0.516507) 0.426078 / 0.579283 (-0.153205) 0.441189 / 0.434364 (0.006825) 0.540284 / 0.540337 (-0.000054) 0.657548 / 1.386936 (-0.729388)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008604 / 0.011353 (-0.002749) 0.004566 / 0.011008 (-0.006442) 0.099607 / 0.038508 (0.061099) 0.029628 / 0.023109 (0.006519) 0.300481 / 0.275898 (0.024583) 0.342596 / 0.323480 (0.019116) 0.007003 / 0.007986 (-0.000982) 0.003408 / 0.004328 (-0.000920) 0.079076 / 0.004250 (0.074826) 0.034104 / 0.037052 (-0.002948) 0.303856 / 0.258489 (0.045367) 0.348729 / 0.293841 (0.054888) 0.033752 / 0.128546 (-0.094794) 0.011497 / 0.075646 (-0.064149) 0.321568 / 0.419271 (-0.097704) 0.041472 / 0.043533 (-0.002061) 0.303396 / 0.255139 (0.048257) 0.331121 / 0.283200 (0.047921) 0.086203 / 0.141683 (-0.055480) 1.476995 / 1.452155 (0.024840) 1.539428 / 1.492716 (0.046712)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.215810 / 0.018006 (0.197803) 0.414292 / 0.000490 (0.413802) 0.000388 / 0.000200 (0.000188) 0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023441 / 0.037411 (-0.013970) 0.098463 / 0.014526 (0.083938) 0.105435 / 0.176557 (-0.071121) 0.139736 / 0.737135 (-0.597399) 0.109467 / 0.296338 (-0.186872)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418244 / 0.215209 (0.203035) 4.160693 / 2.077655 (2.083039) 1.878895 / 1.504120 (0.374775) 1.679338 / 1.541195 (0.138143) 1.730384 / 1.468490 (0.261894) 0.688603 / 4.584777 (-3.896174) 3.393542 / 3.745712 (-0.352170) 1.901337 / 5.269862 (-3.368525) 1.447269 / 4.565676 (-3.118408) 0.083003 / 0.424275 (-0.341272) 0.012574 / 0.007607 (0.004967) 0.526363 / 0.226044 (0.300318) 5.275159 / 2.268929 (3.006230) 2.323642 / 55.444624 (-53.120982) 1.982929 / 6.876477 (-4.893548) 2.014081 / 2.142072 (-0.127991) 0.809466 / 4.805227 (-3.995761) 0.149038 / 6.500664 (-6.351626) 0.064394 / 0.075469 (-0.011075)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.207439 / 1.841788 (-0.634349) 13.691048 / 8.074308 (5.616740) 13.880965 / 10.191392 (3.689573) 0.148553 / 0.680424 (-0.531871) 0.028397 / 0.534201 (-0.505804) 0.391818 / 0.579283 (-0.187465) 0.407181 / 0.434364 (-0.027183) 0.481163 / 0.540337 (-0.059175) 0.570689 / 1.386936 (-0.816247)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006361 / 0.011353 (-0.004992) 0.004520 / 0.011008 (-0.006488) 0.097679 / 0.038508 (0.059171) 0.027223 / 0.023109 (0.004113) 0.407966 / 0.275898 (0.132068) 0.439868 / 0.323480 (0.116388) 0.004625 / 0.007986 (-0.003360) 0.004039 / 0.004328 (-0.000289) 0.074548 / 0.004250 (0.070298) 0.034957 / 0.037052 (-0.002095) 0.412762 / 0.258489 (0.154273) 0.449716 / 0.293841 (0.155875) 0.031272 / 0.128546 (-0.097274) 0.011598 / 0.075646 (-0.064049) 0.320922 / 0.419271 (-0.098349) 0.041250 / 0.043533 (-0.002283) 0.411439 / 0.255139 (0.156300) 0.429722 / 0.283200 (0.146523) 0.087161 / 0.141683 (-0.054522) 1.512573 / 1.452155 (0.060418) 1.569385 / 1.492716 (0.076668)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.222612 / 0.018006 (0.204606) 0.409086 / 0.000490 (0.408596) 0.004246 / 0.000200 (0.004046) 0.000083 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024324 / 0.037411 (-0.013087) 0.099055 / 0.014526 (0.084530) 0.106809 / 0.176557 (-0.069748) 0.141275 / 0.737135 (-0.595860) 0.109426 / 0.296338 (-0.186913)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.469736 / 0.215209 (0.254527) 4.686900 / 2.077655 (2.609246) 2.413392 / 1.504120 (0.909272) 2.217366 / 1.541195 (0.676171) 2.266957 / 1.468490 (0.798467) 0.698647 / 4.584777 (-3.886129) 3.389317 / 3.745712 (-0.356395) 1.862315 / 5.269862 (-3.407546) 1.160931 / 4.565676 (-3.404746) 0.082829 / 0.424275 (-0.341446) 0.012627 / 0.007607 (0.005020) 0.568027 / 0.226044 (0.341983) 5.683220 / 2.268929 (3.414291) 2.865701 / 55.444624 (-52.578924) 2.522401 / 6.876477 (-4.354076) 2.542395 / 2.142072 (0.400323) 0.801224 / 4.805227 (-4.004003) 0.149946 / 6.500664 (-6.350718) 0.065447 / 0.075469 (-0.010023)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.283756 / 1.841788 (-0.558032) 13.903662 / 8.074308 (5.829354) 13.238389 / 10.191392 (3.046997) 0.142304 / 0.680424 (-0.538120) 0.016922 / 0.534201 (-0.517279) 0.377797 / 0.579283 (-0.201487) 0.382460 / 0.434364 (-0.051904) 0.464645 / 0.540337 (-0.075692) 0.556270 / 1.386936 (-0.830666)

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool, I learned a lot reading this. I'm sure it'll be super valuable and welcomed by the community! 😄

I think this would be more of a Conceptual Guide doc since this is more explanatory and compares the differences between a Dataset and an IterableDataset. It’s not necessarily a how-to for how to do something, but it discusses and explains the two types of datasets. There are definitely places in the docs where we can add a nice and link to this doc though to build up the user's understanding of this topic. For example, in the Know your dataset tutorial, we only introduce the regular Dataset object and not the IterableDataset. We can add a section there for IterableDataset and then link to this doc that explains the difference between the two 🙂


## Downloading and streaming

When you have a regular "map-style" [`Dataset`], you can access it using `my_dataset[0]`: we have what we call "random access" to the rows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by a “map-style” Dataset? If I understand correctly, this is just a regular Dataset. So it might be easier for users to understand if we don’t use this specific term and just use Dataset or if we define what we mean by “map-style” (unless this is commonly known jargon, in which case ignore this haha).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it refers to datasets with random access, i.e. that allows you to do my_dataset[0]. I'll define it properly

print(my_dataset[0])
```

To not have to wait for the conversion to Arrow, you can define an iterable dataset by streaming from your local files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of not converting Dataset to Arrow (obvs it’s faster, but it’d be good to mention this explicitly for the user)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster + save disk space + you can modify your original data and re-instantiate the dataset without having to reconvert the original data. I'll mention this !

@lhoestq
Copy link
Member Author

lhoestq commented Jan 26, 2023

I think this would be more of a Conceptual Guide doc since this is more explanatory and compares the differences between a Dataset and an IterableDataset

sounds good to me !

There are definitely places in the docs where we can add a nice and link to this doc though to build up the user's understanding of this topic. For example, in the Know your dataset tutorial, we only introduce the regular Dataset object and not the IterableDataset. We can add a section there for IterableDataset and then link to this doc that explains the difference between the two 🙂

good idea, thanks :)

@stevhliu
Copy link
Member

I'll open a PR to add a section on IterableDataset's in the tutorial, and once you're done editing this doc I can give it a final polish! 😄

@lhoestq
Copy link
Member Author

lhoestq commented Jan 27, 2023

I moved the doc page to conceptual guides and took your suggestions into account :)

I think this is ready for final review now

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009890 / 0.011353 (-0.001463) 0.005156 / 0.011008 (-0.005852) 0.099493 / 0.038508 (0.060984) 0.036671 / 0.023109 (0.013562) 0.304686 / 0.275898 (0.028788) 0.339070 / 0.323480 (0.015590) 0.008466 / 0.007986 (0.000481) 0.005863 / 0.004328 (0.001534) 0.075082 / 0.004250 (0.070832) 0.045926 / 0.037052 (0.008874) 0.303157 / 0.258489 (0.044668) 0.363710 / 0.293841 (0.069870) 0.038497 / 0.128546 (-0.090049) 0.012063 / 0.075646 (-0.063583) 0.334463 / 0.419271 (-0.084808) 0.048161 / 0.043533 (0.004628) 0.300431 / 0.255139 (0.045292) 0.330344 / 0.283200 (0.047145) 0.105509 / 0.141683 (-0.036174) 1.475242 / 1.452155 (0.023087) 1.550624 / 1.492716 (0.057908)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.245749 / 0.018006 (0.227743) 0.575091 / 0.000490 (0.574601) 0.001556 / 0.000200 (0.001357) 0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030447 / 0.037411 (-0.006964) 0.110982 / 0.014526 (0.096456) 0.126760 / 0.176557 (-0.049797) 0.173375 / 0.737135 (-0.563760) 0.128799 / 0.296338 (-0.167539)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.392861 / 0.215209 (0.177651) 3.911231 / 2.077655 (1.833576) 1.757413 / 1.504120 (0.253293) 1.563287 / 1.541195 (0.022093) 1.658678 / 1.468490 (0.190188) 0.677244 / 4.584777 (-3.907533) 3.754917 / 3.745712 (0.009205) 3.779417 / 5.269862 (-1.490444) 1.993159 / 4.565676 (-2.572517) 0.084425 / 0.424275 (-0.339850) 0.012500 / 0.007607 (0.004893) 0.501788 / 0.226044 (0.275743) 5.003173 / 2.268929 (2.734244) 2.273547 / 55.444624 (-53.171077) 1.909766 / 6.876477 (-4.966711) 1.968287 / 2.142072 (-0.173785) 0.834895 / 4.805227 (-3.970332) 0.165312 / 6.500664 (-6.335352) 0.062202 / 0.075469 (-0.013267)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.203080 / 1.841788 (-0.638708) 15.158284 / 8.074308 (7.083976) 14.174484 / 10.191392 (3.983092) 0.171540 / 0.680424 (-0.508883) 0.028604 / 0.534201 (-0.505597) 0.438379 / 0.579283 (-0.140904) 0.429447 / 0.434364 (-0.004917) 0.540979 / 0.540337 (0.000642) 0.630322 / 1.386936 (-0.756614)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007600 / 0.011353 (-0.003753) 0.005400 / 0.011008 (-0.005608) 0.097983 / 0.038508 (0.059475) 0.033407 / 0.023109 (0.010297) 0.384429 / 0.275898 (0.108531) 0.415880 / 0.323480 (0.092400) 0.006085 / 0.007986 (-0.001900) 0.004330 / 0.004328 (0.000002) 0.074654 / 0.004250 (0.070403) 0.053076 / 0.037052 (0.016024) 0.383958 / 0.258489 (0.125469) 0.427289 / 0.293841 (0.133448) 0.036710 / 0.128546 (-0.091836) 0.012400 / 0.075646 (-0.063246) 0.332712 / 0.419271 (-0.086560) 0.058390 / 0.043533 (0.014857) 0.377747 / 0.255139 (0.122608) 0.398997 / 0.283200 (0.115798) 0.117370 / 0.141683 (-0.024313) 1.464211 / 1.452155 (0.012057) 1.596465 / 1.492716 (0.103749)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.212989 / 0.018006 (0.194983) 0.554968 / 0.000490 (0.554479) 0.004305 / 0.000200 (0.004105) 0.000116 / 0.000054 (0.000061)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029167 / 0.037411 (-0.008244) 0.109156 / 0.014526 (0.094631) 0.122575 / 0.176557 (-0.053982) 0.163058 / 0.737135 (-0.574077) 0.127431 / 0.296338 (-0.168908)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.445395 / 0.215209 (0.230185) 4.447534 / 2.077655 (2.369879) 2.259186 / 1.504120 (0.755066) 2.082956 / 1.541195 (0.541761) 2.259126 / 1.468490 (0.790636) 0.692271 / 4.584777 (-3.892506) 3.795759 / 3.745712 (0.050047) 3.603000 / 5.269862 (-1.666862) 1.948556 / 4.565676 (-2.617120) 0.084589 / 0.424275 (-0.339687) 0.012751 / 0.007607 (0.005144) 0.544783 / 0.226044 (0.318738) 5.452278 / 2.268929 (3.183349) 2.809467 / 55.444624 (-52.635157) 2.479297 / 6.876477 (-4.397180) 2.587756 / 2.142072 (0.445683) 0.832258 / 4.805227 (-3.972970) 0.167424 / 6.500664 (-6.333240) 0.066064 / 0.075469 (-0.009405)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.262719 / 1.841788 (-0.579069) 15.917869 / 8.074308 (7.843561) 13.879301 / 10.191392 (3.687909) 0.187712 / 0.680424 (-0.492712) 0.018175 / 0.534201 (-0.516026) 0.425840 / 0.579283 (-0.153443) 0.426164 / 0.434364 (-0.008200) 0.527465 / 0.540337 (-0.012872) 0.629478 / 1.386936 (-0.757458)

@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009064 / 0.011353 (-0.002289) 0.004824 / 0.011008 (-0.006184) 0.100869 / 0.038508 (0.062361) 0.030803 / 0.023109 (0.007694) 0.350880 / 0.275898 (0.074982) 0.423816 / 0.323480 (0.100336) 0.007581 / 0.007986 (-0.000405) 0.003642 / 0.004328 (-0.000686) 0.077682 / 0.004250 (0.073432) 0.039856 / 0.037052 (0.002803) 0.366097 / 0.258489 (0.107608) 0.409226 / 0.293841 (0.115385) 0.033698 / 0.128546 (-0.094848) 0.011730 / 0.075646 (-0.063916) 0.321683 / 0.419271 (-0.097588) 0.041794 / 0.043533 (-0.001739) 0.351175 / 0.255139 (0.096036) 0.374328 / 0.283200 (0.091128) 0.091833 / 0.141683 (-0.049850) 1.507082 / 1.452155 (0.054927) 1.543289 / 1.492716 (0.050572)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.010670 / 0.018006 (-0.007337) 0.429674 / 0.000490 (0.429184) 0.003246 / 0.000200 (0.003046) 0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025015 / 0.037411 (-0.012397) 0.102155 / 0.014526 (0.087629) 0.107010 / 0.176557 (-0.069546) 0.144265 / 0.737135 (-0.592870) 0.110635 / 0.296338 (-0.185703)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.414211 / 0.215209 (0.199002) 4.125582 / 2.077655 (2.047928) 1.997856 / 1.504120 (0.493736) 1.847676 / 1.541195 (0.306481) 1.994100 / 1.468490 (0.525610) 0.694975 / 4.584777 (-3.889802) 3.373629 / 3.745712 (-0.372083) 2.863255 / 5.269862 (-2.406606) 1.565723 / 4.565676 (-2.999953) 0.082539 / 0.424275 (-0.341736) 0.012650 / 0.007607 (0.005043) 0.522989 / 0.226044 (0.296945) 5.205720 / 2.268929 (2.936792) 2.352292 / 55.444624 (-53.092332) 2.080467 / 6.876477 (-4.796010) 2.231014 / 2.142072 (0.088942) 0.811252 / 4.805227 (-3.993975) 0.149171 / 6.500664 (-6.351493) 0.065207 / 0.075469 (-0.010262)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.203137 / 1.841788 (-0.638651) 14.244903 / 8.074308 (6.170595) 14.454368 / 10.191392 (4.262976) 0.139090 / 0.680424 (-0.541334) 0.028738 / 0.534201 (-0.505463) 0.396394 / 0.579283 (-0.182889) 0.407207 / 0.434364 (-0.027156) 0.478036 / 0.540337 (-0.062302) 0.568488 / 1.386936 (-0.818448)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006878 / 0.011353 (-0.004475) 0.004636 / 0.011008 (-0.006372) 0.099118 / 0.038508 (0.060610) 0.028076 / 0.023109 (0.004967) 0.416097 / 0.275898 (0.140199) 0.451722 / 0.323480 (0.128242) 0.005364 / 0.007986 (-0.002622) 0.003506 / 0.004328 (-0.000822) 0.075791 / 0.004250 (0.071541) 0.041373 / 0.037052 (0.004321) 0.416358 / 0.258489 (0.157869) 0.458440 / 0.293841 (0.164599) 0.031870 / 0.128546 (-0.096676) 0.011751 / 0.075646 (-0.063896) 0.321748 / 0.419271 (-0.097524) 0.041780 / 0.043533 (-0.001752) 0.425037 / 0.255139 (0.169898) 0.444169 / 0.283200 (0.160969) 0.093145 / 0.141683 (-0.048538) 1.472151 / 1.452155 (0.019996) 1.542942 / 1.492716 (0.050226)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.224287 / 0.018006 (0.206281) 0.415303 / 0.000490 (0.414813) 0.003180 / 0.000200 (0.002980) 0.000082 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026377 / 0.037411 (-0.011035) 0.106222 / 0.014526 (0.091696) 0.113873 / 0.176557 (-0.062684) 0.143255 / 0.737135 (-0.593880) 0.112642 / 0.296338 (-0.183697)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.444149 / 0.215209 (0.228940) 4.421434 / 2.077655 (2.343779) 2.082198 / 1.504120 (0.578078) 1.879909 / 1.541195 (0.338715) 1.968526 / 1.468490 (0.500036) 0.697230 / 4.584777 (-3.887546) 3.430800 / 3.745712 (-0.314912) 1.893353 / 5.269862 (-3.376509) 1.173271 / 4.565676 (-3.392406) 0.082636 / 0.424275 (-0.341639) 0.012357 / 0.007607 (0.004750) 0.544008 / 0.226044 (0.317964) 5.465472 / 2.268929 (3.196543) 2.530017 / 55.444624 (-52.914608) 2.178462 / 6.876477 (-4.698014) 2.279570 / 2.142072 (0.137498) 0.804890 / 4.805227 (-4.000337) 0.152091 / 6.500664 (-6.348573) 0.069442 / 0.075469 (-0.006027)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.256722 / 1.841788 (-0.585065) 14.554131 / 8.074308 (6.479823) 13.499913 / 10.191392 (3.308521) 0.144350 / 0.680424 (-0.536074) 0.016977 / 0.534201 (-0.517224) 0.378836 / 0.579283 (-0.200447) 0.392004 / 0.434364 (-0.042360) 0.468423 / 0.540337 (-0.071914) 0.584711 / 1.386936 (-0.802225)

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome doc, thanks for sharing all this infos!

docs/source/use_with_pytorch.mdx Outdated Show resolved Hide resolved
@@ -0,0 +1,220 @@
# Differences between Dataset and IterableDataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add a sentence or two here that introduces the topic and scope of the doc. Something like:

There are two types of dataset objects, a Dataset and an IterableDataset. Whichever type of dataset you choose to use or create depends on the size of the dataset. In general, an IterableDataset is ideal for big datasets (think hundreds of GBs!) due to its lazy behavior and speed advantages, while a Dataset is great for everything else. This page will compare the differences between a Dataset and an IterableDataset to help you pick the right dataset object for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me !

docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
my_iterable_dataset.n_shards # 1024
```

Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !
Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this sentence altogether. Two existing links in our docs are more than enough :).

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
Returns:
[`datasets.IterableDataset`]

Example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love all the example usages here! 😍

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

The code looks good.

Regarding the docs, I think it would be better to add this info as notes/tips/sections to the existing docs (Process/Stream; e.g. a tip under Dataset.shuffle that explains how to make this operation more performant by using to_iterable + shuffle, etc.) rather than introducing a new doc page.

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
my_iterable_dataset.n_shards # 1024
```

Feel free to open a discussion on the 🤗 Datasets [forum](https://discuss.huggingface.co/c/datasets/10) if you have questions !
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this sentence altogether. Two existing links in our docs are more than enough :).

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@github-actions
Copy link

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008542 / 0.011353 (-0.002811) 0.004552 / 0.011008 (-0.006456) 0.100543 / 0.038508 (0.062035) 0.029717 / 0.023109 (0.006608) 0.301948 / 0.275898 (0.026050) 0.360211 / 0.323480 (0.036731) 0.006881 / 0.007986 (-0.001105) 0.003433 / 0.004328 (-0.000896) 0.077760 / 0.004250 (0.073510) 0.037069 / 0.037052 (0.000017) 0.314084 / 0.258489 (0.055595) 0.347759 / 0.293841 (0.053918) 0.033255 / 0.128546 (-0.095291) 0.011487 / 0.075646 (-0.064160) 0.323873 / 0.419271 (-0.095399) 0.041203 / 0.043533 (-0.002330) 0.298397 / 0.255139 (0.043258) 0.327174 / 0.283200 (0.043974) 0.088892 / 0.141683 (-0.052791) 1.560114 / 1.452155 (0.107959) 1.532475 / 1.492716 (0.039759)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.226080 / 0.018006 (0.208074) 0.467492 / 0.000490 (0.467003) 0.002198 / 0.000200 (0.001998) 0.000074 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023627 / 0.037411 (-0.013784) 0.096696 / 0.014526 (0.082170) 0.106196 / 0.176557 (-0.070360) 0.140496 / 0.737135 (-0.596639) 0.108859 / 0.296338 (-0.187480)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.422335 / 0.215209 (0.207126) 4.214879 / 2.077655 (2.137224) 1.865866 / 1.504120 (0.361747) 1.660914 / 1.541195 (0.119719) 1.691869 / 1.468490 (0.223379) 0.688164 / 4.584777 (-3.896613) 3.432708 / 3.745712 (-0.313004) 1.856852 / 5.269862 (-3.413010) 1.243685 / 4.565676 (-3.321991) 0.081552 / 0.424275 (-0.342723) 0.012491 / 0.007607 (0.004884) 0.524331 / 0.226044 (0.298287) 5.255090 / 2.268929 (2.986162) 2.269705 / 55.444624 (-53.174919) 1.936722 / 6.876477 (-4.939755) 2.018958 / 2.142072 (-0.123114) 0.800658 / 4.805227 (-4.004569) 0.148665 / 6.500664 (-6.351999) 0.064210 / 0.075469 (-0.011259)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.235422 / 1.841788 (-0.606365) 14.156755 / 8.074308 (6.082447) 14.005916 / 10.191392 (3.814524) 0.150983 / 0.680424 (-0.529441) 0.028500 / 0.534201 (-0.505701) 0.393013 / 0.579283 (-0.186270) 0.408191 / 0.434364 (-0.026173) 0.481017 / 0.540337 (-0.059320) 0.581711 / 1.386936 (-0.805225)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006950 / 0.011353 (-0.004403) 0.004575 / 0.011008 (-0.006434) 0.076702 / 0.038508 (0.038194) 0.028050 / 0.023109 (0.004941) 0.342916 / 0.275898 (0.067018) 0.378861 / 0.323480 (0.055381) 0.005315 / 0.007986 (-0.002671) 0.004822 / 0.004328 (0.000494) 0.075560 / 0.004250 (0.071310) 0.040441 / 0.037052 (0.003388) 0.344284 / 0.258489 (0.085795) 0.386519 / 0.293841 (0.092678) 0.032122 / 0.128546 (-0.096424) 0.011843 / 0.075646 (-0.063803) 0.085798 / 0.419271 (-0.333473) 0.043027 / 0.043533 (-0.000506) 0.342910 / 0.255139 (0.087771) 0.366618 / 0.283200 (0.083418) 0.094766 / 0.141683 (-0.046917) 1.492981 / 1.452155 (0.040827) 1.566994 / 1.492716 (0.074278)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.166083 / 0.018006 (0.148076) 0.409315 / 0.000490 (0.408826) 0.003189 / 0.000200 (0.002989) 0.000127 / 0.000054 (0.000072)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024753 / 0.037411 (-0.012658) 0.099112 / 0.014526 (0.084586) 0.106668 / 0.176557 (-0.069889) 0.142562 / 0.737135 (-0.594573) 0.110648 / 0.296338 (-0.185690)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.452668 / 0.215209 (0.237459) 4.501188 / 2.077655 (2.423534) 2.086197 / 1.504120 (0.582077) 1.873955 / 1.541195 (0.332761) 1.935610 / 1.468490 (0.467120) 0.708290 / 4.584777 (-3.876487) 3.426986 / 3.745712 (-0.318726) 2.805852 / 5.269862 (-2.464009) 1.516918 / 4.565676 (-3.048759) 0.084067 / 0.424275 (-0.340208) 0.012776 / 0.007607 (0.005169) 0.548853 / 0.226044 (0.322809) 5.488198 / 2.268929 (3.219270) 2.704464 / 55.444624 (-52.740161) 2.377817 / 6.876477 (-4.498660) 2.366152 / 2.142072 (0.224079) 0.818192 / 4.805227 (-3.987035) 0.152649 / 6.500664 (-6.348015) 0.066914 / 0.075469 (-0.008555)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.273803 / 1.841788 (-0.567985) 14.071633 / 8.074308 (5.997325) 13.655586 / 10.191392 (3.464194) 0.149471 / 0.680424 (-0.530953) 0.016745 / 0.534201 (-0.517456) 0.386850 / 0.579283 (-0.192434) 0.393595 / 0.434364 (-0.040769) 0.480396 / 0.540337 (-0.059942) 0.573708 / 1.386936 (-0.813228)

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008173 / 0.011353 (-0.003180) 0.004461 / 0.011008 (-0.006547) 0.100284 / 0.038508 (0.061776) 0.028900 / 0.023109 (0.005791) 0.293639 / 0.275898 (0.017741) 0.359450 / 0.323480 (0.035971) 0.007567 / 0.007986 (-0.000418) 0.003434 / 0.004328 (-0.000894) 0.077913 / 0.004250 (0.073663) 0.036313 / 0.037052 (-0.000740) 0.308484 / 0.258489 (0.049995) 0.347575 / 0.293841 (0.053734) 0.033367 / 0.128546 (-0.095179) 0.011508 / 0.075646 (-0.064138) 0.323490 / 0.419271 (-0.095782) 0.042285 / 0.043533 (-0.001248) 0.295696 / 0.255139 (0.040557) 0.332475 / 0.283200 (0.049276) 0.089980 / 0.141683 (-0.051703) 1.461851 / 1.452155 (0.009697) 1.493030 / 1.492716 (0.000314)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.191068 / 0.018006 (0.173062) 0.396768 / 0.000490 (0.396278) 0.002355 / 0.000200 (0.002155) 0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023117 / 0.037411 (-0.014294) 0.096155 / 0.014526 (0.081630) 0.102424 / 0.176557 (-0.074132) 0.142148 / 0.737135 (-0.594987) 0.105954 / 0.296338 (-0.190384)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.421227 / 0.215209 (0.206018) 4.200403 / 2.077655 (2.122748) 1.899410 / 1.504120 (0.395290) 1.684091 / 1.541195 (0.142896) 1.698084 / 1.468490 (0.229594) 0.696195 / 4.584777 (-3.888582) 3.364116 / 3.745712 (-0.381596) 1.899133 / 5.269862 (-3.370728) 1.281405 / 4.565676 (-3.284272) 0.082958 / 0.424275 (-0.341317) 0.012433 / 0.007607 (0.004826) 0.521856 / 0.226044 (0.295812) 5.217626 / 2.268929 (2.948698) 2.309228 / 55.444624 (-53.135396) 1.956828 / 6.876477 (-4.919648) 2.018964 / 2.142072 (-0.123108) 0.816855 / 4.805227 (-3.988373) 0.152867 / 6.500664 (-6.347798) 0.064764 / 0.075469 (-0.010705)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.219020 / 1.841788 (-0.622768) 13.509058 / 8.074308 (5.434750) 13.637826 / 10.191392 (3.446434) 0.156620 / 0.680424 (-0.523804) 0.028518 / 0.534201 (-0.505683) 0.399138 / 0.579283 (-0.180146) 0.399931 / 0.434364 (-0.034433) 0.482902 / 0.540337 (-0.057435) 0.574089 / 1.386936 (-0.812847)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006232 / 0.011353 (-0.005121) 0.004467 / 0.011008 (-0.006542) 0.075494 / 0.038508 (0.036986) 0.026891 / 0.023109 (0.003782) 0.356603 / 0.275898 (0.080705) 0.371977 / 0.323480 (0.048497) 0.004709 / 0.007986 (-0.003276) 0.003230 / 0.004328 (-0.001099) 0.074338 / 0.004250 (0.070088) 0.035588 / 0.037052 (-0.001464) 0.349554 / 0.258489 (0.091065) 0.389672 / 0.293841 (0.095831) 0.031524 / 0.128546 (-0.097022) 0.011493 / 0.075646 (-0.064153) 0.084584 / 0.419271 (-0.334688) 0.041945 / 0.043533 (-0.001588) 0.341057 / 0.255139 (0.085918) 0.367876 / 0.283200 (0.084677) 0.090113 / 0.141683 (-0.051569) 1.507104 / 1.452155 (0.054949) 1.567810 / 1.492716 (0.075094)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.210939 / 0.018006 (0.192933) 0.392600 / 0.000490 (0.392110) 0.002188 / 0.000200 (0.001988) 0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024294 / 0.037411 (-0.013118) 0.100325 / 0.014526 (0.085799) 0.104027 / 0.176557 (-0.072530) 0.141189 / 0.737135 (-0.595947) 0.107438 / 0.296338 (-0.188901)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.443314 / 0.215209 (0.228105) 4.429612 / 2.077655 (2.351957) 2.129275 / 1.504120 (0.625156) 1.940016 / 1.541195 (0.398821) 2.008975 / 1.468490 (0.540485) 0.695434 / 4.584777 (-3.889343) 3.355137 / 3.745712 (-0.390575) 2.606262 / 5.269862 (-2.663600) 1.451283 / 4.565676 (-3.114394) 0.082875 / 0.424275 (-0.341400) 0.012398 / 0.007607 (0.004791) 0.544262 / 0.226044 (0.318218) 5.450829 / 2.268929 (3.181900) 2.582074 / 55.444624 (-52.862550) 2.220037 / 6.876477 (-4.656439) 2.232473 / 2.142072 (0.090401) 0.802094 / 4.805227 (-4.003134) 0.150188 / 6.500664 (-6.350476) 0.066543 / 0.075469 (-0.008926)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.269098 / 1.841788 (-0.572690) 13.764780 / 8.074308 (5.690472) 13.461490 / 10.191392 (3.270098) 0.143841 / 0.680424 (-0.536583) 0.016687 / 0.534201 (-0.517514) 0.388548 / 0.579283 (-0.190736) 0.385229 / 0.434364 (-0.049135) 0.478966 / 0.540337 (-0.061371) 0.570355 / 1.386936 (-0.816581)

@lhoestq
Copy link
Member Author

lhoestq commented Feb 1, 2023

I took your comments into account :)

Regarding the docs, I think it would be better to add this info as notes/tips/sections to the existing docs (Process/Stream; e.g. a tip under Dataset.shuffle that explains how to make this operation more performant by using to_iterable + shuffle, etc.) rather than introducing a new doc page.

I added a paragraph in the Dataset.shuffle docstring, and a note in the Process doc page

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010906 / 0.011353 (-0.000447) 0.005995 / 0.011008 (-0.005014) 0.120183 / 0.038508 (0.081675) 0.042166 / 0.023109 (0.019057) 0.350945 / 0.275898 (0.075046) 0.433055 / 0.323480 (0.109575) 0.009093 / 0.007986 (0.001107) 0.004695 / 0.004328 (0.000366) 0.090362 / 0.004250 (0.086112) 0.051402 / 0.037052 (0.014350) 0.368677 / 0.258489 (0.110188) 0.410926 / 0.293841 (0.117086) 0.044471 / 0.128546 (-0.084075) 0.014051 / 0.075646 (-0.061595) 0.397765 / 0.419271 (-0.021507) 0.057227 / 0.043533 (0.013694) 0.357587 / 0.255139 (0.102448) 0.377470 / 0.283200 (0.094270) 0.119482 / 0.141683 (-0.022201) 1.719799 / 1.452155 (0.267645) 1.758228 / 1.492716 (0.265511)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.224385 / 0.018006 (0.206379) 0.505070 / 0.000490 (0.504580) 0.004863 / 0.000200 (0.004663) 0.000379 / 0.000054 (0.000324)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030366 / 0.037411 (-0.007046) 0.130481 / 0.014526 (0.115955) 0.136429 / 0.176557 (-0.040128) 0.182263 / 0.737135 (-0.554872) 0.142871 / 0.296338 (-0.153468)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.467623 / 0.215209 (0.252414) 4.665522 / 2.077655 (2.587868) 2.130885 / 1.504120 (0.626766) 1.903810 / 1.541195 (0.362615) 2.019077 / 1.468490 (0.550587) 0.820868 / 4.584777 (-3.763909) 4.543118 / 3.745712 (0.797406) 2.491541 / 5.269862 (-2.778321) 1.585377 / 4.565676 (-2.980299) 0.101850 / 0.424275 (-0.322426) 0.014737 / 0.007607 (0.007129) 0.597241 / 0.226044 (0.371197) 5.938445 / 2.268929 (3.669516) 2.695799 / 55.444624 (-52.748825) 2.286890 / 6.876477 (-4.589587) 2.363064 / 2.142072 (0.220991) 0.986670 / 4.805227 (-3.818557) 0.194407 / 6.500664 (-6.306257) 0.074767 / 0.075469 (-0.000702)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.420630 / 1.841788 (-0.421158) 17.537702 / 8.074308 (9.463394) 16.521804 / 10.191392 (6.330412) 0.173622 / 0.680424 (-0.506802) 0.033944 / 0.534201 (-0.500257) 0.520461 / 0.579283 (-0.058822) 0.541283 / 0.434364 (0.106919) 0.651906 / 0.540337 (0.111569) 0.771724 / 1.386936 (-0.615212)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008448 / 0.011353 (-0.002905) 0.005893 / 0.011008 (-0.005115) 0.087995 / 0.038508 (0.049487) 0.038602 / 0.023109 (0.015493) 0.400048 / 0.275898 (0.124150) 0.436998 / 0.323480 (0.113518) 0.006414 / 0.007986 (-0.001572) 0.004478 / 0.004328 (0.000149) 0.086444 / 0.004250 (0.082194) 0.056535 / 0.037052 (0.019483) 0.402066 / 0.258489 (0.143577) 0.458730 / 0.293841 (0.164889) 0.041622 / 0.128546 (-0.086924) 0.014014 / 0.075646 (-0.061632) 0.101382 / 0.419271 (-0.317889) 0.056986 / 0.043533 (0.013453) 0.404527 / 0.255139 (0.149388) 0.428105 / 0.283200 (0.144906) 0.118321 / 0.141683 (-0.023361) 1.716940 / 1.452155 (0.264785) 1.834683 / 1.492716 (0.341967)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.252917 / 0.018006 (0.234910) 0.485950 / 0.000490 (0.485461) 0.000489 / 0.000200 (0.000289) 0.000066 / 0.000054 (0.000011)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035023 / 0.037411 (-0.002388) 0.139055 / 0.014526 (0.124529) 0.144165 / 0.176557 (-0.032392) 0.189559 / 0.737135 (-0.547577) 0.153213 / 0.296338 (-0.143126)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.505069 / 0.215209 (0.289860) 5.024620 / 2.077655 (2.946965) 2.429469 / 1.504120 (0.925349) 2.186210 / 1.541195 (0.645015) 2.275971 / 1.468490 (0.807481) 0.829432 / 4.584777 (-3.755345) 4.518600 / 3.745712 (0.772888) 2.466418 / 5.269862 (-2.803443) 1.558910 / 4.565676 (-3.006767) 0.102017 / 0.424275 (-0.322258) 0.015191 / 0.007607 (0.007584) 0.619092 / 0.226044 (0.393048) 6.241105 / 2.268929 (3.972176) 3.044213 / 55.444624 (-52.400411) 2.630194 / 6.876477 (-4.246282) 2.723685 / 2.142072 (0.581613) 0.994018 / 4.805227 (-3.811210) 0.198722 / 6.500664 (-6.301942) 0.075812 / 0.075469 (0.000343)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.545497 / 1.841788 (-0.296291) 18.305250 / 8.074308 (10.230942) 16.035275 / 10.191392 (5.843883) 0.209339 / 0.680424 (-0.471085) 0.020903 / 0.534201 (-0.513298) 0.499909 / 0.579283 (-0.079374) 0.488775 / 0.434364 (0.054411) 0.581990 / 0.540337 (0.041653) 0.697786 / 1.386936 (-0.689150)

Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the new doc about Dataset vs IterableDataset, thank you! and I think it's worth a separate page.
I left just a few comments to the text.

docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
docs/source/about_mapstyle_vs_iterable.mdx Outdated Show resolved Hide resolved
docs/source/process.mdx Outdated Show resolved Hide resolved
docs/source/use_with_pytorch.mdx Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I guess we can keep the new page.

Btw, this page boils down to "switch from Dataset to IterableDataset to save time/disk space by avoiding the full rewrite of a dataset" (map always does it, flatten_indices is good to run after shuffle to preserve the iteration speed), so maybe a better title for it would be "Optimize processing" (or "Working with datasets at scale" as I mentioned earlier on Slack)

PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)
cc @stevhliu

Co-authored-by: Polina Kazakova <polina@huggingface.co>
@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011706 / 0.011353 (0.000353) 0.008406 / 0.011008 (-0.002602) 0.130887 / 0.038508 (0.092379) 0.037468 / 0.023109 (0.014359) 0.385043 / 0.275898 (0.109145) 0.458837 / 0.323480 (0.135357) 0.013400 / 0.007986 (0.005414) 0.004885 / 0.004328 (0.000557) 0.107156 / 0.004250 (0.102905) 0.046958 / 0.037052 (0.009906) 0.419314 / 0.258489 (0.160825) 0.456061 / 0.293841 (0.162220) 0.058859 / 0.128546 (-0.069687) 0.016682 / 0.075646 (-0.058965) 0.428401 / 0.419271 (0.009129) 0.062908 / 0.043533 (0.019376) 0.370902 / 0.255139 (0.115763) 0.433897 / 0.283200 (0.150697) 0.125672 / 0.141683 (-0.016011) 1.818279 / 1.452155 (0.366124) 1.935767 / 1.492716 (0.443050)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.011928 / 0.018006 (-0.006078) 0.591995 / 0.000490 (0.591506) 0.008416 / 0.000200 (0.008216) 0.000122 / 0.000054 (0.000067)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029640 / 0.037411 (-0.007772) 0.121044 / 0.014526 (0.106518) 0.141840 / 0.176557 (-0.034716) 0.195856 / 0.737135 (-0.541280) 0.146460 / 0.296338 (-0.149879)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.591838 / 0.215209 (0.376629) 5.817309 / 2.077655 (3.739654) 2.411864 / 1.504120 (0.907744) 2.098517 / 1.541195 (0.557323) 2.214609 / 1.468490 (0.746119) 1.217542 / 4.584777 (-3.367235) 5.658394 / 3.745712 (1.912682) 5.155807 / 5.269862 (-0.114055) 2.797313 / 4.565676 (-1.768363) 0.141309 / 0.424275 (-0.282967) 0.014462 / 0.007607 (0.006855) 0.772274 / 0.226044 (0.546230) 7.547357 / 2.268929 (5.278429) 3.150178 / 55.444624 (-52.294446) 2.500130 / 6.876477 (-4.376347) 2.572036 / 2.142072 (0.429964) 1.434498 / 4.805227 (-3.370729) 0.257355 / 6.500664 (-6.243309) 0.087491 / 0.075469 (0.012022)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.483899 / 1.841788 (-0.357889) 17.990741 / 8.074308 (9.916433) 20.398965 / 10.191392 (10.207573) 0.239529 / 0.680424 (-0.440895) 0.046118 / 0.534201 (-0.488083) 0.528349 / 0.579283 (-0.050934) 0.614333 / 0.434364 (0.179969) 0.653621 / 0.540337 (0.113284) 0.794654 / 1.386936 (-0.592282)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008732 / 0.011353 (-0.002621) 0.006432 / 0.011008 (-0.004576) 0.090811 / 0.038508 (0.052303) 0.030154 / 0.023109 (0.007045) 0.407885 / 0.275898 (0.131987) 0.452457 / 0.323480 (0.128977) 0.006966 / 0.007986 (-0.001020) 0.006449 / 0.004328 (0.002120) 0.094439 / 0.004250 (0.090188) 0.050628 / 0.037052 (0.013576) 0.401815 / 0.258489 (0.143326) 0.451814 / 0.293841 (0.157973) 0.047456 / 0.128546 (-0.081090) 0.019019 / 0.075646 (-0.056628) 0.112941 / 0.419271 (-0.306331) 0.057677 / 0.043533 (0.014145) 0.406160 / 0.255139 (0.151021) 0.434469 / 0.283200 (0.151269) 0.110515 / 0.141683 (-0.031167) 1.601393 / 1.452155 (0.149238) 1.745581 / 1.492716 (0.252865)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.280264 / 0.018006 (0.262258) 0.630074 / 0.000490 (0.629585) 0.006900 / 0.000200 (0.006700) 0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027338 / 0.037411 (-0.010073) 0.114772 / 0.014526 (0.100246) 0.130436 / 0.176557 (-0.046121) 0.168990 / 0.737135 (-0.568145) 0.135842 / 0.296338 (-0.160496)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.666739 / 0.215209 (0.451530) 6.212953 / 2.077655 (4.135298) 2.781716 / 1.504120 (1.277596) 2.369975 / 1.541195 (0.828781) 2.338807 / 1.468490 (0.870317) 1.174138 / 4.584777 (-3.410639) 5.420297 / 3.745712 (1.674585) 4.972669 / 5.269862 (-0.297192) 2.214294 / 4.565676 (-2.351382) 0.135429 / 0.424275 (-0.288846) 0.013877 / 0.007607 (0.006270) 0.750805 / 0.226044 (0.524761) 7.145429 / 2.268929 (4.876500) 3.215081 / 55.444624 (-52.229544) 2.598307 / 6.876477 (-4.278170) 2.690479 / 2.142072 (0.548406) 1.344673 / 4.805227 (-3.460554) 0.241536 / 6.500664 (-6.259128) 0.075544 / 0.075469 (0.000074)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.473595 / 1.841788 (-0.368192) 17.372237 / 8.074308 (9.297929) 18.586588 / 10.191392 (8.395196) 0.209300 / 0.680424 (-0.471124) 0.030878 / 0.534201 (-0.503323) 0.509131 / 0.579283 (-0.070152) 0.617884 / 0.434364 (0.183520) 0.633721 / 0.540337 (0.093383) 0.727624 / 1.386936 (-0.659312)

@lhoestq
Copy link
Member Author

lhoestq commented Feb 1, 2023

Took your last comments into account !

so maybe a better title for it would be "Optimize processing" (or "Working with datasets at scale" as I mentioned earlier on Slack)

I think the content would be slightly different, e.g. focus more on multiprocessing/sharding or what data formats to use. This can be a complementary page IMO

PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)

Added a link in the how-to stream page. We may want to include it in the tutorial at one point at well - right now none of the tutorials mention streaming

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009167 / 0.011353 (-0.002186) 0.005345 / 0.011008 (-0.005663) 0.098302 / 0.038508 (0.059794) 0.035649 / 0.023109 (0.012540) 0.295597 / 0.275898 (0.019699) 0.358843 / 0.323480 (0.035364) 0.008011 / 0.007986 (0.000025) 0.004229 / 0.004328 (-0.000100) 0.075123 / 0.004250 (0.070872) 0.046098 / 0.037052 (0.009046) 0.310581 / 0.258489 (0.052092) 0.343230 / 0.293841 (0.049389) 0.038318 / 0.128546 (-0.090229) 0.011954 / 0.075646 (-0.063693) 0.331056 / 0.419271 (-0.088216) 0.052875 / 0.043533 (0.009342) 0.302758 / 0.255139 (0.047619) 0.340596 / 0.283200 (0.057396) 0.113676 / 0.141683 (-0.028007) 1.448272 / 1.452155 (-0.003883) 1.498008 / 1.492716 (0.005291)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.240524 / 0.018006 (0.222518) 0.555823 / 0.000490 (0.555333) 0.003143 / 0.000200 (0.002943) 0.000098 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027764 / 0.037411 (-0.009647) 0.105006 / 0.014526 (0.090480) 0.120550 / 0.176557 (-0.056007) 0.167052 / 0.737135 (-0.570084) 0.124521 / 0.296338 (-0.171818)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.401758 / 0.215209 (0.186549) 3.989629 / 2.077655 (1.911974) 1.767307 / 1.504120 (0.263187) 1.579451 / 1.541195 (0.038257) 1.637642 / 1.468490 (0.169152) 0.702524 / 4.584777 (-3.882253) 3.714326 / 3.745712 (-0.031386) 2.131829 / 5.269862 (-3.138033) 1.487410 / 4.565676 (-3.078267) 0.084901 / 0.424275 (-0.339374) 0.012292 / 0.007607 (0.004685) 0.505211 / 0.226044 (0.279166) 5.074479 / 2.268929 (2.805551) 2.243068 / 55.444624 (-53.201556) 1.880199 / 6.876477 (-4.996278) 2.003757 / 2.142072 (-0.138315) 0.870719 / 4.805227 (-3.934508) 0.167626 / 6.500664 (-6.333039) 0.062024 / 0.075469 (-0.013445)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.192969 / 1.841788 (-0.648819) 14.830812 / 8.074308 (6.756504) 14.331178 / 10.191392 (4.139786) 0.199222 / 0.680424 (-0.481202) 0.029292 / 0.534201 (-0.504909) 0.440427 / 0.579283 (-0.138857) 0.437893 / 0.434364 (0.003529) 0.547155 / 0.540337 (0.006818) 0.645255 / 1.386936 (-0.741681)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007465 / 0.011353 (-0.003888) 0.005386 / 0.011008 (-0.005622) 0.073609 / 0.038508 (0.035100) 0.033550 / 0.023109 (0.010440) 0.341730 / 0.275898 (0.065832) 0.371518 / 0.323480 (0.048038) 0.005986 / 0.007986 (-0.001999) 0.004264 / 0.004328 (-0.000065) 0.073749 / 0.004250 (0.069498) 0.051452 / 0.037052 (0.014399) 0.347385 / 0.258489 (0.088896) 0.392284 / 0.293841 (0.098444) 0.036981 / 0.128546 (-0.091566) 0.012431 / 0.075646 (-0.063216) 0.086421 / 0.419271 (-0.332850) 0.053014 / 0.043533 (0.009481) 0.336660 / 0.255139 (0.081521) 0.359155 / 0.283200 (0.075956) 0.107666 / 0.141683 (-0.034017) 1.424324 / 1.452155 (-0.027830) 1.543027 / 1.492716 (0.050310)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.260862 / 0.018006 (0.242855) 0.552057 / 0.000490 (0.551567) 0.000449 / 0.000200 (0.000249) 0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029184 / 0.037411 (-0.008227) 0.108799 / 0.014526 (0.094274) 0.125136 / 0.176557 (-0.051421) 0.157436 / 0.737135 (-0.579699) 0.126333 / 0.296338 (-0.170005)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.424054 / 0.215209 (0.208845) 4.227847 / 2.077655 (2.150192) 2.051102 / 1.504120 (0.546983) 1.848651 / 1.541195 (0.307457) 1.922728 / 1.468490 (0.454238) 0.705903 / 4.584777 (-3.878874) 3.800977 / 3.745712 (0.055265) 2.099345 / 5.269862 (-3.170517) 1.342919 / 4.565676 (-3.222757) 0.086128 / 0.424275 (-0.338147) 0.012539 / 0.007607 (0.004932) 0.528767 / 0.226044 (0.302723) 5.299989 / 2.268929 (3.031061) 2.534280 / 55.444624 (-52.910345) 2.229532 / 6.876477 (-4.646945) 2.326704 / 2.142072 (0.184632) 0.838533 / 4.805227 (-3.966694) 0.168446 / 6.500664 (-6.332218) 0.065158 / 0.075469 (-0.010311)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.250091 / 1.841788 (-0.591697) 14.988651 / 8.074308 (6.914343) 13.655103 / 10.191392 (3.463711) 0.165079 / 0.680424 (-0.515345) 0.017829 / 0.534201 (-0.516372) 0.425903 / 0.579283 (-0.153381) 0.419771 / 0.434364 (-0.014593) 0.534309 / 0.540337 (-0.006028) 0.635563 / 1.386936 (-0.751373)

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010569 / 0.011353 (-0.000784) 0.005790 / 0.011008 (-0.005218) 0.118626 / 0.038508 (0.080118) 0.040455 / 0.023109 (0.017346) 0.342309 / 0.275898 (0.066411) 0.411828 / 0.323480 (0.088349) 0.008824 / 0.007986 (0.000839) 0.005426 / 0.004328 (0.001098) 0.088740 / 0.004250 (0.084489) 0.050042 / 0.037052 (0.012990) 0.352350 / 0.258489 (0.093861) 0.396030 / 0.293841 (0.102189) 0.043385 / 0.128546 (-0.085162) 0.013805 / 0.075646 (-0.061841) 0.396489 / 0.419271 (-0.022783) 0.055667 / 0.043533 (0.012135) 0.336165 / 0.255139 (0.081026) 0.372912 / 0.283200 (0.089713) 0.115343 / 0.141683 (-0.026340) 1.656412 / 1.452155 (0.204257) 1.708993 / 1.492716 (0.216277)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.011650 / 0.018006 (-0.006357) 0.444415 / 0.000490 (0.443926) 0.003985 / 0.000200 (0.003785) 0.000136 / 0.000054 (0.000082)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031718 / 0.037411 (-0.005693) 0.119640 / 0.014526 (0.105114) 0.138519 / 0.176557 (-0.038037) 0.188847 / 0.737135 (-0.548288) 0.137891 / 0.296338 (-0.158448)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.447540 / 0.215209 (0.232331) 4.577189 / 2.077655 (2.499534) 2.106992 / 1.504120 (0.602872) 1.889631 / 1.541195 (0.348436) 1.972256 / 1.468490 (0.503766) 0.778209 / 4.584777 (-3.806568) 4.430279 / 3.745712 (0.684567) 2.401226 / 5.269862 (-2.868636) 1.481251 / 4.565676 (-3.084425) 0.094244 / 0.424275 (-0.330031) 0.013961 / 0.007607 (0.006354) 0.570962 / 0.226044 (0.344917) 5.809224 / 2.268929 (3.540295) 2.663290 / 55.444624 (-52.781334) 2.201228 / 6.876477 (-4.675249) 2.319240 / 2.142072 (0.177168) 0.938340 / 4.805227 (-3.866887) 0.185546 / 6.500664 (-6.315118) 0.069087 / 0.075469 (-0.006382)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.448597 / 1.841788 (-0.393191) 17.188573 / 8.074308 (9.114265) 16.197532 / 10.191392 (6.006140) 0.194064 / 0.680424 (-0.486360) 0.033694 / 0.534201 (-0.500507) 0.507585 / 0.579283 (-0.071699) 0.505470 / 0.434364 (0.071106) 0.623270 / 0.540337 (0.082932) 0.729964 / 1.386936 (-0.656972)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008529 / 0.011353 (-0.002824) 0.005705 / 0.011008 (-0.005304) 0.085594 / 0.038508 (0.047086) 0.038377 / 0.023109 (0.015268) 0.384221 / 0.275898 (0.108323) 0.414678 / 0.323480 (0.091199) 0.006195 / 0.007986 (-0.001791) 0.004549 / 0.004328 (0.000221) 0.082710 / 0.004250 (0.078460) 0.054899 / 0.037052 (0.017847) 0.404017 / 0.258489 (0.145528) 0.450309 / 0.293841 (0.156468) 0.040620 / 0.128546 (-0.087926) 0.013774 / 0.075646 (-0.061872) 0.099231 / 0.419271 (-0.320041) 0.057183 / 0.043533 (0.013650) 0.390806 / 0.255139 (0.135667) 0.419334 / 0.283200 (0.136134) 0.116449 / 0.141683 (-0.025234) 1.709124 / 1.452155 (0.256969) 1.812769 / 1.492716 (0.320052)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.225206 / 0.018006 (0.207199) 0.440530 / 0.000490 (0.440040) 0.002982 / 0.000200 (0.002782) 0.000102 / 0.000054 (0.000048)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032256 / 0.037411 (-0.005155) 0.127086 / 0.014526 (0.112560) 0.138133 / 0.176557 (-0.038424) 0.176168 / 0.737135 (-0.560968) 0.146072 / 0.296338 (-0.150267)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474374 / 0.215209 (0.259165) 4.785106 / 2.077655 (2.707452) 2.319344 / 1.504120 (0.815225) 2.075239 / 1.541195 (0.534045) 2.179231 / 1.468490 (0.710741) 0.832124 / 4.584777 (-3.752653) 4.376302 / 3.745712 (0.630590) 3.966837 / 5.269862 (-1.303024) 1.820230 / 4.565676 (-2.745446) 0.100692 / 0.424275 (-0.323583) 0.014748 / 0.007607 (0.007141) 0.568702 / 0.226044 (0.342657) 5.771548 / 2.268929 (3.502619) 2.747431 / 55.444624 (-52.697193) 2.448482 / 6.876477 (-4.427994) 2.497206 / 2.142072 (0.355133) 0.960842 / 4.805227 (-3.844385) 0.192855 / 6.500664 (-6.307809) 0.072494 / 0.075469 (-0.002975)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.474542 / 1.841788 (-0.367245) 17.344804 / 8.074308 (9.270496) 15.336082 / 10.191392 (5.144690) 0.200134 / 0.680424 (-0.480290) 0.020728 / 0.534201 (-0.513473) 0.488854 / 0.579283 (-0.090429) 0.490781 / 0.434364 (0.056418) 0.626288 / 0.540337 (0.085950) 0.721130 / 1.386936 (-0.665806)

@lhoestq lhoestq merged commit 79c18b7 into main Feb 1, 2023
@lhoestq lhoestq deleted the to_iterable branch February 1, 2023 16:36
@github-actions
Copy link

github-actions bot commented Feb 1, 2023

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008542 / 0.011353 (-0.002811) 0.004624 / 0.011008 (-0.006384) 0.100749 / 0.038508 (0.062241) 0.029587 / 0.023109 (0.006478) 0.298680 / 0.275898 (0.022782) 0.359659 / 0.323480 (0.036180) 0.007001 / 0.007986 (-0.000984) 0.003398 / 0.004328 (-0.000930) 0.078654 / 0.004250 (0.074404) 0.036440 / 0.037052 (-0.000612) 0.313245 / 0.258489 (0.054756) 0.342776 / 0.293841 (0.048936) 0.033195 / 0.128546 (-0.095352) 0.011500 / 0.075646 (-0.064146) 0.323957 / 0.419271 (-0.095314) 0.039878 / 0.043533 (-0.003655) 0.298189 / 0.255139 (0.043050) 0.325488 / 0.283200 (0.042289) 0.087276 / 0.141683 (-0.054407) 1.480846 / 1.452155 (0.028691) 1.507016 / 1.492716 (0.014300)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.189570 / 0.018006 (0.171564) 0.406407 / 0.000490 (0.405917) 0.003062 / 0.000200 (0.002862) 0.000073 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022865 / 0.037411 (-0.014546) 0.096103 / 0.014526 (0.081578) 0.106462 / 0.176557 (-0.070094) 0.140888 / 0.737135 (-0.596247) 0.108172 / 0.296338 (-0.188167)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.415951 / 0.215209 (0.200742) 4.172187 / 2.077655 (2.094532) 1.842210 / 1.504120 (0.338090) 1.636997 / 1.541195 (0.095802) 1.706078 / 1.468490 (0.237588) 0.695825 / 4.584777 (-3.888952) 3.337354 / 3.745712 (-0.408358) 1.877880 / 5.269862 (-3.391982) 1.153882 / 4.565676 (-3.411794) 0.082923 / 0.424275 (-0.341352) 0.012814 / 0.007607 (0.005207) 0.521793 / 0.226044 (0.295748) 5.275980 / 2.268929 (3.007051) 2.279230 / 55.444624 (-53.165394) 1.941777 / 6.876477 (-4.934700) 1.981297 / 2.142072 (-0.160775) 0.809669 / 4.805227 (-3.995558) 0.148753 / 6.500664 (-6.351911) 0.064909 / 0.075469 (-0.010560)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.226757 / 1.841788 (-0.615031) 13.717354 / 8.074308 (5.643046) 12.925885 / 10.191392 (2.734493) 0.137926 / 0.680424 (-0.542498) 0.028788 / 0.534201 (-0.505413) 0.396654 / 0.579283 (-0.182630) 0.401931 / 0.434364 (-0.032432) 0.460515 / 0.540337 (-0.079823) 0.537903 / 1.386936 (-0.849033)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006757 / 0.011353 (-0.004596) 0.004474 / 0.011008 (-0.006534) 0.076571 / 0.038508 (0.038063) 0.027580 / 0.023109 (0.004471) 0.348231 / 0.275898 (0.072333) 0.398403 / 0.323480 (0.074923) 0.005089 / 0.007986 (-0.002897) 0.004676 / 0.004328 (0.000347) 0.076444 / 0.004250 (0.072194) 0.038508 / 0.037052 (0.001456) 0.348515 / 0.258489 (0.090026) 0.401456 / 0.293841 (0.107615) 0.031630 / 0.128546 (-0.096916) 0.011698 / 0.075646 (-0.063949) 0.085805 / 0.419271 (-0.333467) 0.041962 / 0.043533 (-0.001570) 0.343415 / 0.255139 (0.088276) 0.383001 / 0.283200 (0.099801) 0.090231 / 0.141683 (-0.051452) 1.488114 / 1.452155 (0.035960) 1.569039 / 1.492716 (0.076323)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.261751 / 0.018006 (0.243745) 0.411354 / 0.000490 (0.410865) 0.015103 / 0.000200 (0.014903) 0.000262 / 0.000054 (0.000208)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025423 / 0.037411 (-0.011988) 0.101334 / 0.014526 (0.086808) 0.108835 / 0.176557 (-0.067722) 0.143995 / 0.737135 (-0.593140) 0.111751 / 0.296338 (-0.184588)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.446507 / 0.215209 (0.231298) 4.461543 / 2.077655 (2.383888) 2.104648 / 1.504120 (0.600528) 1.895900 / 1.541195 (0.354706) 1.985481 / 1.468490 (0.516991) 0.699029 / 4.584777 (-3.885748) 3.371064 / 3.745712 (-0.374648) 1.883445 / 5.269862 (-3.386416) 1.166150 / 4.565676 (-3.399527) 0.082639 / 0.424275 (-0.341636) 0.012605 / 0.007607 (0.004998) 0.544860 / 0.226044 (0.318815) 5.513223 / 2.268929 (3.244294) 2.570661 / 55.444624 (-52.873963) 2.206066 / 6.876477 (-4.670411) 2.256346 / 2.142072 (0.114273) 0.801142 / 4.805227 (-4.004085) 0.150412 / 6.500664 (-6.350252) 0.067742 / 0.075469 (-0.007727)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.303477 / 1.841788 (-0.538310) 14.287767 / 8.074308 (6.213458) 13.525563 / 10.191392 (3.334171) 0.148202 / 0.680424 (-0.532222) 0.016868 / 0.534201 (-0.517333) 0.380729 / 0.579283 (-0.198555) 0.388177 / 0.434364 (-0.046187) 0.477410 / 0.540337 (-0.062927) 0.569343 / 1.386936 (-0.817593)

@stevhliu
Copy link
Member

stevhliu commented Feb 1, 2023

PS: I think it would be a good idea to add links to the Guide pages for better discoverability and to somewhat "justify their presence in the docs" (from the tutorial/how-to pages to the guides; some guides are not referenced at all)

Just merged #5485, which references this new doc! Will look for other pages in the docs where it'd make sense to add them :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get an IterableDataset from a map-style Dataset
5 participants