Skip to content

Commit

Permalink
fic docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Dec 3, 2021
1 parent 10ec8c6 commit eff165c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/stream.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ The following example demonstrates how to tokenize a :class:`datasets.IterableDa
Stream in a training loop
^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`datasets.IterableDataset`s can be integrated into a training loop. First, shuffle the dataset:
:class:`datasets.IterableDataset` can be integrated into a training loop. First, shuffle the dataset:

.. code-block::
Expand Down

1 comment on commit eff165c

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.093914 / 0.011353 (0.082561) 0.004602 / 0.011008 (-0.006406) 0.051026 / 0.038508 (0.012518) 0.042247 / 0.023109 (0.019138) 0.417222 / 0.275898 (0.141324) 0.517420 / 0.323480 (0.193941) 0.095340 / 0.007986 (0.087355) 0.003871 / 0.004328 (-0.000457) 0.010840 / 0.004250 (0.006590) 0.046696 / 0.037052 (0.009644) 0.392808 / 0.258489 (0.134319) 0.497941 / 0.293841 (0.204100) 0.125802 / 0.128546 (-0.002744) 0.010497 / 0.075646 (-0.065149) 0.337474 / 0.419271 (-0.081798) 0.055636 / 0.043533 (0.012104) 0.424716 / 0.255139 (0.169577) 0.472780 / 0.283200 (0.189581) 0.104403 / 0.141683 (-0.037280) 2.208490 / 1.452155 (0.756335) 2.262715 / 1.492716 (0.769998)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.296755 / 0.018006 (0.278749) 0.509932 / 0.000490 (0.509442) 0.005692 / 0.000200 (0.005492) 0.000134 / 0.000054 (0.000080)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.044708 / 0.037411 (0.007296) 0.026649 / 0.014526 (0.012123) 0.032127 / 0.176557 (-0.144429) 0.232558 / 0.737135 (-0.504578) 0.033160 / 0.296338 (-0.263178)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.496938 / 0.215209 (0.281729) 4.955762 / 2.077655 (2.878107) 2.168454 / 1.504120 (0.664334) 1.918605 / 1.541195 (0.377410) 2.005938 / 1.468490 (0.537448) 0.489901 / 4.584777 (-4.094876) 6.387124 / 3.745712 (2.641412) 2.437435 / 5.269862 (-2.832427) 1.158501 / 4.565676 (-3.407176) 0.058806 / 0.424275 (-0.365469) 0.013089 / 0.007607 (0.005482) 0.619484 / 0.226044 (0.393440) 6.183163 / 2.268929 (3.914235) 2.715211 / 55.444624 (-52.729413) 2.278180 / 6.876477 (-4.598297) 2.414184 / 2.142072 (0.272112) 0.629400 / 4.805227 (-4.175827) 0.135100 / 6.500664 (-6.365565) 0.068107 / 0.075469 (-0.007362)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.884584 / 1.841788 (0.042797) 14.577170 / 8.074308 (6.502862) 31.911733 / 10.191392 (21.720341) 0.900432 / 0.680424 (0.220008) 0.619881 / 0.534201 (0.085680) 0.448720 / 0.579283 (-0.130563) 0.661978 / 0.434364 (0.227614) 0.324525 / 0.540337 (-0.215812) 0.328812 / 1.386936 (-1.058124)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.083040 / 0.011353 (0.071688) 0.004320 / 0.011008 (-0.006688) 0.035276 / 0.038508 (-0.003232) 0.039706 / 0.023109 (0.016597) 0.388727 / 0.275898 (0.112829) 0.413651 / 0.323480 (0.090171) 0.099607 / 0.007986 (0.091621) 0.003800 / 0.004328 (-0.000529) 0.008410 / 0.004250 (0.004159) 0.051413 / 0.037052 (0.014361) 0.386353 / 0.258489 (0.127864) 0.422933 / 0.293841 (0.129092) 0.101001 / 0.128546 (-0.027546) 0.010140 / 0.075646 (-0.065507) 0.301668 / 0.419271 (-0.117603) 0.053573 / 0.043533 (0.010040) 0.383916 / 0.255139 (0.128777) 0.409395 / 0.283200 (0.126195) 0.091005 / 0.141683 (-0.050677) 2.048081 / 1.452155 (0.595926) 2.087477 / 1.492716 (0.594760)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.281108 / 0.018006 (0.263102) 0.492810 / 0.000490 (0.492320) 0.001115 / 0.000200 (0.000915) 0.000099 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039322 / 0.037411 (0.001911) 0.024700 / 0.014526 (0.010174) 0.029559 / 0.176557 (-0.146998) 0.229527 / 0.737135 (-0.507608) 0.031386 / 0.296338 (-0.264953)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.503071 / 0.215209 (0.287862) 4.984460 / 2.077655 (2.906806) 2.155289 / 1.504120 (0.651169) 1.891894 / 1.541195 (0.350700) 1.963825 / 1.468490 (0.495335) 0.493052 / 4.584777 (-4.091725) 5.995429 / 3.745712 (2.249716) 4.631900 / 5.269862 (-0.637961) 1.153602 / 4.565676 (-3.412074) 0.066239 / 0.424275 (-0.358036) 0.013572 / 0.007607 (0.005965) 0.704092 / 0.226044 (0.478048) 6.448365 / 2.268929 (4.179436) 2.700557 / 55.444624 (-52.744067) 2.236712 / 6.876477 (-4.639765) 2.368757 / 2.142072 (0.226685) 0.630637 / 4.805227 (-4.174590) 0.136725 / 6.500664 (-6.363939) 0.067012 / 0.075469 (-0.008458)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.941712 / 1.841788 (0.099924) 14.844038 / 8.074308 (6.769730) 31.439937 / 10.191392 (21.248545) 0.910268 / 0.680424 (0.229845) 0.701847 / 0.534201 (0.167646) 0.467932 / 0.579283 (-0.111351) 0.644066 / 0.434364 (0.209702) 0.304705 / 0.540337 (-0.235632) 0.333552 / 1.386936 (-1.053384)

CML watermark

Please sign in to comment.