Skip to content

Commit

Permalink
Release: 2.5.2
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 5, 2022
1 parent 60dcc68 commit c59cc34
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion setup.py
Expand Up @@ -198,7 +198,7 @@

setup(
name="datasets",
version="2.5.1", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="2.5.2", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Expand Up @@ -17,7 +17,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "2.5.1"
__version__ = "2.5.2"

import platform

Expand Down

2 comments on commit c59cc34

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007574 / 0.011353 (-0.003778) 0.003616 / 0.011008 (-0.007392) 0.028678 / 0.038508 (-0.009830) 0.030015 / 0.023109 (0.006906) 0.306574 / 0.275898 (0.030676) 0.369572 / 0.323480 (0.046092) 0.005345 / 0.007986 (-0.002641) 0.003038 / 0.004328 (-0.001291) 0.006686 / 0.004250 (0.002435) 0.038660 / 0.037052 (0.001608) 0.317594 / 0.258489 (0.059105) 0.363809 / 0.293841 (0.069968) 0.028958 / 0.128546 (-0.099588) 0.009323 / 0.075646 (-0.066324) 0.247809 / 0.419271 (-0.171463) 0.044671 / 0.043533 (0.001138) 0.310004 / 0.255139 (0.054866) 0.338390 / 0.283200 (0.055191) 0.091139 / 0.141683 (-0.050544) 1.467447 / 1.452155 (0.015292) 1.491145 / 1.492716 (-0.001572)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.216643 / 0.018006 (0.198637) 0.445231 / 0.000490 (0.444741) 0.002280 / 0.000200 (0.002080) 0.000071 / 0.000054 (0.000017)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021593 / 0.037411 (-0.015818) 0.091345 / 0.014526 (0.076819) 0.107412 / 0.176557 (-0.069145) 0.153049 / 0.737135 (-0.584086) 0.106545 / 0.296338 (-0.189794)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418800 / 0.215209 (0.203591) 4.166582 / 2.077655 (2.088927) 1.880401 / 1.504120 (0.376281) 1.667712 / 1.541195 (0.126517) 1.719501 / 1.468490 (0.251010) 0.446184 / 4.584777 (-4.138593) 3.340539 / 3.745712 (-0.405173) 1.840936 / 5.269862 (-3.428926) 1.264911 / 4.565676 (-3.300765) 0.052572 / 0.424275 (-0.371703) 0.010870 / 0.007607 (0.003263) 0.525900 / 0.226044 (0.299856) 5.295922 / 2.268929 (3.026993) 2.315323 / 55.444624 (-53.129302) 1.959206 / 6.876477 (-4.917270) 2.025652 / 2.142072 (-0.116420) 0.563667 / 4.805227 (-4.241560) 0.118126 / 6.500664 (-6.382538) 0.062717 / 0.075469 (-0.012752)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.502027 / 1.841788 (-0.339760) 12.418117 / 8.074308 (4.343809) 26.050196 / 10.191392 (15.858804) 0.893006 / 0.680424 (0.212583) 0.566945 / 0.534201 (0.032744) 0.340975 / 0.579283 (-0.238308) 0.389439 / 0.434364 (-0.044924) 0.236735 / 0.540337 (-0.303602) 0.244571 / 1.386936 (-1.142365)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005342 / 0.011353 (-0.006011) 0.003586 / 0.011008 (-0.007422) 0.026772 / 0.038508 (-0.011736) 0.027275 / 0.023109 (0.004166) 0.417251 / 0.275898 (0.141353) 0.475921 / 0.323480 (0.152441) 0.003279 / 0.007986 (-0.004706) 0.004158 / 0.004328 (-0.000171) 0.004647 / 0.004250 (0.000397) 0.034415 / 0.037052 (-0.002637) 0.425930 / 0.258489 (0.167441) 0.468606 / 0.293841 (0.174765) 0.026948 / 0.128546 (-0.101599) 0.009297 / 0.075646 (-0.066349) 0.247798 / 0.419271 (-0.171473) 0.045745 / 0.043533 (0.002212) 0.421270 / 0.255139 (0.166131) 0.442464 / 0.283200 (0.159264) 0.086854 / 0.141683 (-0.054829) 1.549296 / 1.452155 (0.097142) 1.572534 / 1.492716 (0.079818)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.228844 / 0.018006 (0.210837) 0.417595 / 0.000490 (0.417105) 0.000967 / 0.000200 (0.000767) 0.000089 / 0.000054 (0.000034)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.020871 / 0.037411 (-0.016540) 0.091212 / 0.014526 (0.076686) 0.105066 / 0.176557 (-0.071491) 0.142597 / 0.737135 (-0.594538) 0.105188 / 0.296338 (-0.191151)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.468180 / 0.215209 (0.252971) 4.662055 / 2.077655 (2.584401) 2.448601 / 1.504120 (0.944482) 2.236730 / 1.541195 (0.695535) 2.277605 / 1.468490 (0.809115) 0.444672 / 4.584777 (-4.140105) 3.322080 / 3.745712 (-0.423632) 1.804387 / 5.269862 (-3.465474) 1.095145 / 4.565676 (-3.470531) 0.052555 / 0.424275 (-0.371720) 0.010947 / 0.007607 (0.003340) 0.572087 / 0.226044 (0.346043) 5.714434 / 2.268929 (3.445505) 2.850298 / 55.444624 (-52.594326) 2.528052 / 6.876477 (-4.348425) 2.621176 / 2.142072 (0.479104) 0.553503 / 4.805227 (-4.251724) 0.118975 / 6.500664 (-6.381690) 0.064490 / 0.075469 (-0.010979)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.566697 / 1.841788 (-0.275091) 12.488183 / 8.074308 (4.413875) 26.352401 / 10.191392 (16.161009) 0.919816 / 0.680424 (0.239392) 0.636535 / 0.534201 (0.102335) 0.344689 / 0.579283 (-0.234594) 0.395567 / 0.434364 (-0.038797) 0.231580 / 0.540337 (-0.308758) 0.235245 / 1.386936 (-1.151691)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010154 / 0.011353 (-0.001198) 0.004714 / 0.011008 (-0.006294) 0.036102 / 0.038508 (-0.002406) 0.049138 / 0.023109 (0.026029) 0.347466 / 0.275898 (0.071568) 0.465874 / 0.323480 (0.142394) 0.007296 / 0.007986 (-0.000689) 0.004208 / 0.004328 (-0.000120) 0.008326 / 0.004250 (0.004075) 0.058692 / 0.037052 (0.021639) 0.372919 / 0.258489 (0.114430) 0.416916 / 0.293841 (0.123075) 0.039643 / 0.128546 (-0.088903) 0.011421 / 0.075646 (-0.064225) 0.321461 / 0.419271 (-0.097810) 0.069072 / 0.043533 (0.025539) 0.358327 / 0.255139 (0.103188) 0.376883 / 0.283200 (0.093684) 0.137978 / 0.141683 (-0.003705) 1.775026 / 1.452155 (0.322871) 1.862625 / 1.492716 (0.369908)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.235835 / 0.018006 (0.217829) 0.524305 / 0.000490 (0.523816) 0.006675 / 0.000200 (0.006475) 0.000201 / 0.000054 (0.000147)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031211 / 0.037411 (-0.006201) 0.126283 / 0.014526 (0.111757) 0.137865 / 0.176557 (-0.038691) 0.199319 / 0.737135 (-0.537816) 0.142526 / 0.296338 (-0.153813)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474036 / 0.215209 (0.258827) 4.699722 / 2.077655 (2.622067) 2.094430 / 1.504120 (0.590310) 1.894617 / 1.541195 (0.353423) 1.977719 / 1.468490 (0.509229) 0.502716 / 4.584777 (-4.082061) 4.616386 / 3.745712 (0.870674) 2.440159 / 5.269862 (-2.829702) 1.763320 / 4.565676 (-2.802356) 0.060375 / 0.424275 (-0.363901) 0.012979 / 0.007607 (0.005372) 0.583770 / 0.226044 (0.357726) 5.875267 / 2.268929 (3.606339) 2.589765 / 55.444624 (-52.854859) 2.250155 / 6.876477 (-4.626322) 2.409176 / 2.142072 (0.267103) 0.629286 / 4.805227 (-4.175941) 0.139668 / 6.500664 (-6.360996) 0.072044 / 0.075469 (-0.003425)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.735198 / 1.841788 (-0.106589) 16.707802 / 8.074308 (8.633494) 28.633857 / 10.191392 (18.442465) 1.093137 / 0.680424 (0.412714) 0.637630 / 0.534201 (0.103429) 0.444101 / 0.579283 (-0.135182) 0.503498 / 0.434364 (0.069134) 0.305064 / 0.540337 (-0.235273) 0.312605 / 1.386936 (-1.074331)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007532 / 0.011353 (-0.003821) 0.004733 / 0.011008 (-0.006276) 0.033606 / 0.038508 (-0.004902) 0.043079 / 0.023109 (0.019970) 0.453426 / 0.275898 (0.177528) 0.550128 / 0.323480 (0.226648) 0.004763 / 0.007986 (-0.003222) 0.004143 / 0.004328 (-0.000185) 0.005919 / 0.004250 (0.001668) 0.050519 / 0.037052 (0.013466) 0.461533 / 0.258489 (0.203044) 0.512967 / 0.293841 (0.219126) 0.036062 / 0.128546 (-0.092484) 0.011490 / 0.075646 (-0.064156) 0.308885 / 0.419271 (-0.110387) 0.065250 / 0.043533 (0.021717) 0.452536 / 0.255139 (0.197397) 0.476512 / 0.283200 (0.193313) 0.123944 / 0.141683 (-0.017739) 1.728911 / 1.452155 (0.276756) 1.781415 / 1.492716 (0.288698)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.226633 / 0.018006 (0.208627) 0.485875 / 0.000490 (0.485385) 0.008202 / 0.000200 (0.008002) 0.000180 / 0.000054 (0.000125)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028521 / 0.037411 (-0.008891) 0.126355 / 0.014526 (0.111829) 0.136642 / 0.176557 (-0.039914) 0.187860 / 0.737135 (-0.549276) 0.140391 / 0.296338 (-0.155947)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.498114 / 0.215209 (0.282905) 4.935728 / 2.077655 (2.858073) 2.381326 / 1.504120 (0.877206) 2.177966 / 1.541195 (0.636771) 2.256776 / 1.468490 (0.788286) 0.508070 / 4.584777 (-4.076706) 4.816767 / 3.745712 (1.071055) 5.265752 / 5.269862 (-0.004110) 2.350485 / 4.565676 (-2.215191) 0.071838 / 0.424275 (-0.352437) 0.015644 / 0.007607 (0.008037) 0.668363 / 0.226044 (0.442319) 6.163107 / 2.268929 (3.894178) 2.919560 / 55.444624 (-52.525064) 2.530778 / 6.876477 (-4.345699) 2.725779 / 2.142072 (0.583707) 0.631575 / 4.805227 (-4.173652) 0.142623 / 6.500664 (-6.358041) 0.074485 / 0.075469 (-0.000984)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.898877 / 1.841788 (0.057090) 16.920975 / 8.074308 (8.846667) 29.649417 / 10.191392 (19.458025) 1.138991 / 0.680424 (0.458568) 0.748508 / 0.534201 (0.214307) 0.463248 / 0.579283 (-0.116035) 0.535318 / 0.434364 (0.100954) 0.320734 / 0.540337 (-0.219603) 0.329036 / 1.386936 (-1.057900)

CML watermark

Please sign in to comment.