Skip to content

Commit

Permalink
Release: 2.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Jul 25, 2022
1 parent 6c398c1 commit 401d4c4
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@

setup(
name="datasets",
version="2.3.3.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
version="2.4.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
description="HuggingFace community-driven open-source library of datasets",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# pylint: enable=line-too-long
# pylint: disable=g-import-not-at-top,g-bad-import-order,wrong-import-position

__version__ = "2.3.3.dev0"
__version__ = "2.4.0"

import pyarrow
from packaging import version
Expand Down

2 comments on commit 401d4c4

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007829 / 0.011353 (-0.003524) 0.003971 / 0.011008 (-0.007037) 0.028513 / 0.038508 (-0.009995) 0.030703 / 0.023109 (0.007594) 0.351116 / 0.275898 (0.075218) 0.405407 / 0.323480 (0.081927) 0.005761 / 0.007986 (-0.002225) 0.003239 / 0.004328 (-0.001089) 0.006911 / 0.004250 (0.002661) 0.032611 / 0.037052 (-0.004441) 0.344673 / 0.258489 (0.086184) 0.409839 / 0.293841 (0.115999) 0.029615 / 0.128546 (-0.098931) 0.009979 / 0.075646 (-0.065667) 0.244392 / 0.419271 (-0.174880) 0.050612 / 0.043533 (0.007079) 0.334995 / 0.255139 (0.079856) 0.404068 / 0.283200 (0.120869) 0.088388 / 0.141683 (-0.053295) 1.983120 / 1.452155 (0.530966) 2.010140 / 1.492716 (0.517423)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.221110 / 0.018006 (0.203104) 0.454400 / 0.000490 (0.453910) 0.014425 / 0.000200 (0.014225) 0.000675 / 0.000054 (0.000621)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023561 / 0.037411 (-0.013851) 0.103422 / 0.014526 (0.088897) 0.106751 / 0.176557 (-0.069806) 0.153564 / 0.737135 (-0.583571) 0.109017 / 0.296338 (-0.187321)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.471971 / 0.215209 (0.256762) 4.730482 / 2.077655 (2.652827) 2.180769 / 1.504120 (0.676649) 1.893289 / 1.541195 (0.352095) 1.980852 / 1.468490 (0.512362) 0.470245 / 4.584777 (-4.114532) 4.762350 / 3.745712 (1.016637) 3.432652 / 5.269862 (-1.837210) 0.882484 / 4.565676 (-3.683193) 0.055984 / 0.424275 (-0.368292) 0.011942 / 0.007607 (0.004335) 0.584974 / 0.226044 (0.358929) 5.705454 / 2.268929 (3.436525) 2.411920 / 55.444624 (-53.032704) 2.078621 / 6.876477 (-4.797856) 2.130588 / 2.142072 (-0.011484) 0.604176 / 4.805227 (-4.201051) 0.124030 / 6.500664 (-6.376634) 0.065147 / 0.075469 (-0.010323)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.757162 / 1.841788 (-0.084625) 14.213816 / 8.074308 (6.139508) 29.907916 / 10.191392 (19.716524) 0.902065 / 0.680424 (0.221642) 0.620263 / 0.534201 (0.086062) 0.450918 / 0.579283 (-0.128366) 0.508103 / 0.434364 (0.073739) 0.290296 / 0.540337 (-0.250042) 0.290287 / 1.386936 (-1.096649)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007813 / 0.011353 (-0.003540) 0.003876 / 0.011008 (-0.007132) 0.028879 / 0.038508 (-0.009630) 0.030187 / 0.023109 (0.007077) 0.353110 / 0.275898 (0.077212) 0.389127 / 0.323480 (0.065647) 0.005670 / 0.007986 (-0.002316) 0.003241 / 0.004328 (-0.001087) 0.006670 / 0.004250 (0.002419) 0.033058 / 0.037052 (-0.003994) 0.344135 / 0.258489 (0.085646) 0.385973 / 0.293841 (0.092132) 0.030255 / 0.128546 (-0.098291) 0.009822 / 0.075646 (-0.065824) 0.241175 / 0.419271 (-0.178096) 0.053710 / 0.043533 (0.010177) 0.353645 / 0.255139 (0.098506) 0.374383 / 0.283200 (0.091184) 0.099348 / 0.141683 (-0.042335) 1.929311 / 1.452155 (0.477156) 1.916251 / 1.492716 (0.423534)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.207472 / 0.018006 (0.189466) 0.447448 / 0.000490 (0.446959) 0.014762 / 0.000200 (0.014562) 0.000352 / 0.000054 (0.000298)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024547 / 0.037411 (-0.012864) 0.095701 / 0.014526 (0.081175) 0.107500 / 0.176557 (-0.069056) 0.149129 / 0.737135 (-0.588006) 0.106891 / 0.296338 (-0.189447)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.468246 / 0.215209 (0.253037) 4.658931 / 2.077655 (2.581276) 2.210780 / 1.504120 (0.706660) 1.964441 / 1.541195 (0.423247) 2.074132 / 1.468490 (0.605642) 0.475367 / 4.584777 (-4.109410) 4.433768 / 3.745712 (0.688056) 2.056788 / 5.269862 (-3.213073) 0.861789 / 4.565676 (-3.703887) 0.056067 / 0.424275 (-0.368209) 0.011812 / 0.007607 (0.004205) 0.582910 / 0.226044 (0.356866) 5.852080 / 2.268929 (3.583152) 2.609338 / 55.444624 (-52.835286) 2.279975 / 6.876477 (-4.596502) 2.340847 / 2.142072 (0.198775) 0.605196 / 4.805227 (-4.200031) 0.123566 / 6.500664 (-6.377098) 0.065183 / 0.075469 (-0.010287)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.812167 / 1.841788 (-0.029621) 14.305535 / 8.074308 (6.231227) 30.215032 / 10.191392 (20.023640) 0.917119 / 0.680424 (0.236696) 0.602855 / 0.534201 (0.068654) 0.440704 / 0.579283 (-0.138579) 0.499696 / 0.434364 (0.065332) 0.281635 / 0.540337 (-0.258703) 0.297449 / 1.386936 (-1.089487)

CML watermark

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007477 / 0.011353 (-0.003876) 0.003613 / 0.011008 (-0.007395) 0.027984 / 0.038508 (-0.010524) 0.028835 / 0.023109 (0.005726) 0.300338 / 0.275898 (0.024440) 0.338678 / 0.323480 (0.015198) 0.005146 / 0.007986 (-0.002840) 0.002926 / 0.004328 (-0.001403) 0.006565 / 0.004250 (0.002315) 0.034341 / 0.037052 (-0.002711) 0.296597 / 0.258489 (0.038108) 0.349890 / 0.293841 (0.056049) 0.028198 / 0.128546 (-0.100349) 0.009413 / 0.075646 (-0.066233) 0.240579 / 0.419271 (-0.178692) 0.047892 / 0.043533 (0.004359) 0.305403 / 0.255139 (0.050264) 0.333208 / 0.283200 (0.050009) 0.080191 / 0.141683 (-0.061492) 1.846696 / 1.452155 (0.394542) 1.896483 / 1.492716 (0.403766)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.191836 / 0.018006 (0.173829) 0.408968 / 0.000490 (0.408478) 0.011671 / 0.000200 (0.011471) 0.000598 / 0.000054 (0.000544)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021989 / 0.037411 (-0.015423) 0.093745 / 0.014526 (0.079219) 0.104636 / 0.176557 (-0.071920) 0.147475 / 0.737135 (-0.589661) 0.104364 / 0.296338 (-0.191975)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.454353 / 0.215209 (0.239144) 4.515396 / 2.077655 (2.437742) 2.119887 / 1.504120 (0.615767) 1.931673 / 1.541195 (0.390478) 1.984057 / 1.468490 (0.515567) 0.460492 / 4.584777 (-4.124285) 4.278884 / 3.745712 (0.533172) 1.950401 / 5.269862 (-3.319460) 0.822732 / 4.565676 (-3.742944) 0.054878 / 0.424275 (-0.369397) 0.011699 / 0.007607 (0.004092) 0.560669 / 0.226044 (0.334624) 5.544917 / 2.268929 (3.275989) 2.378693 / 55.444624 (-53.065932) 2.070996 / 6.876477 (-4.805480) 2.301841 / 2.142072 (0.159769) 0.582910 / 4.805227 (-4.222318) 0.120227 / 6.500664 (-6.380438) 0.063523 / 0.075469 (-0.011947)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.683037 / 1.841788 (-0.158751) 13.419334 / 8.074308 (5.345026) 28.495348 / 10.191392 (18.303956) 0.849675 / 0.680424 (0.169252) 0.569850 / 0.534201 (0.035649) 0.432724 / 0.579283 (-0.146559) 0.476489 / 0.434364 (0.042125) 0.274967 / 0.540337 (-0.265370) 0.287822 / 1.386936 (-1.099114)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007489 / 0.011353 (-0.003864) 0.003614 / 0.011008 (-0.007394) 0.027557 / 0.038508 (-0.010951) 0.029341 / 0.023109 (0.006232) 0.299882 / 0.275898 (0.023984) 0.335158 / 0.323480 (0.011678) 0.005147 / 0.007986 (-0.002839) 0.002955 / 0.004328 (-0.001373) 0.006508 / 0.004250 (0.002258) 0.030437 / 0.037052 (-0.006615) 0.301263 / 0.258489 (0.042774) 0.340237 / 0.293841 (0.046396) 0.028334 / 0.128546 (-0.100213) 0.009391 / 0.075646 (-0.066255) 0.238345 / 0.419271 (-0.180926) 0.043558 / 0.043533 (0.000025) 0.310890 / 0.255139 (0.055751) 0.327713 / 0.283200 (0.044513) 0.079318 / 0.141683 (-0.062365) 1.851860 / 1.452155 (0.399705) 1.904780 / 1.492716 (0.412064)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.184882 / 0.018006 (0.166876) 0.391050 / 0.000490 (0.390560) 0.007578 / 0.000200 (0.007378) 0.000289 / 0.000054 (0.000234)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022536 / 0.037411 (-0.014875) 0.090114 / 0.014526 (0.075589) 0.103327 / 0.176557 (-0.073229) 0.137792 / 0.737135 (-0.599343) 0.103281 / 0.296338 (-0.193057)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.430275 / 0.215209 (0.215066) 4.288267 / 2.077655 (2.210613) 1.852647 / 1.504120 (0.348527) 1.632923 / 1.541195 (0.091728) 1.691626 / 1.468490 (0.223135) 0.455993 / 4.584777 (-4.128784) 4.136595 / 3.745712 (0.390883) 1.947669 / 5.269862 (-3.322193) 0.864679 / 4.565676 (-3.700997) 0.053597 / 0.424275 (-0.370678) 0.011574 / 0.007607 (0.003967) 0.541802 / 0.226044 (0.315758) 5.437741 / 2.268929 (3.168812) 2.294119 / 55.444624 (-53.150505) 1.929149 / 6.876477 (-4.947328) 2.008501 / 2.142072 (-0.133572) 0.577919 / 4.805227 (-4.227308) 0.119707 / 6.500664 (-6.380957) 0.062584 / 0.075469 (-0.012886)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.682962 / 1.841788 (-0.158826) 13.432616 / 8.074308 (5.358307) 28.442172 / 10.191392 (18.250780) 0.839741 / 0.680424 (0.159317) 0.580079 / 0.534201 (0.045878) 0.427836 / 0.579283 (-0.151447) 0.474420 / 0.434364 (0.040056) 0.275842 / 0.540337 (-0.264495) 0.287736 / 1.386936 (-1.099200)

CML watermark

Please sign in to comment.