Skip to content

Commit

Permalink
Fix incorrect version specification for the pyarrow package (#2317)
Browse files Browse the repository at this point in the history
Co-authored-by: cemilcengiz <cemil.cengiz94@gmail.com>
  • Loading branch information
cemilcengiz and cemilcengiz committed May 5, 2021
1 parent 3a3e5a4 commit c333d1f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
"numpy>=1.17",
# Backend and serialization.
# Minimum 1.0.0 to avoid permission errors on windows when using the compute layer on memory mapped data
"pyarrow>=1.0.0<4.0.0",
"pyarrow>=1.0.0,<4.0.0",
# For smart caching dataset processing
"dill",
# For performance gains with apache arrow
Expand Down

1 comment on commit c333d1f

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==1.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023212 / 0.011353 (0.011859) 0.017355 / 0.011008 (0.006347) 0.047307 / 0.038508 (0.008799) 0.036781 / 0.023109 (0.013672) 0.338078 / 0.275898 (0.062180) 0.373427 / 0.323480 (0.049947) 0.010836 / 0.007986 (0.002850) 0.004832 / 0.004328 (0.000503) 0.010773 / 0.004250 (0.006522) 0.046962 / 0.037052 (0.009910) 0.337147 / 0.258489 (0.078658) 0.383497 / 0.293841 (0.089656) 0.162451 / 0.128546 (0.033904) 0.124749 / 0.075646 (0.049103) 0.419973 / 0.419271 (0.000701) 0.408140 / 0.043533 (0.364607) 0.337186 / 0.255139 (0.082047) 0.375428 / 0.283200 (0.092228) 1.740421 / 0.141683 (1.598738) 1.795128 / 1.452155 (0.342973) 1.828098 / 1.492716 (0.335381)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.006752 / 0.018006 (-0.011254) 0.494417 / 0.000490 (0.493928) 0.000286 / 0.000200 (0.000086) 0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.048147 / 0.037411 (0.010736) 0.026513 / 0.014526 (0.011987) 0.032590 / 0.176557 (-0.143967) 0.045337 / 0.737135 (-0.691798) 0.028906 / 0.296338 (-0.267433)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.504611 / 0.215209 (0.289402) 5.011681 / 2.077655 (2.934027) 2.256281 / 1.504120 (0.752161) 1.963628 / 1.541195 (0.422433) 1.962128 / 1.468490 (0.493638) 6.912370 / 4.584777 (2.327594) 6.189910 / 3.745712 (2.444198) 8.622629 / 5.269862 (3.352767) 7.607084 / 4.565676 (3.041407) 0.766671 / 0.424275 (0.342396) 0.010893 / 0.007607 (0.003286) 0.622649 / 0.226044 (0.396604) 6.380305 / 2.268929 (4.111376) 2.911461 / 55.444624 (-52.533163) 2.226782 / 6.876477 (-4.649694) 2.222038 / 2.142072 (0.079966) 7.077012 / 4.805227 (2.271785) 6.053636 / 6.500664 (-0.447028) 9.568175 / 0.075469 (9.492706)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.942508 / 1.841788 (9.100721) 12.900062 / 8.074308 (4.825754) 37.884143 / 10.191392 (27.692751) 0.858046 / 0.680424 (0.177622) 0.564674 / 0.534201 (0.030473) 0.777915 / 0.579283 (0.198632) 0.617260 / 0.434364 (0.182896) 0.692383 / 0.540337 (0.152046) 1.498319 / 1.386936 (0.111383)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.023086 / 0.011353 (0.011733) 0.015380 / 0.011008 (0.004371) 0.050661 / 0.038508 (0.012153) 0.035259 / 0.023109 (0.012149) 0.306035 / 0.275898 (0.030137) 0.364384 / 0.323480 (0.040904) 0.011912 / 0.007986 (0.003927) 0.005347 / 0.004328 (0.001018) 0.011232 / 0.004250 (0.006982) 0.052675 / 0.037052 (0.015623) 0.312547 / 0.258489 (0.054058) 0.346862 / 0.293841 (0.053021) 0.167079 / 0.128546 (0.038533) 0.122270 / 0.075646 (0.046624) 0.433880 / 0.419271 (0.014608) 0.596425 / 0.043533 (0.552893) 0.313619 / 0.255139 (0.058480) 0.367348 / 0.283200 (0.084149) 3.518139 / 0.141683 (3.376456) 1.792927 / 1.452155 (0.340773) 1.912647 / 1.492716 (0.419931)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.009056 / 0.018006 (-0.008950) 0.495318 / 0.000490 (0.494828) 0.000387 / 0.000200 (0.000187) 0.000057 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039900 / 0.037411 (0.002489) 0.025122 / 0.014526 (0.010596) 0.026304 / 0.176557 (-0.150253) 0.044451 / 0.737135 (-0.692685) 0.027462 / 0.296338 (-0.268876)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.451695 / 0.215209 (0.236486) 4.572565 / 2.077655 (2.494910) 2.033147 / 1.504120 (0.529027) 1.745400 / 1.541195 (0.204205) 1.754157 / 1.468490 (0.285666) 6.693974 / 4.584777 (2.109197) 5.892725 / 3.745712 (2.147013) 8.325106 / 5.269862 (3.055244) 7.228830 / 4.565676 (2.663154) 0.664716 / 0.424275 (0.240441) 0.010209 / 0.007607 (0.002602) 0.601406 / 0.226044 (0.375362) 6.008079 / 2.268929 (3.739151) 2.705783 / 55.444624 (-52.738841) 2.147549 / 6.876477 (-4.728928) 2.116481 / 2.142072 (-0.025591) 6.828303 / 4.805227 (2.023076) 4.918126 / 6.500664 (-1.582538) 7.042710 / 0.075469 (6.967241)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 11.190888 / 1.841788 (9.349101) 12.904478 / 8.074308 (4.830170) 37.182989 / 10.191392 (26.991597) 0.844162 / 0.680424 (0.163738) 0.583190 / 0.534201 (0.048989) 0.777918 / 0.579283 (0.198635) 0.614819 / 0.434364 (0.180455) 0.687325 / 0.540337 (0.146987) 1.635410 / 1.386936 (0.248474)

CML watermark

Please sign in to comment.