Skip to content

Commit

Permalink
style
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 12, 2021
1 parent daca85d commit 3b80398
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion src/datasets/utils/streaming_download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,9 @@ def _get_extraction_protocol(urlpath: str) -> Optional[str]:
if extension in BASE_KNOWN_EXTENSIONS:
return None
elif path.endswith(".tar.gz") or path.endswith(".tgz"):
raise NotImplementedError(f"Extraction protocol for TAR archives like '{urlpath}' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead.")
raise NotImplementedError(
f"Extraction protocol for TAR archives like '{urlpath}' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead."
)
elif extension in COMPRESSION_EXTENSION_TO_PROTOCOL:
return COMPRESSION_EXTENSION_TO_PROTOCOL[extension]
raise NotImplementedError(f"Extraction protocol '{extension}' for file at '{urlpath}' is not implemented yet")
Expand Down

1 comment on commit 3b80398

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010645 / 0.011353 (-0.000708) 0.004373 / 0.011008 (-0.006636) 0.037621 / 0.038508 (-0.000887) 0.040256 / 0.023109 (0.017147) 0.325562 / 0.275898 (0.049664) 0.376962 / 0.323480 (0.053482) 0.008237 / 0.007986 (0.000251) 0.005209 / 0.004328 (0.000881) 0.010274 / 0.004250 (0.006024) 0.037029 / 0.037052 (-0.000024) 0.352093 / 0.258489 (0.093604) 0.375124 / 0.293841 (0.081283) 0.032167 / 0.128546 (-0.096379) 0.011698 / 0.075646 (-0.063948) 0.283362 / 0.419271 (-0.135910) 0.058128 / 0.043533 (0.014595) 0.347824 / 0.255139 (0.092685) 0.369631 / 0.283200 (0.086431) 0.097929 / 0.141683 (-0.043754) 1.934168 / 1.452155 (0.482013) 2.003858 / 1.492716 (0.511141)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.225254 / 0.018006 (0.207248) 0.539749 / 0.000490 (0.539259) 0.010902 / 0.000200 (0.010702) 0.000418 / 0.000054 (0.000364)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043436 / 0.037411 (0.006025) 0.024310 / 0.014526 (0.009784) 0.026319 / 0.176557 (-0.150238) 0.132976 / 0.737135 (-0.604159) 0.028824 / 0.296338 (-0.267514)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.475577 / 0.215209 (0.260368) 4.932413 / 2.077655 (2.854759) 2.176730 / 1.504120 (0.672610) 1.925886 / 1.541195 (0.384691) 1.937496 / 1.468490 (0.469006) 0.527485 / 4.584777 (-4.057292) 6.384745 / 3.745712 (2.639033) 1.371779 / 5.269862 (-3.898082) 1.345321 / 4.565676 (-3.220355) 0.056642 / 0.424275 (-0.367633) 0.005616 / 0.007607 (-0.001991) 0.610366 / 0.226044 (0.384322) 6.014293 / 2.268929 (3.745364) 2.770029 / 55.444624 (-52.674595) 2.247225 / 6.876477 (-4.629251) 2.181321 / 2.142072 (0.039248) 0.682284 / 4.805227 (-4.122943) 0.155333 / 6.500664 (-6.345331) 0.062208 / 0.075469 (-0.013261)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.052816 / 1.841788 (-0.788972) 14.422730 / 8.074308 (6.348422) 31.493267 / 10.191392 (21.301875) 0.925243 / 0.680424 (0.244819) 0.668298 / 0.534201 (0.134098) 0.292443 / 0.579283 (-0.286841) 0.721196 / 0.434364 (0.286832) 0.239783 / 0.540337 (-0.300554) 0.263919 / 1.386936 (-1.123017)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009457 / 0.011353 (-0.001896) 0.004551 / 0.011008 (-0.006457) 0.034593 / 0.038508 (-0.003915) 0.032582 / 0.023109 (0.009473) 0.385920 / 0.275898 (0.110022) 0.375839 / 0.323480 (0.052359) 0.009447 / 0.007986 (0.001462) 0.005405 / 0.004328 (0.001076) 0.008915 / 0.004250 (0.004665) 0.044298 / 0.037052 (0.007246) 0.410903 / 0.258489 (0.152414) 0.361417 / 0.293841 (0.067576) 0.031143 / 0.128546 (-0.097403) 0.011001 / 0.075646 (-0.064645) 0.293193 / 0.419271 (-0.126079) 0.051473 / 0.043533 (0.007940) 0.337591 / 0.255139 (0.082452) 0.344412 / 0.283200 (0.061212) 0.085572 / 0.141683 (-0.056111) 1.811750 / 1.452155 (0.359595) 2.047706 / 1.492716 (0.554990)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.411038 / 0.018006 (0.393032) 0.567480 / 0.000490 (0.566990) 0.063260 / 0.000200 (0.063060) 0.002354 / 0.000054 (0.002300)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.038908 / 0.037411 (0.001497) 0.025894 / 0.014526 (0.011368) 0.028783 / 0.176557 (-0.147774) 0.142676 / 0.737135 (-0.594459) 0.029892 / 0.296338 (-0.266447)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.480543 / 0.215209 (0.265334) 4.732432 / 2.077655 (2.654778) 2.220497 / 1.504120 (0.716377) 2.014020 / 1.541195 (0.472826) 1.960898 / 1.468490 (0.492408) 0.507820 / 4.584777 (-4.076957) 6.569548 / 3.745712 (2.823836) 1.494515 / 5.269862 (-3.775346) 1.277860 / 4.565676 (-3.287817) 0.054896 / 0.424275 (-0.369379) 0.004866 / 0.007607 (-0.002741) 0.642677 / 0.226044 (0.416633) 6.423500 / 2.268929 (4.154571) 2.871627 / 55.444624 (-52.572998) 2.199929 / 6.876477 (-4.676548) 2.183597 / 2.142072 (0.041525) 0.675058 / 4.805227 (-4.130170) 0.160508 / 6.500664 (-6.340156) 0.064554 / 0.075469 (-0.010915)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.079830 / 1.841788 (-0.761958) 14.192583 / 8.074308 (6.118275) 31.815635 / 10.191392 (21.624243) 0.878071 / 0.680424 (0.197647) 0.615311 / 0.534201 (0.081110) 0.264663 / 0.579283 (-0.314620) 0.703844 / 0.434364 (0.269480) 0.241070 / 0.540337 (-0.299268) 0.258595 / 1.386936 (-1.128341)

CML watermark

Please sign in to comment.