Skip to content

Commit

Permalink
Make streamable the all config
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Nov 30, 2021
1 parent fe4beff commit 1abfd8b
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions datasets/the_pile/the_pile.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def _info(self):
def _split_generators(self, dl_manager):
"""Return SplitGenerators."""
if self.config.name == "all":
data_dir = dl_manager.download_and_extract(_DATA_URLS[self.config.name])
data_dir = dl_manager.download(_DATA_URLS[self.config.name])
return [
datasets.SplitGenerator(
name=split,
Expand All @@ -142,8 +142,10 @@ def _generate_examples(self, files):
"""Yield examples as (key, example) tuples."""
key = 0
if isinstance(files, list):
import zstandard as zstd

for path in files:
with open(path, encoding="utf-8") as f:
with zstd.open(open(path, "rb"), "rt", encoding="utf-8") as f:
for row in f:
data = json.loads(row)
yield key, data
Expand Down

1 comment on commit 1abfd8b

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.074384 / 0.011353 (0.063031) 0.003952 / 0.011008 (-0.007056) 0.028138 / 0.038508 (-0.010370) 0.037805 / 0.023109 (0.014696) 0.259187 / 0.275898 (-0.016711) 0.335654 / 0.323480 (0.012174) 0.090459 / 0.007986 (0.082474) 0.004344 / 0.004328 (0.000015) 0.008485 / 0.004250 (0.004235) 0.049620 / 0.037052 (0.012567) 0.260585 / 0.258489 (0.002096) 0.295495 / 0.293841 (0.001654) 0.092897 / 0.128546 (-0.035649) 0.008213 / 0.075646 (-0.067433) 0.272662 / 0.419271 (-0.146610) 0.050974 / 0.043533 (0.007441) 0.260500 / 0.255139 (0.005361) 0.345561 / 0.283200 (0.062361) 0.095136 / 0.141683 (-0.046547) 1.546684 / 1.452155 (0.094530) 1.918275 / 1.492716 (0.425558)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.333203 / 0.018006 (0.315197) 0.547384 / 0.000490 (0.546895) 0.010061 / 0.000200 (0.009861) 0.000114 / 0.000054 (0.000060)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033938 / 0.037411 (-0.003474) 0.021234 / 0.014526 (0.006709) 0.027363 / 0.176557 (-0.149193) 0.175692 / 0.737135 (-0.561444) 0.030163 / 0.296338 (-0.266175)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.380276 / 0.215209 (0.165067) 3.823097 / 2.077655 (1.745442) 1.679933 / 1.504120 (0.175813) 1.468895 / 1.541195 (-0.072300) 1.601298 / 1.468490 (0.132807) 0.373618 / 4.584777 (-4.211159) 4.613742 / 3.745712 (0.868030) 2.090259 / 5.269862 (-3.179603) 0.796625 / 4.565676 (-3.769052) 0.050743 / 0.424275 (-0.373532) 0.011433 / 0.007607 (0.003826) 0.537085 / 0.226044 (0.311041) 5.362101 / 2.268929 (3.093172) 2.357563 / 55.444624 (-53.087062) 1.951280 / 6.876477 (-4.925196) 2.096911 / 2.142072 (-0.045161) 0.543793 / 4.805227 (-4.261434) 0.117768 / 6.500664 (-6.382896) 0.059244 / 0.075469 (-0.016225)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.378603 / 1.841788 (-0.463185) 11.531308 / 8.074308 (3.457000) 27.243383 / 10.191392 (17.051991) 0.818522 / 0.680424 (0.138099) 0.545809 / 0.534201 (0.011608) 0.375448 / 0.579283 (-0.203835) 0.464038 / 0.434364 (0.029674) 0.232225 / 0.540337 (-0.308112) 0.243527 / 1.386936 (-1.143410)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.078201 / 0.011353 (0.066848) 0.004035 / 0.011008 (-0.006974) 0.026342 / 0.038508 (-0.012166) 0.036383 / 0.023109 (0.013273) 0.269796 / 0.275898 (-0.006102) 0.301751 / 0.323480 (-0.021729) 0.085324 / 0.007986 (0.077338) 0.004880 / 0.004328 (0.000552) 0.006655 / 0.004250 (0.002404) 0.036757 / 0.037052 (-0.000295) 0.264930 / 0.258489 (0.006441) 0.306947 / 0.293841 (0.013106) 0.091088 / 0.128546 (-0.037459) 0.008171 / 0.075646 (-0.067475) 0.222923 / 0.419271 (-0.196348) 0.048858 / 0.043533 (0.005325) 0.271155 / 0.255139 (0.016016) 0.283896 / 0.283200 (0.000697) 0.092843 / 0.141683 (-0.048840) 1.491629 / 1.452155 (0.039474) 1.549579 / 1.492716 (0.056863)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.380703 / 0.018006 (0.362697) 0.537971 / 0.000490 (0.537482) 0.000943 / 0.000200 (0.000743) 0.000084 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030776 / 0.037411 (-0.006636) 0.019965 / 0.014526 (0.005439) 0.032233 / 0.176557 (-0.144323) 0.183256 / 0.737135 (-0.553879) 0.028057 / 0.296338 (-0.268281)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.426658 / 0.215209 (0.211449) 4.268978 / 2.077655 (2.191323) 1.818527 / 1.504120 (0.314407) 1.609533 / 1.541195 (0.068338) 1.736464 / 1.468490 (0.267974) 0.425229 / 4.584777 (-4.159548) 4.706249 / 3.745712 (0.960537) 3.843178 / 5.269862 (-1.426684) 0.885809 / 4.565676 (-3.679868) 0.051031 / 0.424275 (-0.373244) 0.011486 / 0.007607 (0.003879) 0.545886 / 0.226044 (0.319841) 5.417832 / 2.268929 (3.148903) 2.313680 / 55.444624 (-53.130945) 1.953260 / 6.876477 (-4.923216) 2.122017 / 2.142072 (-0.020056) 0.545054 / 4.805227 (-4.260173) 0.118680 / 6.500664 (-6.381984) 0.060643 / 0.075469 (-0.014826)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.370396 / 1.841788 (-0.471392) 12.056940 / 8.074308 (3.982632) 23.566509 / 10.191392 (13.375117) 0.631082 / 0.680424 (-0.049341) 0.469369 / 0.534201 (-0.064832) 0.329864 / 0.579283 (-0.249419) 0.473907 / 0.434364 (0.039544) 0.227733 / 0.540337 (-0.312604) 0.237454 / 1.386936 (-1.149483)

CML watermark

Please sign in to comment.