Fix fsspec storage_options from load_dataset #6072

lhoestq · 2023-07-26T10:44:23Z

lhoestq · 2023-07-26T10:49:48Z

src/datasets/packaged_modules/text/text.py

@@ -33,7 +33,7 @@ def __post_init__(self):
                f"You can remove this warning by passing 'encoding_errors={self.errors}' instead.",
                FutureWarning,
            )
-        self.encoding_errors = self.errors
+            self.encoding_errors = self.errors


this was a bug I encountered while writing the test

HuggingFaceDocBuilderDev · 2023-07-26T10:51:29Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-26T10:54:18Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007617 / 0.011353 (-0.003736)	0.004580 / 0.011008 (-0.006428)	0.100913 / 0.038508 (0.062405)	0.087703 / 0.023109 (0.064594)	0.424159 / 0.275898 (0.148261)	0.467195 / 0.323480 (0.143715)	0.006890 / 0.007986 (-0.001096)	0.003765 / 0.004328 (-0.000564)	0.077513 / 0.004250 (0.073262)	0.064889 / 0.037052 (0.027837)	0.422349 / 0.258489 (0.163860)	0.477391 / 0.293841 (0.183550)	0.036025 / 0.128546 (-0.092522)	0.009939 / 0.075646 (-0.065707)	0.342409 / 0.419271 (-0.076862)	0.061568 / 0.043533 (0.018035)	0.431070 / 0.255139 (0.175931)	0.462008 / 0.283200 (0.178809)	0.027480 / 0.141683 (-0.114203)	1.802271 / 1.452155 (0.350116)	1.861336 / 1.492716 (0.368620)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.255806 / 0.018006 (0.237800)	0.507969 / 0.000490 (0.507479)	0.010060 / 0.000200 (0.009860)	0.000112 / 0.000054 (0.000058)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032286 / 0.037411 (-0.005125)	0.104468 / 0.014526 (0.089942)	0.112707 / 0.176557 (-0.063850)	0.181285 / 0.737135 (-0.555850)	0.113180 / 0.296338 (-0.183158)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.449265 / 0.215209 (0.234056)	4.465941 / 2.077655 (2.388287)	2.177889 / 1.504120 (0.673769)	1.969864 / 1.541195 (0.428669)	2.077502 / 1.468490 (0.609011)	0.561607 / 4.584777 (-4.023170)	4.281873 / 3.745712 (0.536161)	4.975352 / 5.269862 (-0.294510)	2.907121 / 4.565676 (-1.658555)	0.070205 / 0.424275 (-0.354070)	0.009164 / 0.007607 (0.001557)	0.581921 / 0.226044 (0.355876)	5.538667 / 2.268929 (3.269739)	2.798853 / 55.444624 (-52.645771)	2.314015 / 6.876477 (-4.562462)	2.584836 / 2.142072 (0.442763)	0.672333 / 4.805227 (-4.132894)	0.153828 / 6.500664 (-6.346836)	0.069757 / 0.075469 (-0.005712)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.559670 / 1.841788 (-0.282118)	23.994639 / 8.074308 (15.920331)	16.856160 / 10.191392 (6.664768)	0.195555 / 0.680424 (-0.484869)	0.021586 / 0.534201 (-0.512615)	0.469295 / 0.579283 (-0.109989)	0.481582 / 0.434364 (0.047218)	0.588667 / 0.540337 (0.048329)	0.734347 / 1.386936 (-0.652589)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009614 / 0.011353 (-0.001739)	0.004616 / 0.011008 (-0.006392)	0.077223 / 0.038508 (0.038715)	0.103074 / 0.023109 (0.079965)	0.447834 / 0.275898 (0.171936)	0.524696 / 0.323480 (0.201216)	0.007120 / 0.007986 (-0.000866)	0.003890 / 0.004328 (-0.000438)	0.076406 / 0.004250 (0.072156)	0.073488 / 0.037052 (0.036436)	0.466221 / 0.258489 (0.207732)	0.532206 / 0.293841 (0.238365)	0.037596 / 0.128546 (-0.090950)	0.010029 / 0.075646 (-0.065617)	0.084313 / 0.419271 (-0.334959)	0.060088 / 0.043533 (0.016555)	0.437792 / 0.255139 (0.182653)	0.512850 / 0.283200 (0.229650)	0.032424 / 0.141683 (-0.109259)	1.762130 / 1.452155 (0.309975)	1.946097 / 1.492716 (0.453381)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.250774 / 0.018006 (0.232768)	0.506869 / 0.000490 (0.506379)	0.008232 / 0.000200 (0.008032)	0.000164 / 0.000054 (0.000110)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037779 / 0.037411 (0.000368)	0.111933 / 0.014526 (0.097407)	0.122385 / 0.176557 (-0.054172)	0.190372 / 0.737135 (-0.546763)	0.122472 / 0.296338 (-0.173866)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.488502 / 0.215209 (0.273293)	4.878114 / 2.077655 (2.800459)	2.504144 / 1.504120 (1.000024)	2.321077 / 1.541195 (0.779883)	2.416797 / 1.468490 (0.948307)	0.583582 / 4.584777 (-4.001195)	4.277896 / 3.745712 (0.532184)	3.874780 / 5.269862 (-1.395082)	2.540099 / 4.565676 (-2.025577)	0.068734 / 0.424275 (-0.355541)	0.009158 / 0.007607 (0.001550)	0.578401 / 0.226044 (0.352357)	5.763354 / 2.268929 (3.494426)	3.167771 / 55.444624 (-52.276853)	2.675220 / 6.876477 (-4.201257)	2.920927 / 2.142072 (0.778855)	0.673948 / 4.805227 (-4.131280)	0.157908 / 6.500664 (-6.342756)	0.071672 / 0.075469 (-0.003797)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.635120 / 1.841788 (-0.206668)	24.853480 / 8.074308 (16.779172)	17.162978 / 10.191392 (6.971586)	0.209577 / 0.680424 (-0.470847)	0.030110 / 0.534201 (-0.504091)	0.546970 / 0.579283 (-0.032313)	0.581912 / 0.434364 (0.147548)	0.571460 / 0.540337 (0.031123)	0.823411 / 1.386936 (-0.563525)

github-actions · 2023-07-26T13:01:27Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006674 / 0.011353 (-0.004679)	0.004198 / 0.011008 (-0.006810)	0.084859 / 0.038508 (0.046351)	0.076065 / 0.023109 (0.052955)	0.316065 / 0.275898 (0.040167)	0.352097 / 0.323480 (0.028617)	0.005610 / 0.007986 (-0.002376)	0.003600 / 0.004328 (-0.000729)	0.064921 / 0.004250 (0.060671)	0.054493 / 0.037052 (0.017441)	0.318125 / 0.258489 (0.059636)	0.370183 / 0.293841 (0.076342)	0.031141 / 0.128546 (-0.097405)	0.008755 / 0.075646 (-0.066891)	0.288241 / 0.419271 (-0.131030)	0.052379 / 0.043533 (0.008846)	0.328147 / 0.255139 (0.073008)	0.347548 / 0.283200 (0.064348)	0.024393 / 0.141683 (-0.117290)	1.480646 / 1.452155 (0.028492)	1.575867 / 1.492716 (0.083151)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.268978 / 0.018006 (0.250971)	0.586470 / 0.000490 (0.585980)	0.003190 / 0.000200 (0.002990)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030595 / 0.037411 (-0.006816)	0.083037 / 0.014526 (0.068511)	0.103706 / 0.176557 (-0.072850)	0.164104 / 0.737135 (-0.573031)	0.104536 / 0.296338 (-0.191802)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.382274 / 0.215209 (0.167065)	3.811878 / 2.077655 (1.734223)	1.840098 / 1.504120 (0.335978)	1.670949 / 1.541195 (0.129754)	1.763755 / 1.468490 (0.295264)	0.479526 / 4.584777 (-4.105251)	3.544443 / 3.745712 (-0.201269)	3.263004 / 5.269862 (-2.006858)	2.092801 / 4.565676 (-2.472875)	0.057167 / 0.424275 (-0.367108)	0.007450 / 0.007607 (-0.000157)	0.463731 / 0.226044 (0.237686)	4.624630 / 2.268929 (2.355701)	2.327078 / 55.444624 (-53.117546)	1.977734 / 6.876477 (-4.898743)	2.237152 / 2.142072 (0.095079)	0.573210 / 4.805227 (-4.232018)	0.132095 / 6.500664 (-6.368569)	0.060283 / 0.075469 (-0.015186)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.243404 / 1.841788 (-0.598384)	20.306778 / 8.074308 (12.232470)	14.561660 / 10.191392 (4.370268)	0.170826 / 0.680424 (-0.509598)	0.018574 / 0.534201 (-0.515627)	0.392367 / 0.579283 (-0.186916)	0.402918 / 0.434364 (-0.031446)	0.476629 / 0.540337 (-0.063708)	0.653709 / 1.386936 (-0.733227)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006562 / 0.011353 (-0.004791)	0.004092 / 0.011008 (-0.006916)	0.065951 / 0.038508 (0.027443)	0.078090 / 0.023109 (0.054981)	0.369679 / 0.275898 (0.093781)	0.411442 / 0.323480 (0.087962)	0.005646 / 0.007986 (-0.002339)	0.003537 / 0.004328 (-0.000791)	0.066024 / 0.004250 (0.061773)	0.058947 / 0.037052 (0.021895)	0.389219 / 0.258489 (0.130730)	0.414200 / 0.293841 (0.120359)	0.030372 / 0.128546 (-0.098174)	0.008631 / 0.075646 (-0.067015)	0.071692 / 0.419271 (-0.347580)	0.048035 / 0.043533 (0.004502)	0.376960 / 0.255139 (0.121821)	0.389847 / 0.283200 (0.106648)	0.023940 / 0.141683 (-0.117743)	1.487633 / 1.452155 (0.035479)	1.561680 / 1.492716 (0.068964)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.301467 / 0.018006 (0.283461)	0.544159 / 0.000490 (0.543669)	0.000408 / 0.000200 (0.000208)	0.000055 / 0.000054 (0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030939 / 0.037411 (-0.006472)	0.087432 / 0.014526 (0.072906)	0.103263 / 0.176557 (-0.073293)	0.154551 / 0.737135 (-0.582585)	0.104631 / 0.296338 (-0.191707)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422348 / 0.215209 (0.207139)	4.206003 / 2.077655 (2.128348)	2.212619 / 1.504120 (0.708499)	2.049616 / 1.541195 (0.508421)	2.139093 / 1.468490 (0.670603)	0.489647 / 4.584777 (-4.095130)	3.523291 / 3.745712 (-0.222422)	3.277657 / 5.269862 (-1.992205)	2.111353 / 4.565676 (-2.454324)	0.057597 / 0.424275 (-0.366679)	0.007675 / 0.007607 (0.000068)	0.493068 / 0.226044 (0.267023)	4.939493 / 2.268929 (2.670565)	2.695995 / 55.444624 (-52.748630)	2.374904 / 6.876477 (-4.501573)	2.600110 / 2.142072 (0.458038)	0.586306 / 4.805227 (-4.218921)	0.134137 / 6.500664 (-6.366527)	0.061897 / 0.075469 (-0.013572)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.330628 / 1.841788 (-0.511160)	20.557964 / 8.074308 (12.483656)	14.251632 / 10.191392 (4.060240)	0.148772 / 0.680424 (-0.531652)	0.018383 / 0.534201 (-0.515817)	0.392552 / 0.579283 (-0.186731)	0.403959 / 0.434364 (-0.030405)	0.462154 / 0.540337 (-0.078184)	0.608832 / 1.386936 (-0.778104)

mariosasko

Nice! One nit:

mariosasko · 2023-07-26T19:23:15Z

src/datasets/download/streaming_download_manager.py

@@ -423,9 +423,17 @@ def _prepare_single_hop_path_and_storage_options(
    token = None if download_config is None else download_config.token
    protocol = urlpath.split("://")[0] if "://" in urlpath else "file"
    if download_config is not None and protocol in download_config.storage_options:
-        storage_options = {protocol: download_config.storage_options[protocol]}
+        storage_options = download_config.storage_options[protocol]
+    elif download_config is not None and protocol not in download_config.storage_options:


I think we also need to update DownloadConfig.storage_options' type hint.

github-actions · 2023-07-27T12:09:00Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007659 / 0.011353 (-0.003694)	0.004500 / 0.011008 (-0.006508)	0.100379 / 0.038508 (0.061871)	0.079731 / 0.023109 (0.056622)	0.381788 / 0.275898 (0.105890)	0.416524 / 0.323480 (0.093044)	0.004446 / 0.007986 (-0.003539)	0.003752 / 0.004328 (-0.000577)	0.074956 / 0.004250 (0.070706)	0.062885 / 0.037052 (0.025832)	0.383849 / 0.258489 (0.125360)	0.433906 / 0.293841 (0.140065)	0.036079 / 0.128546 (-0.092468)	0.009927 / 0.075646 (-0.065719)	0.343879 / 0.419271 (-0.075393)	0.061055 / 0.043533 (0.017523)	0.376703 / 0.255139 (0.121564)	0.428111 / 0.283200 (0.144911)	0.028667 / 0.141683 (-0.113016)	1.777755 / 1.452155 (0.325600)	1.878283 / 1.492716 (0.385567)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220829 / 0.018006 (0.202823)	0.506406 / 0.000490 (0.505916)	0.005550 / 0.000200 (0.005350)	0.000123 / 0.000054 (0.000069)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034928 / 0.037411 (-0.002483)	0.103873 / 0.014526 (0.089347)	0.114352 / 0.176557 (-0.062204)	0.188218 / 0.737135 (-0.548918)	0.117343 / 0.296338 (-0.178995)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.459148 / 0.215209 (0.243939)	4.582092 / 2.077655 (2.504437)	2.275603 / 1.504120 (0.771483)	2.058155 / 1.541195 (0.516960)	2.163886 / 1.468490 (0.695396)	0.573033 / 4.584777 (-4.011744)	4.414891 / 3.745712 (0.669178)	7.280433 / 5.269862 (2.010572)	4.119414 / 4.565676 (-0.446262)	0.067432 / 0.424275 (-0.356843)	0.008687 / 0.007607 (0.001080)	0.556029 / 0.226044 (0.329984)	5.557192 / 2.268929 (3.288264)	2.921596 / 55.444624 (-52.523028)	2.520249 / 6.876477 (-4.356228)	2.778965 / 2.142072 (0.636893)	0.684765 / 4.805227 (-4.120462)	0.159228 / 6.500664 (-6.341436)	0.074015 / 0.075469 (-0.001454)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.534470 / 1.841788 (-0.307318)	23.630693 / 8.074308 (15.556385)	17.058142 / 10.191392 (6.866750)	0.200909 / 0.680424 (-0.479515)	0.021637 / 0.534201 (-0.512564)	0.467417 / 0.579283 (-0.111866)	0.460456 / 0.434364 (0.026092)	0.541131 / 0.540337 (0.000793)	0.728560 / 1.386936 (-0.658376)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007625 / 0.011353 (-0.003727)	0.004495 / 0.011008 (-0.006513)	0.076373 / 0.038508 (0.037865)	0.085260 / 0.023109 (0.062151)	0.475778 / 0.275898 (0.199880)	0.504604 / 0.323480 (0.181124)	0.006733 / 0.007986 (-0.001253)	0.003751 / 0.004328 (-0.000578)	0.074993 / 0.004250 (0.070743)	0.064704 / 0.037052 (0.027652)	0.490072 / 0.258489 (0.231583)	0.507560 / 0.293841 (0.213719)	0.036765 / 0.128546 (-0.091781)	0.009955 / 0.075646 (-0.065692)	0.082452 / 0.419271 (-0.336820)	0.057131 / 0.043533 (0.013598)	0.467664 / 0.255139 (0.212525)	0.482143 / 0.283200 (0.198943)	0.025396 / 0.141683 (-0.116287)	1.807587 / 1.452155 (0.355433)	1.853355 / 1.492716 (0.360639)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.250543 / 0.018006 (0.232537)	0.495685 / 0.000490 (0.495196)	0.000415 / 0.000200 (0.000215)	0.000063 / 0.000054 (0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035795 / 0.037411 (-0.001616)	0.105954 / 0.014526 (0.091428)	0.120158 / 0.176557 (-0.056399)	0.181714 / 0.737135 (-0.555422)	0.121242 / 0.296338 (-0.175097)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.488241 / 0.215209 (0.273032)	4.866916 / 2.077655 (2.789262)	2.531530 / 1.504120 (1.027410)	2.360642 / 1.541195 (0.819448)	2.457320 / 1.468490 (0.988830)	0.571224 / 4.584777 (-4.013553)	4.339042 / 3.745712 (0.593330)	3.672812 / 5.269862 (-1.597050)	2.364535 / 4.565676 (-2.201142)	0.067004 / 0.424275 (-0.357271)	0.009019 / 0.007607 (0.001412)	0.563751 / 0.226044 (0.337707)	5.664917 / 2.268929 (3.395989)	3.043316 / 55.444624 (-52.401308)	2.682722 / 6.876477 (-4.193755)	2.863482 / 2.142072 (0.721409)	0.666171 / 4.805227 (-4.139056)	0.151862 / 6.500664 (-6.348802)	0.071199 / 0.075469 (-0.004271)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.601880 / 1.841788 (-0.239907)	23.069073 / 8.074308 (14.994765)	16.918377 / 10.191392 (6.726985)	0.173614 / 0.680424 (-0.506810)	0.021843 / 0.534201 (-0.512358)	0.470531 / 0.579283 (-0.108753)	0.471152 / 0.434364 (0.036788)	0.550968 / 0.540337 (0.010631)	0.718869 / 1.386936 (-0.668067)

github-actions · 2023-07-27T12:13:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007530 / 0.011353 (-0.003823)	0.004151 / 0.011008 (-0.006858)	0.098490 / 0.038508 (0.059982)	0.086955 / 0.023109 (0.063846)	0.362133 / 0.275898 (0.086235)	0.391402 / 0.323480 (0.067922)	0.006274 / 0.007986 (-0.001712)	0.003711 / 0.004328 (-0.000618)	0.073519 / 0.004250 (0.069269)	0.066170 / 0.037052 (0.029118)	0.379057 / 0.258489 (0.120568)	0.398132 / 0.293841 (0.104291)	0.033936 / 0.128546 (-0.094610)	0.009977 / 0.075646 (-0.065670)	0.323766 / 0.419271 (-0.095505)	0.078615 / 0.043533 (0.035082)	0.352403 / 0.255139 (0.097264)	0.386607 / 0.283200 (0.103407)	0.036579 / 0.141683 (-0.105103)	1.691899 / 1.452155 (0.239745)	1.819396 / 1.492716 (0.326680)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216888 / 0.018006 (0.198882)	0.465781 / 0.000490 (0.465291)	0.006197 / 0.000200 (0.005997)	0.000086 / 0.000054 (0.000031)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032870 / 0.037411 (-0.004542)	0.096026 / 0.014526 (0.081500)	0.111093 / 0.176557 (-0.065464)	0.185982 / 0.737135 (-0.551154)	0.106967 / 0.296338 (-0.189371)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.441567 / 0.215209 (0.226358)	4.353813 / 2.077655 (2.276158)	2.176034 / 1.504120 (0.671914)	1.969631 / 1.541195 (0.428437)	2.048821 / 1.468490 (0.580330)	0.549144 / 4.584777 (-4.035633)	4.016166 / 3.745712 (0.270453)	3.764249 / 5.269862 (-1.505613)	2.293995 / 4.565676 (-2.271681)	0.065227 / 0.424275 (-0.359048)	0.008303 / 0.007607 (0.000695)	0.513783 / 0.226044 (0.287738)	5.247617 / 2.268929 (2.978689)	2.782114 / 55.444624 (-52.662510)	2.342776 / 6.876477 (-4.533701)	2.621569 / 2.142072 (0.479497)	0.679336 / 4.805227 (-4.125891)	0.152061 / 6.500664 (-6.348603)	0.070294 / 0.075469 (-0.005175)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.471778 / 1.841788 (-0.370010)	22.714904 / 8.074308 (14.640596)	15.607991 / 10.191392 (5.416599)	0.172592 / 0.680424 (-0.507832)	0.021799 / 0.534201 (-0.512402)	0.462740 / 0.579283 (-0.116543)	0.490885 / 0.434364 (0.056521)	0.552997 / 0.540337 (0.012660)	0.763784 / 1.386936 (-0.623152)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007466 / 0.011353 (-0.003886)	0.004322 / 0.011008 (-0.006686)	0.074331 / 0.038508 (0.035823)	0.085315 / 0.023109 (0.062206)	0.409284 / 0.275898 (0.133386)	0.464584 / 0.323480 (0.141104)	0.005651 / 0.007986 (-0.002335)	0.003577 / 0.004328 (-0.000751)	0.070250 / 0.004250 (0.066000)	0.059780 / 0.037052 (0.022727)	0.419668 / 0.258489 (0.161179)	0.462984 / 0.293841 (0.169143)	0.034159 / 0.128546 (-0.094387)	0.008999 / 0.075646 (-0.066647)	0.076302 / 0.419271 (-0.342969)	0.052274 / 0.043533 (0.008741)	0.425938 / 0.255139 (0.170799)	0.430399 / 0.283200 (0.147200)	0.025017 / 0.141683 (-0.116666)	1.680697 / 1.452155 (0.228542)	1.774677 / 1.492716 (0.281960)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.291514 / 0.018006 (0.273508)	0.461175 / 0.000490 (0.460685)	0.023061 / 0.000200 (0.022861)	0.000120 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033950 / 0.037411 (-0.003462)	0.100032 / 0.014526 (0.085506)	0.118308 / 0.176557 (-0.058249)	0.183601 / 0.737135 (-0.553535)	0.116936 / 0.296338 (-0.179402)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.478779 / 0.215209 (0.263570)	4.709505 / 2.077655 (2.631850)	2.457442 / 1.504120 (0.953322)	2.213737 / 1.541195 (0.672542)	2.340642 / 1.468490 (0.872152)	0.567187 / 4.584777 (-4.017590)	3.923061 / 3.745712 (0.177349)	3.752989 / 5.269862 (-1.516873)	2.324028 / 4.565676 (-2.241649)	0.064471 / 0.424275 (-0.359804)	0.008845 / 0.007607 (0.001238)	0.547447 / 0.226044 (0.321402)	5.599435 / 2.268929 (3.330506)	2.980547 / 55.444624 (-52.464077)	2.754908 / 6.876477 (-4.121569)	2.832978 / 2.142072 (0.690906)	0.635059 / 4.805227 (-4.170168)	0.153478 / 6.500664 (-6.347187)	0.067146 / 0.075469 (-0.008323)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.555588 / 1.841788 (-0.286200)	22.828906 / 8.074308 (14.754597)	16.211008 / 10.191392 (6.019616)	0.168009 / 0.680424 (-0.512415)	0.021966 / 0.534201 (-0.512235)	0.464872 / 0.579283 (-0.114411)	0.460429 / 0.434364 (0.026065)	0.530498 / 0.540337 (-0.009839)	0.705020 / 1.386936 (-0.681916)

github-actions · 2023-07-27T12:51:51Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005964 / 0.011353 (-0.005389)	0.003644 / 0.011008 (-0.007364)	0.079607 / 0.038508 (0.041099)	0.058387 / 0.023109 (0.035278)	0.312226 / 0.275898 (0.036328)	0.349206 / 0.323480 (0.025726)	0.004715 / 0.007986 (-0.003271)	0.002869 / 0.004328 (-0.001460)	0.061668 / 0.004250 (0.057417)	0.045694 / 0.037052 (0.008642)	0.313516 / 0.258489 (0.055027)	0.357543 / 0.293841 (0.063702)	0.027179 / 0.128546 (-0.101367)	0.007961 / 0.075646 (-0.067686)	0.262473 / 0.419271 (-0.156798)	0.045588 / 0.043533 (0.002055)	0.313102 / 0.255139 (0.057963)	0.368686 / 0.283200 (0.085486)	0.020556 / 0.141683 (-0.121127)	1.447258 / 1.452155 (-0.004897)	1.527319 / 1.492716 (0.034602)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199417 / 0.018006 (0.181411)	0.422155 / 0.000490 (0.421665)	0.004972 / 0.000200 (0.004772)	0.000073 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023539 / 0.037411 (-0.013872)	0.073055 / 0.014526 (0.058529)	0.083631 / 0.176557 (-0.092926)	0.145923 / 0.737135 (-0.591212)	0.083820 / 0.296338 (-0.212518)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.396305 / 0.215209 (0.181096)	3.967065 / 2.077655 (1.889410)	2.101109 / 1.504120 (0.596989)	1.958817 / 1.541195 (0.417622)	2.037894 / 1.468490 (0.569404)	0.496955 / 4.584777 (-4.087822)	3.078948 / 3.745712 (-0.666764)	3.363655 / 5.269862 (-1.906207)	2.087659 / 4.565676 (-2.478018)	0.057171 / 0.424275 (-0.367104)	0.006410 / 0.007607 (-0.001197)	0.470535 / 0.226044 (0.244491)	4.715259 / 2.268929 (2.446330)	2.355510 / 55.444624 (-53.089114)	2.025270 / 6.876477 (-4.851207)	2.210401 / 2.142072 (0.068329)	0.580538 / 4.805227 (-4.224689)	0.125068 / 6.500664 (-6.375596)	0.059871 / 0.075469 (-0.015598)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.245468 / 1.841788 (-0.596320)	18.322042 / 8.074308 (10.247734)	13.609726 / 10.191392 (3.418334)	0.143623 / 0.680424 (-0.536801)	0.017068 / 0.534201 (-0.517133)	0.330758 / 0.579283 (-0.248525)	0.339946 / 0.434364 (-0.094418)	0.377861 / 0.540337 (-0.162476)	0.524593 / 1.386936 (-0.862343)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006049 / 0.011353 (-0.005304)	0.003737 / 0.011008 (-0.007271)	0.062816 / 0.038508 (0.024308)	0.063768 / 0.023109 (0.040658)	0.362001 / 0.275898 (0.086103)	0.395251 / 0.323480 (0.071772)	0.004823 / 0.007986 (-0.003163)	0.002881 / 0.004328 (-0.001447)	0.061987 / 0.004250 (0.057737)	0.049950 / 0.037052 (0.012898)	0.362442 / 0.258489 (0.103953)	0.399321 / 0.293841 (0.105480)	0.027616 / 0.128546 (-0.100930)	0.007965 / 0.075646 (-0.067681)	0.068584 / 0.419271 (-0.350687)	0.044700 / 0.043533 (0.001168)	0.361011 / 0.255139 (0.105872)	0.386007 / 0.283200 (0.102807)	0.024621 / 0.141683 (-0.117061)	1.441497 / 1.452155 (-0.010657)	1.533145 / 1.492716 (0.040429)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223446 / 0.018006 (0.205440)	0.411147 / 0.000490 (0.410657)	0.001821 / 0.000200 (0.001621)	0.000081 / 0.000054 (0.000027)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025661 / 0.037411 (-0.011751)	0.077838 / 0.014526 (0.063313)	0.086148 / 0.176557 (-0.090408)	0.140386 / 0.737135 (-0.596750)	0.088793 / 0.296338 (-0.207546)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425209 / 0.215209 (0.210000)	4.250723 / 2.077655 (2.173068)	2.403437 / 1.504120 (0.899317)	2.283584 / 1.541195 (0.742390)	2.326870 / 1.468490 (0.858380)	0.504781 / 4.584777 (-4.079996)	3.017042 / 3.745712 (-0.728670)	4.643068 / 5.269862 (-0.626794)	2.535710 / 4.565676 (-2.029967)	0.058520 / 0.424275 (-0.365755)	0.006766 / 0.007607 (-0.000841)	0.500664 / 0.226044 (0.274620)	5.017073 / 2.268929 (2.748145)	2.668661 / 55.444624 (-52.775963)	2.335486 / 6.876477 (-4.540991)	2.486518 / 2.142072 (0.344445)	0.598795 / 4.805227 (-4.206432)	0.126395 / 6.500664 (-6.374269)	0.063154 / 0.075469 (-0.012315)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.358059 / 1.841788 (-0.483728)	18.615724 / 8.074308 (10.541416)	13.670934 / 10.191392 (3.479542)	0.134650 / 0.680424 (-0.545774)	0.016941 / 0.534201 (-0.517260)	0.335215 / 0.579283 (-0.244068)	0.356118 / 0.434364 (-0.078246)	0.393109 / 0.540337 (-0.147228)	0.534165 / 1.386936 (-0.852771)

fix fsspec storage_options

83b792d

lhoestq changed the title ~~Fix fsspec storage_options~~ Fix fsspec storage_options from load_dataset Jul 26, 2023

lhoestq commented Jul 26, 2023

View reviewed changes

lhoestq requested a review from mariosasko July 26, 2023 10:55

fix

7a291b2

mariosasko approved these changes Jul 26, 2023

View reviewed changes

lhoestq added 5 commits July 27, 2023 13:57

fix storage_options type hint

9311d69

more consistency

072bca5

style

20ccac8

Merge branch 'main' into fix-fsspec-storage_options

f9e6eea

docstring

deb9e70

lhoestq merged commit da7d3b5 into main Jul 27, 2023
13 checks passed

lhoestq deleted the fix-fsspec-storage_options branch July 27, 2023 12:42

Fix fsspec storage_options from load_dataset #6072

Fix fsspec storage_options from load_dataset #6072

Conversation

lhoestq commented Jul 26, 2023

lhoestq Jul 26, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 26, 2023 • edited

github-actions bot commented Jul 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko Jul 26, 2023

Choose a reason for hiding this comment

github-actions bot commented Jul 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 26, 2023 •

edited