Support fsspec 2023.10.0 #6335

albertvillanova · 2023-10-23T09:29:17Z

Fix #6333.

HuggingFaceDocBuilderDev · 2023-10-23T09:35:52Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-10-23T09:37:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006013 / 0.011353 (-0.005340)	0.003647 / 0.011008 (-0.007362)	0.081781 / 0.038508 (0.043273)	0.059020 / 0.023109 (0.035911)	0.321823 / 0.275898 (0.045925)	0.350159 / 0.323480 (0.026679)	0.003599 / 0.007986 (-0.004386)	0.002877 / 0.004328 (-0.001452)	0.063941 / 0.004250 (0.059690)	0.049460 / 0.037052 (0.012408)	0.330185 / 0.258489 (0.071696)	0.362220 / 0.293841 (0.068379)	0.027613 / 0.128546 (-0.100934)	0.007976 / 0.075646 (-0.067670)	0.263386 / 0.419271 (-0.155885)	0.045504 / 0.043533 (0.001971)	0.321172 / 0.255139 (0.066033)	0.345291 / 0.283200 (0.062091)	0.023133 / 0.141683 (-0.118550)	1.435816 / 1.452155 (-0.016339)	1.557241 / 1.492716 (0.064524)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222228 / 0.018006 (0.204222)	0.420008 / 0.000490 (0.419518)	0.008598 / 0.000200 (0.008398)	0.000343 / 0.000054 (0.000288)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023725 / 0.037411 (-0.013686)	0.073023 / 0.014526 (0.058497)	0.814888 / 0.176557 (0.638332)	0.294122 / 0.737135 (-0.443013)	0.088945 / 0.296338 (-0.207393)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.393561 / 0.215209 (0.178352)	3.946544 / 2.077655 (1.868890)	1.916476 / 1.504120 (0.412356)	1.721544 / 1.541195 (0.180349)	1.768583 / 1.468490 (0.300093)	0.508067 / 4.584777 (-4.076710)	3.047832 / 3.745712 (-0.697880)	2.952842 / 5.269862 (-2.317020)	1.869337 / 4.565676 (-2.696339)	0.057812 / 0.424275 (-0.366463)	0.006694 / 0.007607 (-0.000913)	0.463007 / 0.226044 (0.236963)	4.635087 / 2.268929 (2.366158)	2.419833 / 55.444624 (-53.024792)	2.018519 / 6.876477 (-4.857958)	2.043430 / 2.142072 (-0.098643)	0.590895 / 4.805227 (-4.214333)	0.126113 / 6.500664 (-6.374552)	0.061045 / 0.075469 (-0.014424)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.226850 / 1.841788 (-0.614937)	17.336630 / 8.074308 (9.262322)	13.651049 / 10.191392 (3.459656)	0.143308 / 0.680424 (-0.537116)	0.016938 / 0.534201 (-0.517263)	0.332829 / 0.579283 (-0.246454)	0.368684 / 0.434364 (-0.065680)	0.385848 / 0.540337 (-0.154489)	0.546391 / 1.386936 (-0.840545)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006149 / 0.011353 (-0.005204)	0.003818 / 0.011008 (-0.007191)	0.064012 / 0.038508 (0.025504)	0.059846 / 0.023109 (0.036737)	0.455928 / 0.275898 (0.180030)	0.480736 / 0.323480 (0.157256)	0.004874 / 0.007986 (-0.003111)	0.002877 / 0.004328 (-0.001451)	0.064195 / 0.004250 (0.059944)	0.048146 / 0.037052 (0.011094)	0.452638 / 0.258489 (0.194149)	0.484339 / 0.293841 (0.190499)	0.028832 / 0.128546 (-0.099715)	0.008162 / 0.075646 (-0.067485)	0.069855 / 0.419271 (-0.349417)	0.041429 / 0.043533 (-0.002104)	0.453282 / 0.255139 (0.198143)	0.473812 / 0.283200 (0.190613)	0.021186 / 0.141683 (-0.120497)	1.465207 / 1.452155 (0.013052)	1.508216 / 1.492716 (0.015500)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242491 / 0.018006 (0.224485)	0.421219 / 0.000490 (0.420730)	0.011201 / 0.000200 (0.011001)	0.000083 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027015 / 0.037411 (-0.010396)	0.080465 / 0.014526 (0.065939)	0.092622 / 0.176557 (-0.083934)	0.146111 / 0.737135 (-0.591024)	0.091546 / 0.296338 (-0.204793)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.458351 / 0.215209 (0.243142)	4.591454 / 2.077655 (2.513799)	2.508156 / 1.504120 (1.004037)	2.328771 / 1.541195 (0.787576)	2.423251 / 1.468490 (0.954761)	0.508504 / 4.584777 (-4.076273)	3.133789 / 3.745712 (-0.611923)	2.862777 / 5.269862 (-2.407084)	1.886327 / 4.565676 (-2.679350)	0.058017 / 0.424275 (-0.366258)	0.006496 / 0.007607 (-0.001111)	0.529629 / 0.226044 (0.303585)	5.310338 / 2.268929 (3.041409)	2.973075 / 55.444624 (-52.471549)	2.601313 / 6.876477 (-4.275163)	2.777348 / 2.142072 (0.635275)	0.593711 / 4.805227 (-4.211516)	0.125453 / 6.500664 (-6.375211)	0.061034 / 0.075469 (-0.014435)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.374391 / 1.841788 (-0.467397)	18.768026 / 8.074308 (10.693718)	15.053637 / 10.191392 (4.862245)	0.158253 / 0.680424 (-0.522171)	0.018126 / 0.534201 (-0.516075)	0.337427 / 0.579283 (-0.241856)	0.391678 / 0.434364 (-0.042686)	0.398524 / 0.540337 (-0.141813)	0.558629 / 1.386936 (-0.828307)

lhoestq · 2023-10-23T09:43:19Z

I think #6334 fixes it already no ?

github-actions · 2023-10-23T10:46:36Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006432 / 0.011353 (-0.004921)	0.003861 / 0.011008 (-0.007147)	0.084132 / 0.038508 (0.045624)	0.069391 / 0.023109 (0.046282)	0.341081 / 0.275898 (0.065183)	0.375975 / 0.323480 (0.052495)	0.003962 / 0.007986 (-0.004024)	0.003235 / 0.004328 (-0.001094)	0.064927 / 0.004250 (0.060677)	0.054190 / 0.037052 (0.017137)	0.350719 / 0.258489 (0.092230)	0.393216 / 0.293841 (0.099375)	0.031002 / 0.128546 (-0.097544)	0.008416 / 0.075646 (-0.067230)	0.289268 / 0.419271 (-0.130003)	0.052167 / 0.043533 (0.008634)	0.347559 / 0.255139 (0.092420)	0.370908 / 0.283200 (0.087709)	0.022540 / 0.141683 (-0.119142)	1.486297 / 1.452155 (0.034143)	1.576968 / 1.492716 (0.084252)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.237048 / 0.018006 (0.219042)	0.452065 / 0.000490 (0.451575)	0.013963 / 0.000200 (0.013763)	0.000242 / 0.000054 (0.000188)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028084 / 0.037411 (-0.009327)	0.081271 / 0.014526 (0.066745)	0.096490 / 0.176557 (-0.080067)	0.152106 / 0.737135 (-0.585030)	0.096174 / 0.296338 (-0.200164)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.386585 / 0.215209 (0.171375)	3.854996 / 2.077655 (1.777342)	1.832898 / 1.504120 (0.328778)	1.662832 / 1.541195 (0.121638)	1.730753 / 1.468490 (0.262263)	0.485286 / 4.584777 (-4.099491)	3.571410 / 3.745712 (-0.174302)	3.373035 / 5.269862 (-1.896826)	1.995570 / 4.565676 (-2.570107)	0.056711 / 0.424275 (-0.367564)	0.007447 / 0.007607 (-0.000160)	0.462985 / 0.226044 (0.236941)	4.617186 / 2.268929 (2.348257)	2.313915 / 55.444624 (-53.130709)	1.961697 / 6.876477 (-4.914780)	1.990410 / 2.142072 (-0.151662)	0.580536 / 4.805227 (-4.224692)	0.146275 / 6.500664 (-6.354389)	0.059458 / 0.075469 (-0.016011)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.274841 / 1.841788 (-0.566947)	18.641853 / 8.074308 (10.567545)	13.977525 / 10.191392 (3.786133)	0.151469 / 0.680424 (-0.528955)	0.018111 / 0.534201 (-0.516090)	0.393243 / 0.579283 (-0.186040)	0.412310 / 0.434364 (-0.022054)	0.461646 / 0.540337 (-0.078692)	0.633016 / 1.386936 (-0.753920)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006496 / 0.011353 (-0.004857)	0.003973 / 0.011008 (-0.007035)	0.064527 / 0.038508 (0.026019)	0.069390 / 0.023109 (0.046281)	0.401162 / 0.275898 (0.125264)	0.431031 / 0.323480 (0.107551)	0.005244 / 0.007986 (-0.002741)	0.003283 / 0.004328 (-0.001046)	0.064931 / 0.004250 (0.060680)	0.054402 / 0.037052 (0.017350)	0.397917 / 0.258489 (0.139428)	0.436728 / 0.293841 (0.142887)	0.031932 / 0.128546 (-0.096614)	0.008557 / 0.075646 (-0.067089)	0.073336 / 0.419271 (-0.345935)	0.047559 / 0.043533 (0.004026)	0.395825 / 0.255139 (0.140686)	0.423002 / 0.283200 (0.139802)	0.021708 / 0.141683 (-0.119975)	1.501140 / 1.452155 (0.048985)	1.558376 / 1.492716 (0.065660)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.289522 / 0.018006 (0.271516)	0.449078 / 0.000490 (0.448589)	0.034174 / 0.000200 (0.033974)	0.000396 / 0.000054 (0.000342)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032533 / 0.037411 (-0.004878)	0.093398 / 0.014526 (0.078872)	0.106930 / 0.176557 (-0.069626)	0.158743 / 0.737135 (-0.578393)	0.106904 / 0.296338 (-0.189435)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.427479 / 0.215209 (0.212270)	4.271758 / 2.077655 (2.194103)	2.298770 / 1.504120 (0.794650)	2.134906 / 1.541195 (0.593712)	2.220487 / 1.468490 (0.751996)	0.490506 / 4.584777 (-4.094270)	3.593876 / 3.745712 (-0.151836)	3.225656 / 5.269862 (-2.044205)	2.004434 / 4.565676 (-2.561243)	0.058015 / 0.424275 (-0.366260)	0.007221 / 0.007607 (-0.000387)	0.504928 / 0.226044 (0.278884)	5.049547 / 2.268929 (2.780618)	2.743843 / 55.444624 (-52.700781)	2.398399 / 6.876477 (-4.478078)	2.562939 / 2.142072 (0.420867)	0.597229 / 4.805227 (-4.207998)	0.134664 / 6.500664 (-6.366001)	0.059612 / 0.075469 (-0.015857)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.369692 / 1.841788 (-0.472095)	19.065326 / 8.074308 (10.991018)	14.404508 / 10.191392 (4.213116)	0.175809 / 0.680424 (-0.504615)	0.020137 / 0.534201 (-0.514064)	0.394043 / 0.579283 (-0.185240)	0.424772 / 0.434364 (-0.009592)	0.475587 / 0.540337 (-0.064751)	0.644275 / 1.386936 (-0.742661)

github-actions · 2023-10-23T11:05:03Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007259 / 0.011353 (-0.004094)	0.004396 / 0.011008 (-0.006612)	0.096456 / 0.038508 (0.057948)	0.078752 / 0.023109 (0.055643)	0.359215 / 0.275898 (0.083317)	0.396927 / 0.323480 (0.073448)	0.005611 / 0.007986 (-0.002375)	0.003687 / 0.004328 (-0.000641)	0.072794 / 0.004250 (0.068544)	0.059794 / 0.037052 (0.022741)	0.372352 / 0.258489 (0.113863)	0.414038 / 0.293841 (0.120197)	0.034490 / 0.128546 (-0.094056)	0.009790 / 0.075646 (-0.065857)	0.326338 / 0.419271 (-0.092934)	0.058582 / 0.043533 (0.015049)	0.354221 / 0.255139 (0.099082)	0.386669 / 0.283200 (0.103469)	0.025356 / 0.141683 (-0.116327)	1.664104 / 1.452155 (0.211950)	1.766825 / 1.492716 (0.274108)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.251107 / 0.018006 (0.233101)	0.478833 / 0.000490 (0.478344)	0.010776 / 0.000200 (0.010577)	0.000292 / 0.000054 (0.000238)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032869 / 0.037411 (-0.004543)	0.098449 / 0.014526 (0.083923)	0.109954 / 0.176557 (-0.066602)	0.176786 / 0.737135 (-0.560350)	0.113477 / 0.296338 (-0.182862)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.431169 / 0.215209 (0.215960)	4.303239 / 2.077655 (2.225585)	2.088885 / 1.504120 (0.584765)	1.895900 / 1.541195 (0.354706)	1.997442 / 1.468490 (0.528952)	0.541840 / 4.584777 (-4.042937)	3.991982 / 3.745712 (0.246270)	3.842421 / 5.269862 (-1.427440)	2.281150 / 4.565676 (-2.284526)	0.063851 / 0.424275 (-0.360425)	0.008470 / 0.007607 (0.000863)	0.515886 / 0.226044 (0.289841)	5.202908 / 2.268929 (2.933980)	2.662789 / 55.444624 (-52.781835)	2.266731 / 6.876477 (-4.609746)	2.343760 / 2.142072 (0.201688)	0.641050 / 4.805227 (-4.164177)	0.148236 / 6.500664 (-6.352428)	0.067422 / 0.075469 (-0.008047)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.475729 / 1.841788 (-0.366059)	22.401583 / 8.074308 (14.327274)	15.886237 / 10.191392 (5.694845)	0.171828 / 0.680424 (-0.508595)	0.022161 / 0.534201 (-0.512040)	0.465873 / 0.579283 (-0.113411)	0.476386 / 0.434364 (0.042022)	0.538317 / 0.540337 (-0.002020)	0.754375 / 1.386936 (-0.632561)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007429 / 0.011353 (-0.003924)	0.004592 / 0.011008 (-0.006416)	0.072315 / 0.038508 (0.033807)	0.080806 / 0.023109 (0.057697)	0.444607 / 0.275898 (0.168709)	0.476970 / 0.323480 (0.153490)	0.006030 / 0.007986 (-0.001956)	0.003755 / 0.004328 (-0.000573)	0.074602 / 0.004250 (0.070352)	0.061846 / 0.037052 (0.024794)	0.450928 / 0.258489 (0.192439)	0.493932 / 0.293841 (0.200091)	0.037398 / 0.128546 (-0.091148)	0.009807 / 0.075646 (-0.065840)	0.080531 / 0.419271 (-0.338741)	0.054052 / 0.043533 (0.010519)	0.453034 / 0.255139 (0.197895)	0.464959 / 0.283200 (0.181760)	0.024718 / 0.141683 (-0.116965)	1.687552 / 1.452155 (0.235397)	1.765746 / 1.492716 (0.273029)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.266998 / 0.018006 (0.248992)	0.479832 / 0.000490 (0.479342)	0.005429 / 0.000200 (0.005229)	0.000117 / 0.000054 (0.000062)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038885 / 0.037411 (0.001474)	0.105931 / 0.014526 (0.091405)	0.120880 / 0.176557 (-0.055677)	0.184006 / 0.737135 (-0.553130)	0.120750 / 0.296338 (-0.175589)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.478626 / 0.215209 (0.263417)	4.797355 / 2.077655 (2.719700)	2.582758 / 1.504120 (1.078638)	2.396488 / 1.541195 (0.855293)	2.515597 / 1.468490 (1.047107)	0.544541 / 4.584777 (-4.040236)	4.150702 / 3.745712 (0.404990)	3.676837 / 5.269862 (-1.593024)	2.287275 / 4.565676 (-2.278402)	0.064602 / 0.424275 (-0.359673)	0.008253 / 0.007607 (0.000646)	0.576201 / 0.226044 (0.350157)	5.859839 / 2.268929 (3.590910)	3.248603 / 55.444624 (-52.196021)	2.841959 / 6.876477 (-4.034518)	2.991120 / 2.142072 (0.849047)	0.667755 / 4.805227 (-4.137472)	0.151219 / 6.500664 (-6.349445)	0.068990 / 0.075469 (-0.006479)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.572359 / 1.841788 (-0.269429)	21.890279 / 8.074308 (13.815971)	15.927473 / 10.191392 (5.736081)	0.170388 / 0.680424 (-0.510036)	0.023282 / 0.534201 (-0.510919)	0.459371 / 0.579283 (-0.119912)	0.468838 / 0.434364 (0.034475)	0.546438 / 0.540337 (0.006101)	0.746912 / 1.386936 (-0.640024)

albertvillanova · 2023-10-23T11:19:30Z

Yes, @lhoestq, you are right. I think we cross-send fixing PRs in a 15 minute interval... 😅

I would say the code in this PR is simpler and easier to understand, but feel free to ignore it.

lhoestq · 2023-10-23T11:22:31Z

I think the correct way it to check if "file" in in the tuple if it's a tuple (in case someone adds another protocol name for the local filesystem)

albertvillanova added 2 commits October 23, 2023 11:19

Unpin fsspec < 2023.10.0

6d47714

Fix is_remote_filesystem with new 'local' URI scheme

e0b7966

Refactor is_remote_filesystem

6f08819

Fix is_remote_filesystem with tuple

2249779

Merge branch 'main' into fix-6333

8197ce8

albertvillanova marked this pull request as ready for review October 23, 2023 11:19

lhoestq closed this Nov 14, 2023

albertvillanova deleted the fix-6333 branch January 11, 2024 06:33

Support fsspec 2023.10.0 #6335

Support fsspec 2023.10.0 #6335

Conversation

albertvillanova commented Oct 23, 2023

HuggingFaceDocBuilderDev commented Oct 23, 2023 • edited Loading

github-actions bot commented Oct 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Oct 23, 2023

github-actions bot commented Oct 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Oct 23, 2023

lhoestq commented Oct 23, 2023

HuggingFaceDocBuilderDev commented Oct 23, 2023 •

edited

Loading