Fix error when loading from GCP bucket #6105

albertvillanova · 2023-07-31T11:44:46Z

Fix resolve_pattern for filesystems with tuple protocol.

Fix #6100.

The bug code lines were introduced by:

Use new hffs #6028

HuggingFaceDocBuilderDev · 2023-07-31T11:50:22Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-31T11:51:59Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006706 / 0.011353 (-0.004647)	0.004016 / 0.011008 (-0.006992)	0.083696 / 0.038508 (0.045188)	0.074340 / 0.023109 (0.051230)	0.327338 / 0.275898 (0.051440)	0.366663 / 0.323480 (0.043183)	0.004052 / 0.007986 (-0.003934)	0.003423 / 0.004328 (-0.000906)	0.064576 / 0.004250 (0.060326)	0.055037 / 0.037052 (0.017985)	0.325089 / 0.258489 (0.066600)	0.379986 / 0.293841 (0.086145)	0.031614 / 0.128546 (-0.096932)	0.008553 / 0.075646 (-0.067094)	0.287430 / 0.419271 (-0.131841)	0.053032 / 0.043533 (0.009499)	0.318990 / 0.255139 (0.063851)	0.364426 / 0.283200 (0.081226)	0.024926 / 0.141683 (-0.116757)	1.461835 / 1.452155 (0.009680)	1.557172 / 1.492716 (0.064456)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.212430 / 0.018006 (0.194424)	0.512891 / 0.000490 (0.512402)	0.004772 / 0.000200 (0.004572)	0.000132 / 0.000054 (0.000078)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027873 / 0.037411 (-0.009538)	0.085598 / 0.014526 (0.071072)	0.097330 / 0.176557 (-0.079226)	0.152235 / 0.737135 (-0.584900)	0.097787 / 0.296338 (-0.198552)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.384645 / 0.215209 (0.169436)	3.841161 / 2.077655 (1.763506)	1.863696 / 1.504120 (0.359577)	1.685082 / 1.541195 (0.143887)	1.772904 / 1.468490 (0.304414)	0.480177 / 4.584777 (-4.104599)	3.601537 / 3.745712 (-0.144175)	3.273647 / 5.269862 (-1.996214)	2.014415 / 4.565676 (-2.551261)	0.056668 / 0.424275 (-0.367607)	0.007257 / 0.007607 (-0.000350)	0.458194 / 0.226044 (0.232150)	4.577311 / 2.268929 (2.308382)	2.333983 / 55.444624 (-53.110641)	1.964508 / 6.876477 (-4.911969)	2.193379 / 2.142072 (0.051307)	0.577557 / 4.805227 (-4.227670)	0.133899 / 6.500664 (-6.366765)	0.060804 / 0.075469 (-0.014665)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.249490 / 1.841788 (-0.592298)	19.791875 / 8.074308 (11.717567)	14.418728 / 10.191392 (4.227336)	0.167788 / 0.680424 (-0.512636)	0.018993 / 0.534201 (-0.515208)	0.396141 / 0.579283 (-0.183142)	0.412427 / 0.434364 (-0.021937)	0.456718 / 0.540337 (-0.083619)	0.641383 / 1.386936 (-0.745553)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006546 / 0.011353 (-0.004807)	0.004059 / 0.011008 (-0.006949)	0.064523 / 0.038508 (0.026015)	0.074988 / 0.023109 (0.051878)	0.388932 / 0.275898 (0.113034)	0.424496 / 0.323480 (0.101016)	0.005226 / 0.007986 (-0.002760)	0.003409 / 0.004328 (-0.000920)	0.064284 / 0.004250 (0.060034)	0.056829 / 0.037052 (0.019777)	0.386457 / 0.258489 (0.127968)	0.428063 / 0.293841 (0.134222)	0.031411 / 0.128546 (-0.097136)	0.008577 / 0.075646 (-0.067070)	0.070357 / 0.419271 (-0.348915)	0.048920 / 0.043533 (0.005388)	0.385197 / 0.255139 (0.130058)	0.407167 / 0.283200 (0.123967)	0.024469 / 0.141683 (-0.117214)	1.482733 / 1.452155 (0.030578)	1.539027 / 1.492716 (0.046311)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227532 / 0.018006 (0.209526)	0.448792 / 0.000490 (0.448302)	0.004139 / 0.000200 (0.003939)	0.000085 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031004 / 0.037411 (-0.006408)	0.088163 / 0.014526 (0.073637)	0.101452 / 0.176557 (-0.075105)	0.152907 / 0.737135 (-0.584229)	0.102325 / 0.296338 (-0.194014)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418092 / 0.215209 (0.202883)	4.162277 / 2.077655 (2.084623)	2.232987 / 1.504120 (0.728867)	2.143583 / 1.541195 (0.602388)	2.246142 / 1.468490 (0.777652)	0.490181 / 4.584777 (-4.094596)	3.631514 / 3.745712 (-0.114198)	3.315025 / 5.269862 (-1.954837)	2.101853 / 4.565676 (-2.463823)	0.057905 / 0.424275 (-0.366370)	0.007686 / 0.007607 (0.000079)	0.489965 / 0.226044 (0.263921)	4.894375 / 2.268929 (2.625447)	2.655459 / 55.444624 (-52.789165)	2.262211 / 6.876477 (-4.614266)	2.505335 / 2.142072 (0.363263)	0.591329 / 4.805227 (-4.213898)	0.133554 / 6.500664 (-6.367110)	0.061922 / 0.075469 (-0.013547)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.347483 / 1.841788 (-0.494304)	20.027011 / 8.074308 (11.952703)	14.430737 / 10.191392 (4.239345)	0.165767 / 0.680424 (-0.514657)	0.018460 / 0.534201 (-0.515741)	0.393790 / 0.579283 (-0.185494)	0.407213 / 0.434364 (-0.027151)	0.474459 / 0.540337 (-0.065879)	0.635054 / 1.386936 (-0.751882)

github-actions · 2023-07-31T12:18:27Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007652 / 0.011353 (-0.003701)	0.004581 / 0.011008 (-0.006427)	0.101629 / 0.038508 (0.063121)	0.090233 / 0.023109 (0.067124)	0.392789 / 0.275898 (0.116891)	0.432163 / 0.323480 (0.108683)	0.004694 / 0.007986 (-0.003292)	0.003927 / 0.004328 (-0.000401)	0.076533 / 0.004250 (0.072282)	0.064442 / 0.037052 (0.027390)	0.397539 / 0.258489 (0.139050)	0.441323 / 0.293841 (0.147482)	0.036278 / 0.128546 (-0.092268)	0.009810 / 0.075646 (-0.065836)	0.343537 / 0.419271 (-0.075734)	0.060273 / 0.043533 (0.016740)	0.395023 / 0.255139 (0.139884)	0.427210 / 0.283200 (0.144011)	0.031717 / 0.141683 (-0.109966)	1.771221 / 1.452155 (0.319066)	1.896336 / 1.492716 (0.403620)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.235081 / 0.018006 (0.217075)	0.512781 / 0.000490 (0.512292)	0.004920 / 0.000200 (0.004721)	0.000097 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033525 / 0.037411 (-0.003887)	0.104416 / 0.014526 (0.089890)	0.115695 / 0.176557 (-0.060861)	0.182216 / 0.737135 (-0.554919)	0.116259 / 0.296338 (-0.180079)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.454817 / 0.215209 (0.239608)	4.527753 / 2.077655 (2.450098)	2.222273 / 1.504120 (0.718153)	2.038448 / 1.541195 (0.497253)	2.179444 / 1.468490 (0.710953)	0.573665 / 4.584777 (-4.011112)	4.504943 / 3.745712 (0.759231)	3.848435 / 5.269862 (-1.421427)	2.455185 / 4.565676 (-2.110491)	0.067985 / 0.424275 (-0.356290)	0.008719 / 0.007607 (0.001112)	0.552405 / 0.226044 (0.326360)	5.515251 / 2.268929 (3.246322)	2.851557 / 55.444624 (-52.593067)	2.463070 / 6.876477 (-4.413407)	2.761596 / 2.142072 (0.619524)	0.688561 / 4.805227 (-4.116667)	0.159946 / 6.500664 (-6.340718)	0.075435 / 0.075469 (-0.000034)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.505178 / 1.841788 (-0.336610)	23.555236 / 8.074308 (15.480928)	17.272759 / 10.191392 (7.081367)	0.206495 / 0.680424 (-0.473928)	0.021869 / 0.534201 (-0.512332)	0.469271 / 0.579283 (-0.110012)	0.469200 / 0.434364 (0.034837)	0.542437 / 0.540337 (0.002100)	0.792864 / 1.386936 (-0.594072)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008151 / 0.011353 (-0.003202)	0.004992 / 0.011008 (-0.006016)	0.079545 / 0.038508 (0.041037)	0.100234 / 0.023109 (0.077125)	0.492791 / 0.275898 (0.216893)	0.511315 / 0.323480 (0.187835)	0.006878 / 0.007986 (-0.001108)	0.003807 / 0.004328 (-0.000522)	0.080876 / 0.004250 (0.076625)	0.076734 / 0.037052 (0.039681)	0.518247 / 0.258489 (0.259758)	0.524202 / 0.293841 (0.230361)	0.039896 / 0.128546 (-0.088650)	0.016581 / 0.075646 (-0.059065)	0.101228 / 0.419271 (-0.318043)	0.061990 / 0.043533 (0.018457)	0.490611 / 0.255139 (0.235472)	0.514930 / 0.283200 (0.231730)	0.028680 / 0.141683 (-0.113002)	1.966215 / 1.452155 (0.514061)	2.047757 / 1.492716 (0.555040)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.286807 / 0.018006 (0.268801)	0.506448 / 0.000490 (0.505959)	0.005867 / 0.000200 (0.005667)	0.000110 / 0.000054 (0.000056)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037141 / 0.037411 (-0.000270)	0.113232 / 0.014526 (0.098706)	0.121201 / 0.176557 (-0.055356)	0.185472 / 0.737135 (-0.551663)	0.122896 / 0.296338 (-0.173442)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.514491 / 0.215209 (0.299282)	4.942457 / 2.077655 (2.864802)	2.533519 / 1.504120 (1.029399)	2.371011 / 1.541195 (0.829817)	2.495604 / 1.468490 (1.027114)	0.576224 / 4.584777 (-4.008553)	4.368584 / 3.745712 (0.622872)	3.885598 / 5.269862 (-1.384263)	2.443596 / 4.565676 (-2.122080)	0.068905 / 0.424275 (-0.355371)	0.009171 / 0.007607 (0.001564)	0.584977 / 0.226044 (0.358932)	5.835220 / 2.268929 (3.566291)	3.189037 / 55.444624 (-52.255588)	2.753228 / 6.876477 (-4.123249)	3.009062 / 2.142072 (0.866990)	0.690179 / 4.805227 (-4.115048)	0.157981 / 6.500664 (-6.342683)	0.074518 / 0.075469 (-0.000951)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.599907 / 1.841788 (-0.241880)	23.853903 / 8.074308 (15.779595)	17.419796 / 10.191392 (7.228404)	0.204974 / 0.680424 (-0.475450)	0.022014 / 0.534201 (-0.512187)	0.473379 / 0.579283 (-0.105905)	0.461346 / 0.434364 (0.026982)	0.564881 / 0.540337 (0.024543)	0.752933 / 1.386936 (-0.634003)

github-actions · 2023-08-01T09:21:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006547 / 0.011353 (-0.004805)	0.004020 / 0.011008 (-0.006988)	0.086828 / 0.038508 (0.048320)	0.072924 / 0.023109 (0.049815)	0.312847 / 0.275898 (0.036949)	0.344605 / 0.323480 (0.021125)	0.004117 / 0.007986 (-0.003868)	0.004365 / 0.004328 (0.000037)	0.066755 / 0.004250 (0.062505)	0.053248 / 0.037052 (0.016195)	0.315744 / 0.258489 (0.057255)	0.362426 / 0.293841 (0.068585)	0.030732 / 0.128546 (-0.097814)	0.008516 / 0.075646 (-0.067130)	0.289927 / 0.419271 (-0.129345)	0.052115 / 0.043533 (0.008582)	0.308026 / 0.255139 (0.052887)	0.343115 / 0.283200 (0.059915)	0.024131 / 0.141683 (-0.117551)	1.464290 / 1.452155 (0.012135)	1.559359 / 1.492716 (0.066642)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.216744 / 0.018006 (0.198738)	0.473156 / 0.000490 (0.472666)	0.004176 / 0.000200 (0.003977)	0.000093 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028500 / 0.037411 (-0.008911)	0.083892 / 0.014526 (0.069366)	0.131851 / 0.176557 (-0.044705)	0.162202 / 0.737135 (-0.574933)	0.127989 / 0.296338 (-0.168349)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.404555 / 0.215209 (0.189346)	4.035989 / 2.077655 (1.958334)	2.025174 / 1.504120 (0.521054)	1.835785 / 1.541195 (0.294590)	1.909819 / 1.468490 (0.441329)	0.475352 / 4.584777 (-4.109425)	3.548055 / 3.745712 (-0.197657)	3.234782 / 5.269862 (-2.035080)	2.010305 / 4.565676 (-2.555371)	0.056507 / 0.424275 (-0.367768)	0.007259 / 0.007607 (-0.000348)	0.482021 / 0.226044 (0.255977)	4.818559 / 2.268929 (2.549631)	2.528765 / 55.444624 (-52.915860)	2.159804 / 6.876477 (-4.716673)	2.380640 / 2.142072 (0.238567)	0.585005 / 4.805227 (-4.220222)	0.133811 / 6.500664 (-6.366853)	0.060686 / 0.075469 (-0.014783)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.260902 / 1.841788 (-0.580886)	19.500215 / 8.074308 (11.425907)	14.164698 / 10.191392 (3.973306)	0.172492 / 0.680424 (-0.507932)	0.018221 / 0.534201 (-0.515980)	0.392609 / 0.579283 (-0.186674)	0.423265 / 0.434364 (-0.011099)	0.454705 / 0.540337 (-0.085633)	0.639856 / 1.386936 (-0.747080)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006656 / 0.011353 (-0.004697)	0.003903 / 0.011008 (-0.007106)	0.063780 / 0.038508 (0.025272)	0.076848 / 0.023109 (0.053739)	0.379429 / 0.275898 (0.103531)	0.442554 / 0.323480 (0.119074)	0.005327 / 0.007986 (-0.002658)	0.003318 / 0.004328 (-0.001010)	0.064307 / 0.004250 (0.060056)	0.057183 / 0.037052 (0.020131)	0.398163 / 0.258489 (0.139674)	0.448532 / 0.293841 (0.154691)	0.031322 / 0.128546 (-0.097224)	0.008462 / 0.075646 (-0.067184)	0.070354 / 0.419271 (-0.348917)	0.048420 / 0.043533 (0.004887)	0.368304 / 0.255139 (0.113165)	0.428786 / 0.283200 (0.145587)	0.023921 / 0.141683 (-0.117762)	1.499281 / 1.452155 (0.047126)	1.554448 / 1.492716 (0.061731)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238830 / 0.018006 (0.220824)	0.464196 / 0.000490 (0.463706)	0.004812 / 0.000200 (0.004613)	0.000098 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031642 / 0.037411 (-0.005770)	0.089205 / 0.014526 (0.074679)	0.101577 / 0.176557 (-0.074980)	0.154993 / 0.737135 (-0.582142)	0.102935 / 0.296338 (-0.193403)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.415218 / 0.215209 (0.200009)	4.137711 / 2.077655 (2.060056)	2.128757 / 1.504120 (0.624637)	1.961086 / 1.541195 (0.419891)	2.047552 / 1.468490 (0.579061)	0.486953 / 4.584777 (-4.097824)	3.587851 / 3.745712 (-0.157861)	3.280771 / 5.269862 (-1.989090)	2.016980 / 4.565676 (-2.548697)	0.057284 / 0.424275 (-0.366991)	0.007705 / 0.007607 (0.000097)	0.492242 / 0.226044 (0.266197)	4.923213 / 2.268929 (2.654285)	2.672528 / 55.444624 (-52.772097)	2.292862 / 6.876477 (-4.583614)	2.517410 / 2.142072 (0.375337)	0.614798 / 4.805227 (-4.190429)	0.149642 / 6.500664 (-6.351023)	0.062898 / 0.075469 (-0.012571)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.323266 / 1.841788 (-0.518522)	19.891504 / 8.074308 (11.817196)	14.115069 / 10.191392 (3.923677)	0.169859 / 0.680424 (-0.510564)	0.018538 / 0.534201 (-0.515663)	0.398456 / 0.579283 (-0.180827)	0.410111 / 0.434364 (-0.024253)	0.483198 / 0.540337 (-0.057139)	0.639283 / 1.386936 (-0.747653)

github-actions · 2023-08-01T10:48:51Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007731 / 0.011353 (-0.003622)	0.004064 / 0.011008 (-0.006944)	0.095261 / 0.038508 (0.056753)	0.081594 / 0.023109 (0.058485)	0.390413 / 0.275898 (0.114515)	0.415542 / 0.323480 (0.092063)	0.006031 / 0.007986 (-0.001954)	0.003817 / 0.004328 (-0.000512)	0.066381 / 0.004250 (0.062131)	0.058262 / 0.037052 (0.021210)	0.383626 / 0.258489 (0.125137)	0.443237 / 0.293841 (0.149396)	0.034358 / 0.128546 (-0.094188)	0.010002 / 0.075646 (-0.065644)	0.317472 / 0.419271 (-0.101800)	0.057428 / 0.043533 (0.013895)	0.393929 / 0.255139 (0.138790)	0.444572 / 0.283200 (0.161373)	0.026295 / 0.141683 (-0.115388)	1.603639 / 1.452155 (0.151484)	1.707750 / 1.492716 (0.215034)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222171 / 0.018006 (0.204165)	0.491762 / 0.000490 (0.491272)	0.003389 / 0.000200 (0.003189)	0.000090 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029420 / 0.037411 (-0.007991)	0.086201 / 0.014526 (0.071676)	0.100150 / 0.176557 (-0.076406)	0.162338 / 0.737135 (-0.574797)	0.099349 / 0.296338 (-0.196989)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.445976 / 0.215209 (0.230767)	4.460197 / 2.077655 (2.382542)	2.211767 / 1.504120 (0.707647)	1.988740 / 1.541195 (0.447545)	2.052289 / 1.468490 (0.583799)	0.570321 / 4.584777 (-4.014456)	4.148777 / 3.745712 (0.403065)	3.750977 / 5.269862 (-1.518885)	2.309443 / 4.565676 (-2.256234)	0.064552 / 0.424275 (-0.359724)	0.008167 / 0.007607 (0.000560)	0.523283 / 0.226044 (0.297238)	5.349347 / 2.268929 (3.080419)	2.710292 / 55.444624 (-52.734332)	2.344252 / 6.876477 (-4.532225)	2.549903 / 2.142072 (0.407831)	0.665942 / 4.805227 (-4.139285)	0.154108 / 6.500664 (-6.346556)	0.070181 / 0.075469 (-0.005289)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.455733 / 1.841788 (-0.386054)	21.846958 / 8.074308 (13.772650)	15.133865 / 10.191392 (4.942473)	0.199009 / 0.680424 (-0.481415)	0.021299 / 0.534201 (-0.512902)	0.421555 / 0.579283 (-0.157729)	0.437639 / 0.434364 (0.003275)	0.498568 / 0.540337 (-0.041769)	0.719649 / 1.386936 (-0.667287)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007858 / 0.011353 (-0.003495)	0.004629 / 0.011008 (-0.006380)	0.075701 / 0.038508 (0.037193)	0.084425 / 0.023109 (0.061316)	0.436650 / 0.275898 (0.160752)	0.466046 / 0.323480 (0.142566)	0.006042 / 0.007986 (-0.001944)	0.003834 / 0.004328 (-0.000495)	0.074729 / 0.004250 (0.070478)	0.065983 / 0.037052 (0.028931)	0.447239 / 0.258489 (0.188750)	0.466728 / 0.293841 (0.172887)	0.035814 / 0.128546 (-0.092733)	0.009919 / 0.075646 (-0.065727)	0.081151 / 0.419271 (-0.338120)	0.057256 / 0.043533 (0.013723)	0.435609 / 0.255139 (0.180470)	0.448901 / 0.283200 (0.165701)	0.026325 / 0.141683 (-0.115357)	1.745658 / 1.452155 (0.293503)	1.804137 / 1.492716 (0.311421)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.302551 / 0.018006 (0.284544)	0.498438 / 0.000490 (0.497948)	0.038562 / 0.000200 (0.038362)	0.000411 / 0.000054 (0.000356)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035573 / 0.037411 (-0.001839)	0.104957 / 0.014526 (0.090431)	0.117208 / 0.176557 (-0.059349)	0.178935 / 0.737135 (-0.558200)	0.124577 / 0.296338 (-0.171761)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.467076 / 0.215209 (0.251867)	4.698852 / 2.077655 (2.621197)	2.453389 / 1.504120 (0.949269)	2.257378 / 1.541195 (0.716183)	2.338615 / 1.468490 (0.870125)	0.542379 / 4.584777 (-4.042398)	4.066895 / 3.745712 (0.321183)	3.689540 / 5.269862 (-1.580321)	2.268997 / 4.565676 (-2.296679)	0.064754 / 0.424275 (-0.359521)	0.008866 / 0.007607 (0.001259)	0.546732 / 0.226044 (0.320687)	5.487765 / 2.268929 (3.218836)	2.974126 / 55.444624 (-52.470498)	2.585492 / 6.876477 (-4.290985)	2.754417 / 2.142072 (0.612345)	0.652045 / 4.805227 (-4.153183)	0.145597 / 6.500664 (-6.355067)	0.065415 / 0.075469 (-0.010054)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.553970 / 1.841788 (-0.287818)	22.300954 / 8.074308 (14.226646)	15.640990 / 10.191392 (5.449598)	0.170903 / 0.680424 (-0.509521)	0.021750 / 0.534201 (-0.512451)	0.455316 / 0.579283 (-0.123967)	0.455051 / 0.434364 (0.020687)	0.536174 / 0.540337 (-0.004164)	0.735930 / 1.386936 (-0.651006)

* Refactor mock_fs * Test resolve_pattern for fs * Test filesystem with tuple protocol * Fix resolve_pattern for tuple protocol

albertvillanova added 2 commits July 31, 2023 13:42

Refactor mock_fs

7f57511

Test resolve_pattern for fs

2c927ee

Test filesystem with tuple protocol

f49c9ca

Fix resolve_pattern for tuple protocol

01e2194

albertvillanova marked this pull request as ready for review August 1, 2023 09:23

albertvillanova merged commit f681398 into main Aug 1, 2023
13 checks passed

albertvillanova deleted the fix-6100 branch August 1, 2023 10:38

albertvillanova added a commit that referenced this pull request Aug 3, 2023

Fix error when loading from GCP bucket (#6105)

710ce02

* Refactor mock_fs * Test resolve_pattern for fs * Test filesystem with tuple protocol * Fix resolve_pattern for tuple protocol

albertvillanova mentioned this pull request Aug 3, 2023

Update datasets 2.14.3 huggingface/dataset-viewer#1614

Merged

Fix error when loading from GCP bucket #6105

Fix error when loading from GCP bucket #6105

Conversation

albertvillanova commented Jul 31, 2023 • edited

HuggingFaceDocBuilderDev commented Jul 31, 2023 • edited

github-actions bot commented Jul 31, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 31, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Aug 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Aug 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Jul 31, 2023 •

edited

HuggingFaceDocBuilderDev commented Jul 31, 2023 •

edited