Call fs.makedirs in save_to_disk #5779

lhoestq · 2023-04-21T15:04:28Z

We need to call fs.makedirs when saving a dataset using save_to_disk, because some fs implementations have actual directories (S3 and others don't)

Close #5775

HuggingFaceDocBuilderDev · 2023-04-21T15:08:34Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-04-21T15:10:31Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007490 / 0.011353 (-0.003862)	0.004957 / 0.011008 (-0.006051)	0.096952 / 0.038508 (0.058444)	0.034125 / 0.023109 (0.011016)	0.301926 / 0.275898 (0.026028)	0.330538 / 0.323480 (0.007058)	0.005999 / 0.007986 (-0.001987)	0.003948 / 0.004328 (-0.000380)	0.073024 / 0.004250 (0.068773)	0.050020 / 0.037052 (0.012967)	0.299987 / 0.258489 (0.041498)	0.336077 / 0.293841 (0.042237)	0.035781 / 0.128546 (-0.092765)	0.012159 / 0.075646 (-0.063487)	0.333311 / 0.419271 (-0.085960)	0.059925 / 0.043533 (0.016392)	0.297772 / 0.255139 (0.042633)	0.313447 / 0.283200 (0.030247)	0.100991 / 0.141683 (-0.040692)	1.472182 / 1.452155 (0.020027)	1.553010 / 1.492716 (0.060294)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.214222 / 0.018006 (0.196216)	0.441579 / 0.000490 (0.441090)	0.001030 / 0.000200 (0.000830)	0.000194 / 0.000054 (0.000140)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026149 / 0.037411 (-0.011262)	0.107324 / 0.014526 (0.092798)	0.113390 / 0.176557 (-0.063167)	0.170282 / 0.737135 (-0.566854)	0.120601 / 0.296338 (-0.175737)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.411795 / 0.215209 (0.196585)	4.091412 / 2.077655 (2.013757)	1.819597 / 1.504120 (0.315477)	1.623413 / 1.541195 (0.082218)	1.658959 / 1.468490 (0.190469)	0.697671 / 4.584777 (-3.887106)	3.868855 / 3.745712 (0.123143)	3.220448 / 5.269862 (-2.049414)	1.796472 / 4.565676 (-2.769204)	0.085817 / 0.424275 (-0.338458)	0.012422 / 0.007607 (0.004815)	0.520302 / 0.226044 (0.294258)	5.062477 / 2.268929 (2.793548)	2.275065 / 55.444624 (-53.169560)	1.936717 / 6.876477 (-4.939759)	2.069924 / 2.142072 (-0.072148)	0.838964 / 4.805227 (-3.966264)	0.170632 / 6.500664 (-6.330032)	0.066011 / 0.075469 (-0.009458)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.190673 / 1.841788 (-0.651114)	14.679478 / 8.074308 (6.605169)	14.099743 / 10.191392 (3.908351)	0.142556 / 0.680424 (-0.537868)	0.017601 / 0.534201 (-0.516600)	0.421301 / 0.579283 (-0.157982)	0.418035 / 0.434364 (-0.016329)	0.503799 / 0.540337 (-0.036539)	0.588809 / 1.386936 (-0.798127)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007556 / 0.011353 (-0.003797)	0.005283 / 0.011008 (-0.005725)	0.075616 / 0.038508 (0.037107)	0.034127 / 0.023109 (0.011018)	0.345145 / 0.275898 (0.069247)	0.377490 / 0.323480 (0.054010)	0.006532 / 0.007986 (-0.001454)	0.004145 / 0.004328 (-0.000183)	0.074724 / 0.004250 (0.070473)	0.048658 / 0.037052 (0.011605)	0.339989 / 0.258489 (0.081500)	0.398240 / 0.293841 (0.104399)	0.037433 / 0.128546 (-0.091114)	0.012410 / 0.075646 (-0.063237)	0.088110 / 0.419271 (-0.331162)	0.050635 / 0.043533 (0.007103)	0.351878 / 0.255139 (0.096739)	0.365707 / 0.283200 (0.082508)	0.104342 / 0.141683 (-0.037341)	1.438009 / 1.452155 (-0.014145)	1.533616 / 1.492716 (0.040900)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.225570 / 0.018006 (0.207563)	0.442482 / 0.000490 (0.441992)	0.000402 / 0.000200 (0.000202)	0.000063 / 0.000054 (0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030348 / 0.037411 (-0.007063)	0.111402 / 0.014526 (0.096877)	0.123365 / 0.176557 (-0.053192)	0.175604 / 0.737135 (-0.561531)	0.128458 / 0.296338 (-0.167881)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426054 / 0.215209 (0.210845)	4.255050 / 2.077655 (2.177395)	2.039568 / 1.504120 (0.535448)	1.856842 / 1.541195 (0.315647)	1.923792 / 1.468490 (0.455301)	0.701023 / 4.584777 (-3.883754)	3.746632 / 3.745712 (0.000920)	2.055563 / 5.269862 (-3.214298)	1.308068 / 4.565676 (-3.257608)	0.085524 / 0.424275 (-0.338751)	0.012103 / 0.007607 (0.004496)	0.522929 / 0.226044 (0.296885)	5.258133 / 2.268929 (2.989205)	2.458440 / 55.444624 (-52.986185)	2.141681 / 6.876477 (-4.734796)	2.258667 / 2.142072 (0.116595)	0.842533 / 4.805227 (-3.962694)	0.168089 / 6.500664 (-6.332575)	0.063707 / 0.075469 (-0.011762)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.312252 / 1.841788 (-0.529536)	14.939185 / 8.074308 (6.864877)	14.479845 / 10.191392 (4.288453)	0.162557 / 0.680424 (-0.517867)	0.017660 / 0.534201 (-0.516541)	0.423261 / 0.579283 (-0.156023)	0.417693 / 0.434364 (-0.016671)	0.495440 / 0.540337 (-0.044897)	0.589932 / 1.386936 (-0.797004)

github-actions · 2023-04-26T12:20:01Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008796 / 0.011353 (-0.002557)	0.005828 / 0.011008 (-0.005180)	0.118629 / 0.038508 (0.080121)	0.042435 / 0.023109 (0.019326)	0.383780 / 0.275898 (0.107882)	0.420344 / 0.323480 (0.096864)	0.006855 / 0.007986 (-0.001130)	0.006290 / 0.004328 (0.001962)	0.087160 / 0.004250 (0.082910)	0.057568 / 0.037052 (0.020516)	0.378761 / 0.258489 (0.120272)	0.426496 / 0.293841 (0.132655)	0.041772 / 0.128546 (-0.086774)	0.014226 / 0.075646 (-0.061420)	0.400097 / 0.419271 (-0.019174)	0.060402 / 0.043533 (0.016870)	0.381955 / 0.255139 (0.126816)	0.399110 / 0.283200 (0.115911)	0.124608 / 0.141683 (-0.017075)	1.737856 / 1.452155 (0.285702)	1.829034 / 1.492716 (0.336318)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219941 / 0.018006 (0.201934)	0.497156 / 0.000490 (0.496666)	0.005094 / 0.000200 (0.004894)	0.000097 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032144 / 0.037411 (-0.005268)	0.131782 / 0.014526 (0.117256)	0.141543 / 0.176557 (-0.035014)	0.211419 / 0.737135 (-0.525716)	0.147338 / 0.296338 (-0.149001)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.478345 / 0.215209 (0.263136)	4.749506 / 2.077655 (2.671851)	2.195794 / 1.504120 (0.691674)	1.978126 / 1.541195 (0.436932)	2.059941 / 1.468490 (0.591451)	0.821959 / 4.584777 (-3.762818)	5.737479 / 3.745712 (1.991767)	2.507125 / 5.269862 (-2.762737)	2.051772 / 4.565676 (-2.513905)	0.100619 / 0.424275 (-0.323656)	0.014437 / 0.007607 (0.006830)	0.599484 / 0.226044 (0.373440)	5.977579 / 2.268929 (3.708651)	2.708143 / 55.444624 (-52.736482)	2.320279 / 6.876477 (-4.556198)	2.510172 / 2.142072 (0.368100)	1.006279 / 4.805227 (-3.798948)	0.199812 / 6.500664 (-6.300853)	0.077967 / 0.075469 (0.002498)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.510171 / 1.841788 (-0.331616)	21.099446 / 8.074308 (13.025138)	17.634225 / 10.191392 (7.442833)	0.223506 / 0.680424 (-0.456918)	0.023845 / 0.534201 (-0.510356)	0.613489 / 0.579283 (0.034206)	0.685735 / 0.434364 (0.251371)	0.652485 / 0.540337 (0.112148)	0.734756 / 1.386936 (-0.652180)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008444 / 0.011353 (-0.002909)	0.005789 / 0.011008 (-0.005220)	0.088297 / 0.038508 (0.049789)	0.040847 / 0.023109 (0.017737)	0.411748 / 0.275898 (0.135850)	0.452320 / 0.323480 (0.128841)	0.006689 / 0.007986 (-0.001296)	0.006029 / 0.004328 (0.001701)	0.086080 / 0.004250 (0.081830)	0.053310 / 0.037052 (0.016257)	0.402568 / 0.258489 (0.144079)	0.459047 / 0.293841 (0.165206)	0.041203 / 0.128546 (-0.087343)	0.014216 / 0.075646 (-0.061431)	0.102729 / 0.419271 (-0.316543)	0.057170 / 0.043533 (0.013637)	0.407137 / 0.255139 (0.151998)	0.429703 / 0.283200 (0.146503)	0.123528 / 0.141683 (-0.018155)	1.690026 / 1.452155 (0.237872)	1.797793 / 1.492716 (0.305077)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.264581 / 0.018006 (0.246575)	0.498981 / 0.000490 (0.498492)	0.000462 / 0.000200 (0.000262)	0.000096 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034613 / 0.037411 (-0.002798)	0.136596 / 0.014526 (0.122070)	0.142183 / 0.176557 (-0.034374)	0.201816 / 0.737135 (-0.535320)	0.148843 / 0.296338 (-0.147496)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.506708 / 0.215209 (0.291499)	5.042829 / 2.077655 (2.965175)	2.448414 / 1.504120 (0.944295)	2.213251 / 1.541195 (0.672056)	2.255805 / 1.468490 (0.787315)	0.829929 / 4.584777 (-3.754848)	5.145717 / 3.745712 (1.400004)	2.493947 / 5.269862 (-2.775915)	1.676171 / 4.565676 (-2.889506)	0.102097 / 0.424275 (-0.322178)	0.014545 / 0.007607 (0.006938)	0.635473 / 0.226044 (0.409429)	6.306767 / 2.268929 (4.037839)	3.050284 / 55.444624 (-52.394341)	2.653175 / 6.876477 (-4.223302)	2.850569 / 2.142072 (0.708496)	1.355280 / 4.805227 (-3.449947)	0.248112 / 6.500664 (-6.252552)	0.091993 / 0.075469 (0.016524)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.837509 / 1.841788 (-0.004279)	21.268838 / 8.074308 (13.194530)	17.338053 / 10.191392 (7.146660)	0.232263 / 0.680424 (-0.448161)	0.029093 / 0.534201 (-0.505108)	0.651056 / 0.579283 (0.071773)	0.617623 / 0.434364 (0.183259)	0.773921 / 0.540337 (0.233584)	0.705118 / 1.386936 (-0.681818)

call fs.makedirs in save_to_disk

4e3c865

lhoestq requested a review from polinaeterna April 21, 2023 15:05

lhoestq merged commit 35846fd into main Apr 26, 2023
13 checks passed

lhoestq deleted the save_to_disk-fs-makedirs branch April 26, 2023 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call fs.makedirs in save_to_disk #5779

Call fs.makedirs in save_to_disk #5779

lhoestq commented Apr 21, 2023

HuggingFaceDocBuilderDev commented Apr 21, 2023 •

edited

github-actions bot commented Apr 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Apr 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Call fs.makedirs in save_to_disk #5779

Call fs.makedirs in save_to_disk #5779

Conversation

lhoestq commented Apr 21, 2023

HuggingFaceDocBuilderDev commented Apr 21, 2023 • edited

github-actions bot commented Apr 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Apr 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Apr 21, 2023 •

edited