Validate non-empty data_files #5802

albertvillanova · 2023-04-27T09:51:36Z

This PR adds validation of data_files, so that they are non-empty (str, list, or dict) or None (default).

See: #5787 (comment)

HuggingFaceDocBuilderDev · 2023-04-27T12:19:14Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-04-27T14:59:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007818 / 0.011353 (-0.003535)	0.005456 / 0.011008 (-0.005552)	0.114685 / 0.038508 (0.076177)	0.038398 / 0.023109 (0.015289)	0.351289 / 0.275898 (0.075391)	0.389170 / 0.323480 (0.065690)	0.006213 / 0.007986 (-0.001773)	0.005796 / 0.004328 (0.001467)	0.085315 / 0.004250 (0.081065)	0.049251 / 0.037052 (0.012198)	0.368119 / 0.258489 (0.109630)	0.394725 / 0.293841 (0.100884)	0.040390 / 0.128546 (-0.088157)	0.014076 / 0.075646 (-0.061570)	0.393771 / 0.419271 (-0.025500)	0.058929 / 0.043533 (0.015397)	0.349526 / 0.255139 (0.094387)	0.378409 / 0.283200 (0.095210)	0.114354 / 0.141683 (-0.027329)	1.749244 / 1.452155 (0.297089)	1.847946 / 1.492716 (0.355229)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.241648 / 0.018006 (0.223641)	0.468419 / 0.000490 (0.467929)	0.004311 / 0.000200 (0.004111)	0.000091 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029978 / 0.037411 (-0.007433)	0.121832 / 0.014526 (0.107306)	0.133516 / 0.176557 (-0.043041)	0.199174 / 0.737135 (-0.537961)	0.138181 / 0.296338 (-0.158158)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.478346 / 0.215209 (0.263137)	4.723967 / 2.077655 (2.646312)	2.107724 / 1.504120 (0.603604)	1.874810 / 1.541195 (0.333615)	1.911568 / 1.468490 (0.443078)	0.800966 / 4.584777 (-3.783811)	4.399032 / 3.745712 (0.653320)	2.346160 / 5.269862 (-2.923702)	1.506673 / 4.565676 (-3.059004)	0.099119 / 0.424275 (-0.325156)	0.014055 / 0.007607 (0.006448)	0.582419 / 0.226044 (0.356375)	5.789147 / 2.268929 (3.520218)	2.632443 / 55.444624 (-52.812182)	2.217630 / 6.876477 (-4.658846)	2.337709 / 2.142072 (0.195637)	0.995345 / 4.805227 (-3.809882)	0.200040 / 6.500664 (-6.300624)	0.076855 / 0.075469 (0.001386)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.386104 / 1.841788 (-0.455683)	17.109772 / 8.074308 (9.035464)	16.147612 / 10.191392 (5.956220)	0.162846 / 0.680424 (-0.517577)	0.020692 / 0.534201 (-0.513509)	0.495752 / 0.579283 (-0.083531)	0.475715 / 0.434364 (0.041351)	0.619826 / 0.540337 (0.079488)	0.720745 / 1.386936 (-0.666191)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008255 / 0.011353 (-0.003098)	0.006118 / 0.011008 (-0.004890)	0.088004 / 0.038508 (0.049496)	0.039225 / 0.023109 (0.016116)	0.399290 / 0.275898 (0.123392)	0.432272 / 0.323480 (0.108792)	0.007382 / 0.007986 (-0.000603)	0.004576 / 0.004328 (0.000248)	0.086511 / 0.004250 (0.082260)	0.050472 / 0.037052 (0.013420)	0.404160 / 0.258489 (0.145671)	0.445356 / 0.293841 (0.151515)	0.041549 / 0.128546 (-0.086997)	0.014148 / 0.075646 (-0.061498)	0.101697 / 0.419271 (-0.317574)	0.057474 / 0.043533 (0.013941)	0.395093 / 0.255139 (0.139954)	0.418613 / 0.283200 (0.135414)	0.123217 / 0.141683 (-0.018466)	1.726146 / 1.452155 (0.273991)	1.852746 / 1.492716 (0.360029)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.256876 / 0.018006 (0.238870)	0.476336 / 0.000490 (0.475846)	0.000465 / 0.000200 (0.000265)	0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034304 / 0.037411 (-0.003107)	0.132617 / 0.014526 (0.118091)	0.141712 / 0.176557 (-0.034845)	0.198101 / 0.737135 (-0.539034)	0.150877 / 0.296338 (-0.145461)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.504717 / 0.215209 (0.289508)	5.035060 / 2.077655 (2.957405)	2.494812 / 1.504120 (0.990692)	2.306601 / 1.541195 (0.765406)	2.481860 / 1.468490 (1.013370)	0.826041 / 4.584777 (-3.758736)	4.414748 / 3.745712 (0.669036)	2.417899 / 5.269862 (-2.851963)	1.574548 / 4.565676 (-2.991128)	0.101712 / 0.424275 (-0.322563)	0.014388 / 0.007607 (0.006781)	0.616674 / 0.226044 (0.390630)	6.180382 / 2.268929 (3.911453)	2.969110 / 55.444624 (-52.475514)	2.574383 / 6.876477 (-4.302094)	2.711008 / 2.142072 (0.568935)	0.997679 / 4.805227 (-3.807548)	0.201241 / 6.500664 (-6.299423)	0.076132 / 0.075469 (0.000663)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.542704 / 1.841788 (-0.299084)	17.610700 / 8.074308 (9.536392)	16.152973 / 10.191392 (5.961581)	0.166040 / 0.680424 (-0.514384)	0.020286 / 0.534201 (-0.513915)	0.506724 / 0.579283 (-0.072559)	0.484348 / 0.434364 (0.049984)	0.606524 / 0.540337 (0.066187)	0.734997 / 1.386936 (-0.651939)

Validate non-empty data_files

3b5f8e9

lhoestq approved these changes Apr 27, 2023

View reviewed changes

albertvillanova mentioned this pull request Apr 27, 2023

Fix inferring module for unsupported data files #5787

Merged

albertvillanova merged commit a200ec9 into huggingface:main Apr 27, 2023
12 checks passed

albertvillanova deleted the validate-non-empty-data-files branch April 27, 2023 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate non-empty data_files #5802

Validate non-empty data_files #5802

albertvillanova commented Apr 27, 2023

HuggingFaceDocBuilderDev commented Apr 27, 2023 •

edited

github-actions bot commented Apr 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Validate non-empty data_files #5802

Validate non-empty data_files #5802

Conversation

albertvillanova commented Apr 27, 2023

HuggingFaceDocBuilderDev commented Apr 27, 2023 • edited

github-actions bot commented Apr 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Apr 27, 2023 •

edited