Fix inferring module for unsupported data files #5787

albertvillanova · 2023-04-24T10:44:50Z

This PR raises a FileNotFoundError instead:

FileNotFoundError: No (supported) data files or dataset script found in <dataset_name>

Fix #5785.

HuggingFaceDocBuilderDev · 2023-04-24T10:48:42Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Thanks !

lhoestq · 2023-04-26T15:53:11Z

src/datasets/load.py

-        if len(set(list(zip(*module_names.values()))[0])) > 1:
-            raise ValueError(f"Couldn't infer the same data file format for all splits. Got {module_names}")
-        module_name, builder_kwargs = next(iter(module_names.values()))
+        module_name, builder_kwargs = next(iter(split_modules.values()))


Maybe check that i split_modules is not empty ?

Thanks for your review, @lhoestq.

I think it can only be empty if the user passes data_files={}, otherwise there are 2 options: either it is not empty or an exception is raised.

split_modules is derived from data_files, which is instance of DataFilesDict.from_local_or_remote with patterns

patterns is derived either from sanitize_patterns or get_data_patterns_locally

sanitize_patterns can only return an empty dict if the user passes data_files={}

get_data_patterns_locally can only return a non-empty dict or raise a EmptyDatasetError

I think the validation of data_files={} should be elsewhere though. What do you think?

Maybe changing?

sanitize_patterns(self.data_files) if self.data_files is not None else get_data_patterns_locally(base_path)

to

sanitize_patterns(self.data_files) if self.data_files else get_data_patterns_locally(base_path)

This way, we are sure split_modules is never empty.

I think the validation of data_files={} should be elsewhere though. What do you think?

Yea indeed, probably in load_dataset_builder ?

Maybe changing?

I think it's better if it raises an error rather than trying to make it run with data files that were not requested

Feel free to merge then :)

lhoestq · 2023-04-26T15:53:21Z

src/datasets/load.py

-        if len(set(list(zip(*module_names.values()))[0])) > 1:
-            raise ValueError(f"Couldn't infer the same data file format for all splits. Got {module_names}")
-        module_name, builder_kwargs = next(iter(module_names.values()))
+        module_name, builder_kwargs = next(iter(split_modules.values()))


lhoestq · 2023-04-27T09:50:09Z

I think you can revert the last commit - it should fail if data_files={} IMO

This reverts commit 650141d. As requested by reviewer.

albertvillanova · 2023-04-27T12:56:16Z

The validation of non-empty data_files is addressed in this PR:

Validate non-empty data_files #5802

github-actions · 2023-04-27T13:06:00Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008622 / 0.011353 (-0.002730)	0.005970 / 0.011008 (-0.005038)	0.117797 / 0.038508 (0.079289)	0.040955 / 0.023109 (0.017846)	0.419538 / 0.275898 (0.143640)	0.455816 / 0.323480 (0.132336)	0.006481 / 0.007986 (-0.001505)	0.004507 / 0.004328 (0.000178)	0.089073 / 0.004250 (0.084822)	0.052389 / 0.037052 (0.015337)	0.420053 / 0.258489 (0.161564)	0.466886 / 0.293841 (0.173045)	0.042660 / 0.128546 (-0.085886)	0.014673 / 0.075646 (-0.060973)	0.411229 / 0.419271 (-0.008042)	0.076993 / 0.043533 (0.033460)	0.431693 / 0.255139 (0.176554)	0.446283 / 0.283200 (0.163084)	0.131408 / 0.141683 (-0.010275)	1.820339 / 1.452155 (0.368184)	1.952946 / 1.492716 (0.460230)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246543 / 0.018006 (0.228537)	0.489806 / 0.000490 (0.489317)	0.013999 / 0.000200 (0.013800)	0.000323 / 0.000054 (0.000269)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032541 / 0.037411 (-0.004870)	0.130569 / 0.014526 (0.116043)	0.139630 / 0.176557 (-0.036926)	0.217018 / 0.737135 (-0.520118)	0.147914 / 0.296338 (-0.148425)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.494767 / 0.215209 (0.279558)	4.949313 / 2.077655 (2.871658)	2.277023 / 1.504120 (0.772903)	2.036677 / 1.541195 (0.495482)	2.064461 / 1.468490 (0.595970)	0.842484 / 4.584777 (-3.742293)	4.720646 / 3.745712 (0.974934)	4.025673 / 5.269862 (-1.244189)	2.198606 / 4.565676 (-2.367070)	0.103042 / 0.424275 (-0.321233)	0.014794 / 0.007607 (0.007187)	0.617867 / 0.226044 (0.391822)	6.197146 / 2.268929 (3.928218)	2.804927 / 55.444624 (-52.639697)	2.426420 / 6.876477 (-4.450057)	2.515182 / 2.142072 (0.373109)	1.008098 / 4.805227 (-3.797129)	0.204982 / 6.500664 (-6.295682)	0.078643 / 0.075469 (0.003174)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.490790 / 1.841788 (-0.350997)	17.268042 / 8.074308 (9.193734)	17.129647 / 10.191392 (6.938255)	0.170351 / 0.680424 (-0.510073)	0.021317 / 0.534201 (-0.512884)	0.517068 / 0.579283 (-0.062215)	0.500200 / 0.434364 (0.065836)	0.641974 / 0.540337 (0.101637)	0.763984 / 1.386936 (-0.622952)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008358 / 0.011353 (-0.002995)	0.005710 / 0.011008 (-0.005298)	0.091077 / 0.038508 (0.052569)	0.040413 / 0.023109 (0.017303)	0.416634 / 0.275898 (0.140736)	0.451122 / 0.323480 (0.127642)	0.006417 / 0.007986 (-0.001569)	0.004360 / 0.004328 (0.000032)	0.089543 / 0.004250 (0.085292)	0.051137 / 0.037052 (0.014085)	0.420228 / 0.258489 (0.161739)	0.458649 / 0.293841 (0.164808)	0.041828 / 0.128546 (-0.086718)	0.014268 / 0.075646 (-0.061379)	0.105301 / 0.419271 (-0.313970)	0.058931 / 0.043533 (0.015398)	0.413445 / 0.255139 (0.158306)	0.443882 / 0.283200 (0.160682)	0.124946 / 0.141683 (-0.016737)	1.842259 / 1.452155 (0.390104)	1.948162 / 1.492716 (0.455445)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.235799 / 0.018006 (0.217792)	0.487667 / 0.000490 (0.487177)	0.001112 / 0.000200 (0.000912)	0.000094 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034233 / 0.037411 (-0.003178)	0.136593 / 0.014526 (0.122068)	0.145598 / 0.176557 (-0.030959)	0.206545 / 0.737135 (-0.530590)	0.150781 / 0.296338 (-0.145558)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.522345 / 0.215209 (0.307136)	5.192092 / 2.077655 (3.114438)	2.543182 / 1.504120 (1.039062)	2.285212 / 1.541195 (0.744018)	2.312803 / 1.468490 (0.844313)	0.859334 / 4.584777 (-3.725443)	4.620235 / 3.745712 (0.874523)	3.964060 / 5.269862 (-1.305802)	2.046347 / 4.565676 (-2.519330)	0.105284 / 0.424275 (-0.318991)	0.015051 / 0.007607 (0.007444)	0.646530 / 0.226044 (0.420485)	6.386396 / 2.268929 (4.117467)	3.131833 / 55.444624 (-52.312791)	2.761898 / 6.876477 (-4.114579)	2.833216 / 2.142072 (0.691143)	1.026024 / 4.805227 (-3.779204)	0.206776 / 6.500664 (-6.293888)	0.078845 / 0.075469 (0.003376)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.580851 / 1.841788 (-0.260937)	17.826213 / 8.074308 (9.751905)	16.929460 / 10.191392 (6.738068)	0.232483 / 0.680424 (-0.447941)	0.021123 / 0.534201 (-0.513078)	0.522196 / 0.579283 (-0.057087)	0.503495 / 0.434364 (0.069131)	0.622777 / 0.540337 (0.082440)	0.753272 / 1.386936 (-0.633664)

albertvillanova added 3 commits April 24, 2023 12:41

Test infer module for unsupported data files

074375a

Fix infer module functions for unsupported files

244c4e5

Fix dataset module factories without script

207268e

lhoestq approved these changes Apr 26, 2023

View reviewed changes

Make sure split_modules is not empty due to empty data_files

650141d

albertvillanova mentioned this pull request Apr 27, 2023

Validate non-empty data_files #5802

Merged

Revert "Make sure split_modules is not empty due to empty data_files"

6e4997a

This reverts commit 650141d. As requested by reviewer.

albertvillanova merged commit 3f9dfbd into huggingface:main Apr 27, 2023
12 checks passed

albertvillanova deleted the fix-5785 branch April 27, 2023 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inferring module for unsupported data files #5787

Fix inferring module for unsupported data files #5787

albertvillanova commented Apr 24, 2023 •

edited

HuggingFaceDocBuilderDev commented Apr 24, 2023 •

edited

lhoestq left a comment

lhoestq Apr 26, 2023

albertvillanova Apr 27, 2023

albertvillanova Apr 27, 2023

lhoestq Apr 27, 2023

lhoestq Apr 27, 2023

lhoestq Apr 26, 2023

lhoestq commented Apr 27, 2023

albertvillanova commented Apr 27, 2023

github-actions bot commented Apr 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix inferring module for unsupported data files #5787

Fix inferring module for unsupported data files #5787

Conversation

albertvillanova commented Apr 24, 2023 • edited

HuggingFaceDocBuilderDev commented Apr 24, 2023 • edited

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Apr 26, 2023

Choose a reason for hiding this comment

albertvillanova Apr 27, 2023

Choose a reason for hiding this comment

albertvillanova Apr 27, 2023

Choose a reason for hiding this comment

lhoestq Apr 27, 2023

Choose a reason for hiding this comment

lhoestq Apr 27, 2023

Choose a reason for hiding this comment

lhoestq Apr 26, 2023

Choose a reason for hiding this comment

lhoestq commented Apr 27, 2023

albertvillanova commented Apr 27, 2023

github-actions bot commented Apr 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Apr 24, 2023 •

edited

HuggingFaceDocBuilderDev commented Apr 24, 2023 •

edited