Improve default patterns resolution #6704

mariosasko · 2024-03-01T16:31:25Z

Separate the default patterns that match directories from the ones matching files and ensure directories are checked first (reverts the change from #6244, which merged these patterns). Also, ensure that the glob patterns do not overlap to avoid duplicates in the result.

Additionally, replace get_fs_token_paths with url_to_fs to avoid unnecessary glob calls.

fix #6259
fix #6272

HuggingFaceDocBuilderDev · 2024-03-01T16:35:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2024-03-01T16:59:14Z

Awesome !

Note that it can still create duplicates if a path matches several dir patterns, e.g.

data/train-train/data/txt

matches two dir patterns:

**/{keyword}[{sep}]*/**
**/*[{sep}]{keyword}/**

PS: feel free to update your branch, I just updated ruff on main

mariosasko · 2024-03-01T17:15:09Z

Yes, I didn't mention that case on purpose 🙂. One solution would be deprecating the **/*[{sep}]{keyword}/** pattern (and eventually removing it). This way, the directory patterns would align more with the filename ones. Or do you think this is too big of a breaking change?

…efault-patterns-resolution

lhoestq · 2024-03-01T17:49:27Z

I think it's too big of a breaking change yes :/ (and would make the docs / logic more complex for users to get imo) Though I think your approach is already a nice step in the right direction

…efault-patterns-resolution

mariosasko · 2024-03-05T17:17:42Z

These changes to the resolve_pattern function lead to 20-30x faster local file resolution in my benchmarks.

lhoestq · 2024-03-05T17:45:53Z

Nice ! Though since fsspec caches the filesystem, is there a risk when adding new files and reloading a dataset ?

with open("my/local/dir/0000.txt", "w") as f:
    f.write("Hello there")
d1 = load_dataset("my/local/dir")
with open("my/local/dir/0001.txt", "w") as f:
    f.write("General Kenobi")
d2 = load_dataset("my/local/dir")
assert list(d1) != list(d2)

mariosasko · 2024-03-05T18:21:56Z

Yes. But I think I have a solution for this.

…efault-patterns-resolution

mariosasko · 2024-03-14T17:36:05Z

I'm not satisfied with the context manager approach...

A clean solution would require a bigger rewrite of the resolution logic (e.g., merging get_data_patterns and DataFilesDict.from_patterns into a get_data_files function that would build the DataFilesDict by matching the paths using fs.find and fsspec.utils.glob_translate (available in fsspec>=2023.12.0))

The current changes make the local resolution 2-3x faster, which is good enough for now, I think.

lhoestq

Sounds good

github-actions · 2024-03-15T15:31:22Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004888 / 0.011353 (-0.006465)	0.003267 / 0.011008 (-0.007742)	0.065117 / 0.038508 (0.026609)	0.029416 / 0.023109 (0.006306)	0.232021 / 0.275898 (-0.043877)	0.258053 / 0.323480 (-0.065427)	0.003971 / 0.007986 (-0.004014)	0.002550 / 0.004328 (-0.001779)	0.049126 / 0.004250 (0.044876)	0.040620 / 0.037052 (0.003568)	0.253437 / 0.258489 (-0.005052)	0.273583 / 0.293841 (-0.020258)	0.026775 / 0.128546 (-0.101771)	0.010073 / 0.075646 (-0.065573)	0.219089 / 0.419271 (-0.200183)	0.035047 / 0.043533 (-0.008486)	0.247661 / 0.255139 (-0.007478)	0.258674 / 0.283200 (-0.024525)	0.018428 / 0.141683 (-0.123255)	1.130394 / 1.452155 (-0.321761)	1.173167 / 1.492716 (-0.319549)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092581 / 0.018006 (0.074574)	0.303657 / 0.000490 (0.303167)	0.000215 / 0.000200 (0.000015)	0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018640 / 0.037411 (-0.018771)	0.062032 / 0.014526 (0.047506)	0.073982 / 0.176557 (-0.102575)	0.121499 / 0.737135 (-0.615636)	0.076780 / 0.296338 (-0.219559)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.279411 / 0.215209 (0.064202)	2.737977 / 2.077655 (0.660322)	1.454135 / 1.504120 (-0.049985)	1.343144 / 1.541195 (-0.198051)	1.339876 / 1.468490 (-0.128614)	0.567306 / 4.584777 (-4.017471)	2.372569 / 3.745712 (-1.373143)	2.716810 / 5.269862 (-2.553052)	1.697895 / 4.565676 (-2.867782)	0.061804 / 0.424275 (-0.362471)	0.004986 / 0.007607 (-0.002622)	0.332721 / 0.226044 (0.106676)	3.274572 / 2.268929 (1.005644)	1.789900 / 55.444624 (-53.654725)	1.536346 / 6.876477 (-5.340131)	1.551940 / 2.142072 (-0.590132)	0.634539 / 4.805227 (-4.170688)	0.115860 / 6.500664 (-6.384805)	0.041737 / 0.075469 (-0.033732)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.024469 / 1.841788 (-0.817319)	11.327496 / 8.074308 (3.253188)	9.265855 / 10.191392 (-0.925537)	0.142200 / 0.680424 (-0.538224)	0.013945 / 0.534201 (-0.520256)	0.289670 / 0.579283 (-0.289614)	0.269240 / 0.434364 (-0.165124)	0.324748 / 0.540337 (-0.215590)	0.421393 / 1.386936 (-0.965543)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005284 / 0.011353 (-0.006069)	0.003351 / 0.011008 (-0.007658)	0.049973 / 0.038508 (0.011465)	0.030257 / 0.023109 (0.007148)	0.273660 / 0.275898 (-0.002238)	0.300328 / 0.323480 (-0.023152)	0.004133 / 0.007986 (-0.003852)	0.002614 / 0.004328 (-0.001715)	0.048055 / 0.004250 (0.043804)	0.044731 / 0.037052 (0.007678)	0.290257 / 0.258489 (0.031768)	0.321243 / 0.293841 (0.027402)	0.029542 / 0.128546 (-0.099004)	0.010074 / 0.075646 (-0.065573)	0.057944 / 0.419271 (-0.361327)	0.051267 / 0.043533 (0.007734)	0.276278 / 0.255139 (0.021139)	0.302464 / 0.283200 (0.019264)	0.018231 / 0.141683 (-0.123452)	1.140782 / 1.452155 (-0.311373)	1.182991 / 1.492716 (-0.309725)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092325 / 0.018006 (0.074319)	0.302610 / 0.000490 (0.302121)	0.000202 / 0.000200 (0.000002)	0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021458 / 0.037411 (-0.015954)	0.074883 / 0.014526 (0.060357)	0.085747 / 0.176557 (-0.090809)	0.125506 / 0.737135 (-0.611629)	0.086921 / 0.296338 (-0.209417)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.290485 / 0.215209 (0.075276)	2.853898 / 2.077655 (0.776243)	1.615606 / 1.504120 (0.111486)	1.491797 / 1.541195 (-0.049397)	1.515981 / 1.468490 (0.047491)	0.566760 / 4.584777 (-4.018017)	2.462593 / 3.745712 (-1.283119)	2.765516 / 5.269862 (-2.504345)	1.755078 / 4.565676 (-2.810598)	0.063614 / 0.424275 (-0.360661)	0.005040 / 0.007607 (-0.002567)	0.347957 / 0.226044 (0.121912)	3.464258 / 2.268929 (1.195330)	1.992273 / 55.444624 (-53.452351)	1.699147 / 6.876477 (-5.177330)	1.868438 / 2.142072 (-0.273635)	0.660756 / 4.805227 (-4.144471)	0.118142 / 6.500664 (-6.382522)	0.041974 / 0.075469 (-0.033495)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.012206 / 1.841788 (-0.829581)	12.343735 / 8.074308 (4.269427)	10.321975 / 10.191392 (0.130583)	0.140007 / 0.680424 (-0.540417)	0.015755 / 0.534201 (-0.518446)	0.291978 / 0.579283 (-0.287305)	0.278792 / 0.434364 (-0.155572)	0.325366 / 0.540337 (-0.214972)	0.439403 / 1.386936 (-0.947533)

albertvillanova · 2024-04-19T13:40:12Z

src/datasets/data_files.py

@@ -46,36 +46,57 @@ class EmptyDatasetError(FileNotFoundError):
 }
 NON_WORDS_CHARS = "-._ 0-9"
 if config.FSSPEC_VERSION < version.parse("2023.9.0"):
-    KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**[{sep}/]{keyword}[{sep}/]**"]
+    KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"]


I think this is a breaking change impacting all uses of the variable KEYWORDS_IN_PATH_NAME_BASE_PATTERNS. See: https://github.com/huggingface/dataset-viewer/actions/runs/8753796799/job/24024224560?pr=2740

CC: @mariosasko @lhoestq

dataset-viewer seems to be the only repo on GH using KEYWORDS_IN_PATH_NAME_BASE_PATTERNS...

But I think it's okay to add it back if it's hard to fix datasets-viewer's CI otherwise (merging KEYWORDS_IN_FILENAME_BASE_PATTERNS and KEYWORDS_IN_DIR_NAME_BASE_PATTERNS should fix it, no?).

no need to add it back imo, we can just use the new variables (adding the two together) in dataset-viewer

I am fixing this in dataset-viewer: huggingface/dataset-viewer#2740 (comment)

vttrifonov · 2024-04-23T00:26:12Z

This change is breaking in

datasets/src/datasets/arrow_dataset.py

Line 1515 in f96e74d

fs, _ = url_to_fs(dataset_path, **(storage_options or {}))

when the input is pathlib.Path. The issue is that url_to_fs expects a str and cannot deal with Path. get_fs_token_paths converts to str so it is not a problem

lhoestq · 2024-04-23T09:43:08Z

I opened #6828 to add proper Path support to save_to_disk / load_from_disk

mariosasko added 2 commits March 1, 2024 00:03

Separate filename and dirname patterns

7fa24f7

Nit

feb87bd

mariosasko mentioned this pull request Mar 1, 2024

Drop data_files duplicates #6282

Closed

Merge branch 'main' of github.com:huggingface/datasets into improve-d…

7c71608

…efault-patterns-resolution

mariosasko added 3 commits March 5, 2024 18:05

Faster local files resolution

91a6789

Merge branch 'main' of github.com:huggingface/datasets into improve-d…

b6a50b8

…efault-patterns-resolution

Style

2d112ba

mariosasko added 5 commits March 6, 2024 00:24

Use context manager

74136d3

Replace fsspec.get_fs_token_paths with url_to_fs

253c5ef

Fix

a654f28

Remove context manager

c9d3c03

Merge branch 'main' of github.com:huggingface/datasets into improve-d…

03ab89a

…efault-patterns-resolution

mariosasko marked this pull request as ready for review March 14, 2024 17:36

mariosasko requested a review from lhoestq March 14, 2024 17:36

lhoestq approved these changes Mar 15, 2024

View reviewed changes

mariosasko merged commit d1d3c06 into main Mar 15, 2024
12 checks passed

mariosasko deleted the improve-default-patterns-resolution branch March 15, 2024 15:22

albertvillanova reviewed Apr 19, 2024

View reviewed changes

albertvillanova mentioned this pull request Apr 23, 2024

Update datasets to 2.19.0 and fsspec to 2024.3.1 huggingface/dataset-viewer#2740

Merged

albertvillanova mentioned this pull request Apr 23, 2024

Load and save from/to disk no longer accept pathlib.Path #6829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve default patterns resolution #6704

Improve default patterns resolution #6704

mariosasko commented Mar 1, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 1, 2024

lhoestq commented Mar 1, 2024 •

edited

Loading

mariosasko commented Mar 1, 2024 •

edited

Loading

lhoestq commented Mar 1, 2024 •

edited

Loading

mariosasko commented Mar 5, 2024

lhoestq commented Mar 5, 2024

mariosasko commented Mar 5, 2024

mariosasko commented Mar 14, 2024 •

edited

Loading

lhoestq left a comment

github-actions bot commented Mar 15, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova Apr 19, 2024 •

edited

Loading

mariosasko Apr 19, 2024 •

edited

Loading

lhoestq Apr 19, 2024 •

edited

Loading

albertvillanova Apr 23, 2024

vttrifonov commented Apr 23, 2024

lhoestq commented Apr 23, 2024

Improve default patterns resolution #6704

Improve default patterns resolution #6704

Conversation

mariosasko commented Mar 1, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 1, 2024

lhoestq commented Mar 1, 2024 • edited Loading

mariosasko commented Mar 1, 2024 • edited Loading

lhoestq commented Mar 1, 2024 • edited Loading

mariosasko commented Mar 5, 2024

lhoestq commented Mar 5, 2024

mariosasko commented Mar 5, 2024

mariosasko commented Mar 14, 2024 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 15, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

mariosasko Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

lhoestq Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

albertvillanova Apr 23, 2024

Choose a reason for hiding this comment

vttrifonov commented Apr 23, 2024

lhoestq commented Apr 23, 2024

mariosasko commented Mar 1, 2024 •

edited

Loading

lhoestq commented Mar 1, 2024 •

edited

Loading

mariosasko commented Mar 1, 2024 •

edited

Loading

lhoestq commented Mar 1, 2024 •

edited

Loading

mariosasko commented Mar 14, 2024 •

edited

Loading

albertvillanova Apr 19, 2024 •

edited

Loading

mariosasko Apr 19, 2024 •

edited

Loading

lhoestq Apr 19, 2024 •

edited

Loading