Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve default patterns resolution #6704

Merged
merged 11 commits into from
Mar 15, 2024

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Mar 1, 2024

Separate the default patterns that match directories from the ones matching files and ensure directories are checked first (reverts the change from #6244, which merged these patterns). Also, ensure that the glob patterns do not overlap to avoid duplicates in the result.

Additionally, replace get_fs_token_paths with url_to_fs to avoid unnecessary glob calls.

fix #6259
fix #6272

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Mar 1, 2024

Awesome !

Note that it can still create duplicates if a path matches several dir patterns, e.g.

data/train-train/data/txt

matches two dir patterns:

**/{keyword}[{sep}]*/**
**/*[{sep}]{keyword}/**

PS: feel free to update your branch, I just updated ruff on main

@mariosasko
Copy link
Collaborator Author

mariosasko commented Mar 1, 2024

Yes, I didn't mention that case on purpose 🙂. One solution would be deprecating the **/*[{sep}]{keyword}/** pattern (and eventually removing it). This way, the directory patterns would align more with the filename ones. Or do you think this is too big of a breaking change?

@lhoestq
Copy link
Member

lhoestq commented Mar 1, 2024

I think it's too big of a breaking change yes :/ (and would make the docs / logic more complex for users to get imo) Though I think your approach is already a nice step in the right direction

@mariosasko
Copy link
Collaborator Author

These changes to the resolve_pattern function lead to 20-30x faster local file resolution in my benchmarks.

@lhoestq
Copy link
Member

lhoestq commented Mar 5, 2024

Nice ! Though since fsspec caches the filesystem, is there a risk when adding new files and reloading a dataset ?

with open("my/local/dir/0000.txt", "w") as f:
    f.write("Hello there")
d1 = load_dataset("my/local/dir")
with open("my/local/dir/0001.txt", "w") as f:
    f.write("General Kenobi")
d2 = load_dataset("my/local/dir")
assert list(d1) != list(d2)

@mariosasko
Copy link
Collaborator Author

Yes. But I think I have a solution for this.

@mariosasko
Copy link
Collaborator Author

mariosasko commented Mar 14, 2024

I'm not satisfied with the context manager approach...

A clean solution would require a bigger rewrite of the resolution logic (e.g., merging get_data_patterns and DataFilesDict.from_patterns into a get_data_files function that would build the DataFilesDict by matching the paths using fs.find and fsspec.utils.glob_translate (available in fsspec>=2023.12.0))

The current changes make the local resolution 2-3x faster, which is good enough for now, I think.

@mariosasko mariosasko marked this pull request as ready for review March 14, 2024 17:36
@mariosasko mariosasko requested a review from lhoestq March 14, 2024 17:36
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@mariosasko mariosasko merged commit d1d3c06 into main Mar 15, 2024
12 checks passed
@mariosasko mariosasko deleted the improve-default-patterns-resolution branch March 15, 2024 15:22
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.004888 / 0.011353 (-0.006465) 0.003267 / 0.011008 (-0.007742) 0.065117 / 0.038508 (0.026609) 0.029416 / 0.023109 (0.006306) 0.232021 / 0.275898 (-0.043877) 0.258053 / 0.323480 (-0.065427) 0.003971 / 0.007986 (-0.004014) 0.002550 / 0.004328 (-0.001779) 0.049126 / 0.004250 (0.044876) 0.040620 / 0.037052 (0.003568) 0.253437 / 0.258489 (-0.005052) 0.273583 / 0.293841 (-0.020258) 0.026775 / 0.128546 (-0.101771) 0.010073 / 0.075646 (-0.065573) 0.219089 / 0.419271 (-0.200183) 0.035047 / 0.043533 (-0.008486) 0.247661 / 0.255139 (-0.007478) 0.258674 / 0.283200 (-0.024525) 0.018428 / 0.141683 (-0.123255) 1.130394 / 1.452155 (-0.321761) 1.173167 / 1.492716 (-0.319549)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.092581 / 0.018006 (0.074574) 0.303657 / 0.000490 (0.303167) 0.000215 / 0.000200 (0.000015) 0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.018640 / 0.037411 (-0.018771) 0.062032 / 0.014526 (0.047506) 0.073982 / 0.176557 (-0.102575) 0.121499 / 0.737135 (-0.615636) 0.076780 / 0.296338 (-0.219559)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.279411 / 0.215209 (0.064202) 2.737977 / 2.077655 (0.660322) 1.454135 / 1.504120 (-0.049985) 1.343144 / 1.541195 (-0.198051) 1.339876 / 1.468490 (-0.128614) 0.567306 / 4.584777 (-4.017471) 2.372569 / 3.745712 (-1.373143) 2.716810 / 5.269862 (-2.553052) 1.697895 / 4.565676 (-2.867782) 0.061804 / 0.424275 (-0.362471) 0.004986 / 0.007607 (-0.002622) 0.332721 / 0.226044 (0.106676) 3.274572 / 2.268929 (1.005644) 1.789900 / 55.444624 (-53.654725) 1.536346 / 6.876477 (-5.340131) 1.551940 / 2.142072 (-0.590132) 0.634539 / 4.805227 (-4.170688) 0.115860 / 6.500664 (-6.384805) 0.041737 / 0.075469 (-0.033732)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.024469 / 1.841788 (-0.817319) 11.327496 / 8.074308 (3.253188) 9.265855 / 10.191392 (-0.925537) 0.142200 / 0.680424 (-0.538224) 0.013945 / 0.534201 (-0.520256) 0.289670 / 0.579283 (-0.289614) 0.269240 / 0.434364 (-0.165124) 0.324748 / 0.540337 (-0.215590) 0.421393 / 1.386936 (-0.965543)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005284 / 0.011353 (-0.006069) 0.003351 / 0.011008 (-0.007658) 0.049973 / 0.038508 (0.011465) 0.030257 / 0.023109 (0.007148) 0.273660 / 0.275898 (-0.002238) 0.300328 / 0.323480 (-0.023152) 0.004133 / 0.007986 (-0.003852) 0.002614 / 0.004328 (-0.001715) 0.048055 / 0.004250 (0.043804) 0.044731 / 0.037052 (0.007678) 0.290257 / 0.258489 (0.031768) 0.321243 / 0.293841 (0.027402) 0.029542 / 0.128546 (-0.099004) 0.010074 / 0.075646 (-0.065573) 0.057944 / 0.419271 (-0.361327) 0.051267 / 0.043533 (0.007734) 0.276278 / 0.255139 (0.021139) 0.302464 / 0.283200 (0.019264) 0.018231 / 0.141683 (-0.123452) 1.140782 / 1.452155 (-0.311373) 1.182991 / 1.492716 (-0.309725)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.092325 / 0.018006 (0.074319) 0.302610 / 0.000490 (0.302121) 0.000202 / 0.000200 (0.000002) 0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021458 / 0.037411 (-0.015954) 0.074883 / 0.014526 (0.060357) 0.085747 / 0.176557 (-0.090809) 0.125506 / 0.737135 (-0.611629) 0.086921 / 0.296338 (-0.209417)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.290485 / 0.215209 (0.075276) 2.853898 / 2.077655 (0.776243) 1.615606 / 1.504120 (0.111486) 1.491797 / 1.541195 (-0.049397) 1.515981 / 1.468490 (0.047491) 0.566760 / 4.584777 (-4.018017) 2.462593 / 3.745712 (-1.283119) 2.765516 / 5.269862 (-2.504345) 1.755078 / 4.565676 (-2.810598) 0.063614 / 0.424275 (-0.360661) 0.005040 / 0.007607 (-0.002567) 0.347957 / 0.226044 (0.121912) 3.464258 / 2.268929 (1.195330) 1.992273 / 55.444624 (-53.452351) 1.699147 / 6.876477 (-5.177330) 1.868438 / 2.142072 (-0.273635) 0.660756 / 4.805227 (-4.144471) 0.118142 / 6.500664 (-6.382522) 0.041974 / 0.075469 (-0.033495)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.012206 / 1.841788 (-0.829581) 12.343735 / 8.074308 (4.269427) 10.321975 / 10.191392 (0.130583) 0.140007 / 0.680424 (-0.540417) 0.015755 / 0.534201 (-0.518446) 0.291978 / 0.579283 (-0.287305) 0.278792 / 0.434364 (-0.155572) 0.325366 / 0.540337 (-0.214972) 0.439403 / 1.386936 (-0.947533)

@@ -46,36 +46,57 @@ class EmptyDatasetError(FileNotFoundError):
}
NON_WORDS_CHARS = "-._ 0-9"
if config.FSSPEC_VERSION < version.parse("2023.9.0"):
KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**[{sep}/]{keyword}[{sep}/]**"]
KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"]
Copy link
Member

@albertvillanova albertvillanova Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a breaking change impacting all uses of the variable KEYWORDS_IN_PATH_NAME_BASE_PATTERNS. See: https://github.com/huggingface/dataset-viewer/actions/runs/8753796799/job/24024224560?pr=2740

CC: @mariosasko @lhoestq

Copy link
Collaborator Author

@mariosasko mariosasko Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset-viewer seems to be the only repo on GH using KEYWORDS_IN_PATH_NAME_BASE_PATTERNS...

But I think it's okay to add it back if it's hard to fix datasets-viewer's CI otherwise (merging KEYWORDS_IN_FILENAME_BASE_PATTERNS and KEYWORDS_IN_DIR_NAME_BASE_PATTERNS should fix it, no?).

Copy link
Member

@lhoestq lhoestq Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add it back imo, we can just use the new variables (adding the two together) in dataset-viewer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fixing this in dataset-viewer: huggingface/dataset-viewer#2740 (comment)

@vttrifonov
Copy link

This change is breaking in

fs, _ = url_to_fs(dataset_path, **(storage_options or {}))

when the input is pathlib.Path. The issue is that url_to_fs expects a str and cannot deal with Path. get_fs_token_paths converts to str so it is not a problem

@lhoestq
Copy link
Member

lhoestq commented Apr 23, 2024

I opened #6828 to add proper Path support to save_to_disk / load_from_disk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants