Skip to content

fix: tolerate chmod PermissionError on flat-permission filesystems#8146

Open
dschulmeist wants to merge 1 commit intohuggingface:mainfrom
dschulmeist:fix/chmod-best-effort-flat-filesystems-8125
Open

fix: tolerate chmod PermissionError on flat-permission filesystems#8146
dschulmeist wants to merge 1 commit intohuggingface:mainfrom
dschulmeist:fix/chmod-best-effort-flat-filesystems-8125

Conversation

@dschulmeist
Copy link
Copy Markdown

Summary

Fixes #8125.

Dataset.map() (and the indices writer in Dataset.flatten_indices / Dataset.select path) calls os.chmod(cache_file, 0o666 & ~umask) on the freshly moved cache file. On flat-permission filesystems — GCS FUSE, S3 FUSE mounts, and similar object-storage-backed mounts — chmod raises PermissionError (a subclass of OSError). The cache file itself is already written successfully at that point, but the unguarded chmod takes down the whole .map() call with a traceback like:

  File "src/datasets/arrow_dataset.py", line 3747, in _map_single
    os.chmod(cache_file_name, 0o666 & ~umask)
PermissionError: [Errno 1] Operation not permitted: '/mnt/data/cache_55b1844ebd9dffff_00000_of_00001.arrow'

Fix

Wrap each of the two os.chmod call sites in src/datasets/arrow_dataset.py with try/except OSError, logging the failure at debug level. The permission update stays best-effort: sane 0o666 & ~umask permissions on POSIX filesystems that support it, silent graceful degradation on flat-permission mounts where there's nothing to set.

No new API surface, no env-var toggle — the issue body suggested an opt-out env var, but catching OSError is narrower and doesn't require users to know about a new knob. If a future caller actually needs to detect the degraded mode, they can observe the debug log.

Tests

New regression test test_map_survives_chmod_permission_error in tests/test_arrow_dataset.py:

  • Patches datasets.arrow_dataset.os.chmod to raise PermissionError for paths inside the pytest tmp_path (so pytest's own housekeeping is unaffected)
  • Runs Dataset.from_dict({\"a\": [1, 2, 3]}).map(lambda x: {\"b\": x[\"a\"] * 2}, cache_file_name=...)
  • Asserts the mapped values are correct ([2, 4, 6]), the cache file exists on disk, and the chmod guard was actually exercised

Verified the test fails on main without the fix (propagates PermissionError) and passes with it.

Scope notes

  • Two chmod sites changed in arrow_dataset.py — both follow the identical shutil.move → umask → chmod pattern
  • No behavior change on POSIX filesystems that accept chmod
  • Issue author asked about an env-var opt-out; this goes narrower (catch-the-error) because the chmod is genuinely best-effort and failure is recoverable

Checklist

  • Added a regression test in tests/
  • ruff check and ruff format --check clean on touched files
  • Related test_map_caching* tests still pass
  • Scope isolated to one bug

…uggingface#8125)

map() and the indices cache writer both call os.chmod on the freshly
moved cache file. On flat-permission filesystems (GCS FUSE, S3 FUSE
mounts) chmod raises PermissionError / OSError, aborting the whole
map() call even though the cache file itself was written successfully.

Wrap each chmod in try/except OSError with a debug log. The chmod
remains best-effort: it still sets sane 0o666-masked permissions on
filesystems that support it, and silently degrades on ones that don't.

Adds a regression test that patches os.chmod to raise PermissionError
only for paths inside the pytest tmp-path and asserts that Dataset.map
still returns the correct mapped values and the cache file ends up on
disk.

Fixes huggingface#8125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: being able to disable chmod for flat permission filesystems

1 participant