Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse file local cache #275

Merged
merged 10 commits into from
Jul 14, 2022
Merged

Sparse file local cache #275

merged 10 commits into from
Jul 14, 2022

Conversation

kuenishi
Copy link
Member

@kuenishi kuenishi commented Jun 6, 2022

With adding local_cache=True keyword to opening a zip FS e.g. using from_url() , a local cache file is created in a path (by default, ~/.cache/pfio/ ). The cache file is a named temporary file, but potentially, it can easily be persisted locally. It costs the local disk space, but the amount is just as same as the total amount of data transferred over the network to the client.

This will be a major step toward working around #271, as well as improving latency on reading a file from remote file system such as S3.

You can see the difference of space consumed like this:

$ ls -lh  ~/.cache/pfio/tmpjwvpycrg
-rw------- 1 kota kota 155G Jun  6 12:29 /home/kota/.cache/pfio/tmpjwvpycrg
$ du -h  ~/.cache/pfio/tmpjwvpycrg
184M    /home/kota/.cache/pfio/tmpjwvpycrg

Concurrency control

  • For threads: while pread() and pwrite() are thread-safe, other internal data structure such as known ranges is not thread safe. This is future work.
  • For processes: POSIX doesn't guarantee consistent read and write against single file by multiple processes; and little is also supported in Linux. Same technique as we do in pfio.cache.MultiprocessFileCache can be applied to the class CachedWrapper(). This is also future work.

@kuenishi kuenishi added the cat:feature Implementation that introduces new interfaces. label Jun 6, 2022
@kuenishi kuenishi added this to the 2.3.0 milestone Jun 6, 2022
@kuenishi kuenishi merged commit 86a6339 into master Jul 14, 2022
@kuenishi kuenishi deleted the sfcache branch July 14, 2022 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature Implementation that introduces new interfaces.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant