-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single cache file #145
Single cache file #145
Conversation
/test |
Successfully created a job for commit 9889a9c: |
@belltailjp Tests passed. Is this still WIP? |
I wanted to try some micro-benchmarks to see if there's non-negligible performance regressions due to increased chance of file access concurrency in |
https://gist.github.com/belltailjp/eb6a98a799bf1a56bd42132b92db763b#file-result-md
There are certain amount of fluctuations but I think it's OK to say no significant degradation (nor improvement) in performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but please fix the conflict against #143 .
9889a9c
to
bb85698
Compare
/test |
Successfully created a job for commit bb85698: |
Currently file caches (FileCache and MultiprocessFileCache) save the cache information separately into index file and data file.
This PR is to propose to change it to single file.
The new cache file is formatted as:
where N is the number of samples to cache.
This is actually completely equivalent to just concatenating the current index and data file.
Background
More strict cache size limitation (related to #143)
#143 will introduce limitation of cache data, however currently it is limited to data file only.
It is usually OK - as the index files are relatively negligible in most cases (e.g., ImageNet1K case 34GB vs 20MB), but as the number of parallelism grows the index file increases, which becomes relatively non-trivial, but they'll be out of limitation.
By unifying the index and data file into one file, we can constrain the cache size more strictly.
Easier handling
In case we want to fully utilize cache
preserve
andpreload
functionality, if the preserved cache is a single file it is easier to handle/manage.Performance concern
@kuenishi has pointed out that since in a single-file cache, every time we access to the cache it reads the file twice (to get index and actually data), which may increase chance of conflict in multi-process use.
To be studied.