-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add persistence to cache.FileCache #90
Conversation
I think this closes #3 |
@@ -251,3 +255,56 @@ def close(self): | |||
self.datafp.close() | |||
self.indexfp = None | |||
self.datafp = None | |||
|
|||
def preload(self, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to split the preload
and the freeze
of cache?
In the current implementation, user has to do following to freeze
the cache and get multi-process safe:
load => preserve => preload
If we can split these two functionalities, then we can simply do "load" => "freeze". Moreover, the cache can be modified before "freeze".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by load?
If "load" means writing data into cache files, that means creating file cache with specific name if I understand it correctly. But it is unacceptable because it leads to partially written named files left after job failure and they'll definitely mess up the cache and following re-run of the job.
Also the principle of cache in this module is data being immutable and deterministic, there should be no modification. What do you mean by "modified"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to cause confusion, by "load" I mean reading data from remote storage. If I understand correctly, in order to use this module, user has to do following:
- read data from remote storage
- preserve
- preload
My suggestion is to have a "freeze" function, so that we can:
- read data from remote storage
- freeze
to support the multiprocess. I do not think it would increase the risk of having a broken cache.
By modifying, I do not mean changing the data itself, but the ability of adding new cache entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it would increase the risk of having a broken cache.
I believe definitely it happens. For example,
- Create or a cache file named "foo.cache"
- Write 50% data to "foo.cache", like
put(1, b'bar')
- Job fail
Even in this case partial write may happen like b'ba'
was written to disk but b'r'
wasn't. Then, restart the job,
- Open a 50% written cache file named "foo.cache"
- Can't run
put(1, b'bar')
as the data 1 is immutable - Job fail
Also in this case it is not possible to put data nor to run get(1)
if partial write happened. This is typical scenario of broken data and broken cache files must be thrown away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we both are misunderstanding each other. I think in your code, you integrated the "freeze" into preserve, which I misunderstood, so the 3 is not necessary. But I think that design forces the user to preserve the cache, which is not always necessary. For example, they just want to enable multiprocessing, which only needs "freezing" the cache. Preserving the data disables the automatic deletion of the cache.
In my example, the cache should not be frozen until all the data is loaded, and after "freeze", no data can be added to the cache. Then when the job fails at a "put", which is definitely before the "freeze", the cache will be neither frozen nor preserved. That is why I think such design will not increase the risk of having broken case.
Sorry for making confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatic deletion of the cache among multiple processes does not work correctly because there are no synchronization on closing the file objects, where child processes can't read contents after parent process exit. This is because Python's tempfile
module explicitly unlinks named temporary files. Please try following script:
import tempfile
import os
import multiprocessing as mp
import time
with tempfile.NamedTemporaryFile() as f:
f.write(b'foo')
f.flush()
name = f.name
pid = os.fork()
print('pid', pid)
if 0 == pid:
pname = 'child'
else:
pname = 'parent'
f2 = open(name, 'rb')
assert f2
time.sleep(1)
if pname == 'child':
time.sleep(1)
You'll see this:
$ Traceback (most recent call last):
File "t.py", line 22, in <module>
time.sleep(1)
File "/usr/lib/python3.7/tempfile.py", line 500, in __exit__
self.close()
File "/usr/lib/python3.7/tempfile.py", line 507, in close
self._closer.close()
File "/usr/lib/python3.7/tempfile.py", line 444, in close
unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmplpb16mgk'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you for the explanation. Can you fix the other issues?
chainerio/cache/file_cache.py
Outdated
self._frozen = True | ||
|
||
def preserve(self, name): | ||
'''Preserve the cache as persistent files in the disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the disk
chainerio/cache/file_cache.py
Outdated
def preserve(self, name): | ||
'''Preserve the cache as persistent files in the disk | ||
|
||
Once the cache is preserved, cache files are not to be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache files are not to be removed => cache files will not be removed
By letting developers explicitly call
perserve()
method, the cache knows when to freeze the data. This change resolve the issue that cached data can't be reused across the training session (thus re-constructing the cache by re-reading data from remote storage), limiting the use cases to only same number & configuration of cached data.See
tests/cache_tests/test_file_cache.py
for possible example usage.