Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add persistence to cache.FileCache #90

Merged
merged 2 commits into from
Dec 2, 2019
Merged

Conversation

kuenishi
Copy link
Member

By letting developers explicitly call perserve() method, the cache knows when to freeze the data. This change resolve the issue that cached data can't be reused across the training session (thus re-constructing the cache by re-reading data from remote storage), limiting the use cases to only same number & configuration of cached data.

See tests/cache_tests/test_file_cache.py for possible example usage.

@kuenishi kuenishi added the cat:feature Implementation that introduces new interfaces. label Nov 27, 2019
@belldandyxtq
Copy link
Member

I think this closes #3

@@ -251,3 +255,56 @@ def close(self):
self.datafp.close()
self.indexfp = None
self.datafp = None

def preload(self, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to split the preload and the freeze of cache?
In the current implementation, user has to do following to freeze the cache and get multi-process safe:
load => preserve => preload

If we can split these two functionalities, then we can simply do "load" => "freeze". Moreover, the cache can be modified before "freeze".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by load?
If "load" means writing data into cache files, that means creating file cache with specific name if I understand it correctly. But it is unacceptable because it leads to partially written named files left after job failure and they'll definitely mess up the cache and following re-run of the job.

Also the principle of cache in this module is data being immutable and deterministic, there should be no modification. What do you mean by "modified"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to cause confusion, by "load" I mean reading data from remote storage. If I understand correctly, in order to use this module, user has to do following:

  1. read data from remote storage
  2. preserve
  3. preload

My suggestion is to have a "freeze" function, so that we can:

  1. read data from remote storage
  2. freeze

to support the multiprocess. I do not think it would increase the risk of having a broken cache.

By modifying, I do not mean changing the data itself, but the ability of adding new cache entities.

Copy link
Member Author

@kuenishi kuenishi Nov 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it would increase the risk of having a broken cache.

I believe definitely it happens. For example,

  1. Create or a cache file named "foo.cache"
  2. Write 50% data to "foo.cache", like put(1, b'bar')
  3. Job fail

Even in this case partial write may happen like b'ba' was written to disk but b'r' wasn't. Then, restart the job,

  1. Open a 50% written cache file named "foo.cache"
  2. Can't run put(1, b'bar') as the data 1 is immutable
  3. Job fail

Also in this case it is not possible to put data nor to run get(1) if partial write happened. This is typical scenario of broken data and broken cache files must be thrown away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we both are misunderstanding each other. I think in your code, you integrated the "freeze" into preserve, which I misunderstood, so the 3 is not necessary. But I think that design forces the user to preserve the cache, which is not always necessary. For example, they just want to enable multiprocessing, which only needs "freezing" the cache. Preserving the data disables the automatic deletion of the cache.

In my example, the cache should not be frozen until all the data is loaded, and after "freeze", no data can be added to the cache. Then when the job fails at a "put", which is definitely before the "freeze", the cache will be neither frozen nor preserved. That is why I think such design will not increase the risk of having broken case.

Sorry for making confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatic deletion of the cache among multiple processes does not work correctly because there are no synchronization on closing the file objects, where child processes can't read contents after parent process exit. This is because Python's tempfile module explicitly unlinks named temporary files. Please try following script:

import tempfile
import os
import multiprocessing as mp
import time

with tempfile.NamedTemporaryFile() as f:
    f.write(b'foo')
    f.flush()
    name = f.name

    pid = os.fork()
    print('pid', pid)
    if 0 == pid:
        pname = 'child'
    else:
        pname = 'parent'

    f2 = open(name, 'rb')
    assert f2
    time.sleep(1)
    if pname == 'child':
        time.sleep(1)

You'll see this:

$ Traceback (most recent call last):
  File "t.py", line 22, in <module>
    time.sleep(1)
  File "/usr/lib/python3.7/tempfile.py", line 500, in __exit__
    self.close()
  File "/usr/lib/python3.7/tempfile.py", line 507, in close
    self._closer.close()
  File "/usr/lib/python3.7/tempfile.py", line 444, in close
    unlink(self.name)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmplpb16mgk'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you for the explanation. Can you fix the other issues?

self._frozen = True

def preserve(self, name):
'''Preserve the cache as persistent files in the disk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the disk

def preserve(self, name):
'''Preserve the cache as persistent files in the disk

Once the cache is preserved, cache files are not to be removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache files are not to be removed => cache files will not be removed

@belldandyxtq belldandyxtq merged commit 68fa1f6 into master Dec 2, 2019
@kuenishi kuenishi deleted the persistent-cache branch January 31, 2020 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature Implementation that introduces new interfaces.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants