Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache concurrency error #1200

Closed
arvoelke opened this issue Nov 16, 2016 · 41 comments
Closed

Cache concurrency error #1200

arvoelke opened this issue Nov 16, 2016 · 41 comments
Labels

Comments

@arvoelke
Copy link
Contributor

I ran two Python scripts at the same time and got the error (by chance):

Traceback (most recent call last):
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\doit\action.py", line 383, in execute
    returned_value = self.py_callable(*self.args, **kwargs)
  File "c:\Users\Aaron\CTN\voelker-synapse2016\dodo.py", line 61, in action_simulate
    *r.compile_system(**opts), dt=DT, probe_tau=PROBE_TAU)
  File "c:\Users\Aaron\CTN\voelker-synapse2016\main.py", line 483, in run
    sim.run(sim_t)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\simulator.py", line 142, in __init__
    self._model.build(network, progress_bar=self.progress_bar)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\builder\builder.py", line 121, in build
    built = Builder.build(self, obj, *args, **kwargs)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\builder\builder.py", line 216, in build
    return cls.builders[obj_cls](model, obj, *args, **kwargs)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\builder\network.py", line 116, in build_network
    model.decoder_cache.shrink()
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\cache.py", line 463, in shrink
    self._index.sync()
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\cache.py", line 243, in sync
    replace(self.filename + '.part', self.filename)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\utils\compat.py", line 46, in replace
    stderr=devnull)
  File "C:\Users\Aaron\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['move', '/Y', 'c:/Users/Aaron\\.nengo\\cache\\decoders\\index.part', 'c:/Users/Aaron\\.nengo\\cache\\decoders\\index']' returned non-zero exit status 1

This is using the most recent dev version of Nengo on Windows.

@arvoelke arvoelke added the bug label Nov 16, 2016
@jgosmann
Copy link
Collaborator

Is there any chance you can reproduce this and get some insight what exactly the reason is that the move command fails?

@jgosmann
Copy link
Collaborator

Is this Python 2 or Python 3? If it is Python 2 (and you are able to somewhat reliably reproduce this), does this also happen with Python 3?

@arvoelke
Copy link
Contributor Author

Python 2.7, and I doubt I can reproduce this unless I write a script to rapidly fire off lots of builds. It seems that I just got 'lucky' once.

But there is something a little peculiar about what I'm doing. The nengo.Simulator() should not be invoking any decoder optimizations. I have done them all by calling the solver myself before building the model (and hand-wiring the weights). I've also combined all of the decoder optimization problems together into one giant matrix, which might make it a little slow.

@jgosmann
Copy link
Collaborator

Potentially related.

I see two potential causes:

  1. The file lock is not working properly on Windows. We have some basic tests, but they wouldn't catch a race condition that might exist if the file creation itself is not atomic. In that case the source file for the move might have been moved already by another process because the mutual exclusivity isn't given anymore.
  2. Some other Windows weirdness is going on.

This is likely to be fixed for all 64-bit Python, all Python >= 3, and all Unix systems. So I wonder if this is even worth to figure out and fix ...

@jgosmann
Copy link
Collaborator

The nengo.Simulator() should not be invoking any decoder optimizations. I have done them all by calling the solver myself before building the model (and hand-wiring the weights). I've also combined all of the decoder optimization problems together into one giant matrix, which might make it a little slow.

The cache index might still get loaded and written even if no decoders are read or written.

@arvoelke
Copy link
Contributor Author

This is likely to be fixed for all 64-bit Python, all Python >= 3, and all Unix systems. So I wonder if this is even worth to figure out and fix ...

Perhaps we can make the error mention an issue that suggests this as a fix if they are using Python 2. I'm not super worried about this since I can't get it to happen again, but at least it's documented now.

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

Hmmm... this just happened again, but not inside of a thread this time... I was only running one simulation at a time. If I go to the cache directory I see both index and index.part with the same date modified. Moreover, this seems to happen consistently (every 2nd run). I am on the newest dev version. It also does not work on version 2.3.0.

This is a serious issue for me since it means I can't reproduce my paper with the current Nengo (but it works if I switch back to version 2.2.0).

@arvoelke
Copy link
Contributor Author

As a fix, I just wiped the cache directory. It seems fine for the time being again.

@jgosmann
Copy link
Collaborator

This is a serious issue for me since it means I can't reproduce my paper with the current Nengo (but it works if I switch back to version 2.2.0).

Does it still fail even if you disable the cache?

As a fix, I just wiped the cache directory. It seems fine for the time being again.

Did you backup the prior cache directory? Every piece of information could be helpful in debugging this and it seemed you had it in a state where things are reproducible which would be super helpful.

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

Does it still fail even if you disable the cache?

I would assume no because shrink is not implemented by NoDecoderCache?

Did you backup the prior cache directory?

Unfortunately switching versions of nengo automatically wiped my cache directory (and so it might also happen in the earlier versions but my cache just wasn't in the bad state anymore).

@jgosmann
Copy link
Collaborator

Let me know when it reoccurs.

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

In the process of trying to reproduce this, I ran the following script:

import nengo

for seed in range(1000):
    print seed
    
    with nengo.Network(seed=seed) as model:
        nengo.Connection(
            nengo.Ensemble(100, 1, seed=seed),
            nengo.Node(size_in=1))        

    with nengo.Simulator(model, seed=seed) as sim:
        pass

and got the following unrelated error which is making it difficult to get my cache into a bad state...

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-dae63544fda8> in <module>()
     11             nengo.Node(size_in=1))        
     12 
---> 13     with nengo.Simulator(model, seed=seed) as sim:
     14         pass

c:\users\aaron\ctn\nengo\nengo\simulator.pyc in __init__(self, network, dt, seed, model, progress_bar)
    143         if network is not None:
    144             # Build the network into the model
--> 145             self.model.build(network, progress_bar=self.progress_bar)
    146 
    147         # -- map from Signal.base -> ndarray

c:\users\aaron\ctn\nengo\nengo\builder\builder.pyc in build(self, obj, *args, **kwargs)
    119             The object to build into this model.
    120         """
--> 121         built = Builder.build(self, obj, *args, **kwargs)
    122         if self.build_callback is not None:
    123             self.build_callback(obj)

c:\users\aaron\ctn\nengo\nengo\builder\builder.pyc in build(cls, model, obj, *args, **kwargs)
    214                 "Cannot build object of type %r" % type(obj).__name__)
    215 
--> 216         return cls.builders[obj_cls](model, obj, *args, **kwargs)
    217 
    218     @classmethod

c:\users\aaron\ctn\nengo\nengo\builder\network.pyc in build_network(model, network, progress_bar)
    116             model.decoder_cache.shrink()
    117 
--> 118         model.build_callback = None
    119 
    120     # Unset config

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in __exit__(self, exc_type, exc_value, traceback)
    413     def __exit__(self, exc_type, exc_value, traceback):
    414         self.progress.__exit__(exc_type, exc_value, traceback)
--> 415         self.progress_bar.update(self.progress)
    416 
    417     def step(self, n=1):

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in update(self, progress)
    342                             progress.max_steps / self.max_updates)
    343         if next_update_step < progress.n_steps or progress.finished:
--> 344             self.progress_bar.update(progress)
    345             self.last_update_step = progress.n_steps
    346 

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in update(self, progress)
    279         elif long_eta or progress.finished:
    280             self._visible = True
--> 281             self.delegate.update(progress)
    282 
    283     @property

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in update(self, progress)
    187         else:
    188             line = self._get_in_progress_line(progress)
--> 189         sys.stdout.write(line)
    190         sys.stdout.flush()
    191 

c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\ipykernel\iostream.pyc in write(self, string)
    315 
    316             is_child = (not self._is_master_process())
--> 317             self._buffer.write(string)
    318             if is_child:
    319                 # newlines imply flush in subprocesses

ValueError: I/O operation on closed file

Will report back with more details later. I have to run for now but can make this into a separate issue. In the mean time I'll disable the progress bar.

@jgosmann
Copy link
Collaborator

Try to disable the progress bar (nengo.Simulator(..., progress_bar=False)) and open another issue for this error with information on how exactly you are running the script (something seems to have closed the standard output which is weird).

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

Okay, so I managed to reproduce this consistently. First, some specs:

  • nengo 2.3.1.dev
  • Enthought Canopy Python 2.7.3 (64-bit)
  • Windows 8 (64-bit)

Now, some steps:

  1. Run the above script (with progress_bar=False) to make the index file sufficiently big. Here is a temporary download link to my decoders directory if you want to skip this step.
  2. Run the following script:
import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())

    with open(os.devnull, 'w') as devnull:
        subprocess.check_call(["move", "/Y", src, dst],
                              shell=True,
                              stdout=devnull,
                              stderr=devnull)

The move command will fail randomly (with stderr == "Access is denied.").

If we replace the write() with move /Y dst src to simply move the file back and forth, then there are no issues. So it seems the OS does not immediately release access to other processes after the written file is closed in Python.

This might not be the ideal solution, but it seems to work if we replace the move with a direct read/write:

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())

    with open(dst, 'wb') as fdst:
        with open(src, 'rb') as fsrc:
            fdst.write(fsrc.read())

This does not throw any errors for me, and so we could do the move directly like so.

@jgosmann
Copy link
Collaborator

jgosmann commented Dec 13, 2016

So it seems the OS does not immediately release access to other processes after the written file is closed in Python.

Why Windows, why? 😭

This might not be the ideal solution, but it seems to work if we replace the move with a direct read/write:

I have to think about this. The move is supposed to be atomic and cheap. Both of these things are violated with the manual write because the entire file is read to memory and rewritten to the hard drive. A move just changes the filename without reading the file to memory.

@jgosmann
Copy link
Collaborator

I wonder if changing shell=True to shell=False changes anything?

@jgosmann
Copy link
Collaborator

Apart from trying the shell thing, can you also try the following code for me?

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())
        fsrc.flush()
        os.fsync(fsrc.fileno())

    with open(os.devnull, 'w') as devnull:
        subprocess.check_call(["move", "/Y", src, dst],
                              shell=True,
                              stdout=devnull,
                              stderr=devnull)

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

I wonder if changing shell=True to shell=False changes anything?

May need to change something else too? I get WindowsError: [Error 2] The system cannot find the file specified.

Apart from trying the shell thing, can you also try the following code for me?

I tried this code and got the same error (access is denied).

@jgosmann
Copy link
Collaborator

One more thing to try:

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb', 0) as fsrc:
        with open(dst, 'rb', 0) as fdst:
            fsrc.write(fdst.read())

    with open(os.devnull, 'w') as devnull:
        subprocess.check_call(["move", "/Y", src, dst],
                              shell=True,
                              stdout=devnull,
                              stderr=devnull)

@jgosmann
Copy link
Collaborator

And another one:

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())

    subprocess.check_call(["move", "/Y", src, dst],
                          shell=True, close_fds=True)

@arvoelke
Copy link
Contributor Author

I tried both of the above and they both have the same issue.

Then .... I had the idea to disable real-time protection on my virus scanner (Avira) and ... voila! No issue anymore ...

So it's not so much a Windows problem as it is a problem with another process swooping in and grabbing the file handler before we can get to it. Then the move command fails, rightfully so.

Hmm... this is a nuisance. Unsure what to do about this...

@jgosmann
Copy link
Collaborator

Hmm... this is a nuisance. Unsure what to do about this...

Install Linux. ;)

My best idea right now is to retry the move command until it succeeds (or a number of trials or time limit is exceeded).

@arvoelke
Copy link
Contributor Author

There is some discussion in here: http://bugs.python.org/issue1425127

They basically left it as being an issue with running poorly designed third-party software, but chances are someone will encounter this again in the future and it isn't pleasant to debug.

I tried replacing the .part extension with a random number, but that doesn't help (I thought there might be some lag-time).

Why is the index.part needed again? Isn't there a lock that can prevent another process from opening index while it's being modified by this function?

@arvoelke
Copy link
Contributor Author

Install Linux. ;)

There are real-time virus scanners for Linux too (although certainly less common). Just saying.

@tbekolay
Copy link
Member

Is it possible to whitelist certain directories or files in the virus scanner so that it will ignore the cache directory?

@jgosmann
Copy link
Collaborator

Well designed anti-virus/indexing tools use oplocks ( http://blogs.msdn.com/b/oldnewthing/archive/2013/04/15/10410965.aspx ) so they can open a file, but seamlessly get out of the way if a normal process needs to take exclusive control of a file or delete it. Sadly "well-designed" is not a term usually associated with anti-virus tools, so errors like this are relatively commonplace.

So at least I know which anti-virus I won't buy. ;)

There are real-time virus scanners for Linux too (although certainly less common). Just saying.

But other processes can't prevent you from deleting a file on Unix systems.

@jgosmann
Copy link
Collaborator

Why is the index.part needed again?

From the top of my head: It allows other Nengo processes to read the cache while the index file is being written.

@jgosmann
Copy link
Collaborator

From the top of my head: It allows other Nengo processes to read the cache while the index file is being written.

And that's actually not true, because in most cases that won't be possible because of the lock. The actual reason is to minimize the chance of a corrupted index file when someone kills the Nengo process. (It is not that unlikely to interrupt it while writing the index file in certain situations, but it is unlikely to do it during the move which even is atomic in the best case).

@jgosmann
Copy link
Collaborator

Potential solutions:

  1. Tell the user to tell their virus scanner to ignore the cache directory. Might be a good thing to do for performance reasons anyways, but I don't expect people to read the documentation carefully enough. So they might still run into this problem.
  2. Deactivate the cache on Windows with Python 2.7. It is not clear whether the same problem exists with other Python versions (those do not use move, but os.replace).
  3. Try move a couple of times if it fails.
  4. Just ignore failures of move and continue on.

I'm leaning towards 3, but I'm interested in what others think.

@jgosmann jgosmann self-assigned this Dec 13, 2016
@arvoelke
Copy link
Contributor Author

For whatever reason, a copy does not run into this issue for me, so there is also option:

  1. subprocess.check_call(["copy", "/Y", src.replace('/', '\\'), dst.replace('/', '\\')], ...)

Note I had to replace the C:/ with C:\ with copy (despite it being okay for move 😕).

@tbekolay
Copy link
Member

I'd go with 4, if the move fails then warn and disable the cache.

@jgosmann
Copy link
Collaborator

Copy doesn't run into the issue because it is not trying to delete the source file. It will also read and write all the data and is thus more expensive than move.

@jgosmann
Copy link
Collaborator

I'd go with 4, if the move fails then warn and disable the cache.

Note that this happens after everything has already been cached. Thus, we can't really disable the cache or at least there is no use in disabling it.

@tbekolay
Copy link
Member

Note that this happens after everything has already been cached.

Right right... does this result in a corrupt cache state, if the index isn't updated?

@jgosmann
Copy link
Collaborator

No, the newly cached objects won't be accessible, but the old cache index still exists.

@arvoelke
Copy link
Contributor Author

I'm also happy with 4. I'm also thinking that any error that occurs within the cache system should produce a message detailing various ways to disable the cache?

@jgosmann
Copy link
Collaborator

Any particular reason against 3? Given we already did all the caching work, wouldn't it be better to try a little bit harder to make it accessible?

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

3 is fine if you were to also fallback to 4 in the case that it fails a few times in a row. This does reduce the probability of failure exponentially and so there is merit. It just feels hacky. I may also be biased by working in places that say to always fail-fast, but that was dealing with systems in a very different context.

@jgosmann
Copy link
Collaborator

It just feels hacky

The whole subprocess call to move feels hacky to me. I don't think the fail-fast philosophy applies here? We're not failing in either case (that would be the current behaviour). The relevant point here seems to be the trade-off between not wasting more time than necessary on an operation that might not complete in many circumstances vs. spending a little bit more time to save a larger amount time in the future when the cached decoders can actually be used. I suppose finding the optimum there is an empirical question. So yeah, maybe it's best to go with 4 as it is the least amount of added code.

@arvoelke
Copy link
Contributor Author

arvoelke commented Dec 13, 2016

I don't think the fail-fast philosophy applies here? We're not failing in either case (that would be the current behaviour).

This isn't relevant but here (with option 4) we would say that the cache system 'failed' which is then handled by the build system. This is still a failure mode from the perspective of the cache system (and it will not result in a failure for the build system). The exception doesn't need to bubble all the way up the stack to constitute a failure. It only needs to propagate to the point that separates one 'system' from another. The fail-fast philosophy comes in to say that the cache system should fail early rather than attempting k times and then possibly (albeit with lesser probability) still failing. But again I agree this is not entirely relevant; just trying to explain why that came up.

@jgosmann
Copy link
Collaborator

jgosmann commented Feb 6, 2017

Dev meeting decision: Fail with a message directing the user to try to disable the anti virus or cache.

jgosmann added a commit that referenced this issue Feb 6, 2017
@jgosmann jgosmann mentioned this issue Feb 6, 2017
4 tasks
@jgosmann jgosmann removed their assignment Feb 6, 2017
tbekolay pushed a commit that referenced this issue Feb 17, 2017
adityagilra pushed a commit to adityagilra/nengo that referenced this issue Jun 21, 2017
@tbekolay tbekolay mentioned this issue Feb 5, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants