Cache concurrency error #1200

arvoelke · 2016-11-16T03:35:24Z

I ran two Python scripts at the same time and got the error (by chance):

Traceback (most recent call last):
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\doit\action.py", line 383, in execute
    returned_value = self.py_callable(*self.args, **kwargs)
  File "c:\Users\Aaron\CTN\voelker-synapse2016\dodo.py", line 61, in action_simulate
    *r.compile_system(**opts), dt=DT, probe_tau=PROBE_TAU)
  File "c:\Users\Aaron\CTN\voelker-synapse2016\main.py", line 483, in run
    sim.run(sim_t)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\simulator.py", line 142, in __init__
    self._model.build(network, progress_bar=self.progress_bar)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\builder\builder.py", line 121, in build
    built = Builder.build(self, obj, *args, **kwargs)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\builder\builder.py", line 216, in build
    return cls.builders[obj_cls](model, obj, *args, **kwargs)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\builder\network.py", line 116, in build_network
    model.decoder_cache.shrink()
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\cache.py", line 463, in shrink
    self._index.sync()
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\cache.py", line 243, in sync
    replace(self.filename + '.part', self.filename)
  File "c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\nengo-2.2.1.dev0-py2.7.egg\nengo\utils\compat.py", line 46, in replace
    stderr=devnull)
  File "C:\Users\Aaron\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.1.0.1371.win-x86_64\lib\subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['move', '/Y', 'c:/Users/Aaron\\.nengo\\cache\\decoders\\index.part', 'c:/Users/Aaron\\.nengo\\cache\\decoders\\index']' returned non-zero exit status 1

This is using the most recent dev version of Nengo on Windows.

The text was updated successfully, but these errors were encountered:

jgosmann · 2016-11-16T04:19:55Z

Is there any chance you can reproduce this and get some insight what exactly the reason is that the move command fails?

jgosmann · 2016-11-16T04:22:46Z

Is this Python 2 or Python 3? If it is Python 2 (and you are able to somewhat reliably reproduce this), does this also happen with Python 3?

arvoelke · 2016-11-16T04:34:13Z

Python 2.7, and I doubt I can reproduce this unless I write a script to rapidly fire off lots of builds. It seems that I just got 'lucky' once.

But there is something a little peculiar about what I'm doing. The nengo.Simulator() should not be invoking any decoder optimizations. I have done them all by calling the solver myself before building the model (and hand-wiring the weights). I've also combined all of the decoder optimization problems together into one giant matrix, which might make it a little slow.

jgosmann · 2016-11-16T04:36:50Z

Potentially related.

I see two potential causes:

The file lock is not working properly on Windows. We have some basic tests, but they wouldn't catch a race condition that might exist if the file creation itself is not atomic. In that case the source file for the move might have been moved already by another process because the mutual exclusivity isn't given anymore.
Some other Windows weirdness is going on.

This is likely to be fixed for all 64-bit Python, all Python >= 3, and all Unix systems. So I wonder if this is even worth to figure out and fix ...

jgosmann · 2016-11-16T04:37:57Z

The nengo.Simulator() should not be invoking any decoder optimizations. I have done them all by calling the solver myself before building the model (and hand-wiring the weights). I've also combined all of the decoder optimization problems together into one giant matrix, which might make it a little slow.

The cache index might still get loaded and written even if no decoders are read or written.

arvoelke · 2016-11-16T04:39:55Z

This is likely to be fixed for all 64-bit Python, all Python >= 3, and all Unix systems. So I wonder if this is even worth to figure out and fix ...

Perhaps we can make the error mention an issue that suggests this as a fix if they are using Python 2. I'm not super worried about this since I can't get it to happen again, but at least it's documented now.

arvoelke · 2016-12-13T16:08:00Z

Hmmm... this just happened again, but not inside of a thread this time... I was only running one simulation at a time. If I go to the cache directory I see both index and index.part with the same date modified. Moreover, this seems to happen consistently (every 2nd run). I am on the newest dev version. It also does not work on version 2.3.0.

This is a serious issue for me since it means I can't reproduce my paper with the current Nengo (but it works if I switch back to version 2.2.0).

arvoelke · 2016-12-13T16:16:15Z

As a fix, I just wiped the cache directory. It seems fine for the time being again.

jgosmann · 2016-12-13T16:17:59Z

This is a serious issue for me since it means I can't reproduce my paper with the current Nengo (but it works if I switch back to version 2.2.0).

Does it still fail even if you disable the cache?

As a fix, I just wiped the cache directory. It seems fine for the time being again.

Did you backup the prior cache directory? Every piece of information could be helpful in debugging this and it seemed you had it in a state where things are reproducible which would be super helpful.

arvoelke · 2016-12-13T16:25:29Z

Does it still fail even if you disable the cache?

I would assume no because shrink is not implemented by NoDecoderCache?

Did you backup the prior cache directory?

Unfortunately switching versions of nengo automatically wiped my cache directory (and so it might also happen in the earlier versions but my cache just wasn't in the bad state anymore).

jgosmann · 2016-12-13T16:49:54Z

Let me know when it reoccurs.

arvoelke · 2016-12-13T16:59:37Z

In the process of trying to reproduce this, I ran the following script:

import nengo

for seed in range(1000):
    print seed
    
    with nengo.Network(seed=seed) as model:
        nengo.Connection(
            nengo.Ensemble(100, 1, seed=seed),
            nengo.Node(size_in=1))        

    with nengo.Simulator(model, seed=seed) as sim:
        pass

and got the following unrelated error which is making it difficult to get my cache into a bad state...

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-dae63544fda8> in <module>()
     11             nengo.Node(size_in=1))        
     12 
---> 13     with nengo.Simulator(model, seed=seed) as sim:
     14         pass

c:\users\aaron\ctn\nengo\nengo\simulator.pyc in __init__(self, network, dt, seed, model, progress_bar)
    143         if network is not None:
    144             # Build the network into the model
--> 145             self.model.build(network, progress_bar=self.progress_bar)
    146 
    147         # -- map from Signal.base -> ndarray

c:\users\aaron\ctn\nengo\nengo\builder\builder.pyc in build(self, obj, *args, **kwargs)
    119             The object to build into this model.
    120         """
--> 121         built = Builder.build(self, obj, *args, **kwargs)
    122         if self.build_callback is not None:
    123             self.build_callback(obj)

c:\users\aaron\ctn\nengo\nengo\builder\builder.pyc in build(cls, model, obj, *args, **kwargs)
    214                 "Cannot build object of type %r" % type(obj).__name__)
    215 
--> 216         return cls.builders[obj_cls](model, obj, *args, **kwargs)
    217 
    218     @classmethod

c:\users\aaron\ctn\nengo\nengo\builder\network.pyc in build_network(model, network, progress_bar)
    116             model.decoder_cache.shrink()
    117 
--> 118         model.build_callback = None
    119 
    120     # Unset config

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in __exit__(self, exc_type, exc_value, traceback)
    413     def __exit__(self, exc_type, exc_value, traceback):
    414         self.progress.__exit__(exc_type, exc_value, traceback)
--> 415         self.progress_bar.update(self.progress)
    416 
    417     def step(self, n=1):

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in update(self, progress)
    342                             progress.max_steps / self.max_updates)
    343         if next_update_step < progress.n_steps or progress.finished:
--> 344             self.progress_bar.update(progress)
    345             self.last_update_step = progress.n_steps
    346 

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in update(self, progress)
    279         elif long_eta or progress.finished:
    280             self._visible = True
--> 281             self.delegate.update(progress)
    282 
    283     @property

c:\users\aaron\ctn\nengo\nengo\utils\progress.pyc in update(self, progress)
    187         else:
    188             line = self._get_in_progress_line(progress)
--> 189         sys.stdout.write(line)
    190         sys.stdout.flush()
    191 

c:\users\aaron\appdata\local\enthought\canopy\user\lib\site-packages\ipykernel\iostream.pyc in write(self, string)
    315 
    316             is_child = (not self._is_master_process())
--> 317             self._buffer.write(string)
    318             if is_child:
    319                 # newlines imply flush in subprocesses

ValueError: I/O operation on closed file

Will report back with more details later. I have to run for now but can make this into a separate issue. In the mean time I'll disable the progress bar.

jgosmann · 2016-12-13T17:02:46Z

Try to disable the progress bar (nengo.Simulator(..., progress_bar=False)) and open another issue for this error with information on how exactly you are running the script (something seems to have closed the standard output which is weird).

arvoelke · 2016-12-13T18:31:45Z

Okay, so I managed to reproduce this consistently. First, some specs:

nengo 2.3.1.dev
Enthought Canopy Python 2.7.3 (64-bit)
Windows 8 (64-bit)

Now, some steps:

Run the above script (with progress_bar=False) to make the index file sufficiently big. Here is a temporary download link to my decoders directory if you want to skip this step.
Run the following script:

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())

    with open(os.devnull, 'w') as devnull:
        subprocess.check_call(["move", "/Y", src, dst],
                              shell=True,
                              stdout=devnull,
                              stderr=devnull)

The move command will fail randomly (with stderr == "Access is denied.").

If we replace the write() with move /Y dst src to simply move the file back and forth, then there are no issues. So it seems the OS does not immediately release access to other processes after the written file is closed in Python.

This might not be the ideal solution, but it seems to work if we replace the move with a direct read/write:

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())

    with open(dst, 'wb') as fdst:
        with open(src, 'rb') as fsrc:
            fdst.write(fsrc.read())

This does not throw any errors for me, and so we could do the move directly like so.

jgosmann · 2016-12-13T18:39:08Z

So it seems the OS does not immediately release access to other processes after the written file is closed in Python.

Why Windows, why? 😭

This might not be the ideal solution, but it seems to work if we replace the move with a direct read/write:

I have to think about this. The move is supposed to be atomic and cheap. Both of these things are violated with the manual write because the entire file is read to memory and rewritten to the hard drive. A move just changes the filename without reading the file to memory.

jgosmann · 2016-12-13T18:53:31Z

I wonder if changing shell=True to shell=False changes anything?

jgosmann · 2016-12-13T18:57:14Z

Apart from trying the shell thing, can you also try the following code for me?

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())
        fsrc.flush()
        os.fsync(fsrc.fileno())

    with open(os.devnull, 'w') as devnull:
        subprocess.check_call(["move", "/Y", src, dst],
                              shell=True,
                              stdout=devnull,
                              stderr=devnull)

arvoelke · 2016-12-13T18:59:18Z

I wonder if changing shell=True to shell=False changes anything?

May need to change something else too? I get WindowsError: [Error 2] The system cannot find the file specified.

Apart from trying the shell thing, can you also try the following code for me?

I tried this code and got the same error (access is denied).

jgosmann · 2016-12-13T19:03:08Z

One more thing to try:

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb', 0) as fsrc:
        with open(dst, 'rb', 0) as fdst:
            fsrc.write(fdst.read())

    with open(os.devnull, 'w') as devnull:
        subprocess.check_call(["move", "/Y", src, dst],
                              shell=True,
                              stdout=devnull,
                              stderr=devnull)

jgosmann · 2016-12-13T19:09:24Z

And another one:

import nengo
import os
import subprocess

dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'

for i in range(100):
    print i

    with open(src, 'wb') as fsrc:
        with open(dst, 'rb') as fdst:
            fsrc.write(fdst.read())

    subprocess.check_call(["move", "/Y", src, dst],
                          shell=True, close_fds=True)

arvoelke · 2016-12-13T19:15:19Z

I tried both of the above and they both have the same issue.

Then .... I had the idea to disable real-time protection on my virus scanner (Avira) and ... voila! No issue anymore ...

So it's not so much a Windows problem as it is a problem with another process swooping in and grabbing the file handler before we can get to it. Then the move command fails, rightfully so.

Hmm... this is a nuisance. Unsure what to do about this...

jgosmann · 2016-12-13T19:26:21Z

Hmm... this is a nuisance. Unsure what to do about this...

Install Linux. ;)

My best idea right now is to retry the move command until it succeeds (or a number of trials or time limit is exceeded).

arvoelke · 2016-12-13T19:27:24Z

There is some discussion in here: http://bugs.python.org/issue1425127

They basically left it as being an issue with running poorly designed third-party software, but chances are someone will encounter this again in the future and it isn't pleasant to debug.

I tried replacing the .part extension with a random number, but that doesn't help (I thought there might be some lag-time).

Why is the index.part needed again? Isn't there a lock that can prevent another process from opening index while it's being modified by this function?

arvoelke · 2016-12-13T19:31:55Z

Install Linux. ;)

There are real-time virus scanners for Linux too (although certainly less common). Just saying.

tbekolay · 2016-12-13T19:34:29Z

Is it possible to whitelist certain directories or files in the virus scanner so that it will ignore the cache directory?

jgosmann · 2016-12-13T19:34:48Z

Well designed anti-virus/indexing tools use oplocks ( http://blogs.msdn.com/b/oldnewthing/archive/2013/04/15/10410965.aspx ) so they can open a file, but seamlessly get out of the way if a normal process needs to take exclusive control of a file or delete it. Sadly "well-designed" is not a term usually associated with anti-virus tools, so errors like this are relatively commonplace.

So at least I know which anti-virus I won't buy. ;)

There are real-time virus scanners for Linux too (although certainly less common). Just saying.

But other processes can't prevent you from deleting a file on Unix systems.

jgosmann · 2016-12-13T19:36:08Z

Why is the index.part needed again?

From the top of my head: It allows other Nengo processes to read the cache while the index file is being written.

jgosmann · 2016-12-13T19:41:56Z

From the top of my head: It allows other Nengo processes to read the cache while the index file is being written.

And that's actually not true, because in most cases that won't be possible because of the lock. The actual reason is to minimize the chance of a corrupted index file when someone kills the Nengo process. (It is not that unlikely to interrupt it while writing the index file in certain situations, but it is unlikely to do it during the move which even is atomic in the best case).

jgosmann · 2016-12-13T19:49:07Z

Potential solutions:

Tell the user to tell their virus scanner to ignore the cache directory. Might be a good thing to do for performance reasons anyways, but I don't expect people to read the documentation carefully enough. So they might still run into this problem.
Deactivate the cache on Windows with Python 2.7. It is not clear whether the same problem exists with other Python versions (those do not use move, but os.replace).
Try move a couple of times if it fails.
Just ignore failures of move and continue on.

I'm leaning towards 3, but I'm interested in what others think.

arvoelke · 2016-12-13T20:06:34Z

For whatever reason, a copy does not run into this issue for me, so there is also option:

subprocess.check_call(["copy", "/Y", src.replace('/', '\\'), dst.replace('/', '\\')], ...)

Note I had to replace the C:/ with C:\ with copy (despite it being okay for move 😕).

tbekolay · 2016-12-13T20:08:18Z

I'd go with 4, if the move fails then warn and disable the cache.

jgosmann · 2016-12-13T20:09:30Z

Copy doesn't run into the issue because it is not trying to delete the source file. It will also read and write all the data and is thus more expensive than move.

jgosmann · 2016-12-13T20:10:31Z

I'd go with 4, if the move fails then warn and disable the cache.

Note that this happens after everything has already been cached. Thus, we can't really disable the cache or at least there is no use in disabling it.

tbekolay · 2016-12-13T20:13:15Z

Note that this happens after everything has already been cached.

Right right... does this result in a corrupt cache state, if the index isn't updated?

jgosmann · 2016-12-13T20:14:21Z

No, the newly cached objects won't be accessible, but the old cache index still exists.

arvoelke · 2016-12-13T20:29:04Z

I'm also happy with 4. I'm also thinking that any error that occurs within the cache system should produce a message detailing various ways to disable the cache?

jgosmann · 2016-12-13T20:31:58Z

Any particular reason against 3? Given we already did all the caching work, wouldn't it be better to try a little bit harder to make it accessible?

arvoelke · 2016-12-13T20:39:45Z

3 is fine if you were to also fallback to 4 in the case that it fails a few times in a row. This does reduce the probability of failure exponentially and so there is merit. It just feels hacky. I may also be biased by working in places that say to always fail-fast, but that was dealing with systems in a very different context.

jgosmann · 2016-12-13T20:47:06Z

It just feels hacky

The whole subprocess call to move feels hacky to me. I don't think the fail-fast philosophy applies here? We're not failing in either case (that would be the current behaviour). The relevant point here seems to be the trade-off between not wasting more time than necessary on an operation that might not complete in many circumstances vs. spending a little bit more time to save a larger amount time in the future when the cached decoders can actually be used. I suppose finding the optimum there is an empirical question. So yeah, maybe it's best to go with 4 as it is the least amount of added code.

arvoelke · 2016-12-13T21:06:32Z

I don't think the fail-fast philosophy applies here? We're not failing in either case (that would be the current behaviour).

This isn't relevant but here (with option 4) we would say that the cache system 'failed' which is then handled by the build system. This is still a failure mode from the perspective of the cache system (and it will not result in a failure for the build system). The exception doesn't need to bubble all the way up the stack to constitute a failure. It only needs to propagate to the point that separates one 'system' from another. The fail-fast philosophy comes in to say that the cache system should fail early rather than attempting k times and then possibly (albeit with lesser probability) still failing. But again I agree this is not entirely relevant; just trying to explain why that came up.

Fixes #1200.

jgosmann · 2017-02-06T20:11:33Z

Dev meeting decision: Fail with a message directing the user to try to disable the anti virus or cache.

Addresses #1200.

Fixes #1200.

Fixes nengo#1200.

arvoelke added the bug label Nov 16, 2016

jgosmann self-assigned this Dec 13, 2016

jgosmann added a commit that referenced this issue Dec 13, 2016

Ignore failed move of cache index and warn.

eb77649

Fixes #1200.

jgosmann mentioned this issue Dec 13, 2016

Ignore failed move of cache index and warn. #1235

Merged

3 tasks

jgosmann added a commit that referenced this issue Feb 6, 2017

Warn if cache move fails.

3957235

Addresses #1200.

jgosmann mentioned this issue Feb 6, 2017

Warn if cache move fails. #1264

Closed

4 tasks

jgosmann removed their assignment Feb 6, 2017

tbekolay pushed a commit that referenced this issue Feb 17, 2017

Ignore failed move of cache index and warn.

90e9ec2

Fixes #1200.

tbekolay closed this as completed in 930041f Feb 17, 2017

adityagilra pushed a commit to adityagilra/nengo that referenced this issue Jun 21, 2017

Ignore failed move of cache index and warn.

9371f85

Fixes nengo#1200.

arvoelke mentioned this issue Sep 5, 2018

Windows cache permission error #1456

Closed

tbekolay mentioned this issue Feb 5, 2019

Fixes for pytest 4.1 #1497

Merged

5 tasks

Cache concurrency error #1200

Cache concurrency error #1200

Comments

arvoelke commented Nov 16, 2016

jgosmann commented Nov 16, 2016

jgosmann commented Nov 16, 2016

arvoelke commented Nov 16, 2016

jgosmann commented Nov 16, 2016

jgosmann commented Nov 16, 2016

arvoelke commented Nov 16, 2016

arvoelke commented Dec 13, 2016 • edited Loading

arvoelke commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016 • edited Loading

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016 • edited Loading

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016 • edited Loading

jgosmann commented Dec 13, 2016 • edited Loading

jgosmann commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016 • edited Loading

jgosmann commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016

arvoelke commented Dec 13, 2016

tbekolay commented Dec 13, 2016

jgosmann commented Dec 13, 2016

jgosmann commented Dec 13, 2016

jgosmann commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016

tbekolay commented Dec 13, 2016

jgosmann commented Dec 13, 2016

jgosmann commented Dec 13, 2016

tbekolay commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016 • edited Loading

jgosmann commented Dec 13, 2016

arvoelke commented Dec 13, 2016 • edited Loading

jgosmann commented Feb 6, 2017

arvoelke commented Dec 13, 2016 •

edited

Loading

arvoelke commented Dec 13, 2016 •

edited

Loading

arvoelke commented Dec 13, 2016 •

edited

Loading

arvoelke commented Dec 13, 2016 •

edited

Loading

jgosmann commented Dec 13, 2016 •

edited

Loading

arvoelke commented Dec 13, 2016 •

edited

Loading

arvoelke commented Dec 13, 2016 •

edited

Loading

arvoelke commented Dec 13, 2016 •

edited

Loading