-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache concurrency error #1200
Comments
Is there any chance you can reproduce this and get some insight what exactly the reason is that the move command fails? |
Is this Python 2 or Python 3? If it is Python 2 (and you are able to somewhat reliably reproduce this), does this also happen with Python 3? |
Python 2.7, and I doubt I can reproduce this unless I write a script to rapidly fire off lots of builds. It seems that I just got 'lucky' once. But there is something a little peculiar about what I'm doing. The |
I see two potential causes:
This is likely to be fixed for all 64-bit Python, all Python >= 3, and all Unix systems. So I wonder if this is even worth to figure out and fix ... |
The cache index might still get loaded and written even if no decoders are read or written. |
Perhaps we can make the error mention an issue that suggests this as a fix if they are using Python 2. I'm not super worried about this since I can't get it to happen again, but at least it's documented now. |
Hmmm... this just happened again, but not inside of a thread this time... I was only running one simulation at a time. If I go to the cache directory I see both This is a serious issue for me since it means I can't reproduce my paper with the current Nengo (but it works if I switch back to version 2.2.0). |
As a fix, I just wiped the cache directory. It seems fine for the time being again. |
Does it still fail even if you disable the cache?
Did you backup the prior cache directory? Every piece of information could be helpful in debugging this and it seemed you had it in a state where things are reproducible which would be super helpful. |
I would assume no because
Unfortunately switching versions of nengo automatically wiped my cache directory (and so it might also happen in the earlier versions but my cache just wasn't in the bad state anymore). |
Let me know when it reoccurs. |
In the process of trying to reproduce this, I ran the following script:
and got the following unrelated error which is making it difficult to get my cache into a bad state...
Will report back with more details later. I have to run for now but can make this into a separate issue. In the mean time I'll disable the progress bar. |
Try to disable the progress bar ( |
Okay, so I managed to reproduce this consistently. First, some specs:
Now, some steps:
The If we replace the This might not be the ideal solution, but it seems to work if we replace the
This does not throw any errors for me, and so we could do the |
Why Windows, why? 😭
I have to think about this. The move is supposed to be atomic and cheap. Both of these things are violated with the manual write because the entire file is read to memory and rewritten to the hard drive. A move just changes the filename without reading the file to memory. |
I wonder if changing |
Apart from trying the import nengo
import os
import subprocess
dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'
for i in range(100):
print i
with open(src, 'wb') as fsrc:
with open(dst, 'rb') as fdst:
fsrc.write(fdst.read())
fsrc.flush()
os.fsync(fsrc.fileno())
with open(os.devnull, 'w') as devnull:
subprocess.check_call(["move", "/Y", src, dst],
shell=True,
stdout=devnull,
stderr=devnull) |
May need to change something else too? I get
I tried this code and got the same error (access is denied). |
One more thing to try: import nengo
import os
import subprocess
dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'
for i in range(100):
print i
with open(src, 'wb', 0) as fsrc:
with open(dst, 'rb', 0) as fdst:
fsrc.write(fdst.read())
with open(os.devnull, 'w') as devnull:
subprocess.check_call(["move", "/Y", src, dst],
shell=True,
stdout=devnull,
stderr=devnull) |
And another one: import nengo
import os
import subprocess
dst = nengo.cache.get_default_decoder_cache()._index.index_path
src = dst + '.part'
for i in range(100):
print i
with open(src, 'wb') as fsrc:
with open(dst, 'rb') as fdst:
fsrc.write(fdst.read())
subprocess.check_call(["move", "/Y", src, dst],
shell=True, close_fds=True) |
I tried both of the above and they both have the same issue. Then .... I had the idea to disable real-time protection on my virus scanner (Avira) and ... voila! No issue anymore ... So it's not so much a Windows problem as it is a problem with another process swooping in and grabbing the file handler before we can get to it. Then the Hmm... this is a nuisance. Unsure what to do about this... |
Install Linux. ;) My best idea right now is to retry the |
There is some discussion in here: http://bugs.python.org/issue1425127 They basically left it as being an issue with running poorly designed third-party software, but chances are someone will encounter this again in the future and it isn't pleasant to debug. I tried replacing the Why is the |
There are real-time virus scanners for Linux too (although certainly less common). Just saying. |
Is it possible to whitelist certain directories or files in the virus scanner so that it will ignore the cache directory? |
So at least I know which anti-virus I won't buy. ;)
But other processes can't prevent you from deleting a file on Unix systems. |
From the top of my head: It allows other Nengo processes to read the cache while the index file is being written. |
And that's actually not true, because in most cases that won't be possible because of the lock. The actual reason is to minimize the chance of a corrupted index file when someone kills the Nengo process. (It is not that unlikely to interrupt it while writing the index file in certain situations, but it is unlikely to do it during the |
Potential solutions:
I'm leaning towards 3, but I'm interested in what others think. |
For whatever reason, a
Note I had to replace the |
I'd go with 4, if the move fails then warn and disable the cache. |
Copy doesn't run into the issue because it is not trying to delete the source file. It will also read and write all the data and is thus more expensive than move. |
Note that this happens after everything has already been cached. Thus, we can't really disable the cache or at least there is no use in disabling it. |
Right right... does this result in a corrupt cache state, if the index isn't updated? |
No, the newly cached objects won't be accessible, but the old cache index still exists. |
I'm also happy with 4. I'm also thinking that any error that occurs within the cache system should produce a message detailing various ways to disable the cache? |
Any particular reason against 3? Given we already did all the caching work, wouldn't it be better to try a little bit harder to make it accessible? |
3 is fine if you were to also fallback to 4 in the case that it fails a few times in a row. This does reduce the probability of failure exponentially and so there is merit. It just feels hacky. I may also be biased by working in places that say to always fail-fast, but that was dealing with systems in a very different context. |
The whole subprocess call to |
This isn't relevant but here (with option 4) we would say that the cache system 'failed' which is then handled by the build system. This is still a failure mode from the perspective of the cache system (and it will not result in a failure for the build system). The exception doesn't need to bubble all the way up the stack to constitute a failure. It only needs to propagate to the point that separates one 'system' from another. The fail-fast philosophy comes in to say that the cache system should fail early rather than attempting |
Dev meeting decision: Fail with a message directing the user to try to disable the anti virus or cache. |
I ran two Python scripts at the same time and got the error (by chance):
This is using the most recent dev version of Nengo on Windows.
The text was updated successfully, but these errors were encountered: