-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make writing of cache index atomic. #1107
Conversation
By analyzing the blame information on this pull request, we identified @tbekolay to be a potential reviewer |
pickle.dump( | ||
{k: v for k, v in self._index.items() | ||
if v[0] not in self._removed_files}, | ||
f, pickle.HIGHEST_PROTOCOL) | ||
os.rename(self.filename + '.part', self.filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is self.filename
guaranteed not to exist? That's the only flaw I can see in this implementation. In Unix, the rename will overwrite, but on Windows it won't (reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we finally drop support for Windows? It's so annoying. 😠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a fixup commit to handle this correctly on Windows.
@@ -30,6 +32,13 @@ def ensure_bytes(s): | |||
assert isinstance(s, bytes) | |||
return s | |||
|
|||
def replace(src, dst): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use shutil.move
? It's part of the standard library and does something similar but perhaps more robust...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because shutil.move
depends on os.rename
semantics and thus does not solve any of the problems. os.replace
provides the correct semantics, but only exists in Python 3.3 and newer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looking at the shutil.move
source, it actually does a copy in the case of OSError on rename, and then deletes the source. The copy function they use will actually overwrite I think, but it seems like some extra work to copy the file. So I'd probably just go with this, unless there's a clear way in which shutil.move
is more robust.
For whatever reason, I tracked down the commit in CPython that added It's in C so I'm satisfied with reimplementing this in Python 2 and using it in Python 3. |
Does this need a changelog entry? |
I would add one as a bugfix, yes. |
Actually, if we're not adding a changelog entry for #1110, then we probably don't need one here either. So I think it's fine as-is, except for that one test that's failing. Also, before we merge this, I think we should rebase and re-run all the tests (though don't we always...) since there might be interactions with #1110. |
Is there a way to restart the AppVeyor build? |
Restarted; if you log in with your Github account you should be able to link to the nengo dev user which can restart builds... I think? |
After extensive searching I have determined that it is not possible to restart an appveyor branch. At least not from the type of account Jan and I have. Consequently, should this be rebased to see if the tests pass? And then we can continue with the review? |
Rebase has to be done sooner or later anyways. @tbekolay you're assigned to this, do you want to do the rebase? Or do you want to reassign this to me and let me handle it? |
I'm looking at this now, so I'll do the rebase if you don't mind. |
1f446b3
to
03d91d7
Compare
Rebased, but kept the fixup and squash commits for the time being. I reworded the changelog message in 03d91d7 (was easier to do in the rebase, sorry). The appveyor build still fails in Python 2 (but succeeds in Python 3). |
Can we drop Windows support? 😡 |
Nope! |
@jgosmann I can fire up Windows to try to debug this if you want? |
Sure. |
Alright... that was legit one of the most annoying bugs I've attempted to fix in some time. But, I think I came up with a solution (ad4ae7d). Works locally, and it seems to work in appveyor too! Explanation of the bug and fix: In 32-bit Python 2.7 (but not 2.7 64-bit, and not 32-bit 3.5) there was an issue when doing >>> root = os.path.dirname(dst)
>>> os.listdir(root)
['d6', 'index', 'index.part', 'legacy.txt']
>>> os.path.isfile(src)
False
>>> os.path.isfile(dst)
False
>>> src == os.path.join(root, 'index.part')
True
>>> dst == os.path.join(root, 'index')
True So yeah, I have a guess as to the underlying issue, but I haven't verified it... at first, I thought it might be the Windows 260 character path limit, but quickly dismissed that as the paths were only around 100 characters. A bit of searching led me to the fact that Windows will sometimes give you a different path than you actually requested. The most common example is that if you try to access That's just a guess though, I don't know if that's actually the case. What I do know is that it works in the terminal, so while debugging I tried calling out to the shell to do the dirty work for me. >>> import subprocess
>>> subprocess.call(["dir", root])
Volume in drive C has no label.
Volume Serial Number is 45F3-4A7A
Directory of C:\tmp\pytest-of-tbekolay\pytest-7\test_dec_cache
07/27/2016 11:13 PM <DIR> .
07/27/2016 11:13 PM <DIR> ..
07/27/2016 11:13 PM <DIR> 24
07/27/2016 11:13 PM <DIR> 44
07/27/2016 11:13 PM 151 index
07/27/2016 11:13 PM 151 index.part
07/27/2016 11:12 PM 5 legacy.txt
3 File(s) 307 bytes
4 Dir(s) 19,363,864,576 bytes free
>>> subprocess.call(["move", "/Y", src, dst])
1 file(s) moved. So I cleaned that up a bit, and made |
😩 wtf |
Would this be worth rporting as a bug to someone? On Thu, Jul 28, 2016 at 10:08 AM, Jan Gosmann notifications@github.com
|
Since it works in 32-bit Python 3.5 it is likely that a similar bug has been reported and fixed. Also figuring out the root cause will probably take a while... though yeah, I guess Python 2.7 is still supported for the time being, so if we could find the root cause it would be worth reporting. |
@@ -1,6 +1,9 @@ | |||
from __future__ import absolute_import | |||
|
|||
import collections | |||
import errno |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're using this anymore, so can probably get rid of it.
Yep, looks good. |
Cool, I'll merge 👍 🐙 |
If the process is killed while writing the pickle, we can't load it anymore and lose all index information. Thus, we write it to another file first, so we still have the old index in case the process gets killed. Once we're done writing we rename the file (which should be atomic on most file systems and operating systems, but will at least be a much faster operation reducing the likelihood of ending up with an invalid index). Addresses #1097.
6e83614
to
ff90fba
Compare
Description:
Make writing of cache index atomic.
If the process is killed while writing the cache index pickle, we can't load it anymore and thus lose all index information. Thus, we write it to another file first, so we still have the old index in case the process gets killed. Once we're done writing we rename the file (which should be atomic on most file systems and operating systems, but will at least be a much faster operation reducing the likelihood of ending up with an invalid index).
Motivation and context:
Adresses #1097.
Interactions with other PRs:
Besides this we should make sure that the cache fails gracefully if the index cannot be read.
How has this been tested?
Tests still pass. The atomicity of the operation is hard to test as it is an implementation detail.
Where should a reviewer start?
Should be obvious, it's just a two line change ...
How long should this take to review?
Types of changes:
Checklist:
Still to do:
If the process is killed while writing the pickle, we can't load it
anymore and thus lose all index information. Thus, we write it to
another file first, so we still have the old index in case the process
gets killed. Once we're done writing we rename the file (which should
be atomic on most file systems and operating systems, but will at least
be a much faster operation reducing the likelihood of ending up with
an invalid index).
Addresses #1097.