Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make writing of cache index atomic. #1107

Merged
merged 1 commit into from
Jul 29, 2016
Merged

Make writing of cache index atomic. #1107

merged 1 commit into from
Jul 29, 2016

Conversation

jgosmann
Copy link
Collaborator

@jgosmann jgosmann commented Jun 23, 2016

Description:

Make writing of cache index atomic.

If the process is killed while writing the cache index pickle, we can't load it anymore and thus lose all index information. Thus, we write it to another file first, so we still have the old index in case the process gets killed. Once we're done writing we rename the file (which should be atomic on most file systems and operating systems, but will at least be a much faster operation reducing the likelihood of ending up with an invalid index).

Motivation and context:

Adresses #1097.

Interactions with other PRs:

Besides this we should make sure that the cache fails gracefully if the index cannot be read.

How has this been tested?

Tests still pass. The atomicity of the operation is hard to test as it is an implementation detail.

Where should a reviewer start?

Should be obvious, it's just a two line change ...

How long should this take to review?

  • Quick (less than 40 lines changed or changes are straightforward)

Types of changes:

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have read the CONTRIBUTING.rst document.
  • I have updated the documentation accordingly. (No changes necessary.)
  • I have included a changelog entry.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Still to do:

If the process is killed while writing the pickle, we can't load it
anymore and thus lose all index information. Thus, we write it to
another file first, so we still have the old index in case the process
gets killed. Once we're done writing we rename the file (which should
be atomic on most file systems and operating systems, but will at least
be a much faster operation reducing the likelihood of ending up with
an invalid index).

Addresses #1097.

@jgosmann jgosmann added this to the 2.2.0 release milestone Jun 23, 2016
@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @tbekolay to be a potential reviewer

pickle.dump(
{k: v for k, v in self._index.items()
if v[0] not in self._removed_files},
f, pickle.HIGHEST_PROTOCOL)
os.rename(self.filename + '.part', self.filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.filename guaranteed not to exist? That's the only flaw I can see in this implementation. In Unix, the rename will overwrite, but on Windows it won't (reference).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we finally drop support for Windows? It's so annoying. 😠

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a fixup commit to handle this correctly on Windows.

@@ -30,6 +32,13 @@ def ensure_bytes(s):
assert isinstance(s, bytes)
return s

def replace(src, dst):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use shutil.move? It's part of the standard library and does something similar but perhaps more robust...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because shutil.move depends on os.rename semantics and thus does not solve any of the problems. os.replace provides the correct semantics, but only exists in Python 3.3 and newer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking at the shutil.move source, it actually does a copy in the case of OSError on rename, and then deletes the source. The copy function they use will actually overwrite I think, but it seems like some extra work to copy the file. So I'd probably just go with this, unless there's a clear way in which shutil.move is more robust.

@tbekolay
Copy link
Member

For whatever reason, I tracked down the commit in CPython that added os.replace: python/cpython@c64442f

It's in C so I'm satisfied with reimplementing this in Python 2 and using it in Python 3.

@jgosmann
Copy link
Collaborator Author

Does this need a changelog entry?

@hunse
Copy link
Collaborator

hunse commented Jun 24, 2016

I would add one as a bugfix, yes.

@hunse
Copy link
Collaborator

hunse commented Jun 24, 2016

Actually, if we're not adding a changelog entry for #1110, then we probably don't need one here either. So I think it's fine as-is, except for that one test that's failing.

Also, before we merge this, I think we should rebase and re-run all the tests (though don't we always...) since there might be interactions with #1110.

@hunse
Copy link
Collaborator

hunse commented Jun 24, 2016

Oh, sorry, @tbekolay said he's putting a changelog in #1110, so this should have one too then.

@tbekolay tbekolay mentioned this pull request Jun 27, 2016
12 tasks
@tbekolay tbekolay self-assigned this Jun 28, 2016
@jgosmann
Copy link
Collaborator Author

Is there a way to restart the AppVeyor build?

@tbekolay
Copy link
Member

Restarted; if you log in with your Github account you should be able to link to the nengo dev user which can restart builds... I think?

@Seanny123
Copy link
Contributor

After extensive searching I have determined that it is not possible to restart an appveyor branch. At least not from the type of account Jan and I have. Consequently, should this be rebased to see if the tests pass? And then we can continue with the review?

@jgosmann
Copy link
Collaborator Author

Rebase has to be done sooner or later anyways. @tbekolay you're assigned to this, do you want to do the rebase? Or do you want to reassign this to me and let me handle it?

@tbekolay
Copy link
Member

I'm looking at this now, so I'll do the rebase if you don't mind.

@tbekolay
Copy link
Member

Rebased, but kept the fixup and squash commits for the time being. I reworded the changelog message in 03d91d7 (was easier to do in the rebase, sorry). The appveyor build still fails in Python 2 (but succeeds in Python 3).

@jgosmann
Copy link
Collaborator Author

Can we drop Windows support? 😡

@tbekolay
Copy link
Member

Can we drop Windows support?

Nope!

@tbekolay
Copy link
Member

@jgosmann I can fire up Windows to try to debug this if you want?

@jgosmann
Copy link
Collaborator Author

Sure.

@tbekolay
Copy link
Member

Alright... that was legit one of the most annoying bugs I've attempted to fix in some time. But, I think I came up with a solution (ad4ae7d). Works locally, and it seems to work in appveyor too!

Explanation of the bug and fix:

In 32-bit Python 2.7 (but not 2.7 64-bit, and not 32-bit 3.5) there was an issue when doing replace(src='index.part', dst='index') (but the absolute path versions of those). When trying to the replace in 32-bit 2.7, this is what I was getting while debugging:

>>> root = os.path.dirname(dst)
>>> os.listdir(root)
['d6', 'index', 'index.part', 'legacy.txt']
>>> os.path.isfile(src)
False
>>> os.path.isfile(dst)
False
>>> src == os.path.join(root, 'index.part')
True
>>> dst == os.path.join(root, 'index')
True

So yeah, os.listdir says that those files are there. os.path.isfile says that those files are not there. Other functions like os.stat and os.remove use the same machinery as os.path.isfile, so they were failing to remove the files that did indeed exist (as verified in the terminal; you could cat the contents and whatnot, so they were legitimately there).

I have a guess as to the underlying issue, but I haven't verified it... at first, I thought it might be the Windows 260 character path limit, but quickly dismissed that as the paths were only around 100 characters. A bit of searching led me to the fact that Windows will sometimes give you a different path than you actually requested. The most common example is that if you try to access C:\Windows\SysWOW64 in a 32-bit process, it will give you C:\Windows\System32 (and vice versa in 64-bit) whether you like it or not. I think that a similar thing is happening with getting temporary directory, as it might be the case that in 64-bit Windows you get a different path for %TEMP% as in 32-bits, and Windows will automatically (and without telling you) switch the paths if you try to get them.

That's just a guess though, I don't know if that's actually the case. What I do know is that it works in the terminal, so while debugging I tried calling out to the shell to do the dirty work for me.

>>> import subprocess
>>> subprocess.call(["dir", root])
 Volume in drive C has no label.                                
 Volume Serial Number is 45F3-4A7A                              

 Directory of C:\tmp\pytest-of-tbekolay\pytest-7\test_dec_cache 

07/27/2016  11:13 PM    <DIR>          .                        
07/27/2016  11:13 PM    <DIR>          ..                       
07/27/2016  11:13 PM    <DIR>          24                       
07/27/2016  11:13 PM    <DIR>          44                       
07/27/2016  11:13 PM               151 index                    
07/27/2016  11:13 PM               151 index.part              
07/27/2016  11:12 PM                 5 legacy.txt               
               3 File(s)            307 bytes                   
               4 Dir(s)  19,363,864,576 bytes free              
>>> subprocess.call(["move", "/Y", src, dst])
        1 file(s) moved.

So I cleaned that up a bit, and made replace call out to the shell in Windows in Python 2. Which seems to work!

@jgosmann
Copy link
Collaborator Author

😩 wtf

@Seanny123
Copy link
Contributor

Would this be worth rporting as a bug to someone?

On Thu, Jul 28, 2016 at 10:08 AM, Jan Gosmann notifications@github.com
wrote:

😩 wtf


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1107 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB1HaYVEQOJb8tbTtlhuMeAr_UjaL8Clks5qaLfUgaJpZM4I9Cne
.

@tbekolay
Copy link
Member

Would this be worth rporting as a bug to someone?

Since it works in 32-bit Python 3.5 it is likely that a similar bug has been reported and fixed. Also figuring out the root cause will probably take a while... though yeah, I guess Python 2.7 is still supported for the time being, so if we could find the root cause it would be worth reporting.

@@ -1,6 +1,9 @@
from __future__ import absolute_import

import collections
import errno
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're using this anymore, so can probably get rid of it.

@tbekolay tbekolay removed their assignment Jul 29, 2016
@tbekolay
Copy link
Member

tbekolay commented Jul 29, 2016

Added @hunse's requested changes; this PR LGTM with all that. @hunse does this look good to you now?

@hunse
Copy link
Collaborator

hunse commented Jul 29, 2016

Yep, looks good.

@tbekolay
Copy link
Member

Cool, I'll merge 👍 🐙

If the process is killed while writing the pickle, we can't load it
anymore and lose all index information. Thus, we write it to another
file first, so we still have the old index in case the process gets
killed. Once we're done writing we rename the file (which should be
atomic on most file systems and operating systems, but will at least
be a much faster operation reducing the likelihood of ending up with
an invalid index).

Addresses #1097.
@tbekolay tbekolay merged commit ff90fba into master Jul 29, 2016
@tbekolay tbekolay deleted the atomic-cache-index branch July 29, 2016 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants