Make writing of cache index atomic. #1107

jgosmann · 2016-06-23T17:21:02Z

Description:

Make writing of cache index atomic.

If the process is killed while writing the cache index pickle, we can't load it anymore and thus lose all index information. Thus, we write it to another file first, so we still have the old index in case the process gets killed. Once we're done writing we rename the file (which should be atomic on most file systems and operating systems, but will at least be a much faster operation reducing the likelihood of ending up with an invalid index).

Motivation and context:

Adresses #1097.

Interactions with other PRs:

Besides this we should make sure that the cache fails gracefully if the index cannot be read.

How has this been tested?

Tests still pass. The atomicity of the operation is hard to test as it is an implementation detail.

Where should a reviewer start?

Should be obvious, it's just a two line change ...

How long should this take to review?

Quick (less than 40 lines changed or changes are straightforward)

Types of changes:

Bug fix (non-breaking change which fixes an issue)

Checklist:

I have read the CONTRIBUTING.rst document.
I have updated the documentation accordingly. (No changes necessary.)
I have included a changelog entry.
I have added tests to cover my changes.
All new and existing tests passed.

Still to do:

If the process is killed while writing the pickle, we can't load it
anymore and thus lose all index information. Thus, we write it to
another file first, so we still have the old index in case the process
gets killed. Once we're done writing we rename the file (which should
be atomic on most file systems and operating systems, but will at least
be a much faster operation reducing the likelihood of ending up with
an invalid index).

Addresses #1097.

mention-bot · 2016-06-23T17:21:04Z

By analyzing the blame information on this pull request, we identified @tbekolay to be a potential reviewer

hunse · 2016-06-23T17:58:49Z

nengo/cache.py

                    pickle.dump(
                        {k: v for k, v in self._index.items()
                         if v[0] not in self._removed_files},
                        f, pickle.HIGHEST_PROTOCOL)
+                os.rename(self.filename + '.part', self.filename)


Is self.filename guaranteed not to exist? That's the only flaw I can see in this implementation. In Unix, the rename will overwrite, but on Windows it won't (reference).

Can we finally drop support for Windows? It's so annoying. 😠

I made a fixup commit to handle this correctly on Windows.

tbekolay · 2016-06-23T18:44:07Z

nengo/utils/compat.py

@@ -30,6 +32,13 @@ def ensure_bytes(s):
        assert isinstance(s, bytes)
        return s

+    def replace(src, dst):


Why not use shutil.move? It's part of the standard library and does something similar but perhaps more robust...

Because shutil.move depends on os.rename semantics and thus does not solve any of the problems. os.replace provides the correct semantics, but only exists in Python 3.3 and newer.

Yeah, looking at the shutil.move source, it actually does a copy in the case of OSError on rename, and then deletes the source. The copy function they use will actually overwrite I think, but it seems like some extra work to copy the file. So I'd probably just go with this, unless there's a clear way in which shutil.move is more robust.

tbekolay · 2016-06-23T23:06:32Z

For whatever reason, I tracked down the commit in CPython that added os.replace: python/cpython@c64442f

It's in C so I'm satisfied with reimplementing this in Python 2 and using it in Python 3.

jgosmann · 2016-06-24T15:40:56Z

Does this need a changelog entry?

hunse · 2016-06-24T15:43:32Z

I would add one as a bugfix, yes.

hunse · 2016-06-24T17:02:04Z

Actually, if we're not adding a changelog entry for #1110, then we probably don't need one here either. So I think it's fine as-is, except for that one test that's failing.

Also, before we merge this, I think we should rebase and re-run all the tests (though don't we always...) since there might be interactions with #1110.

hunse · 2016-06-24T17:05:20Z

Oh, sorry, @tbekolay said he's putting a changelog in #1110, so this should have one too then.

jgosmann · 2016-07-22T21:48:33Z

Is there a way to restart the AppVeyor build?

tbekolay · 2016-07-23T21:02:23Z

Restarted; if you log in with your Github account you should be able to link to the nengo dev user which can restart builds... I think?

Seanny123 · 2016-07-27T00:34:40Z

After extensive searching I have determined that it is not possible to restart an appveyor branch. At least not from the type of account Jan and I have. Consequently, should this be rebased to see if the tests pass? And then we can continue with the review?

jgosmann · 2016-07-27T02:43:01Z

Rebase has to be done sooner or later anyways. @tbekolay you're assigned to this, do you want to do the rebase? Or do you want to reassign this to me and let me handle it?

tbekolay · 2016-07-27T12:57:18Z

I'm looking at this now, so I'll do the rebase if you don't mind.

tbekolay · 2016-07-27T14:36:57Z

Rebased, but kept the fixup and squash commits for the time being. I reworded the changelog message in 03d91d7 (was easier to do in the rebase, sorry). The appveyor build still fails in Python 2 (but succeeds in Python 3).

jgosmann · 2016-07-27T16:16:18Z

Can we drop Windows support? 😡

tbekolay · 2016-07-27T18:31:53Z

Can we drop Windows support?

Nope!

tbekolay · 2016-07-27T18:33:04Z

@jgosmann I can fire up Windows to try to debug this if you want?

jgosmann · 2016-07-27T19:35:51Z

Sure.

tbekolay · 2016-07-28T13:07:16Z

Alright... that was legit one of the most annoying bugs I've attempted to fix in some time. But, I think I came up with a solution (ad4ae7d). Works locally, and it seems to work in appveyor too!

Explanation of the bug and fix:

In 32-bit Python 2.7 (but not 2.7 64-bit, and not 32-bit 3.5) there was an issue when doing replace(src='index.part', dst='index') (but the absolute path versions of those). When trying to the replace in 32-bit 2.7, this is what I was getting while debugging:

>>> root = os.path.dirname(dst)
>>> os.listdir(root)
['d6', 'index', 'index.part', 'legacy.txt']
>>> os.path.isfile(src)
False
>>> os.path.isfile(dst)
False
>>> src == os.path.join(root, 'index.part')
True
>>> dst == os.path.join(root, 'index')
True

So yeah, os.listdir says that those files are there. os.path.isfile says that those files are not there. Other functions like os.stat and os.remove use the same machinery as os.path.isfile, so they were failing to remove the files that did indeed exist (as verified in the terminal; you could cat the contents and whatnot, so they were legitimately there).

I have a guess as to the underlying issue, but I haven't verified it... at first, I thought it might be the Windows 260 character path limit, but quickly dismissed that as the paths were only around 100 characters. A bit of searching led me to the fact that Windows will sometimes give you a different path than you actually requested. The most common example is that if you try to access C:\Windows\SysWOW64 in a 32-bit process, it will give you C:\Windows\System32 (and vice versa in 64-bit) whether you like it or not. I think that a similar thing is happening with getting temporary directory, as it might be the case that in 64-bit Windows you get a different path for %TEMP% as in 32-bits, and Windows will automatically (and without telling you) switch the paths if you try to get them.

That's just a guess though, I don't know if that's actually the case. What I do know is that it works in the terminal, so while debugging I tried calling out to the shell to do the dirty work for me.

>>> import subprocess
>>> subprocess.call(["dir", root])
 Volume in drive C has no label.                                
 Volume Serial Number is 45F3-4A7A                              

 Directory of C:\tmp\pytest-of-tbekolay\pytest-7\test_dec_cache 

07/27/2016  11:13 PM    <DIR>          .                        
07/27/2016  11:13 PM    <DIR>          ..                       
07/27/2016  11:13 PM    <DIR>          24                       
07/27/2016  11:13 PM    <DIR>          44                       
07/27/2016  11:13 PM               151 index                    
07/27/2016  11:13 PM               151 index.part              
07/27/2016  11:12 PM                 5 legacy.txt               
               3 File(s)            307 bytes                   
               4 Dir(s)  19,363,864,576 bytes free              
>>> subprocess.call(["move", "/Y", src, dst])
        1 file(s) moved.

So I cleaned that up a bit, and made replace call out to the shell in Windows in Python 2. Which seems to work!

jgosmann · 2016-07-28T14:08:17Z

😩 wtf

Seanny123 · 2016-07-28T15:01:40Z

Would this be worth rporting as a bug to someone?

On Thu, Jul 28, 2016 at 10:08 AM, Jan Gosmann notifications@github.com
wrote:

😩 wtf

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1107 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB1HaYVEQOJb8tbTtlhuMeAr_UjaL8Clks5qaLfUgaJpZM4I9Cne
.

tbekolay · 2016-07-28T15:10:34Z

Would this be worth rporting as a bug to someone?

Since it works in 32-bit Python 3.5 it is likely that a similar bug has been reported and fixed. Also figuring out the root cause will probably take a while... though yeah, I guess Python 2.7 is still supported for the time being, so if we could find the root cause it would be worth reporting.

hunse · 2016-07-28T15:55:25Z

nengo/utils/compat.py

@@ -1,6 +1,9 @@
 from __future__ import absolute_import

 import collections
+import errno


I don't think we're using this anymore, so can probably get rid of it.

tbekolay · 2016-07-29T18:36:27Z

Added @hunse's requested changes; this PR LGTM with all that. @hunse does this look good to you now?

hunse · 2016-07-29T18:42:30Z

Yep, looks good.

tbekolay · 2016-07-29T18:44:25Z

Cool, I'll merge 👍 🐙

If the process is killed while writing the pickle, we can't load it anymore and lose all index information. Thus, we write it to another file first, so we still have the old index in case the process gets killed. Once we're done writing we rename the file (which should be atomic on most file systems and operating systems, but will at least be a much faster operation reducing the likelihood of ending up with an invalid index). Addresses #1097.

jgosmann added the needs review label Jun 23, 2016

jgosmann added this to the 2.2.0 release milestone Jun 23, 2016

jgosmann mentioned this pull request Jun 23, 2016

Graceful fail when cache index cannot be loaded #1097

Closed

hunse reviewed Jun 23, 2016
View reviewed changes

jgosmann added needs changes and removed needs changes labels Jun 23, 2016

tbekolay reviewed Jun 23, 2016
View reviewed changes

jgosmann mentioned this pull request Jun 24, 2016

Graceful handing of corrupted cache index #1110

Merged

5 tasks

tbekolay mentioned this pull request Jun 27, 2016

2.1.1, 2.2.0 releases #1080

Closed

12 tasks

tbekolay self-assigned this Jun 28, 2016

Seanny123 added the needs rebase label Jul 27, 2016

tbekolay force-pushed the atomic-cache-index branch from 1f446b3 to 03d91d7 Compare July 27, 2016 13:46

tbekolay removed the needs rebase label Jul 27, 2016

hunse reviewed Jul 28, 2016
View reviewed changes

tbekolay removed their assignment Jul 29, 2016

tbekolay added needs second review and removed needs review labels Jul 29, 2016

tbekolay force-pushed the atomic-cache-index branch from 6e83614 to ff90fba Compare July 29, 2016 19:08

tbekolay merged commit ff90fba into master Jul 29, 2016

tbekolay deleted the atomic-cache-index branch July 29, 2016 19:24

tbekolay added reviewed and removed needs second review labels Aug 8, 2016

jgosmann mentioned this pull request Nov 16, 2016

Cache concurrency error #1200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make writing of cache index atomic. #1107

Make writing of cache index atomic. #1107

jgosmann commented Jun 23, 2016 •

edited

Loading

mention-bot commented Jun 23, 2016

hunse Jun 23, 2016

jgosmann Jun 23, 2016

jgosmann Jun 23, 2016

tbekolay Jun 23, 2016

jgosmann Jun 23, 2016

hunse Jun 23, 2016

tbekolay commented Jun 23, 2016

jgosmann commented Jun 24, 2016

hunse commented Jun 24, 2016

hunse commented Jun 24, 2016

hunse commented Jun 24, 2016

jgosmann commented Jul 22, 2016

tbekolay commented Jul 23, 2016

Seanny123 commented Jul 27, 2016

jgosmann commented Jul 27, 2016

tbekolay commented Jul 27, 2016

tbekolay commented Jul 27, 2016

jgosmann commented Jul 27, 2016

tbekolay commented Jul 27, 2016

tbekolay commented Jul 27, 2016

jgosmann commented Jul 27, 2016

tbekolay commented Jul 28, 2016

jgosmann commented Jul 28, 2016

Seanny123 commented Jul 28, 2016

tbekolay commented Jul 28, 2016

hunse Jul 28, 2016

tbekolay commented Jul 29, 2016 •

edited

Loading

hunse commented Jul 29, 2016

tbekolay commented Jul 29, 2016

Make writing of cache index atomic. #1107

Make writing of cache index atomic. #1107

Conversation

jgosmann commented Jun 23, 2016 • edited Loading

mention-bot commented Jun 23, 2016

hunse Jun 23, 2016

Choose a reason for hiding this comment

jgosmann Jun 23, 2016

Choose a reason for hiding this comment

jgosmann Jun 23, 2016

Choose a reason for hiding this comment

tbekolay Jun 23, 2016

Choose a reason for hiding this comment

jgosmann Jun 23, 2016

Choose a reason for hiding this comment

hunse Jun 23, 2016

Choose a reason for hiding this comment

tbekolay commented Jun 23, 2016

jgosmann commented Jun 24, 2016

hunse commented Jun 24, 2016

hunse commented Jun 24, 2016

hunse commented Jun 24, 2016

jgosmann commented Jul 22, 2016

tbekolay commented Jul 23, 2016

Seanny123 commented Jul 27, 2016

jgosmann commented Jul 27, 2016

tbekolay commented Jul 27, 2016

tbekolay commented Jul 27, 2016

jgosmann commented Jul 27, 2016

tbekolay commented Jul 27, 2016

tbekolay commented Jul 27, 2016

jgosmann commented Jul 27, 2016

tbekolay commented Jul 28, 2016

jgosmann commented Jul 28, 2016

Seanny123 commented Jul 28, 2016

tbekolay commented Jul 28, 2016

hunse Jul 28, 2016

Choose a reason for hiding this comment

tbekolay commented Jul 29, 2016 • edited Loading

hunse commented Jul 29, 2016

tbekolay commented Jul 29, 2016

jgosmann commented Jun 23, 2016 •

edited

Loading

tbekolay commented Jul 29, 2016 •

edited

Loading