pkg/index: delay (even longer) acquiring a write lock in ReceiveBlob #1289

bobg · 2019-12-12T15:26:38Z

Possible fix for #878. Have been running with this change for a little while, haven't seen the deadlock again yet, pretty sure I would have by now. Also, go test -race ./... reports no (new) race conditions.

bradfitz · 2019-12-20T04:59:45Z

It'd be nice to understand the problem more & ideally have a test for it, though.

bobg · 2019-12-20T16:41:16Z

Agreed. The root cause is pretty clear: Index.ReceiveBlob was calling populateFile while holding the write lock, which sent requests (handled in another goroutine) that ended up also wanting to get the write lock in Index.ReceiveBlob when file-chunk blobs had to be added.

There are two things I still don't quite understand. First, why did this deadlock not happen every time? And second, how did it spontaneously resolve after a while? A timeout on the calling goroutine's http request could explain that, but the amount of time before resolving was too long and too irregular, and I never saw any error messages that said "timeout" or "context canceled" or anything like that.

bobg · 2019-12-21T20:07:28Z

The root cause is pretty clear

I take it back. I misunderstood populateFile, which is (in principle, at least) a read-only operation w.r.t. the blob server, though perhaps the concrete type of the Fetcher it takes does some mutation somewhere. The goroutine holding a write lock in this example is not waiting for some other perkeepd thread to satisfy a write request, it's just reading from Google Cloud Storage.

The other goroutine that's waiting for a write lock in that example is part of the sync loop, which seems fine.

Maybe there's something about the sync being preempted for a long time by lots and lots of writes that eventually causes a write to get stuck while holding the lock?

One thing I do know: the change in this PR definitely fixes the problem. I ran a cp -a that spent over a week populating my GCS perkeep storage and it didn't deadlock at all.

Delay (even longer) acquiring a write lock in ReceiveBlob.

46b773a

googlebot added the cla: yes Author has submitted the Google CLA. label Dec 12, 2019

Solve some data races that only come up in tests.

5afc9e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/index: delay (even longer) acquiring a write lock in ReceiveBlob #1289

pkg/index: delay (even longer) acquiring a write lock in ReceiveBlob #1289

bobg commented Dec 12, 2019

bradfitz commented Dec 20, 2019

bobg commented Dec 20, 2019

bobg commented Dec 21, 2019

pkg/index: delay (even longer) acquiring a write lock in ReceiveBlob #1289

Are you sure you want to change the base?

pkg/index: delay (even longer) acquiring a write lock in ReceiveBlob #1289

Conversation

bobg commented Dec 12, 2019

bradfitz commented Dec 20, 2019

bobg commented Dec 20, 2019

bobg commented Dec 21, 2019