Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/index: delay (even longer) acquiring a write lock in ReceiveBlob #1289

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bobg
Copy link
Contributor

@bobg bobg commented Dec 12, 2019

Possible fix for #878. Have been running with this change for a little while, haven't seen the deadlock again yet, pretty sure I would have by now. Also, go test -race ./... reports no (new) race conditions.

@googlebot googlebot added the cla: yes Author has submitted the Google CLA. label Dec 12, 2019
@bradfitz
Copy link
Contributor

It'd be nice to understand the problem more & ideally have a test for it, though.

@bobg
Copy link
Contributor Author

bobg commented Dec 20, 2019

Agreed. The root cause is pretty clear: Index.ReceiveBlob was calling populateFile while holding the write lock, which sent requests (handled in another goroutine) that ended up also wanting to get the write lock in Index.ReceiveBlob when file-chunk blobs had to be added.

There are two things I still don't quite understand. First, why did this deadlock not happen every time? And second, how did it spontaneously resolve after a while? A timeout on the calling goroutine's http request could explain that, but the amount of time before resolving was too long and too irregular, and I never saw any error messages that said "timeout" or "context canceled" or anything like that.

@bobg
Copy link
Contributor Author

bobg commented Dec 21, 2019

The root cause is pretty clear

I take it back. I misunderstood populateFile, which is (in principle, at least) a read-only operation w.r.t. the blob server, though perhaps the concrete type of the Fetcher it takes does some mutation somewhere. The goroutine holding a write lock in this example is not waiting for some other perkeepd thread to satisfy a write request, it's just reading from Google Cloud Storage.

The other goroutine that's waiting for a write lock in that example is part of the sync loop, which seems fine.

Maybe there's something about the sync being preempted for a long time by lots and lots of writes that eventually causes a write to get stuck while holding the lock?

One thing I do know: the change in this PR definitely fixes the problem. I ran a cp -a that spent over a week populating my GCS perkeep storage and it didn't deadlock at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Author has submitted the Google CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants