Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: nil pointer crash in writer goroutine after CloseWithError #4167

Closed
dt opened this issue May 24, 2021 · 7 comments
Closed

storage: nil pointer crash in writer goroutine after CloseWithError #4167

dt opened this issue May 24, 2021 · 7 comments

Comments

@dt
Copy link

@dt dt commented May 24, 2021

Client

Storage

Code

Extracted/adapted from a more complex reproduction, but more or less:

   for {
	c := storage.NewClient(...)
        w := c.NewWriter(...)
        w.CloseWithError(io.EOF)
        c.Close(...)
   }

Expected behavior

The write is aborted.

Actual behavior

The process crashes:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x375c91a]
...
cloud.google.com/go/storage.(*Writer).open.func1(0xc03023bb00, 0xc00dea2380, 0xc0381d70e0, 0xc00cac1840, 0x1, 0x1)
        /go/src/.../vendor/cloud.google.com/go/storage/writer.go:131 +0xfa fp=0xc0040f0fb0 sp=0xc0040f0e30 pc=0x375c91a
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0040f0fb8 sp=0xc0040f0fb0 pc=0x4c4ae1
created by cloud.google.com/go/storage.(*Writer).open
        /go/src/.../vendor/cloud.google.com/go/storage/writer.go:118 +0x412

Additional Information

It looks like Writer spawns this goroutine to handle the upload side of the pipe. Inside this goroutine it is reading fields of storage.Client such as Client.raw on the line mentioned in the above crash stack trace:

call := w.o.c.raw.Objects.Insert(w.o.bucket, rawObj).

Looking at Client.Close(), it nils out raw when the client is closed:

So it looks like the caller has called Client.Close() while while the writer goroutine was still running. However reviewing the calling the code in my application, the the Writer is always closed -- via either Close or a CloseWithError in a defer -- before the function that created it returns to the function which created and would close the Client.

So it appears that the writer was somehow still running after it was closed, which meant it was then poised to crash when the client was closed out from under it?

Looking at Writer.Close(), it blocks on the w.donec channel, which is closed by the spawned goroutine when it exits, thus ensuring that that goroutine has exited before Close returns:

However Writer.CloseWithError() does not block on that donec -- it simply calls CloseWithError() on the underlying io.Pipe and returns but it does not actually wait for the writer goroutine to notice the closed pipe and exit:

return w.pw.CloseWithError(err)

@tritone
Copy link
Collaborator

@tritone tritone commented May 26, 2021

Thanks for the detailed writeup and for your proposed fix!

I wasn't able to reproduce the issue locally with a couple versions of your repro. Your reasoning seems correct, but I have a couple follow up questions to try to understand better what's going on:

  • Could you reliably reproduce the error locally? How long did it take typically?
  • You were calling Writer.CloseWithError directly in your code, not using context cancellation, correct? (the latter is recommended, though it also uses CloseWithError under the hood). Does your repro still work if you switch to using cancellation?

Loading

@dt
Copy link
Author

@dt dt commented May 26, 2021

Could you reliably reproduce the error locally? How long did it take typically?

My reproduction was unfortunately a bit more complicated: I was running several processes across a cluster of machines that were all writing files to GCS while under heavy sustained CPU load. They'd usually crash within about 5min, sometimes as long as 10min, after successfully writing 2-10k files. Sorry, I realize that isn't exactly a simple reliable local reproduction but it was reliable in that it did always crash before finishing the overall task.

You were calling Writer.CloseWithError directly in your code, not using context cancellation, correct?

Yes and no: I was calling Writer.CloseWithError, specifically if I got an error back from a Write call (via io.Copy), but this was all being done in a function which was invoked with context from context.WithDeadline and over gRPC, so I think it is very likely I was also cancelling the context in some cases.

Does your repro still work if you switch to using cancellation?

No -- after switching to only calling Close after explicit ctx cancellation, I haven't seen the crash.

While it was still happening, I did a little quick and dirty println debugging, specifically I adding a package var map[*Writer]string and made Close and CloseWithError set strings in it identifying which, if either, had been called along with runtime.Caller to show who called it. Adding if w.o.c.raw == nil { panic(closedWriters[w]) } in the open goroutine, just before call := w.o.c.raw... line that was crashing, verified that that w had indeed already been closed, before we got to that line, by a CloseWithError call in reaction to Write returning an error.

My shot in the dark guess is that this is down extreme CPU contention and whether the open goroutine happens to get scheduled soon enough. Concretely, my hypothesis is that I start a write and at the same time, the context is cancelled, due to deadline or something. w.monitorCancel notices that and closes the pipe. At the same time, my Write calls open which fires off the goroutine, then hits the closed pipe and returns an error. Due to the CPU contention, at this point the open goroutine does not yet get scheduled, while my caller, in response to the error from Write, calls CloseWithError and after it returns, moves on to call the deferred client.Close, all before the scheduler has started the open goroutine. When it finally starts up, we've closed the client out from under it and it crashes.

I also ran it a few times with the patch in #4168 to add <-w.donec to CloseWithError, mirroring Close, and that seemed to eliminate the crash as well, but as linked above, I've now instead just switched to calling Close.

Loading

@dt
Copy link
Author

@dt dt commented May 27, 2021

Huh, sadly it seems as if I might have spoken too soon: even after switching to exclusively use Close, instead of CloseWithError, I just got two more crashes overnight with the same stack -- though this time it was was after running for hours, rather than minutes, so it seems much less common now at least.

EDIT: never mind, I think this more recent one -- since my application changes to avoid calling CloseWithError -- was my own fault (an early error return path was closing the Client without waiting for a Write to finish), so it is just CloseWithError that is susceptible to returning while the background goroutine is still running.

Loading

@tritone
Copy link
Collaborator

@tritone tritone commented Jun 7, 2021

Thanks for following up and sorry for the delay (I've been on vacation and then a busy week last week). Just wanted to check in, has it continued to work alright for you in the intervening 10 days or so?

I think I need to come up with some kind of repro myself for this (or at least do a little more research) before following up with your PR. I'm reluctant to mess with CloseWithError because 1. context cancellation calls it under the hood and 2. calling it directly is deprecated anyway. I need to make sure I fully understand what's going on there before adding something that would cause calls to cancel to block.

Loading

@dt
Copy link
Author

@dt dt commented Jun 7, 2021

Yeah, since removing CloseWithError usage I haven't seen any more crashes.

(also, just while I poking around in writer.go, I think it looked like w.ChunkSize = -1 could cause a subsequent Close() to deadlock? if open() sets opened = true but then early-returned on the ChunkSize check before starting the goroutine which will close(donec), won't Close() block on <-donec forever?)

Loading

@tritone
Copy link
Collaborator

@tritone tritone commented Jun 14, 2021

Thanks for following up! Good to hear that strategy has been working better.

I think you're right about the check on ChunkSize-- I think it just hasn't come up because it's probably not a very common mistake to make (and if you do make it you still get an error from the Write call at least). There's no reason that check for negative values can't happen earlier though so I'll send a quick PR to fix.

Loading

tritone added a commit to tritone/google-cloud-go that referenced this issue Jun 14, 2021
Since w.opened is set to true before this validation currently,
subsequent calls to Writer.Close will deadlock. There is no
reason not to do this check first before opening the pipe.

Thanks to @dt for pointing this out.

Updates googleapis#4167
gcf-merge-on-green bot pushed a commit that referenced this issue Jun 14, 2021
Since w.opened is set to true before this validation currently,
subsequent calls to Writer.Close will deadlock. There is no
reason not to do this check first before opening the pipe.

Thanks to @dt for pointing this out.

Updates #4167
@tritone
Copy link
Collaborator

@tritone tritone commented Sep 14, 2021

Since you haven't seen this issue when using context cancellation and calling CloseWithError directly is deprecated, I'm going to close out this issue. Feel free to re-open if you see this kind of thing again!

Loading

@tritone tritone closed this Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants