Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Go 2: interruptible goroutines #50678

Closed
KevinWang15 opened this issue Jan 19, 2022 · 18 comments
Closed

proposal: Go 2: interruptible goroutines #50678

KevinWang15 opened this issue Jan 19, 2022 · 18 comments
Labels
FrozenDueToAge LanguageChange Suggested changes to the Go language Proposal Proposal-FinalCommentPeriod v2 An incompatible library change
Milestone

Comments

@KevinWang15
Copy link

KevinWang15 commented Jan 19, 2022

Hi,

Can we have the feature to kill one goroutine from another goroutine?

I have read #32610 and several other disucssions. I believe the answer was NO, but I still want to present my use cases and experiments for your consideration.


I was working on Kubernetes APIServer. When APIServer was OOMKilled and restarted, all clients are going to reconnect to APIServer and try to download all the data they need from APIServer to their local cache. They do this by sending a LIST request.

After an APIServer restarts, there would be tons of concurrent LIST requests, so CPU would be very overloaded. The HTTP response was quite big, with every CPU core saturated, json.Marshal of the response would take on average 300 seconds to complete (whereas they would take only 10 seconds to complete had the CPU not been so busy).

The interesting thing is, APIServer will return a GatewayTimeout response after 60 seconds, and the client will retry the request on receiving this error.

So at the 60 second mark, APIServer and the client have both given up on the request, the request context was cancelled, but the goroutine was still actively doing json.Marshal of the response for the request! (because json.Marshal wasn't supposed to be context-aware.)

That's the last thing we want when the server was already overloaded. In the next 240 seconds that goroutine was wasting CPU for nothing, and it was not releasing the memory. Soon the clients would retry the LIST request and another league of goroutines would be created to serve the requests - the overload just got worse!

In my observation, due to the lingering json.Marshal goroutine, each client will result in the creation of 5 goroutines, and use 5x the CPU and memory.


I actually did a little experiment and hacked src/runtime.

  1. a goroutine can get its own goid, so I stored the goid of the serving goroutine in the request context
  2. a function to kill any goroutine (make that goroutine panic) by goid is added, so when GatewayTimeout occurs, I could read the goroutine that was doing json.Marshal from request context and make it panic

The results are exactly what we want! 60% memory usage reduction (and much less prone to OOMKills) and faster recovery.

image
(P.S. we changed some other behaviors of APIServer, e.g. returning Retry-After header to disperse client-side retries and imposing a smaller MaxRequestsInflight when a LIST flood is detected. And the above experiment already had those optimizations turned on)


This issue is not limited to Kubernetes APIServer. I guess every server powered by golang would have more or less the same problem, and would all benefit from an API for killing goroutines (non-cooperatively, because using context / channels to cooperatively make goroutines exit is sometimes not viable).

If this use case could be considered valid, I would like to humbly request to add such an API. It doesn't have to expose the goid to the user, just allowing the user to get a handle of the current goroutine, store it somewhere, and then provide an Abort() method that will make the goroutine panic at the earliest convenience, would be good enough.

Thanks!

@gopherbot gopherbot added this to the Proposal milestone Jan 19, 2022
@davecheney
Copy link
Contributor

This use case sounds like the the kubernetes Api server should add some basic rate limiting.

@KevinWang15
Copy link
Author

This use case sounds like the the kubernetes Api server should add some basic rate limiting.

Right, there are some rate limiting in place, and I also worked on tuning the algorithms there.

But if we could have this API, it would be invaluable, and bring the whole thing to the next level.

@beoran
Copy link

beoran commented Jan 19, 2022

I would rather say you need a better json parser that is interruptable through a context.

@KevinWang15
Copy link
Author

I would rather say you need a better json parser that is interruptable through a context.

Well, a better JSON parser certainly will help.
Apart from json.Marshal, there are some other CPU-intensive function calls that are not context-aware.
It would be much work if we replace everything with a context-aware alternative.

And is it good to make everything context-aware? Is it going to add a lot of overhead?
Quoting this comment: #50422 (comment)

mvdan commented: Usually, a context or timeout/deadline is required for asynchronous work where it's practically impossible to know how long an operation will take. I also imagine it would be useful for libraries that interpret or execute code, such as interpreting a script that might loop forever. My first impression is that encoding/json falls under neither of those categories.

When all these problems can be solved with one single added API in golang runtime, and it isn't that hard to implement, I am tempted not to do things the other way around.

I guess it all boils down to, is it so abominably against the philosophies and principles of golang design? Or is it negotiable?

@seankhliao
Copy link
Member

Please fill out https://github.com/golang/proposal/blob/master/go2-language-changes.md when proposing language changes

@seankhliao seankhliao added v2 An incompatible library change LanguageChange Suggested changes to the Go language WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jan 19, 2022
@seankhliao seankhliao changed the title proposal: can we have the feature to kill one goroutine from another goroutine? (I believe I have valid use cases) proposal: Go 2: interruptible goroutines Jan 19, 2022
@KevinWang15
Copy link
Author

ACK, I will fill it out, thanks

@KevinWang15
Copy link
Author

Would you consider yourself a novice, intermediate, or experienced Go programmer?

I consider myself quite experienced.

What other languages do you have experience with?

Java, JavaScript

Would this change make Go easier or harder to learn, and why?

I think it should make no difference. It is only an extra API that a developer can choose to use. They can learn it when they want to use it.

Has this idea, or one like it, been proposed before?

The closest is question #32610. We are asking for the same feature. The previous discussion was abandoned and didn't offer a use case. So I decided to bring it up again.

Who does this proposal help, and why?

Everyone who uses Golang to write a server, and potentially some other use cases.
It will enable developers to make sure that when the server has finished sending a response for a request, every goroutine that was used to serve the request can be aborted immediately. No lingering goroutines will get stuck in expensive function calls that are not context-aware, and waste CPU / memory.

What is the proposed change?

  1. Add runtime.GetGoroutineHandle() that will return a *GoroutineHandle of the caller goroutine.
  2. Provide .Abort() method on *GoroutineHandle that will make the related goroutine panic with "aborted" at the earliest convenience (when it is safe to do so). All defers will be executed in order not to break invariants.

Nothing would change in the language spec.

Please also describe the change informally, as in a class teaching Go.

"Now it's possible to kill a goroutine just like you can kill a thread. Of course, it is dangerous and not encouraged; Context and other cooperative methods are still your go to solution, but it is a possibility if you know what you are doing"

Is this change backward compatible?

Yes.

Show example code before and after the change.

func main() {
	goroutineHandle := make(chan *GoroutineHandle)
	go func() {
		goroutineHandle <- runtime.GetGoroutineHandle()
		json.Marshal(aVeryBigObject)
	}()

	timer := time.NewTimer(10 * time.Second)
	<-timer.C
	(<-goroutineHandle).Abort()
}

What is the cost of this proposal?

It will not affect tools or add compile time / run time cost.

Can you describe a possible implementation?

Put a goid into the ToBeAborted set, wait for it to get descheduled naturally (either by Gosched or preemptive scheduling). Then, when it is about to resume, change its PC to a function that will panic immediately; Or, add such a check in Gosched.

How would the language spec change?

The language spec wouldn't change.

Orthogonality: how does this change interact or overlap with existing features?

No, it's just another API. Just like Thread.stop provided by Java.

Is the goal of this change a performance improvement?

Yes. We can expect server programs to waste less CPU and memory. See my analysis and experiments in #50678

Does this affect error handling? / Is this about generics?

No


PTAL @seankhliao

@bcmills
Copy link
Contributor

bcmills commented Jan 19, 2022

(emphasis mine)

make the related goroutine panic with "aborted" at the earliest convenience (when it is safe to do so)

That's the problem: it is never safe to make a goroutine panic unexpectedly.

Go programs can (and should!) be written and tested such that unexpected panics cannot and do not occur — especially given Go's new fuzzing support set to be released in Go 1.18. In particular, panics should not cross package boundaries except in the case of programmer error: if package A calls into package B, package B should only panic if A has violated some documented or obvious invariant of its API.

If a particular package knows that it is safe to panic at some particular point, that package could just as easily check a Context for cancellation at that point too.

@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 19, 2022
@KevinWang15
Copy link
Author

KevinWang15 commented Jan 19, 2022

Right, it is never safe to make a goroutine panic unexpectedly, as unsafe as killing a thread in Java.
Maybe it should be put into the unsafe package, and people should never use it unless they know what they are doing.
Shall we give developers the power to do it if they really want to? (Java did). If the program crashes, then the developer is to blame because he played with fire and got burned.

By when it is safe to do so I meant, from go runtime's point of view. It's indeed impossible to guarantee there will be no logic error or deadlocks when a goroutine is aborted abruptly. I guess that responsibility falls on the developer.

@bcmills
Copy link
Contributor

bcmills commented Jan 19, 2022

Shall we give developers the power to do it if they really want to? (Java did).

Go is not Java, nor should it be: Java is already a pretty good Java, so Go making different decisions adds diversity to the language ecosystem.

If the program crashes, then the developer is to blame because he played with fire and got burned.

Go consistently chooses not to take that approach. (That's why we have automatic bounds checks and a fairly permissive memory model, and also why the language discourages pointer arithmetic and subtle memory-management patterns.)

Besides, the failure mode of an unexpected panic isn't always a crash. Sometimes it's much more difficult to diagnose, like a deadlock in a production server.

@KevinWang15
Copy link
Author

Okay, I see your point, I can understand. Thanks

@beoran
Copy link

beoran commented Jan 19, 2022

If you use a Context-aware io.Reader like here: https://pace.dev/blog/2020/02/03/context-aware-ioreader-for-golang-by-mat-ryer.html, together with a Json parser that can unmarshal from an io.Reader, and there are several like that, then your problem should be solved.

@KevinWang15
Copy link
Author

If you use a Context-aware io.Reader like here: https://pace.dev/blog/2020/02/03/context-aware-ioreader-for-golang-by-mat-ryer.html, together with a Json parser that can unmarshal from an io.Reader, and there are several like that, then your problem should be solved.

Thanks for the suggestion. The current encoding/json has func NewEncoder(w io.Writer) which can take in a io.Writer, but its behavior isn't ideal for this use case ( #33714 , basically it completes the whole marshaling in a buffer and writes to io.Writer in one shot ).

Maybe there open-source alternatives to encoding/json that behaves differently (it needs to work with json.Marshaler interface or we will have to rewrite a lot of code), I will take a look.

Our solution for now is to hack src/runtime and src/encoding/json and make the goroutine panic when the request was cancelled. This solution has the upside of being very easy to implement (100 lines of code), but the downside is we will have to use a forked (hacked) Golang. I guess when #33714 is solved by upstream we can try that way too.

@mvdan
Copy link
Member

mvdan commented Jan 20, 2022

I agree with @beoran that a context-aware reader with a streaming decoder (or a context-aware writer with a streaming encoder for marshals) seems like a good approach. encoding/json doesn't currently support streaming encodes (#33714), but that's certainly something we want to fix in the future.

@smasher164
Copy link
Member

If you use a Context-aware io.Reader like here: https://pace.dev/blog/2020/02/03/context-aware-ioreader-for-golang-by-mat-ryer.html, together with a Json parser that can unmarshal from an io.Reader, and there are several like that, then your problem should be solved.

Interesting. So this basically turns the early exits from error handling into preemption points. At least, most of the error handling, since the errors returned from a Reader are only a subset of all the errors returned by a JSON parser.

@ianlancetaylor
Copy link
Contributor

Based on the discussion above, this is a likely decline. Leaving open for four weeks for final comments.

@KevinWang15
Copy link
Author

Thank you, I'm fine with closing this

@ianlancetaylor
Copy link
Contributor

No change in consensus.

@golang golang locked and limited conversation to collaborators Mar 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge LanguageChange Suggested changes to the Go language Proposal Proposal-FinalCommentPeriod v2 An incompatible library change
Projects
None yet
Development

No branches or pull requests

9 participants