Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in producer's router #345

Closed
xakepp35 opened this issue Jun 1, 2022 · 2 comments · Fixed by #346
Closed

Panic in producer's router #345

xakepp35 opened this issue Jun 1, 2022 · 2 comments · Fixed by #346
Labels

Comments

@xakepp35
Copy link

xakepp35 commented Jun 1, 2022

We have retries system which does resending to producers in a round-robin manner, and under some circumstances when nsq services are restarting - this library can easily crash the application. It looks like its internals are not thread-safe and written in a very outdated golang style, which makes it extremely hard to investigate where exactly issue is:

go-nsq: v1.1.0

panic: runtime error: index out of range [0] with length 0

goroutine 48061 [running]:
github.com/nsqio/go-nsq.(*Producer).popTransaction(...)
	/go/src/***/vendor/github.com/nsqio/go-nsq/producer.go:370
github.com/nsqio/go-nsq.(*Producer).router(0xc0003c18c0)
	/go/src/***/vendor/github.com/nsqio/go-nsq/producer.go:353 +0x532
created by github.com/nsqio/go-nsq.(*Producer).connect
	/go/src/***/vendor/github.com/nsqio/go-nsq/producer.go:324 +0x574

PS. please dont fork goroutines uncontrollably, instead of poorly controlled channels - use a friendly worker pool library with a managed set of worker goroutines, where you can wait for all tasks to be done, and stop using goto - its not ANSI C89...

@xakepp35 xakepp35 changed the title Panic in router Panic in producer's router Jun 1, 2022
@shachardevops
Copy link

Plus 1

@ploxiln
Copy link
Member

ploxiln commented Jun 5, 2022

I'm not too surprised there's a bug in this part of the codebase, it's not heavily used by the original developers. I think the original devs mostly either publish only to localhost on each server that produces messages, in some cases with this library but in other cases just using http from a service written in a different language, or through an http load balancer to a small cluster of nsqd (thus via http). The Consumer code is much more heavily used.

I don't think gotos, goroutines, or races are involved in this. Each Producer starts one goroutine, running Producer.router(), which runs for the life of that Producer. (One more short-lived goroutine is started in Producer.close() when the producer is done.) The two goto you can see in router() are exactly equivalent of a break to label of the outer for (and the Producer exits at that point, and can no longer hit this panic).

Only the single router() goroutine started by a particular Producer will append or pop from the transactions slice, there is no concurrent access.

The bug might simply be that this code assumes that onConnError() can only be called in relation to an in-flight transaction. But if it is called when zero transactions are in flight, you'll get this panic. It actually looks like that shouldn't happen, it should only be called for an error reply message from the server, which should correspond to a sent message ... but of course it should be robust against seeming to get bytes from the server that look like some kind of reply when there are is no transaction in flight (and then close the Producer because message send/recv is out of sync).

I'll open a quick PR. If you can provide a reproducer script, or validate that the PR makes a noticeable difference in your infrastructure (confirming the hypothesis), that would be helpful. Here it is: #346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants