New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.0.2] Periodical temporary Linkerd outages when io.l5d.mesh
is used
#1346
Comments
Around the same time there were few stack-traces in `namerd02`
|
I am able to reproduce this as well in 1.0.2, but not in 1.0.0 linkerd. Sanitized linkerd logs look like this:
etc. |
@DukeyToo Thanks for confirming that you're also seeing the issue. Can you provide a bit more info about your setup, as well as steps to reproduce if possible? If you don't mind sharing your linkerd and namerd configs, that would be very helpful. You can also send them to me directly; I'm |
@klingerf , configs below. We're running a linkerd per host, with a cluster of 3 namerd hosts, which also run zookeeper and some other stuff. The linkerd hosts come and go but the namerd hosts typically stay for a long time. Like @ashald we did not see the issue immediately. It occurred after some undetermined time. Restarting all of the linkerd processes fixed it for a while, until the next occurrence. We reverted pretty quickly to 1.0.0, so it only occurred for us about 3 times total over 2 days. Linkerd yaml:
Namerd config:
|
Finally, me and @edio were able to reproduce the issue using following steps: 1. Set Vars
2. Prepare configsNamerd
Linkerd
3. Start namerd, linkerd and a serviceNamerd
Linkerd
Service
Run TestPrepare Test
4. Run Test
|
Thanks @ashald! Here's the namerd debug log for that test:
So a new connection is initialized when the dyn binding error is encountered. |
Ok, it looks like what's happening here is that when linkerd's client cache is full and it receives a request that requires building a new client, it evicts an existing client and tears it down. If the client has an open stream to namerd (as is the case with the io.l5d.mesh api), the stream is closed on the linkerd side. As far as I can tell, the stream is not closed on the namerd side, and namerd continues to send updates on the stream, which look like this:
Once one of these is sent to linkerd after the stream has been closed, the The call to
The way in which remote streams are closed changed as part of #1280, and that's evidently where this bug was introduced. Am still investigating, but it seems like there are two approaches to fixing:
I think we should probably do both but will keep poking around. |
ZOMG, it looked obscure, but I never thought it might be so deep. o_O Thanks for looking into it! |
Problem If a linkerd h2 connection receives a data frame from the remote and is unable to accept it due to the local frame queue already being reset, it falls into a recursive infinite loop trying to accept the frame. Solution Instead of retrying on failure, send a reset frame back to the remote so that it will not continue to send frames.
Problem If a linkerd h2 connection receives a data frame from the remote and is unable to accept it due to the local frame queue already being reset, it falls into a recursive infinite loop trying to accept the frame. Solution Instead of retrying on failure, send a reset frame back to the remote so that it will not continue to send frames.
Problem If a linkerd h2 connection receives a data frame from the remote and is unable to accept it due to the local frame queue already being reset, it falls into a recursive infinite loop trying to accept the frame. Solution Instead of retrying on failure, send a reset frame back to the remote so that it will not continue to send frames.
We believe it's a bug introduced in version
1.0.2
- we cannot reproduce it with version1.0.0
.Linkerd config
Namerd config
Every several hours Linkerd instance "becomes broken". It can forward requests to the names that were resolved through it before but cannot forward requests to names there are yet unknown to it (e.g., new service is deployed).
All requests to "new names" result in:
Looking at logs (Consul log level set to ALL) we see that it gets updates from Consul about services being added or removed so we assume that Linkerd knows about health Namerd instances.
After a while (we weren't able to figure out whether this interval is constant or no) we see a stacktrace like this in logs (root log level set to ALL):
where:
Once such an exception is logged Linkerd is "recovered" and now is able to resolve names not yet known to it.
The text was updated successfully, but these errors were encountered: