Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[streaming] weird errors when watching services with streaming on Consul 1.9.1 #9474

Closed
pierresouchay opened this issue Dec 28, 2020 · 3 comments
Labels
theme/streaming Related to Streaming connections between server and client type/bug Feature does not function as expected

Comments

@pierresouchay
Copy link
Contributor

Overview of the Issue

We upgraded servers from 1.8.6 to 1.9.1 with streaming enabled by using configuration:

{
  "rpc": {
    "enable_streaming": true
  },
  "use_streaming_backend": true
}

on both servers and clients.

While all of this seemed to work in our test environment, in our preproduction, it seems to fail on all the DCs of our preproduction.

When trying such curl command on 1 agent for instance:

curl "http://localhost:8500/v1/health/service/consul-info-html?wait=1s&index=1&stale"
< HTTP/1.1 500 Internal Server Error
< Vary: Accept-Encoding
< X-Consul-Default-Acl-Policy: allow
< Date: Mon, 28 Dec 2020 22:17:27 GMT
< Content-Length: 16
< Content-Type: text/plain; charset=utf-8

context canceled

As an alternative, we also sometime have HTTP 500 with message:

rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection closed

which generate this kind of logs:

2020-12-28T22:22:39.181Z [ERROR] agent: subscribe call failed: err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection closed" topic=ServiceHealth key=consul-info-html failure_count=1
2020-12-28T22:22:39.181Z [ERROR] agent: subscribe call failed: err="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection closed" topic=ServiceHealth key=consul-info-html failure_count=2
2020-12-28T22:22:39.181Z [ERROR] agent.http: Request error: method=GET url=/v1/health/service/consul-info-html?wait=30s&index=1&stale from=192.168.75.139:55754 error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection closed"

Removing the index=1 return the correct result:

curl "http://localhost:8500/v1/health/service/consul-info-html?wait=1s&stale" -i
HTTP/1.1 200 OK
Content-Type: application/json
Vary: Accept-Encoding
X-Consul-Default-Acl-Policy: allow
X-Consul-Index: 2070313186
X-Consul-Knownleader: true
X-Consul-Lastcontact: 90
Date: Mon, 28 Dec 2020 22:33:34 GMT
Transfer-Encoding: chunked

[{"Node":{"ID":"671c839d-aae4-9f9e-1b8e-bc57558444b1" ...]

Using any value for index (including the current index of service fails the same way (ie, in this example, using 2070313186 also fails)

The failure is the same on our 3 DCs, error does not happen if "use_streaming_backend": false is set on config of agent.

For all agents, /v1/agent/self => says both streaming is enabled.

Also tried to restart sequentially all servers + client, nothing works always in the same state.

@pierresouchay
Copy link
Contributor Author

Ok, it seems for every agent having Streaming enabled, we have this kind of errors:

2021-01-04T18:04:52.863Z [ERROR] agent.server.rpc: failed to read byte: conn=from=10.236.195.48:51572 error="tls: first record does not look like a TLS handshake"

Looks to me that the GRPC connection is doing some weird stuff (note: we have TLS disabled for RPC)

pierresouchay added a commit to pierresouchay/consul that referenced this issue Jan 5, 2021
pierresouchay added a commit to pierresouchay/consul that referenced this issue Jan 6, 2021
This ensures that hashicorp#9474 will
not reproduce.
@dnephin
Copy link
Contributor

dnephin commented Jan 6, 2021

Fixed by #9512

@dnephin dnephin closed this as completed Jan 6, 2021
@dnephin dnephin added theme/streaming Related to Streaming connections between server and client type/bug Feature does not function as expected labels Jan 6, 2021
@pierresouchay
Copy link
Contributor Author

@dnephin Tested today on our real clusters (with https://github.com/criteo-forks/consul/tree/v1.9.1-criteo2 ) -> works very well on a real-world cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/streaming Related to Streaming connections between server and client type/bug Feature does not function as expected
Projects
None yet
2 participants