several enhancements for Consul registration #242

stevehu · 2018-07-24T14:47:50Z

The consul registration works but sometimes users get intermittent issues with it.

Symptom #1:

When I startup multiple replicas of an API (ie. apid with 6 pods spreading across 3 nodes) I got 1 pod failed to register. I’m attaching the logs from the container.
Container log – see attached api_reg_err.txt. A few lines below which relates to the registration failure.

18:44:42.805 [main]  INFO  com.networknt.server.Server - register service: light://172.26.129.126:30000/com.networknt.apid-1.0.0?
18:44:42.805 [main]  INFO  c.n.r.support.AbstractRegistry - [ConsulRegistry] Url (null) will set to available to Registry [light://localhost:8080/default/consul/1.0/service]
18:44:42.806 [Client I/O-1]  DEBUG io.undertow.request.io - UT005013: An IOException occurred
java.nio.channels.ClosedChannelException: null
…
18:44:42.807 [Client I/O-1]  DEBUG io.undertow.client - Connection to consul.mcc.prod.api.abcd.com/172.26.129.123:8443 was closed by the target server

At the same time on the consul server 172.26.129.123 there was no clue with regards to the above connection closure.

Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cbmcclrv0246p.ca.abcd.com:5647
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://cbmcclrv0246p.ca.abcd.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] root:490 - connecting to cbmcclrv0246p.ca.abcd.com:5647...
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] root:532 - Disconnected
Jul 23 14:42:01 cb1461797 goferd: [ERROR][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:33 - connect: proton+amqps://cbmcclrv0246p.ca.abcd.com:5647, failed: Connection amqps://cbmcclrv0246p.ca.abcd.com:5647 disconnected
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:35 - retry in 106 seconds
Jul 23 14:43:48 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cbmcclrv0246p.ca.abcd.com:5647

I took a ‘tcpdump’ to troubleshoot this and captured the traffic between the 2 peers.

I can see that the last packet (#9) was a RESET sent by the application node to Consul in the middle of a TLS handshake. So it looks like the light4j code closed the connection premature. What would you think may be causing this disconnection? I have been suspecting that the TLS handshake has taken too long which may have resulted in timeouts. But looking at other TLS sessions in the tcpdump which correlates to successful registrations it seems the light4j code didn’t timeout with way longer latencies. So it sounds a myth to me without much understanding of how the light4j code manages the connections.

There are a few things I think could raise the level of fault tolerance which you may want to consider.
Add retries with the consul registration at startup – wait couple seconds and try again if it fails for the first time.
Shutdown the application JVM if the registration fails after 3~4 retries. Let K8S/OCP reschedule the pod.
You can also add a URL to present the readiness/liveness status of the pod so that K8S/OCP will probe and decide if need to kill/restart it. This is configurable in the deployment yaml.
From the logs, it appears that the service “check” code uses a separate Consul connection than the registration. The retry/fail mechanisms above may also apply here too.

Symptom #2

Even when the service is registered successfully, it may turn to “failing” later. I haven’t got a chance to look too deep but have an impression that it may also be premature connection disconnect issue similar to symptom #1. I may need to do more tcpdump when I have time but if you could think about the suggestions above in general and provide your feedback it'll be great.

P.s. I have seen the similar symptom in both non-PROD and pre-PROD environments – and only with HTTPs Consul services. It may very likely be related to the slow handshake of TLS comparing to TCP handshake only. But I don’t think we can use clear-text HTTP in PROD.

Symptom #3

Once a service check fails the service will show as “failing” on Consul UI. Most cases the check failure will be cleaned up once the pod is restarted. However, in case they do not and remain as “critical” to the service. I’m guessing it’s due to “de-registration” failure. This does not impact the applications ( I assuming) but will be annoying and confusing to operations people.

I would recommend you try parameter “DeregisterCriticalServiceAfter” in the check object so that the status will be cleaned up by Consul even though application itself failed de-registering. I’m not sure if it works for light4j though as it “pushes” check status instead of asking Consul to “pull”.

The text was updated successfully, but these errors were encountered:

stevehu · 2018-08-06T14:24:22Z

The first issue is fixed by switching to HTTP/2 with TLS enabled. #246

stevehu · 2018-08-06T14:25:36Z

The second issue should be fixed at the same time but it is not verified yet. There are still some enhancements that can be done to improve the performance.

…nfig

stevehu mentioned this issue Jul 24, 2018

Add circuit breaker to the client module #241

Closed

stevehu added a commit that referenced this issue Aug 9, 2018

fixes #242 add consul.yml for health check configuration

7d92f32

stevehu added a commit that referenced this issue Aug 9, 2018

fixes #242 update ConsulClientImpl to get parameters from config

d3f6e28

stevehu added a commit that referenced this issue Aug 10, 2018

fixes #242 update test cases

f057dd0

stevehu closed this as completed in 94335fb Aug 15, 2018

younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024

fixes networknt#242 Add check to deregister after critical after 1m

ab8e90d

younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024

fixes networknt#242 add consul.yml for health check configuration

99209da

younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024

fixes networknt#242 update ConsulClientImpl to get parameters from co…

d3fa408

…nfig

younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024

fixes networknt#242 update test cases

d05ba11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

several enhancements for Consul registration #242

several enhancements for Consul registration #242

stevehu commented Jul 24, 2018

stevehu commented Aug 6, 2018

stevehu commented Aug 6, 2018

several enhancements for Consul registration #242

several enhancements for Consul registration #242

Comments

stevehu commented Jul 24, 2018

Symptom #1:

Symptom #2

Symptom #3

stevehu commented Aug 6, 2018

stevehu commented Aug 6, 2018