Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

several enhancements for Consul registration #242

Closed
stevehu opened this issue Jul 24, 2018 · 2 comments
Closed

several enhancements for Consul registration #242

stevehu opened this issue Jul 24, 2018 · 2 comments

Comments

@stevehu
Copy link
Contributor

stevehu commented Jul 24, 2018

The consul registration works but sometimes users get intermittent issues with it.

Symptom #1:

When I startup multiple replicas of an API (ie. apid with 6 pods spreading across 3 nodes) I got 1 pod failed to register. I’m attaching the logs from the container.
Container log – see attached api_reg_err.txt. A few lines below which relates to the registration failure.

18:44:42.805 [main]  INFO  com.networknt.server.Server - register service: light://172.26.129.126:30000/com.networknt.apid-1.0.0?
18:44:42.805 [main]  INFO  c.n.r.support.AbstractRegistry - [ConsulRegistry] Url (null) will set to available to Registry [light://localhost:8080/default/consul/1.0/service]
18:44:42.806 [Client I/O-1]  DEBUG io.undertow.request.io - UT005013: An IOException occurred
java.nio.channels.ClosedChannelException: null
…
18:44:42.807 [Client I/O-1]  DEBUG io.undertow.client - Connection to consul.mcc.prod.api.abcd.com/172.26.129.123:8443 was closed by the target server

At the same time on the consul server 172.26.129.123 there was no clue with regards to the above connection closure.

Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cbmcclrv0246p.ca.abcd.com:5647
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.proton.connection:87 - open: URL: amqps://cbmcclrv0246p.ca.abcd.com:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] root:490 - connecting to cbmcclrv0246p.ca.abcd.com:5647...
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] root:532 - Disconnected
Jul 23 14:42:01 cb1461797 goferd: [ERROR][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:33 - connect: proton+amqps://cbmcclrv0246p.ca.abcd.com:5647, failed: Connection amqps://cbmcclrv0246p.ca.abcd.com:5647 disconnected
Jul 23 14:42:01 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:35 - retry in 106 seconds
Jul 23 14:43:48 cb1461797 goferd: [INFO][pulp.agent.72acc855-fc03-4af8-9718-7c420ec65284] gofer.messaging.adapter.connect:28 - connecting: proton+amqps://cbmcclrv0246p.ca.abcd.com:5647

I took a ‘tcpdump’ to troubleshoot this and captured the traffic between the 2 peers.

I can see that the last packet (#9) was a RESET sent by the application node to Consul in the middle of a TLS handshake. So it looks like the light4j code closed the connection premature. What would you think may be causing this disconnection? I have been suspecting that the TLS handshake has taken too long which may have resulted in timeouts. But looking at other TLS sessions in the tcpdump which correlates to successful registrations it seems the light4j code didn’t timeout with way longer latencies. So it sounds a myth to me without much understanding of how the light4j code manages the connections.

There are a few things I think could raise the level of fault tolerance which you may want to consider.
Add retries with the consul registration at startup – wait couple seconds and try again if it fails for the first time.
Shutdown the application JVM if the registration fails after 3~4 retries. Let K8S/OCP reschedule the pod.
You can also add a URL to present the readiness/liveness status of the pod so that K8S/OCP will probe and decide if need to kill/restart it. This is configurable in the deployment yaml.
From the logs, it appears that the service “check” code uses a separate Consul connection than the registration. The retry/fail mechanisms above may also apply here too.

Symptom #2

Even when the service is registered successfully, it may turn to “failing” later. I haven’t got a chance to look too deep but have an impression that it may also be premature connection disconnect issue similar to symptom #1. I may need to do more tcpdump when I have time but if you could think about the suggestions above in general and provide your feedback it'll be great.

P.s. I have seen the similar symptom in both non-PROD and pre-PROD environments – and only with HTTPs Consul services. It may very likely be related to the slow handshake of TLS comparing to TCP handshake only. But I don’t think we can use clear-text HTTP in PROD.

Symptom #3

Once a service check fails the service will show as “failing” on Consul UI. Most cases the check failure will be cleaned up once the pod is restarted. However, in case they do not and remain as “critical” to the service. I’m guessing it’s due to “de-registration” failure. This does not impact the applications ( I assuming) but will be annoying and confusing to operations people.

I would recommend you try parameter “DeregisterCriticalServiceAfter” in the check object so that the status will be cleaned up by Consul even though application itself failed de-registering. I’m not sure if it works for light4j though as it “pushes” check status instead of asking Consul to “pull”.

@stevehu
Copy link
Contributor Author

stevehu commented Aug 6, 2018

The first issue is fixed by switching to HTTP/2 with TLS enabled. #246

@stevehu
Copy link
Contributor Author

stevehu commented Aug 6, 2018

The second issue should be fixed at the same time but it is not verified yet. There are still some enhancements that can be done to improve the performance.

stevehu added a commit that referenced this issue Aug 10, 2018
younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024
younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024
younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024
younggwon1 pushed a commit to younggwon1/light-4j that referenced this issue Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant