-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
several enhancements for Consul registration #242
Comments
The first issue is fixed by switching to HTTP/2 with TLS enabled. #246 |
The second issue should be fixed at the same time but it is not verified yet. There are still some enhancements that can be done to improve the performance. |
stevehu
added a commit
that referenced
this issue
Aug 9, 2018
stevehu
added a commit
that referenced
this issue
Aug 9, 2018
younggwon1
pushed a commit
to younggwon1/light-4j
that referenced
this issue
Feb 10, 2024
younggwon1
pushed a commit
to younggwon1/light-4j
that referenced
this issue
Feb 10, 2024
younggwon1
pushed a commit
to younggwon1/light-4j
that referenced
this issue
Feb 10, 2024
younggwon1
pushed a commit
to younggwon1/light-4j
that referenced
this issue
Feb 10, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The consul registration works but sometimes users get intermittent issues with it.
Symptom #1:
When I startup multiple replicas of an API (ie. apid with 6 pods spreading across 3 nodes) I got 1 pod failed to register. I’m attaching the logs from the container.
Container log – see attached api_reg_err.txt. A few lines below which relates to the registration failure.
At the same time on the consul server 172.26.129.123 there was no clue with regards to the above connection closure.
I took a ‘tcpdump’ to troubleshoot this and captured the traffic between the 2 peers.
I can see that the last packet (#9) was a RESET sent by the application node to Consul in the middle of a TLS handshake. So it looks like the light4j code closed the connection premature. What would you think may be causing this disconnection? I have been suspecting that the TLS handshake has taken too long which may have resulted in timeouts. But looking at other TLS sessions in the tcpdump which correlates to successful registrations it seems the light4j code didn’t timeout with way longer latencies. So it sounds a myth to me without much understanding of how the light4j code manages the connections.
There are a few things I think could raise the level of fault tolerance which you may want to consider.
Add retries with the consul registration at startup – wait couple seconds and try again if it fails for the first time.
Shutdown the application JVM if the registration fails after 3~4 retries. Let K8S/OCP reschedule the pod.
You can also add a URL to present the readiness/liveness status of the pod so that K8S/OCP will probe and decide if need to kill/restart it. This is configurable in the deployment yaml.
From the logs, it appears that the service “check” code uses a separate Consul connection than the registration. The retry/fail mechanisms above may also apply here too.
Symptom #2
Even when the service is registered successfully, it may turn to “failing” later. I haven’t got a chance to look too deep but have an impression that it may also be premature connection disconnect issue similar to symptom #1. I may need to do more tcpdump when I have time but if you could think about the suggestions above in general and provide your feedback it'll be great.
P.s. I have seen the similar symptom in both non-PROD and pre-PROD environments – and only with HTTPs Consul services. It may very likely be related to the slow handshake of TLS comparing to TCP handshake only. But I don’t think we can use clear-text HTTP in PROD.
Symptom #3
Once a service check fails the service will show as “failing” on Consul UI. Most cases the check failure will be cleaned up once the pod is restarted. However, in case they do not and remain as “critical” to the service. I’m guessing it’s due to “de-registration” failure. This does not impact the applications ( I assuming) but will be annoying and confusing to operations people.
I would recommend you try parameter “DeregisterCriticalServiceAfter” in the check object so that the status will be cleaned up by Consul even though application itself failed de-registering. I’m not sure if it works for light4j though as it “pushes” check status instead of asking Consul to “pull”.
The text was updated successfully, but these errors were encountered: