Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: register the service before the proxy #305

Merged
merged 3 commits into from
Aug 10, 2020
Merged

Conversation

ishustava
Copy link
Contributor

@ishustava ishustava commented Aug 7, 2020

Because the proxy service registers an alias health check that points to the service ID of the main service and we're registering the proxy service before the main service, the alias check starts out as red
because at that time the service doesn't yet exist. Consul runs this check every minute, and so this will become green only after one minute.

This is particularly bad in a case when a service is restarted either due to scheduled or unscheduled maintenance. For example, when you have deployment and you trigger a re-deploy (kubectl rollout restart), kubernetes by default will do a rolling deploy, where it won't terminate the old instance until the new one comes up and is healthy. But Consul will take an additional minute or so to mark this service as healthy, causing downtime, where either no downtime or minimal downtime should be experienced.

Changes proposed in this PR:

Switch the order of how services are registered with Consul, with the main service registered first and the proxy service after.

Steps to reproduce and test

  1. Deploy the latest helm chart with connect enabled
     helm install consul --set global.name=consul --set connectInject.enabled=true --set server.replicas=1 --set server.bootstrapExpect=1 hashicorp/consul
  2. Create static-server and static-client deployments.
     kubectl apply -f https://raw.githubusercontent.com/hashicorp/consul-helm/acceptance-tests-base/test/acceptance/tests/connect/fixtures/static-server.yaml
     kubectl apply -f https://raw.githubusercontent.com/hashicorp/consul-helm/acceptance-tests-base/test/acceptance/tests/connect/fixtures/static-client.yaml
  3. Exec into the static client pod and run the following loop (it should print "hello world" every second):
     $ kubectl exec -it static-client-758b47746d-r5rll /bin/sh
     kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
     Defaulting container name to static-client.
     Use 'kubectl describe pod/static-client-758b47746d-r5rll -n default' to see all of the containers in this pod.
     # while true; do echo "$(date) $(curl -sS http://localhost:1234)"; sleep 1; done
    
  4. Restart the static-server deployment and observe about 1 minute of downtime in the logs from the while loop above.
     kubectl rollout restart deploy/static-server
    

To fix, upgrade to the image built from this PR and run helm upgrade:

helm upgrade consul --set global.name=consul --set global.imageK8S=hashicorpdev/consul-k8s:f817303 --set connectInject.enabled=true --set server.replicas=1 --set server.bootstrapExpect=1 hashicorp/consul

Once the connect injector becomes healthy, restart the static-server deployment again (step 4 above). You should either see no or 1-2 errors (i.e. 1-2 seconds of downtime) printed from the while loop running in the static-client container.

Checklist:

  • Tests added
  • CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

Because the proxy service registers an alias health check that
points to the service ID of the main service and we're registering the proxy
service before the main service, the alias check starts out as red
because at that time the service doesn't yet exist.
Consul runs this check every minute,
and so this will become green only after one minute.
This is particularly bad in a case when a service is restarted
either due to scheduled or unscheduled maintenance.
For example, when you have deployment and you trigger a re-deploy (kubectl rollout restart),
kubernetes by default will do a rolling deploy,
where it won't terminate the old instance until the new one comes up and is healthy.
But Consul will take an additional minute or so to mark this service healthy, causing
downtime, where no downtime should be experienced.
@ishustava ishustava requested review from a team, adilyse and thisisnotashwin and removed request for a team August 7, 2020 01:59
Copy link
Contributor

@thisisnotashwin thisisnotashwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Was able to successfully reproduce the error and the fix. 🎉

Copy link
Contributor

@adilyse adilyse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look fine, but the changelog needs to include a warning about the implications of this change.

CHANGELOG.md Outdated
BUG FIXES:

* Connect: Reduce downtime caused by an alias health check of the sidecar proxy not being healthy for up to 1 minute
when a Connect-enabled service is restarted [[GH-305](https://github.com/hashicorp/consul-k8s/pull/305)].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our conversation earlier, this should include the caveat that while this fix reverts to the previous behavior, that previous behavior includes Consul routing to services that may not be ready yet.

@ishustava ishustava added area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working labels Aug 10, 2020
@ishustava ishustava merged commit 646e522 into master Aug 10, 2020
@ishustava ishustava deleted the no-downtime-connect branch August 10, 2020 21:22
ndhanushkodi pushed a commit to ndhanushkodi/consul-k8s that referenced this pull request Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants