Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix signal handling for lifecycle sidecar #409

Merged
merged 4 commits into from
Dec 17, 2020

Conversation

kschoche
Copy link
Contributor

@kschoche kschoche commented Dec 16, 2020

This fixes issues related to stale service registrations created when pods are terminated, causing services to be registered with no pods backing them. This was caused by a race condition in shutdown logic of the lifecycle-sidecar, so the fix address shutdown path for it.

Changes proposed in this PR:

  • move signal handling to a go func in lifecycle sidecar so that it can exit cleanly and avoid race conditions if resync were to take a long time.

How I've tested this PR:
existing unit tests + manual tests

How I expect reviewers to test this PR:
Spin up latest release of consul-k8s and deploy a sample app that is connect injected, it is helpful to see the issue in the UI so enable that as well. Once the app is online delete the application, you can watch the logs of the lifecycle-sidecar for the pod side by side with the agent logs and see that lifecycle sidecar runs a register after the pod de-registers its service, or see that in the UI the service never goes away.

Checklist:

  • Tests added
  • CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

@kschoche kschoche requested review from a team, lkysow and thisisnotashwin and removed request for a team December 16, 2020 00:30
Copy link
Contributor

@thisisnotashwin thisisnotashwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the updated pattern for signal handling and passing a context into exex.CommandContext().

I might be out of the loop here, but does this address a specific problem we are running into? Also, given we don't really follow a pattern of logging time taken by a command, I'm unsure if this is something we would want to merge vs use to identify root cause and fix independently.

subcommand/lifecycle-sidecar/command.go Show resolved Hide resolved
subcommand/lifecycle-sidecar/command.go Show resolved Hide resolved
@kschoche
Copy link
Contributor Author

I might be out of the loop here, but does this address a specific problem we are running into? Also, given we don't really follow a pattern of logging time taken by a command, I'm unsure if this is something we would want to merge vs use to identify root cause and fix independently.

Sorry about that, I feel silly for not expanding in the PR.

This addresses the issues that we've been seeing with services in the UI showing up after the application has been deleted. What was happening was that an application would get deleted, the pods would terminate and their preStop hooks would deregister the service, (normal), however due to the time between when we're running the cmd.CombinedOutput() and when we actually catch the term signal for lifecycle sidecar we would actually re-register the pod on the way out nearly every time!

This fixes the race condition between lifecycle-sidecar getting a term signal and pod de/re-registration.

There is also a past known issue where we suspected pods were not executing their preStop hooks due
abnormal termination scenarios from k8s, which occasionally left the services registered with Consul even though the pod was gone. This looks+reads+smells like the actual cause of that problem as opposed to "abnormal termination."
The underlying issue that caused all of this to surface in the latest release is the abnormally long time it is currently taking for our register command to complete, which means we are not able to catch our term signal about 50% of the time without re-registering the pod.

Let me know if that doesn't make sense and I can demo it to you, and I'll also add steps to repro to the PR.

Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Like you mentioned, we also need a test case. We should be able to start the command, send the signal and see that the service isn't re-registered in consul.

subcommand/lifecycle-sidecar/command.go Show resolved Hide resolved
subcommand/lifecycle-sidecar/command.go Outdated Show resolved Hide resolved
Copy link
Contributor

@thisisnotashwin thisisnotashwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦊 ! See comment about test, also needs changelog entry.

subcommand/lifecycle-sidecar/command_test.go Show resolved Hide resolved
@kschoche kschoche added the type/bug Something isn't working label Dec 17, 2020
@kschoche kschoche self-assigned this Dec 17, 2020
@kschoche kschoche merged commit 18eabd8 into master Dec 17, 2020
@kschoche kschoche deleted the update_signal_handling_lifecyclesidecar branch December 17, 2020 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants