Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kuma-cp) Clear snapshots from cache on disconnect #2172

Merged
merged 1 commit into from
Jun 16, 2021

Conversation

lahabana
Copy link
Contributor

@lahabana lahabana commented Jun 16, 2021

Previously we were pushing an empty snapshot on the cache when a dp was disconnecting.
This was causing the dp to get an empty snapshot after a restart.
This would cause listeners to disappear for a short time and causing the traffic going through the dp to fail.
We now clear the snapshot from the dp to avoid this behaviour.

Fix #2171

Testing

  • Unit tests
  • E2E tests
  • Manual testing on Universal
  • Manual testing on Kubernetes

Previously we were pushing an empty snapshot on the cache when a dp was disconnecting.
This was causing the dp to get an empty snapshot after a restart.
This would cause listeners to disappear for a short time and causing the traffic going through the dp to fail.
We now clear the snapshot from the dp to avoid this behaviour.

Fix kumahq#2176

Signed-off-by: Charly Molter <charly@koyeb.com>
@jakubdyszkiewicz
Copy link
Contributor

Ok, I think this change makes sense.

The only problem is that I believe this change will make SDS tests really flaky. I remember that with a clear snapshot it was a huge problem. But I'd rather have a flaky test than a problem on prod.

Let's just keep in mind that we may need to revisit SDS tests after merging it.

@lahabana
Copy link
Contributor Author

@jakubdyszkiewicz was it e2e tests or unit tests that were flaky? I can try to look at the flakiness if you tell me where to start :)

@jakubdyszkiewicz
Copy link
Contributor

@lahabana pkg/sds/server/v3/server_test.go this one. I remember I was only able to reproduce it when I was hammering my VM with yes > /dev/null & and running tests in a loop.

@lahabana
Copy link
Contributor Author

Ok thanks I'll try to repro! Should I merge this anyway and potentially fix the test in a follow up PR?

@jakubdyszkiewicz
Copy link
Contributor

yes, definitely. Go ahead and merge it (squash and merge 🙏 )

@lahabana lahabana merged commit b780f95 into kumahq:master Jun 16, 2021
@lahabana
Copy link
Contributor Author

yes, definitely. Go ahead and merge it (squash and merge 🙏 )

Can an admin make squash and merge the default as well :) (And maybe disable the other options in the github repo settings)?

@lahabana lahabana deleted the fix/2176 branch June 16, 2021 11:16
@jakubdyszkiewicz
Copy link
Contributor

it should be by default 🤔 or maybe it's only on my profile

@lahabana
Copy link
Contributor Author

Hey @jakubdyszkiewicz I can't manage to make the tests flaky. I've done 154 attempts of the tests (ginkgo -v -untilItFails -race pkg/sds/server/...) with a docker image with 0.5 cpu and yes > /dev/null running and never got any failure.
Don't you think you change on ed84c9c might have fixed the flakiness?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Envoy on Dps can get cleaned before added back which causes network errors.
2 participants