-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolving race condition with Global configs in KDD #1070
Conversation
calico_node/startup/startup.go
Outdated
@@ -788,12 +788,12 @@ func ensureDefaultConfig(cfg *api.CalicoAPIConfig, c *client.Client, node *api.N | |||
|
|||
// Store the Calico Version as a global felix config setting. | |||
if err := c.Config().SetFelixConfig("CalicoVersion", "", VERSION); err != nil { | |||
return err | |||
return checkConflictError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a conflict, I don't think we want to short-circuit here, we still want to continue with the rest of the default configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes you're right, it shouldn't do that.
@heschlie - what's your take on the feasibility of adding an ST that ensures the various |
@bcreane I'll need to mull that one over, I'm not sure of a way to "guarantee" we're causing that error to occur, or if it is worth it to engineer a test for it. Will take a stab at it and see if inspiration strikes. |
@@ -787,12 +787,12 @@ func ensureDefaultConfig(cfg *api.CalicoAPIConfig, c *client.Client, node *api.N | |||
} | |||
|
|||
// Store the Calico Version as a global felix config setting. | |||
if err := c.Config().SetFelixConfig("CalicoVersion", "", VERSION); err != nil { | |||
if err := checkConflictError(c.Config().SetFelixConfig("CalicoVersion", "", VERSION)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about an alternative approach that we check if the value is already set to what we want to set and do nothing if it's already set to that? This kind of conflict can be very common since all the calico-nodes in the cluster will try to set these values when they boot up. So if we have 100 nodes they will all try to set CalicoVersion
, ClusterType
to the same value and will likely run into a conflict.
It's slightly different from ignoring the error when we see the conflict since if we're doing an upgrade on a small cluster and we try to update CalicoVersion
and see a conflict, we can't just ignore that. Maybe we want to check if the value is what we want to set to, if not then we retry.
WDYGT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok to me, we're already doing that with some other values via ensureGlobalFelixConfig
would be trivial to include this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That on its own won't be enough. There's still a race condition between when we check what the value is and when we try to write a new value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but inside the ensureGlobalFelixConfig
I've added a check for said error as well, I don't know if or how much more expensive a SetFelixConfig
is over a GetFelixConfig
so maybe the impact would but pretty much nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I re-read @gunjan5 comment, and yea essentially what is currently happening is the API server sees we're trying to update a value tied to an invalid resource version because two Nodes were trying to set the same value at the same time, and the one who lost ends up receiving this error, and terminating the startup script. Since we can be pretty sure the value for the Global configs was just set by another Node we can safely ignore it.
I don't think this requires a release note. |
@gunjan5 @caseydavenport @bcreane do we feel tests are necessary for this? |
@heschlie I'd feel a lot safer with tests for this. At a minimum I'd like to see what manual testing has been done on this to ensure it's behaving correctly and actually fixes the issue. |
@caseydavenport to cover the minimum, it was reliably reproducible on a 5 Node cluster. Fire up the Nodes, deploy Calico, watch I don't think we have anything to spin up a full k8s cluster in the calico repo, and that feels too heavy. I think for this test we could just spin up a k8s API server and run the startup script against it concurrently to simulate several instances of |
calico_node/startup/startup.go
Outdated
// and ignore it if so. This is to allow our global configs to ignore conflict from multiple Nodes | ||
// trying to set the same value at the same time. | ||
func checkConflictError(err error) error { | ||
if _, ok := err.(errors.ErrorResourceUpdateConflict); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heschlie before we merge this can we at least add an info log here?
Something along the lines of "Ignoring conflict when attempting to set config %s=%s"
@@ -0,0 +1,69 @@ | |||
kind: List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file? Don't we already have this checked in somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we have one in the upgrade section, I'll have it point to that one instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we have one with only CRDs in this repo, but we can apply the one in libcalico-go by using it's url, so if the CRDs change in future we don't have to remember to update them here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one in upgrade is split into 2 because name overlap between crd and tpr names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also do what Felix does and pull from the vendor dir: https://github.com/projectcalico/felix/blob/master/k8sfv/run-test#L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling from the vendor dir seems like the most straightforward if we're having success with it in Felix, that's the one I grabbed anyway.
@caseydavenport @gunjan5 I pushed up some more changes, and updated the test a tad. |
@@ -388,6 +390,41 @@ run-etcd-host: | |||
--advertise-client-urls "http://$(LOCAL_IP_ENV):2379,http://127.0.0.1:2379" \ | |||
--listen-client-urls "http://0.0.0.0:2379" | |||
|
|||
## Kubernetes apiserver used for tests | |||
run-k8s-apiserver: stop-k8s-apiserver run-etcd vendor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should depend on vendor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to, if we're pulling in the CRDs from libcalico-go in the vendor dir this will fail if it doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, right.
calico_node/Makefile
Outdated
stop-k8s-apiserver: | ||
@-docker rm -f st-apiserver | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra newline here
- Added a check to see if the error we see if of the type `errors.ErrorResourceUpdateConflict` when we set global config variables, as this means another Node set it while the current Node was trying to. As all the Nodes are trying to set them as the same value we can safely ignore this and carry on. - Added test that spins up API server and creates 10 Nodes in goroutines to simulate a bunch of Nodes coming up at once.
fa24ef7
to
191d4f9
Compare
@heschlie once this is squashed and tests pass then I think we can go ahead and merge. |
@caseydavenport or @gunjan5 Squashed, but need an approved review to merge. |
Description
Fixes #1009
Added a check to see if the error we see if of the type
errors.ErrorResourceUpdateConflict
when we set global config variables, as this means another Node set it while the current Node was trying to. As all the Nodes are trying to set them as the same value we can safely ignore this and carry on.Todos
Release Note