Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving race condition with Global configs in KDD #1070

Merged
merged 1 commit into from
Sep 13, 2017

Conversation

heschlie
Copy link
Contributor

@heschlie heschlie commented Aug 31, 2017

Description

Fixes #1009

Added a check to see if the error we see if of the type errors.ErrorResourceUpdateConflict when we set global config variables, as this means another Node set it while the current Node was trying to. As all the Nodes are trying to set them as the same value we can safely ignore this and carry on.

Todos

  • Tests
  • Documentation
  • Release note

Release Note

None Required

@heschlie heschlie added this to the Calico v2.6.0 milestone Aug 31, 2017
@heschlie heschlie self-assigned this Aug 31, 2017
@@ -788,12 +788,12 @@ func ensureDefaultConfig(cfg *api.CalicoAPIConfig, c *client.Client, node *api.N

// Store the Calico Version as a global felix config setting.
if err := c.Config().SetFelixConfig("CalicoVersion", "", VERSION); err != nil {
return err
return checkConflictError(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a conflict, I don't think we want to short-circuit here, we still want to continue with the rest of the default configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes you're right, it shouldn't do that.

@heschlie heschlie added the release-note-required Change has user-facing impact (no matter how small) label Aug 31, 2017
@bcreane
Copy link
Contributor

bcreane commented Sep 1, 2017

@heschlie - what's your take on the feasibility of adding an ST that ensures the various ensure*Config functions no longer fail when another node is poking the relevant key/value?

@heschlie
Copy link
Contributor Author

heschlie commented Sep 1, 2017

@bcreane I'll need to mull that one over, I'm not sure of a way to "guarantee" we're causing that error to occur, or if it is worth it to engineer a test for it. Will take a stab at it and see if inspiration strikes.

@@ -787,12 +787,12 @@ func ensureDefaultConfig(cfg *api.CalicoAPIConfig, c *client.Client, node *api.N
}

// Store the Calico Version as a global felix config setting.
if err := c.Config().SetFelixConfig("CalicoVersion", "", VERSION); err != nil {
if err := checkConflictError(c.Config().SetFelixConfig("CalicoVersion", "", VERSION)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about an alternative approach that we check if the value is already set to what we want to set and do nothing if it's already set to that? This kind of conflict can be very common since all the calico-nodes in the cluster will try to set these values when they boot up. So if we have 100 nodes they will all try to set CalicoVersion, ClusterType to the same value and will likely run into a conflict.

It's slightly different from ignoring the error when we see the conflict since if we're doing an upgrade on a small cluster and we try to update CalicoVersion and see a conflict, we can't just ignore that. Maybe we want to check if the value is what we want to set to, if not then we retry.

WDYGT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me, we're already doing that with some other values via ensureGlobalFelixConfig would be trivial to include this as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That on its own won't be enough. There's still a race condition between when we check what the value is and when we try to write a new value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but inside the ensureGlobalFelixConfig I've added a check for said error as well, I don't know if or how much more expensive a SetFelixConfig is over a GetFelixConfig so maybe the impact would but pretty much nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I re-read @gunjan5 comment, and yea essentially what is currently happening is the API server sees we're trying to update a value tied to an invalid resource version because two Nodes were trying to set the same value at the same time, and the one who lost ends up receiving this error, and terminating the startup script. Since we can be pretty sure the value for the Global configs was just set by another Node we can safely ignore it.

@caseydavenport
Copy link
Member

I don't think this requires a release note.

@heschlie heschlie removed the release-note-required Change has user-facing impact (no matter how small) label Sep 1, 2017
@heschlie
Copy link
Contributor Author

heschlie commented Sep 5, 2017

@gunjan5 @caseydavenport @bcreane do we feel tests are necessary for this?

@caseydavenport
Copy link
Member

@heschlie I'd feel a lot safer with tests for this.

At a minimum I'd like to see what manual testing has been done on this to ensure it's behaving correctly and actually fixes the issue.

@heschlie
Copy link
Contributor Author

heschlie commented Sep 5, 2017

@caseydavenport to cover the minimum, it was reliably reproducible on a 5 Node cluster. Fire up the Nodes, deploy Calico, watch calico/node come up and see restarts occur, verify logs show conflicts for the global config values. To validate the fix I setup calico/node to ignore the conflict errors only for global config variables (though on the first go I did miss some and was able to see it conflicting on different variables, so a couple iterations), relaunch the cluster several times to verify there are no restarts of calico/node.

I don't think we have anything to spin up a full k8s cluster in the calico repo, and that feels too heavy. I think for this test we could just spin up a k8s API server and run the startup script against it concurrently to simulate several instances of calico/node coming up. WDYGT?

// and ignore it if so. This is to allow our global configs to ignore conflict from multiple Nodes
// trying to set the same value at the same time.
func checkConflictError(err error) error {
if _, ok := err.(errors.ErrorResourceUpdateConflict); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heschlie before we merge this can we at least add an info log here?

Something along the lines of "Ignoring conflict when attempting to set config %s=%s"

@@ -0,0 +1,69 @@
kind: List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file? Don't we already have this checked in somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have one in the upgrade section, I'll have it point to that one instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we have one with only CRDs in this repo, but we can apply the one in libcalico-go by using it's url, so if the CRDs change in future we don't have to remember to update them here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one in upgrade is split into 2 because name overlap between crd and tpr names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also do what Felix does and pull from the vendor dir: https://github.com/projectcalico/felix/blob/master/k8sfv/run-test#L63

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling from the vendor dir seems like the most straightforward if we're having success with it in Felix, that's the one I grabbed anyway.

@heschlie
Copy link
Contributor Author

@caseydavenport @gunjan5 I pushed up some more changes, and updated the test a tad.

@@ -388,6 +390,41 @@ run-etcd-host:
--advertise-client-urls "http://$(LOCAL_IP_ENV):2379,http://127.0.0.1:2379" \
--listen-client-urls "http://0.0.0.0:2379"

## Kubernetes apiserver used for tests
run-k8s-apiserver: stop-k8s-apiserver run-etcd vendor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should depend on vendor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to, if we're pulling in the CRDs from libcalico-go in the vendor dir this will fail if it doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, right.

stop-k8s-apiserver:
@-docker rm -f st-apiserver


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline here

- Added a check to see if the error we see if of the type
`errors.ErrorResourceUpdateConflict` when we set global config
variables, as this means another Node set it while the current Node was
trying to.  As all the Nodes are trying to set them as the same value we can safely ignore
this and carry on.

- Added test that spins up API server and creates 10 Nodes in goroutines
  to simulate a bunch of Nodes coming up at once.
@caseydavenport
Copy link
Member

@heschlie once this is squashed and tests pass then I think we can go ahead and merge.

@heschlie
Copy link
Contributor Author

@caseydavenport or @gunjan5 Squashed, but need an approved review to merge.

@caseydavenport caseydavenport merged commit a87ae0f into projectcalico:master Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

calico/node fails to start due to config race
4 participants