Fix sync of multi-cluster ingress #183

nicksardo · 2018-03-29T19:24:53Z

Fixes #182
Also fixes an pre-existing issue of keeping around backend services if MCI and normal ingresses shared backend services. If the normal ingress was deleted, the backend service wouldn't be GC'd until MCI was deleted.

The logic for syncing MCI vs GCE-ingress was not easily distinguishable. I've re-ordered the sync logic and broke down Checkpoint into something I think is more clear.
Checkpoint is broken up into three separate funcs and are called in the following order:

EnsureInstanceGroupsAndPorts()
- MCI updates annotations and early returns.
EnsureLoadBalancer()
EnsureFirewall()

cc @nikhiljindal @csbell @bowei @MrHohn @rramkumar1

nikhiljindal · 2018-03-29T23:57:01Z

Thanks for sending this @nicksardo
Have added more e2e tests in kubernetes/kubernetes#61909

nikhiljindal · 2018-03-30T05:06:55Z

/assign

nikhiljindal · 2018-03-30T05:24:13Z

pkg/controller/controller.go

-		Items: []extensions.Ingress{*ing},
+	igs, err := lbc.CloudClusterManager.EnsureInstanceGroupsAndPorts(nodeNames, allNodePorts)
+	if err != nil {
+		return err


Set syncError = err as at other places?

Yep. As I said, WIP.

It should be retErr now?

nikhiljindal · 2018-03-30T05:24:38Z

pkg/controller/controller.go

+			ing.Annotations = map[string]string{}
+		}
+		if err = setInstanceGroupsAnnotation(ing.Annotations, igs); err != nil {
+			return err


Same. set syncError = err?

Deleted syncError

nikhiljindal · 2018-03-30T05:24:55Z

pkg/controller/controller.go

+		if err = setInstanceGroupsAnnotation(ing.Annotations, igs); err != nil {
+			return err
+		}
+		return updateAnnotations(lbc.client, ing.Name, ing.Namespace, ing.Annotations)


the returned error should be set as SyncError?

nikhiljindal · 2018-03-30T05:29:31Z

pkg/controller/controller.go

-	//   for the Ingress associated with "key" is ready for a UrlMap update.
-	//   If this encounters an error, eg for quota reasons, we want to invoke
-	//   Phase 2 right away and retry checkpointing.
-	// * Phase 2 performs GC by refcounting shared resources. This needs to


fwiw, I think this comment was useful specially this explanation of why we always run phase 2, even when phase 1 (creating LB fails).
This helps understand why we have the defer func and tricky syncError tracking code.

Will replace with a better comment elsewhere.

nikhiljindal · 2018-03-30T05:31:18Z

@nicksardo Were you able to find the PR that caused this regression? Will help in comparing how this was working before.

nicksardo · 2018-03-30T17:44:28Z

It was either the PR that stopped syncing multiple load balancers or stopped syncing multiple backend services. I don't think knowing which one caused the regression is important. You can examine the code at v0.9.7 to see how MCI was separated from GCE-ingress. I think the division was too subtle and was bound to cause this break.

nikhiljindal

Mostly lg.
One comment about how to handle errors.
Earlier, we were raising an event for any error returned by Checkpoint. We seem to be ignoring some of them now (not generating event). Is that intentional?

nikhiljindal · 2018-03-30T22:11:11Z

pkg/controller/controller.go

-		Items: []extensions.Ingress{*ing},
+	igs, err := lbc.CloudClusterManager.EnsureInstanceGroupsAndPorts(nodeNames, allNodePorts)
+	if err != nil {
+		return err


It should be retErr now?

nikhiljindal · 2018-03-30T22:11:23Z

pkg/controller/controller.go

-		}
+	// Create the backend services and higher-level LB resources.
+	if err = lbc.CloudClusterManager.EnsureLoadBalancer(lb, lbSvcPorts, igs); err != nil {
+		return err


set to retErr?

nikhiljindal · 2018-03-30T22:12:16Z

pkg/controller/controller.go

@@ -344,20 +351,24 @@ func (lbc *LoadBalancerController) sync(key string) (err error) {
 	// Update the UrlMap of the single loadbalancer that came through the watch.
 	l7, err := lbc.CloudClusterManager.l7Pool.Get(key)
 	if err != nil {
-		syncError = fmt.Errorf("%v, unable to get loadbalancer: %v", syncError, err)
-		return syncError
+		return fmt.Errorf("unable to get loadbalancer: %v", err)


This was being set to syncError earlier and hence we were generating event for this. Is the change to not generate event intentional?

nikhiljindal · 2018-03-30T23:02:10Z

As discussed with @nicksardo, turns out I was understanding named returns wrong.
We were unnecessarily setting syncErr = err before, its not required :)

/lgtm

nicksardo · 2018-03-30T23:10:08Z

There is one function difference in terms of syncError. Previously, if we failed "Checkpoint", it would continue on to update the URL map. I don't think that's behavior we want to keep. Since the controller now only syncs GCP resources belonging to the ingress, updating the URL map seems pointless if there's an error earlier.

nikhiljindal · 2018-03-30T23:12:57Z

sgtm

bowei · 2018-03-30T23:50:20Z

/approve

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 29, 2018

nicksardo force-pushed the fix-mci branch from 33a1b54 to 2fb7495 Compare March 29, 2018 23:28

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 29, 2018

k8s-ci-robot assigned nikhiljindal Mar 30, 2018

nikhiljindal reviewed Mar 30, 2018

View reviewed changes

Restructure Checkpoint to fix MCI issue

e156251

nicksardo force-pushed the fix-mci branch from 6f489a4 to e156251 Compare March 30, 2018 20:00

Rename firewall sync error to XPN specific

5c909f3

nicksardo changed the title ~~WIP: Fix sync of multi-cluster ingress~~ Fix sync of multi-cluster ingress Mar 30, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2018

nicksardo assigned bowei Mar 30, 2018

nikhiljindal reviewed Mar 30, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2018

nicksardo merged commit f24d554 into kubernetes:master Mar 31, 2018

nikhiljindal mentioned this pull request Mar 31, 2018

ingress controller should only manage instance groups for multicluster ingress #182

Closed

freehan mentioned this pull request Apr 2, 2018

Test Failing: [sig-network] Loadbalancing: L7 GCE [Slow] [Feature:Ingress] multicluster ingress should get instance group annotation #185

Closed

nikhiljindal mentioned this pull request Apr 3, 2018

Cherry-pick checkpoint changes to 1.0 #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sync of multi-cluster ingress #183

Fix sync of multi-cluster ingress #183

nicksardo commented Mar 29, 2018 •

edited

Loading

nikhiljindal commented Mar 29, 2018

nikhiljindal commented Mar 30, 2018

nikhiljindal Mar 30, 2018

nicksardo Mar 30, 2018

nikhiljindal Mar 30, 2018

nikhiljindal Mar 30, 2018

nicksardo Mar 30, 2018

nikhiljindal Mar 30, 2018

nikhiljindal Mar 30, 2018

nicksardo Mar 30, 2018

nikhiljindal commented Mar 30, 2018

nicksardo commented Mar 30, 2018

nikhiljindal left a comment

nikhiljindal Mar 30, 2018

nikhiljindal Mar 30, 2018

nikhiljindal Mar 30, 2018

nikhiljindal commented Mar 30, 2018

nicksardo commented Mar 30, 2018 •

edited

Loading

nikhiljindal commented Mar 30, 2018

bowei commented Mar 30, 2018

Fix sync of multi-cluster ingress #183

Fix sync of multi-cluster ingress #183

Conversation

nicksardo commented Mar 29, 2018 • edited Loading

nikhiljindal commented Mar 29, 2018

nikhiljindal commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikhiljindal commented Mar 30, 2018

nicksardo commented Mar 30, 2018

nikhiljindal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikhiljindal commented Mar 30, 2018

nicksardo commented Mar 30, 2018 • edited Loading

nikhiljindal commented Mar 30, 2018

bowei commented Mar 30, 2018

nicksardo commented Mar 29, 2018 •

edited

Loading

nicksardo commented Mar 30, 2018 •

edited

Loading