Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sync of multi-cluster ingress #183

Merged
merged 2 commits into from
Mar 31, 2018
Merged

Conversation

nicksardo
Copy link
Contributor

@nicksardo nicksardo commented Mar 29, 2018

Fixes #182
Also fixes an pre-existing issue of keeping around backend services if MCI and normal ingresses shared backend services. If the normal ingress was deleted, the backend service wouldn't be GC'd until MCI was deleted.

The logic for syncing MCI vs GCE-ingress was not easily distinguishable. I've re-ordered the sync logic and broke down Checkpoint into something I think is more clear.
Checkpoint is broken up into three separate funcs and are called in the following order:

  1. EnsureInstanceGroupsAndPorts()
    • MCI updates annotations and early returns.
  2. EnsureLoadBalancer()
  3. EnsureFirewall()

cc @nikhiljindal @csbell @bowei @MrHohn @rramkumar1

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 29, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 29, 2018
@nikhiljindal
Copy link
Contributor

Thanks for sending this @nicksardo
Have added more e2e tests in kubernetes/kubernetes#61909

@nikhiljindal
Copy link
Contributor

/assign

Items: []extensions.Ingress{*ing},
igs, err := lbc.CloudClusterManager.EnsureInstanceGroupsAndPorts(nodeNames, allNodePorts)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set syncError = err as at other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. As I said, WIP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be retErr now?

ing.Annotations = map[string]string{}
}
if err = setInstanceGroupsAnnotation(ing.Annotations, igs); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. set syncError = err?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted syncError

if err = setInstanceGroupsAnnotation(ing.Annotations, igs); err != nil {
return err
}
return updateAnnotations(lbc.client, ing.Name, ing.Namespace, ing.Annotations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the returned error should be set as SyncError?

// for the Ingress associated with "key" is ready for a UrlMap update.
// If this encounters an error, eg for quota reasons, we want to invoke
// Phase 2 right away and retry checkpointing.
// * Phase 2 performs GC by refcounting shared resources. This needs to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I think this comment was useful specially this explanation of why we always run phase 2, even when phase 1 (creating LB fails).
This helps understand why we have the defer func and tricky syncError tracking code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will replace with a better comment elsewhere.

@nikhiljindal
Copy link
Contributor

@nicksardo Were you able to find the PR that caused this regression? Will help in comparing how this was working before.

@nicksardo
Copy link
Contributor Author

It was either the PR that stopped syncing multiple load balancers or stopped syncing multiple backend services. I don't think knowing which one caused the regression is important. You can examine the code at v0.9.7 to see how MCI was separated from GCE-ingress. I think the division was too subtle and was bound to cause this break.

@nicksardo nicksardo changed the title WIP: Fix sync of multi-cluster ingress Fix sync of multi-cluster ingress Mar 30, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2018
Copy link
Contributor

@nikhiljindal nikhiljindal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lg.
One comment about how to handle errors.
Earlier, we were raising an event for any error returned by Checkpoint. We seem to be ignoring some of them now (not generating event). Is that intentional?

Items: []extensions.Ingress{*ing},
igs, err := lbc.CloudClusterManager.EnsureInstanceGroupsAndPorts(nodeNames, allNodePorts)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be retErr now?

}
// Create the backend services and higher-level LB resources.
if err = lbc.CloudClusterManager.EnsureLoadBalancer(lb, lbSvcPorts, igs); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to retErr?

@@ -344,20 +351,24 @@ func (lbc *LoadBalancerController) sync(key string) (err error) {
// Update the UrlMap of the single loadbalancer that came through the watch.
l7, err := lbc.CloudClusterManager.l7Pool.Get(key)
if err != nil {
syncError = fmt.Errorf("%v, unable to get loadbalancer: %v", syncError, err)
return syncError
return fmt.Errorf("unable to get loadbalancer: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was being set to syncError earlier and hence we were generating event for this. Is the change to not generate event intentional?

@nikhiljindal
Copy link
Contributor

As discussed with @nicksardo, turns out I was understanding named returns wrong.
We were unnecessarily setting syncErr = err before, its not required :)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2018
@nicksardo
Copy link
Contributor Author

nicksardo commented Mar 30, 2018

There is one function difference in terms of syncError. Previously, if we failed "Checkpoint", it would continue on to update the URL map. I don't think that's behavior we want to keep. Since the controller now only syncs GCP resources belonging to the ingress, updating the URL map seems pointless if there's an error earlier.

@nikhiljindal
Copy link
Contributor

sgtm

@bowei
Copy link
Member

bowei commented Mar 30, 2018

/approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingress controller should only manage instance groups for multicluster ingress
4 participants