Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resource cleanup in ingress_utils.go within e2e/framework #42555

Merged

Conversation

nicksardo
Copy link
Contributor

@nicksardo nicksardo commented Mar 6, 2017

What this PR does / why we need it:
The GLBC is failing to delete resources during the etcd rollback tests and the e2e cleanup is leaking them. After a short while, tests are failing to create new resources.
This PR addresses the e2e/framework's ability to delete GLBC-created resources and adds more logging.

Which issue this PR fixes:
Helps #38569 but does not completely close this flake

Special notes for your reviewer:
Resources were not being deleted because resource names were being truncated and then their ability to be deleted was determined by the entire cluster id existing in the name. Truncated names also have an extra '0' append to the end of their name (unknown origin). This PR tries to match on a common prefix.

Minor changes were made to improve log readability.

Testing this PR:
This was tested by running a master upgrade test and by adding a second forwarding-rule mid-run. This forwarding rule referenced the same url-map used by the first forwarding-rule created by the GLBC. Therefore, the GLBC will be able to delete the forwarding-rule but not anymore L7 resources. This second forwarding rule's name was nearly identical to the first forwarding rule so that the cleanup code will find it.

As you can see from the test run below, the cleanup code deleted all the resources that the GLBC could not.

...
Mar  5 18:35:53.112: INFO: Monitoring glbc's cleanup of gce resources:
k8s-fws-e2e-tests-ingress-upgrsde-0px85-static-ip--5f38ac0e2420 (forwarding rule)
k8s-tps-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420 (target-https-proxy)
k8s-um-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e24260 (url-map)
k8s-be-32331--5f38ac0e2426f796 (backend-service)
k8s-be-32613--5f38ac0e2426f796 (backend-service)
k8s-be-32331--5f38ac0e2426f796 (http-health-check)
k8s-be-32613--5f38ac0e2426f796 (http-health-check)
k8s-ig--5f38ac0e2426f796 (instance-group)
k8s-ssl-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420 (ssl-certificate)

STEP: Performing final delete of any remaining resources
Mar  5 18:35:54.055: INFO: Deleting forwarding-rules: k8s-fws-e2e-tests-ingress-upgrsde-0px85-static-ip--5f38ac0e2420
Mar  5 18:36:06.945: INFO: Deleting target-https-proxies: k8s-tps-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420
Mar  5 18:36:14.301: INFO: Deleting url-map: k8s-um-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e24260
Mar  5 18:36:18.309: INFO: Deleting backed-service: k8s-be-32331--5f38ac0e2426f796
Mar  5 18:36:22.112: INFO: Deleting backed-service: k8s-be-32613--5f38ac0e2426f796
Mar  5 18:36:26.192: INFO: Deleting http-health-check: k8s-be-32331--5f38ac0e2426f796
Mar  5 18:36:29.846: INFO: Deleting http-health-check: k8s-be-32613--5f38ac0e2426f796
Mar  5 18:36:33.722: INFO: Deleting instance-group: k8s-ig--5f38ac0e2426f796
Mar  5 18:36:37.762: INFO: Deleting ssl-certificate: k8s-ssl-e2e-tests-ingress-upgrade-0px85-static-ip--5f38ac0e2420
STEP: No resources leaked.
Mar  5 18:36:46.441: INFO: Deleting addresses: e2e-tests-ingress-upgrade-0px85-static-ip
Mar  5 18:36:53.902: INFO: L7 controller failed to delete all cloud resources on time. timed out waiting for the condition
...

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 6, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Mar 6, 2017
@nicksardo
Copy link
Contributor Author

@mml

@spxtr spxtr assigned mml and unassigned spxtr Mar 6, 2017
@nicksardo
Copy link
Contributor Author

Hey @krousey, Matt is actually out today. Wondering if you'd like to take a look or can point me to a more appropriate reviewer.

Copy link
Contributor

@krousey krousey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small question.. and one double take.

@spxtr, doesn't the test-infra team take steps to try to clean this up?

// Static-IP allocated on behalf of the test, never deleted by the
// controller. Delete this IP only after the controller has had a chance
// to cleanup or it might interfere with the controller, causing it to
// throw out confusing events.
if ipErr := wait.Poll(5*time.Second, LoadBalancerCleanupTimeout, func() (bool, error) {
if ipErr := wait.Poll(5*time.Second, 1*time.Minute, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you still mean to change this? Or was this to aid in testing and debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On purpose. The LoadBalancerCleanupTimeout is 15 minutes and is already used above at 336.

This wait.Poll(...) is for manually deleting the static IP created during Setup(...). We want to try just enough times to make sure any propagation of deletions from above will make the ip viewed as un-used. Even polling for a minute may be overkill.

@@ -648,8 +674,7 @@ func (cont *GCEIngressController) isHTTPErrorCode(err error, code int) bool {

// Cleanup cleans up cloud resources.
// If del is false, it simply reports existing resources without deleting them.
// It always deletes resources created through it's methods, like staticIP, even
// if del is false.
// If dle is true, it deletes resources it finds acceptable (see canDelete func).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If del is true".

Also, this is really gross. It's 2 functions hiding behind a boolean switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very gross, indeed. It's also looking at the return strings of deleteXYZ(del bool) string to determine if resources exists. Frankly, all of it would benefit from being rebuilt. I'm more concerned with stopping the leak of resources at the moment.

@nicksardo
Copy link
Contributor Author

@krousey I also wanted some advice on using your upgrade-test interface. If the code encounters a failure on Setup, should I call Teardown (or the implementation of Teardown) myself? I had this build fail before disruption and noticed that Teardown wasn't called, resulting in leaked resources.

@krousey krousey added this to the v1.6 milestone Mar 7, 2017
@krousey
Copy link
Contributor

krousey commented Mar 7, 2017

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2017
@krousey krousey added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Mar 7, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: krousey, nicksardo

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @ncdc
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2017
@nicksardo
Copy link
Contributor Author

@k8s-bot gce etcd3 e2e test this

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 42080, 41653, 42598, 42555)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants