Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[occm] Make sure we don't mask LB tests failures and fix what was failing #2360

Merged
merged 3 commits into from
Sep 15, 2023

Conversation

dulek
Copy link
Contributor

@dulek dulek commented Sep 6, 2023

What this PR does / why we need it:
In test-lb-service.sh we do trap "delete_resources" EXIT to make sure we cleanup resources on a test failure. In there, we only fetched the $? after making a check for ${AUTO_CLEAN_UP}, which itself alters the code to 0, so function always returns success. This means tests can never really fail.

This commit fixes it by making sure $ERROR_CODE is fetched at the very beginning of the cleanup function.

Some additional fixes needed to be made to make tests passing again. In particular:

  • Deleting listeners when the secondary shared Service is deleted is fixed now by making sure we're comparing the correct tag.
  • Tests no longer create Services as internal in order to make shared LB tests work after [occm] Don't allow internal Services to share an LB #2190.
  • FIP is added to the user LB in the sharing test in order for CPO to not complain about not being able to create it for a secondary Service.

Which issue this PR fixes(if applicable):
fixes #2540

Special notes for reviewers:

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 6, 2023
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 6, 2023
@dulek
Copy link
Contributor Author

dulek commented Sep 6, 2023

mdulko:cloud-provider-openstack/ (fix-test-failures*) $ bash tests/e2e/cloudprovider/test-lb-service.sh                                                                                                              
+ TIMEOUT=300                                                                                                                                                                                                        
+ FLOATING_IP=                                                                                                                                                                                                       
+ NAMESPACE=octavia-lb-test                                                                                                                                                                                          
+ GATEWAY_IP=                                                                                                                                                                                                        
+ DEVSTACK_OS_RC=/home/zuul/devstack/openrc                                                                                                                                                                          
+ CLUSTER_TENANT=demo                                                                                                                                                                                                
+ CLUSTER_USER=demo                                                                                                                                                                                                  
+ LB_SUBNET_NAME=private-subnet                                                                                                                                                                                      
+ AUTO_CLEAN_UP=true                                                                                                                                                                                                 
+ trap delete_resources EXIT                                                                                                                                                                                         
+ test_basic                                                                                                                                                                                                         
+ exit 1                                                                                                                                                                                                             
+ delete_resources                                                                                                                                                                                                   
+ [[ true != \t\r\u\e ]]                                                                                                                                                                                             
+ ERROR_CODE=0
+ exit 0

Here's how I tested this.

@jichenjc
Copy link
Contributor

jichenjc commented Sep 7, 2023

/test openstack-cloud-controller-manager-e2e-test

@dulek
Copy link
Contributor Author

dulek commented Sep 7, 2023

>>>>>>> FAIL: Timeout when waiting for the Service test-shared-2 created

Well, now we finally start to see failures.

/retest

@dulek
Copy link
Contributor Author

dulek commented Sep 7, 2023

I0906 16:55:52.107153      11 event.go:307] "Event occurred" object="octavia-lb-test/test-shared-2" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: internal Service cannot share a load balancer"

Okay, so this got broken when I started to limit LB sharing, but I never noticed because tests were busted.

@mdbooth
Copy link
Contributor

mdbooth commented Sep 7, 2023

@dulek Is it worth reverting the previous change to land this quickly? I'm guessing we wouldn't have merged it if we knew it broke a test?

@dulek
Copy link
Contributor Author

dulek commented Sep 7, 2023

@dulek Is it worth reverting the previous change to land this quickly? I'm guessing we wouldn't have merged it if we knew it broke a test?

mdulko:cloud-provider-openstack/ (provider-testing) $ git revert 788ea08fcf5714af8be3b0010bc6d6e5eb1d3a59
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: could not revert 788ea08fc... Don't allow internal Services to share an LB (#2190)

It's not that trivial anymore. Anyway I plan to look more closely at these tests, my current understanding is that adding service.beta.kubernetes.io/openstack-internal-load-balancer: "true" to the LB is just cargo cult from the previous test that did so to avoid unnecessary FIP allocation. I'm pretty sure this is all fixable.

@dulek
Copy link
Contributor Author

dulek commented Sep 7, 2023

@lingxiankong, do you know why we have this in place in tests?

@dulek
Copy link
Contributor Author

dulek commented Sep 7, 2023

+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:564 :   printf '\n>>>>>>> Checking the listener number for the load balancer 9e56a8cc-4687-45a1-8c46-e5dbbb3f388b\n'
++ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:565 :   openstack loadbalancer status show 9e56a8cc-4687-45a1-8c46-e5dbbb3f388b
++ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:565 :   jq '.loadbalancer.listeners | length'
+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:565 :   listenerNum=2
+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:566 :   [[ 2 != 1 ]]
+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:567 :   printf '\n>>>>>>> FAIL: The listener number should be 1 for load balancer 9e56a8cc-4687-45a1-8c46-e5dbbb3f388b, actual: 2\n'

Going forward…

@mdbooth
Copy link
Contributor

mdbooth commented Sep 7, 2023

@dulek Is it worth reverting the previous change to land this quickly? I'm guessing we wouldn't have merged it if we knew it broke a test?

mdulko:cloud-provider-openstack/ (provider-testing) $ git revert 788ea08fcf5714af8be3b0010bc6d6e5eb1d3a59
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: could not revert 788ea08fc... Don't allow internal Services to share an LB (#2190)

It's not that trivial anymore. Anyway I plan to look more closely at these tests, my current understanding is that adding service.beta.kubernetes.io/openstack-internal-load-balancer: "true" to the LB is just cargo cult from the previous test that did so to avoid unnecessary FIP allocation. I'm pretty sure this is all fixable.

Probably only worth it if the fix drags on a bit, then. In the meantime we probably shouldn't merge LB-related changes

@dulek
Copy link
Contributor Author

dulek commented Sep 7, 2023

/retest

I can see in the logs that listener was removed and LB was ACTIVE again. It couldn't be the ACTIVE-PENDING_UPDATE dance, as we only allow Service to be deleted once it's processed and kubectl waits for that.

Let's try it once again.

With shared LBs we distinguish the elements by tagging them with the
proper name of the LB that would be created for a Service if it wasn't
created as shared. This commit fixes that comparison for listener
deletion as code was always comparing the name of the primary LB.
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 8, 2023
@dulek
Copy link
Contributor Author

dulek commented Sep 8, 2023

/retest

I can see in the logs that listener was removed and LB was ACTIVE again. It couldn't be the ACTIVE-PENDING_UPDATE dance, as we only allow Service to be deleted once it's processed and kubectl waits for that.

Let's try it once again.

Okay, there was a bug, fix is added.

@dulek
Copy link
Contributor Author

dulek commented Sep 8, 2023

>>>>>>> Waiting for the Service test-shared-user-lb creation finished

>>>>>>> FAIL: Timeout when waiting for the Service test-shared-user-lb created

That's another one.

@dulek
Copy link
Contributor Author

dulek commented Sep 8, 2023

Okay, this one was simple, I left some internal annotations by mistake.

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.
In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 11, 2023
@dulek
Copy link
Contributor Author

dulek commented Sep 11, 2023

Tests are actually passing again! 🥳

@jichenjc, @kayrus, @mdbooth, @zetaab, this is ready for reviews now.

@jichenjc
Copy link
Contributor

/lgtm

@dulek given you actually create additional commit for other purpose (e.g compare the LB name) ,can you help
put that into #2360 (comment)
as well? Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2023
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second and third commits lgtm.

I don't understand the first commit yet.

@@ -2390,7 +2390,7 @@ func (lbaas *LbaasV2) deleteLoadBalancer(loadbalancer *loadbalancers.LoadBalance
Protocol: proto,
Port: int(port.Port),
}]
if isPresent && cpoutil.Contains(listener.Tags, loadbalancer.Name) {
if isPresent && cpoutil.Contains(listener.Tags, svcConf.lbName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is different to loadbalancer.Name? I've read this function and the calling function, in which you're now setting svcConf.lbName to the return value of lbaas.GetLoadBalancerName. Does this imply that the loadbalancer name (retrieved by id?) is different to GetLoadBalancerName.

Copy link
Contributor Author

@dulek dulek Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so this all boils down to the support for shared LBs that I believe is implemented in a way that is very problematic codewise.

How does it work? You add an annotation to the Service that features the LB ID you want that Service to use. The assumption is that the LB exists already and CPO will just add listener, pool and members (members should be the same really1) to that LB. The secondary LB resources will be tagged with the LB name that would be used for the secondary Service so that we can distinguish them later on.

The bug here is that we looked up listeners comparing their tags with the name of the LB taken from OpenStack, which is based on the primary Service name. In that case we should use the secondary service name and this is the fix.

Footnotes

  1. Now this makes me think that we might have a bug that 2 Services sharing an LB can fight for their view of LB members. Fixed on reconciliation, but not great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very fragile, but fixing that is definitely outside of the scope of this PR.

/lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It totally is and you'll probably be happy to see it disabled by default in OpenShift: openshift/cluster-cloud-controller-manager-operator#263.

@dulek dulek changed the title [occm] Make sure we don't mask LB tests failures [occm] Make sure we don't mask LB tests failures and fix what was failing Sep 12, 2023
@dulek
Copy link
Contributor Author

dulek commented Sep 12, 2023

/lgtm

@dulek given you actually create additional commit for other purpose (e.g compare the LB name) ,can you help put that into #2360 (comment) as well? Thanks

Done!

@dulek
Copy link
Contributor Author

dulek commented Sep 14, 2023

@jichenjc: Anything else to move this forward?

@jichenjc
Copy link
Contributor

no, all good ~

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jichenjc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit f8fceaf into kubernetes:master Sep 15, 2023
4 of 5 checks passed
@dulek
Copy link
Contributor Author

dulek commented Sep 19, 2023

/cherry-pick release-1.27

@k8s-infra-cherrypick-robot

@dulek: #2360 failed to apply on top of branch "release-1.27":

Applying: Compare proper LB name for shared LBs
Using index info to reconstruct a base tree...
M	pkg/openstack/loadbalancer.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Compare proper LB name for shared LBs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dulek added a commit to dulek/cloud-provider-openstack that referenced this pull request Sep 19, 2023
…ling (kubernetes#2360)

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
dulek added a commit to dulek/cloud-provider-openstack that referenced this pull request Sep 26, 2023
…ling (kubernetes#2360)

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
k8s-ci-robot pushed a commit that referenced this pull request Oct 13, 2023
…ling (#2360) (#2367)

* Fix shared LBs tests

PR #2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
k8s-infra-cherrypick-robot pushed a commit to k8s-infra-cherrypick-robot/cloud-provider-openstack that referenced this pull request Oct 13, 2023
…ubernetes#2360)

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
k8s-ci-robot pushed a commit that referenced this pull request Oct 16, 2023
…2360) (#2430)

* Fix shared LBs tests

PR #2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.

Co-authored-by: Michal Dulko <mdulko@redhat.com>
mandre pushed a commit to shiftstack/cloud-provider-openstack that referenced this pull request Feb 6, 2024
…ling (kubernetes#2360)

* Compare proper LB name for shared LBs

With shared LBs we distinguish the elements by tagging them with the
proper name of the LB that would be created for a Service if it wasn't
created as shared. This commit fixes that comparison for listener
deletion as code was always comparing the name of the primary LB.

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
@dulek
Copy link
Contributor Author

dulek commented Feb 7, 2024

/cherry-pick release-1.28

For some reason this haven't made it to 1.28 branch, which is pretty bad. Trying to fix that now.

@k8s-infra-cherrypick-robot

@dulek: #2360 failed to apply on top of branch "release-1.28":

Applying: Compare proper LB name for shared LBs
Using index info to reconstruct a base tree...
M	pkg/openstack/loadbalancer.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Compare proper LB name for shared LBs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.28

For some reason this haven't made it to 1.28 branch, which is pretty bad. Trying to fix that now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dulek added a commit to shiftstack/cloud-provider-openstack that referenced this pull request Feb 7, 2024
…ling (kubernetes#2360)

* Compare proper LB name for shared LBs

With shared LBs we distinguish the elements by tagging them with the
proper name of the LB that would be created for a Service if it wasn't
created as shared. This commit fixes that comparison for listener
deletion as code was always comparing the name of the primary LB.

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
mandre pushed a commit to shiftstack/cloud-provider-openstack that referenced this pull request Feb 7, 2024
…ling (kubernetes#2360)

* Compare proper LB name for shared LBs

With shared LBs we distinguish the elements by tagging them with the
proper name of the LB that would be created for a Service if it wasn't
created as shared. This commit fixes that comparison for listener
deletion as code was always comparing the name of the primary LB.

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
dulek added a commit to shiftstack/cloud-provider-openstack that referenced this pull request Feb 7, 2024
…ling (kubernetes#2360)

* Compare proper LB name for shared LBs

With shared LBs we distinguish the elements by tagging them with the
proper name of the LB that would be created for a Service if it wasn't
created as shared. This commit fixes that comparison for listener
deletion as code was always comparing the name of the primary LB.

* Fix shared LBs tests

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
k8s-ci-robot pushed a commit that referenced this pull request Feb 8, 2024
…ling (#2360) (#2537)

* Compare proper LB name for shared LBs

With shared LBs we distinguish the elements by tagging them with the
proper name of the LB that would be created for a Service if it wasn't
created as shared. This commit fixes that comparison for listener
deletion as code was always comparing the name of the primary LB.

* Fix shared LBs tests

PR #2190 prohibited sharing an LB that is internal for security reasons.
This commit fixes the shared LBs tests to not create internal LBs.

* Make sure we don't mask LB tests failures

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make
sure we cleanup resources on a test failure. In there, we only fetched
the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself
alters the code to 0, so function always returns success. This means
tests can never really fail.

This commit fixes it by making sure `$ERROR_CODE` is fetched at the very
beginning of the cleanup function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[occm] LB tests failures are masked
5 participants