New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[e2e test failure] [sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator #50945
Comments
@k8s-merge-robot
Note: Method 1 will trigger an email to the group. You can find the group list here and label list here. |
Seeing this failure on the e2e release-master-blocking https://k8s-testgrid.appspot.com/release-master-blocking#gke Example failure Looks like an RBAC issue cc @kubernetes/sig-api-machinery-bugs |
@kubernetes/sig-api-machinery-bugs this is currently blocking the alpha.3 release |
Automatic merge from submit-queue Fixed gke auth update wait condition. Lookup whoami on gke using gcloud auth list. Make sure we do not run the test on any cluster older than 1.7. **What this PR does / why we need it**: Fixes issue with aggregator e2e test on GKE **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50945 **Special notes for your reviewer**: There is a TODO, follow up will be provided when the immediate problem is resolved. **Release note**: ```release-note NONE ```
This has started failing again on our GKE test suite https://k8s-testgrid.appspot.com/release-master-blocking#gke
cc @kubernetes/sig-api-machinery-test-failures |
[MILESTONENOTIFIER] Milestone Labels Complete @k8s-merge-robot Issue label settings:
|
/assign @cheftako |
Seems to be flaking now with the error:
|
@cheftako @ericchiang - we need to determine (today if possible) if this is truly release blocking. If so, please add the release-blocker label. And, if not, how do we best continue work on this for 1.8.x/1.9.0? |
this test has two passes and four failures on the same commit I'm seeing gke-specific authz grants in that test that are incorrect: 3b9485b#diff-c944d1288edcaf37beebab811603bfd8L164 That commit removed the wait for the authz grant to become effective (which can lead to flakes), and granted superuser permissions to all users, which is incorrect and invalidates any other authz-related tests run in parallel with this test |
cc @kubernetes/sig-auth-test-failures |
Aggregator e2e test is intermittantly failing on GKE but not GCE. Adding the following debugging for help trace issue. Make sure we always use the same rest client. Randomly generate the flunder resource name to detect parallel tests. Print endpoints for sample-system in case multiple instances. Print original and new pods in case the pod has been restarted. Fixed import list. Remove rand seed.
I can add the wait back. edit: and fix the test given admin to all authenticated users. |
Actually after staring at this test for about half an hour I can't figure out what different users exist or what permissions they're being granted. ClientSet, InternalClientset, and AggregatorClient are all initialized from the same config so I don't see how one would be able to create an RBAC binding but another would fail later. kubernetes/test/e2e/framework/framework.go Lines 156 to 161 in 6808e80
@cheftako any thoughts here? |
I honestly think the gke specific BindClusterRole is a red-herring. It is needed so the client has permission to perform one of the setup steps. (I think it was either to create the wardler cluster role or to bind that role to the anonymous user) Once that setup step is complete we no longer need the that cluster role bound and so I don't think its related. |
The gke authorizer allows the “bind” verb, so the client can create a binding to the cluster-admin. It cannot create a role directly unless it has permissions via RBAC. Since we don’t have a way to determine the username associated with |
I agree that the point at which the tests are failing indicate that the previous authorization issues are not the cause. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.. Debug for issues #50945 Aggregator e2e test is intermittantly failing on GKE but not GCE. Adding the following debugging for help trace issue. Make sure we always use the same rest client. Randomly generate the flunder resource name to detect parallel tests. Print endpoints for sample-system in case multiple instances. Print original and new pods in case the pod has been restarted. **What this PR does / why we need it**: Adds debugging for aggregator e2e test to track down GKE flakiness. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50945 **Special notes for your reviewer**: This is primarily additional debugging information. **Release note**: ```release-note NONE ```
/open |
/reopen |
So a lot more information to work with now but the error is still occurring. I am still looking into this. |
@cheftako any update on the investigation? |
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=aggregator Friendly v1.8 release team ping. This failure still seems to be happening, is this actively being worked? Does this need to be in the v1.8 milestone? |
Automatic merge from submit-queue (batch tested with PRs 51648, 53030, 53009). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.. Fixed intermitant e2e aggregator test on GKE. **What this PR does / why we need it**: Issue was caused by another test cleaning up its namespace. This caused the namespace controller to try to clean up that namespace. This involves deleting all flunders under that namespace. However the sample-apiserver was not honoring the namespace filter. So the flunders for the test would randomly disappear. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50945 **Special notes for your reviewer**: Requires we fix the container image to contain this fix to work. **Release note**: ```release-note NONE ```
…3030-upstream-release-1.8 Automatic merge from submit-queue. Automated cherry pick of #53030 Cherry pick of #53030 on release-1.8. #53030: Fixed intermittant e2e aggregator test on GKE. **What this PR does / why we need it**: Issue was caused by another test cleaning up its namespace. This caused the namespace controller to try to clean up that namespace. This involves deleting all flunders under that namespace. However the sample-apiserver was not honoring the namespace filter. So the flunders for the test would randomly disappear. Relates to issue #50945 **Special notes for your reviewer**: Requires we fix the container image to contain this fix to work.
Failure cluster 42229f8b33f735ea0213
Error text:
Failure cluster statistics:
1 tests failed, 11 jobs failed, 241 builds failed.
Failure stats cover 1 day time range '17 Aug 2017 22:57 UTC' to '18 Aug 2017 22:57 UTC'.
Top failed tests by jobs failed:
Top failed jobs by builds failed:
Current Status
The text was updated successfully, but these errors were encountered: