Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e test failure] [sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator #50945

Closed
k8s-github-robot opened this issue Aug 19, 2017 · 21 comments · Fixed by #51235, #52816 or #53030
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Milestone

Comments

@k8s-github-robot
Copy link

Failure cluster 42229f8b33f735ea0213

Error text:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:65
creating cluster role wardler
Expected error:
    <*errors.StatusError | 0xc421134380>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "clusterroles.rbac.authorization.k8s.io \"wardler\" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"create\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"delete\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"deletecollection\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"get\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"list\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"patch\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"update\"]} PolicyRule{Resources:[\"flunders\"], APIGroups:[\"wardle.k8s.io\"], Verbs:[\"watch\"]} PolicyRule{NonResourceURLs:[\"*\"], Verbs:[\"get\"]}] user=&{pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com  [system:authenticated] map[]} ownerrules=[PolicyRule{Resources:[\"selfsubjectaccessreviews\"], APIGroups:[\"authorization.k8s.io\"], Verbs:[\"create\"]} PolicyRule{NonResourceURLs:[\"/api\" \"/api/*\" \"/apis\" \"/apis/*\" \"/healthz\" \"/swaggerapi\" \"/swaggerapi/*\" \"/version\"], Verbs:[\"get\"]}] ruleResolutionErrors=[]",
            Reason: "Forbidden",
            Details: {
                Name: "wardler",
                Group: "rbac.authorization.k8s.io",
                Kind: "clusterroles",
                UID: "",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 403,
        },
    }
    clusterroles.rbac.authorization.k8s.io "wardler" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["create"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["delete"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["deletecollection"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["get"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["list"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["patch"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["update"]} PolicyRule{Resources:["flunders"], APIGroups:["wardle.k8s.io"], Verbs:["watch"]} PolicyRule{NonResourceURLs:["*"], Verbs:["get"]}] user=&{pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com  [system:authenticated] map[]} ownerrules=[PolicyRule{Resources:["selfsubjectaccessreviews"], APIGroups:["authorization.k8s.io"], Verbs:["create"]} PolicyRule{NonResourceURLs:["/api" "/api/*" "/apis" "/apis/*" "/healthz" "/swaggerapi" "/swaggerapi/*" "/version"], Verbs:["get"]}] ruleResolutionErrors=[]
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:331
Failure cluster statistics:

1 tests failed, 11 jobs failed, 241 builds failed.
Failure stats cover 1 day time range '17 Aug 2017 22:57 UTC' to '18 Aug 2017 22:57 UTC'.

Top failed tests by jobs failed:
Test Name Jobs Failed
[sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator 11
Top failed jobs by builds failed:
Job Name Builds Failed Latest Failure
ci-kubernetes-e2e-gci-gke-multizone 42 18 Aug 2017 22:02 UTC
ci-kubernetes-e2e-gci-gke 40 18 Aug 2017 22:00 UTC
ci-kubernetes-e2e-gke-multizone 40 18 Aug 2017 22:11 UTC

Current Status

@k8s-github-robot k8s-github-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Aug 19, 2017
@k8s-github-robot
Copy link
Author

@k8s-merge-robot
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/sig-contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 19, 2017
@ericchiang ericchiang changed the title Failure cluster [42229f...] failed 241 builds, 11 jobs, and 1 tests over 1 days [sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator Aug 21, 2017
@ericchiang ericchiang changed the title [sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator [e2e test failure] [sig-api-machinery] Aggregator Should be able to support the 1.7 Sample API Server using the current Aggregator Aug 21, 2017
@ericchiang
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. labels Aug 21, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 21, 2017
@ericchiang ericchiang added this to the v1.8 milestone Aug 21, 2017
@jdumars
Copy link
Member

jdumars commented Aug 23, 2017

@kubernetes/sig-api-machinery-bugs this is currently blocking the alpha.3 release

@dims
Copy link
Member

dims commented Aug 23, 2017

@cheftako looks like this new test was added by in #50347 which was merged about 6 days ago and is failing consistently

k8s-github-robot pushed a commit that referenced this issue Aug 26, 2017
Automatic merge from submit-queue

Fixed gke auth update wait condition.

Lookup whoami on gke using gcloud auth list.
Make sure we do not run the test on any cluster older than 1.7.

**What this PR does / why we need it**: Fixes issue with aggregator e2e test on GKE

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50945 

**Special notes for your reviewer**: There is a TODO, follow up will be provided when the immediate problem is resolved.

**Release note**: ```release-note
NONE
```
@ericchiang
Copy link
Contributor

This has started failing again on our GKE test suite https://k8s-testgrid.appspot.com/release-master-blocking#gke

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke/15715#sig-api-machinery-aggregator-should-be-able-to-support-the-17-sample-api-server-using-the-current-aggregator

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:69
attempting to delete a newly created flunders resource
Expected error:
    <*errors.StatusError | 0xc4211b4990>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""},
            Status: "Failure",
            Message: "the server could not find the requested resource",
            Reason: "NotFound",
            Details: {
                Name: "",
                Group: "",
                Kind: "",
                UID: "",
                Causes: [
                    {
                        Type: "UnexpectedServerResponse",
                        Message: "unknown",
                        Field: "",
                    },
                ],
                RetryAfterSeconds: 0,
            },
            Code: 404,
        },
    }
    the server could not find the requested resource
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:430

cc @kubernetes/sig-api-machinery-test-failures

@ericchiang ericchiang reopened this Sep 20, 2017
@ericchiang ericchiang added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Sep 20, 2017
@k8s-github-robot
Copy link
Author

[MILESTONENOTIFIER] Milestone Labels Complete

@k8s-merge-robot

Issue label settings:

  • sig/api-machinery: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Additional instructions available here The commands available for adding these labels are documented here

@ericchiang ericchiang added this to Backlog in 1.8 Failing tests Sep 20, 2017
@apelisse apelisse moved this from Backlog to Need owner in 1.8 Failing tests Sep 20, 2017
@cheftako
Copy link
Member

/assign @cheftako

@apelisse apelisse moved this from Need owner to in-progress in 1.8 Failing tests Sep 20, 2017
@ericchiang
Copy link
Contributor

Seems to be flaking now with the error:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:69
Sep 20 17:49:59.789: failed to get back the correct flunders list &{map[metadata:map[selfLink:/apis/wardle.k8s.io/v1alpha1/namespaces/sample-system/flunders resourceVersion:5] kind:FlunderList apiVersion:wardle.k8s.io/v1alpha1] []} from the dynamic client
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apimachinery/aggregator.go:481

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke/15719#sig-api-machinery-aggregator-should-be-able-to-support-the-17-sample-api-server-using-the-current-aggregator

@jdumars
Copy link
Member

jdumars commented Sep 21, 2017

@cheftako @ericchiang - we need to determine (today if possible) if this is truly release blocking. If so, please add the release-blocker label. And, if not, how do we best continue work on this for 1.8.x/1.9.0?

@liggitt
Copy link
Member

liggitt commented Sep 21, 2017

this test has two passes and four failures on the same commit

I'm seeing gke-specific authz grants in that test that are incorrect:

3b9485b#diff-c944d1288edcaf37beebab811603bfd8L164

That commit removed the wait for the authz grant to become effective (which can lead to flakes), and granted superuser permissions to all users, which is incorrect and invalidates any other authz-related tests run in parallel with this test

@liggitt
Copy link
Member

liggitt commented Sep 21, 2017

cc @kubernetes/sig-auth-test-failures

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Sep 21, 2017
cheftako added a commit to cheftako/kubernetes that referenced this issue Sep 21, 2017
Aggregator e2e test is intermittantly failing on GKE but not GCE.
Adding the following debugging for help trace issue.
Make sure we always use the same rest client.
Randomly generate the flunder resource name to detect parallel tests.
Print endpoints for sample-system in case multiple instances.
Print original and new pods in case the pod has been restarted.

Fixed import list.
Remove rand seed.
@ericchiang
Copy link
Contributor

ericchiang commented Sep 21, 2017

I can add the wait back.

edit: and fix the test given admin to all authenticated users.

@ericchiang
Copy link
Contributor

ericchiang commented Sep 21, 2017

Actually after staring at this test for about half an hour I can't figure out what different users exist or what permissions they're being granted. ClientSet, InternalClientset, and AggregatorClient are all initialized from the same config so I don't see how one would be able to create an RBAC binding but another would fail later.

f.ClientSet, err = clientset.NewForConfig(config)
Expect(err).NotTo(HaveOccurred())
f.InternalClientset, err = internalclientset.NewForConfig(config)
Expect(err).NotTo(HaveOccurred())
f.AggregatorClient, err = aggregatorclient.NewForConfig(config)
Expect(err).NotTo(HaveOccurred())

@cheftako any thoughts here?

@cheftako
Copy link
Member

cheftako commented Sep 21, 2017

I honestly think the gke specific BindClusterRole is a red-herring. It is needed so the client has permission to perform one of the setup steps. (I think it was either to create the wardler cluster role or to bind that role to the anonymous user) Once that setup step is complete we no longer need the that cluster role bound and so I don't think its related.

@liggitt
Copy link
Member

liggitt commented Sep 21, 2017

I don't see how one would be able to create an RBAC binding but another would fail later.

The gke authorizer allows the “bind” verb, so the client can create a binding to the cluster-admin. It cannot create a role directly unless it has permissions via RBAC. Since we don’t have a way to determine the username associated with iclient, binding to all authenticated users is what was done as a workaround.

@liggitt
Copy link
Member

liggitt commented Sep 21, 2017

I agree that the point at which the tests are failing indicate that the previous authorization issues are not the cause.

k8s-github-robot pushed a commit that referenced this issue Sep 21, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..

Debug for issues #50945

Aggregator e2e test is intermittantly failing on GKE but not GCE.
Adding the following debugging for help trace issue.
Make sure we always use the same rest client.
Randomly generate the flunder resource name to detect parallel tests.
Print endpoints for sample-system in case multiple instances.
Print original and new pods in case the pod has been restarted.

**What this PR does / why we need it**: Adds debugging for aggregator e2e test to track down GKE flakiness.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50945 

**Special notes for your reviewer**: This is primarily additional debugging information.

**Release note**:
```release-note NONE
```
@cheftako
Copy link
Member

/open

@cheftako
Copy link
Member

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Sep 22, 2017
@cheftako
Copy link
Member

So a lot more information to work with now but the error is still occurring. I am still looking into this.

@liggitt
Copy link
Member

liggitt commented Sep 23, 2017

@cheftako any update on the investigation?

@liggitt liggitt removed the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Sep 23, 2017
@spiffxp
Copy link
Member

spiffxp commented Sep 25, 2017

https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=aggregator

Friendly v1.8 release team ping. This failure still seems to be happening, is this actively being worked? Does this need to be in the v1.8 milestone?

k8s-github-robot pushed a commit that referenced this issue Sep 26, 2017
Automatic merge from submit-queue (batch tested with PRs 51648, 53030, 53009). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..

Fixed intermitant e2e aggregator test on GKE.

**What this PR does / why we need it**: Issue was caused by another test cleaning up its namespace.
This caused the namespace controller to try to clean up that namespace.
This involves deleting all flunders under that namespace.
However the sample-apiserver was not honoring the namespace filter.
So the flunders for the test would randomly disappear.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50945 

**Special notes for your reviewer**: Requires we fix the container image to contain this fix to work.

**Release note**:
```release-note NONE
```
k8s-github-robot pushed a commit that referenced this issue Sep 29, 2017
…3030-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #53030

Cherry pick of #53030 on release-1.8.

#53030: Fixed intermittant e2e aggregator test on GKE.

**What this PR does / why we need it**: Issue was caused by another test cleaning up its namespace.
This caused the namespace controller to try to clean up that namespace.
This involves deleting all flunders under that namespace.
However the sample-apiserver was not honoring the namespace filter.
So the flunders for the test would randomly disappear.

Relates to issue  #50945

**Special notes for your reviewer**: Requires we fix the container image to contain this fix to work.
@ericchiang ericchiang moved this from in-progress to Fixed/Closed in 1.8 Failing tests Nov 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
No open projects
1.8 Failing tests
Fixed/Closed
8 participants