New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI alpha tests are failing #70760

Open
msau42 opened this Issue Nov 7, 2018 · 31 comments

Comments

Projects
None yet
7 participants
@msau42
Member

msau42 commented Nov 7, 2018

What happened:
Probably caused by #70193

/kind bug

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 7, 2018

@msau42 @saad-ali

I understand that a679486 moved crd creation logic to e2e test's k8s cluster deployment code.
On the other hand, I tested it with manually deployed k8s cluster.

So, in my case, I guess that in addition to enable CSIDriverRegistry and CSINodeInfo feature gates enabled, I need to create crd manually before tests, like below:

# kubectl create -f staging/src/k8s.io/csi-api/pkg/crd/manifests/csidriver.yaml
# kubectl create -f staging/src/k8s.io/csi-api/pkg/crd/manifests/csinodeinfo.yaml

I could make two of three SkipAttach tests pass in my environment by above way.
For "non-attachable volume does not need VolumeAttachment" case, I'm still looking into it. (It doesn't seem to skip creating volumeAttachment, properly)

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 7, 2018

@msau42 @saad-ali

I found a typo in createCSIDriver's argument. While hostpath driver's GetDynamicProvisionStorageClass connects driver name and unique name without "-", it connects them with "-". This makes "non-attachable" test fail, due to the mismatch between volume name in crd and provisioner name, as a result, creating volumeAttachment isn't skipped. I will create a PR to fix it.

For above bug, a679486 is unrelated. It just changed the way in how to test in manually deployed k8s cluster as I mentioned in above comment. So, there should be no problem for the PR, at least for the CI testing if it works well in CI environments.

mkimuram added a commit to mkimuram/kubernetes that referenced this issue Nov 7, 2018

@saad-ali

This comment has been minimized.

Member

saad-ali commented Nov 8, 2018

I can verify that the CRDs are not getting installed on the alpha cluster. Investigating why.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

/milestone v1.13
/priority important-soon
/kind failing-tests

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 8, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

/kind failing-test

@jberkus @mortent @maria

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 8, 2018

/reopen

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 8, 2018

@AishSundar: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wangzhen127

This comment has been minimized.

Member

wangzhen127 commented Nov 9, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 9, 2018

@wangzhen127: Reopened this issue.

In response to this:

I don't think this is fully resolved. There are still failures on
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features
https://k8s-testgrid.appspot.com/google-gce#gci-gce-alpha-features

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42

This comment has been minimized.

Member

msau42 commented Nov 9, 2018

@mkimuram can you help take a look at these failures? It's testing CSI filesystem volume mode.

I think it is unrelated to the crd installation issue that Saad fixed.

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 9, 2018

@msau42

I tested with AllAlpha flag and manually created csi crds and confirmed that filesystem volume mode test passed in my environment and also other tests passed there. If I delete csi crds, other tests also fail, so I agree that the issue that Saad fixed is not related.

However, above also means that this issue can't be reproduced in my environment.

@msau42

This comment has been minimized.

Member

msau42 commented Nov 9, 2018

I1109 14:56:01.141] Nov 9 14:56:01.131: INFO: At 2018-11-09 14:50:54 +0000 UTC - event for security-context-dc2399cc-e42e-11e8-9b95-0a580a3c2a84: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "pvc-d6000e8b-e42e-11e8-880b-42010a800002" : node "bootstrap-e2e-minion-group-3wtg" has no NodeID annotation

@msau42

This comment has been minimized.

Member

msau42 commented Nov 9, 2018

Hostpath also seems to be failing because of driver registration issues? Not sure why other alpha tests are passing if that's the case

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 9, 2018

Not sure why other alpha tests are passing if that's the case

This test is the only test that uses CreateSecPodWithNodeName to create pod with CSI driver volume and expects the result to be success. One of the pod parameters specified there might affect this failure.

@wangzhen127

This comment has been minimized.

Member

wangzhen127 commented Nov 9, 2018

Those 2 failing tests are newly introduced in b89b367 and started to fail from the beginning. Is that PR blocking 1.13? If not, can we revert and fix later? I have recently set up optional presubmit tests for alpha features at pull-kubernetes-e2e-gce-alpha-features, which will pick up those 2 tests.

Currently this failure is blocking #70034.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 9, 2018

@msau42 @mkimuram apart from the 2 failing tests, I see [sig-storage] CSI Volumes [Driver: com.google.csi.gcepd][Serial] [Testpattern: Dynamic PV (filesystem volmode)] volumeMode[Feature:BlockVolume] should create sc, pod, pv, and pvc, read/write to the pv, and delete all created resources which @saad-ali addressed with #70780 is also flaky.

These failures are now blocking PRs as mentioned above. As we enter code slush today, its very critical to have the PRs get in sooner than later. Can we plz revert the feature and the tests until they are fully resolved and ready to go. Thanks

@msau42

This comment has been minimized.

Member

msau42 commented Nov 9, 2018

I prefer disabling the specific tests instead of reverting the entire pr. We will lose important signal for csi

@wangzhen127

This comment has been minimized.

Member

wangzhen127 commented Nov 9, 2018

I prefer disabling the specific tests instead of reverting the entire pr. We will lose important signal for csi

That works for me. Can you cc me on the PR that disables those tests?

When you re-enable the tests, remember to try them out on pull-kubernetes-e2e-gce-alpha-features. :)

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 9, 2018

@AishSundar

Let me consider if we can just skip these test for CSI, instead of reverting the entire patch.

@gnufied

This comment has been minimized.

Member

gnufied commented Nov 9, 2018

@mkimuram you mean just skipping them in Alpha cluster right?

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 9, 2018

@gnufied

I was thinking about skipping the below test with the combination of below drivers:

[Test]

  • [Testpattern: Dynamic PV (filesystem volmode)] volumeMode[Feature:BlockVolume] should create sc, pod, pv, and pvc, read/write to the pv, and delete all created resources

[Driver]

  • csi-hostpath
  • com.google.csi.gcepd

Won't it enough? Or, do we need to skip it only with alpha cluster?

mkimuram added a commit to mkimuram/kubernetes that referenced this issue Nov 9, 2018

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 9, 2018

@msau42 @mkimuram @gnufied If we just disable the tests alone and not revert the feature, we are just masking the issue. Ideally if we are not able to fix the tests and reenable them the feature should be reverted from 1.13 as well. Are you also actively working on fixing the tests and this skip is just to unblock PRs in the meantime? if so can we timebox by when we can have the tests fixed and re-enabled / the feature reverted? My fear is we might forget the fact that its disabled since there's no active reporting on it.

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 9, 2018

@AishSundar

If we just disable the tests alone and not revert the feature, we are just masking the issue.

Actually, there is no feature added in either of culprit commits. It added tests or intended to improve the way it is tested.

Therefore, it must be a test issue or another issue which is found by the test added, not an issue introduced by the commits themselves.

My fear is we might forget the fact that its disabled since there's no active reporting on it.

In the code, there is a clear comment that the test is skipped due to this issue with the reference to this issue#, so this won't be forgot. If you still worry about this issue is not tracked, we might be able to keep this issue open.

@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 9, 2018

@mkimuram IIUC the test that were failing (and are being reverted) were added for the new CSI festures going to Alpha in 1.13, right? so even these 2 culprit commits do not contain feature work, it is testing some new code additions to 1.13 and now those will be untested until we re-enable these tests.

Is there an ETA to re-enable them, preferably sometime early next week? Yes +1 to keeping this issue open to track the skipped tests.

@mkimuram

This comment has been minimized.

Contributor

mkimuram commented Nov 10, 2018

@AishSundar

IIUC the test that were failing (and are being reverted) were added for the new CSI festures going to Alpha in 1.13, right? so even these 2 culprit commits do not contain feature work, it is testing some new code additions to 1.13 and now those will be untested until we re-enable these tests.

The two failing tests themselves are not for new alpha features that are added in v1.13.
Instead, they are just to test csi driver works well as it works in existing in-tree driver.
(And failed because they might be affected by some other alpha features added or some of specific configurations in gce.)

In addition, #68025 added ~50 tests for csi driver, so it would be better not to remove all of the tests by reverting the PR, just for the 2 failing tests. I think that just reverting the patch won't help improving quality of k8s.

Is there an ETA to re-enable them, preferably sometime early next week?

As for ETA to solve the issue, I found difficulty in debuging in my local environment.
It seems that the two tests fail in gce environment w/ AllAlpha=true,
however, as I've already mentioned, at least csi-hostpath tests in my local environment w/ AllAlpha=true doesn't fail.(And I don't have test environment for gce.)

Of course, I will help solving the issue next week. However, I will need someone's help for testing combinations of alpha features in gce environment to narrow down which flag really affects this issue.
At least I would like to know the results:

  • w/ MountPropagation=true,CSIPersistentVolume=true,BlockVolume=true,CSIBlockVolume=true
  • all alpha flags except CSIDriverRegistry and CSINodeInfo
@AishSundar

This comment has been minimized.

Contributor

AishSundar commented Nov 10, 2018

/reopen

@mkimuram Thanks for the detailed explanation. I think the revert is fine in this case then. I am keeping this issue open however to track and fix the failures in these 2 tests

@msau @cjwagner @BenTheElder can any of you help answer the ques above w.r.t alpha features on gce environment. thanks

@k8s-ci-robot k8s-ci-robot reopened this Nov 10, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Nov 10, 2018

@AishSundar: Reopened this issue.

In response to this:

/reopen

@mkimuram Thanks for the detailed explanation. I think the revert is fine in this case then. I am keeping this issue open however to track and fix the failures in these 2 tests

@msau @cjwagner @BenTheElder can any of you help answer the ques above w.r.t alpha features on gce environment. thanks

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42

This comment has been minimized.

Member

msau42 commented Nov 11, 2018

I can help debug when I'm back on Tuesday. In the meantime, I would suggest looking through the failing test logs to try to triage what's going on

@msau42

This comment has been minimized.

goodluckbot added a commit to goodluckbot/kubernetes that referenced this issue Nov 11, 2018

goodluckbot added a commit to goodluckbot/kubernetes that referenced this issue Nov 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment