Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job failed] 1.9-master upgrade|downgrade jobs #60764

Closed
krzyzacy opened this issue Mar 5, 2018 · 67 comments
Closed

[job failed] 1.9-master upgrade|downgrade jobs #60764

krzyzacy opened this issue Mar 5, 2018 · 67 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
Milestone

Comments

@krzyzacy
Copy link
Member

krzyzacy commented Mar 5, 2018

this is an unbrella issue for k8s-testgrid.appspot.com/sig-release-master-upgrade

there are multiple test failures, and I will open individual issues for each failing test

/priority failing-test
/priority critical-urgent
/kind bug
/status approved-for-milestone
/sig cluster-lifecycle
/sig gcp

cc @jdumars @jberkus
and also cc @krousey who's our upgrade expert

@k8s-ci-robot k8s-ci-robot added status/approved-for-milestone kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/bug Categorizes issue or PR as related to a bug. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Mar 5, 2018
@krzyzacy
Copy link
Member Author

krzyzacy commented Mar 5, 2018

/milestone v1.10

@krzyzacy
Copy link
Member Author

krzyzacy commented Mar 5, 2018

cc @kubernetes/sig-cluster-lifecycle-bugs
I think sig-cluster-lifecycle want to triage the downgrade suite which is timing-out on downgrading
/assign @roberthbailey @luxas @lukemarsden @jbeda
assign you folks for now, please reassign as appropriate

@krousey
Copy link
Contributor

krousey commented Mar 5, 2018

Downgrades are failing to delete the stateful set test's namespace because PVCs remain. All the other tests are actually passing.

@krzyzacy
Copy link
Member Author

krzyzacy commented Mar 5, 2018

@krousey do you know who I need to bug next?

@krousey
Copy link
Contributor

krousey commented Mar 5, 2018

In particular: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/1910

I0305 18:18:01.917] STEP: Destroying namespace "e2e-tests-sig-apps-statefulset-upgrade-86ffb" for this suite.
I0305 18:28:01.928] Mar  5 18:28:01.928: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
I0305 18:28:02.233] Mar  5 18:28:02.233: INFO: namespace: e2e-tests-sig-apps-statefulset-upgrade-86ffb, resource: bindings, ignored listing per whitelist
I0305 18:28:02.237] Mar  5 18:28:02.236: INFO: namespace: e2e-tests-sig-apps-statefulset-upgrade-86ffb, resource: persistentvolumeclaims, items remaining: 3
I0305 18:28:02.244] Mar  5 18:28:02.243: INFO: namespace: e2e-tests-sig-apps-statefulset-upgrade-86ffb, DeletionTimetamp: 2018-03-05 18:18:01 +0000 UTC, Finalizers: [kubernetes], Phase: Terminating
I0305 18:28:02.246] Mar  5 18:28:02.246: INFO: namespace: e2e-tests-sig-apps-statefulset-upgrade-86ffb, total namespaces: 12, active: 11, terminating: 1
I0305 18:28:02.249] Mar  5 18:28:02.249: INFO: Couldn't delete ns: "e2e-tests-sig-apps-statefulset-upgrade-86ffb": namespace e2e-tests-sig-apps-statefulset-upgrade-86ffb was not deleted with limit: timed out waiting for the condition, namespaced content other than pods remain (&errors.errorString{s:"namespace e2e-tests-sig-apps-statefulset-upgrade-86ffb was not deleted with limit: timed out waiting for the condition, namespaced content other than pods remain"})

@krousey
Copy link
Contributor

krousey commented Mar 5, 2018

@krzyzacy We should pull in someone from storage and maybe someone from apps.

@krzyzacy
Copy link
Member Author

krzyzacy commented Mar 5, 2018

/sig storage
/sig apps

cc @kow3ns @saad-ali ^^

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Mar 5, 2018
@kow3ns kow3ns added this to Backlog in Workloads Mar 5, 2018
@kow3ns kow3ns moved this from Backlog to In Progress in Workloads Mar 6, 2018
@pospispa
Copy link
Contributor

ACK. Waiting on reviews and approvals (PRs #61282 and #61316)
ETA: depends on reviews and approvals
Risks: N/A

@pospispa
Copy link
Contributor

In K8s 1.10 before K8s 1.11 release:

  1. Modify the pvc-protection-controller to run in finalizer deletion only mode if disabled.
  2. Modify pv-protection-controller to run in finalizer deletion only mode if disabled.

I've created PR #61324 for the above.

gnufied pushed a commit to gnufied/kubernetes that referenced this issue Mar 19, 2018
After K8s 1.9 is upgraded to K8s 1.10 finalizer [kubernetes.io/pvc-protection] is added to PVCs
because StorageObjectInUseProtection feature is enabled by default in K8s 1.10.
However, when K8s 1.10 is downgraded to K8s 1.9 the finalizers remain in the PVCs and as pvc-protection-controller
is not started by default in K8s 1.9 finalizers are not removed automatically from deleted PVCs and that's why
deleted PVC are not removed but remain in Terminating phase.

That's why pvc-protection-controller is always started because the pvc-protection-controller removes finalizers
from PVCs automatically when a PVC is not in active use by a pod.

Related issue: kubernetes#60764
gnufied pushed a commit to gnufied/kubernetes that referenced this issue Mar 19, 2018
After K8s 1.9 is upgraded to K8s 1.10 finalizer [kubernetes.io/pv-protection] is added to PVs
because StorageObjectInUseProtection feature is enabled by default in K8s 1.10.
However, when K8s 1.10 is downgraded to K8s 1.9 the finalizers remain in the PVs and as pv-protection-controller
does not exist in K8s 1.9 PV finalizers are not removed automatically from deleted PVs and that's why
deleted PV remain in the system.

That's why the finalizer removing part of the pv-protection-controller is backported from K8s 1.10 in order to remove
finalizers automatically when a PV is deleted and is not Bound to a PVC.

Related issue: kubernetes#60764
Related pv-protection-controller PR: kubernetes#58743
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@krzyzacy @lukemarsden @luxas @roberthbailey

Issue Labels
  • sig/apps sig/cluster-lifecycle sig/gcp sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@jberkus
Copy link

jberkus commented Mar 21, 2018

@krzyzacy @childsb @saad-ali where are we on this? I thought we had a plan, and it was executed, and the remaining open PRs were for the future 1.9 patches. What's waiting to be done so that we can move this issue out of the milestone?

@liggitt
Copy link
Member

liggitt commented Mar 21, 2018

1.9 PR was merged in #61370 mid-day yesterday. four green runs since then on https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.9-downgrade-cluster-parallel&sort-by-failures=

@liggitt
Copy link
Member

liggitt commented Mar 21, 2018

some tests are still flaking in the job, but no solid red tests

@krzyzacy
Copy link
Member Author

I think we can close this issue, and I'll keep an eye there and open a few more flake issues

Workloads automation moved this from In Progress to Done Mar 21, 2018
pospispa added a commit to pospispa/kubernetes that referenced this issue Apr 20, 2018
After K8s 1.10 is upgraded to K8s 1.11 finalizer [kubernetes.io/pvc-protection] is added to PVCs
because StorageObjectInUseProtection feature will be GA in K8s 1.11.
However, when K8s 1.11 is downgraded to K8s 1.10 and the StorageObjectInUseProtection feature is disabled
the finalizers remain in the PVCs and as pvc-protection-controller is not started in K8s 1.10 finalizers
are not removed automatically from deleted PVCs and that's why deleted PVC are not removed from the system
but remain in Terminating phase.
The same applies to pv-protection-controller and [kubernetes.io/pvc-protection] finalizer in PVs.

That's why pvc-protection-controller is always started because the pvc-protection-controller removes finalizers
from PVCs automatically when a PVC is not in active use by a pod.
Also the pv-protection-controller is always started to remove finalizers from PVs automatically when a PV is not
Bound to a PVC.

Related issue: kubernetes#60764
k8s-github-robot pushed a commit that referenced this issue Apr 21, 2018
…nUseProtection-downgrade-issue

Automatic merge from submit-queue (batch tested with PRs 61324, 62880, 62765). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Always Start pvc-protection-controller and pv-protection-controller

**What this PR does / why we need it**:
After K8s 1.10 is upgraded to K8s 1.11 finalizer `[kubernetes.io/pvc-protection]` is added to PVCs
because `StorageObjectInUseProtection` feature will be GA in K8s 1.11.
However, when K8s 1.11 is downgraded to K8s 1.10 and the `StorageObjectInUseProtection` feature is disabled the finalizers remain in the PVCs and as `pvc-protection-controller` is not started in K8s 1.10 finalizers are not removed automatically from deleted PVCs and that's why deleted PVC are not removed from the system but remain in `Terminating` phase.
The same applies to `pv-protection-controller` and `[kubernetes.io/pvc-protection]` finalizer in PVs.

That's why `pvc-protection-controller` is always started because the `pvc-protection-controller` removes finalizers from PVCs automatically when a PVC is not in active use by a pod.
Also the `pv-protection-controller` is always started to remove finalizers from PVs automatically when a PV is not `Bound` to a PVC.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes N/A
This issue #60764 is for downgrade from K8s 1.10 to K8s 1.9.
This PR fixes the same problem but for downgrade from K8s 1.11 to K8s 1.10.

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
pospispa added a commit to pospispa/kubernetes that referenced this issue Apr 21, 2018
StorageObjectInUseProtection feature is enabled by default in K8s 1.10+. Assume K8s cluster is used with this feature enabled, i.e. finalizers are added to all PVs and PVCs. In case the K8s cluster admin disables the StorageObjectInUseProtection feature and a user deletes a PVC that is not in active use by a pod then the PVC is not removed from the system because of the finalizer. Therefore, the user will have to remove the finalizer manually in order to have the PVC removed from the system. Note: deleted PVs won't be removed from the system also because of finalizers.

That's why pvc-protection-controller is always started because the pvc-protection-controller removes finalizers from PVCs automatically when a PVC is not in active use by a pod. Also the pv-protection-controller is always started to remove finalizers from PVs automatically when a PV is not Bound to a PVC.

Related issue:
kubernetes#60764

Related PRs:
kubernetes#61370
kubernetes#61324
k8s-github-robot pushed a commit that referenced this issue Jun 4, 2018
…ction-downgrade-issue-cherry-pick-into-K8s-1.10

Automatic merge from submit-queue.

cherry-pick into K8s 1.10: Always Start pvc-protection-controller and pv-protection-controller

**What this PR does / why we need it**:
StorageObjectInUseProtection feature is enabled by default in K8s 1.10+. Assume K8s cluster is used with this feature enabled, i.e. finalizers are added to all PVs and PVCs. In case the K8s cluster admin disables the StorageObjectInUseProtection feature and a user deletes a PVC that is not in active use by a pod then the PVC is not removed from the system because of the finalizer. Therefore, the user will have to remove the finalizer manually in order to have the PVC removed from the system. Note: deleted PVs won't be removed from the system also because of finalizers.

This problem was fixed in [K8s 1.9.6](https://github.com/kubernetes/kubernetes/releases/tag/v1.9.6) in PR #61370
This problem is also fixed in K8s 1.11+ in PR #61324
However, this problem is not fixed in K8s 1.10, that's why I've cherry-picked the PR #61324 and proposing to merge it into K8s 1.10.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes 
N/A

Related issue: #60764

**Special notes for your reviewer**:

**Release note**:

```release-note
In case StorageObjectInUse feature is disabled and Persistent Volume (PV) or Persistent Volume Claim (PVC) contains a finalizer and the PV or PVC is deleted it is not automatically removed from the system. Now, it is automatically removed.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
Workloads
  
Done
Development

No branches or pull requests