Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix:statefulSet would not recreate the pvc when a pvc is deleted before a pod #79164

Closed
wants to merge 1 commit into from

Conversation

cwdsuzhou
Copy link
Member

@cwdsuzhou cwdsuzhou commented Jun 19, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently, statefulSet would not recreate the pvc when a pvc is deleted before a pod, which would lead to pod in pending status.

Which issue(s) this PR fixes:

Fixes #74374

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Bugfix: statefulSet would not recreate the pvc when a pvc is deleted before a pod 

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cwdsuzhou
To complete the pull request process, please assign kow3ns
You can assign the PR to them by writing /assign @kow3ns in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 19, 2019
@cwdsuzhou
Copy link
Member Author

/test pull-kubernetes-integration

@cwdsuzhou
Copy link
Member Author

/test pull-kubernetes-kubemark-e2e-gce-big

@cwdsuzhou cwdsuzhou changed the title [WIP] Fix not recreate pvc in sts Bugfix:statefulSet would not recreate the pvc when a pvc is deleted before a pod Jun 19, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 19, 2019
@cwdsuzhou
Copy link
Member Author

/assign @msau42 @kow3ns

@@ -192,6 +192,11 @@ func (spc *realStatefulPodControl) createPersistentVolumeClaims(set *apps.Statef
case err != nil:
errs = append(errs, fmt.Errorf("Failed to retrieve PVC %s: %s", claim.Name, err))
spc.recordClaimEvent("create", set, pod, &claim, err)
case err == nil:
Copy link
Contributor

@mucahitkurt mucahitkurt Jun 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a race condition with this solution as @msau42 already said, if the pvc can have DeletionTimeStamp after this method is run, and then pod will be pending state again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not an atomic operation, I think creating statefulSet pod can not avoid the scene absolutely. However, it seldom(almost never) would happen.

@kow3ns
Copy link
Member

kow3ns commented Jul 8, 2019

Why not just add a watch on PVCs and check if a deleted PVC corresponds to a Pod in a StatefulSet that has not been deleted?

@cwdsuzhou
Copy link
Member Author

Why not just add a watch on PVCs and check if a deleted PVC corresponds to a Pod in a StatefulSet that has not been deleted?

I am not sure if this would increase api-server load. So I would like to fix the issue by check DeleteTimeStamp

I have corrected the logic of this PR:

  1. check PVC DeleteTimeStamp from cache
  2. if not nil, get actual PVC from api server
getErr != nil && apierrors.IsNotFound(getErr) 
||realPvc != nil && acutualPvc.DeletionTimestamp != nil

we should also mark it as error.

PATL, thanks @kow3ns

@cwdsuzhou
Copy link
Member Author

cc @janetkuo

PTAL, thanks.

@cwdsuzhou
Copy link
Member Author

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Aug 2, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 2, 2019
@cwdsuzhou
Copy link
Member Author

/retest

@msau42
Copy link
Member

msau42 commented Oct 4, 2019

/assign @janetkuo

@cwdsuzhou
Copy link
Member Author

ping @janetkuo

PTAL thanks

// if get pvc failed collect it;
// if not found or in terminating also collect error;
// if found pvc and not in terminating state, do not collect it.
actualPVC, getErr := spc.client.CoreV1().PersistentVolumeClaims(claim.Namespace).Get(claim.Name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why rechecking the deletionTimestamp by live GET helps here. The same object may not transition back and you get race on detecting the recreate case where you can miss it so we'd need a reconciliation logic for that case anyways.

Watching if pods reference deleted PVC and reconciling it as @kow3ns suggested in #79164 (comment) sounds more reliable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why rechecking the deletionTimestamp by live GET helps here. The same object may not transition back and you get race on detecting the recreate case where you can miss it so we'd need a reconciliation logic for that case anyways.

Watching if pods reference deleted PVC and reconciling it as @kow3ns suggested in #79164 (comment) sounds more reliable

Thanks, I would do that later

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 13, 2020
@smileusd
Copy link
Contributor

Hi, How about this pr status?

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Storage protection feature does not integrate well with StatefulSet PVC recreation
9 participants