Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support pod bootstrap "checkpointing" in the kubelet #378

Closed
calebamiles opened this issue Aug 1, 2017 · 17 comments
Closed

support pod bootstrap "checkpointing" in the kubelet #378

calebamiles opened this issue Aug 1, 2017 · 17 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Milestone

Comments

@calebamiles
Copy link
Member

calebamiles commented Aug 1, 2017

Feature Description

  • One-line feature description (can be used as a release note): support pod "checkpointing" in the kubelet for self hosting
  • Primary contact (assignee): @timothysc
  • Responsible SIGs: SIG Cluster Lifecycle, SIG Node
  • Design proposal link (community repo): RFE: Bootstrap Checkpointing - Modify manifest behavior slightly for self hosting.  kubernetes#49236
  • Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred: @timothysc, @dchen1107
  • Approver (likely from SIG/area to which feature belongs): @dchen1107
  • Feature target (which target equals to which milestone):
    • Alpha release target 1.9
    • Beta release target (TBD)
    • Stable release target (TBD)
@calebamiles calebamiles added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels Aug 1, 2017
@calebamiles calebamiles added this to the 1.8 milestone Aug 1, 2017
@luxas
Copy link
Member

luxas commented Sep 1, 2017

@roberthbailey this seems to miss the v1.8 target? Should we move to next-milestone?

@roberthbailey roberthbailey modified the milestones: next-milestone, 1.8 Sep 1, 2017
@roberthbailey
Copy link
Member

roberthbailey commented Sep 1, 2017

Yes.

@timothysc timothysc modified the milestones: next-milestone, 1.9 Oct 23, 2017
@timothysc
Copy link
Member

timothysc commented Oct 23, 2017

Proposal is here -> https://github.com/kubernetes/community/pull/1241/files
Implementation (refreshing this week) -> https://github.com/kubernetes/kubernetes/pull/50984/files

@timothysc timothysc changed the title support pod "checkpointing" in the kubelet support pod bootstrap "checkpointing" in the kubelet Oct 23, 2017
@idvoretskyi
Copy link
Member

idvoretskyi commented Oct 24, 2017

@timothysc still alpha for 1.9?

@roberthbailey
Copy link
Member

roberthbailey commented Oct 24, 2017

Yes.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Nov 22, 2017
Automatic merge from submit-queue (batch tested with PRs 55812, 55752, 55447, 55848, 50984). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Initial basic bootstrap-checkpoint support

**What this PR does / why we need it**:
Adds initial support for Pod checkpointing to allow for controlled recovery of the control plane during self host failure conditions. 

fixes #49236
xref kubernetes/enhancements#378

**Special notes for your reviewer**:

Proposal is here: https://docs.google.com/document/d/1hhrCa_nv0Sg4O_zJYOnelE8a5ClieyewEsQM6c7-5-o/edit?ts=5988fba8#

1. Controlled tests work, but I have not tested the self hosted api-server recovery, that requires validation and logs.  /cc @luxas 
2. In adding hooks for checkpoint manager much of the tests around basicpodmanager appears to be stub'd.  This has become an anti-pattern in the code and should be avoided.  
3. I need a node-e2e to ensure consistency of behavior. 

**Release note**:
```
Add basic bootstrap checkpointing support to the kubelet for control plane recovery
```

/cc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-node-pr-reviews
@zacharysarah
Copy link
Contributor

zacharysarah commented Nov 22, 2017

@calebamiles 👋 Please indicate in the 1.9 feature tracking board
whether this feature needs documentation. If yes, please open a PR and add a link to the tracking spreadsheet. Thanks in advance!

@luxas
Copy link
Member

luxas commented Nov 24, 2017

@timothysc will open a PR with documentation for this soon.
I updated the status in the tracking spreadsheet

@timothysc
Copy link
Member

timothysc commented Nov 27, 2017

I think until we are consuming this as part of the test suite as part of the self-hosting feature it is premature to document it's usage. We need to get testing cycles under it's belt and enable the broader feature that the code was written to enable.

@luxas
Copy link
Member

luxas commented Nov 27, 2017

Ok, let's discuss whether we need docs or not tomorrow in the SIG call

@timothysc
Copy link
Member

timothysc commented Nov 28, 2017

per sig discussion this morning, we are not planning to document the feature until the primary use case which is self-hosting has been enabled by default. We did not have enough test cycles to enable it for 1.9 but are planning to enable in 1.10.

I'm not certain who is managing feature documentation for 1.10 but this is the official plan from @kubernetes/sig-cluster-lifecycle-feature-requests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 28, 2017
@zacharysarah
Copy link
Contributor

zacharysarah commented Nov 28, 2017

@timothysc Thanks for the update!

@fejta-bot
Copy link

fejta-bot commented Feb 26, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 26, 2018
dims pushed a commit to dims/openstack-cloud-controller-manager that referenced this issue Mar 7, 2018
Automatic merge from submit-queue (batch tested with PRs 55812, 55752, 55447, 55848, 50984). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Initial basic bootstrap-checkpoint support

**What this PR does / why we need it**:
Adds initial support for Pod checkpointing to allow for controlled recovery of the control plane during self host failure conditions. 

fixes #49236
xref kubernetes/enhancements#378

**Special notes for your reviewer**:

Proposal is here: https://docs.google.com/document/d/1hhrCa_nv0Sg4O_zJYOnelE8a5ClieyewEsQM6c7-5-o/edit?ts=5988fba8#

1. Controlled tests work, but I have not tested the self hosted api-server recovery, that requires validation and logs.  /cc @luxas 
2. In adding hooks for checkpoint manager much of the tests around basicpodmanager appears to be stub'd.  This has become an anti-pattern in the code and should be avoided.  
3. I need a node-e2e to ensure consistency of behavior. 

**Release note**:
```
Add basic bootstrap checkpointing support to the kubelet for control plane recovery
```

/cc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-node-pr-reviews
dims pushed a commit to dims/openstack-cloud-controller-manager that referenced this issue Mar 7, 2018
Automatic merge from submit-queue (batch tested with PRs 55812, 55752, 55447, 55848, 50984). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Initial basic bootstrap-checkpoint support

**What this PR does / why we need it**:
Adds initial support for Pod checkpointing to allow for controlled recovery of the control plane during self host failure conditions. 

fixes #49236
xref kubernetes/enhancements#378

**Special notes for your reviewer**:

Proposal is here: https://docs.google.com/document/d/1hhrCa_nv0Sg4O_zJYOnelE8a5ClieyewEsQM6c7-5-o/edit?ts=5988fba8#

1. Controlled tests work, but I have not tested the self hosted api-server recovery, that requires validation and logs.  /cc @luxas 
2. In adding hooks for checkpoint manager much of the tests around basicpodmanager appears to be stub'd.  This has become an anti-pattern in the code and should be avoided.  
3. I need a node-e2e to ensure consistency of behavior. 

**Release note**:
```
Add basic bootstrap checkpointing support to the kubelet for control plane recovery
```

/cc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-node-pr-reviews
@fejta-bot
Copy link

fejta-bot commented Mar 28, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 28, 2018
@justaugustus
Copy link
Member

justaugustus commented Apr 17, 2018

@timothysc
Any plans for this in 1.11?

If so, can you please ensure the feature is up-to-date with the appropriate:

  • Description
  • Milestone
  • Assignee(s)
  • Labels:
    • stage/{alpha,beta,stable}
    • sig/*
    • kind/feature

cc @idvoretskyi

@fejta-bot
Copy link

fejta-bot commented May 17, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@mcluseau
Copy link

mcluseau commented Jun 8, 2018

@timothysc there are no references of "checkpoint" in https://kubernetes.io/docs/imported/release/notes/#v1-10-0 ; any pointer please? :)

@tasdikrahman
Copy link

tasdikrahman commented Oct 1, 2018

greetings. The issue got closed due to it being stale for some time, was curious if it is still planned in a future release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Projects
None yet
Development

No branches or pull requests