Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the kubeadm upgrade DAG for the TLS Upgrade #62655

Conversation

stealthybox
Copy link
Member

@stealthybox stealthybox commented Apr 16, 2018

What this PR does / why we need it:
This adds the necessary utilities to detect Etcd TLS on static pods from the file system and query Etcd.
It modifies the upgrade logic to make it support the APIServer downtime.
Tests are included and should be passing.

bazel test //cmd/kubeadm/... \
  && bazel build //cmd/kubeadm --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 \
  && issue=TLSUpgrade ~/Repos/vagrant-kubeadm-testing/copy_kubeadm_bin.sh

These cases are working consistently for me

kubeadm-1.9.6 reset \
  && kubeadm-1.9.6 init --kubernetes-version 1.9.1 \
  && kubectl apply -f https://git.io/weave-kube-1.6
/vagrant/bin/TLSUpgrade_kubeadm upgrade apply 1.9.6  # non-TLS to TLS
/vagrant/bin/TLSUpgrade_kubeadm upgrade apply 1.10.0 # TLS to TLS
/vagrant/bin/TLSUpgrade_kubeadm upgrade apply 1.10.1 # TLS to TLS
/vagrant/bin/TLSUpgrade_kubeadm upgrade apply 1.9.1  # TLS to TLS /w major version downgrade

This branch is based on top of #61942, as resolving the hash race condition is necessary for consistent behavior.
It looks to fit in pretty well with @craigtracey's PR: #62141
The interfaces are pretty similar

/assign @detiber @timothysc

Which issue(s) this PR fixes
Helps with kubernetes/kubeadm#740

Special notes for your reviewer:

278b322
[kubeadm] Implement ReadStaticPodFromDisk

c74b563
Implement etcdutils with Cluster.HasTLS()

  • Test HasTLS()
  • Instrument throughout upgrade plan and apply
  • Update plan_test and apply_test to use new fake Cluster interfaces
  • Add descriptions to upgrade range test
  • Support KubernetesDir and EtcdDataDir in upgrade tests
  • Cover etcdUpgrade in upgrade tests
  • Cover upcoming TLSUpgrade in upgrade tests

8d8e5fe
Update test-case, fix nil-pointer bug, and improve error message

97117fa
Modify the kubeadm upgrade DAG for the TLS Upgrade

  • Calculate beforePodHashMap before the etcd upgrade in anticipation of
    KubeAPIServer downtime

  • Detect if pre-upgrade etcd static pod cluster HasTLS()==false to switch
    on the Etcd TLS Upgrade if TLS Upgrade:

    • Skip L7 Etcd check (could implement a waiter for this)
    • Skip data rollback on etcd upgrade failure due to lack of L7 check
      (APIServer is already down unable to serve new requests)
    • On APIServer upgrade failure, also rollback the etcd manifest to
      maintain protocol compatibility
  • Add logging

Release note:

kubeadm upgrade no longer races leading to unexpected upgrade behavior on pod restarts
kubeadm upgrade now successfully upgrades etcd and the controlplane to use TLS
kubeadm upgrade now supports external etcd setups
kubeadm upgrade can now rollback and restore etcd after an upgrade failure

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Apr 16, 2018
@k8s-ci-robot k8s-ci-robot requested review from kad and luxas April 16, 2018 15:08
Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stop reviews on this until we can chat.

However, I would like to trim out any refactorings that may be unnecessary in this PR.

@@ -23,7 +23,7 @@ import (
kubeadmconstants "k8s.io/kubernetes/cmd/kubeadm/app/constants"
"k8s.io/kubernetes/cmd/kubeadm/app/features"
"k8s.io/kubernetes/cmd/kubeadm/app/phases/addons/dns"
"k8s.io/kubernetes/cmd/kubeadm/app/util"
etcdutil "k8s.io/kubernetes/cmd/kubeadm/app/util/etcd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what this change has todo with the this PR.

FWIW renaming and refactorings should not be in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I neglected to put this in the commit, but it was necessary for me to restructure the module, because HasTLS() imports ReadStaticPodFromDisk from app/util/staticpod

app/util/staticpod already imports from /app/util and this created a dependency cycle.

// notice the removal of the Static Pod, leading to a false positive below where we check that the API endpoint is healthy
// If we don't do this, there is a case where we remove the Static Pod manifest, kubelet is slow to react, kubeadm checks the
// API endpoint below of the OLD Static Pod component and proceeds quickly enough, which might lead to unexpected results.
if err := waiter.WaitForStaticPodHashChange(cfg.NodeName, component, beforePodHash); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this overlap with @detiber 's proposed kubelet change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, these are all proposed solutions that complement each other and overlap in different ways.

This PR skips the dependency on the APIServer for the protocol switch.

The kubelet API PR completely gets rid of the hard dependency on the APIServer.

@timothysc timothysc removed the request for review from luxas April 16, 2018 16:31
@timothysc timothysc added kind/bug Categorizes issue or PR as related to a bug. cherrypick-candidate priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 20, 2018
@timothysc timothysc added this to the v1.10 milestone Apr 20, 2018
@timothysc
Copy link
Member

@stealthybox @detiber - we need to get this done ASAP for 1.10.2 release!

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@detiber @stealthybox @timothysc

Pull Request Labels
  • sig/cluster-lifecycle: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2018
detiber and others added 2 commits April 20, 2018 18:32
- Update kubeadm static pod upgrades to use the
  kubetypes.ConfigHashAnnotationKey annotation on the mirror pod rather
  than generating a hash from the full object info. Previously, a status
  update for the pod would allow the upgrade to proceed before the
  new static pod manifest was actually deployed.

Signed-off-by: Jason DeTiberus <detiber@gmail.com>
@stealthybox stealthybox force-pushed the TLSUpgrade_+_detiber-kubeadm_hash branch from 14049c0 to 022cfd9 Compare April 21, 2018 00:39
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2018
@stealthybox stealthybox force-pushed the TLSUpgrade_+_detiber-kubeadm_hash branch 2 times, most recently from 6e17c22 to 27514f4 Compare April 21, 2018 04:37
@stealthybox
Copy link
Member Author

@timothysc
I think @detiber and I agree this patch is near complete.
We've validated the happy path and improved some error cases.

We should validate that it also allows for External Etcd during the upgrade such as in #62141
(cc @craigtracey)

A final nice to have would be to add more test cases and reduce the differences between test behavior and actual behavior.

@detiber
Copy link
Member

detiber commented Apr 23, 2018

@stealthybox There are still some additional changes needed to unblock external etcd, but I think I'd like to get that sorted out as a followup.

@detiber
Copy link
Member

detiber commented Apr 23, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2018
@timothysc
Copy link
Member

/cc @MaciekPytel - for milestone cherry-pick approval.

Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments.

/cc @kubernetes/sig-cluster-lifecycle-pr-reviews

for more eyeballs, b/c this is a large back-patch and we want to be certain here.

@@ -281,7 +282,9 @@ func PerformStaticPodUpgrade(client clientset.Interface, waiter apiclient.Waiter
return err
}

return upgrade.StaticPodControlPlane(waiter, pathManager, internalcfg, etcdUpgrade)
// These are uninitialized because passing in the clients allow for mocking the client during testing
var oldEtcdClient, newEtdClient etcdutil.Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird and would cause static analysis issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is weird.

The trouble we ran into is that we cannot initialize the etcd client before the TLS certs are created.

We could lazy initialize the client, but that sounds buggy as well.
We could also store test interfaces in a global var and conditionally initialize them instead of passing them in as params.

I'm not sure what the best way to address this is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could pass the client creation code as functions?
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point let's just open an issue to follow up on some of these items.

@@ -61,14 +62,20 @@ func (f *fakeVersionGetter) KubeletVersions() (map[string]uint16, error) {
}, nil
}

type fakeEtcdCluster struct{}
type fakeEtcdCluster struct{ TLS bool }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please open an issue about instead of creating a stub to use the test jiggery that exists to auto-standup and cleanup etcd.

return c.TLS
}

func (c fakeTLSEtcdClient) GetStatus() (*clientv3.StatusResponse, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as other comment, I'm an anti-stub person and we have the jiggery to dynamically spin clusters in UTs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc this sounds great for testing GenericClient.GetStatus() in util/etcd/etcd_test.go

Do we have an existing unit-test harness for running etcd as a static pod using a kubelet?
That would be what is required for this particular test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have utilities to basically spin up an actual etcd server + client pair, but at this point I'm ok with merging what is here and we can create a new issue and cross-link for future work.

@@ -136,29 +141,36 @@ type fakeStaticPodPathManager struct {
}

func NewFakeStaticPodPathManager(moveFileFunc func(string, string) error) (StaticPodPathManager, error) {
realManifestsDir, err := ioutil.TempDir("", "kubeadm-upgraded-manifests")
kubernetesDir, err := ioutil.TempDir("", "kubeadm-pathmanager-")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who cleans up these directories?

Copy link
Member Author

@stealthybox stealthybox Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer os.RemoveAll(pathMgr.(*fakeStaticPodPathManager).KubernetesDir())

https://github.com/kubernetes/kubernetes/pull/62655/files#diff-012a4c853bcc083f2ac6c61d69e7292eR375

}

// GenericClient is a common etcd client for supported etcd servers
type GenericClient struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test the upgrade scenario of external etcd that @craigtracey was validating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a test for this, but I assume @detiber would include it in the follow up PR?

func (c GenericClient) GetStatus() (*clientv3.StatusResponse, error) {
cli, err := clientv3.New(clientv3.Config{
Endpoints: c.Endpoints,
DialTimeout: 5 * time.Second,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const above.

// WaitForStatus returns a StatusResponse after an initial delay and retry attempts
func (c GenericClient) WaitForStatus(delay time.Duration, retries int, retryInterval time.Duration) (*clientv3.StatusResponse, error) {
fmt.Printf("[util/etcd] Waiting %v for initial delay\n", delay)
time.Sleep(delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically a wait.Until, please use that f(n) or comment to reduce this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stealthybox#2 attempts to modify WaitForStatus to use wait.Until from apimachinery.

@@ -0,0 +1,197 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018

stealthybox and others added 5 commits April 24, 2018 09:55
- Test HasTLS()
- Instrument throughout upgrade plan and apply
- Update plan_test and apply_test to use new fake Cluster interfaces
- Add descriptions to upgrade range test
- Support KubernetesDir and EtcdDataDir in upgrade tests
- Cover etcdUpgrade in upgrade tests
- Cover upcoming TLSUpgrade in upgrade tests
- Calculate `beforePodHashMap` before the etcd upgrade in anticipation of KubeAPIServer downtime
- Detect if pre-upgrade etcd static pod cluster `HasTLS()==false` to switch on the Etcd TLS Upgrade
if TLS Upgrade:
  - Skip L7 Etcd check (could implement a waiter for this)
  - Skip data rollback on etcd upgrade failure due to lack of L7 check (APIServer is already down unable to serve new requests)
  - On APIServer upgrade failure, also rollback the etcd manifest to maintain protocol compatibility

- Add logging
- Adds L7 check for kubeadm etcd static pod upgrade
Fix `rollbackEtcdData()` to return error=nil on success
`rollbackEtcdData()` used to always return an error making the rest of the
upgrade code completely unreachable.

Ignore errors from `rollbackOldManifests()` during the rollback since it
always returns an error.
Success of the rollback is gated with etcd L7 healthchecks.

Remove logic implying the etcd manifest should be rolled back when
`upgradeComponent()` fails
@stealthybox stealthybox force-pushed the TLSUpgrade_+_detiber-kubeadm_hash branch from 27514f4 to dac4fe8 Compare April 24, 2018 16:00
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 24, 2018
Copy link
Member

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@@ -281,7 +282,9 @@ func PerformStaticPodUpgrade(client clientset.Interface, waiter apiclient.Waiter
return err
}

return upgrade.StaticPodControlPlane(waiter, pathManager, internalcfg, etcdUpgrade)
// These are uninitialized because passing in the clients allow for mocking the client during testing
var oldEtcdClient, newEtdClient etcdutil.Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point let's just open an issue to follow up on some of these items.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 24, 2018
@timothysc timothysc added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Apr 24, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: detiber, stealthybox, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 62655, 61711, 59122, 62853, 62390). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 67870da into kubernetes:master Apr 24, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 24, 2018

@stealthybox: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws dac4fe8 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot pushed a commit that referenced this pull request Apr 26, 2018
…62655-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #62655: Modify the kubeadm upgrade DAG for the TLS Upgrade

Cherry pick of #62655 on release-1.10.

#62655: Modify the kubeadm upgrade DAG for the TLS Upgrade

**Release note**:
```release-note
Action Required: kubeadm upgrade no longer supports downgrading 1.10 clusters to 1.9 clusters due to an incompatibility between the kubernetes 1.9 and 1.10 featureGates struct. Please backup /etc/kubernetes/manifests, the etcd database, and the kubeadm-config configmap if you anticipate a need to rollback.
```
@discostur
Copy link

discostur commented Apr 27, 2018

@stealthybox Shouldn't this now also work for external etcd clusters? Because i just updated kubeadm to 1.10.2 but i still get:

$kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
could not read manifests from: /etc/kubernetes/manifests, error: failed to check if etcd pod implements TLS: failed to read manifest for "/etc/kubernetes/manifests/etcd.yaml": open /etc/kubernetes/manifests/etcd.yaml: no such file or directory

@stealthybox
Copy link
Member Author

@discostur, I believe @detiber is still putting together support for external etcd.

@ypsingh27
Copy link

I have upgraded my kubeadm and kubectl version to v1.10.2
root # kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
root # kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

I am still not able to upgrade successfully. I get the below error:

root # kubeadm upgrade apply v1.10.1
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/version] You have chosen to change the cluster version to "v1.10.1"
[upgrade/versions] Cluster version: v1.9.3
[upgrade/versions] kubeadm version: v1.10.2
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.10.1"...
Static pod: kube-apiserver-vmaksa69901dzl hash: dc3749dffa0c124bd5c4964613658249
Static pod: kube-controller-manager-vmaksa69901dzl hash: 9e6798d0ba2ebe6747904b2195183c11
Static pod: kube-scheduler-vmaksa69901dzl hash: d38a1ee5cf80a84bc4f295d19b6874c2
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-vmaksa69901dzl hash: 17c801f54fe8bd173a9f4810b4242bbb
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests729921876/etcd.yaml"
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing etcd/server certificate and key.
[certificates] Using the existing etcd/peer certificate and key.
[certificates] Using the existing etcd/healthcheck-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests825459363/etcd.yaml"
[upgrade/staticpods] Not waiting for pod-hash change for component "etcd"
[upgrade/etcd] Waiting for etcd to become available
[util/etcd] Waiting 30s for initial delay
[util/etcd] Attempting to get etcd status 1/10
[util/etcd] Attempt failed with error: dial tcp [::1]:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 2/10
[util/etcd] Attempt failed with error: dial tcp [::1]:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 3/10
[util/etcd] Attempt failed with error: dial tcp [::1]:2379: getsockopt: connection refused
[util/etcd] Waiting 15s until next retry
[util/etcd] Attempting to get etcd status 4/10
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests729921876"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests729921876/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests729921876/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests729921876/kube-scheduler.yaml"
[upgrade/staticpods] The etcd manifest will be restored if component "kube-apiserver" fails to upgrade
[certificates] Using the existing etcd/ca certificate and key.
[certificates] Using the existing apiserver-etcd-client certificate and key.
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests825459363/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

Can someone help me with this upgrade? any worarounds?

@stealthybox
Copy link
Member Author

@ypsingh27 it looks like your upgrade failed on the apiserver manifest.
It appears kubeadm rolled back your cluster to the previous functioning state. Is that correct?

The first thing I think of is that you may not have been able to pull the image within the timeout.
Pre-pulling images can help with the upgrade.
Beyond that, I can't help without more information about your config and environment.

Somebody in the #kubeadm slack channel may be able to help you.

@discostur
Copy link

@detiber are there any news on the failed upgrade with external etcd support? Because in the 1.10.2 release notes they write:

kubeadm upgrade now supports external etcd setups

but it still fails :/

@ypsingh27
Copy link

@stealthybox Please let me know, what more details are needed. Also how can I pe-download the manifest? Can I use the manifest of 1.10.1 from other system and restart kubelet?

@robertrbruno
Copy link

Just wanted to second the same issues @discostur is seeing. Any updates?

@discostur
Copy link

@robertrbruno #63495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants