Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubeadm control plane]: etcd communication errors are being swallowed #2454

Closed
sethp-nr opened this issue Feb 26, 2020 · 16 comments
Closed
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@sethp-nr
Copy link
Contributor

sethp-nr commented Feb 26, 2020

What steps did you take and what happened:

While testing an upgrade the etcd health checks were failing repeatedly. With the code from #2451 in place I could resolve it down one level:

failed to create etcd client: unable to create etcd client: context deadline exceeded

After some work, I found that my etcd ca secret was regenerated, changing the private key (see: #2454). It seems that GRPC has exactly one error message when the connection is misconfigured, and that's "context deadline exceeded." I haven't yet found a way to get more information on what happened via the API, but I'm continuing to dig.

What did you expect to happen:

When I set up the same condition with etcdctl and k port-forward I got a helpful error message:

{"level":"warn","ts":"2020-02-25T20:50:29.757-0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-f67a0407-f684-4682-bb79-ec33c94b2178/127.0.0.1:63477","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: bad certificate\""}
Error: context deadline exceeded

Note the Error: context deadline exceeded is what came back from clientv3.New, and the other is a log statement being printed to stderr.

Anything else you would like to add:

I found that any error sent back from the proxy dial function was being swallowed in the same way. It also looks like we're not using the errorStream we set up with the API Server, so it's possible that we'd miss important information about the proxy connection.

Environment:

  • Cluster-api version: master
  • Minikube/KIND version: kind v0.7.0 go1.13.6 darwin/amd64
  • Kubernetes version: (use kubectl version): a mix of v1.15 and v1.16 control plane nodes
  • OS (e.g. from /etc/os-release): ubuntu

/kind bug
/assign
/lifecycle active

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Feb 26, 2020
@sethp-nr sethp-nr changed the title KubeadmControlPlane: etcd communication errors are being swallowed [kubeadm control plane]: etcd communication errors are being swallowed Feb 26, 2020
@vincepri
Copy link
Member

/milestone v0.3.0

@vincepri
Copy link
Member

vincepri commented Mar 9, 2020

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.0, v0.3.x Mar 9, 2020
@vincepri
Copy link
Member

vincepri commented Mar 9, 2020

Bumping this given that it seems the upstream grpc fix might not go in soon

@vincepri
Copy link
Member

@sethp-nr Not sure if you saw this response, should we leave things as they are if the upstream fix isn't merged?

@vincepri vincepri removed the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Apr 30, 2020
@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 1, 2020

It ended up going in as an option under a different PR after some discussion in the linked thread (culminating here: grpc/grpc-go#2031 (comment) ).

There's still some work in getting it into a released version & getting etcd to be compatible with that version & picking that version up here and then we can finally replace WithBlock with WithReturnLastError and see clearer errors.

I'm working that in fits and starts when I have time, but if someone else wanted to do it I would not get in their way.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2020
@vincepri
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 31, 2020
@vincepri
Copy link
Member

The PR changes with WithReturnConnectionError has been merged and available from v1.30.0

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.x, v0.4.0 Jul 31, 2020
@vincepri
Copy link
Member

/help

@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 31, 2020
@vincepri
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jul 31, 2020
@vincepri
Copy link
Member

/milestone v1.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.4, v1.0 Oct 19, 2021
@vincepri vincepri modified the milestones: v1.0, v1.1 Oct 22, 2021
@fabriziopandini fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022
@sbueringer
Copy link
Member

/assign @killianmuldoon
to re-assess the current state

@timoreimann
Copy link
Contributor

timoreimann commented Feb 19, 2022

FWIW I submitted #4997 some time ago that addressed at least one occurrence of error hiding / swallowing. Not sure if this bug report is about more though.

@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/close

until we get more evidence that there are still other occurrences of this error after #4997 merged

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close

until we get more evidence that there are still other occurrences of this error after #4997 merged

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants