New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApiServer was not able to recover from network disconnection with etcd #69621

Closed
DylanBLE opened this Issue Oct 10, 2018 · 9 comments

Comments

Projects
None yet
5 participants
@DylanBLE
Contributor

DylanBLE commented Oct 10, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
When apiserver was disconnected from etcd cluster for some reason, it's not able to recover even if the network issue was solved.

What you expected to happen:
Apiserver should always find the available etcd to work on.

How to reproduce it (as minimally and precisely as possible):
Suppose there was an etcd cluster with 3 nodes and one apiserver connecting to it.

  1. cut off the network connection between etcd and apiserver.
iptables -A INPUT -p tcp --dport 2379 -j DROP
iptables -A OUTPUT -p tcp --dport 2379 -j DROP
  1. after 1 min, apiserver will respond with 504 Gateway Timeout from the timeout filter
    log shows as follows
logging error output: "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within 1m0s\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}\n"
 [[kubelet/v1.9.6 (linux/amd64) kubernetes/73b157c] 172.16.1.43:48934]
I1010 19:31:18.290539   29187 wrap.go:42] GET /api/v1/nodes/172.16.1.42: (1m0.000346629s) 504
  1. recover the network on one node
iptables -D INPUT -p tcp --dport 2379 -j DROP
iptables -D OUTPUT -p tcp --dport 2379 -j DROP
  1. there was a chance(2/3) apiserver still not able to server unless restart apiserver.

Anything else we need to know?:
etcd cluster's client port is 2379 and peer port is 2380, so the iptable rules will not affect the internal health of etcd cluster.

Environment:

  • Kubernetes version (use kubectl version):
    v 1.9.6
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
    Etcd Version: 3.1.18
@DylanBLE

This comment has been minimized.

Show comment
Hide comment
@DylanBLE

DylanBLE Oct 10, 2018

Contributor

/sig api-machinery

Contributor

DylanBLE commented Oct 10, 2018

/sig api-machinery

@DylanBLE

This comment has been minimized.

Show comment
Hide comment
@DylanBLE

DylanBLE Oct 10, 2018

Contributor

It seems that adding a timeout context fixes the issue. I wonder if there is more elegant solution.

Contributor

DylanBLE commented Oct 10, 2018

It seems that adding a timeout context fixes the issue. I wonder if there is more elegant solution.

@jennybuckley

This comment has been minimized.

Show comment
Hide comment
@jennybuckley

jennybuckley Oct 11, 2018

Contributor

/assign @jingyih

Contributor

jennybuckley commented Oct 11, 2018

/assign @jingyih

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Oct 11, 2018

Contributor

@jennybuckley: GitHub didn't allow me to assign the following users: jingyih.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @jingyih

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor

k8s-ci-robot commented Oct 11, 2018

@jennybuckley: GitHub didn't allow me to assign the following users: jingyih.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @jingyih

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jingyih

This comment has been minimized.

Show comment
Hide comment
@jingyih

jingyih Oct 11, 2018

Contributor

There is a timeout added to the incoming request to apiserver, which we believe is not wired correctly in the call stack. I am investigating this issue.

Contributor

jingyih commented Oct 11, 2018

There is a timeout added to the incoming request to apiserver, which we believe is not wired correctly in the call stack. I am investigating this issue.

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Oct 12, 2018

Member

can you verify the per-call timeout is still needed after 31ff8c6 (added in 1.10). I notice you reported this against 1.9.x

Member

liggitt commented Oct 12, 2018

can you verify the per-call timeout is still needed after 31ff8c6 (added in 1.10). I notice you reported this against 1.9.x

@DylanBLE

This comment has been minimized.

Show comment
Hide comment
@DylanBLE

DylanBLE Oct 15, 2018

Contributor

@liggitt 31ff8c6 fixed the issue! Thanks.

Contributor

DylanBLE commented Oct 15, 2018

@liggitt 31ff8c6 fixed the issue! Thanks.

@DylanBLE

This comment has been minimized.

Show comment
Hide comment
@DylanBLE

DylanBLE Oct 15, 2018

Contributor

/close

Contributor

DylanBLE commented Oct 15, 2018

/close

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Oct 15, 2018

Contributor

@DylanBLE: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor

k8s-ci-robot commented Oct 15, 2018

@DylanBLE: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment