New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd component status check should include credentials #39716

Merged
merged 1 commit into from Apr 22, 2017

Conversation

@zhouhaibing089
Contributor

zhouhaibing089 commented Jan 11, 2017

  • Add TLS credentials into pkg/genericapiserver.Backend.
  • Add TLS credentials into pkg/registry/core/componentstatus.Server.
  • pkg/probe/http.httpProber should accept the TLS credentials.

Now it is working.

$ kubectl get cs
NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok
controller-manager   Healthy   ok
etcd-0               Healthy   {"health": "true"}

Fixes #27343.

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Jan 11, 2017

Hi @zhouhaibing089. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-reviewable

This comment has been minimized.

k8s-reviewable commented Jan 11, 2017

This change is Reviewable

@zhouhaibing089 zhouhaibing089 force-pushed the zhouhaibing089:etcd-health-check branch 2 times, most recently from 1731c14 to 1896989 Jan 11, 2017

@zhouhaibing089

This comment has been minimized.

Contributor

zhouhaibing089 commented Jan 11, 2017

squashed and removed the WIP.

@zhouhaibing089 zhouhaibing089 changed the title from WIP - etcd component status check should include credentials to etcd component status check should include credentials Jan 11, 2017

@zhouhaibing089 zhouhaibing089 force-pushed the zhouhaibing089:etcd-health-check branch 5 times, most recently from e0877af to 523aecb Jan 11, 2017

@zhouhaibing089 zhouhaibing089 force-pushed the zhouhaibing089:etcd-health-check branch from 523aecb to a989076 Jan 12, 2017

@zhouhaibing089 zhouhaibing089 force-pushed the zhouhaibing089:etcd-health-check branch from a989076 to 71f60a2 Jan 13, 2017

@lavalamp lavalamp requested a review from mml Jan 19, 2017

@lavalamp

This comment has been minimized.

Member

lavalamp commented Jan 19, 2017

This seems like a good idea but there are some things I would do differently. I'm strapped for time so maybe @mml can do an initial review?

@sandys

This comment has been minimized.

sandys commented Jan 23, 2017

hey guys, may I request for someone to review this please ?

I think we are blocking on this to setup a full-TLS kubernetes cluster.

@zhouhaibing089 zhouhaibing089 force-pushed the zhouhaibing089:etcd-health-check branch from 71f60a2 to 00bc625 Jan 23, 2017

@lavalamp

The caching/threadsafety comments are the important ones, everything else is readability. Thanks. Sorry for the long delay.

func (server *Server) DoServerCheck(prober httpprober.HTTPProber) (probe.Result, string, error) {
func (server *Server) DoServerCheck() (probe.Result, string, error) {
// setup the prober
if server.Prober == nil {

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

I think you probably need a lock or a sync.Once to protect this logic.

This comment has been minimized.

@zhouhaibing089

zhouhaibing089 Jan 30, 2017

Contributor

yeah, updated to use sync.Once.

This comment has been minimized.

@liggitt

liggitt Mar 8, 2017

Member

the server.Prober == nil check outside the sync.Once call will still cause race issues with the assignment inside the sync.Once call

@@ -96,7 +93,7 @@ func ToConditionStatus(s probe.Result) api.ConditionStatus {
}
func (rs *REST) getComponentStatus(name string, server Server) *api.ComponentStatus {
status, msg, err := server.DoServerCheck(rs.prober)
status, msg, err := server.DoServerCheck()

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

Since you take server by value, it is going to have to construct a prober every time, it won't cache it. I guess you probably want to start storing pointers in the map.

This comment has been minimized.

@zhouhaibing089

zhouhaibing089 Jan 30, 2017

Contributor

Updated to storing pointers in the map.

@@ -36,6 +36,16 @@ func New() HTTPProber {
return httpProber{transport}
}
// NewWithCert creates a prober which will also setup client certificates

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

Optional: what about NewWithClientCert?

for server := range servers {
backends = append(backends, Backend{
Server: server,
KeyFile: s.StorageConfig.KeyFile,

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

Are all etcd servers using the same client cert? Hm, I guess they must.

This comment has been minimized.

@zhouhaibing089

zhouhaibing089 Jan 30, 2017

Contributor

Yeah, all the configurations are inherited from storagebackend.Config, if they have to use different one, we will need to update that part as well..

// for health validations
type Backend struct {
Server string
// TLS credentials

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

document: these will be empty if it is not an HTTPS connection.

Better yet, make a type ClientCert {Key, Cert string} and put here a ClientCert *ClientCert, so it is clear it can be nil?

This comment has been minimized.

@zhouhaibing089

zhouhaibing089 Jan 30, 2017

Contributor

these will be empty if it is not an HTTPS connection.

The client certificate is only required when the server enables client-cert-auth.

There are already many places use CertFile and KeyFile around, like storagebackend.Config, so I just leave it as it is..

// Backend describes the storage servers, the information here should be enough
// for health validations
type Backend struct {
Server string

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

Document whether this includes the scheme (http/https), or just the IP/hostname?

@@ -30,6 +30,15 @@ import (
"github.com/golang/glog"
)
// Backend describes the storage servers, the information here should be enough
// for health validations

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

nit: add trailing period.

type Backend struct {
Server string
// TLS credentials
KeyFile string

This comment has been minimized.

@lavalamp

lavalamp Jan 24, 2017

Member

Are these really file paths rather than the contents of the key/cert files? Seems odd to pass around file paths.

@zhouhaibing089 zhouhaibing089 force-pushed the zhouhaibing089:etcd-health-check branch from 7209b91 to b104017 Apr 17, 2017

@k8s-merge-robot k8s-merge-robot removed the lgtm label Apr 17, 2017

@zhouhaibing089

This comment has been minimized.

Contributor

zhouhaibing089 commented Apr 21, 2017

ping @liggitt @lavalamp is there any other comments?

@smarterclayton

This comment has been minimized.

Contributor

smarterclayton commented Apr 21, 2017

/approve

@zhouhaibing089

This comment has been minimized.

Contributor

zhouhaibing089 commented Apr 22, 2017

@liggitt mind adding the lgtm label?(rebased and updated the bazel BUILD file since the last time).

@liggitt

This comment has been minimized.

Member

liggitt commented Apr 22, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 22, 2017

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 22, 2017

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, smarterclayton, zhouhaibing089

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 22, 2017

Automatic merge from submit-queue

@k8s-merge-robot k8s-merge-robot merged commit e0ba40b into kubernetes:master Apr 22, 2017

12 of 14 checks passed

Submit Queue Required Github CI test is not green: Jenkins GCE e2e
Details
code-review/reviewable 9 files, 16 discussions left (lavalamp, liggitt, smarterclayton, zhouhaibing089)
Details
Jenkins Bazel Build Job succeeded.
Details
Jenkins GCE Node e2e Jenkins job succeeded.
Details
Jenkins GCE e2e Jenkins job succeeded.
Details
Jenkins GCE etcd3 e2e Jenkins job succeeded.
Details
Jenkins GCI GCE e2e Jenkins job succeeded.
Details
Jenkins Kubemark GCE e2e Jenkins job succeeded.
Details
Jenkins kops AWS e2e Jenkins job succeeded.
Details
Jenkins non-CRI GCE Node e2e Jenkins job succeeded.
Details
Jenkins non-CRI GCE e2e Jenkins job succeeded.
Details
Jenkins unit/integration Jenkins job succeeded.
Details
Jenkins verification Jenkins job succeeded.
Details
cla/linuxfoundation zhouhaibing089 authorized
Details
@dawidmalina

This comment has been minimized.

dawidmalina commented Apr 24, 2017

@liggitt Will this change be merged to 1.5 branch?

@liggitt

This comment has been minimized.

Member

liggitt commented Apr 24, 2017

@liggitt Will this change be merged to 1.5 branch?

no, since this is not fixing a regression, new functionality is not backported to old releases.

@dawidmalina

This comment has been minimized.

dawidmalina commented Apr 24, 2017

Ok. So this will be available in 1.7, am i right?

@liggitt

This comment has been minimized.

Member

liggitt commented Apr 24, 2017

Yes

@dimthe

This comment has been minimized.

dimthe commented May 16, 2017

can this fix be backported or not ?

@cemo

This comment has been minimized.

cemo commented Jun 29, 2017

What is the effect of having a unhealthy output?

$ kubectl get componentstatus
NAME                 STATUS      MESSAGE                                                                                   ERROR
controller-manager   Healthy     ok
scheduler            Healthy     ok
etcd-0               Unhealthy   Get https://etcd.xxx.internal.yyy:2379/health: remote error: tls: bad certificate

I can see that deploys, daemon sets are working fine but I could not be sure about this message.

gentle ping @liggitt, @zhouhaibing089 and others

@zhouhaibing089

This comment has been minimized.

Contributor

zhouhaibing089 commented Jun 29, 2017

  1. there is no big deal when you see this message, however, people will ask: well, what's wrong with it(you already did, 😄 ).
  2. any service which relies on the result of component status might not work well, in the time that we were using kube-up scripts, it will validate cluster status by checking this output.

@cemo hope this helps.

@cemo

This comment has been minimized.

cemo commented Jun 29, 2017

Thank you so much. 👍

@tristanls

This comment has been minimized.

tristanls commented Jun 29, 2017

@cemo, @zhouhaibing089 the effect is that if all traffic is encrypted by mutual TLS, this prevents kubelet from configuring networking in certain cluster configurations, which causes the cluster to not function. See: #42701 (comment)

kelseyhightower added a commit to kelseyhightower/kubernetes-the-hard-way that referenced this pull request Jul 12, 2017

Update 05-kubernetes-controller.md
there is a bug in Kubernetes 1.6.1 that causes an error when validating the kubernetes environment and etcd.  (kubernetes/kubernetes#39716)   I found that using the 1.7.0 version I did not get this error.  Affects the README, this file and the client configuration (moving to 1.7.0 to match)
@gyliu513

This comment has been minimized.

Member

gyliu513 commented Jul 17, 2017

I think that this should be back merged to 1.6, it is really confusing for people when got this error, as people cannot make sure if the cluster still running in a healthy state or not at this time, comments? @liggitt @lavalamp

@liggitt

This comment has been minimized.

Member

liggitt commented Jul 17, 2017

I think that this should be back merged to 1.6, it is really confusing for people when got this error, as people cannot make sure if the cluster still running in a healthy state or not at this time, comments? @liggitt @lavalamp

I'm ambivalent. The componentstatus API has been broken this way since it was introduced (c.f. #17737).

There are still other assumptions the componentstatuses API makes that are incorrect (that the controller-manager and scheduler are running on unsecured localhost, for example). I don't really think backporting one aspect of componentstatuses is high priority.

zuzmic added a commit to zuzmic/kubernetes-the-hard-way that referenced this pull request Apr 19, 2018

Update 05-kubernetes-controller.md
there is a bug in Kubernetes 1.6.1 that causes an error when validating the kubernetes environment and etcd.  (kubernetes/kubernetes#39716)   I found that using the 1.7.0 version I did not get this error.  Affects the README, this file and the client configuration (moving to 1.7.0 to match)

mbenabda added a commit to weekendesk/kubernetes-the-hard-way that referenced this pull request Apr 20, 2018

Update 05-kubernetes-controller.md
there is a bug in Kubernetes 1.6.1 that causes an error when validating the kubernetes environment and etcd.  (kubernetes/kubernetes#39716)   I found that using the 1.7.0 version I did not get this error.  Affects the README, this file and the client configuration (moving to 1.7.0 to match)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment