Skip to content

Add status handling for RouteHealth#334

Closed
benjaminapetersen wants to merge 5 commits intoopenshift:masterfrom
benjaminapetersen:status/route-healthy
Closed

Add status handling for RouteHealth#334
benjaminapetersen wants to merge 5 commits intoopenshift:masterfrom
benjaminapetersen:status/route-healthy

Conversation

@benjaminapetersen
Copy link
Contributor

Most of the "Console isn't happy" questions we get stem from the console route being unhealthy at this point. This adds a RouteHealthDegraded condition, with several checks to see if the route is in a good place.

It is similar to the check done by the authentication operator, which seems to have made debugging much easier:
https://github.com/openshift/cluster-authentication-operator/blob/e7d5e3d45f5188e9c0631e93552e5b7ac48a06e4/pkg/operator2/operator.go#L445

/assign @spadgett @jhadvig

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 1, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benjaminapetersen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 1, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 1, 2019
@benjaminapetersen
Copy link
Contributor Author

Revised per @spadgett comment about ordering. Def a good point, we want to be able to check health even if other resources err and abort the loop.

@benjaminapetersen
Copy link
Contributor Author

/retest

Variety of flakes

Copy link
Member

@jhadvig jhadvig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments, otherwise the changes looks good 👍

@benjaminapetersen
Copy link
Contributor Author

Interestingly, test failure:

brand_builtins_test.go:50: operator has not reached settled state in 4m0s attempts due to [RouteHealthDegraded] - timed out waiting for the condition 

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen benjaminapetersen force-pushed the status/route-healthy branch 2 times, most recently from fdf5aa2 to 0466b2b Compare November 12, 2019 15:19
@benjaminapetersen
Copy link
Contributor Author

benjaminapetersen commented Nov 12, 2019

/retest

     logging_test.go:25: setting console operator to 'Debug' LogLevel ...
    logging_test.go:83: checking if '--log-level=*=DEBUG' flag is set on the console deployment container command...
    logging_test.go:16: waiting for cleanup to reach settled state...
    console-operator.go:320: waiting for observed generation 14 to match generation 15... 
    console-operator.go:369: waited 10 seconds to reach settled state...
    console-operator.go:369: waited 30 seconds to reach settled state...
    console-operator.go:369: waited 60 seconds to reach settled state...
    console-operator.go:369: waited 90 seconds to reach settled state...
    console-operator.go:369: waited 120 seconds to reach settled state...
    logging_test.go:16: operator has not reached settled state in 4m0s attempts due to [RouteHealthDegraded] - timed out waiting for the condition 

I want that status :)
But retest to see if this is a flake or if a tweak is needed.

@benjaminapetersen
Copy link
Contributor Author

/retest

level=fatal msg="Terraform destroy: failed to destroy using Terraform"

- GET configmap router-ca if possible and load CA for testing console
route health
- load serviceaccount ca as fallback for testing route health
- report various potential route health errors

Add CA to solve x509 errors when checking route health
}
client := clientWithCA(caPool)

// TODO: deal with an environment with a MitM proxy?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stlaz got the CA cert from router-ca, and a fallback to use the local serviceaccount/ca from the operator, if the router-ca is unavailable. You mentioned a possible proxy MitM issue I should also take into account. Care to elaborate? Thanks!

routerCA, rcaErr := co.configMapClient.ConfigMaps(api.OpenShiftConsoleNamespace).Get(api.RouterCAConfigMapName, metav1.GetOptions{})

if rcaErr != nil && apierrors.IsNotFound(rcaErr) {
// using CA ca-bundle.crt from configmap router-ca from openshift-config-managed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On error, this does not log any cert data, simply logs path (at high enough -v) of attempt to read cert (also, if we fail, the cert was not read, anyway). We want to know exactly what went wrong, if it goes wrong, I think, without leaking anything.

@benjaminapetersen
Copy link
Contributor Author

No longer getting this issue with the CA wired up:

  - lastTransitionTime: "2019-11-04T20:21:52Z"
    message: 'RouteHealthDegraded: failed to GET route: Head https://console-openshift-console.apps.bpetersen.devcluster.openshift.com/health:
      x509: certificate signed by unknown authority'
    reason: RouteHealthDegradedFailedGet
    status: "True"
    type: Degraded

Flow now may report:

  - lastTransitionTime: "2019-11-15T18:48:57Z"
    message: 'RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.bpetersen.devcluster.openshift.com/health
      returns ''503 Service Unavailable'''

then:

- lastTransitionTime: "2019-11-15T18:48:57Z"
    message: 'RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.bpetersen.devcluster.openshift.com/health
      returns ''503 Service Unavailable'''
    reason: RouteHealthDegradedStatusError

until finally resolved when the endpoint responds:

status:
  conditions:
  - lastTransitionTime: "2019-11-15T19:04:20Z"
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-15T19:04:18Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-11-15T19:04:18Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-11-15T16:15:43Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

}

func (co *consoleOperator) getCA() (*x509.CertPool, error) {
caCertPool := x509.NewCertPool()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly i should update to:

// start with system, fallback to new?
rootCAs, _ := x509.SystemCertPool()
if rootCAs == nil {
	rootCAs = x509.NewCertPool()
}

func clientWithCA(caPool *x509.CertPool) *http.Client {
return &http.Client{
Timeout: 5 * time.Second,
Transport: &http.Transport{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly should use:

transport := http.DefaultTransport()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using your own roots whereas the above transport only uses the default system trust bundle, so you will want to use your own transport.

@benjaminapetersen
Copy link
Contributor Author

Also needs #334 to eliminate the Removed test flake.

@openshift-ci-robot
Copy link
Contributor

@benjaminapetersen: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/verify a75d774 link /test verify
ci/prow/unit a75d774 link /test unit
ci/prow/e2e-gcp a75d774 link /test e2e-gcp
ci/prow/e2e-aws-console a75d774 link /test e2e-aws-console
ci/prow/e2e-aws-operator a75d774 link /test e2e-aws-operator
ci/prow/e2e-gcp-upgrade a75d774 link /test e2e-gcp-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

}
}

func appendProxyCA(caCertPool *x509.CertPool, systemCABundle []byte) *x509.CertPool {
Copy link
Contributor

@stlaz stlaz Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If proxy CA is configured, it should already appear in the system CA bundle - the location you're copying it to is where golang looks when you call x509.SystemCertPool()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nvm, it was just the name of the function that confused me, I see what you did there

}

func (co *consoleOperator) getCA() (*x509.CertPool, error) {
// TODO: should I update to this? start with the SystemCertPool?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this looks better IMO, and SystemCertPool() should be fresh since you're doing --terminate-on-files when starting your operator


// fallback to our local ca in from our serviceaccount
// if we hit this path, are we likely to get self signed certs errors?
serviceAccountCAbytes, err := ioutil.ReadFile(api.OAuthEndpointCAFilePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CA bundle actually does contain the router-ca, so you might just stick with it instead of getting the CA from the config map above.

// that are missing from the system trust roots
rootCAs.AppendCertsFromPEM(config.CAData)

// TODO: could get router-ca ConfigMap we sync'd from openshift-config-managed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, you're using a passthrough route to access the /metrics endpoint from an external endpoint, but the cert that's used for serving directly at the pod which terminates the TLS connection is not valid for the hostname of the route (and is actually signed by service-ca AFAIK).

So even if you had a valid CA that signed the server cert here, the cert verification would fail because of the hostname. Yet you need passthrough route to be able to use client cert auth. Another solution would be to grab a token for a service account that's allowed to access this endpoint, in which case the passthrough aspect of the route would not be necessary.

@jhadvig
Copy link
Member

jhadvig commented Apr 20, 2020

CLosing in favor of #399

@jhadvig jhadvig closed this Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants