Add status handling for RouteHealth by benjaminapetersen · Pull Request #334 · openshift/console-operator

benjaminapetersen · 2019-11-01T14:43:14Z

Most of the "Console isn't happy" questions we get stem from the console route being unhealthy at this point. This adds a RouteHealthDegraded condition, with several checks to see if the route is in a good place.

It is similar to the check done by the authentication operator, which seems to have made debugging much easier:
https://github.com/openshift/cluster-authentication-operator/blob/e7d5e3d45f5188e9c0631e93552e5b7ac48a06e4/pkg/operator2/operator.go#L445

/assign @spadgett @jhadvig

openshift-ci-robot · 2019-11-01T14:43:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benjaminapetersen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [benjaminapetersen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/console/operator/health.go

pkg/console/operator/sync_v400.go

benjaminapetersen · 2019-11-01T16:40:01Z

Revised per @spadgett comment about ordering. Def a good point, we want to be able to check health even if other resources err and abort the loop.

benjaminapetersen · 2019-11-01T18:46:42Z

/retest

Variety of flakes

jhadvig

Couple of comments, otherwise the changes looks good 👍

pkg/console/operator/health.go

benjaminapetersen · 2019-11-04T20:32:33Z

Interestingly, test failure:

brand_builtins_test.go:50: operator has not reached settled state in 4m0s attempts due to [RouteHealthDegraded] - timed out waiting for the condition

benjaminapetersen · 2019-11-04T20:32:40Z

/retest

pkg/console/operator/health.go

benjaminapetersen · 2019-11-12T17:00:39Z

/retest

     logging_test.go:25: setting console operator to 'Debug' LogLevel ...
    logging_test.go:83: checking if '--log-level=*=DEBUG' flag is set on the console deployment container command...
    logging_test.go:16: waiting for cleanup to reach settled state...
    console-operator.go:320: waiting for observed generation 14 to match generation 15... 
    console-operator.go:369: waited 10 seconds to reach settled state...
    console-operator.go:369: waited 30 seconds to reach settled state...
    console-operator.go:369: waited 60 seconds to reach settled state...
    console-operator.go:369: waited 90 seconds to reach settled state...
    console-operator.go:369: waited 120 seconds to reach settled state...
    logging_test.go:16: operator has not reached settled state in 4m0s attempts due to [RouteHealthDegraded] - timed out waiting for the condition

I want that status :)
But retest to see if this is a flake or if a tweak is needed.

benjaminapetersen · 2019-11-13T17:07:36Z

/retest

level=fatal msg="Terraform destroy: failed to destroy using Terraform"

pkg/console/operator/health.go

- GET configmap router-ca if possible and load CA for testing console route health - load serviceaccount ca as fallback for testing route health - report various potential route health errors Add CA to solve x509 errors when checking route health

pkg/api/api.go

pkg/console/operator/health_deployment.go

benjaminapetersen · 2019-11-15T19:13:52Z

pkg/console/operator/health_route.go

+		}
+		client := clientWithCA(caPool)
+
+		// TODO: deal with an environment with a MitM proxy?


@stlaz got the CA cert from router-ca, and a fallback to use the local serviceaccount/ca from the operator, if the router-ca is unavailable. You mentioned a possible proxy MitM issue I should also take into account. Care to elaborate? Thanks!

pkg/console/operator/health_route.go

benjaminapetersen · 2019-11-15T19:15:38Z

pkg/console/operator/health_route.go

+	routerCA, rcaErr := co.configMapClient.ConfigMaps(api.OpenShiftConsoleNamespace).Get(api.RouterCAConfigMapName, metav1.GetOptions{})
+
+	if rcaErr != nil && apierrors.IsNotFound(rcaErr) {
+		//  using CA ca-bundle.crt from configmap router-ca from openshift-config-managed


On error, this does not log any cert data, simply logs path (at high enough -v) of attempt to read cert (also, if we fail, the cert was not read, anyway). We want to know exactly what went wrong, if it goes wrong, I think, without leaking anything.

benjaminapetersen · 2019-11-15T19:18:31Z

No longer getting this issue with the CA wired up:

  - lastTransitionTime: "2019-11-04T20:21:52Z"
    message: 'RouteHealthDegraded: failed to GET route: Head https://console-openshift-console.apps.bpetersen.devcluster.openshift.com/health:
      x509: certificate signed by unknown authority'
    reason: RouteHealthDegradedFailedGet
    status: "True"
    type: Degraded

Flow now may report:

  - lastTransitionTime: "2019-11-15T18:48:57Z"
    message: 'RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.bpetersen.devcluster.openshift.com/health
      returns ''503 Service Unavailable'''

then:

- lastTransitionTime: "2019-11-15T18:48:57Z"
    message: 'RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.bpetersen.devcluster.openshift.com/health
      returns ''503 Service Unavailable'''
    reason: RouteHealthDegradedStatusError

until finally resolved when the endpoint responds:

status:
  conditions:
  - lastTransitionTime: "2019-11-15T19:04:20Z"
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2019-11-15T19:04:18Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-11-15T19:04:18Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-11-15T16:15:43Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable

benjaminapetersen · 2019-11-15T19:53:39Z

pkg/console/operator/health_route.go

+}
+
+func (co *consoleOperator) getCA() (*x509.CertPool, error) {
+	caCertPool := x509.NewCertPool()


Possibly i should update to:

// start with system, fallback to new? rootCAs, _ := x509.SystemCertPool() if rootCAs == nil { rootCAs = x509.NewCertPool() }

benjaminapetersen · 2019-11-15T19:55:44Z

pkg/console/operator/health_route.go

+func clientWithCA(caPool *x509.CertPool) *http.Client {
+	return &http.Client{
+		Timeout: 5 * time.Second,
+		Transport: &http.Transport{


Possibly should use:

transport := http.DefaultTransport()

You're using your own roots whereas the above transport only uses the default system trust bundle, so you will want to use your own transport.

benjaminapetersen · 2019-11-15T21:26:39Z

Also needs #334 to eliminate the Removed test flake.

openshift-ci-robot · 2019-11-18T23:27:30Z

@benjaminapetersen: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/verify	`a75d774`	link	`/test verify`
ci/prow/unit	`a75d774`	link	`/test unit`
ci/prow/e2e-gcp	`a75d774`	link	`/test e2e-gcp`
ci/prow/e2e-aws-console	`a75d774`	link	`/test e2e-aws-console`
ci/prow/e2e-aws-operator	`a75d774`	link	`/test e2e-aws-operator`
ci/prow/e2e-gcp-upgrade	`a75d774`	link	`/test e2e-gcp-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

stlaz · 2019-11-19T07:37:54Z

pkg/console/operator/health_route.go

+	}
+}
+
+func appendProxyCA(caCertPool *x509.CertPool, systemCABundle []byte) *x509.CertPool {


If proxy CA is configured, it should already appear in the system CA bundle - the location you're copying it to is where golang looks when you call x509.SystemCertPool()

ah, nvm, it was just the name of the function that confused me, I see what you did there

stlaz · 2019-11-19T11:38:41Z

pkg/console/operator/health_route.go

+}
+
+func (co *consoleOperator) getCA() (*x509.CertPool, error) {
+	// TODO: should I update to this? start with the SystemCertPool?


yes, this looks better IMO, and SystemCertPool() should be fresh since you're doing --terminate-on-files when starting your operator

stlaz · 2019-11-19T11:50:11Z

pkg/console/operator/health_route.go

+
+	// fallback to our local ca in from our serviceaccount
+	// if we hit this path, are we likely to get self signed certs errors?
+	serviceAccountCAbytes, err := ioutil.ReadFile(api.OAuthEndpointCAFilePath)


This CA bundle actually does contain the router-ca, so you might just stick with it instead of getting the CA from the config map above.

stlaz · 2019-11-19T11:58:25Z

test/e2e/metrics_test.go

 	// that are missing from the system trust roots
 	rootCAs.AppendCertsFromPEM(config.CAData)

+	// TODO: could get router-ca ConfigMap we sync'd from openshift-config-managed


Not really, you're using a passthrough route to access the /metrics endpoint from an external endpoint, but the cert that's used for serving directly at the pod which terminates the TLS connection is not valid for the hostname of the route (and is actually signed by service-ca AFAIK).

So even if you had a valid CA that signed the server cert here, the cert verification would fail because of the hostname. Yet you need passthrough route to be able to use client cert auth. Another solution would be to grab a token for a service account that's allowed to access this endpoint, in which case the passthrough aspect of the route would not be necessary.

jhadvig · 2020-04-20T08:02:52Z

CLosing in favor of #399

openshift-ci-robot assigned jhadvig and spadgett Nov 1, 2019

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 1, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 1, 2019

openshift-ci-robot requested review from jhadvig and spadgett November 1, 2019 14:43

benjaminapetersen force-pushed the status/route-healthy branch from d06d691 to 16d894e Compare November 1, 2019 16:37

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 1, 2019

benjaminapetersen commented Nov 1, 2019

View reviewed changes

pkg/console/operator/health.go Outdated Show resolved Hide resolved

benjaminapetersen commented Nov 1, 2019

View reviewed changes

pkg/console/operator/sync_v400.go Show resolved Hide resolved

benjaminapetersen commented Nov 1, 2019

View reviewed changes

pkg/console/operator/sync_v400.go Show resolved Hide resolved

jhadvig requested changes Nov 4, 2019

View reviewed changes

pkg/console/operator/health.go Outdated Show resolved Hide resolved

pkg/console/operator/health.go Outdated Show resolved Hide resolved

pkg/console/operator/health.go Outdated Show resolved Hide resolved

benjaminapetersen commented Nov 4, 2019

View reviewed changes

pkg/console/operator/health.go Outdated Show resolved Hide resolved

benjaminapetersen force-pushed the status/route-healthy branch 2 times, most recently from fdf5aa2 to 0466b2b Compare November 12, 2019 15:19

benjaminapetersen force-pushed the status/route-healthy branch from 0466b2b to 6e26a31 Compare November 13, 2019 17:08

stlaz reviewed Nov 15, 2019

View reviewed changes

pkg/console/operator/health.go Outdated Show resolved Hide resolved

Add status handling for RouteHealth

dcf0bf4

- GET configmap router-ca if possible and load CA for testing console route health - load serviceaccount ca as fallback for testing route health - report various potential route health errors Add CA to solve x509 errors when checking route health

benjaminapetersen force-pushed the status/route-healthy branch from 6e26a31 to dcf0bf4 Compare November 15, 2019 19:11

benjaminapetersen commented Nov 15, 2019

View reviewed changes

pkg/api/api.go Show resolved Hide resolved

benjaminapetersen commented Nov 15, 2019

View reviewed changes

pkg/console/operator/health_deployment.go Show resolved Hide resolved

benjaminapetersen commented Nov 15, 2019

View reviewed changes

pkg/console/operator/health_route.go Show resolved Hide resolved

benjaminapetersen commented Nov 15, 2019

View reviewed changes

benjaminapetersen mentioned this pull request Nov 15, 2019

[WIP] Route Sync Controller #350

Closed

benjaminapetersen commented Nov 15, 2019

View reviewed changes

benjaminapetersen added 4 commits November 18, 2019 10:55

TODO: comments

ee52c55

Add proxy ca-bundle to operator via annotation

bdbeb6d

Change entrypoint in operator.yaml

683874b

Append proxy CA to do route health check

a75d774

stlaz reviewed Nov 19, 2019

View reviewed changes

benjaminapetersen mentioned this pull request Jan 16, 2020

[WIP] Add /health/oauthconnect endpoint openshift/console#3977

Closed

1 task

jhadvig mentioned this pull request Mar 23, 2020

Status handling for RouteHealth #399

Merged

jhadvig closed this Apr 20, 2020

Conversation

benjaminapetersen commented Nov 1, 2019

Uh oh!

openshift-ci-robot commented Nov 1, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaminapetersen commented Nov 1, 2019

Uh oh!

benjaminapetersen commented Nov 1, 2019

Uh oh!

jhadvig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjaminapetersen commented Nov 4, 2019

Uh oh!

benjaminapetersen commented Nov 4, 2019

Uh oh!

Uh oh!

benjaminapetersen commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjaminapetersen commented Nov 13, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminapetersen commented Nov 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminapetersen commented Nov 15, 2019

Uh oh!

openshift-ci-robot commented Nov 18, 2019

Uh oh!

stlaz Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhadvig commented Apr 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benjaminapetersen commented Nov 12, 2019 •

edited

Loading

stlaz Nov 19, 2019 •

edited

Loading