Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update console for monitoring changes #909

Merged

Conversation

spadgett
Copy link
Member

@spadgett spadgett commented Dec 6, 2018

  • Proxy to port 9092, which has the tenancy proxy in front of it
  • Remove the CAN_LIST_NS check since users can now see metrics in their own namespaces

https://jira.coreos.com/browse/CONSOLE-1035

@kyoto @brancz

/hold

@spadgett
Copy link
Member Author

spadgett commented Dec 6, 2018

Currently blocked because /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt is not there, and the console uses that ca to proxy to the prometheus service.

sh-4.2$ cd /var/run/secrets/kubernetes.io/serviceaccount/
sh-4.2$ ls
ca.crt  namespace  token

@brancz
Copy link

brancz commented Dec 6, 2018

The serving certs controller works a bit differently now. The service-ca.crt is not mounted into the serviceaccount secrets anymore, you need to create a configmap that has a special annotation, the configmap will then be populated with the ca.crt automatically by the serving certs controller.

@spadgett
Copy link
Member Author

spadgett commented Dec 6, 2018

@spadgett
Copy link
Member Author

spadgett commented Dec 10, 2018

@brancz I'm having trouble connecting to 9092 from inside the pod. Any ideas? 9091 is fine. This is a 4.0 install.

sh-4.2$ curl -k https://prometheus-k8s.openshift-monitoring.svc:9092
curl: (7) Failed connect to prometheus-k8s.openshift-monitoring.svc:9092; No route to host

@s-urbaniak
Copy link
Contributor

@spadgett the error No route to host implies that there is a problem with DNS in your pod. Can you please verify that you can reach other services in the pod?

@spadgett
Copy link
Member Author

@s-urbaniak I thought so, too, but I can reach the same service on port 9091. Just not 9092. (The service does have a port 9092 defined.)

@s-urbaniak
Copy link
Contributor

@spadgett you might be right 🤔 When I look at the service I see it has a tenancy target port: https://github.com/openshift/cluster-monitoring-operator/blob/1322e56e961511994a4a1a5ef55152d3b389575c/assets/prometheus-k8s/service.yaml#L17

But that port is not declared in the container: https://github.com/openshift/cluster-monitoring-operator/blob/1322e56e961511994a4a1a5ef55152d3b389575c/assets/prometheus-k8s/prometheus.yaml#L67-L77

cc @metalmatze for verification

@s-urbaniak
Copy link
Contributor

@spadgett good catch, we verified it's the missing port name, submitting a fix to the cluster monitoring operator as we speak.

@s-urbaniak
Copy link
Contributor

@spadgett once openshift/cluster-monitoring-operator#183 is merged and the images are rebuilt, you can retry :-)

@spadgett
Copy link
Member Author

@s-urbaniak great, thank you!

@spadgett
Copy link
Member Author

I'm still struggling to get port 9092 working. 9091 seems to work. This is from inside the console pod.

sh-4.2$ curl -k -H 'Authorization: Bearer <redacted>' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=count(increase(kube_pod_container_status_restarts_total%7Bnamespace%3D%22openshift-console%22%7D%5B1h%5D)%20%3E%205%20)'
{"status":"success","data":{"resultType":"vector","result":[]}}
sh-4.2$ curl -k -H 'Authorization: Bearer <redacted>' 'https://prometheus-k8s.openshift-monitoring.svc:9092/api/v1/query?query=count(increase(kube_pod_container_status_restarts_total%7Bnamespace%3D%22openshift-console%22%7D%5B1h%5D)%20%3E%205%20)'
Bad Request. The request or configuration is malformed.

The decoded query here is

count(increase(kube_pod_container_status_restarts_total{namespace="openshift-console"}[1h]) > 5 )

@spadgett
Copy link
Member Author

I pulled out the service-ca.crt changes into a separate PR, #960. It will require console operator changes as well to pass the service-ca file path on startup.

@metalmatze
Copy link
Contributor

It seems like the request still can't get through the kube-rbac-proxy:

https://github.com/brancz/kube-rbac-proxy/blob/7a8722d50ffc5928ca0d21091040c6758244dd6c/pkg/proxy/proxy.go#L79

I would like to help you debug this. Is there any easy way to work on this? Is a normal OpenShift 4 cluster enough? Do I need a patched version?

@spadgett
Copy link
Member Author

Sorry, I meant to get back to you on this.

This is a normal OpenShift 4.0 cluster installed from the 0.7.0 installer. I'm logged in as the kube:admin user. Let me know if there's anything I can do to help debug.

@s-urbaniak
Copy link
Contributor

@spadgett @metalmatze I gave this a debug session and found that the only missing piece was the namespace URL parameter.

Given the kube-rbac proxy configuration in openshift (check with kubectl -n openshift-monitoring get secret kube-rbac-proxy -o jsonpath='{.data.config\.yaml}' | base64 -d; echo), it expects a namespace URL query parameter, according to https://github.com/brancz/kube-rbac-proxy/blob/69cfb74e7e3b373602d2295d6175bcccd48da85c/pkg/proxy/proxy.go#L141.

If I take your request from above and just modify it slightly, adding a namespace=openshift-console URL query parameter, it now works as expected:

curl -k -H "Authorization: Bearer <REDACTED>" "https://prometheus-k8s.openshift-monitoring.svc:9092/api/v1/query?namespace=openshift-console&query=count(increase(kube_pod_container_status_restarts_total%7Bnamespace%3D%22openshift-console%22%7D%5B1h%5D)%20%3E%205%20)"; echo
{"status":"success","data":{"resultType":"vector","result":[]}}

@metalmatze
Copy link
Contributor

Thank you for looking into it!
I guess that the console backend needs to add the namespace query parameter to its requests. It should be enforced there and not in the frontend.

@spadgett
Copy link
Member Author

spadgett commented Jan 3, 2019

@s-urbaniak Thanks, I see now. I didn't realize namespace was a separate parameter even if specified in query. I'll make the updates.

@metalmatze
Copy link
Contributor

No worries. That additional separate namespace parameter is what is used to enforce the namespaces inside the query paramter's query that will be executed against Prometheus. the label proxy will override all {namespace="foobar"} with the given namespace from that query parameter. Therefore it has to be enforced in the backend. ☺️

@spadgett
Copy link
Member Author

spadgett commented Jan 3, 2019

For queries that should NOT have a namespace, do we need to use service port 9091 instead? For example:

https://github.com/openshift/console/blob/master/frontend/public/components/cluster-overview.jsx#L49

@brancz
Copy link

brancz commented Jan 7, 2019

yes, for "cluster wide" metrics the existing port should continue to be used

@spadgett
Copy link
Member Author

spadgett commented Jan 7, 2019

I'm seeing certificate errors connecting to port 9092 using service-ca.crt. It works OK for 9091.

2019/01/7 18:26:58 http: proxy error: x509: certificate is not valid for any names, but wanted to match prometheus-k8s.openshift-monitoring.svc

sh-4.2$ curl --cacert /var/service-ca/service-ca.crt https://prometheus-k8s.openshift-monitoring:9092/
curl: (60) Certificate type not approved for application.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.

@spadgett
Copy link
Member Author

spadgett commented Jan 7, 2019

Sorry, wrong hostname above, but I still see the error using the correct host. (No error using port 9091.)

sh-4.2$ curl --cacert /var/service-ca/service-ca.crt https://prometheus-k8s.openshift-monitoring.svc:9092/
curl: (60) Certificate type not approved for application.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.

@spadgett
Copy link
Member Author

spadgett commented Jan 7, 2019

Yeah, I don't see the kube-rbac-proxy container doing anything with the service serving certificate. I opened issue MON-511.

@brancz
Copy link

brancz commented Jan 8, 2019

Yes you're right, we'll try and fix that as soon as we can! Sorry for the inconvenience.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 16, 2019
@spadgett
Copy link
Member Author

The proxy is working for me now 👍

@spadgett
Copy link
Member Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019
@spadgett
Copy link
Member Author

jenkins rebuild

1 similar comment
@spadgett
Copy link
Member Author

jenkins rebuild

@spadgett spadgett changed the title [WIP] Update console for monitoring changes Update console for monitoring changes Jan 25, 2019
@spadgett
Copy link
Member Author

/assign @kyoto
/hold cancel

@openshift-ci-robot openshift-ci-robot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 25, 2019
@spadgett
Copy link
Member Author

Stale element flake

/retest

1) Interacting with the etcd OCS : displays metadata about the created `EtcdBackup` in its "Overview" section
   StaleElementReferenceError: stale element reference: element is not attached to the page document

@spadgett
Copy link
Member Author

OLM test flake

/retest

1) Interacting with the etcd OCS : creates etcd Operator `Deployment`
   Error: Timeout - Async callback was not invoked within timeout specified by jasmine.DEFAULT_TIMEOUT_INTERVAL.

@spadgett
Copy link
Member Author

Tests are green, and I've been able to validate metrics work for a normal user by patching kubeapiserveroperatorconfig (thanks @kyoto for the tip)

{
name: 'Used',
<Line title="Memory Usage" namespace={ns.metadata.name} query={[
{ name: 'Used',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Was this newline accidentally removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

// In OpenShift, the user must be able to list namespaces to query Prometheus.
return canListNS;
};
const canAccessPrometheus = (prometheusFlag) => prometheusFlag && !!prometheusBasePath && !!prometheusTenancyBasePath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this will prevent all requirePrometheus() wrapped components from rendering unless both prometheusBasePath and prometheusTenancyBasePath are set. Is that what we want?

Copy link
Member Author

@spadgett spadgett Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was trying to keep the client logic simple. Currently, the console serve sets both values together, so if one is set the other will always be, too. We assume the RBAC proxy will always be there in OpenShift 4.0. Let me know if that's not the case.

@kyoto
Copy link
Member

kyoto commented Jan 27, 2019

@spadgett LGTM apart from one nit and one question.

* Proxy to port 9092, which has the tenancy proxy in front of it
* Remove the `CAN_LIST_NS` check since users can now see metrics in
  their own namespaces

https://jira.coreos.com/browse/CONSOLE-1035
@brancz
Copy link

brancz commented Jan 28, 2019

The rbac proxy is always there in 4.0. It's part of the Prometheus pod that also answers the non-tenancy requests, so this sounds good to me.

@kyoto
Copy link
Member

kyoto commented Jan 28, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 28, 2019
@openshift-merge-robot openshift-merge-robot merged commit 21d0a9d into openshift:master Jan 28, 2019
@spadgett spadgett deleted the update-monitoring-proxy branch January 28, 2019 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants