-
Notifications
You must be signed in to change notification settings - Fork 381
add Telemeter client #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Telemeter client #103
Conversation
e8f90d6 to
34d24d5
Compare
|
@squat could you elaborate on what is still work-in-progress here? Just looking at the code I would say tollbooth token handling, could you just confirm that? 🙂 |
|
@brancz tollbooth token handling is a big one. I am waiting for input from the installer team for details on obtaining the token components. The second piece is the generation of the actual telemeter client manifests to deploy with the CMO. I am currently writing jsonnet in the telemeter repo to generate the manifests and will then import that dependency in this repo. |
|
Once openshift/telemeter#25 is in, this PR can vendor the Telemeter client jsonnet and render the manifests. The outstanding work is still identifying the source of the tollbooth authentication token and cluster ID. |
34d24d5 to
81493ba
Compare
610f6ee to
f1b1a00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one comment otherwise lgtm
| - --from=https://prometheus-k8s.openshift-monitoring.svc:9091 | ||
| - --from-ca-file=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt | ||
| - --from-token-file=/var/run/secrets/kubernetes.io/serviceaccount/token | ||
| - --to= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should have a value no? (either here or in the code below, but I don't see it in either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what I was referring to in the PR description when I spoke about follow ups. We need to determine the production URL for the telemeter server; once we have that we'll configure it in the telemeter jsonnet and regenerate the manifests here. Does that sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. My feeling is that this will need to be configurable in some way (maybe a flag on the cluster-monitoring-operator?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this sounds completely reasonable. We need to be able to modify CMO deployments for testing, production, etc and be able to report to different telemeter servers. Either in a flag or as another field in the CMO configmap. I'll plumb the URL into an environment variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack 👍
pkg/manifests/manifests.go
Outdated
| } | ||
|
|
||
| s.StringData["id"] = base64.StdEncoding.EncodeToString([]byte(f.config.TelemeterClientConfig.ClusterID)) | ||
| s.StringData["token"] = base64.StdEncoding.EncodeToString([]byte(f.config.TelemeterClientConfig.PullSecret)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this resolves the secret mystery, thanks! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should rename this also to pullSecret on telemeter side. On second thought, this is fine'ish though, as this token is pretty much opaque to telemeter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just set Data to the byte slice of what you want then the marshaling will automatically encode it as base64. I generally prefer that as i always forget which base64 encoding is used :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very true, good catch!
|
LGTM from my side so far! |
| return manifests.NewDefaultConfig() | ||
| } | ||
|
|
||
| cmap, err = o.client.KubernetesInterface().CoreV1().ConfigMaps("kube-system").Get("cluster-config-v1", metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should make aware that storing the pull secret in a configmap might not be the best place?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, we should raise this in openshift/installer
| record: build_error_rate | ||
| - name: kubernetes-absent | ||
| rules: | ||
| - alert: AlertmanagerDown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t look right. I think we accidentally didn’t append the telemetry Job but instead overwrote the whole array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was thinking. I’ll fix this in the telemeter repo
f1b1a00 to
b20e672
Compare
|
needs another bump after openshift/telemeter#29 merges |
0b57ba9 to
e295aef
Compare
e295aef to
949d4a1
Compare
e38ef2e to
381f396
Compare
381f396 to
22d9fc7
Compare
|
|
||
| s.Data["id"] = []byte(f.config.TelemeterClientConfig.ClusterID) | ||
| s.Data["to"] = []byte(f.config.TelemeterClientConfig.TelemeterServerURL) | ||
| s.Data["token"] = []byte(f.config.TelemeterClientConfig.Token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
| Auth string `json:"auth"` | ||
| } `json:"cloud.openshift.com"` | ||
| } `json:"auths"` | ||
| }{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @smarterclayton this is the current parsing logic for the pull secret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, can we make sure tollbooth starts generating the secret in this form asap?
|
@s-urbaniak: GitHub didn't allow me to request PR reviews from the following users: the, for, pull, secret, is, this, current, parsing, logic. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: s-urbaniak, squat The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
looks super fine to me 🎉 |
Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ...
Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ... I've also added diff logging, using an already-vendored library, to make it easier to understand why the operator feels the need to update the resource.
Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ... I've also added diff logging, using an already-vendored library, to make it easier to understand why the operator feels the need to update the resource. The go.mod update was generated with: $ go mod tidy using: $ go version go version go1.19.5 linux/amd64 now that we're directly using the already-vendored package.
Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ... I've also added diff logging, using an already-vendored library, to make it easier to understand why the operator feels the need to update the resource. The go.mod update was generated with: $ go mod tidy using: $ go version go version go1.19.5 linux/amd64 now that we're directly using the already-vendored package.
Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ... I've also added diff logging, using an already-vendored library, to make it easier to understand why the operator feels the need to update the resource. The go.mod update was generated with: $ go mod tidy using: $ go version go version go1.19.5 linux/amd64 now that we're directly using the already-vendored package.
This pull request adds the Telemeter client to the Cluster Monitoring Operator stack.
One of the key details to note about the Telemeter client is that its deployment requires a secret containing the cluster's pull secret and ID. These values are only available for 4.x clusters created with github.com/openshift/installer. That is, trying to run the CMO on a non-4.x cluster will result in the deployment of a Telemeter client with an invalid secret, meaning it will not be able to authenticate against the Telemeter server. Specifically, the cluster ID and pull secret can be found in a ConfigMap in the
kube-systemnamespace namedcluster-config-v1, e.g. [0].Once the URL for the production Telemeter server is known, this pull request must be followed by a PR to the github.com/openshift/telemeter repo to set the default Telemeter server URL and another PR to this repo to bump the Telemeter client jsonnet dependency and regenerate the manifests.
cc @brancz @s-urbaniak
[0] https://github.com/openshift/installer/blob/master/installer/pkg/config-generator/fixtures/kube-system.yaml