add Telemeter client #103

squat · 2018-09-21T12:07:34Z

This pull request adds the Telemeter client to the Cluster Monitoring Operator stack.

One of the key details to note about the Telemeter client is that its deployment requires a secret containing the cluster's pull secret and ID. These values are only available for 4.x clusters created with github.com/openshift/installer. That is, trying to run the CMO on a non-4.x cluster will result in the deployment of a Telemeter client with an invalid secret, meaning it will not be able to authenticate against the Telemeter server. Specifically, the cluster ID and pull secret can be found in a ConfigMap in the kube-system namespace named cluster-config-v1, e.g. [0].

Once the URL for the production Telemeter server is known, this pull request must be followed by a PR to the github.com/openshift/telemeter repo to set the default Telemeter server URL and another PR to this repo to bump the Telemeter client jsonnet dependency and regenerate the manifests.

cc @brancz @s-urbaniak

[0] https://github.com/openshift/installer/blob/master/installer/pkg/config-generator/fixtures/kube-system.yaml

brancz · 2018-09-25T08:01:58Z

@squat could you elaborate on what is still work-in-progress here? Just looking at the code I would say tollbooth token handling, could you just confirm that? 🙂

squat · 2018-09-25T08:44:54Z

@brancz tollbooth token handling is a big one. I am waiting for input from the installer team for details on obtaining the token components. The second piece is the generation of the actual telemeter client manifests to deploy with the CMO. I am currently writing jsonnet in the telemeter repo to generate the manifests and will then import that dependency in this repo.

squat · 2018-09-25T15:09:50Z

Once openshift/telemeter#25 is in, this PR can vendor the Telemeter client jsonnet and render the manifests. The outstanding work is still identifying the source of the tollbooth authentication token and cluster ID.

brancz

just one comment otherwise lgtm

brancz · 2018-09-26T14:29:12Z

assets/telemeter-client/deployment.yaml

+        - --from=https://prometheus-k8s.openshift-monitoring.svc:9091
+        - --from-ca-file=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
+        - --from-token-file=/var/run/secrets/kubernetes.io/serviceaccount/token
+        - --to=


this should have a value no? (either here or in the code below, but I don't see it in either)

this is what I was referring to in the PR description when I spoke about follow ups. We need to determine the production URL for the telemeter server; once we have that we'll configure it in the telemeter jsonnet and regenerate the manifests here. Does that sound good?

Ack. My feeling is that this will need to be configurable in some way (maybe a flag on the cluster-monitoring-operator?).

Yes this sounds completely reasonable. We need to be able to modify CMO deployments for testing, production, etc and be able to report to different telemeter servers. Either in a flag or as another field in the CMO configmap. I'll plumb the URL into an environment variable

s-urbaniak · 2018-09-26T15:31:11Z

pkg/manifests/manifests.go

+	}
+
+	s.StringData["id"] = base64.StdEncoding.EncodeToString([]byte(f.config.TelemeterClientConfig.ClusterID))
+	s.StringData["token"] = base64.StdEncoding.EncodeToString([]byte(f.config.TelemeterClientConfig.PullSecret))


👍 this resolves the secret mystery, thanks! :-)

I wonder if we should rename this also to pullSecret on telemeter side. On second thought, this is fine'ish though, as this token is pretty much opaque to telemeter.

If you just set Data to the byte slice of what you want then the marshaling will automatically encode it as base64. I generally prefer that as i always forget which base64 encoding is used :)

very true, good catch!

s-urbaniak · 2018-09-26T15:32:26Z

LGTM from my side so far!

s-urbaniak · 2018-09-26T15:42:41Z

pkg/operator/operator.go

 		return manifests.NewDefaultConfig()
 	}

+	cmap, err = o.client.KubernetesInterface().CoreV1().ConfigMaps("kube-system").Get("cluster-config-v1", metav1.GetOptions{})


I feel we should make aware that storing the pull secret in a configmap might not be the best place?!

Absolutely, we should raise this in openshift/installer

brancz · 2018-09-26T18:57:00Z

assets/prometheus-k8s/rules.yaml

      record: build_error_rate
  - name: kubernetes-absent
    rules:
-    - alert: AlertmanagerDown


This doesn’t look right. I think we accidentally didn’t append the telemetry Job but instead overwrote the whole array.

This is what I was thinking. I’ll fix this in the telemeter repo

squat · 2018-09-26T21:16:28Z

needs another bump after openshift/telemeter#29 merges

s-urbaniak · 2018-09-27T09:29:51Z

pkg/manifests/manifests.go

+
+	s.Data["id"] = []byte(f.config.TelemeterClientConfig.ClusterID)
+	s.Data["to"] = []byte(f.config.TelemeterClientConfig.TelemeterServerURL)
+	s.Data["token"] = []byte(f.config.TelemeterClientConfig.Token)


s-urbaniak · 2018-09-27T09:31:00Z

pkg/operator/operator.go

+					Auth string `json:"auth"`
+				} `json:"cloud.openshift.com"`
+			} `json:"auths"`
+		}{}


/cc @smarterclayton this is the current parsing logic for the pull secret

ok, can we make sure tollbooth starts generating the secret in this form asap?

openshift-ci-robot · 2018-09-27T09:31:03Z

@s-urbaniak: GitHub didn't allow me to request PR reviews from the following users: the, for, pull, secret, is, this, current, parsing, logic.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @smarterclayton this is the current parsing logic for the pull secret

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

s-urbaniak

/lgtm

openshift-ci-robot · 2018-09-27T09:31:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: s-urbaniak, squat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [s-urbaniak,squat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

s-urbaniak · 2018-09-27T09:31:47Z

looks super fine to me 🎉

Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ...

Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ... I've also added diff logging, using an already-vendored library, to make it easier to understand why the operator feels the need to update the resource.

Following the pattern we've used in CreateOrUpdateDeployment since 675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103). This saves some network traffic and Kubernetes API service load. There's a bit of dancing as I copy Status (which is irrelevant for this use-case) and ObjectMeta (except for merged annotations and labels) over from 'existing' to 'required', but that sets up a convenient DeepEqual for "did anything we care about change?". It also makes it easier to see from the Kube API server logs when the Prometheus resources are actually being updates, while before this commit: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort would find traffic like: 2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator ... I've also added diff logging, using an already-vendored library, to make it easier to understand why the operator feels the need to update the resource. The go.mod update was generated with: $ go mod tidy using: $ go version go version go1.19.5 linux/amd64 now that we're directly using the already-vendored package.

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 21, 2018

openshift-ci-robot requested review from ironcladlou and mxinden September 21, 2018 12:07

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 21, 2018

squat force-pushed the telemeter_client branch 2 times, most recently from e8f90d6 to 34d24d5 Compare September 24, 2018 16:13

squat force-pushed the telemeter_client branch from 34d24d5 to 81493ba Compare September 25, 2018 17:08

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2018

squat force-pushed the telemeter_client branch 2 times, most recently from 610f6ee to f1b1a00 Compare September 26, 2018 14:07

squat changed the title ~~[WIP] pkg/*: add k8s functions for Telemeter client~~ add Telemeter client Sep 26, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2018

brancz reviewed Sep 26, 2018

View reviewed changes

s-urbaniak reviewed Sep 26, 2018

View reviewed changes

squat mentioned this pull request Sep 26, 2018

jsonnet: put telemeter server URL into secret and env var openshift/telemeter#28

Merged

s-urbaniak reviewed Sep 26, 2018

View reviewed changes

brancz reviewed Sep 26, 2018

View reviewed changes

squat force-pushed the telemeter_client branch from f1b1a00 to b20e672 Compare September 26, 2018 21:06

squat mentioned this pull request Sep 26, 2018

jsonnet: extend jobs with telemeter client openshift/telemeter#29

Merged

squat force-pushed the telemeter_client branch 2 times, most recently from 0b57ba9 to e295aef Compare September 26, 2018 21:27

jsonnet: add telemeter-client

26addbc

squat force-pushed the telemeter_client branch from e295aef to 949d4a1 Compare September 27, 2018 07:22

squat force-pushed the telemeter_client branch 3 times, most recently from e38ef2e to 381f396 Compare September 27, 2018 08:13

squat added 2 commits September 27, 2018 10:16

pkg/*: add k8s functions for Telemeter client

675a7ed

assets: regenerate assets

22d9fc7

squat force-pushed the telemeter_client branch from 381f396 to 22d9fc7 Compare September 27, 2018 08:16

s-urbaniak reviewed Sep 27, 2018

View reviewed changes

openshift-ci-robot requested a review from smarterclayton September 27, 2018 09:31

s-urbaniak approved these changes Sep 27, 2018

View reviewed changes

openshift-ci-robot assigned s-urbaniak Sep 27, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2018

openshift-merge-robot merged commit c034508 into openshift:master Sep 27, 2018

squat deleted the telemeter_client branch September 27, 2018 13:58

wking mentioned this pull request Aug 8, 2023

OCPBUGS-17113: pkg/client: Only update Prometheus if we have changes to push #2068

Closed

1 task

add Telemeter client #103

add Telemeter client #103

Uh oh!

Conversation

squat commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brancz commented Sep 25, 2018

Uh oh!

squat commented Sep 25, 2018

Uh oh!

squat commented Sep 25, 2018

Uh oh!

brancz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squat Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brancz Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s-urbaniak commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squat commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 27, 2018

Uh oh!

s-urbaniak left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 27, 2018

Uh oh!

s-urbaniak commented Sep 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

squat commented Sep 21, 2018 •

edited

Loading

squat Sep 26, 2018 •

edited

Loading

brancz Sep 26, 2018 •

edited

Loading