Skip to content

Conversation

@squat
Copy link
Contributor

@squat squat commented Sep 21, 2018

This pull request adds the Telemeter client to the Cluster Monitoring Operator stack.

One of the key details to note about the Telemeter client is that its deployment requires a secret containing the cluster's pull secret and ID. These values are only available for 4.x clusters created with github.com/openshift/installer. That is, trying to run the CMO on a non-4.x cluster will result in the deployment of a Telemeter client with an invalid secret, meaning it will not be able to authenticate against the Telemeter server. Specifically, the cluster ID and pull secret can be found in a ConfigMap in the kube-system namespace named cluster-config-v1, e.g. [0].

Once the URL for the production Telemeter server is known, this pull request must be followed by a PR to the github.com/openshift/telemeter repo to set the default Telemeter server URL and another PR to this repo to bump the Telemeter client jsonnet dependency and regenerate the manifests.

cc @brancz @s-urbaniak

[0] https://github.com/openshift/installer/blob/master/installer/pkg/config-generator/fixtures/kube-system.yaml

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 21, 2018
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 21, 2018
@squat squat force-pushed the telemeter_client branch 2 times, most recently from e8f90d6 to 34d24d5 Compare September 24, 2018 16:13
@brancz
Copy link
Contributor

brancz commented Sep 25, 2018

@squat could you elaborate on what is still work-in-progress here? Just looking at the code I would say tollbooth token handling, could you just confirm that? 🙂

@squat
Copy link
Contributor Author

squat commented Sep 25, 2018

@brancz tollbooth token handling is a big one. I am waiting for input from the installer team for details on obtaining the token components. The second piece is the generation of the actual telemeter client manifests to deploy with the CMO. I am currently writing jsonnet in the telemeter repo to generate the manifests and will then import that dependency in this repo.

@squat
Copy link
Contributor Author

squat commented Sep 25, 2018

Once openshift/telemeter#25 is in, this PR can vendor the Telemeter client jsonnet and render the manifests. The outstanding work is still identifying the source of the tollbooth authentication token and cluster ID.

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 25, 2018
@squat squat force-pushed the telemeter_client branch 2 times, most recently from 610f6ee to f1b1a00 Compare September 26, 2018 14:07
@squat squat changed the title [WIP] pkg/*: add k8s functions for Telemeter client add Telemeter client Sep 26, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2018
Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment otherwise lgtm

- --from=https://prometheus-k8s.openshift-monitoring.svc:9091
- --from-ca-file=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
- --from-token-file=/var/run/secrets/kubernetes.io/serviceaccount/token
- --to=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have a value no? (either here or in the code below, but I don't see it in either)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what I was referring to in the PR description when I spoke about follow ups. We need to determine the production URL for the telemeter server; once we have that we'll configure it in the telemeter jsonnet and regenerate the manifests here. Does that sound good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. My feeling is that this will need to be configurable in some way (maybe a flag on the cluster-monitoring-operator?).

Copy link
Contributor Author

@squat squat Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this sounds completely reasonable. We need to be able to modify CMO deployments for testing, production, etc and be able to report to different telemeter servers. Either in a flag or as another field in the CMO configmap. I'll plumb the URL into an environment variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack 👍

}

s.StringData["id"] = base64.StdEncoding.EncodeToString([]byte(f.config.TelemeterClientConfig.ClusterID))
s.StringData["token"] = base64.StdEncoding.EncodeToString([]byte(f.config.TelemeterClientConfig.PullSecret))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this resolves the secret mystery, thanks! :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should rename this also to pullSecret on telemeter side. On second thought, this is fine'ish though, as this token is pretty much opaque to telemeter.

Copy link
Contributor

@brancz brancz Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just set Data to the byte slice of what you want then the marshaling will automatically encode it as base64. I generally prefer that as i always forget which base64 encoding is used :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very true, good catch!

@s-urbaniak
Copy link
Contributor

LGTM from my side so far!

return manifests.NewDefaultConfig()
}

cmap, err = o.client.KubernetesInterface().CoreV1().ConfigMaps("kube-system").Get("cluster-config-v1", metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should make aware that storing the pull secret in a configmap might not be the best place?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, we should raise this in openshift/installer

record: build_error_rate
- name: kubernetes-absent
rules:
- alert: AlertmanagerDown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t look right. I think we accidentally didn’t append the telemetry Job but instead overwrote the whole array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was thinking. I’ll fix this in the telemeter repo

@squat
Copy link
Contributor Author

squat commented Sep 26, 2018

needs another bump after openshift/telemeter#29 merges

@squat squat force-pushed the telemeter_client branch 2 times, most recently from 0b57ba9 to e295aef Compare September 26, 2018 21:27
@squat squat force-pushed the telemeter_client branch 3 times, most recently from e38ef2e to 381f396 Compare September 27, 2018 08:13

s.Data["id"] = []byte(f.config.TelemeterClientConfig.ClusterID)
s.Data["to"] = []byte(f.config.TelemeterClientConfig.TelemeterServerURL)
s.Data["token"] = []byte(f.config.TelemeterClientConfig.Token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

Auth string `json:"auth"`
} `json:"cloud.openshift.com"`
} `json:"auths"`
}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @smarterclayton this is the current parsing logic for the pull secret

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can we make sure tollbooth starts generating the secret in this form asap?

@openshift-ci-robot
Copy link
Contributor

@s-urbaniak: GitHub didn't allow me to request PR reviews from the following users: the, for, pull, secret, is, this, current, parsing, logic.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @smarterclayton this is the current parsing logic for the pull secret

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@s-urbaniak s-urbaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: s-urbaniak, squat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@s-urbaniak
Copy link
Contributor

looks super fine to me 🎉

@openshift-merge-robot openshift-merge-robot merged commit c034508 into openshift:master Sep 27, 2018
@squat squat deleted the telemeter_client branch September 27, 2018 13:58
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Aug 8, 2023
Following the pattern we've used in CreateOrUpdateDeployment since
675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103).
This saves some network traffic and Kubernetes API service load.
There's a bit of dancing as I copy Status (which is irrelevant for
this use-case) and ObjectMeta (except for merged annotations and
labels) over from 'existing' to 'required', but that sets up a
convenient DeepEqual for "did anything we care about change?".

It also makes it easier to see from the Kube API server logs
when the Prometheus resources are actually being updates, while before
this commit:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort

would find traffic like:

  2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  ...
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Aug 9, 2023
Following the pattern we've used in CreateOrUpdateDeployment since
675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103).
This saves some network traffic and Kubernetes API service load.
There's a bit of dancing as I copy Status (which is irrelevant for
this use-case) and ObjectMeta (except for merged annotations and
labels) over from 'existing' to 'required', but that sets up a
convenient DeepEqual for "did anything we care about change?".

It also makes it easier to see from the Kube API server logs
when the Prometheus resources are actually being updates, while before
this commit:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort

would find traffic like:

  2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  ...

I've also added diff logging, using an already-vendored library, to
make it easier to understand why the operator feels the need to update
the resource.
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Aug 9, 2023
Following the pattern we've used in CreateOrUpdateDeployment since
675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103).
This saves some network traffic and Kubernetes API service load.
There's a bit of dancing as I copy Status (which is irrelevant for
this use-case) and ObjectMeta (except for merged annotations and
labels) over from 'existing' to 'required', but that sets up a
convenient DeepEqual for "did anything we care about change?".

It also makes it easier to see from the Kube API server logs
when the Prometheus resources are actually being updates, while before
this commit:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort

would find traffic like:

  2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  ...

I've also added diff logging, using an already-vendored library, to
make it easier to understand why the operator feels the need to update
the resource.

The go.mod update was generated with:

  $ go mod tidy

using:

  $ go version
  go version go1.19.5 linux/amd64

now that we're directly using the already-vendored package.
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Aug 9, 2023
Following the pattern we've used in CreateOrUpdateDeployment since
675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103).
This saves some network traffic and Kubernetes API service load.
There's a bit of dancing as I copy Status (which is irrelevant for
this use-case) and ObjectMeta (except for merged annotations and
labels) over from 'existing' to 'required', but that sets up a
convenient DeepEqual for "did anything we care about change?".

It also makes it easier to see from the Kube API server logs
when the Prometheus resources are actually being updates, while before
this commit:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort

would find traffic like:

  2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  ...

I've also added diff logging, using an already-vendored library, to
make it easier to understand why the operator feels the need to update
the resource.

The go.mod update was generated with:

  $ go mod tidy

using:

  $ go version
  go version go1.19.5 linux/amd64

now that we're directly using the already-vendored package.
wking added a commit to wking/cluster-monitoring-operator that referenced this pull request Aug 9, 2023
Following the pattern we've used in CreateOrUpdateDeployment since
675a7ed (pkg/*: add k8s functions for Telemeter client, 2018-09-21, openshift#103).
This saves some network traffic and Kubernetes API service load.
There's a bit of dancing as I copy Status (which is irrelevant for
this use-case) and ObjectMeta (except for merged annotations and
labels) over from 'existing' to 'required', but that sets up a
convenient DeepEqual for "did anything we care about change?".

It also makes it easier to see from the Kube API server logs
when the Prometheus resources are actually being updates, while before
this commit:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1685027676433158144/artifacts/e2e-aws-ovn-serial/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"prometheuses"' kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort

would find traffic like:

  2023-07-28T21:10:30.455712Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:39.629004Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:11:58.727870Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:24.616877Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:13:43.859596Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:14:51.770214Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  2023-07-28T21:15:10.524179Z 200 system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
  ...

I've also added diff logging, using an already-vendored library, to
make it easier to understand why the operator feels the need to update
the resource.

The go.mod update was generated with:

  $ go mod tidy

using:

  $ go version
  go version go1.19.5 linux/amd64

now that we're directly using the already-vendored package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants