OTA-1080: pkg/cli/admin/inspectalerts: New tech-preview inspect-alerts subcommand #1674

wking · 2024-01-31T05:31:18Z

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.

This provides a more convenient syntax based on the existing docs':

$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"

But I'm using the /alerts path instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview oc adm upgrade status, and has a long-standing (but currently deferred) RFE.

I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because:

I don't want to leak a sensitive token to an untrusted server.
Clusters are supposed to have set up a well-known cert/CA for their ingress:

The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust.

I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional, while the status property is required. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :).

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output.

openshift-ci-robot · 2024-01-31T05:31:21Z

@wking: This pull request references OTA-1080 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview:
$ OC_ENABLE_CMD_INSPECT_ALERTS=true ./oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.
This provides a more convenient syntax based on the existing docs':
$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"
But I'm using the /alerts path instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview oc adm upgrade status, and has a long-standing (but currently deferred) RFE.

I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because:

I don't want to leak a sensitive token to an untrusted server.

Clusters are supposed to have set up a well-known cert/CA for their ingress:

The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust.

I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional, while the status property is required. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :).

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-01-31T06:04:28Z

@wking: This pull request references OTA-1080 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview:
$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.
This provides a more convenient syntax based on the existing docs':
$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"
But I'm using the /alerts path instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview oc adm upgrade status, and has a long-standing (but currently deferred) RFE.

I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because:

I don't want to leak a sensitive token to an untrusted server.

Clusters are supposed to have set up a well-known cert/CA for their ingress:

The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust.

I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional, while the status property is required. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :).

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

petr-muller · 2024-01-31T12:15:46Z

/cc

pkg/cli/admin/admin.go

ardaguclu · 2024-01-31T14:48:35Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+	uris := make([]string, 0, len(route.Status.Ingress))
+	for _, ingress := range route.Status.Ingress {
+		uri := &url.URL{
+			Scheme: "https",


Will the certificate authority in kubeconfig be used?

Nope, there's no CA handling at all, I'm just using a stock Go http.Client{}. Is there some kind of tooling/precedent for using the kubeconfig to figure out the CAs that should be trusted to connect to a particular Route?

ardaguclu · 2024-01-31T14:50:01Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+
+	client := &http.Client{}
+	uris := make([]string, 0, len(route.Status.Ingress))
+	for _, ingress := range route.Status.Ingress {


Do we always check the first one in the slice and return immediately?

From my PR/commit message:

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

If you have a preference on that or the Admitted=True filtering I floated in the PR/commit message, let me know about that too :)

ardaguclu · 2024-01-31T14:51:54Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+		return nil, fmt.Errorf("no token is currently in use for this session")
+	}
+
+	route, err := getRoute(ctx, namespace, name, metav1.GetOptions{})


What is the purpose of getRoute better than passing route object into GetWithBearer after retrieving it in Run

Thanks to that, we wouldn't pass ctx which is only used by getRoute

We're expecting to consume this in future oc adm upgrade status work, and I wanted to bundle as much as possible into the shared helper, just to make the call-site simpler on the consumer side. But having two separate call-and-catch-err in the status call-site would just be a bit less compact, and not hard to do, so I don't mind pivoting to passing route if you aren't buying my "easier on Go consumers" pitch here ;)

ardaguclu · 2024-01-31T14:59:08Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+			return nil, err
+		}
+
+		req.Header.Add("Authorization", fmt.Sprintf("Bearer %s", bearerToken))


As far as I understand we are sending request to another endpoint other than API server because this endpoint is not registered in API server?. This is not a common pattern which ~every request goes to API server?

If /api/v2/alerts is registered to API server, why we don't use the builtin functionality, instead of manually adding bearer token in header?

This is a non-Kube API, e.g. in a 4.16.0-ec.1 cluster, the only alert API resources are:

$ oc api-resources | grep -i alert alertmanagerconfigs amcfg monitoring.coreos.com/v1beta1 true AlertmanagerConfig alertmanagers am monitoring.coreos.com/v1 true Alertmanager alertingrules monitoring.openshift.io/v1 true AlertingRule alertrelabelconfigs monitoring.openshift.io/v1 true AlertRelabelConfig

And none of those is this list of Alertmanager pending/firing alerts. An alternative access mechanism would be for the Alertmanager to register its alert type with Kube, e.g. as an aggregated API server. But that sort of pivot would be large, and in the meantime, we want some kind of convenient access to firing alerts for the status subcommand.

But narrowly, if there is Go/Kube/other tooling for Bearer injection on a shelf somewhere, I'm happy to pivot to using that when constructing these requests. I was unable to find it on a brief search yesterday, but I didn't sink too much time into hunting, because I wanted to limit the time invested in this tech-preview proof-of-concept.

As you referred above in the source code, we could leverage github.com/prometheus/client_golang/v1 + github.com/prometheus/common for token injection but it can be deferred to a following PR.

@ardaguclu the monitoring stack exposes a few API services protected by kube-rbac-proxy which uses TokenReview + SubjectAccessReview to authenticate and authorizate requests. Hence the need to provide a bearer token.

Thanks. But as I pointed out in #1674 (comment), bearer token may be empty and kubeconfig is still accessible to the API server. Any attempt discards the client-go may have such an issue. That results in user can access to API server but this command just doesn't work from user's perspective.

Yes the request will be rejected. I'm not sure what you're advising for though?

As you referred above in the source code, we could leverage github.com/prometheus/client_golang/v1 + github.com/prometheus/common for token injection but it can be deferred to a following PR.

I'm advising this ^ but it seems that it is not possible soon. I'd suggest to mention this case in the description of the command to warn user about this.

LalatenduMohanty · 2024-01-31T22:07:35Z

Can we add an option to filter the alerts based on the severity e.g. "severity": "warning", ?

wking · 2024-02-01T00:31:09Z

Can we add an option to filter the alerts based on the severity e.g. "severity": "warning", ?

All things are possible, but I was trying to avoid sinking time into aspects that oc adm upgrade status wouldn't need. It will presumably deserialize the JSON into github.com/prometheus/common/model's Alerts, and then use structured filtering. While folks using the JSON dump I've implemented so far can do that with jq, e.g.:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true ./oc adm inspect-alerts | jq -r '[.[] | select(.labels.severity == "warning") | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:14:20.600Z 2024-02-01T00:31:50.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.

And since the main issue I personally have consuming the raw JSON is excluding all the properties I don't care about, I'm already using jq (or we'd have to discuss and come to some kind of agreement about which properties were important enough to show by default, and what the command line API looks like to adjust that if a user wants something custom).

Anyhow, things in this space seem like they'd be nice-to-have, but I don't see them as proof-of-concept blockers.

machine424 · 2024-02-01T10:10:33Z

Can we add an option to filter the alerts based on the severity e.g. "severity": "warning", ?

In future iterations, we could consider wiring up amtool to avoid implementing another client.
It offers many useful features and it'll be handy to have it in oc (one less binary to worry about), we could even allow customizing the alertmanager URL, the platform one being the default. (we'll need to think about compatibilities though)

machine424 · 2024-02-01T10:10:51Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+}
+
+// GetWithBearer gets a Route by namespace/name, contructs a URI using
+// status.ingress[].host and the path argument, and performs GETs on that


Maybe https://github.com/openshift/library-go/blob/2ad786549f07f264339f63b12b71a0fe5e77e832/pkg/route/routeapihelpers/routeapihelpers.go#L18 could help here, it also checks the conditions ;)

oc vendors library-go, but not yet that sub-package^, and I wanted to avoid mucking with vendor/ and introduce thinking about iteration over ingress[] (IngressURI only returns the first match). But yes, we could pivot to using that. Or extend library-go to grow a callback-consuming function that supports iteration over multiiple matches, etc. Thoughts for this pull?

We can leave that for future iterations for sure.
(If we could just add a TODO for this somewhere)

simonpasquier · 2024-02-01T11:09:38Z

I'd rather use the Thanos Querier API endpoints than Alertmanager:

Customers may disable the in-cluster Alertmanager (especially true with ACM satellite clusters).
It would be possible to retrieve grab pending alerts.
Thanos aggregates both platform and user-defined alerts (in case this is useful).

The downside is that you won't see which alerts are being silenced since it's only applicable to alerts from Alertmanager.

wking · 2024-02-01T15:42:05Z

Customers may disable the in-cluster Alertmanager...

Ah, docs around that here. I'd mistakenly thought it was Alertmanager converting Prometheus metrics into alerts, but looks like Prometheus calculates the alerts, and Alertmanager layers on silences and push notification. So yeah, I'll pivot to Thanos.

petr-muller · 2024-02-02T09:10:05Z

The downside is that you won't see which alerts are being silenced since it's only applicable to alerts from Alertmanager.

I'm a bit worried about confused users who would expect to not see a silenced alert when they do have an alertmanager, but for the purpose of oc adm upgrade status I think we're probably good with documenting that little caveat (over doing something more complicated like trying alertmanager and then falling back to thanos or something else...).

pkg/cli/admin/inspectalerts/inspectalerts.go

wking · 2024-02-02T22:57:38Z

Customers may disable the in-cluster Alertmanager...

... So yeah, I'll pivot to Thanos.

Ok, poking at this locally, looks like Thanos is only giving me some metric labels:

$ curl -sH "Authorization: Bearer $(oc whoami -t)" "https://$(oc -n openshift-monitoring get route thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/query?query=ALERTS" | jq '.data.result[0]'
{
  "metric": {
    "__name__": "ALERTS",
    "alertname": "ClusterAutoscalerUnschedulablePods",
    "alertstate": "pending",
    "container": "cluster-autoscaler",
    "endpoint": "metrics",
    "instance": "10.129.0.57:8085",
    "job": "cluster-autoscaler-default",
    "namespace": "openshift-machine-api",
    "pod": "cluster-autoscaler-default-f8dd547c7-dg9t5",
    "prometheus": "openshift-monitoring/k8s",
    "service": "cluster-autoscaler-default",
    "severity": "info"
  },
  "value": [
    1706914546.182,
    "1"
  ]
}

While Alertmanager gives me .annotations.summary and other context that is needed to act on the alert. I guess it's theoretically possible for oc to head back to the cluster and pull PrometheusRules and... no, we still wouldn't be able to do the dynamic template stuff that happens somewhere in the monitoring component today.

But anyway, are folks ok if I stick with using Alertmanager for this initial proof-of-concept?

simonpasquier · 2024-02-05T11:25:46Z

Sorry I was a bit dense: Thanos also exposes a /api/v1/alerts endpoint which returns all active alerts including the annotations.

As recommended by Simon [1]. By dropping one level down and querying Thanos, we lose Silence access (those only come in at the Alertmanager level), but we gain: * The ability to retrieve alerts when Alertmanager is disabled (especially true with ACM satellite clusters). * Access to pending alerts (not just firing alerts). * Access to both platform and user-defined alerts (the platform Alertmanager only has access to the former). Demo of the new functionality: $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | .activeAt + " " + .state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]' 2024-01-18T15:42:50Z firing warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades. 2024-01-18T15:45:13Z firing info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. 2024-01-19T20:20:58.083333925Z firing none Watchdog An alert that should always be firing to certify that Alertmanager is working properly. 2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal. 2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal. 2024-02-05T20:09:16.894430475Z pending warning PodDisruptionBudgetAtLimit The pod disruption budget is preventing further disruption to pods. 2024-02-05T20:12:56.428212511Z pending info ClusterAutoscalerUnschedulablePods Cluster Autoscaler has 13 unschedulable pods 2024-02-05T20:15:43.59211931Z pending warning MachineNotYetDeleted machine build0-gstfj-ci-tests-worker-d-rrdl9 has been in Deleting phase for more than 6 hours ... [1]: openshift#1674 (comment)

wking · 2024-02-05T20:32:45Z

Ok, I've pushed 41e5e82 -> da5669f with a pivot to Thanos and /api/v1/alerts:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | .activeAt + " " + .state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-18T15:42:50Z firing warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-18T15:45:13Z firing info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:20:58.083333925Z firing none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-02-05T20:09:16.894430475Z pending warning PodDisruptionBudgetAtLimit The pod disruption budget is preventing further disruption to pods.
2024-02-05T20:12:56.428212511Z pending info ClusterAutoscalerUnschedulablePods Cluster Autoscaler has 13 unschedulable pods
2024-02-05T20:15:43.59211931Z pending warning MachineNotYetDeleted machine build0-gstfj-ci-tests-worker-d-rrdl9 has been in Deleting phase for more than 6 hours
...

wking · 2024-02-05T22:39:55Z

images hit build04 node capacity.

/retest-required

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview: $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]' 2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly. 2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. 2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades. 2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal. 2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal. 2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete. This provides a more convenient syntax based on the existing docs' [1]: $ oc login -u <username> -p <password> $ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host}) $ token=$(oc whoami -t) $ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers" But I'm using the /alerts path [2] instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview 'oc adm upgrade status', and has a long-standing (but currently deferred) RFE [3]. I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because: * I don't want to leak a sensitive token to an untrusted server. * Clusters are supposed to have set up a well-known cert/CA for their ingress [4]: The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters. so our client will likely trust the cluster. and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust. I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional [5], while the status property is required [6]. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input. I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts. I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :). If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement. We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output. [1]: https://docs.openshift.com/container-platform/4.14/monitoring/accessing-third-party-monitoring-apis.html [2]: https://github.com/prometheus/alertmanager/blob/v0.26.0/api/v2/openapi.yaml#L134 [3]: https://issues.redhat.com/browse/RFE-928 [4]: https://docs.openshift.com/container-platform/4.14/security/certificate_types_descriptions/ingress-certificates.html#location [5]: https://github.com/openshift/api/blob/4f00b4f16b634830bb16b432eec91f6e514f2981/route/v1/types.go#L95 [6]: https://github.com/openshift/api/blob/4f00b4f16b634830bb16b432eec91f6e514f2981/route/v1/types.go#L352-L353

As recommended by Simon [1]. By dropping one level down and querying Thanos, we lose Silence access (those only come in at the Alertmanager level), but we gain: * The ability to retrieve alerts when Alertmanager is disabled (especially true with ACM satellite clusters). * Access to pending alerts (not just firing alerts). * Access to both platform and user-defined alerts (the platform Alertmanager only has access to the former). Demo of the new functionality: $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | .activeAt + " " + .state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]' 2024-01-18T15:42:50Z firing warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades. 2024-01-18T15:45:13Z firing info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. 2024-01-19T20:20:58.083333925Z firing none Watchdog An alert that should always be firing to certify that Alertmanager is working properly. 2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal. 2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal. 2024-02-05T20:09:16.894430475Z pending warning PodDisruptionBudgetAtLimit The pod disruption budget is preventing further disruption to pods. 2024-02-05T20:12:56.428212511Z pending info ClusterAutoscalerUnschedulablePods Cluster Autoscaler has 13 unschedulable pods 2024-02-05T20:15:43.59211931Z pending warning MachineNotYetDeleted machine build0-gstfj-ci-tests-worker-d-rrdl9 has been in Deleting phase for more than 6 hours ... [1]: openshift#1674 (comment)

Simon, internally: > ...let's go with the CMO owners... So this brings in the approver set from [1]. That has the same people as reviewers and approvers, so I'm only creating a single monitoring-approvers here to ease future membership-management. If, in the future, monitoring decides to have distinct sets for reviewers and approvers, it would be straightforward to add a new alias. [1]: https://github.com/openshift/cluster-monitoring-operator/blob/b7e3f50875f2bb1fed912b23fb80a101d3a786c0/OWNERS

jan--f · 2024-02-12T14:27:03Z

👍 This looks good to me.

simonpasquier · 2024-02-12T14:18:38Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+		return alertBytes, err
+	}
+
+	// if we end up going this way, probably check and error on 'result' being an empty set (it should at least contain Watchdog)


I would return the list as-is. An empty list isn't expected currently but who knows if it doesn't change in the future.

simonpasquier · 2024-02-12T14:26:59Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+			return nil, err
+		}
+
+		req.Header.Add("Authorization", fmt.Sprintf("Bearer %s", bearerToken))


As you referred above in the source code, we could leverage github.com/prometheus/client_golang/v1 + github.com/prometheus/common for token injection but it can be deferred to a following PR.

@ardaguclu the monitoring stack exposes a few API services protected by kube-rbac-proxy which uses TokenReview + SubjectAccessReview to authenticate and authorizate requests. Hence the need to provide a bearer token.

ardaguclu · 2024-02-14T08:44:03Z

pkg/cli/admin/inspectalerts/inspectalerts.go

+}
+
+func (o *options) Run(ctx context.Context) error {
+	alertBytes, err := GetAlerts(ctx, o.getRoute, o.RESTConfig.BearerToken)


There are many ways to send authenticated request to API server in RESTConfig

oc/vendor/k8s.io/client-go/rest/config.go

Lines 69 to 89 in 4fe33f1

// Server requires Bearer authentication. This client will not attempt to use

// refresh tokens for an OAuth2 flow.

// TODO: demonstrate an OAuth2 compatible client.

BearerToken string `datapolicy:"token"`

// Path to a file containing a BearerToken.

// If set, the contents are periodically read.

// The last successfully read value takes precedence over BearerToken.

BearerTokenFile string

// Impersonate is the configuration that RESTClient will use for impersonation.

Impersonate ImpersonationConfig

// Server requires plugin-specified authentication.

AuthProvider *clientcmdapi.AuthProviderConfig

// Callback to persist config for AuthProvider.

AuthConfigPersister AuthProviderConfigPersister

// Exec-based authentication provider.

ExecProvider *clientcmdapi.ExecConfig

and as far as I know, bearerToken is not mandatory. For example, if user may use credentials exec plugin (ExecProvider) that skips bearer token step.

Bearer token is what we doc, so it should be good enough for a env-var-gated first attempt, right?

Yes, I think so

Bearer token is not functional for users working with cluster-admin kubeconfig, so that might be a potential problem for a set of our users.

The other problem I run into when playing with this was:

OC_ENABLE_CMD_INSPECT_ALERTS=true ./oc adm inspect-alerts Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority

I'm guessing you'd need to do something similar to what we do in https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_monitoring to verify that route, but I'm hesitant about that approach.

Isn't there a stable API we could access? How our console is accessing that data? Can we re-use that same API?

must-gather is using a ServiceAccount token and collecting the default ingress CA. Both of which we could use. But 🤷, also work to implement, and not clear to me that we need to sink that time in before landing a subcommand behind a OC_ENABLE_CMD_INSPECT_ALERTS=true gate. Is "if there isn't a bearer token or the request is rejected with an HTTP auth code, go see if we have the power to borrow a must-gather ServiceAccount token" the sort of thing we need to land the initial implementation? There's always work we could invest to make any implementation better, but where are the goal-posts for merging an initial, env-var-gated command?

RE: stable API , we can take up the work post the merge. This should not block the merge as this PR is behind OC_ENABLE_CMD_INSPECT_ALERTS=true flag.

Isn't there a stable API we could access? How our console is accessing that data? Can we re-use that same API?

The console uses the same API service to query and display the cluster alerts.

ardaguclu · 2024-02-14T16:45:06Z

/approve
for adding monitoring-approvers in OWNER_ALIASES

LalatenduMohanty

/lgtm

openshift-ci · 2024-02-22T21:59:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ardaguclu, LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ardaguclu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-02-23T00:57:40Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2024-02-23T05:22:09Z

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-cli-container-v4.16.0-202402230438.p0.ga36a8e3.assembly.stream.el8 for distgit openshift-enterprise-cli.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2024

openshift-ci bot requested review from mfojtik and soltysh January 31, 2024 05:31

wking force-pushed the alert-access branch from 3345b7c to ab49d5c Compare January 31, 2024 06:04

openshift-ci bot requested a review from petr-muller January 31, 2024 12:15

ardaguclu reviewed Jan 31, 2024

View reviewed changes

wking force-pushed the alert-access branch 3 times, most recently from a26bea5 to 675fdb4 Compare January 31, 2024 18:51

machine424 reviewed Feb 1, 2024

View reviewed changes

petr-muller reviewed Feb 2, 2024

View reviewed changes

pkg/cli/admin/inspectalerts/inspectalerts.go Outdated Show resolved Hide resolved

wking force-pushed the alert-access branch from 675fdb4 to 41e5e82 Compare February 2, 2024 22:59

wking added 3 commits February 9, 2024 16:20

wking force-pushed the alert-access branch from da5669f to 1cf4ed1 Compare February 10, 2024 00:25

simonpasquier reviewed Feb 12, 2024

View reviewed changes

ardaguclu reviewed Feb 14, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2024

LalatenduMohanty approved these changes Feb 22, 2024

View reviewed changes

openshift-ci bot assigned LalatenduMohanty Feb 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024

openshift-merge-bot bot merged commit a36a8e3 into openshift:master Feb 23, 2024
13 checks passed

wking deleted the alert-access branch February 23, 2024 03:01

	// Server requires Bearer authentication. This client will not attempt to use
	// refresh tokens for an OAuth2 flow.
	// TODO: demonstrate an OAuth2 compatible client.
	BearerToken string `datapolicy:"token"`

	// Path to a file containing a BearerToken.
	// If set, the contents are periodically read.
	// The last successfully read value takes precedence over BearerToken.
	BearerTokenFile string

	// Impersonate is the configuration that RESTClient will use for impersonation.
	Impersonate ImpersonationConfig

	// Server requires plugin-specified authentication.
	AuthProvider *clientcmdapi.AuthProviderConfig

	// Callback to persist config for AuthProvider.
	AuthConfigPersister AuthProviderConfigPersister

	// Exec-based authentication provider.
	ExecProvider *clientcmdapi.ExecConfig

OTA-1080: pkg/cli/admin/inspectalerts: New tech-preview inspect-alerts subcommand #1674

OTA-1080: pkg/cli/admin/inspectalerts: New tech-preview inspect-alerts subcommand #1674

Conversation

wking commented Jan 31, 2024 • edited

openshift-ci-robot commented Jan 31, 2024 • edited by openshift-ci bot

openshift-ci-robot commented Jan 31, 2024 • edited by openshift-ci bot

petr-muller commented Jan 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wking Jan 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LalatenduMohanty commented Jan 31, 2024 • edited

wking commented Feb 1, 2024 • edited

machine424 commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 Feb 5, 2024 • edited

Choose a reason for hiding this comment

simonpasquier commented Feb 1, 2024

wking commented Feb 1, 2024

petr-muller commented Feb 2, 2024

wking commented Feb 2, 2024

simonpasquier commented Feb 5, 2024

wking commented Feb 5, 2024

wking commented Feb 5, 2024

jan--f commented Feb 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ardaguclu Feb 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ardaguclu commented Feb 14, 2024

LalatenduMohanty left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 22, 2024

openshift-ci bot commented Feb 23, 2024

openshift-bot commented Feb 23, 2024

wking commented Jan 31, 2024 •

edited

openshift-ci-robot commented Jan 31, 2024 •

edited by openshift-ci bot

openshift-ci-robot commented Jan 31, 2024 •

edited by openshift-ci bot

wking Jan 31, 2024 •

edited

LalatenduMohanty commented Jan 31, 2024 •

edited

wking commented Feb 1, 2024 •

edited

machine424 Feb 5, 2024 •

edited

ardaguclu Feb 14, 2024 •

edited