Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA-1080: pkg/cli/admin/inspectalerts: New tech-preview inspect-alerts subcommand #1674

Merged
merged 3 commits into from Feb 23, 2024

Conversation

wking
Copy link
Member

@wking wking commented Jan 31, 2024

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.

This provides a more convenient syntax based on the existing docs':

$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"

But I'm using the /alerts path instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview oc adm upgrade status, and has a long-standing (but currently deferred) RFE.

I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because:

  • I don't want to leak a sensitive token to an untrusted server.

  • Clusters are supposed to have set up a well-known cert/CA for their ingress:

    The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

    so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust.

I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional, while the status property is required. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :).

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 31, 2024

@wking: This pull request references OTA-1080 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true ./oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.

This provides a more convenient syntax based on the existing docs':

$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"

But I'm using the /alerts path instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview oc adm upgrade status, and has a long-standing (but currently deferred) RFE.

I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because:

  • I don't want to leak a sensitive token to an untrusted server.

  • Clusters are supposed to have set up a well-known cert/CA for their ingress:

    The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

    so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust.

I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional, while the status property is required. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :).

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 31, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 31, 2024

@wking: This pull request references OTA-1080 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid customers accidentally using it while it's tech-preview:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.

This provides a more convenient syntax based on the existing docs':

$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"

But I'm using the /alerts path instead of the /receivers path given in the docs example. This provides convenient access when the console isn't available (e.g. because the Console ClusterVersion capability is not enabled) or for use in other tooling like the tech-preview oc adm upgrade status, and has a long-standing (but currently deferred) RFE.

I'm also dropping the -k/--insecure from the doc'ed curl in my Go translation, because:

  • I don't want to leak a sensitive token to an untrusted server.

  • Clusters are supposed to have set up a well-known cert/CA for their ingress:

    The Ingress Operator generates a default certificate for an Ingress Controller to serve as a placeholder until you configure a custom default certificate. Do not use Operator-generated default certificates in production clusters.

    so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this tech-preview proof-of-concept, vs. trying to use the kubeconfig-protected connections to automatically figure out which CA I need to trust.

I'm iterating over status.ingress[].host entries instead of using the doc'ed spec.host, because the spec property is optional, while the status property is required. It also seems safer to consume the Route controller's output status instead of wiring up directly to Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an Admitted=True entry, although something like that might be reasonable to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to know if I need to worry about any encoding issues (base64?) in this handoff, but in testing it works for me. Please feel free to stiffen, or at least research, that handoff before making this generally available :).

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored github.com/prometheus/common/model that could be used to deserialize this output (e.g. for table printing), but for now I'll just dump the JSON to output.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller
Copy link
Member

/cc

pkg/cli/admin/admin.go Outdated Show resolved Hide resolved
uris := make([]string, 0, len(route.Status.Ingress))
for _, ingress := range route.Status.Ingress {
uri := &url.URL{
Scheme: "https",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the certificate authority in kubeconfig be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, there's no CA handling at all, I'm just using a stock Go http.Client{}. Is there some kind of tooling/precedent for using the kubeconfig to figure out the CAs that should be trusted to connect to a particular Route?


client := &http.Client{}
uris := make([]string, 0, len(route.Status.Ingress))
for _, ingress := range route.Status.Ingress {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always check the first one in the slice and return immediately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my PR/commit message:

If I get a failure from one status.ingress[].host, I don't bother attempting the others. Wiring up that kind of fallback with error aggregation when all hosts fail would be a useful future refinement.

If you have a preference on that or the Admitted=True filtering I floated in the PR/commit message, let me know about that too :)

return nil, fmt.Errorf("no token is currently in use for this session")
}

route, err := getRoute(ctx, namespace, name, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of getRoute better than passing route object into GetWithBearer after retrieving it in Run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to that, we wouldn't pass ctx which is only used by getRoute

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're expecting to consume this in future oc adm upgrade status work, and I wanted to bundle as much as possible into the shared helper, just to make the call-site simpler on the consumer side. But having two separate call-and-catch-err in the status call-site would just be a bit less compact, and not hard to do, so I don't mind pivoting to passing route if you aren't buying my "easier on Go consumers" pitch here ;)

return nil, err
}

req.Header.Add("Authorization", fmt.Sprintf("Bearer %s", bearerToken))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand we are sending request to another endpoint other than API server because this endpoint is not registered in API server?. This is not a common pattern which ~every request goes to API server?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If /api/v2/alerts is registered to API server, why we don't use the builtin functionality, instead of manually adding bearer token in header?

Copy link
Member Author

@wking wking Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-Kube API, e.g. in a 4.16.0-ec.1 cluster, the only alert API resources are:

$ oc api-resources | grep -i alert
alertmanagerconfigs                   amcfg                                                                                  monitoring.coreos.com/v1beta1                   true         AlertmanagerConfig
alertmanagers                         am                                                                                     monitoring.coreos.com/v1                        true         Alertmanager
alertingrules                                                                                                                monitoring.openshift.io/v1                      true         AlertingRule
alertrelabelconfigs                                                                                                          monitoring.openshift.io/v1                      true         AlertRelabelConfig

And none of those is this list of Alertmanager pending/firing alerts. An alternative access mechanism would be for the Alertmanager to register its alert type with Kube, e.g. as an aggregated API server. But that sort of pivot would be large, and in the meantime, we want some kind of convenient access to firing alerts for the status subcommand.

But narrowly, if there is Go/Kube/other tooling for Bearer injection on a shelf somewhere, I'm happy to pivot to using that when constructing these requests. I was unable to find it on a brief search yesterday, but I didn't sink too much time into hunting, because I wanted to limit the time invested in this tech-preview proof-of-concept.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you referred above in the source code, we could leverage github.com/prometheus/client_golang/v1 + github.com/prometheus/common for token injection but it can be deferred to a following PR.

@ardaguclu the monitoring stack exposes a few API services protected by kube-rbac-proxy which uses TokenReview + SubjectAccessReview to authenticate and authorizate requests. Hence the need to provide a bearer token.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. But as I pointed out in #1674 (comment), bearer token may be empty and kubeconfig is still accessible to the API server. Any attempt discards the client-go may have such an issue. That results in user can access to API server but this command just doesn't work from user's perspective.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the request will be rejected. I'm not sure what you're advising for though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you referred above in the source code, we could leverage github.com/prometheus/client_golang/v1 + github.com/prometheus/common for token injection but it can be deferred to a following PR.

I'm advising this ^ but it seems that it is not possible soon. I'd suggest to mention this case in the description of the command to warn user about this.

@wking wking force-pushed the alert-access branch 3 times, most recently from a26bea5 to 675fdb4 Compare January 31, 2024 18:51
@LalatenduMohanty
Copy link
Member

LalatenduMohanty commented Jan 31, 2024

Can we add an option to filter the alerts based on the severity e.g. "severity": "warning", ?

@wking
Copy link
Member Author

wking commented Feb 1, 2024

Can we add an option to filter the alerts based on the severity e.g. "severity": "warning", ?

All things are possible, but I was trying to avoid sinking time into aspects that oc adm upgrade status wouldn't need. It will presumably deserialize the JSON into github.com/prometheus/common/model's Alerts, and then use structured filtering. While folks using the JSON dump I've implemented so far can do that with jq, e.g.:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true ./oc adm inspect-alerts | jq -r '[.[] | select(.labels.severity == "warning") | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-19T20:14:20.600Z 2024-02-01T00:31:50.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.

And since the main issue I personally have consuming the raw JSON is excluding all the properties I don't care about, I'm already using jq (or we'd have to discuss and come to some kind of agreement about which properties were important enough to show by default, and what the command line API looks like to adjust that if a user wants something custom).

Anyhow, things in this space seem like they'd be nice-to-have, but I don't see them as proof-of-concept blockers.

@machine424
Copy link

Can we add an option to filter the alerts based on the severity e.g. "severity": "warning", ?

In future iterations, we could consider wiring up amtool to avoid implementing another client.
It offers many useful features and it'll be handy to have it in oc (one less binary to worry about), we could even allow customizing the alertmanager URL, the platform one being the default. (we'll need to think about compatibilities though)

}

// GetWithBearer gets a Route by namespace/name, contructs a URI using
// status.ingress[].host and the path argument, and performs GETs on that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oc vendors library-go, but not yet that sub-package^, and I wanted to avoid mucking with vendor/ and introduce thinking about iteration over ingress[] (IngressURI only returns the first match). But yes, we could pivot to using that. Or extend library-go to grow a callback-consuming function that supports iteration over multiiple matches, etc. Thoughts for this pull?

Copy link

@machine424 machine424 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave that for future iterations for sure.
(If we could just add a TODO for this somewhere)

@simonpasquier
Copy link

I'd rather use the Thanos Querier API endpoints than Alertmanager:

  • Customers may disable the in-cluster Alertmanager (especially true with ACM satellite clusters).
  • It would be possible to retrieve grab pending alerts.
  • Thanos aggregates both platform and user-defined alerts (in case this is useful).

The downside is that you won't see which alerts are being silenced since it's only applicable to alerts from Alertmanager.

@wking
Copy link
Member Author

wking commented Feb 1, 2024

Customers may disable the in-cluster Alertmanager...

Ah, docs around that here. I'd mistakenly thought it was Alertmanager converting Prometheus metrics into alerts, but looks like Prometheus calculates the alerts, and Alertmanager layers on silences and push notification. So yeah, I'll pivot to Thanos.

@petr-muller
Copy link
Member

The downside is that you won't see which alerts are being silenced since it's only applicable to alerts from Alertmanager.

I'm a bit worried about confused users who would expect to not see a silenced alert when they do have an alertmanager, but for the purpose of oc adm upgrade status I think we're probably good with documenting that little caveat (over doing something more complicated like trying alertmanager and then falling back to thanos or something else...).

@wking
Copy link
Member Author

wking commented Feb 2, 2024

Customers may disable the in-cluster Alertmanager...

... So yeah, I'll pivot to Thanos.

Ok, poking at this locally, looks like Thanos is only giving me some metric labels:

$ curl -sH "Authorization: Bearer $(oc whoami -t)" "https://$(oc -n openshift-monitoring get route thanos-querier -o jsonpath='{.status.ingress[0].host}')/api/v1/query?query=ALERTS" | jq '.data.result[0]'
{
  "metric": {
    "__name__": "ALERTS",
    "alertname": "ClusterAutoscalerUnschedulablePods",
    "alertstate": "pending",
    "container": "cluster-autoscaler",
    "endpoint": "metrics",
    "instance": "10.129.0.57:8085",
    "job": "cluster-autoscaler-default",
    "namespace": "openshift-machine-api",
    "pod": "cluster-autoscaler-default-f8dd547c7-dg9t5",
    "prometheus": "openshift-monitoring/k8s",
    "service": "cluster-autoscaler-default",
    "severity": "info"
  },
  "value": [
    1706914546.182,
    "1"
  ]
}

While Alertmanager gives me .annotations.summary and other context that is needed to act on the alert. I guess it's theoretically possible for oc to head back to the cluster and pull PrometheusRules and... no, we still wouldn't be able to do the dynamic template stuff that happens somewhere in the monitoring component today.

But anyway, are folks ok if I stick with using Alertmanager for this initial proof-of-concept?

@simonpasquier
Copy link

Sorry I was a bit dense: Thanos also exposes a /api/v1/alerts endpoint which returns all active alerts including the annotations.

wking added a commit to wking/oc that referenced this pull request Feb 5, 2024
As recommended by Simon [1].  By dropping one level down and querying
Thanos, we lose Silence access (those only come in at the Alertmanager
level), but we gain:

* The ability to retrieve alerts when Alertmanager is disabled
  (especially true with ACM satellite clusters).
* Access to pending alerts (not just firing alerts).
* Access to both platform and user-defined alerts (the platform
  Alertmanager only has access to the former).

Demo of the new functionality:

  $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | .activeAt + " " + .state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
  2024-01-18T15:42:50Z firing warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
  2024-01-18T15:45:13Z firing info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
  2024-01-19T20:20:58.083333925Z firing none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
  2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
  2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
  2024-02-05T20:09:16.894430475Z pending warning PodDisruptionBudgetAtLimit The pod disruption budget is preventing further disruption to pods.
  2024-02-05T20:12:56.428212511Z pending info ClusterAutoscalerUnschedulablePods Cluster Autoscaler has 13 unschedulable pods
  2024-02-05T20:15:43.59211931Z pending warning MachineNotYetDeleted machine build0-gstfj-ci-tests-worker-d-rrdl9 has been in Deleting phase for more than 6 hours
  ...

[1]: openshift#1674 (comment)
@wking
Copy link
Member Author

wking commented Feb 5, 2024

Ok, I've pushed 41e5e82 -> da5669f with a pivot to Thanos and /api/v1/alerts:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | .activeAt + " " + .state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
2024-01-18T15:42:50Z firing warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
2024-01-18T15:45:13Z firing info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
2024-01-19T20:20:58.083333925Z firing none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
2024-02-05T20:09:16.894430475Z pending warning PodDisruptionBudgetAtLimit The pod disruption budget is preventing further disruption to pods.
2024-02-05T20:12:56.428212511Z pending info ClusterAutoscalerUnschedulablePods Cluster Autoscaler has 13 unschedulable pods
2024-02-05T20:15:43.59211931Z pending warning MachineNotYetDeleted machine build0-gstfj-ci-tests-worker-d-rrdl9 has been in Deleting phase for more than 6 hours
...

@wking
Copy link
Member Author

wking commented Feb 5, 2024

images hit build04 node capacity.

/retest-required

The new subcommand is gated by OC_ENABLE_CMD_INSPECT_ALERTS to avoid
customers accidentally using it while it's tech-preview:

  $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.[] | .startsAt + " " + .endsAt + " " + .status.state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
  2024-01-19T20:13:28.083Z 2024-01-31T05:23:28.083Z suppressed none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
  2024-01-19T20:14:13.174Z 2024-01-31T05:24:13.174Z suppressed info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
  2024-01-19T20:14:20.600Z 2024-01-31T05:24:20.600Z suppressed warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
  2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
  2024-01-22T15:56:45.824Z 2024-01-31T05:23:15.824Z suppressed info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
  2024-01-30T06:02:23.913Z 2024-01-31T05:24:23.913Z active warning KubeJobFailed Job failed to complete.

This provides a more convenient syntax based on the existing docs'
[1]:

  $ oc login -u <username> -p <password>
  $ host=$(oc -n openshift-monitoring get route alertmanager-main -ojsonpath={.spec.host})
  $ token=$(oc whoami -t)
  $ curl -H "Authorization: Bearer $token" -k "https://$host/api/v2/receivers"

But I'm using the /alerts path [2] instead of the /receivers path
given in the docs example.  This provides convenient access when the
console isn't available (e.g. because the Console ClusterVersion
capability is not enabled) or for use in other tooling like the
tech-preview 'oc adm upgrade status', and has a long-standing (but
currently deferred) RFE [3].

I'm also dropping the -k/--insecure from the doc'ed curl in my Go
translation, because:

* I don't want to leak a sensitive token to an untrusted server.
* Clusters are supposed to have set up a well-known cert/CA for their
  ingress [4]:

    The Ingress Operator generates a default certificate for an
    Ingress Controller to serve as a placeholder until you configure a
    custom default certificate. Do not use Operator-generated default
    certificates in production clusters.

  so our client will likely trust the cluster.

and failing on "I don't trust that server" seems fine for this
tech-preview proof-of-concept, vs. trying to use the
kubeconfig-protected connections to automatically figure out which CA
I need to trust.

I'm iterating over status.ingress[].host entries instead of using the
doc'ed spec.host, because the spec property is optional [5], while the
status property is required [6].  It also seems safer to consume the
Route controller's output status instead of wiring up directly to
Route-admin written spec input.

I'm not currently filtering on RouteIngress.Conditions including an
Admitted=True entry, although something like that might be reasonable
to avoid wasting time poking known-to-be-broken hosts.

I haven't dug into Bearer token RFCs or the REST config structure to
know if I need to worry about any encoding issues (base64?) in this
handoff, but in testing it works for me.  Please feel free to stiffen,
or at least research, that handoff before making this generally
available :).

If I get a failure from one status.ingress[].host, I don't bother
attempting the others.  Wiring up that kind of fallback with error
aggregation when all hosts fail would be a useful future refinement.

We seem to have types in the vendored
github.com/prometheus/common/model that could be used to deserialize
this output (e.g. for table printing), but for now I'll just dump the
JSON to output.

[1]: https://docs.openshift.com/container-platform/4.14/monitoring/accessing-third-party-monitoring-apis.html
[2]: https://github.com/prometheus/alertmanager/blob/v0.26.0/api/v2/openapi.yaml#L134
[3]: https://issues.redhat.com/browse/RFE-928
[4]: https://docs.openshift.com/container-platform/4.14/security/certificate_types_descriptions/ingress-certificates.html#location
[5]: https://github.com/openshift/api/blob/4f00b4f16b634830bb16b432eec91f6e514f2981/route/v1/types.go#L95
[6]: https://github.com/openshift/api/blob/4f00b4f16b634830bb16b432eec91f6e514f2981/route/v1/types.go#L352-L353
As recommended by Simon [1].  By dropping one level down and querying
Thanos, we lose Silence access (those only come in at the Alertmanager
level), but we gain:

* The ability to retrieve alerts when Alertmanager is disabled
  (especially true with ACM satellite clusters).
* Access to pending alerts (not just firing alerts).
* Access to both platform and user-defined alerts (the platform
  Alertmanager only has access to the former).

Demo of the new functionality:

  $ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts | jq -r '[.data.alerts[] | .activeAt + " " + .state + " " + (.labels | .severity + " " + .alertname) + " " + .annotations.summary] | sort[]'
  2024-01-18T15:42:50Z firing warning TechPreviewNoUpgrade Cluster has enabled tech preview features that will prevent upgrades.
  2024-01-18T15:45:13Z firing info ClusterNotUpgradeable One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
  2024-01-19T20:20:58.083333925Z firing none Watchdog An alert that should always be firing to certify that Alertmanager is working properly.
  2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-b-d6xzf.c.openshift-ci-build-farm.internal.
  2024-01-22T15:51:45.824704657Z firing info PodStartupStorageOperationsFailing Pods can't start because volume_mount of volume plugin kubernetes.io/configmap is permanently failing on node build0-gstfj-w-c-q85c8.c.openshift-ci-build-farm.internal.
  2024-02-05T20:09:16.894430475Z pending warning PodDisruptionBudgetAtLimit The pod disruption budget is preventing further disruption to pods.
  2024-02-05T20:12:56.428212511Z pending info ClusterAutoscalerUnschedulablePods Cluster Autoscaler has 13 unschedulable pods
  2024-02-05T20:15:43.59211931Z pending warning MachineNotYetDeleted machine build0-gstfj-ci-tests-worker-d-rrdl9 has been in Deleting phase for more than 6 hours
  ...

[1]: openshift#1674 (comment)
Simon, internally:

> ...let's go with the CMO owners...

So this brings in the approver set from [1].

That has the same people as reviewers and approvers, so I'm only
creating a single monitoring-approvers here to ease future
membership-management.  If, in the future, monitoring decides to have
distinct sets for reviewers and approvers, it would be straightforward
to add a new alias.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/b7e3f50875f2bb1fed912b23fb80a101d3a786c0/OWNERS
@jan--f
Copy link

jan--f commented Feb 12, 2024

👍 This looks good to me.

return alertBytes, err
}

// if we end up going this way, probably check and error on 'result' being an empty set (it should at least contain Watchdog)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would return the list as-is. An empty list isn't expected currently but who knows if it doesn't change in the future.

return nil, err
}

req.Header.Add("Authorization", fmt.Sprintf("Bearer %s", bearerToken))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you referred above in the source code, we could leverage github.com/prometheus/client_golang/v1 + github.com/prometheus/common for token injection but it can be deferred to a following PR.

@ardaguclu the monitoring stack exposes a few API services protected by kube-rbac-proxy which uses TokenReview + SubjectAccessReview to authenticate and authorizate requests. Hence the need to provide a bearer token.

}

func (o *options) Run(ctx context.Context) error {
alertBytes, err := GetAlerts(ctx, o.getRoute, o.RESTConfig.BearerToken)
Copy link
Member

@ardaguclu ardaguclu Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways to send authenticated request to API server in RESTConfig

// Server requires Bearer authentication. This client will not attempt to use
// refresh tokens for an OAuth2 flow.
// TODO: demonstrate an OAuth2 compatible client.
BearerToken string `datapolicy:"token"`
// Path to a file containing a BearerToken.
// If set, the contents are periodically read.
// The last successfully read value takes precedence over BearerToken.
BearerTokenFile string
// Impersonate is the configuration that RESTClient will use for impersonation.
Impersonate ImpersonationConfig
// Server requires plugin-specified authentication.
AuthProvider *clientcmdapi.AuthProviderConfig
// Callback to persist config for AuthProvider.
AuthConfigPersister AuthProviderConfigPersister
// Exec-based authentication provider.
ExecProvider *clientcmdapi.ExecConfig
and as far as I know, bearerToken is not mandatory. For example, if user may use credentials exec plugin (ExecProvider) that skips bearer token step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bearer token is what we doc, so it should be good enough for a env-var-gated first attempt, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bearer token is not functional for users working with cluster-admin kubeconfig, so that might be a potential problem for a set of our users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem I run into when playing with this was:

OC_ENABLE_CMD_INSPECT_ALERTS=true ./oc adm inspect-alerts 
Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority

I'm guessing you'd need to do something similar to what we do in https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_monitoring to verify that route, but I'm hesitant about that approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a stable API we could access? How our console is accessing that data? Can we re-use that same API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must-gather is using a ServiceAccount token and collecting the default ingress CA. Both of which we could use. But 🤷, also work to implement, and not clear to me that we need to sink that time in before landing a subcommand behind a OC_ENABLE_CMD_INSPECT_ALERTS=true gate. Is "if there isn't a bearer token or the request is rejected with an HTTP auth code, go see if we have the power to borrow a must-gather ServiceAccount token" the sort of thing we need to land the initial implementation? There's always work we could invest to make any implementation better, but where are the goal-posts for merging an initial, env-var-gated command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: stable API , we can take up the work post the merge. This should not block the merge as this PR is behind OC_ENABLE_CMD_INSPECT_ALERTS=true flag.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a stable API we could access? How our console is accessing that data? Can we re-use that same API?

The console uses the same API service to query and display the cluster alerts.

@ardaguclu
Copy link
Member

/approve
for adding monitoring-approvers in OWNER_ALIASES

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2024
Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024
Copy link
Contributor

openshift-ci bot commented Feb 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ardaguclu, LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Feb 23, 2024

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit a36a8e3 into openshift:master Feb 23, 2024
13 checks passed
@wking wking deleted the alert-access branch February 23, 2024 03:01
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-cli-container-v4.16.0-202402230438.p0.ga36a8e3.assembly.stream.el8 for distgit openshift-enterprise-cli.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants