Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(operator): Restructure LokiStack metrics #12228

Merged
merged 7 commits into from Mar 19, 2024

Conversation

xperimental
Copy link
Collaborator

@xperimental xperimental commented Mar 15, 2024

What this PR does / why we need it:

The current implementation of the LokiStack metrics emitted by Loki Operator sometimes keeps old versions of the metrics. This leads to the metrics not properly reflecting the current state of the deployed LokiStack resources.

This PR replaces the current implementation which relied on being called by the reconciliation loop by an implementation that fetches the current information of the LokiStack resources by itself and generates metrics from that data. As the list of LokiStack resources should already be cached in the informer this should not put any additional stress on the Kubernetes API server.

Which issue(s) this PR fixes:

LOG-5253

Special notes for your reviewer:

  • This completely replaces the existing metrics. Only two metrics remain:
    • lokistack_info containing information about the deployed LokiStack resources as labels (currently limited to the "size")
    • lokistack_status_condition showing the current state of the Conditions available in the LokiStack status. This setup mirrors the metrics provided by kube_state_metrics
  • The alert for "needs storage schema update" has been updated to reflect the new metrics setup. I'm not aware of any other usage of these metrics so far.
  • Calling List on each metrics scrape seems wasteful, but my theory is that this data should be available in the informer cache of the operator. I had a look at the apiserver metrics while running this operator version compared to a version without this change and it does look like this is true and the change does not have any effect on the apiserver.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Tests updated
  • CHANGELOG.md updated

@xperimental xperimental self-assigned this Mar 15, 2024
@xperimental xperimental requested review from periklis and a team as code owners March 15, 2024 16:52
Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xperimental
Copy link
Collaborator Author

I wonder, if it would be a good idea to have a metric exposing the state of all conditions on the LokiStack (similar to what kube_state_metrics does for the standard types) instead of just having a metric for the warnings.

That way, we (or customers) could also produce alerts on the other conditions and, for example, automatically warn when a LokiStack fails to become ready for a while.

@xperimental xperimental merged commit ebdf8fe into grafana:main Mar 19, 2024
18 checks passed
@xperimental xperimental deleted the fix-telemetry branch March 19, 2024 11:42
edsoncelio pushed a commit to edsoncelio/loki that referenced this pull request Mar 22, 2024
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants