Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: promtail; clean up metrics generated from logs after a config reload. #11882

Merged
merged 2 commits into from
Apr 24, 2024

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Feb 6, 2024

It is possible to generate metrics from log lines in Promtail. If the config file is reloaded, those metrics are currently not cleaned up. This causes a few issues:

  • If a metric is not necessary anymore, it is still visible on the /metrics endpoint.
  • A metric with the same name might have a different meaning after the config file reload, so it'd make sense to clean these up.

This will help us fix a bug reported in the Grafana Agent.

I suppose no changelog entry is required because PromQL is already designed to handle counter resets? But if you think it's appropriate, I could add a changelog entry?

I tested this change locally using a config file like this:

Promtail config
server:
  http_listen_port: 9080
  grpc_listen_port: 0
  enable_runtime_reload: true
positions:
  filename: /Users/paulintodev/Desktop/log_metrics_bug/tmp_promtail/positions
clients:
  - url: "https://logs-prod-008.grafana.net/loki/api/v1/push"
    basic_auth:
      username: ""
      password: ""
scrape_configs:
- job_name: log files
  static_configs:
  - labels:
      __path__: /Users/paulintodev/Desktop/log_metrics_bug/*.log
  pipeline_stages:
    - metrics:
        log_lines_total:
          type: Counter
          description: "total number of log lines"
          prefix: my_promtail_custom_
          max_idle_duration: 24h
          config:
            match_all: true
            action: inc
        log_lines_total2:
          type: Counter
          description: "total number of log lines"
          prefix: my_promtail_custom2_
          max_idle_duration: 24h
          config:
            match_all: true
            action: inc

Then I'd do a curl localhost:9080/reload and check http://localhost:9080/metrics.

I didn't add a whole lot of unit tests because TBH we should probably focus on adding tests in the Agent, given that we're sunsetting Promtail over time.

@ptodev ptodev requested a review from a team as a code owner February 6, 2024 19:06
@CLAassistant
Copy link

CLAassistant commented Feb 6, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptodev need to rebase/merge in master to get some CI updates

Comment on lines +131 to +136
pl.Cleanup()

if err := testutil.GatherAndCompare(registry,
strings.NewReader("")); err != nil {
t.Fatalf("mismatch metrics: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we specifically check for the absence of the metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pl.Cleanup(), no metrics should be visible. I don't see an advantage in checking individual metrics when there shouldn't be any of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there's originally two metrics stages, and after we reload there should be only one? we should still check for it's existence and correct value, I think right now we're deleting metrics for stages that will still exist after the reload which would not be correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that deleting all the metrics is the right thing to do. The fact that a metric exists doesn't mean that it's used in the same way in the new config. It's possible that the user changed the meaning of each metric, but retained the metric names. In that case it would make sense to start with fresh metrics.

In the future we could make the code smart and reload only the stages which changed, but I think that's a separate issue. To do that, we could avoid calling the Cleanup method for stages which are completely unchanged as a whole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document this clearly in the metrics config section

and maybe as a note here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this line to both doc pages:

If Promtail's configuration is reloaded, all metrics will be reset.

@ptodev ptodev force-pushed the ptodev/reset-promtail-metrics branch from 62e6140 to a4e8ae0 Compare February 26, 2024 10:05
@cstyan cstyan changed the title [promtail] Clean up metrics generated from logs after a config reload. fix: promtail; clean up metrics generated from logs after a config reload. Feb 26, 2024
@ptodev ptodev force-pushed the ptodev/reset-promtail-metrics branch from a4e8ae0 to 85b0900 Compare February 27, 2024 11:01
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptodev sorry for the delay, I think we just need some docs, also have you guys been using this code downstream in Agent/Alloy already or would merging this be the first usage of this code?

Comment on lines +131 to +136
pl.Cleanup()

if err := testutil.GatherAndCompare(registry,
strings.NewReader("")); err != nil {
t.Fatalf("mismatch metrics: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should document this clearly in the metrics config section

and maybe as a note here as well

@ptodev ptodev force-pushed the ptodev/reset-promtail-metrics branch from 85b0900 to 1d1261d Compare April 23, 2024 15:31
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Apr 23, 2024
@ptodev ptodev force-pushed the ptodev/reset-promtail-metrics branch from 1d1261d to a300762 Compare April 23, 2024 15:34
@ptodev
Copy link
Contributor Author

ptodev commented Apr 23, 2024

have you guys been using this code downstream in Agent/Alloy already or would merging this be the first usage of this code?

Thank you for the review! This is the first usage. Alloy and Agent import the Promtail code, so in order to update them I will need to merge this PR first.

@ptodev ptodev requested a review from cstyan April 23, 2024 15:38
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve/merge this with a note that if we get reports of issues I'll revert this change. Hopefully we here from the Alloy team after a short period of time whether they have confirmed usages that are working well or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport k190 size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failed to unregister all metrics from previous promtail. THIS IS A BUG
4 participants