Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL watcher has high CPU usage with low throughput #11625

Closed
rfratto opened this issue Nov 24, 2022 · 3 comments
Closed

WAL watcher has high CPU usage with low throughput #11625

rfratto opened this issue Nov 24, 2022 · 3 comments

Comments

@rfratto
Copy link
Member

rfratto commented Nov 24, 2022

grafana/agent#1148 reports unexpectedly high CPU usage (~4-6%) at very low metric workloads.

I've been able to track this down to being at least partially due to the WAL watcher, which tails the latest WAL segment every 10ms, and checks for new segments every 100ms. This is very high for a process which only scrapes a single target every minute, and causes a lot of unnecessary CPU time being spent.

To verify the issue, I tried a read period of 1s, a segment read period of 1m, and a checkpoint period of 1h. This lowered my CPU usage from 4% to <1%.

I see a few potential resolutions to this issue:

  1. Use fsnotify to watch for changes to the WAL segment, WAL directory, and WAL checkpoint
  2. Emit some kind of event (maybe through a channel or *sync.Cond) from the storage.Appendable implementation to the WAL reader to notify that new samples have been written to the WAL
  3. Allow the user to specify the frequency of how often the WAL is read during tailing via command-line flags or in the remote_write configuration block.

Is there a reason fsnotify isn't already being used for this?

@rfratto rfratto changed the title High CPU usage at low workloads WAL watcher has high CPU usage at low workloads Nov 24, 2022
@rfratto rfratto changed the title WAL watcher has high CPU usage at low workloads WAL watcher has high CPU usage with low throughput Nov 24, 2022
@rfratto
Copy link
Member Author

rfratto commented Nov 24, 2022

Is there a reason fsnotify isn't already being used for this?

I played around with an fsnotify approach and couldn't get it to work as expected. At least on my Mac, fsnotify was only emitting events every ~60 seconds with a 5 second scrape interval. I can't figure out why, which isn't giving me much faith in the fsnotify-based approach.

@erikbaranowski
Copy link

erikbaranowski commented Nov 30, 2022

Some additional brainstorming... a combination of 1 and 2 could be considered.

  1. Reduce the CPU cost of the watcher (see @rfratto proposals above)
  2. Reduce the frequency of the watcher and consider allowing the user to control variables described below:
    1. Reduce frequency to a slower set speed
    2. Create 'autoscaling' frequency
      1. Have an idle vs active frequency. For example, if the watcher has not found any changes for 5 seconds reduce the frequency to an idling frequency until a change is detected then speed back up
      2. Something even crazier with multiple tiers and smart logic that tries to zone in on the right frequency for a detected level of hits. This might need some serious testing and justification vs the complexity it could introduce.

@gouthamve
Copy link
Member

Closed by #11949

@prometheus prometheus locked as resolved and limited conversation to collaborators Jan 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants