Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki High CPU Usage #3020

Closed
cf-sewe opened this issue Dec 2, 2020 · 14 comments · Fixed by #3347
Closed

Loki High CPU Usage #3020

cf-sewe opened this issue Dec 2, 2020 · 14 comments · Fixed by #3347

Comments

@cf-sewe
Copy link

cf-sewe commented Dec 2, 2020

Describe the bug
After some time, loki is running on 100% CPU.

To Reproduce
Steps to reproduce the behavior:

  • unable to reproduce by algorithm, happens after some time
  • happend 2 times , in different environments
  • Originally reported in Slack, some weeks ago

Expected behavior

Environment:

  • Kubernetes 1.18 (AWS)
  • loki monolithic deployment via Helm
    • Loki v2.0.0 is running with CPU limit of 1 (1 core). Loki is running quite idle normally.
    • Promtail 2.0.0
  • boltdb-shipper
  • AWS S3

Screenshots, Promtail config, or terminal output

  • pprof summary:
Showing nodes accounting for 5400ms, 63.83% of 8460ms total
Dropped 88 nodes (cum <= 42.30ms)
Showing top 10 nodes out of 111
      flat  flat%   sum%        cum   cum%
     760ms  8.98%  8.98%      760ms  8.98%  runtime.duffcopy
     760ms  8.98% 17.97%      770ms  9.10%  unicode/utf8.DecodeRuneInString
     720ms  8.51% 26.48%     1510ms 17.85%  github.com/prometheus/prometheus/promql/parser.(*Lexer).next
     690ms  8.16% 34.63%      800ms  9.46%  runtime.step
     610ms  7.21% 41.84%     1440ms 17.02%  runtime.pcvalue
     450ms  5.32% 47.16%     7920ms 93.62%  github.com/prometheus/prometheus/promql/parser.(*yyParserImpl).Parse
     440ms  5.20% 52.36%     2230ms 26.36%  runtime.gentraceback
     370ms  4.37% 56.74%      870ms 10.28%  github.com/prometheus/prometheus/promql/parser.lexInsideBraces
     330ms  3.90% 60.64%     3100ms 36.64%  github.com/prometheus/prometheus/promql/parser.(*Lexer).NextItem
     270ms  3.19% 63.83%      270ms  3.19%  runtime.futex
  • pprof trace (10s)
    pprof_trace.zip

  • loki 100 % in k9s
    image

  • grafana dashboard
    image

  • kubernetes dashboard for loki-stack
    image

@cyriltovena
Copy link
Contributor

Nothing shocking in the trace, you have someone doing a metric query, not sure why you would expect loki to not use CPU. Do you know what the query looks like ?

@cyriltovena
Copy link
Contributor

cyriltovena commented Dec 4, 2020

It should be noted that in the same second you receive a push and you are querying.

@LucaDev
Copy link

LucaDev commented Dec 5, 2020

I'm having the same issues.
image
After a restart the cpu usage goes back down
image

@cyriltovena
Copy link
Contributor

cyriltovena commented Dec 5, 2020 via email

@cf-sewe
Copy link
Author

cf-sewe commented Dec 16, 2020

Actually the initial profile was already a CPU profile - just forgot to upload it ;)
I retested now with 20s window, all other pprof options are default.

[ec2-user@ip-192-168-100-136 ~]$ go tool pprof http://localhost:3100/debug/pprof/profile?seconds=20
Fetching profile over HTTP from http://localhost:3100/debug/pprof/profile?seconds=20
Saved profile in /home/ec2-user/pprof/pprof.loki.samples.cpu.002.pb.gz
File: loki
Type: cpu
Time: Dec 16, 2020 at 6:29am (UTC)
Duration: 20.12s, Total samples = 15.51s (77.10%)

pprof.loki.samples.cpu.002.pb.gz

@LucaDev
Copy link

LucaDev commented Dec 18, 2020

Thank you @cf-sewe, @cyriltovena I wasn't able to reproduce the bug again until now. It happened two times so far.

@stale
Copy link

stale bot commented Jan 17, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 17, 2021
@stale stale bot closed this as completed Jan 25, 2021
@LucaDev
Copy link

LucaDev commented Jan 29, 2021

@cf-sewe has the error occurred again recently?

@cf-sewe
Copy link
Author

cf-sewe commented Feb 1, 2021

Yes that CPU hog occurs reliably after some time in my environment - however only on old Loki version 2.0.0.
Have not updated Loki to v2.1 yet, and won't soon.

@xal3xhx
Copy link

xal3xhx commented Feb 15, 2021

issue is occuring for me as well, running a fresh install of everything. will be back tomorrow with more info and what my current setup is

@LucaDev
Copy link

LucaDev commented Feb 15, 2021

This should be reopened.

@xal3xhx
Copy link

xal3xhx commented Feb 15, 2021

I agree this needs to be reopened, I’m not home so I can’t test anything but I’m running the most up to date version of truenas and I have grafana inside a jail.

Loki is running on the same jail and is also the most up to date version with default configs.

promtail is running inside another jail on the same system, that jail is running nginx and promtail is looking at the logs with a standard config with 2 replace’s

the dashboard on grafana is the Loki V2.0 nginx dashboard

after some time the cpu on the system gets pinned to 100% I believe it was around 1-2 hours but I could be completely wrong there.

Stoping promtail has no effect proving it’s a Loki issue, restarting Loki fixes the problem for a little while.

as I final note both promtail and Loki are started with the system through an rc.d file using the Daemon command

will post configs and other reports when I get have later today

@xal3xhx
Copy link

xal3xhx commented Feb 15, 2021

After looking through the other issues posted it looks like this might be related to issue #3275

@cyriltovena
Copy link
Contributor

#3275 (comment)

cyriltovena added a commit to cyriltovena/loki that referenced this issue Feb 17, 2021
I've also added a test to prove the issue was happening and now is fixed.

Fixes grafana#3275
Fixes grafana#3264
Fixes grafana#3020

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
@cyriltovena cyriltovena reopened this Feb 17, 2021
@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Feb 17, 2021
owen-d pushed a commit that referenced this issue Feb 18, 2021
…3347)

I've also added a test to prove the issue was happening and now is fixed.

Fixes #3275
Fixes #3264
Fixes #3020

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants