Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karma hangs after X time with to many files again? #2944

Closed
jonaz opened this issue Apr 1, 2021 · 6 comments · Fixed by #2967
Closed

karma hangs after X time with to many files again? #2944

jonaz opened this issue Apr 1, 2021 · 6 comments · Fixed by #2967
Assignees

Comments

@jonaz
Copy link

jonaz commented Apr 1, 2021

Is might be related to #2888 but this time it took a while longer to stopp working. Still have 24 alertmanagers and the interval is now 30s with timeout 10s on each one.

# /usr/local/bin/karma --version
v0.81

Let me know if you need more info.
lsof is growing socks as last time:

# lsof | grep karma | wc -l
53415

and its growing. Will probably start responding with errors when it hits 65k.
Its the same symptoms as #2888

@prymitive prymitive self-assigned this Apr 2, 2021
@prymitive
Copy link
Owner

I was able to reproduce a hang when working on #2888 but I with that fixed I can't anymore.
Do you scrape karma metrics and have a chart of process_open_fds and http_request_count_total over that 24h?

@jonaz
Copy link
Author

jonaz commented Apr 2, 2021

Karma stops responding on /metrics when this happens. When silently leaks sock according to lsof until it reaches limit and spams to many files error in logs. From stop responding to spamming errors took a few days. We have 3 instances so we can afford to let a non working one be running to troubleshoot.

This is probably a more rare deadlock since it i
Only happened once since we updated to the fix in #2888 I'll update here when we run into it again.

@prymitive
Copy link
Owner

I don’t see any leakage so there’s a chance client connections are not being closed.
I’ve added HTTP server read and write timeouts, as I realised there are none at the moment. Try the latest master image to see if it helps.

@jonaz
Copy link
Author

jonaz commented Apr 6, 2021

Is there a possible deadlock in https://github.com/prymitive/karma/blob/main/internal/alertmanager/models.go#L498 ?
Should not defer but unlock sooner?

@prymitive
Copy link
Owner

That seems likely, good catch

@prymitive
Copy link
Owner

Merged a fix for that, let me know if you still see any issues, thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants