Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

root: fix 100% CPU for worker container (#7025) #7762

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DKingAlpha
Copy link

Some linux users (arch linux, for example) are running docker with default service file that set NOFILES to infiite, which will cause celery to hang for hours to days taking 100% CPU to close all fds by enumerating from NOFILES to 3.

This commit override ulimit for container without touching user docker service configuration.

For details see #7025

Some linux users (arch linux, for example) are running docker with default service file that set NOFILES to infiite, which will cause celery to hang for hours to days taking 100% CPU to close all fds by enumerating from NOFILES to 3.

This commit override ulimit for container without touching user docker service configuration.

Signed-off-by: DKing <15340687+DKingAlpha@users.noreply.github.com>
@DKingAlpha DKingAlpha requested a review from a team as a code owner December 3, 2023 17:54
Copy link

netlify bot commented Dec 3, 2023

Deploy Preview for authentik-storybook failed.

Name Link
🔨 Latest commit 3b0a1ac
🔍 Latest deploy log https://app.netlify.com/sites/authentik-storybook/deploys/656cc0d92b027400084688ae

Copy link
Member

@BeryJu BeryJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for digging into this! I'd like to get confirmation from some more people in #7025 if it fixes their issue before merging this. Also we should probably add a similar thing for the helm chart (https://github.com/goauthentik/helm) if possible

Copy link

codecov bot commented Dec 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.64%. Comparing base (2bc4506) to head (3b0a1ac).
Report is 2414 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7762      +/-   ##
==========================================
+ Coverage   92.62%   92.64%   +0.02%     
==========================================
  Files         588      588              
  Lines       29141    29141              
==========================================
+ Hits        26991    26997       +6     
+ Misses       2150     2144       -6     
Flag Coverage Δ
e2e 50.72% <ø> (+0.02%) ⬆️
integration 25.94% <ø> (ø)
unit 89.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DKingAlpha
Copy link
Author

Tried to ping someone, waiting for feedback.

Meanwhile you can easily reproduce by changing the ulimit to 0x3ffffff8 in decimal. That should kinda prove it.

@rissson rissson self-requested a review December 3, 2023 19:46
@rissson
Copy link
Member

rissson commented Dec 4, 2023

For reference:

Here's what I have at home:

$ ulimit -Sn
1048576
# ulimit -Hn
1048576

And we have the same in production at authentik

Copy link

@Leptopoda Leptopoda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes the issue for me.

OS: clear-linux
# ulimit -Sn
1024
# ulimit -Hn
524288

@mobiledude
Copy link

mobiledude commented Jan 3, 2024

Fixes the issue for me.

OS: clear-linux
# ulimit -Sn
1024
# ulimit -Hn
524288

Any idea someone how I can change this for my Authentik worker on unraid? It’s been running high cpu usage for days now? Help is appreciated.

Reporting back (unraid-solved):
In hind side I did 3 things, not sure what solved it. 1) in the Unraid template I added "-ulimit nofile=10240:10240" in Extra Parameters field as flag (advanced view) 2) redeployed (removing containers and images) both worker and authentik. 3) added AUTHENTIK_REDIS__DB:1 as variable to the unraid template for both Worker and authentik. Now everything seems normal.

@cenkalti
Copy link

cenkalti commented Jan 3, 2024

Do you know why setting ulimit to a larger number fixes the issue? Is it an issue in Celery or Authentik?

@tanberry tanberry added the status/reviewing thanks for opening, we're taking a look label Feb 23, 2024
@BeryJu
Copy link
Member

BeryJu commented Mar 15, 2024

With #7810, #8440 and #7813 this shouldn't be an issue anymore, could you check this again with 2024.2.2 @cenkalti @mobiledude @Leptopoda @DKingAlpha

@BeryJu BeryJu added status/awaiting-contributor Awaiting a response from the contributor and removed status/reviewing thanks for opening, we're taking a look labels Mar 15, 2024
@Leptopoda
Copy link

Thanks for coming back.
I can confirm that I no longer need the workaround on 2024.02.2

@DKingAlpha
Copy link
Author

I still got high cpu usage with latest 2024.2.2. Adding ulimit back to compose.yml fixed the issue for me.

live py profiler py-spy is incompatible with recent py3.12, I will find another way to identify the issue when I have time.

@cenkalti
Copy link

I still experience high CPU usage with the latest 2024.2.2 version. However, I was able to resolve the issue by adding ulimit back to the 'compose.yml' file.

@MyIgel
Copy link

MyIgel commented Mar 20, 2024

I can confirm that it fixed my setup too so its imho worth merging 🎉

@SpiderD555
Copy link

SpiderD555 commented Mar 26, 2024

New authentik user here. Tried re-setting Redis, tried setting ulimits in docker-compose, unfortunately CPU still spikes at 100%. After some more troubleshooting I did increase the RAM allocation to the VM (while still leaving custom ulimits in pace), and suddenly it all started to work - no CPU spikes

@Janhouse
Copy link

Same here with 2024.4.2, ulimits in the compose fixed the issue.

@arthurlockman
Copy link

Adding ulimits back to compose fixed my issue on 2024.4.2.

@BeryJu
Copy link
Member

BeryJu commented Jun 1, 2024

For context, the reason why we haven't merged this PR:

  • Configuring ulimit values for containers is possible with compose, but not possible with kubernetes, so it would only solve half the problem
  • It's also more of a bandaid than a full solution, this ulimit adjustment shouldn't be required and changing the values seems to just prevent a bug in either our code/our usage of celery, or celery itself from happening, instead of fixing the root cause itself.

@ForsakenRei
Copy link

I can confirm this fixed my issue for docker running on Oracle Linux 9.4

@cubic3d
Copy link
Contributor

cubic3d commented Aug 8, 2024

Just deployed a fresh compose install on current Arch linux using 2024.6.3. Celery was taking one core to 100%. Setting ulimits for nofile resolved the issue.

I'm not experiencing this in k8s running on Talos.

@justin8
Copy link

justin8 commented Oct 9, 2024

Thanks! This fixed the issue for me as well.

justin8 added a commit to justin8/docker-stacks that referenced this pull request Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/awaiting-contributor Awaiting a response from the contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.