Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hugo stalls when hitting hard memory limit #12129

Closed
TiGR opened this issue Feb 23, 2024 · 14 comments
Closed

hugo stalls when hitting hard memory limit #12129

TiGR opened this issue Feb 23, 2024 · 14 comments

Comments

@TiGR
Copy link

TiGR commented Feb 23, 2024

Not sure how to reproduce it, maybe by reducing HUGO_MEMORYLIMIT and taking some huge website to test. In case of our huge website we can normally run hugo build with 8Gb memory limit, but if I run hugo server on the same limit, hugo hits the limit at some point and then all cpu usage drops to 0% and hugo just sits there, without doing a thing (no new files are created too).

Running hugo server --logLevel info

Start building sites … 
hugo v0.124.0-DEV+extended linux/amd64 BuildDate=unknown

INFO  static: syncing static files to / duration 24.426106ms
INFO  build:  step process substep collect files 37000 files_total 37000 duration 3.013660535s
INFO  build:  step process substep collect files 102422 files_total 139422 duration 2.721319973s
INFO  build:  step process duration 5.735132833s
INFO  build:  step assemble duration 4.368413267s
INFO  build:  step render substep pages site en outputFormat html duration 1m14.605222904s
INFO  build:  step render substep pages site en outputFormat csv duration 439.775463ms
INFO  build:  step render substep pages site en outputFormat json duration 1.694182069s
INFO  build:  step render substep pages site en outputFormat rss duration 856.068872ms
INFO  dynacache: adjusted partitions' max size evicted 7858 numGC 64 limit 7.68 GB alloc 8.00 GB totalAlloc 83.67 GB
INFO  dynacache: adjusted partitions' max size evicted 7237 numGC 65 limit 7.68 GB alloc 8.17 GB totalAlloc 87.76 GB
INFO  dynacache: adjusted partitions' max size evicted 6099 numGC 70 limit 7.68 GB alloc 8.09 GB totalAlloc 107.07 GB
INFO  dynacache: adjusted partitions' max size evicted 7398 numGC 71 limit 7.68 GB alloc 8.23 GB totalAlloc 107.39 GB
INFO  dynacache: adjusted partitions' max size evicted 2218 numGC 71 limit 7.68 GB alloc 7.96 GB totalAlloc 111.00 GB
INFO  build:  step render substep pages site es outputFormat html duration 1m17.620461596s
INFO  build:  step render substep pages site es outputFormat csv duration 376.800492ms
INFO  build:  step render substep pages site es outputFormat json duration 1.446087386s
INFO  build:  step render substep pages site es outputFormat rss duration 582.231744ms
INFO  dynacache: adjusted partitions' max size evicted 8436 numGC 75 limit 7.68 GB alloc 8.16 GB totalAlloc 125.20 GB
INFO  dynacache: adjusted partitions' max size evicted 4562 numGC 76 limit 7.68 GB alloc 7.76 GB totalAlloc 128.92 GB
INFO  dynacache: adjusted partitions' max size evicted 2837 numGC 77 limit 7.68 GB alloc 8.18 GB totalAlloc 133.44 GB
INFO  dynacache: adjusted partitions' max size evicted 3769 numGC 78 limit 7.68 GB alloc 8.42 GB totalAlloc 137.72 GB
INFO  dynacache: adjusted partitions' max size evicted 7235 numGC 79 limit 7.68 GB alloc 8.72 GB totalAlloc 142.06 GB
INFO  dynacache: adjusted partitions' max size evicted 7577 numGC 80 limit 7.68 GB alloc 8.08 GB totalAlloc 145.92 GB
INFO  dynacache: adjusted partitions' max size evicted 8250 numGC 81 limit 7.68 GB alloc 7.89 GB totalAlloc 149.91 GB
INFO  dynacache: adjusted partitions' max size evicted 7581 numGC 82 limit 7.68 GB alloc 7.77 GB totalAlloc 153.98 GB
INFO  dynacache: adjusted partitions' max size evicted 3157 numGC 84 limit 7.68 GB alloc 8.19 GB totalAlloc 162.63 GB
INFO  dynacache: adjusted partitions' max size evicted 1492 numGC 85 limit 7.68 GB alloc 7.99 GB totalAlloc 166.28 GB
INFO  dynacache: adjusted partitions' max size evicted 5796 numGC 86 limit 7.68 GB alloc 7.97 GB totalAlloc 167.16 GB
INFO  dynacache: adjusted partitions' max size evicted 2363 numGC 86 limit 7.68 GB alloc 8.10 GB totalAlloc 170.69 GB

And here it just hangs. I'd expect it to fail somehow if it has run out of memory.

Also, if I reduce the limit, regular hugo build also stalls:

$ HUGO_MEMORYLIMIT=3 ~/go/bin/hugo --logLevel info
Start building sites … 
hugo v0.124.0-DEV+extended linux/amd64 BuildDate=unknown

INFO  static: syncing static files to / duration 146.63746ms
INFO  build:  step process substep collect files 37000 files_total 37000 duration 3.00951775s
INFO  build:  step process substep collect files 102422 files_total 139422 duration 2.852218241s
INFO  build:  step process duration 5.861893274s
INFO  build:  step assemble duration 4.53627309s
INFO  dynacache: adjusted partitions' max size evicted 2282 numGC 26 limit 3.00 GB alloc 4.09 GB totalAlloc 7.56 GB
INFO  dynacache: adjusted partitions' max size evicted 4109 numGC 27 limit 3.00 GB alloc 3.62 GB totalAlloc 8.89 GB
INFO  dynacache: adjusted partitions' max size evicted 3740 numGC 27 limit 3.00 GB alloc 4.76 GB totalAlloc 10.04 GB
INFO  dynacache: adjusted partitions' max size evicted 2188 numGC 30 limit 3.00 GB alloc 4.38 GB totalAlloc 16.54 GB

And here it stalled.

@bep bep self-assigned this Feb 23, 2024
@bep bep removed the NeedsTriage label Feb 23, 2024
@bep bep added this to the v0.123.4 milestone Feb 23, 2024
@bep bep added the v0.123.x label Feb 23, 2024
@bep
Copy link
Member

bep commented Feb 23, 2024

I suspect this is a deadlock with the cache mutex (or something), which isn't the easiest thing in the world to debug, but I guess I should be able to trigger this. I have tested this pretty heavily using 1GB limit on some very big sites, but I guess the devil is in the detail.

A temporary workaround, I guess, would be to set the memorylimit to something very high.

@bep bep closed this as completed Feb 23, 2024
@bep bep reopened this Feb 23, 2024
bep added a commit that referenced this issue Feb 23, 2024
@bep
Copy link
Member

bep commented Feb 23, 2024

I have pushed 564bae0 -- which I don't think will fix your problem.

I have spent some time building some pretty beefy sites with HUGO_MEMORYLIMIT=1, but I haven't been able to force this issue.

Note that I don't doubt that this is a real issue that needs to be fixed, but I need to look closer to reproduce it.

I notice from the log that the amount of memory you allocate is on a complete different level than the sites I'm testing on. Can you elaborate a little what all the memory is used for? Image processing? Do you have a image cache?

@TiGR
Copy link
Author

TiGR commented Feb 23, 2024

It is really hard for me to determine what hugo uses memory for. Yes, we have 2,5 thousands of images, we have around 12 thousands of pages (multiplied by 10 translations). Most of images have previews generated, many of them are additionally resized and dumped as both jpeg and webp, base language build generates 25 thousands images.

Do you have a image cache?

Do you mean resources/_gen? If so, then yes.

@TiGR
Copy link
Author

TiGR commented Feb 24, 2024

I have pushed 564bae0 -- which I don't think will fix your problem.

It does not, hugo still stalls on master.

@bep bep modified the milestones: v0.123.4, v0.123.5 Feb 26, 2024
@TiGR
Copy link
Author

TiGR commented Feb 27, 2024

We start seeing stalls in our cloud builds (we have hugo 0.123.2 there at this point). We have the same memory limits we had for 122, that is HUGO_MEMORYLIMIT is 22. Hugo during the build do not exceed 11Gb. But we see random unpredictable stalls. I can't reproduce anything like this on my machine, but we have this problem in cloud.

Example build:

time=2024-02-27T11:24:30Z | msg="Start building sites … "
time=2024-02-27T11:24:30Z | msg="hugo v0.123.2-929b91fe75cb0d041f22b4707700dfc117115ad4+extended linux/amd64 BuildDate=2024-02-22T15:27:15Z VendorInfo=gohugoio"
time=2024-02-27T11:24:30Z | msg="INFO  static: syncing static files to /"
time=2024-02-27T11:24:30Z | msg=" duration 141.026102ms"
time=2024-02-27T11:24:32Z | msg="INFO  dynacache: adjusted partitions' max size evicted 4537 numGC 16 limit 22.00 GB alloc 0.17 GB totalAlloc 0.37 GB"
time=2024-02-27T11:24:33Z | msg="INFO  build:  step process substep collect files 41000 files_total 41000 duration 3.014767052s"
time=2024-02-27T11:24:36Z | msg="INFO  build:  step process substep collect files 38000 files_total 79000 duration 3.004260141s"
time=2024-02-27T11:24:39Z | msg="INFO  build:  step process substep collect files 33000 files_total 112000 duration 3.028676451s"
time=2024-02-27T11:24:42Z | msg="INFO  build:  step process substep collect files 27870 files_total 139870 duration 2.447407666sINFO  build:  step process duration 11.495399175s"
time=2024-02-27T11:24:49Z | msg="INFO  build:  step assemble duration 7.466807118s"
time=2024-02-27T11:27:08Z | msg="INFO  build:  step render substep pages site en outputFormat html duration 2m18.177601878s"
time=2024-02-27T11:27:08Z | msg="INFO  build:  step render substep pages site en outputFormat csv duration 333.367446ms"
time=2024-02-27T11:27:12Z | msg="INFO  build:  step render substep pages site en outputFormat json duration 3.168703346s"
time=2024-02-27T11:27:13Z | msg="INFO  build:  step render substep pages site en outputFormat rss duration 1.284847777s"
time=2024-02-27T11:29:16Z | msg="INFO  build:  step render substep pages site es outputFormat html duration 2m3.000824066s"
time=2024-02-27T11:29:17Z | msg="INFO  build:  step render substep pages site es outputFormat csv duration 313.333325ms"
time=2024-02-27T11:29:19Z | msg="INFO  build:  step render substep pages site es outputFormat json duration 1.968936201s"
time=2024-02-27T11:29:20Z | msg="INFO  build:  step render substep pages site es outputFormat rss duration 818.80005ms"
time=2024-02-27T11:31:18Z | msg="INFO  build:  step render substep pages site de outputFormat html duration 1m58.147270281s"
time=2024-02-27T11:31:19Z | msg="INFO  build:  step render substep pages site de outputFormat csv duration 333.32989ms"
time=2024-02-27T11:31:21Z | msg="INFO  build:  step render substep pages site de outputFormat json duration 2.148809437s"
time=2024-02-27T11:31:22Z | msg="INFO  build:  step render substep pages site de outputFormat rss duration 781.219217ms"
time=2024-02-27T11:33:18Z | msg="INFO  build:  step render substep pages site ru outputFormat html duration 1m55.313464896s"
time=2024-02-27T11:33:18Z | msg="INFO  build:  step render substep pages site ru outputFormat csv duration 319.735755ms"
time=2024-02-27T11:33:20Z | msg="INFO  build:  step render substep pages site ru outputFormat json duration 1.988445374s"
time=2024-02-27T11:33:21Z | msg="INFO  build:  step render substep pages site ru outputFormat rss duration 778.596956ms"

And here it stalled, even though there were no INFO messages with dynacache GC runs. So, it seems that stalls are happening not only in extreme memory conditions, but randomly anytime GC is run.

@bep bep modified the milestones: v0.123.5, v0.123.6 Feb 28, 2024
@bep
Copy link
Member

bep commented Feb 29, 2024

So, it seems that stalls are happening not only in extreme memory conditions, but randomly anytime GC is run.

So, it's most likely a mutex deadlock, but it's not to me inspecting the code, and I need to reproduce it to understand and fix it. But I will try to spend some time on this today.

@bep
Copy link
Member

bep commented Feb 29, 2024

@TiGR what would be useful is, if you could do a kill -SIGABRT <pid> on a stuck process and post the stack trace (which I suspect would be long), in a comment here or something.

@bep
Copy link
Member

bep commented Feb 29, 2024

OK, I think I have reproduced this building a beefy Hugo site on a Ubuntu VM with only 1GB of memory (the memory limit set to 256 MB).

I will start by making it possible to set HUGO_MEMORYLIMIT to something other than gigabytes, so I can hopefully reproduce this on my main box (which would make it much easier to debug).

@bep
Copy link
Member

bep commented Feb 29, 2024

Strike that, it actually completed, just took very long (because of image processing, I guess, faster second run with the cache).

@TiGR
Copy link
Author

TiGR commented Feb 29, 2024

Another reliable way to reproduce stalls is by building big website with HUGO_NUMWORKERMUPTIPLIER=32

@TiGR
Copy link
Author

TiGR commented Mar 1, 2024

@TiGR what would be useful is, if you could do a kill -SIGABRT <pid> on a stuck process and post the stack trace (which I suspect would be long), in a comment here or something.

hugo-stall-stack-trace.txt

@bep bep modified the milestones: v0.123.6, v0.123.7, v0.123.8 Mar 1, 2024
@bep bep modified the milestones: v0.123.8, v0.123.9, v0.124.0 Mar 7, 2024
@TiGR
Copy link
Author

TiGR commented Mar 17, 2024

The issue is still there in hugo v0.124.

@bep bep modified the milestones: v0.124.0, v0.125.0 Mar 18, 2024
@bep
Copy link
Member

bep commented Mar 20, 2024

@TiGR sorry, I have been a little busy with other things. I see the deadlock in your stack trace and even if I don't have a failing test for it, the fix looks obvious, so I'm going to apply a fix for that later today. I'm boldly closing this issue from that PR, but if it is still a problem, please shout.

bep added a commit to bep/hugo that referenced this issue Mar 20, 2024
@bep bep modified the milestones: v0.125.0, v0.124.1 Mar 20, 2024
bep added a commit to bep/hugo that referenced this issue Mar 20, 2024
bep added a commit to bep/hugo that referenced this issue Mar 20, 2024
@bep bep closed this as completed in 758a876 Mar 20, 2024
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants