Wordpress(?) over-committing memory #348

DevJohnC · 2019-01-18T00:22:48Z

We have peachpie running a Wordpress network which keeps getting killed by the linux out-of-memory process killer.

[31346.098815] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[31346.315298] [20856]     0 20856  2098347   433262    1184       9        0          1000 dotnet
[31346.338398] Out of memory: Kill process 20856 (dotnet) score 1845 or sacrifice child
[31346.345205] Killed process 20856 (dotnet) total-vm:8393388kB, anon-rss:1733048kB, file-rss:0kB

The dotnet process (which is a Peachpie project) is reserving a lot more memory than it's actually using.

Are there tuning options to mitigate this situation?

The text was updated successfully, but these errors were encountered:

jakubmisek · 2019-01-19T15:41:51Z

Can this be related to dotnet/aspnetcore#3409 or dotnet/aspnetcore#1976?

DevJohnC · 2019-01-20T22:16:41Z

Disabling the server-gc seems to have bought us a modicum of stability but it doesn't last. We also turned off the wp cron.

We still experience the same issue: after running for around half an hour with a steady flow of traffic the dotnet process spikes to 100% cpu usage and keeps allocating memory until the OOM killer finally kills the process and allows the node to restabalize.

DevJohnC · 2019-01-21T22:55:38Z

To further add to this issue, I watched the issue happen in real-time today to try and get more details and see if it pinged anything in your minds.

We're on a pretty vanilla WordPress install - we have some basic themes and configured as a network with no plugins besides some filters to set comment options and filter admin menus
Traffic levels are not high but are consistent, it's normal that the app is serving a web request constantly
Traffic logs show nothing suspicious and a replay of web requests leading up to the issue doesn't reproduce it
The issue occurs at oddly specific timing, around 30-40mins of uptime each occurrence
Server death looks like this:
- CPU usage spikes to 100% user-mode usage
- This gradually becomes more and more kernel-mode cpu usage, staying at 100% utilization overall
- During this time memory usage (which resides at just under 500MB usually) starts spiking by GBs at a time, quickly assigning then deleting GBs of RAM
- This continues until the server becomes unresponsive, likely due to the CPU giving more and more time to kernel-mode operations
- During this time the database is not showing a high data throughput, meaning the GBs of memory assigned and deleted aren't database records

Given the timing nature and how the CPU just becomes consumed in kernel-mode it feels like a deadlock issue? Or something related? Peachpie seems to just stall waiting for whatever kernel operation is happening to complete. Perhaps some sort of internal cache that deadlocks when there's continuous requests incoming?

jakubmisek · 2019-01-22T12:36:30Z

I'm still thinking of some Linux specific .NET Core issue. (We're running tens of WordPress websites on .NET Core on Win10 x64 and Azure and the servers are stable for months so far, using 400-600 MB of RAM). Using the default setup https://github.com/iolevel/wpdotnet-sdk/blob/master/app/Program.cs

Anyways; it is possible there is a dead-loop in the PeachPie code ... in that case it would be great if you'd be able to debug the process? Or attach when this happens? Is it possible on Linux ?

DevJohnC · 2019-01-22T16:46:03Z

Damn, I was seriously hoping you'd have a good idea of what is wrong right away, oh well :/

I admit I'm ignorant about attaching to a running process in production. I'm currently moving the blog network to a dedicated VM that's isolated from the rest of our setup to facilitate that.

What information am I looking to dump from the process, with what windows tools equivalents?

jakubmisek · 2019-01-23T11:18:09Z

It seems like it might be caused by a plugin that we didn't test yet ..

Anyways; in Visual Studio there is Mini dump or actually if you'd have a chance to see stack trace where the OOM happens, that might help too.

DevJohnC · 2019-01-23T16:30:29Z

I ended up attaching lldb with the libsosplugin.so plugin for linux on a production server. I tried to get a dump with procdump but it didn't want to load into lldb.

Anyway, we now have a full 8 or so hours of uptime after disabling WordPress' option to automatically convert smilies to images.

I managed to dump a couple of stack traces while the application was dying and found convert_smilies (https://core.trac.wordpress.org/browser/tags/5.0.3/src/wp-includes/formatting.php#L2836) to be running during both incidents. This function is doing a lot of, likely, inefficient string manipulation resulting in a lot of data copying and CPU bound workload.

Reading through that function I think it's very likely to be a combination of

a poorly designed method that isn't anywhere near efficient
Peachpie possibly having significant overhead when used in this manner

I'm going to keep the issue open while I re-enable the WP cron and monitor for more of this behavior but it seems the culprit is found.

jakubmisek · 2019-01-25T15:44:27Z

@DevJohnC both are possible - however I was not able to replicate the issue (on Win x64). formatting.php file is definitely an ugly piece of code but seems to not cause any issues. When profiling performance it is not even reported as a significant function

(no formatting.php)

jakubmisek · 2019-01-25T17:46:53Z

Is it possible for you to run the web site locally?
Do you have list of your enabled wp plugins ?

DevJohnC · 2019-01-26T18:22:13Z

I'm currently on my laptop on satellite internet so I can't do a whole lot of debugging until I'm back at my workstation.

However, we've had 100% uptime since disabling the WordPress option for convert_smilies. The stacktraces I found it in were all executing as part of the RSS2 feed.

It's entirely plausible that convert_smilies crashing our servers was dependent on content; maybe something with lots of HTML tags, or non-latin language posts or posts with large content bodies or goodness knows whatever else.

I'll try and narrow it down when I can.

jakubmisek · 2019-01-26T18:26:26Z

Thank yo, Will try more tests as well. BTW with newly released Peachpie 0.9.30 there is -40% memory utilization.

jakubmisek · 2019-04-15T09:50:13Z

So far we cannot repro the issue with memory. (but we are running Windows servers).

We are constantly requesting RSS2 feed and nothing weird happens yet.

jakubmisek · 2019-04-21T07:40:10Z

closing for now, if you'd have any more details, please comment :) thank you!

jakubmisek added this to TODO in WordPress via automation Jan 25, 2019

jakubmisek closed this as completed Apr 21, 2019

WordPress automation moved this from TODO to Done Apr 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wordpress(?) over-committing memory #348

Wordpress(?) over-committing memory #348

DevJohnC commented Jan 18, 2019

jakubmisek commented Jan 19, 2019 •

edited

Loading

DevJohnC commented Jan 20, 2019

DevJohnC commented Jan 21, 2019

jakubmisek commented Jan 22, 2019

DevJohnC commented Jan 22, 2019

jakubmisek commented Jan 23, 2019

DevJohnC commented Jan 23, 2019

jakubmisek commented Jan 25, 2019

jakubmisek commented Jan 25, 2019

DevJohnC commented Jan 26, 2019

jakubmisek commented Jan 26, 2019

jakubmisek commented Apr 15, 2019

jakubmisek commented Apr 21, 2019

Wordpress(?) over-committing memory #348

Wordpress(?) over-committing memory #348

Comments

DevJohnC commented Jan 18, 2019

jakubmisek commented Jan 19, 2019 • edited Loading

DevJohnC commented Jan 20, 2019

DevJohnC commented Jan 21, 2019

jakubmisek commented Jan 22, 2019

DevJohnC commented Jan 22, 2019

jakubmisek commented Jan 23, 2019

DevJohnC commented Jan 23, 2019

jakubmisek commented Jan 25, 2019

jakubmisek commented Jan 25, 2019

DevJohnC commented Jan 26, 2019

jakubmisek commented Jan 26, 2019

jakubmisek commented Apr 15, 2019

jakubmisek commented Apr 21, 2019

jakubmisek commented Jan 19, 2019 •

edited

Loading