Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeated out-of-memory errors #335

Open
VanessaE opened this issue Mar 31, 2017 · 30 comments

Comments

Projects
None yet
5 participants
@VanessaE
Copy link

commented Mar 31, 2017

2017-03-31 05:58:39: ACTION[Main]: IRC: Disconnected.
2017-03-31 05:58:40: ERROR[Main]: ServerError: AsyncErr: environment_Step: OOM error from mod 'mesecons' in callback environment_Step(): not enough memory
2017-03-31 05:58:40: ERROR[Main]: Current Lua memory usage: 972 MB
2017-03-31 05:58:40: ERROR[Main]: stack traceback:

I don't know what the root cause is, since the server has 16GB of RAM, and the usage at the time of the crash was clearly less than the LuaJIT limit (dumbass author thereof 😠 ), but the engine keeps citing mesecons (several times today), so maybe there's a leak somewhere?

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 7, 2017

Now that my servers' build of LuaJIT has GC64 enabled, it's able to allow mods to use way more than the old 1GB limit, but that just serves to show the memory leak rather more clearly than before. I still don't know if it's mesecons actually doing it.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 7, 2017

@VanessaE it may cite mesecons if the offending code gets called by mesecons, or calls mesecons (in an inappropriate way). So, where is the stack traceback?

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 7, 2017

That's what puzzles me - what you see is all I had in the log.

It's equally terse every time it throws an out-of-memory error, but it always cites mesecons.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 7, 2017

oh well, it may not report the stack in the case of OOM error.
so, let’s start debugging. if there is no good debugger, debug print is the last resort. as a first step, add something like
print("Executing", action.func, "with params", dump(action.params))
into mesecon.queue:execute (mesecons/mesecons/actionqueue.lua, line 93)
Warning: you will get A LOT of output! it may be a good idea to omit dump(action.params) for first test!

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 7, 2017

I don't think I can do that sort of debugging since this is a production server.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

Well, it’s possible to review the code, of course. The symptoms say there is something wrong with some mesecon device; it seems to do something wrong in the action queue. It’s unlikely (possible, but unlikely) to be the mesecons itself; rather that should be some mod that uses mesecons API. What mods are using the mesecons action queue on your server? The most interesting are those that register custom actions themselves.
Edit: I assume you don’t use #334 . It should use less memory than the current implementation, but bugs may appear anywhere.

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 8, 2017

Nope, I don't have #334 in. As for what else hooks into mesecons, a quick grep shows nothing outside of mesecons that manipulates mesecon.queue, and the only things I can think of that interact with mesecons to any significant degree are technic and pipeworks. I've been very careful to make sure the latter always frees what it uses.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

So, main question: have you updated anything recently?

@sfan5

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

There's a simple answer:
mesecons was the mod that attempted to allocate memory when Lua noticed "Oops, looks like we're out of memory". This is why Mesecons shows up in the error message but this does not mean that mesecons allocated all of the memory.

This is similar to an OOM situation with the Linux kernel, the program that hit the "OOM ceiling" is not necessarily the culprit. Which is what makes OOM handling so complicated.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

@sfan5 I understand. I think that’s unlikely that some mod allocates almost all available memory, and retain this behavior after increasing the limit. So I assume something that is called by mesecons tries to allocate infinitely much memory (e.g. it may run into an infinite loop allocating memory on each iteration; not an infinite recursion as stack overflows are reported as such IIRC). While there might be a possibility that something leaks memory gradually, in the background, “the usage at the time of the crash was clearly less than the LuaJIT limit,” so that’s not the case, I think.

@sfan5

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

It does not need to be mesecons itself allocating huge amounts, implementation details in Lua/LuaJIT or Minetest itself can also cause memory allocations.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

@sfan5 Can it be anything not called, directly or indirectly, from mesecons?

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 8, 2017

Could it be something besides Mesecons? Well, I suppose Pipeworks is as good a candidate as Mesecons since the two are often used together, but the engine has never once blamed it. All I can be sure of is this: the contraption in the screenshot below, when turned on, causes the server to start leaking memory at a rate of about 10 GB over 24 hours.

It is composed of two lines of Pipeworks' deployers placed opposite two lines of node breakers, two stackwise filter-injectors, two wooden chests for buffers/storage, and an assortment of tubes to move everything around.

All of the deployers are pre-loaded with axes, with the first slot in each left empty to receive incoming leaves from the tubes. I presume the axes were added to "fill up" the excess inventory slots with something that can't be "deployed". The node breakers have nothing loaded into their tool slots (so they use the default behavior, which acts like a mese pick).

screenshot_20170408_153728

In the distance, behind that wall lever, is a single LuaC acting as a clock generator via this code:

if pin.a then
port.b = not port.b
port.d = port.b
interrupt(0.5,"toggle")
end

Ignore the tube running down the aisle in the center of the contraption, that goes off to some other, unrelated contraption that isn't turned on anyway.

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 8, 2017

As a practical example of the leak, here's an excerpt from my server's status tracker page:

mem-daily

At about 07:05 in the image, I restarted the server to update my Minetest and LuaJIT builds (hoping that maybe one of those was at fault, but to no avail). At 11:00 in this image, the morning shutdown-backup scheduled reboot took place. I signed back on afterward to monitor things, and just basically fiddling about.
The above-pictured contraption was turned off when I was online, and remained so until about 14:50, when the person who built it signed on and turned it back on. As you can see, the memory usage was essentially flat until that point, and an obvious leak thereafter, until 21:05, when I signed on and turned the contraption off again.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

tested myself.

  1. confirm the leak, suspect some core problem
  2. it effectively makes the server unresponsive if used with high frequencies, although it doesn’t run that fast (I specified delay=0.01, but the actual delay was about of several seconds; nevertheless, I could not connect, getting that cryptic “init packet not recognized by server” message)

Actually I remember a similar problem on with quarries. That seems to happen when machines work in an inactive block (that is, the mapblock containing the controller is active while the mapblocks with the machine are loaded but inactive).

By the way, you’ve said “and the usage at the time of the crash was clearly less than the LuaJIT limit,” but now you say the usage grows gradually. What is right? Or there are 2 separate OOM bugs?

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

Note: the whole breaker implementation looks just horribly.
While the actionqueue is certainly suboptimal, it looks way better.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

So, problematic things I found:

  1. Dropping items in an inactive mapblocks. It seems that in active mapblocks, such items are joined together up to stack size, but in an inactive one, they lie separately.
  2. Injecting items into a loop-shaped tube. Well, that’s pretty self-explanatory, just remember that the server can’t delete them. (831K luaentities file... how much that is? it have been running for a few minutes only, with just a single injector)
@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 8, 2017

In a looped (or any) tube, if there are too many items, the tube will break and dump them out on the ground.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

it doesn’t seem to do that. where is it implemented?

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 8, 2017

It's a new feature that went in a few days ago. https://github.com/minetest-mods/pipeworks/commit/33c4c9d552c7f113a21405456e67c0e4752674a8 (plus more commits since then to tune it)

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

so 7680820d fixed (2), nice. will test that machine again, with updated pipeworks.

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 8, 2017

By the way, you’ve said “and the usage at the time of the crash was clearly less than the LuaJIT limit,” but now you say the usage grows gradually. What is right? Or there are 2 separate OOM bugs?

By this I mean that LuaJIT limits you to 1 GB of RAM, or it used to, and when an OOM crash would occur, it would cite Mesecons (hence the title of this issue). I use a gc64-enabled build of LuaJIT now, so that memory limit no longer applies, which makes it possible to see that there's also a memory leak and not necessarily just some occasional attempt to allocate a lot of memory in one go.

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2017

2017-04-09 02:47:37: ACTION[Main]: World at [/home/numzero/projects/minetest/modding/dev-mesecons/bin/../worlds/test]
2017-04-09 02:47:37: ACTION[Main]: Server for gameid="minetest" listening on 0.0.0.0:30000.
2017-04-09 02:47:37: WARNING[Server]: Undeclared global variable "unified_inventory" accessed at ...in/../mods/technic/technic/machines/register/recipes.lua:54
[Pipeworks] Warning - a tube at (8,9,-14) broke due to too many items (31)
Segmentation fault (core dumped)

so certainly there is some core problem.

@VanessaE

This comment has been minimized.

Copy link
Author

commented Apr 9, 2017

Here's a worldedit copy of that contraption, by the way: leak-machine.we.zip

@Zeno-

This comment has been minimized.

Copy link

commented Apr 9, 2017

@numberZero

confirm the leak, suspect some core problem

How did you confirm the leak?

@Zeno-

This comment has been minimized.

Copy link

commented Apr 9, 2017

I cannot unfortunately find a leak using the contraption that VanessaE posted above.
Test environment:
minetest_game: commit b377aaa50d2f8ae1926c2319f17380f8e754f499
dreambuilder_modpack: commit 7acbe4fcf319bffc29812d4c5f3718c5f23886ff

First graph unfortunately I didn't take not of the minetest commit, but it is recent (less than a week old)
Second graph is minetest at commit 8ad3dad137df740ce63101bf224e977e66b9df2c (HEAD)

Both tests ran for (very approximately) 30-40 minutes.
The contraption was running (I could see on the console it was running but both times I joined with a client to double check -- the peaks on the graphs are the client joining.)

Note: The peaks in the graphs (at the start and end) are the client joining. To shutdown I opted to shutdown with the client rather than from the console and this is what you see as the final peak in the graphs.

snapshot125

snapshot126

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Apr 9, 2017

@Zeno-
Not sure was that the case originally; in my test world I built a similar machine connected to a chest, and as deployers don’t spend items in creative mode, the chest was filled very soon. And then...

Dropping items in an inactive mapblocks. It seems that in active mapblocks, such items are joined together up to stack size, but in an inactive one, they lie separately.

as the controller was in an anchored mapblock but the machine and the chest in another one.
Note that I was able to reproduce similar behavior with quarries; the server becomes unresponsive very soon. Although with recent changes in pipeworks, something works differently, I’m not sure what exactly.

Note: tested on MT a69cbb844, that’s my own build, but I have changed client code only; server code is the same as in 5a59ad230744a84a1474e8c3b97cf3d9080536d7

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Aug 26, 2018

Related? #280

@Minetest-One

This comment has been minimized.

Copy link

commented Mar 7, 2019

I still read unconfirmed bug ! HAHA
I can confirm that also mesecons is often a reason that OOM occurs,
so that mean mesecons does more often need more memory and is breaking then the border = OOM crash.
AS I counted long time what last break it mesecons was place two,

so Question, did you optimize the code yet like seen how to do ?
https://forum.minetest.net/viewtopic.php?f=18&t=16043&p=324628#p324628
local data = {}
vm:get_data(data)

@numberZero

This comment has been minimized.

Copy link
Collaborator

commented Mar 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.