New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lua freezing the main thread #1643
Comments
* killed a bunch of simu memory leaks * added a bunch of TRACE() in lua * fixed some potential bugs in lua.cpp: * missing free or sid.background * call luaL_unref() only if argument > 0 * free of init was before its call * luaGetMemUsed() returned wrong value * added Lua shutdown (main reason is to enable checking of simu traces) * added python script trace-parse.py that checks (de)allocations using simu output traces
This sound very similar to the problem I had in #1597. It was related to linking the firmware with newlib-nano. I have tested this script yesterday with my own build of opentx (with normal newlib) and the worst case is that I get a warning that script can't be loaded because of the lack of RAM. System never locks-up. Strange thing is that problem with this script does not happen always. So the problem looks two fold:
I certainly see no problem withing the attached script. But in any case a bad script should not crash the whole system in any case! You should really open an issue for this. |
I have made some changes and wrote an allocator/deallocator parser that checks the validity of realloc/free calls. Using commit b15ce6f and standalone simu I get the following results: Only simulator run, without any scripts, just start simu, go to main screen and then close it:
In addition to above, a simple stand-alone script was run:
Like first case, but offending script from this issue was run, all pages were displayed, no values changed:
My findings:
|
One idea, this post (last reply) http://www.freertos.org/FreeRTOS_Support_Forum_Archive/November_2009/freertos_sbrk_and_newlib_-_is_any_malloc_or_free_used_3447091.html says that heap growing must be aligned to 8 byte boundary. We had this problem before, where menusStack stack was not 8 byte aligned and that caused printf() problems for double type #1379. |
The RAM on simu will be always bigger, as it uses 64bits pointers instead of 32bits |
The OP found that he removed all the concatenation operators the script didn't freeze on exit anymore: But he now has occasional "not enough memory" messages when exiting/reexecuting it, trying again works. Now Bertrand is back the thread on the French forum is here: |
I will run the new version of script trough simu and parser to compare with old version. But my hunch is that the new script is using a little less memory. I guess that opentx somehow allows too much memory to be allocated and that then causes the crash. We should double check the validity of
This to me looks like the Lua garbage collector did not finish its job OR newlib allocator did not really release RAM. |
Did some additional runs of original script with self build commit 69aef74. Strange thing is that sometimes I can run this script many times without any problem. When I then check script memory usage (when it is running with MENU long press) I get around 39kB. Free memory after script is closed (in statistics page) is reported at about 12kB. Then after power cycle I get script to run one time and then the next execution returns So newlib allocator is definitely not (always) releasing RAM (with _sbrk() function) when Lua releases it. Sometimes it does, but with this script it mostly does not do a release. That is why we see a large ram usage (only 12kB free), but this large script is able to run. This is likely not a problem since it hopefully reuses this ram for new allocations. It is looking more and more that some particular accumulated allocation size (that is still allowed by _sbrk()) causes problems. On the other hand if allocation is bigger, it gets rejected by _sbrk() and we just get The _sbrk() declares 4kB from the top of the RAM and does not allow heap to grow in there. That looks ok, but that is the stack for main thread and for interrupts. Do we have enough? I am not very familiar with startup files for this CPU. Where is initial stack pointer set? To which value? Where is stack for interrupt handlers? Do we have a main thread (using this stack) after all other threads are run? |
Next report from the script author, he moved variable declarations from within individual functions to script level to avoid allocation / deallocation at every function call, and the script no longer crashes. Seems to support that the problem comes with improper deallocation. |
This function needs to be checked, I am not sure that RAM_END-4096 is good. extern caddr_t _sbrk(int nbytes) |
This is an awesome script, I didn't think it would be possible !!! |
Bertrand, ram end is where main stack (MSP register) lives. This stack is used by main function and all interrupt (trap) handlers. I sifted trough source and couldn't find where such ammount of stack could be used. Could we paint this stack also and then measure its usage? |
Right, but the main function shouldn't use too much stack, as soon as the RTOS is started, each task has his own stack. That's said I also tried to increase to 3000 words the menus stack and the script also worked. I don't get how this is possible. I thought it would be worst. |
I think I can explain that:
Now when you changed heap size with additional stack in menu thread, you changed the size of the last free heap chunk. It just happens that this new heap size causes different behavior: it fails to grow before it allocates the last part of the heap - the problematic part. So there is always a chance to allocate the last part of the heap and get a stack overwrite or to not allocate it and fail with "not enough ram" error. The question is what is in the last part for the RAM and how bit it is. Do you have an open Taranis with SWD connection? You could inspect the contents of the RAM up there. Mine is closed and used for flying so I don't want to do that. |
I would prefer having a "not enough memory" message after my change, it would be obvious we were writing on the main stack. But this is not the case. Perhaps the malloc tries to get a big chunk and when it fails it mallocs a smaller chunk, but I am in doubt it does this. I don't have the SWD connector on my Plus radio, I have to solder it. |
We were contemplating this on wrong assumptions! I have looked over the source code of newlib and newlib-nano allocators from https://launchpad.net/gcc-arm-embedded/4.7/4.7-2013-q3-update/+download/gcc-arm-none-eabi-4_7-2013q3-20130916-src.tar.bz2 The facts for newlib-nano which (I strongly believe) is used on official build server:
The facts for newlib (which I am using most of the time):
Therefore our statistics that shows free memory is misleading for both allocators! Currently it just shows minimum free memory. |
In view of my previous comment incident in issue #1597 (comment) is now looking very similar to this one. |
Also the measurement of bit32 library RAM usage in #1626 (comment) is questionable. |
This is right, inside OpenTX _sbrk(..) is never called with a negative argument, anyhow it would be completely wrong, only the last block could be freed this way. But that doesn't change anything to our asumptions? We have to understand why the stack goes bigger than 4k, unless we have something really completely wrong with the allocator, I hope no! |
Bad news. I have implemented main stack free check in commit a78c2b8 and got:
This means that the main stack (interrupts stack) is only using 284 bytes, which on one hand is good - we have a normal stack usage, but on the other hand we still have the original problem. @bsongis Reducing max heap size (with main stack reserve from 4k to 8k) just hid your problems because heap size changed. We still have the original problem and it is not the main stack corruption! |
Since this issue never causes watchdog reboot it is clear that main stack does not get corrupted. That would cause immediate reset. I am also 99% sure that the libc allocator is not the main culprit. That leaves us with Lua code and our code in particular the interface to Lua. The Lua code is presumed clean, but to prove it i will run a bunch more tests in standalone simu and check malloc/free output with checker script. So in my view our Lua interface code is the main suspect in this case... |
I checked, valgrind doesn't see anything wrong |
I also found another issue in the way newlib-nano allocator works. It is bad in our case, but can be easily fixed. More tomorrow.... |
Nice catch, thanks! |
Commit 0f9a02a adds debug traces and can be used to see the free memory fragmentation. Unfortunately I can't reproduce newlib-nano allocator failure. I was getting it on Newlib-nano allocator fragmentation reportData presented here was got from the When Taranis shows main screen and no model scripts are defined we get this (Lua environment initialized for the first time):
Now we start some stand-alone script from SD manager, first Lua state gets closed and we can se that all memory is freed, but remains in possesion of allocator:
Now new Lua state is created and script is loaded, we get much more memory fragmentation:
Still, nothing particular alarming. Now we close and open the same script several times. Each time all allocated memory is freed by Lua, but each time newlib-nano allocator finishes with bigger heap size. Each script run uses a little more heap memory. For example outputs after script exit, when lua state is closed:
Eventually all available memory gets used by the allocator (around 70k in my case) then error happens when initializing Lua state. Here we try to push constant MIXSRC_SF, but at that point malloc fails, Lua calls
The list of free chunks above is truncated, but there really is no chunk large enough (1430 bytes) to satisfy malloc request. Memory is so badly fragmented. Why this happens in next post. |
Newlib-nano allocator problemWhy is our Lua state initialization causing free memory fragmentation with newlib-nano allocator? It is caused by the newlib-nano allocator algorithm in combination with Lua memory usage pattern. Case 1: first runHeap size is zero and when allocator gets malloc request it requests more heap using _sbrk() and returns this new region. So memory is allocated from start of heap towards the end. Case 2: second runFirst all allocated memory by Lua is freed and allocator gets left with one big chunk of free memory corresponding to max memory usage from previous run. When new malloc request comes and if it is smaller than the size of the free chunk, then:
So this means that memory is allocated from end towards the beginning if it can be satisfied from free list. This is opposite from the case 1! This behavior in combination with Lua malloc/fee pattern causes massive fragmentation of allocator free space and in the end causes allocator to request more heap, because it can not satisfy all malloc request from the free list (even if end malloc usage is the same). This pattern then repeats each time Lua state is initialized and each time more heap is needed, but the allocated memory size stay the same! When max heap size is reached, we get Lua panic(). Possible solutionJust to confirm I introduced an ugly hack that wen enabled with But this fix is not for production since it is highly dependent on allocator implementation and that could change in future versions of newlib. My current idea is to further analyze Lua malloc/fee pattern and then see what can be improved. Ideas:
I need a litle guidance here. @bsongis what are your thoughts? |
Thanks for your analysis. I was aware of those fragmentations problems and that's why the RAM is completely freed each time we are allowed to do so (model load, change in model scripts, etc.) in luaInit() - at least I hope the RAM is completely freed, it seems so, there shouldn't remain any malloc used. I also checked that a script produced a 'not enough memory available' error when a malloc failed, but obviously I missed a case and according to what you are writing above, this case produces an abort() which is really wrong. So we have many problems to solve and I think they have to be solved in this order:
Now I will focus on 1), perhaps we could open a new issue for 2) - and 3) has already one opened, targeted to 2.1.0 if I am right. Soooo for 1) did you already find a potential fix? Did you have a look at eLua? |
I am trying to enable exceptions and throw an exception in a custom panic handler, it would be an easy fix. |
Completely free RAM still leaves a problem with newlib-nano allocator as is explained above.
I agree with your plan. Lets create three separate issues and a corresponding branch for each one. Since this branch is quite messy I suggest we start from the master. I can create this branch and add changes from this branch that are helpful and then the protection can be added later (by you). I have been toying with 2) so 1) is not done yet by me.
C++ exceptions might not work properly because we are mixing C code (Lua) with C++. Be careful. Proven solution is setjmp/longjmp as explained in http://noahdesu.github.io/2013/01/24/custom-lua-panic-handler.html |
Completely free RAM still leaves a problem with newlib-nano allocator as is explained above. |
I will take care of 1) then. Thanks for your help on the other points! I am really hoping miracles from 3) in fact, so I will give it a try just after 1) is fixed |
Enabling the exceptions costs a lot on the flash side :( |
I've told you, do not use C++ exceptions, use setjmp/longjmp! |
I like exploring all solutions :) |
I have number 2) custom allocator written and somewhat tested. Its effect on memory fragmentation is significant. I will post it in coming days. |
I pushed my changes in commit 3375641 to let you see what am I doing. Code is not cleaned yet. But the new allocator and panic protection seem to be working. But it still needs to be heavily tested! The idea behind panic protection is that functions that are protected have |
Has it been determined that the string concatenate is OK to use or not? Thanks Dean. |
String concatenate itself is not a problem, it was just a trigger in this script for this bug. Use it. |
Thank you.Dean Date: Wed, 3 Sep 2014 13:09:29 -0700 String concatenate itself is not a problem, it was just a trigger in this script for this bug. Use it. — |
Using latest commit d4eca36 and make:
I get following results on debug port: Lua initalized, no scripts:
In this state Lua is using 12k of newlib-nano allocated RAM and almost all our BinAllocator slots. Then before runing standalone script, lua is closed:
So remaining allocated memory is zero 👍 Sucess Then I started a script that calls non-existing function:
Now not all memory is freed! But I think this is the result of above error printing (ie printf()). This 236 will remain allocated from now on regardless of how many times some other script will be run. Now I run original predimrc.lua, first time and close it:
It can be seen that memory limit was reached (realloc FAILURE 45) and heap now is at 0x2001dfe0 (0x2001e000 is theoretical limit, 4k reserve). Lua handled the lack of RAM gracefully, it called GC and got enugh RAM that way. When script was closed, the cleanup was good, but 236 bytes remain (from printf() allegedly). I have then run predimrc.lua an number of times and the result is basically the same. Here is memory report while script predimrc.lua is running (on page1):
|
I will take as less as possible for 'master' from your patch. This will be ok tomorrow. The release should happen during the week-end, unless we have something else meanwhile. |
I received a message from a user on the French forum who has an issue with this script:
http://andrebernet.ch/divers/predimrc.lua
He says the script runs fine, until you exit it whether it is with the EXIT short press as he implemented it, or with the system's EXIT long. No problem on the simu, but on the radio it locks up the main thread and requires removing the battery to turn the radio off.
The text was updated successfully, but these errors were encountered: