New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out of memory/large files results in stopped LinuxCNC #13
Comments
Comment by mhaberler ok, so.. let's see.. Overall I'd suggest we add an INI option telling using code to either use a disk-based list, or the default in-memory list. The way we do this is to tell the list management code to do either-or with a constructor parameter. Behavior would default to in-memory list; if instantiated with say a filename parameter, it would write list nodes to this file. Before we go about rewriting the class, some code reading and forensics. first, lets look at usage: do a ' grep -r interp_list emc/', this will give us all lines where this class is used. It's only: definition: use: emcanon.cc is where all the appending happens; emctaskmain.cc and emctask.cc mostly (with one exception, user-defined M-codes) test list size, get a list member, and clear the list. So technically interp_list is a member of the canon class - except there is no canon class, it is exceptionally bad written code using global variables. A candidate place to explicitly instantiate the interp_list would be emcanon.cc:INIT_CANON() which could be thought of like a class constructor method (this coding style has been tagged 'NIST Fortran++' by Andy, and for a reason). So how can we add this option? turns out this list is instantiated with a static global initializer, which is pretty sad practice: interpl.cc:29 - this doesnt give us a chance to modify the instantiation by an INI parameter - by the time we get to reading the INI options (iniLoad in emctaskmain.cc:3049 ff) all is said and done. And, all the use is by a C++ reference, so we cannot delay instantiation until later - if it is a global reference, it must be instantiated before main() is called. See here why this practice is frowned upon: http://arstechnica.com/civis/viewtopic.php?f=20&t=107304 So we first need to refactor that - option 1 is to remove the static global initializer usage and replace it by a class pointer, and instantiate the class at the right time (i.e. before any interp_list operations commence); option 2 would be to fudge the semantics of the class instantiation - that is: postpone the decision whether to use a in-memory or disk-based list until a new method - say set_mode(const char *filename) is called (if NULL, use in-memory list). Option 3 is to dump the existing code completely and rewrite this class from scratch, and I would actually favor this - the code is bad beyond repair. It makes use of an underlying linklist.hh/cc class which is hopelessly convoluted too. It would also entail refactoring the code to use a class pointer. Before doing that, let's look at the usage and semantics of the class methods:
A typical usage pattern is emccanon.cc:883: set the line number, then append an NML message to the list (for our purposes the NML message is an opaque blob with size NMLmsg->size). The get() operation retrieves a pointer to the NML message at the head of the list, AND also sets the instance variable line_number, which is retrieved by the separate method get_line_number(). Why? beats me. The line number could just as well had been an pointer or reference at the get() operation, without the need for the separate method. Note get() and get_line_number() are used only once: grep -nHr -e interp_list.get . The underlying linklist class exposes a len() method, which is the current size of the list, i.e. the balance of append() and get() operations. If we rewrite the list class from scratch, we'll need to emulate this, say by a balance counter. clear() - well, that clears the list. Everybody still with me? if yes, let's go about redesigning this clunker ;) |
Comment by ArcEye Just thoughts as they occur. The usual method for serialising a linked list, is to create in memory and then write the buffer to disk. As this whole problem arises from running out of memory, when processing large files, Likewise the interpreter de-serialisation will have to be limited to a set length and refreshed when required from disk. If the data is now persistent and the memory that held it is not the only source, does this pave the way for proper 'run from line', the ability to run a program backwards to a set point and many of the other features of some dedicated controllers which are currently missing from MK / LCNC? (Which of course suggests the adoption of a doubly linked list for future use if nothing else) |
Comment by luminize I guess it's not only the programming, but mostly finding the right trees in the forrest of code. I've looked up the code a little, but I'll have to see it a few times and understand along the way. If I understood correctly:
Assume we've chosen to use the disk-based list (our *.ngc program)
|
Comment by mhaberler ArcEye: I think that is an implementation detail; all we need is to duck-type the existing class so the API semantics are retained "proper run-from-line" - not sure I understand; fact is, the concept of linenumbers is fairly useless once multiple source files come into play as the linenumber isnt unique anymore; see also #106 - more work than just changing the list implementation re running backwards - there are two queues at play, the interplist and the motion queue; stepping back would mean flushing the motion queue, and re-populate from the interplist in reverse order and motion direction; nope, I dont volunteer ;) |
Comment by mhaberler Bas: the interplist is not about ngc files. It is the communications vehicle between interpreter and task. It holds NML messages which are the command output from the interpreter, once it has called an emccanon.cc function. NML messages are opaque blobs for the purposes of the interplist API - the only requirements are that the API semantics of class NML_INTERP_LIST must be retained. So interp_list.get() might - behind the scenes - read from a file instead of fetching the next list element. Maybe a good starting point would be to instrument the interplist code with some debug print statements to get the idea what is happening when |
Comment by luminize I need to do just one step back, just to know how the routing is, in my kindergarden language.
Do I understand the mechanics? |
Comment by mhaberler the flow is like so:
so the interplist function is entirely internal to milltask, there is no UI involved |
Comment by RobertBerger We also hit the issue with the OOM killer on a BBB [1] and I did some research on the subject by instrumenting emc/nml_intf/interpl.cc
Anyhow, it looks like the problem is not in
Can you please point me to the function where all the memory is allocated? [1] https://groups.google.com/forum/#!topic/machinekit/S8Mg3D2tqbM |
Comment by mhaberler the alllocation happens here, see the definition of the NML_INTERP_LIST class members which contains the tempnode: https://github.com/machinekit/machinekit/blob/master/src/emc/nml_intf/interpl.cc#L33 pretty sure the exception is raised here once the ' new LinkedList;' fails |
Comment by RobertBerger I put some more instrumentation into the file, right after What you can see [1] is the log from a debug session where I try to reproduce my problem on an x86 with Simulator/axis. line 1 - until I choose the config BTW the real free memory on the system is much less after loading the gcode file than what we see in the log since I suspect my instrumentation code is not at the right place. |
Comment by RobertBerger Both methods are being called from the
Shouldn't the method which eats away all the memory be called more frequently when the gcode is loaded and be consistent with how much free memory is really available on the system? The last call on my log is to Where else should I search? |
Comment by RobertBerger What you can see here[1] is
We can clearly see that the axis process is the one eating RAM while loading gcode. So I believe that the problem is not in the Unshared memory unique to that process increases from Do you have an idea where in |
Comment by mhaberler very interesting - I should really check my facts before posting, clearly an assumption wont do ;-) still away from a working install for a week or so, sorry that said - both milltask and axis use in-memory structures which are bound to fail on large inputs; it just seems that Axis is more agressive on memory consumption than milltask what the axis process does is run the ngc file through the preview interpreter and build an OpenGL representation of the path; since the interpreter per se works with bounded memory it is unlikely to be the cause, but rather the datastructures built from its output; follow load_preview in axis, down to GLCanon in lib/python/rs274/glcanon.py to verify it is axis which dies on preview, I suggest to instrument the ngc file with: (AXIS,hide) somewhere near the start (see glcanon.py:92:commend() - this should suppress building the preview and - if it is the cause of the failure - keep Axis running without running out of memory does this work as expected? |
Comment by RobertBerger I ran some tests on the BBB and loaded gcode from a working file:
(which is still a lot - what's in those 115 M? I don't see anything changing in the GUI. Is it really just the commands?) The file which causes the visit of the OOM killer still can not be loaded even when I add (AXIS,stop). The good news are that now we search at the right spot. How can we further reduce axis/GUI memory usage? Can we change to another more lightweight GUI? |
Comment by mhaberler As for changing Axis - AFAIC this is a lost cause, it is an impenetrable spaghetti monstrosity with a very arcane control flow thanks to Python AND Tcl/Tk being used simultaneously Several people use a different UIs, but I have no experience with them - just look back in the mailinglist (I think it was tkemc) |
Comment by robEllenberg Hi All, I'm not sure if this effort is still going on, but I had a thought about a "quick fix" for the in-memory problem. If the linked list storing NML commands is mostly sequentially accessed (i.e. pulling messages off the end to send to motion / IO), then maybe we could use an existing STL-like container designed to do disk swapping automatically? I found this with some searching: It probably wouldn't help with the Axis/gremlin memory consumption issue, but it could be useful for the internals. |
Comment by mhaberler I'm on the case (#106) but got drawn aside by the vtable stuff (to be frank that is a lot more interesting than digging miswritten code in task ..) I was planning to use a zeroMQ socket as queue as this would enable remote ops for free. But I'm getting second thoughts on this - @robEllenberg : do you see this stxxl code as a means to obtain a canon queue which can be walked backwards/forwards for e.g. EDM (step back a path if wire breaks)? actually a UI to the canon queue (view, maybe insert/delete canon ops) would be an interesting concept it might also be useful for your state-tracking stuff just noted it's in debian: |
Comment by mhaberler Jean-Paul just made a great suggestion which would be a stopgap to the issue without requiring the full rewrite of the control structure: limit readahead based on interplist size fact is - the infinite readahead done by task/interp does not add to path quality if the tp queue is stuffed anyway, and I think that is some 1200 entries or so, so it would be fine to not call the interpreter from task if the interplist goes above a (configurable) high water mark the way it could work is:
the result should be that the interplist size remains bounded regardless of input file size I guess this is a fairly small change, and should fix the issue for embedded systems (hopefully for good) |
Comment by luminize That sound very logical.
|
Comment by mhaberler well I'm puzzled - exactly this is already in place, see https://github.com/machinekit/machinekit/blob/master/src/emc/task/emctaskmain.cc#L559-L560, also the InterpList class already has a len() field - so everything seems to be in place It seems to me we need a bit more thorough investigation of the source of the problem - if this size-limiting does in fact work, the cause of the out-of-memory error must be different |
Comment by mhaberler I would appreciate if somebody having this kind of error could try this trap: https://github.com/mhaberler/machinekit/tree/limit-readahead-by-interplist-size if task fails with out-of-memory as before, we need to look elsewhere gdb ../bin/milltask core |
Comment by mhaberler it might make sense to run milltask under valgrind to find the source of leaks to do so:
@@ -3049,11 +3049,11 @@ statwait=.01
|
Comment by mhaberler well, I guess I need an NGC file and a non-Axis config referred to which reproducably causes this error. probably it would help if one just attached gdb to milltask and produce a backtrace like so:
|
Comment by luminize I might have one.
|
Comment by evandene Problem Out Of Memory when loading large G code files solved for my Delta Printer with BBB plus BeBoPr++ plus pololu DRV8825 drivers. |
Issue by luminize
Mon May 19 06:42:37 2014
Originally opened as machinekit/machinekit#193
In these two thread is some discussion about LinuxCNC stopping unexpectedly, having the hot-end still hot. The exiting LinuxCNC without switching off this, zeroing the PWM or having a watchdog is another think. This bug is about preventing the out of memory occurring. I volunteered to code this, but I need some guidance in finding the beginning of the path of crumbled :)
https://groups.google.com/forum/#!topic/machinekit/rkCVRKg2-GQ
https://groups.google.com/forum/#!topic/machinekit/A3UvHqbQvN4
The text was updated successfully, but these errors were encountered: