out of memory/large files results in stopped LinuxCNC #13

ArcEye · 2018-08-02T12:05:34Z

Issue by luminize
Mon May 19 06:42:37 2014
Originally opened as machinekit/machinekit#193

In these two thread is some discussion about LinuxCNC stopping unexpectedly, having the hot-end still hot. The exiting LinuxCNC without switching off this, zeroing the PWM or having a watchdog is another think. This bug is about preventing the out of memory occurring. I volunteered to code this, but I need some guidance in finding the beginning of the path of crumbled :)

https://groups.google.com/forum/#!topic/machinekit/rkCVRKg2-GQ
https://groups.google.com/forum/#!topic/machinekit/A3UvHqbQvN4

ArcEye · 2018-08-02T12:05:35Z

Comment by mhaberler
Mon May 19 07:49:33 2014

ok, so.. let's see..

Overall I'd suggest we add an INI option telling using code to either use a disk-based list, or the default in-memory list. The way we do this is to tell the list management code to do either-or with a constructor parameter. Behavior would default to in-memory list; if instantiated with say a filename parameter, it would write list nodes to this file.

Before we go about rewriting the class, some code reading and forensics.

first, lets look at usage: do a ' grep -r interp_list emc/', this will give us all lines where this class is used. It's only:

definition:
emc/nml_intf/interpl.cc
emc/nml_intf/interpl.hh

use:
emc/task/emctask.cc
emc/task/emctaskmain.cc
emc/task/emccanon.cc

emcanon.cc is where all the appending happens; emctaskmain.cc and emctask.cc mostly (with one exception, user-defined M-codes) test list size, get a list member, and clear the list.

So technically interp_list is a member of the canon class - except there is no canon class, it is exceptionally bad written code using global variables. A candidate place to explicitly instantiate the interp_list would be emcanon.cc:INIT_CANON() which could be thought of like a class constructor method (this coding style has been tagged 'NIST Fortran++' by Andy, and for a reason).

So how can we add this option? turns out this list is instantiated with a static global initializer, which is pretty sad practice: interpl.cc:29 - this doesnt give us a chance to modify the instantiation by an INI parameter - by the time we get to reading the INI options (iniLoad in emctaskmain.cc:3049 ff) all is said and done. And, all the use is by a C++ reference, so we cannot delay instantiation until later - if it is a global reference, it must be instantiated before main() is called. See here why this practice is frowned upon: http://arstechnica.com/civis/viewtopic.php?f=20&t=107304

So we first need to refactor that - option 1 is to remove the static global initializer usage and replace it by a class pointer, and instantiate the class at the right time (i.e. before any interp_list operations commence); option 2 would be to fudge the semantics of the class instantiation - that is: postpone the decision whether to use a in-memory or disk-based list until a new method - say set_mode(const char *filename) is called (if NULL, use in-memory list).

Option 3 is to dump the existing code completely and rewrite this class from scratch, and I would actually favor this - the code is bad beyond repair. It makes use of an underlying linklist.hh/cc class which is hopelessly convoluted too. It would also entail refactoring the code to use a class pointer.

Before doing that, let's look at the usage and semantics of the class methods:

 class NML_INTERP_LIST {
  public:
    NML_INTERP_LIST();
    ~NML_INTERP_LIST();
    int set_line_number(int line);
    int get_line_number();
    int append(NMLmsg &);
    int append(NMLmsg *);
    NMLmsg *get();
    void clear();
    void print();
    int len();
  private:
    class LinkedList * linked_list_ptr;
    NML_INTERP_LIST_NODE temp_node; // filled in and put on the list
    int next_line_number;   // line number used to fill temp_node
    int line_number;        // line number of node from get()
  };

A typical usage pattern is emccanon.cc:883: set the line number, then append an NML message to the list (for our purposes the NML message is an opaque blob with size NMLmsg->size).
Any other NML message emitted thereafter with append() will inherit the line number because that is store in next_line_number on set_line_number().

The get() operation retrieves a pointer to the NML message at the head of the list, AND also sets the instance variable line_number, which is retrieved by the separate method get_line_number(). Why? beats me. The line number could just as well had been an pointer or reference at the get() operation, without the need for the separate method. Note get() and get_line_number() are used only once:

grep -nHr -e interp_list.get .
./task/emctaskmain.cc:2530: emcTaskCommand = interp_list.get();
./task/emctaskmain.cc:2536: interp_list.get_line_number();

The underlying linklist class exposes a len() method, which is the current size of the list, i.e. the balance of append() and get() operations. If we rewrite the list class from scratch, we'll need to emulate this, say by a balance counter.

clear() - well, that clears the list.
print() is a debug function, to print the list contents.

Everybody still with me? if yes, let's go about redesigning this clunker ;)

ArcEye · 2018-08-02T12:05:35Z

Comment by ArcEye
Mon May 19 15:01:54 2014

Just thoughts as they occur.

The usual method for serialising a linked list, is to create in memory and then write the buffer to disk.

As this whole problem arises from running out of memory, when processing large files,
probably need a routine which writes out every XXX records.

Likewise the interpreter de-serialisation will have to be limited to a set length and refreshed when required from disk.

If the data is now persistent and the memory that held it is not the only source, does this pave the way for proper 'run from line', the ability to run a program backwards to a set point and many of the other features of some dedicated controllers which are currently missing from MK / LCNC?

(Which of course suggests the adoption of a doubly linked list for future use if nothing else)

ArcEye · 2018-08-02T12:05:36Z

Comment by luminize
Wed May 21 05:27:54 2014

I guess it's not only the programming, but mostly finding the right trees in the forrest of code. I've looked up the code a little, but I'll have to see it a few times and understand along the way.

If I understood correctly:

What we are building here, is that a list that partially holds the file information.
the NML list is the list that gets processed by linuxcnc and every line is "executed" as fast as possible.

Assume we've chosen to use the disk-based list (our *.ngc program)

there is a file on a disk
we read a line (as long as we need to keep up with the physical machine speed) until we reach the maximum size of the list.
pointer is at the "first" line of the list
when there is a free space in the motion commands list the data of this line in this list gets added by retrieving the pointer of the motion commands list with get()
move the pointer of this list
add another line of the file to the previous line
when at the end of the lines of this list we move to the first line

ArcEye · 2018-08-02T12:05:37Z

Comment by mhaberler
Wed May 21 07:45:11 2014

ArcEye: I think that is an implementation detail; all we need is to duck-type the existing class so the API semantics are retained
If all records go to disk, or just a subset, shouldnt affect usage

"proper run-from-line" - not sure I understand; fact is, the concept of linenumbers is fairly useless once multiple source files come into play as the linenumber isnt unique anymore; see also #106 - more work than just changing the list implementation

re running backwards - there are two queues at play, the interplist and the motion queue; stepping back would mean flushing the motion queue, and re-populate from the interplist in reverse order and motion direction; nope, I dont volunteer ;)

ArcEye · 2018-08-02T12:05:37Z

Comment by mhaberler
Wed May 21 07:52:26 2014

Bas: the interplist is not about ngc files. It is the communications vehicle between interpreter and task. It holds NML messages which are the command output from the interpreter, once it has called an emccanon.cc function.

NML messages are opaque blobs for the purposes of the interplist API - the only requirements are that the API semantics of class NML_INTERP_LIST must be retained. So interp_list.get() might - behind the scenes - read from a file instead of fetching the next list element.

Maybe a good starting point would be to instrument the interplist code with some debug print statements to get the idea what is happening when

ArcEye · 2018-08-02T12:05:38Z

Comment by luminize
Wed May 21 08:17:52 2014

I need to do just one step back, just to know how the routing is, in my kindergarden language.
in the current (bad) situation, when I load a g-code file

a list (interpreter) is filled (which is displayed in the GUI for example?)
press "run" and a task 'orders' our interp_list list to pick from the interpreter a line get()
later a task uses this list to put the NMLmessages in ins own queue (somewhere)

Do I understand the mechanics?
I'll add some debug print statements.

ArcEye · 2018-08-02T12:05:39Z

Comment by mhaberler
Wed May 21 08:22:19 2014

the flow is like so:

you tell UI to run an MDI command or run a NGC file
this command gets passed to milltask (which contains interpreter and task), typically with the linuxcnc python module
milltask calls the interpreter to execute the command/file
interpreter calls the canon layer, which generates NML commands and puts them on the interplist
the task function in milltask pulls commands off the list and schedules them for motion or iocontrol as needed

so the interplist function is entirely internal to milltask, there is no UI involved

ArcEye · 2018-08-02T12:05:39Z

Comment by RobertBerger
Fri Jul 18 16:17:57 2014

We also hit the issue with the OOM killer on a BBB [1] and I did some research on the subject by instrumenting emc/nml_intf/interpl.cc

diff --git a/src/emc/nml_intf/interpl.cc b/src/emc/nml_intf/interpl.cc
index e443127..cb16ae5 100644
--- a/src/emc/nml_intf/interpl.cc
+++ b/src/emc/nml_intf/interpl.cc
@@ -15,6 +15,7 @@
 * Last change:
 ********************************************************************/

+#define MEM_DEBUG 1

 #include <string.h>        /* memcpy() */

@@ -25,6 +26,13 @@
 #include "linklist.hh"
 #include "nmlmsg.hh"            /* class NMLmsg */
 #include "rcs_print.hh"
+#ifdef MEM_DEBUG 
+#include <sys/sysinfo.h>
+#include <stdio.h>
+#include <unistd.h>
+#endif
+
+#define DEBUG_INTERPL 1

 NML_INTERP_LIST interp_list;   /* NML Union, for interpreter */

@@ -88,8 +96,8 @@ int NML_INTERP_LIST::append(NMLmsg * nml_msg_ptr)
    ((void *) &temp_node.line_number) >
    ((void *) &temp_node.command.commandbuf)) {
    rcs_print_error
-       ("NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.");
-   return -1;
+       ("NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.\n");
+   //return -1;
     }
 #endif

@@ -116,6 +124,43 @@ int NML_INTERP_LIST::append(NMLmsg * nml_msg_ptr)
         linked_list_ptr->list_size, temp_node.line_number);
     }

+#ifdef MEM_DEBUG  
+    struct sysinfo si;
+    static int first_time=1;
+    static long first_freeram;
+    static long ramdiff;
+    if  (sysinfo(&si) < 0) {
+       rcs_print
+           ("NML_INTERP_LIST::append : could not get sysinfo\n");
+    } else {
+       rcs_print
+           ("NML_INTERP_LIST::append : pid:       %d\n", (int)getpid());
+       rcs_print
+           ("NML_INTERP_LIST::append : totalram:  %10li\n", si.totalram);
+       rcs_print
+           ("NML_INTERP_LIST::append : freeram:   %10li\n", si.freeram);
+       if (first_time==1) {
+           first_freeram=si.freeram;
+           first_time=0;
+       }
+       ramdiff = first_freeram - si.freeram;
+       rcs_print
+           ("NML_INTERP_LIST::append : %10li - %10li = %10li\n", first_freeram, si.freeram, ramdiff);
+       rcs_print
+           ("NML_INTERP_LIST::append : sharedram: %10li\n", si.sharedram);
+       rcs_print
+           ("NML_INTERP_LIST::append : bufferram: %10li\n", si.bufferram);
+            rcs_print
+           ("NML_INTERP_LIST::append : totalswap: %10li\n", si.totalswap);
+       rcs_print
+           ("NML_INTERP_LIST::append : freeswap:  %10li\n",si.freeswap);
+       rcs_print
+           ("NML_INTERP_LIST::append : totalhigh: %10li\n",si.totalhigh);
+       rcs_print
+           ("NML_INTERP_LIST::append : freehigh:  %10li\n",si.freehigh);
+
+    }
+#endif
     return 0;
 }

@@ -187,3 +232,5 @@ int NML_INTERP_LIST::get_line_number()
 {
      return line_number;
 } 
+
+

Anyhow, it looks like the problem is not in NML_INTERP_LIST::append but elsewhere. Here is some log from a test run and NML_INTERP_LIST::append is not called very frequently:

LINUXCNC - 2.7.0~pre0
Machine configuration directory is '/home/student/machinekit/configs/sim/axis'
Machine configuration file is 'axis.ini'
Starting LinuxCNC...
io started
halcmd loadusr io started
task pid=21252
emcTaskInit: using builtin interpreter
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=32,type=EMC_TRAJ_SET_TERM_COND}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1199230976
NML_INTERP_LIST::append : 1199230976 - 1199230976 =          0
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=96,type=EMC_TRAJ_SET_G5X}) : list_size=2, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1199230976
NML_INTERP_LIST::append : 1199230976 - 1199230976 =          0
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600 
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=96,type=EMC_TRAJ_SET_G92}) : list_size=3, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1199230976
NML_INTERP_LIST::append : 1199230976 - 1199230976 =          0
NML_INTERP_LIST::append : sharedram:          0 
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784 
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=32,type=EMC_TRAJ_SET_ROTATION}) : list_size=4, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1199230976
NML_INTERP_LIST::append : 1199230976 - 1199230976 =          0
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1196675072
NML_INTERP_LIST::append : 1199230976 - 1196675072 =    2555904
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=32,type=EMC_TRAJ_SET_TERM_COND}) : list_size=2, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1196548096
NML_INTERP_LIST::append : 1199230976 - 1196548096 =    2682880
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=96,type=EMC_TRAJ_SET_G5X}) : list_size=3, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1196421120
NML_INTERP_LIST::append : 1199230976 - 1196421120 =    2809856
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=96,type=EMC_TRAJ_SET_G92}) : list_size=4, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1196421120
NML_INTERP_LIST::append : 1199230976 - 1196421120 =    2809856
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=32,type=EMC_TRAJ_SET_ROTATION}) : list_size=5, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1196421120
NML_INTERP_LIST::append : 1199230976 - 1196421120 =    2809856
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1183485952
NML_INTERP_LIST::append : 1199230976 - 1183485952 =   15745024
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1183358976
NML_INTERP_LIST::append : 1199230976 - 1183358976 =   15872000
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13320192
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1180065792
NML_INTERP_LIST::append : 1199230976 - 1180065792 =   19165184
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13328384
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:   1179803648
NML_INTERP_LIST::append : 1199230976 - 1179803648 =   19427328
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13328384
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:     81469440
NML_INTERP_LIST::append : 1199230976 -   81469440 = 1117761536
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13402112
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:     81469440
NML_INTERP_LIST::append : 1199230976 -   81469440 = 1117761536
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   13402112
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1033846784
NML_INTERP_LIST::append : totalhigh:          0
...
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:     79851520
NML_INTERP_LIST::append : 1199230976 -   79851520 = 1119379456
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   12423168
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1032085504
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0
Cannot home while shared home switch is closed
Cannot home while shared home switch is closed
Cannot home while shared home switch is closed
Cannot home while shared home switch is closed
emc/nml_intf/interpl.cc 98: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
NML_INTERP_LIST::append : pid:       21252
NML_INTERP_LIST::append : totalram:  2602352640
NML_INTERP_LIST::append : freeram:     78954496
NML_INTERP_LIST::append : 1199230976 -   78954496 = 1120276480
NML_INTERP_LIST::append : sharedram:          0
NML_INTERP_LIST::append : bufferram:   12460032
NML_INTERP_LIST::append : totalswap: 1044377600
NML_INTERP_LIST::append : freeswap:  1032085504
NML_INTERP_LIST::append : totalhigh:          0
NML_INTERP_LIST::append : freehigh:           0

Can you please point me to the function where all the memory is allocated?

[1] https://groups.google.com/forum/#!topic/machinekit/S8Mg3D2tqbM

ArcEye · 2018-08-02T12:05:40Z

Comment by mhaberler
Fri Jul 18 18:01:38 2014

the alllocation happens here, see the definition of the NML_INTERP_LIST class members which contains the tempnode:

https://github.com/machinekit/machinekit/blob/master/src/emc/nml_intf/interpl.cc#L33

pretty sure the exception is raised here once the ' new LinkedList;' fails

ArcEye · 2018-08-02T12:05:41Z

Comment by RobertBerger
Sat Jul 19 06:53:45 2014

I put some more instrumentation into the file, right after linked_list_ptr = new LinkedList; but this is not called when I load the gcode file. [1]

What you can see [1] is the log from a debug session where I try to reproduce my problem on an x86 with Simulator/axis.

line 1 - until I choose the config
line 2 to 170 - happens automagically
then I load the gcode (where it would crash on the BBB and where I would like to see who allocates all this memory)
line 174 to the end: NML_INTERP_LIST::NML_INTERP_LIST is not called, NML_INTERP_LIST::append is called twice

BTW the real free memory on the system is much less after loading the gcode file than what we see in the log since I suspect my instrumentation code is not at the right place.

[1] http://pastebin.com/F7Pzhvr3

ArcEye · 2018-08-02T12:05:41Z

Comment by RobertBerger
Sat Jul 19 10:40:34 2014

Both methods are being called from the milltask

emc/nml_intf/interpl.cc 169: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
---> mem_debug (emc/nml_intf/interpl.cc append 199)
pid:  29621
name: milltask
totalram:  2602352640
freeram:    629784576
freeram diff:  660107264 -  629784576 =   30322688
sharedram:          0
bufferram:   49942528
totalswap: 1044377600
freeswap:   958582784
totalhigh:          0
freehigh:           0
<--- mem_debug (emc/nml_intf/interpl.cc append 199)
emc/nml_intf/interpl.cc 169: NML_INTERP_LIST::append : assumptions about NML_INTERP_LIST_NODE have been violated.
NML_INTERP_LIST::append(nml_msg_ptr{size=24,type=EMC_TASK_PLAN_SYNCH}) : list_size=1, line_number=0
---> mem_debug (emc/nml_intf/interpl.cc append 199)
pid:  29621
name: milltask
totalram:  2602352640
freeram:    630038528
freeram diff:  660107264 -  630038528 =   30068736
sharedram:          0
bufferram:   49942528
totalswap: 1044377600
freeswap:   958582784
totalhigh:          0
freehigh:           0
<--- mem_debug (emc/nml_intf/interpl.cc append 199)

Shouldn't the method which eats away all the memory be called more frequently when the gcode is loaded and be consistent with how much free memory is really available on the system?

The last call on my log is to NML_INTERP_LIST::append with freeram: 630038528 while freeram is in reality 85252 kB after loading the gcode file.

Where else should I search?

ArcEye · 2018-08-02T12:05:42Z

Comment by RobertBerger
Sat Jul 19 14:53:35 2014

What you can see here[1] is smem called every 5 secs (plus some filtering on the output).

#!/bin/bash
echo "Press [CTRL+C] to stop.."
while :
do
        echo "======================================================"
        sudo smem | grep $(pgrep -o -x axis)
        echo "Swap | USS | PSS | RSS " 
        sleep 5
done

We can clearly see that the axis process is the one eating RAM while loading gcode. So I believe that the problem is not in the milltask but axis (what I suspected from OOM output).

Unshared memory unique to that process increases from 32216 to 10829921 which most likely blows up on the BBB.

Do you have an idea where in axis this happens?

[1] http://pastebin.com/UW73W1aQ

ArcEye · 2018-08-02T12:05:43Z

Comment by mhaberler
Sat Jul 19 15:31:00 2014

very interesting - I should really check my facts before posting, clearly an assumption wont do ;-) still away from a working install for a week or so, sorry

that said - both milltask and axis use in-memory structures which are bound to fail on large inputs; it just seems that Axis is more agressive on memory consumption than milltask

what the axis process does is run the ngc file through the preview interpreter and build an OpenGL representation of the path; since the interpreter per se works with bounded memory it is unlikely to be the cause, but rather the datastructures built from its output; follow load_preview in axis, down to GLCanon in lib/python/rs274/glcanon.py

to verify it is axis which dies on preview, I suggest to instrument the ngc file with: (AXIS,hide) somewhere near the start (see glcanon.py:92:commend() - this should suppress building the preview and - if it is the cause of the failure - keep Axis running without running out of memory

does this work as expected?

ArcEye · 2018-08-02T12:05:43Z

Comment by RobertBerger
Sat Jul 19 19:41:36 2014

I ran some tests on the BBB and loaded gcode from a working file:

axis without loading gcode consumes roughly 78 M
without (AXIS,stop) axis consumes 326 M after the gcode was loaded
with (AXIS,stop) axis consumes 193 M after the gcode was loaded

(which is still a lot - what's in those 115 M? I don't see anything changing in the GUI. Is it really just the commands?)

The file which causes the visit of the OOM killer still can not be loaded even when I add (AXIS,stop).

The good news are that now we search at the right spot.

How can we further reduce axis/GUI memory usage? Can we change to another more lightweight GUI?

ArcEye · 2018-08-02T12:05:44Z

Comment by mhaberler
Sat Jul 19 19:49:37 2014

As for changing Axis - AFAIC this is a lost cause, it is an impenetrable spaghetti monstrosity with a very arcane control flow thanks to Python AND Tcl/Tk being used simultaneously

Several people use a different UIs, but I have no experience with them - just look back in the mailinglist (I think it was tkemc)

ArcEye · 2018-08-02T12:05:45Z

Comment by robEllenberg
Tue Feb 3 20:20:59 2015

Hi All,

I'm not sure if this effort is still going on, but I had a thought about a "quick fix" for the in-memory problem. If the linked list storing NML commands is mostly sequentially accessed (i.e. pulling messages off the end to send to motion / IO), then maybe we could use an existing STL-like container designed to do disk swapping automatically? I found this with some searching:

http://stxxl.sourceforge.net/

It probably wouldn't help with the Axis/gremlin memory consumption issue, but it could be useful for the internals.

ArcEye · 2018-08-02T12:05:45Z

Comment by mhaberler
Wed Feb 4 06:20:12 2015

I'm on the case (#106) but got drawn aside by the vtable stuff (to be frank that is a lot more interesting than digging miswritten code in task ..)

I was planning to use a zeroMQ socket as queue as this would enable remote ops for free. But I'm getting second thoughts on this - @robEllenberg : do you see this stxxl code as a means to obtain a canon queue which can be walked backwards/forwards for e.g. EDM (step back a path if wire breaks)?

actually a UI to the canon queue (view, maybe insert/delete canon ops) would be an interesting concept

it might also be useful for your state-tracking stuff

just noted it's in debian:
$ apt-cache search stxxl
libstxxl-dev - Development libraries for STXXL
libstxxl-doc - Documentation for STXXL
libstxxl1 - C++ STL drop-in replacement for extremely large datasets
libstxxl1-dbg - Debugging symbols for STXXL libraries

ArcEye · 2018-08-02T12:05:46Z

Comment by mhaberler
Sat Apr 18 06:40:39 2015

Jean-Paul just made a great suggestion which would be a stopgap to the issue without requiring the full rewrite of the control structure: limit readahead based on interplist size

fact is - the infinite readahead done by task/interp does not add to path quality if the tp queue is stuffed anyway, and I think that is some 1200 entries or so, so it would be fine to not call the interpreter from task if the interplist goes above a (configurable) high water mark

the way it could work is:

the InterpList class gets a length() member which observes on how many elements are in the list (increment on append, decrement on dequeue)
the task logic deciding to read ahead or not inspects the interplist length and compares it to a high water mark (taken from the ini file and say defaulted at a few thousand)

the result should be that the interplist size remains bounded regardless of input file size

I guess this is a fairly small change, and should fix the issue for embedded systems (hopefully for good)

ArcEye · 2018-08-02T12:05:47Z

Comment by luminize
Sat Apr 18 07:39:37 2015

That sound very logical.
InterpList becoming a buffer :)

On 18 Apr 2015, at 08:40, Michael Haberler notifications@github.com wrote:

Jean-Paul just made a great suggestion which would be a stopgap to the issue without requiring the full rewrite of the control structure: limit readahead based on interplist size

fact is - the infinite readahead done by task/interp does not add to path quality if the tp queue is stuffed anyway, and I think that is some 1200 entries or so, so it would be fine to not call the interpreter from task if the interplist goes above a (configurable) high water mark

the way it could work is:

the InterpList class gets a length() member which observes on how many elements are in the list (increment on append, decrement on dequeue)
the task logic deciding to read ahead or not inspects the interplist length and compares it to a high water mark (taken from the ini file and say defaulted at a few thousand)
the result should be that the interplist size remains bounded regardless of input file size

I guess this is a fairly small change, and should fix the issue for embedded systems (hopefully for good)

—
Reply to this email directly or view it on GitHub.

ArcEye · 2018-08-02T12:05:47Z

Comment by mhaberler
Sat Apr 18 07:57:25 2015

well I'm puzzled - exactly this is already in place, see https://github.com/machinekit/machinekit/blob/master/src/emc/task/emctaskmain.cc#L559-L560, also the InterpList class already has a len() field - so everything seems to be in place

It seems to me we need a bit more thorough investigation of the source of the problem - if this size-limiting does in fact work, the cause of the out-of-memory error must be different

ArcEye · 2018-08-02T12:05:48Z

Comment by mhaberler
Sat Apr 18 08:03:57 2015

I would appreciate if somebody having this kind of error could try this trap: https://github.com/mhaberler/machinekit/tree/limit-readahead-by-interplist-size

if task fails with out-of-memory as before, we need to look elsewhere
if it dumps core on the assert, please report the output of:

gdb ../bin/milltask core
print interp_list.len()
backtrace

ArcEye · 2018-08-02T12:05:49Z

Comment by mhaberler
Sat Apr 18 08:56:05 2015

it might make sense to run milltask under valgrind to find the source of leaks

to do so:

install valgrind
in the ini, specify TASK = valgrind --leak-check=yes milltask
edit bin/axis and increase the axis timeout (valgrind slows things down very much):

@@ -3049,11 +3049,11 @@ statwait=.01
while s.axes == 0:
print "waiting for s.axes"
time.sleep(statwait)
statfail+=1
statwait *= 2

if statfail > 8:
if statfail > 20:
raise SystemExit, (
"A configuration error is preventing Machinekit from starting.\n"
"More information may be available when running from a terminal.")
s.poll()
run linuxcnc ../configs/sim/axis/axis_mm.ini >/tmp/out 2>&1
torture it with a huge ngc file so the OOM error happens
post the /tmp/out file

ArcEye · 2018-08-02T12:05:49Z

Comment by mhaberler
Sun Apr 19 04:48:54 2015

well, I guess I need an NGC file and a non-Axis config referred to which reproducably causes this error.
anybody has one around?

probably it would help if one just attached gdb to milltask and produce a backtrace like so:

start your config
find out pid of milltask
run 'gdb -p '
type 'continue'
load the ngc file causes the OOM error
milltask should drop into gdb
type 'backtrace' and post results

ArcEye · 2018-08-02T12:05:50Z

Comment by luminize
Sun Apr 19 05:33:24 2015

I might have one.
Will have to look it up.
Will send it to you by mail

On 19 Apr 2015, at 06:48, Michael Haberler notifications@github.com wrote:

well, I guess I need an NGC file and a non-Axis config referred to which reproducably causes this error.
anybody has one around?

probably it would help if one just attached gdb to milltask and produce a backtrace like so:

start your config
find out pid of milltask
run 'gdb -p '
type 'continue'
load the ngc file causes the OOM error
milltask should drop into gdb
type 'backtrace' and post results
—
Reply to this email directly or view it on GitHub.

ArcEye · 2018-08-02T12:05:51Z

Comment by evandene
Fri Oct 23 20:28:20 2015

Problem Out Of Memory when loading large G code files solved for my Delta Printer with BBB plus BeBoPr++ plus pololu DRV8825 drivers.
Already mentioned by quite a few others here; using the UI "tkemc" (DISPLAY = tkemc) instead of "axis" (DISPLAY = axis) in your lineardelta.ini file (or ÿourmachine.ini file) really does the job. But ... make sure before starting Machinekit, reboot the BBB. I'm not sure but rebooting seems to me that all what was left from the previous config gets cleaned out and you get a fresh start. Also make sure you have zeroed the "A" axis (Extruder) each time you print a new part with "G92 A0" in the header of your G code file or just as an instruction in the MIDI command line.
Homing all three axis one by one feels a little clumsy but you get used to it. Make sure after the homing procedure the display shows you the specific home locations.
Good luck and run files as big as 30 Mb or bigger without a problem.

ArcEye added the problem label Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of memory/large files results in stopped LinuxCNC #13

out of memory/large files results in stopped LinuxCNC #13

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

out of memory/large files results in stopped LinuxCNC #13

out of memory/large files results in stopped LinuxCNC #13

Comments

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018

ArcEye commented Aug 2, 2018