machinekit-multicore Beta testing for backward compatibility #10

Closed
ArcEye opened this Issue Jan 12, 2017 · 52 comments

Projects

None yet

5 participants

@ArcEye
Collaborator
ArcEye commented Jan 12, 2017

This issue tracker is for those users who assist in testing the readiness of this repo for merge with machinekit.

Testers need to be able to build and run RIP builds,
(http://www.machinekit.io/docs/developing/developing/ )
set debugging levels
( http://www.machinekit.io/docs/config/ini_config/#sub:EMC-section - DEBUG=7 is normally sufficient)
and provide comprehensive logs of any errors
(copy of relevant section of /var/log/linuxcnc.log, copy of dmesg for segfaults etc. and stderr output to terminal)
plus access to complete configs that produce them.
(preferably via a github repo)

Testing across as wide a range of computers and machines as possible is crucial.

If you are able to test, from a terminal session:

  • clone it
    (git clone http://github.com/machinekit/machinekit-multicore.git)
  • build it locally
    (cd machinekit-multicore/src && autogen.sh && configure (insert any platform specific switches) && make -j$(nproc) && sudo make setuid)
  • then set the env
    ( cd ../ && . ./scripts/rip-environment )
  • run your existing configs.
    (machinekit ---- then select your config)

Results, bugs queries etc should be posted to this tracker

@sirop
sirop commented Jan 14, 2017

My first tests with linuxcnc-ethercat probably show that the yet unresolved issue koppi/mk#4 evolves to a seg fault right after threads start:

Jan 14 04:10:25 debian-slave msgd:0: hal_lib:3796:rt rtapi_task_start:  starting task 1 'test:0'
Jan 14 04:10:25 debian-slave msgd:0: hal_lib:3796:rt RTAPI: period_nsec: 1000000
Jan 14 04:10:25 debian-slave msgd:0: hal_lib:3796:rt hal_create_xthread:274 HAL: thread test id 1 created prio=98
Jan 14 04:10:25 debian-slave msgd:0: hal_lib:3796:rt rtapi_task_wrapper: task 0x7fa058622d08 'test:0' period=1000000 prio=98 ratio=1
Jan 14 04:10:29 debian-slave msgd:0: hal_lib:3802:user hal_add_funct_to_thread:214 HAL: adding function 'lcec.read-all' to thread 'test'
Jan 14 04:10:37 debian-slave msgd:0: hal_lib:3802:user hal_add_funct_to_thread:214 HAL: adding function 'lcec.write-all' to thread 'test'
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user signal 11 - 'Segmentation fault' received, dumping core (current dir=/home/slave/linuxcnc-ethercat/examples/generic-complex/Copley)
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user  --- rtapi_app backtrace: ---
Jan 14 04:10:52 debian-slave msgd:0: hal_lib:3802:user hal_start_threads:343 HAL: starting threads
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user 7fa056b3a228 lcec_update_mast (/home/slave/linuxcnc-ethercat/src/lcec_main.c:1023)
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user 7fa057fea0cb thread_task      (hal/lib/hal_thread.c:65)
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user 7fa058415dfc _rtapi_task_wrap (rtapi/xenomai.c:273)
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user 7fa05b39c7fe ?? ??:0
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user 7fa05af7c0a3 start_thread ??:0
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user 7fa05922062c clone ??:0
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user ffffffffffffffff ?? ??:0
Jan 14 04:10:52 debian-slave rtapi:0: 1:rtapi_app:3796:user  --------------------
Jan 14 04:10:52 debian-slave msgd:0: rtapi_app exit detected - scheduled shutdown

So if with non-multicore MK such a seg fault occurred at the exit which was annoying, but not a show stopper, now with multicore MK this seg fault right at the thread start is a show stopper.

The cure might be to add a missing link step as Haberler suggests in koppi/mk#4 (comment) .
I'll try this today.

If interested in configs: https://github.com/sirop/linuxcnc-ethercat/tree/copley/examples/generic-complex/Copley , copley branch .

@ArcEye
Collaborator
ArcEye commented Jan 14, 2017

My first tests with linuxcnc-ethercat probably show that the yet unresolved issue koppi/mk#4 evolves to a seg fault right after threads start:

Threads are handled differently which may be the cause of the difference in timing.
I would be more worried if there was a segfault which was not found with the existing repo.

You may find it easier to integrate your component into the build, in a suitable Submakefile, to get it built as @mhaberler did.

We will await your results.

@sirop
sirop commented Jan 15, 2017

I actually understood Haberler's advice as to use smth. like Makefile.modinc for out-of-tree build.
Haberler used a Submakefile only for a small subset of Sittner's ethercat component as we tried to isolate the problem.
I tried to implement it in https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/realtime.mk ,
but this did not improve anything.

Have to postpone further tests for several days.

@ArcEye
Collaborator
ArcEye commented Jan 18, 2017

Tested machinekit-multicore with a Mesa 5i25/7i76 combo on a milling machine, for several hours on Monday (2 days ago).

Existing mesa config worked perfectly without any alterations, machine did not miss a beat.

@ArcEye
Collaborator
ArcEye commented Jan 18, 2017

Built machinekit-multicore on a DE0-NANO-Soc.
Normal sim works, also the V2-icomp-demo sim

Will test in next few days running with a DB25 extension board to a 7i76 on the same mill as above.
Has previously been tested with vanilla machinekit and works.

@ArcEye
Collaborator
ArcEye commented Jan 19, 2017 edited

Found a problem with machinekit-multicore when using hm2_soc_ol with the DE0-NANO-Soc.

multicore pre-dates the mk-soc-fpga work by some time and functions within rtapi_compat are at variance.

multicore contains a function
extern int procfs_cmd(const char *path, const char *format, ...);
This function is called
int rtapi_fs_write(const char *path, const char *format, ...);
in machinekit/master

multicore also lacks a `rtapi_fs_read() function which hm2_soc_ol relies upon

Introduced the function, plus some DSO_USER stuff at the end, renamed the calls to
procfs_cmd() in rtapi_app.c and all builds.

Need to connect to 7i76 etc to test actual working, but driver now loads

EDIT:
Well it does load, but when you pass the config string as per running the socfpga_stepper sim, you get

hal_call_usrfunct(newinst,config="firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4") failed: -22 - Invalid argument

More debugging required ;/

@sirop
sirop commented Jan 22, 2017

So continuing with Sittner's component...

linuxcnc.log shows:

Jan 22 03:44:33 debian-slave rtapi:0: 1:rtapi_app:10800:user  --- rtapi_app backtrace: ---
Jan 22 03:44:33 debian-slave rtapi:0: 1:rtapi_app:10800:user 7f7755961a68 lcec_update_mast (/home/slave/linuxcnc-ethercat/src/lcec_main.c:1023)

linuxcnc-ethercat/src/lcec_main.c:1023 is:

  *(hal_data->slaves_responding) = ms->slaves_responding;

allocated before with hal_data = hal_malloc(sizeof(lcec_master_data_t)) .

As I see hall_malloc now uses a mutex over halg_malloc(1, size) .
Does this mean that *(hal_data->slaves_responding) also needs a mutex request before accessing it?

@ArcEye
Collaborator
ArcEye commented Jan 22, 2017

@sirop

Does this mean that *(hal_data->slaves_responding) also needs a mutex request before accessing it?

I doubt it, it is even optional for halg_malloc()

My SWAG would be a clash between global and local names

You have named a local pointer to your data structure, hal_data
https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/lcec_main.c#L1022

This name is defined in hal_priv.h as the global pointer to the master data structure hal_data_t struct that holds all the hal memory area data.
https://github.com/machinekit/machinekit-multicore/blob/master/src/hal/lib/hal_priv.h#L254

Try renaming all instances of hal_data as local_hal_data or similar and see if that changes anything.

@sirop
sirop commented Jan 22, 2017

Try renaming all instances of hal_data as local_hal_data or similar and see if that changes anything.

Tried this out. It did not help: still a seg fault at the same place right after the thread start.

Will check my etherlab master version next week - just in case...

@ArcEye
Collaborator
ArcEye commented Jan 23, 2017 edited

Tried this out. It did not help: still a seg fault at the same place right after the thread start.

Worth a try, probably should not use hal_data as a name anyway.

I would certainly start with debugging prints of the pins etc. to find the NULL value or out of range pointer, which is presumably causing the segfault.

Perhaps looking at the global_hal_data here and printing not only any values but the actual address
https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/lcec_main.c#L206
and then checking it is the same here
https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/lcec_main.c#L1055
which is the call that seems to be triggering the error.

The reason I don't think the mutex is required, is simply that rt components allocate memory to an inst_comp struct, fill it with pins and address them with dereferenced pointers all the time.

In this code, when the memory is allocated and pins created, they are addressed identically to initialise them, to the function which is erroring, without any problem
https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/lcec_main.c#L963

// initialize pins
  *(hal_data->slaves_responding) = 0;
  *(hal_data->state_init) = 0;
  *(hal_data->state_preop) = 0;
  *(hal_data->state_safeop) = 0;
  *(hal_data->state_op) = 0;
  *(hal_data->link_up) = 0;
  *(hal_data->all_op) = 0;
@ArcEye
Collaborator
ArcEye commented Jan 26, 2017 edited

Back to DE0-NANO-Soc - just to log where things are currently.

Commit made to correct #10 (comment) so that it now tries to load hm2_soc_ol

Test prints trying to load the same config as specified in the socfpga-stepper sim config

machinekit@mksocfpga:~/machinekit$ DEBUG=5 realtime restart
machinekit@mksocfpga:~/machinekit$ halcmd loadrt hostmot2
<commandline>:0: Realtime module 'hostmot2' loaded
machinekit@mksocfpga:~/machinekit$ halcmd newinst hm2_soc_ol hm2-socfpga0 config="firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4"
<commandline>:0: Realtime module 'hm2_soc_ol' loaded
<commandline>:0: rc=-22: hal_call_usrfunct(newinst,config=firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4) failed: -22 - Invalid argument
machinekit@mksocfpga:~/machinekit$

The linuxcnc.log shows

Jan 26 16:00:21 mksocfpga msgd:0: startup pid=14963 flavor=rt-preempt rtlevel=5 usrlevel=5 halsize=524288 shm=Posix cc=gcc 4.9.2  version=unknown
Jan 26 16:00:21 mksocfpga msgd:0: ØMQ=4.0.5 czmq=3.0.2 protobuf=2.6.1 atomics=gcc intrinsics    libwebsockets=<no version symbol>
Jan 26 16:00:21 mksocfpga msgd:0: configured: sha=7ce7c10
Jan 26 16:00:21 mksocfpga msgd:0: built:      Jan 26 2017 13:36:18 sha=7ce7c10
Jan 26 16:00:21 mksocfpga msgd:0: register_stuff: actual hostname as announced by avahi='mksocfpga.local'
Jan 26 16:00:21 mksocfpga msgd:0: zeroconf: registering: 'Log service on mksocfpga.local pid 14963'
Jan 26 16:00:21 mksocfpga rtapi:0: 2:rtapi_app:14968:user rtapi:0: cannot create core dumps - /proc/sys/fs/suid_dumpable contains 0
Jan 26 16:00:21 mksocfpga rtapi:0: 2:rtapi_app:14968:user you might have to run 'echo 1 > /proc/sys/fs/suid_dumpable' as root to enable rtapi_app core dumps
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user rtapi.so default iparms: ''
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user RTAPI:0  rt-preempt unknown init
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user rtapi: loaded from rtapi.so
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user hal_lib.so default iparms: ''
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user rtapi_app_main:195 HAL: initializing RT hal_lib support
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_xinitfv:83 HAL: initializing component 'hal_lib' type=4 arg1=0 arg2=0/0x0
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user hal_heap_addmem:58 HAL: extending arena by 262144 bytes
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_export_xfunctfv:85 HAL: exporting function 'newinst' type 2 fp=0 owner=66
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_export_xfunctfv:85 HAL: exporting function 'delinst' type 2 fp=0 owner=66
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_xinitfv:264 HAL: singleton component 'hal_lib' id=66 initialized
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user rtapi_app_main:199 HAL: RT hal_lib support initialized rc=66
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user hal_lib: loaded from hal_lib.so
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user accepting commands at ipc:///tmp/0.rtapi.a42c8c6b-4025-4f83-ba28-dad21114744a
Jan 26 16:00:22 mksocfpga rtapi:0: 3:rtapi_app:14968:user rtapi_app:0 ready flavor=rt-preempt gcc=4.9.2 git=unknown
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user pid=14968 flavor=rt-preempt gcc=4.9.2 git=unknown
Jan 26 16:00:22 mksocfpga rtapi:0: 4:rtapi_app:14968:user pid=14968 flavor=rt-preempt gcc=4.9.2 git=unknown
Jan 26 16:00:22 mksocfpga msgd:0: ulapi:14969:user _ulapi_init(): ulapi rt-preempt unknown loaded
Jan 26 16:00:22 mksocfpga msgd:0: ulapi:14969:user halg_xinitfv:264 HAL: singleton component 'hal_lib14969' id=70 initialized
Jan 26 16:00:22 mksocfpga msgd:0: hal_lib:14969:user --halcmd ping
Jan 26 16:00:22 mksocfpga msgd:0: hal_lib:14969:user halg_exit:286 HAL: removing component 72 'halcmd14969'
Jan 26 16:00:22 mksocfpga msgd:0: hal_lib:14969:user ulapi_hal_lib_cleanup:235 HAL: lib_module_id=70
Jan 26 16:00:22 mksocfpga msgd:0: hal_lib:14969:user halg_exit:286 HAL: removing component 70 'hal_lib14969'
Jan 26 16:00:22 mksocfpga msgd:0: hal_lib:14969:user halg_exit:308 HAL: hal_errorcount()=0
Jan 26 16:00:22 mksocfpga msgd:0: hal_lib:14969:user halg_exit:309 HAL: _halerrno=0
Jan 26 16:00:22 mksocfpga msgd:0: zeroconf: registered 'Log service on mksocfpga.local pid 14963' _machinekit._tcp 0 TXT "uuid=a42c8c6b-4025-4f83-ba28-dad21114744a" "instance=8a967e22-e3e0-11e6-9d4d-bad04a9c4ece"
Jan 26 16:00:30 mksocfpga rtapi:0: 4:rtapi_app:14968:user pid=14968 flavor=rt-preempt gcc=4.9.2 git=unknown
Jan 26 16:00:30 mksocfpga rtapi:0: 4:rtapi_app:14968:user hostmot2.so default iparms: ''
Jan 26 16:00:30 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_xinitfv:83 HAL: initializing component 'hostmot2' type=1 arg1=0 arg2=0/0x0
Jan 26 16:00:30 mksocfpga rtapi:0: 4:rtapi_app:14968:user hostmot2: loaded from hostmot2.so
Jan 26 16:00:31 mksocfpga msgd:0: ulapi:14974:user _ulapi_init(): ulapi rt-preempt unknown loaded
Jan 26 16:00:31 mksocfpga msgd:0: ulapi:14974:user halg_xinitfv:264 HAL: singleton component 'hal_lib14974' id=74 initialized
Jan 26 16:00:31 mksocfpga msgd:0: hal_lib:14974:user --halcmd loadrt hostmot2
Jan 26 16:00:31 mksocfpga msgd:0: hal_lib:14974:user halg_exit:286 HAL: removing component 76 'halcmd14974'
Jan 26 16:00:31 mksocfpga msgd:0: hal_lib:14974:user ulapi_hal_lib_cleanup:235 HAL: lib_module_id=74
Jan 26 16:00:31 mksocfpga msgd:0: hal_lib:14974:user halg_exit:286 HAL: removing component 74 'hal_lib14974'
Jan 26 16:00:31 mksocfpga msgd:0: hal_lib:14974:user halg_exit:308 HAL: hal_errorcount()=0
Jan 26 16:00:31 mksocfpga msgd:0: hal_lib:14974:user halg_exit:309 HAL: _halerrno=0
Jan 26 16:00:39 mksocfpga rtapi:0: 4:rtapi_app:14968:user pid=14968 flavor=rt-preempt gcc=4.9.2 git=unknown
Jan 26 16:00:39 mksocfpga rtapi:0: 4:rtapi_app:14968:user hm2_soc_ol.so default iparms: 'debug=0 config="(null)" descriptor="(null)" no_init_llio=0 num=0'
Jan 26 16:00:39 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_xinitfv:83 HAL: initializing component 'hm2_soc_ol' type=1 arg1=0 arg2=0/0x0
Jan 26 16:00:39 mksocfpga rtapi:0: 4:rtapi_app:14968:user hm2_soc_ol: loaded from hm2_soc_ol.so
Jan 26 16:00:39 mksocfpga rtapi:0: 4:rtapi_app:14968:user do_newinst_cmd: instargs='config=firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4'
Jan 26 16:00:39 mksocfpga rtapi:0: 4:rtapi_app:14968:user halg_inst_create:59 HAL: rtapi: creating instance 'hm2-socfpga0' size 208

** Jan 26 16:00:39 mksocfpga rtapi:0: 1:rtapi_app:14968:user hm2_soc_ol: stat((null)) failed: No such file or directory **

Jan 26 16:00:39 mksocfpga rtapi:0: hal_call_usrfunct(newinst,config=firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4) failed: -22 - Invalid argument
Jan 26 16:00:39 mksocfpga rtapi:0: 1:rtapi_app:14968:user hal_call_usrfunct(newinst,config=firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4) failed:
Jan 26 16:00:39 mksocfpga msgd:0: ulapi:14977:user _ulapi_init(): ulapi rt-preempt unknown loaded
Jan 26 16:00:39 mksocfpga msgd:0: ulapi:14977:user halg_xinitfv:264 HAL: singleton component 'hal_lib14977' id=80 initialized
Jan 26 16:00:39 mksocfpga msgd:0: hal_lib:14977:user --halcmd newinst hm2_soc_ol hm2-socfpga0 config=firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I85S_GPIO_GPIO.dtbo num_stepgens=4 num_encoders=1 num_mencoders=4
Jan 26 16:00:39 mksocfpga msgd:0: hal_lib:14977:user halg_exit:286 HAL: removing component 82 'halcmd14977'
Jan 26 16:00:39 mksocfpga msgd:0: hal_lib:14977:user ulapi_hal_lib_cleanup:235 HAL: lib_module_id=80
Jan 26 16:00:39 mksocfpga msgd:0: hal_lib:14977:user halg_exit:286 HAL: removing component 80 'hal_lib14977'
Jan 26 16:00:39 mksocfpga msgd:0: hal_lib:14977:user halg_exit:308 HAL: hal_errorcount()=0
Jan 26 16:00:39 mksocfpga msgd:0: hal_lib:14977:user halg_exit:309 HAL: _halerrno=0

The relevant section separated above seems to show that
Error occurring hm2_soc_ol line:526

    // read a custom fwid message if given
    if (descriptor) {
	struct stat st;
	if (stat(descriptor, &st)) {
	    LL_ERR("stat(%s) failed: %s\n", descriptor,
		   strerror(errno));
        return -EINVAL;
	}

It is testing despite the error message saying descriptor was NULL and no descriptor=xxx was passed.

My only guess at present is that the args are being corrupted / in the wrong order etc.

@cdsteinkuehler
Contributor
@cdsteinkuehler
Contributor
@cdsteinkuehler
Contributor
@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Jan 27, 2017

I'll send a PR with a minimal fix for the soc driver, but this should maybe be handled differently at the rtapi end.
Sadly, after fixing this I am getting a segfault later on in the driver loading that looks like it's related to protobuf:

That should hold it for now.
I did effectively the same thing by commenting out the descriptor block and got the same
User signal 11: segfault but ran out of time before I could go through the core file.

The error I had at stderr was to do with machinetalk timing out, which I suspect will be because of
a protobuf or similar mismatch

Something is certainly not right with the way the args are handled.
machinekit/master does things different ways and I suspect some of it was changed because of this
driver and the necessity to pass great long config strings as per the Mesa configs within an instantiated component.

We removed strings from the instanceparam types available in instcomp, because of the problems they caused. It was left to users to pass them after the delimiter and handle through the argc/argv mechanism which worked much better.

Hopefully I will get some time to look at it tomorrow.
It would be handy to do a debug build and step through the execution to see what is happening, but on this board with its limited resources, probably a non starter and we will just have to go from error debug to the next error debug.

@dkhughes
Contributor

Looks like a lot of great work. I wish I could be of more help but I am swamped at the moment. Would you guys like me to spin a SD card image with a debug RIP build? With the odroids it only takes about 20 minutes to build and it will skip that first huge compile. Might not be that helpful if you are working deep in emc though where a rebuild basically touches every driver, but I figured I would offer since maybe one debugger session is better than none.

@ArcEye
Collaborator
ArcEye commented Jan 28, 2017

Would you guys like me to spin a SD card image with a debug RIP build?

That could be very useful thanks.

Even being able to track the exact call sequences easily, without being able to step into functions, would be an improvement.

@sirop
sirop commented Jan 28, 2017

@ArcEye

Perhaps looking at the global_hal_data here and printing not only any values but the actual address
https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/lcec_main.c#L206
and then checking it is the same here
https://github.com/sirop/linuxcnc-ethercat/blob/copley/src/lcec_main.c#L1055
which is the call that seems to be triggering the error.

I followed your advice: global_hal_data sustains its initially given address.

While debugging the core according to https://github.com/mhaberler/asciidoc-sandbox/wiki/Debugging-RT-components#analyzing-a-core-dump :

(gdb) list
...
1022    void lcec_update_master_hal(lcec_master_data_t *lcec_hal_data, ec_master_state_t *ms) {
1023      *(lcec_hal_data->slaves_responding) = ms->slaves_responding;
...
(gdb) p ms
$1 = (ec_master_state_t *) 0x7ff4e322de98 <global_ms>
(gdb) p lcec_hal_data
$2 = (lcec_master_data_t *) 0x7ff4e44bb358
(gdb) p  ms->slaves_responding
$3 = 1
(gdb) p  *(lcec_hal_data->slaves_responding)
Cannot access memory at address 0x7ff4e44bb358
(gdb) p  lcec_hal_data->slaves_responding
Cannot access memory at address 0x7ff4e44bb358

Is it usual that one can access the rvalue address but not the lvalue address in the core dump?

@ArcEye
Collaborator
ArcEye commented Jan 28, 2017

I am no expert on gdb, but I don't think it matters how a pointer is used in code, what gdb is checking is the address and the value at the address.
It looks to me as though your ec_master_state_t *ms pointer is valid and that lcec_master_data_t
is not

(gdb) p ms
$1 = (ec_master_state_t *) 0x7ff4e322de98 <global_ms>
(gdb) p  ms->slaves_responding
$3 = 1

(gdb) p  *(lcec_hal_data->slaves_responding)
Cannot access memory at address 0x7ff4e44bb358
(gdb) p  lcec_hal_data->slaves_responding
Cannot access memory at address 0x7ff4e44bb358

The hard bit is figuring out why 😄

@ArcEye
Collaborator
ArcEye commented Jan 28, 2017 edited

Back on the DE0-NANO-Soc

Breakthrough !!!!!!!!!!!!!!!!!!!!!!!

The last gdb entry gave a starting point and the old fashioned debugging through prints to linuxcnc.log eventually narrowed down to the segfault arising within rtapi_hex_dump, which is called from fwid.c (and also hm2_soc_ol)

Commenting out those calls produces a proper loading and the socfpga sim works.

Reverting the code to the same as machinekit/master still segfaults, so I have more work to do working out which parameter is producing a null or out of range pointer, then tidying and committing the changes I made to get that far.

But it now runs !!!!!!!!!

 machinekit@mksocfpga:~/machinekit$ DEBUG=5 realtime restart
 machinekit@mksocfpga:~/machinekit$ halcmd loadrt hostmot2
 <commandline>:0: Realtime module 'hostmot2' loaded
 machinekit@mksocfpga:~/machinekit$ halcmd newinst hm2_soc_ol hm2-socfpga0 config="firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I76_7I76_7I76.dtbo num_encoders=1 num_stepgens=4"
 <commandline>:0: Realtime module 'hm2_soc_ol' loaded
 machinekit@mksocfpga:~/machinekit$ halcmd show pin
 Component Pins:
     Comp   Inst Type  Dir         Value  Name                                            Epsilon Flags  linked to:
     84     86 float IN            100  hm2_de0n.0.dpll.01.timer-us             	0.000010	--l-
     84     86 float IN            100  hm2_de0n.0.dpll.02.timer-us             	0.000010	--l-
     84     86 float IN            100  hm2_de0n.0.dpll.03.timer-us             	0.000010	--l-
     84     86 float IN            100  hm2_de0n.0.dpll.04.timer-us             	0.000010	--l-
     84     86 float IN             -1  hm2_de0n.0.dpll.base-freq-khz           	0.000010	--l-
     84     86 u32   OUT    0x00000000  hm2_de0n.0.dpll.ddsize                  		--l-
     84     86 float OUT             0  hm2_de0n.0.dpll.phase-error-us          	0.000010	--l-
     84     86 u32   IN     0x00400000  hm2_de0n.0.dpll.plimit                  		--l-
     84     86 u32   OUT    0x00000001  hm2_de0n.0.dpll.prescale                		--l-
     84     86 u32   IN     0x0000A000  hm2_de0n.0.dpll.time-const              		--l-
     84     86 s32   OUT             0  hm2_de0n.0.encoder.00.count             		--l-
.....
     84     86 bit   IN          FALSE  hm2_de0n.0.encoder.00.reset             		--l-
     84     86 float OUT             0  hm2_de0n.0.encoder.00.velocity          	0.000010	--l-
     84     86 u32   IN     0x017D7840  hm2_de0n.0.encoder.sample-frequency     		--l-
     84     86 bit   OUT         FALSE  hm2_de0n.0.gpio.000.in                  		--l-
     84     86 bit   OUT          TRUE  hm2_de0n.0.gpio.000.in_not              		--l-
     84     86 bit   OUT         FALSE  hm2_de0n.0.gpio.001.in                  		--l-
     84     86 bit   OUT          TRUE  hm2_de0n.0.gpio.001.in_not              		--l-
.....
     84     86 bit   OUT          TRUE  hm2_de0n.0.gpio.067.in                  		--l-
     84     86 bit   OUT         FALSE  hm2_de0n.0.gpio.067.in_not              		--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.gpio.067.out                 		--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.led.CR01                     		--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.led.CR02                     		--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.led.CR03                     		--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.led.CR04                     		--l-
     84     86 s32   OUT             0  hm2_de0n.0.pet_watchdog.time            		----
     84     86 s32   I/O             0  hm2_de0n.0.pet_watchdog.tmax            		----
     84     86 bit   OUT         FALSE  hm2_de0n.0.pet_watchdog.tmax-inc        		----
     84     86 s32   OUT             0  hm2_de0n.0.read.time                    		----
     84     86 s32   I/O             0  hm2_de0n.0.read.tmax                    		----
     84     86 bit   OUT         FALSE  hm2_de0n.0.read.tmax-inc                		----
     84     86 s32   OUT             0  hm2_de0n.0.read_gpio.time               		----
     84     86 s32   I/O             0  hm2_de0n.0.read_gpio.tmax               		----
     84     86 bit   OUT         FALSE  hm2_de0n.0.read_gpio.tmax-inc           		----
     84     86 bit   IN          FALSE  hm2_de0n.0.stepgen.00.control-type      		--l-
     84     86 s32   OUT             0  hm2_de0n.0.stepgen.00.counts            		--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.dbg_err_at_match  	0.000010	--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.dbg_ff_vel        	0.000010	--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.dbg_pos_minus_prev	0.000010	--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.dbg_s_to_match    	0.000010	--l-
     84     86 s32   OUT             0  hm2_de0n.0.stepgen.00.dbg_step_rate     		--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.dbg_vel_error     	0.000010	--l-
     84     86 u32   IN     0x0004FFEC  hm2_de0n.0.stepgen.00.dirhold           		--l-
     84     86 u32   IN     0x0004FFEC  hm2_de0n.0.stepgen.00.dirsetup          		--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.stepgen.00.enable            		--l-
     84     86 float IN              1  hm2_de0n.0.stepgen.00.maxaccel          	0.000010	--l-
     84     86 float IN              0  hm2_de0n.0.stepgen.00.maxvel            	0.000010	--l-
     84     86 float IN              0  hm2_de0n.0.stepgen.00.position-cmd      	0.000010	--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.position-fb       	0.000010	--l-
     84     86 float IN              1  hm2_de0n.0.stepgen.00.position-scale    	0.000010	--l-
     84     86 u32   IN     0x00000000  hm2_de0n.0.stepgen.00.step_type         		--l-
     84     86 u32   IN     0x0004FFEC  hm2_de0n.0.stepgen.00.steplen           		--l-
     84     86 u32   IN     0x0004FFEC  hm2_de0n.0.stepgen.00.stepspace         		--l-
     84     86 u32   IN     0x00000000  hm2_de0n.0.stepgen.00.table-data-0      		--l-
     84     86 u32   IN     0x00000000  hm2_de0n.0.stepgen.00.table-data-1      		--l-
     84     86 u32   IN     0x00000000  hm2_de0n.0.stepgen.00.table-data-2      		--l-
     84     86 u32   IN     0x00000000  hm2_de0n.0.stepgen.00.table-data-3      		--l-
     84     86 float IN              0  hm2_de0n.0.stepgen.00.velocity-cmd      	0.000010	--l-
     84     86 float OUT             0  hm2_de0n.0.stepgen.00.velocity-fb       	0.000010	--l-
     84     86 bit   IN          FALSE  hm2_de0n.0.stepgen.01.control-type      		--l-
.....
     84     86 bit   IN          FALSE  hm2_de0n.0.stepgen.02.control-type      		--l-
.....
     84     86 bit   IN          FALSE  hm2_de0n.0.stepgen.03.control-type      		--l-
.....
     84     86 s32   IN             -1  hm2_de0n.0.stepgen.timer-number         		--l-
     84     86 bit   I/O         FALSE  hm2_de0n.0.watchdog.has_bit             		--l-
     84     86 s32   OUT             0  hm2_de0n.0.write.time                   		----
     84     86 s32   I/O             0  hm2_de0n.0.write.tmax                   		----
     84     86 bit   OUT         FALSE  hm2_de0n.0.write.tmax-inc               		----
     84     86 s32   OUT             0  hm2_de0n.0.write_gpio.time              		----
     84     86 s32   I/O             0  hm2_de0n.0.write_gpio.tmax              		----
     84     86 bit   OUT         FALSE  hm2_de0n.0.write_gpio.tmax-inc          		----
@luminize
Member
@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Jan 29, 2017

I have thoroughly tested the call inside fwid.c and it passes correct addresses and parameters.
Stepped through the whole for() loop iterations to ensure it did not exceed the bounds of the buf
and it didn't.

The first trigger for this error, is similar to the descriptor = "null" error earlier.
debug was not initialised and thus the print routine was used, even though debug=1 had not been specified.
I have now forced default initialisation to 0.

I am also now looking at the hm2_soc_ol call to rtapi_print_hex_dump()

@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Jan 29, 2017

That is what I did, albeit in a different way ( by removing the field from the actual function and adjusting the calls )

I'll merge and then do a new pull and build to thoroughly test - should only take an hour or so ;\

@ArcEye
Collaborator
ArcEye commented Jan 29, 2017

I'll merge and then do a new pull and build to thoroughly test - should only take an hour or so ;\

Famous last words, more like 2 hours.

Tested running the 5i25_socfpga sim and from the command line, all OK

@cdsteinkuehler
Contributor
@cdsteinkuehler
Contributor

Still more issues, even with gcc-4.7. I'm installing concurrency-kit to see if that will help with the build fails, otherwise I'm going to just update to Debian Jessie. If that doesn't work, we will need to decide if Debian Wheezy support is mandatory in order to merge (IMHO it isn't, as Jessie has been out since April 2015).

@cdsteinkuehler
Contributor

...still didn't build on Wheezy. I am building from scratch on Jessie and will report results when complete (as long as it builds and I can test before I leave on a business trip tomorrow AM). IMHO, the Wheezy build issues should be resolved if fairly simple (they didn't look like horrible problems, and were mostly related to selecting the proper version of atomic instructions to use based on gcc version and installed libraries), but are not show-stoppers for merging multicore. Personally I'd rather have multicore and require Jessie than not have multicore and build on Wheezy.

@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Jan 30, 2017 edited

I am out all day and will look properly later.
The C11++ is non-negotiable, because of the atomic ops. I am not sure however if I ever built multicore on Wheezy, have been using Jessie pretty much since it was a beta, because of the problems it solved regards library support etc.
What we will need to do is build for Stretch, because Jessie is on v.8 already and probably not long for mainstream.
That is going to bring the pains on regards packages 😄

@ArcEye
Collaborator
ArcEye commented Jan 31, 2017 edited

I have not yet built successfully on Wheezy

I have brought down my old test machine, a dual core Sempron, with Wheezy installed and MK already set up. Did a dist-upgrade on all packages including machinekit.

machinekit-multicore cloned and built without any issues.

I am using gcc/g++ 4.7.2 from backports

I don't have any locally built libs, libck should not be necessary, because AFAIR @mhaberler integrated the routines required in the code, to avoid extra dependencies.

My package list is here
NB. thelibck-connector listed is not continuity-kit, but console-kit

@ArcEye
Collaborator
ArcEye commented Jan 31, 2017

To finish the DE0-NANO-Soc testing.

Now run with the loading string set to

halcmd newinst hm2_soc_ol hm2-socfpga0 config="firmware=socfpga/dtbo/DE0_Nano_SoC_DB25.7I76_7I76_7I76_7I76.dtbo num_encoders=2 num_stepgens=4"

and a 7i76 connected to the NANO via the @cdsteinkuehler DB25 interface.

All pins and params for hm2_de0n.0 ( including hm2_de0n.0.encoder.01 resulting from the num_encoders=2 ) plus all hm2_de0n.0.7i76.0.0 pins and params are visible.

Looks to be fully working.

@ArcEye
Collaborator
ArcEye commented Feb 1, 2017

See also
Re-examine module arg handling in hm2_soc_ol #20
and
Modify parameter handling in hm2_soc_ol driver for DE0-NANO-So #22
and
Modify socfpga sim configs to use new param passing protocols #23

@ArcEye
Collaborator
ArcEye commented Feb 2, 2017

Now have run my mill using the DE0-NANO-Soc, with machinekit-multicore and the amended hm2_soc_ol driver.

Runs exactly the same as previously, so that job done and dusted.

@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Feb 3, 2017

Great news! Things are working for me here as well.

Have you tested building on wheezy too?

Do we know of anyone other than Bas and I who are currently making use of the ring buffer code that will not be backwards compatible?

I suspect we will never know until they squeak 😉
We could put out a post and get no replies, but be none the wiser.

I am thinking of writing inline compatibility functions, which take the old calls and then use the new function, for say hal_ring_detach()
ringsize_t can just be made an alias define for ring_size_t etc
Need to look back over it to see what the clashes were.

So what's next? Should we make packages and push for broader testing or are we ready to merge?

If we go to the trouble of producing packages, it is a lot of work, will only attract the people who don't know how to build a RIP and may not elicit much better response than Beta testers request.

So long as the previous packages remain available, any major problems can be reversed by pointing the users back to those and explaining how to select them.

The next major hurdle, after managing to merge, the ins and outs of which we may need to discuss (don't know how this is done on-line, or whether it is done off-line and force pushed up to the repo),
will be the CI builds.

If those pass we can be fairly confident things work, if not, then at least there are no new packages to worry about until sorted.
(I half expect Wheezy builds to fail, they are always the flaky weak link when anything new comes in)

@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Feb 3, 2017 edited

I thought you had rebased on top of master fairly recently, so I'd hope we could rebase the multicore (again) and do a proper PR. If not, we'll have to sort out what to do.

Not anything quite that git-fu 😄

I added by cherry-pick, except when conflicts made it too difficult, (mostly the squashed/rebased stuff from Alex) whence I manually took the elements in commits and made fresh commits from them.

However:
I have just merged multicore into master in a test repo.
There were 13 conflicts, about third of which were white space, so didn't take long to manually amend.
Built and currently running the axis_mm sim from it.

So in that respect, fairly well ready to go

@ArcEye
Collaborator
ArcEye commented Feb 3, 2017

PR done for the merge. machinekit/machinekit#1121

Lets see what happens in Travis.

@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Feb 3, 2017

At the moment everything is failing on

make[1]: *** No rule to make target `objects/posix/halcomp-srcs/hal/i_components/charge_pump.o', needed by `../rtlib/posix/charge_pump.so'.  Stop.

This has come up before, can't remember the cause - or why it does not happen locally ???!!!

Deploy on a Friday afternoon - definitely not 😈
This PR was just to shake out the Travis builds and see what happened.

Should I/we look a bit into getting Wheezy running?

As I posted previously, it builds fine on a amd64 rt-preempt machine with gcc/g++ 4.7.2 on Wheezy
It has also got as far as building the components on the Travis build before failure for another reason.

What platform / compiler etc was it you had failures on?

@ArcEye
Collaborator
ArcEye commented Feb 3, 2017 edited

At the moment everything is failing on
make[1]: *** No rule to make target objects/posix/halcomp-srcs/hal/i_components/charge_pump.o', needed by../rtlib/posix/charge_pump.so'. Stop.
This has come up before, can't remember the cause - or why it does not happen locally ???!!!

OK, cloned the merge and error repeated, something did not carry over properly from the merge-test branch, I'll sort it tomorrow

'Beer o'clock' here in wet and windy Blighty

@ArcEye
Collaborator
ArcEye commented Feb 4, 2017 edited

Initial merge now builds on Travis.

I have corrected instances of hal_ring_detach() because I forgot I cannot create overloaded functions
in C, unlike C++.
There will just have to be a circulation and fault fixing for those who didn't read it. 😄

ring_size_t is now a define of ringsize_t so that won't cause problems.

Additional commit now pushed and being tested on Travis - fine locally.

@luminize
Member
luminize commented Feb 4, 2017
@cdsteinkuehler
Contributor
@ArcEye
Collaborator
ArcEye commented Feb 5, 2017

I'm not sure that asking for testers within a 24 hr period will be sucessful.

I might try resurrecting @zultron s docker build system for ARM and putting it through that though.

I only have the Jessie based DEO-NANO-Soc on which it builds, plus an arm64 board. The arm64 is Jessie based too, so anticipate it would build there too, albeit I can only do posix sims on that currently.

The next thing to think about probably is a draft of the announcement.

@ArcEye
Collaborator
ArcEye commented Feb 5, 2017

I might try resurrecting @zultron s docker build system for ARM and putting it through that though.

Forgot, it was just Jessie, @zultron did not know if it would be possible to do the same for Wheezy.
You have already proved it on Jessie.

The next thing to think about probably is a draft of the announcement.

I am leaning towards as minimal as possible, just advise of some changes in preparation for
full multicore functionality.
Don't raise any expectations or encourage users to go 'off piste' and raise all sorts of queries, when we just want to know that it works as before.

@sirop
sirop commented Feb 5, 2017

I did not get further with debugging the EtherCAT component.
Tried also:

~$ pidof rtapi:0
10933
~$ sudo more /proc/10933/maps
...
7f95f60a1000-7f95f60a2000 rw-p 0000b000 08:01 787982                     /lib/x86_64-linux-gnu/libnss_files-2.19.so
7f95f60a2000-7f95f65a3000 rw-s 00000000 00:10 46006                      /dev/shm/linuxcnc-0-00414c32
7f95f65a3000-7f95f65c0000 r-xp 00000000 08:01 6294011                    /home/slave/machinekit-multicore/rtlib/xenomai/hal_lib.so
...

So if
Feb 5 04:56:25 debian-slave msgd:0: hal_lib:10933:rt LCEC: global_hal_data addr 0x7f95f65a1360 L209,
then global_hal_data must be on /dev/shm/linuxcnc-0-00414c32 .

Is there a MK specific config for /dev/shm ?

@ArcEye
Collaborator
ArcEye commented Feb 7, 2017

Multicore now merged.

All problem reporting requested to machinekit/machinekit#1123

@ArcEye ArcEye closed this Feb 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment