-
Notifications
You must be signed in to change notification settings - Fork 473
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 ACE_SOCK_Dgram_Mcast::join: Unknown error -5 #1713
Comments
I've continued to look at this, and have found some interesting stuff using my debugger, well... I've had a suggestion on the mailing list about adding a MulticastInterface=eht0:avahi entry into the discovery section of my rtps.ini, and also a multicast_interface=eth0:avahi into the transport section. The latter makes no difference, and the former results in a segmentation fault. In trying to debug the segmentation fault, I think it's a side-effect of the multicast join not working, but I've been able to follow through to find that Spdp::SpdpTransport::join_multicast_group() is called 3 times. Debugging this bit of code:
I follow through with the following values in nic.name():
ifconfig doesn't show sit0 at all (see previous comment), and join_multicast_group is never passed eth0::avahi! This bit of code is being executed in Sdpd.cpp, in the Spdp::SpdpTransport constructor:
which is where join_multicast_group() is being called from using the nics collection retrieved from the network_config_monitor(), so the network_config_monitor seems to think there are 3 nics, with names lo, eth0 and sit0. Any ideas where sit0 is coming from, and why eth0:avahi isn't in there? John |
Further comments from the mailing list: sit0 does get shown by ifconfig -a; it's a tunneling device for using IPV6 over an IPV4 connection. I was asked to try the latest master branch, however that didn't work; changes made to use ACE_INET_Addr objects for holding IP addresses, when passed in a "xxx.yyy.zzz.aaa" don't work as they use the ACE_INET_Addr::set() function which, when the string has no ":" in it, treats the whole thing as a port number! See #1717. |
What happens when you just append a |
Leaving rtps.ini as is and changing the command line to use -DCPSDefaultAddress 169.254.5.198: gives:
(As before, I think the segfault results from trying to continue after the DDS configuration hasn't worked properly, rather than directly from the DDS stuff). This corresponds to the issues mentioned in #1717; there are a number of IP address parameter fields that use the same faulty parsing mechanism. If I change the SedpLocalAddress in the rtps.ini to have a : after it (no MulticastInterface in discovery section, or multicast_interface in transport section):
(Same as before) Adding MulticastInterface=eth0:avahi into the discovery section...
This line's interesting "ACE_INET_Addr::ACE_INET_Addr: 169.254.5.198:: Unknown error -2"; 2 x ":"? Adding multicast_interface=eth0:avahi into the tranport section instead of previous change: Back to:
I thought about patching the code, but it's an issue in multiple places and I don't think I'm currently well-placed enough to understand all of the places where it may be broken. |
Now that #1727 is merged, is this at least easier to test? If it's down to just "error -2" from joining a multicast group, the next step may be to verify with the debugger that the system calls being made by ACE are correct for your configuration. Does the output of "ip a" show that the interface 169.254.4.198 has multicast? |
Adam, With #1717 solved, the -DCPSDefaultAddress 169.254.5.198 works but, as mentioned in #1738, I then got an issue with SedpLocalAddress in the rtps.ini file. You've closed that one as the Dev Guide does say it needs a trailing ':' (which is something I've never seen expected in any other software!), although #1738 also mentions that SpdpLocalAddress, which the dev guide specifically says "No Port" with, uses identical code so is going to fail if you don't change the dev guide for that too. With the ':' in place, I'm back to:
Adding the eth0:avahi device as "MulticastInterface"/"multicast_interface", as before, stops any attempt at joining a multicast group. Re your last question:
|
FWIW - I've been looking at something related to this in my own code recently, trying to get mac addresses and IP addresses using the ifaddrs stuff in Linux. The eth0, sit0, and lo devices show up when I look for AF_PACKET devices, but not the eth0:avahi device. For AF_INET devices, lo and eth0:avahi, but not eth0 or sti0, and the eth0:avahi device shows the 169.254.5.198 address.
From this code:
|
See https://github.com/DOCGroup/ACE_TAO/blob/master/ACE/tests/Enum_Interfaces_Test.cpp for an ACE unit tests to list all ip interfaces, maybe compile the ACE/tests directory and run it? |
I copied some of the stuff into my application, as I couldn't be arsed to work out how to compile it on its own. This is the result:
Those addresses represent the lo and the eth0:avahi devices, but it doesn't actually tell you the interface name. I'll see if I can find that stuff. |
Nope - don't really know my way round ACE, but can't see any obvious way to get the name of an interface from the IP addr, other than going down to the ifaddrs stuff which is basically the code I showed earlier. If you can see how to, please let me know. Also, I'm sure I read something elsewhere about the NetworkConfigMonitor being changed to be optional; is that a build time configuration option? I can see (in LinuxNetworkConfigMonitor::process_message): RTM_NEWLINK -> for lo, index 1 -> add_interface is called It seems that the code can't tell the difference between eth0, which doesn't have an IPV4 address, and eth0:avahi, which does. I'm a bit confused at this point; is there a way to make this behave more like 3.13.3? |
You can try disabling LinuxNetworkConfigMonitor
just make it It would be good to know what's actually going wrong there. Can you list steps required to get a similar network configuration starting with a "stock" Debian/Ubuntu install? |
Sorry for the delay in replying; I've been on holiday. I will try your suggestion once I get the chance. As far "list the steps required", that's easier said than done :-) I'm no Linux expert and the way that alias popped up wasn't something I specifically chose :-) I suspect the issue is related to something like I described in #1713 (comment); when you look for AF_PACKET devices you don't see the eth0:avahi alias but, when you look at AF_INET, you see the IP address associated with it but it seems to associate it with eth0 rather than eth0:avahi. I will take a look round to see if I can work out a way to get something similar on Ubuntu. |
Apologies for the further delay on this. Extremely tight timescales and issues with staff have left me no time to play around with these settings. It remains on my "to do" list. |
Just a quick note: I've built 3.16 with the modification suggested in #1713 (comment) and it successfully runs the Messenger publisher with the subscriber running on a native Ubuntu build. While I wouldn't consider this enough to close this issue, it's a step in the right direction. |
Multicast support (with multiple interfaces) has been improved recently. Is this still a problem? |
Interestingly, I was just talking to a colleague about this issue yesterday. Unfortunately we don't have the time or resources to be able to keep up-to-date with OpenDDS releases, so we're still running with 3.16. If we get the chance to try the latest version at some point, I will be sure to let you know if this is still, or is no longer a problem. |
We believe this has been addressed in more recent versions of OpenDDS (3.23). |
FWIW, we've recently started running 3.22 and, as I understand it (it was my colleague who tried it), we still have this issue in that version. Do you have any test results that prove this is fixed? If not, can I suggest you please reopen it; it's possible that I may be in a position within the next few days to check this. |
We do not have a way to reproduce this error so we cannot definitively say that it is addressed. You will need to submit a PR with test that demonstrates the problem. |
I can't really do that; the testing I've done has been manual, and you need a target system that's ARM based with an ethn:avahi network device to highlight the problem. I don't know if there's a way to simulate that! However, I've built 3.23 today for the ARM target and tried the Messenger application. With multicast_interface=eth0:avahi in the RTPS section of the ini file, and MulticastInterface=eth0:avahi in the discovery section, I was able to get through the publisher and subscriber startups without any error showing (without both of those there are errors; 2 x errors if neither is in, or 1 x error if one or the other is in). I've also seen RTPS packets received in another machine connected to the same network. That looks promising. What I haven't been able to do, though, is get the publisher and subscriber to communicate with each other yet, but that might be to do with things that are awkward today, as I'm working from home and the system I'm using is remote. I will be back in the office tomorrow and will try again with 'better' equipment, then report back. |
Further information. I've tried building my applications using OpenDDS 3.23 built using the normal LinuxNetworkMonitor included. As with the Messenger application, I was able to get the application to start, without showing the errors, by explicitly specifying the MulticastInterface and multicast_interface within the rtps.ini sections but, despite that, there was no end to end communication happening. Commenting out the LinuxNetworkMonitor, as described earlier, fixed that issue, so nothing appears to have changed to improve things from my point of view. |
Just to reset... The LinuxNetworkConfigMonitor uses a NETLINK socket to receive changes in network interfaces and address from the Linux kernel. These come in distinct messages, i.e., changes in interfaces come in one set of messages and changes in address comes in a different set of messages. The NetworkConfigMonitor publishes these on an Internal DDS topic (
Presumably, the interface names will match what you reported earlier: lo, eth0, and sit0. These are names coming from NETLINK socket and those are names that must be used when configuring. Configuring the multicast interface to "eth0:avahi" when using the LinuxNetworkConfigMonitor means that all of the updates from netlink will get dropped because none of the interfaces have the appropriate name. See the call to As a next experiment, you could configure the multicast interfaces to "eth0". Based on your report, the MulticastManager should attempt to join multicast groups on this interface with the APIPA address. Speaking of, when the APIPA address is being used, does the device have a valid route? That is, joining a multicast group without a valid route may cause the join to fail. Does communication work in this scenario? Does a packet capture show SPDP announcements? Assuming that things still aren't working, I would turn my attention to
My hope is that we can get it to work with the LinuxNetworkConfigMonitor, preferably without specifying multicast interface explicitly. The key thing to understand is if joining the multicast group is succeeding with a warning or actually failing. Thus, a fix might be to change the logic to consider certain error codes as success. To test this, you can ignore the return value of |
Thank you for reopening this, and for a detailed description of how it's supposed to work. I suspect I've been through some of the code when trying to debug the issue when we first saw it, but will take note of what you've said. I will try to respond more fully to your questions tomorrow, but it's probably worth me reiterating that, on our system, eth0 does not have an IPV4 address; it's an embedded Linux application on ARM, built with the Xilinx Petalinux toolchain. The whole system is self-contained and devices only use "zeroconf" settings, i.e. mDNS and link-local IP addresses, hence the use of Avahi which, with the default setup using avahi-autoipd (I believe) causes us to get this eth0:avahi pseudo-device (AIUI) which is the one that has an IP address (in the link-local range) and which is capable of multicast. eth0, as I mentioned, has no IPV4 address which, as far as I can tell, causes it to be incapable of multicast (?). As far as logs are concerned, I will check again, but the ones I looked at on 3.23 are basically the same as the ones I provided when we started seeing this on 3.14. I will respond again tomorrow, but one thing I was wondering was whether we may be able to provide you with representative hardware this problem occurs on. I'm assuming that, as it's a rare use case, you'd probably like to understand it and find out if you can produce a fix for it but, as it's not, I guess, one of your primary supported platforms (AIUI) you may not want to spend money on it :-) The other thing I did also wonder was whether a configuration option could be provided to disable the Linux Network Config Monitor which would avoid the issue without us having to go in and hack the code every time there's an update. |
@jrw972 Here are the logs from the publisher and subscriber using the Messenger DevGuideExample (DCPS). The rtps.ini file is also included, but that's the default version anyway for now. These were both run on ARM A9 systems running PetaLinux (Xilinx Zynq-7000 systems) with the OpenDDS stuff built by following the Raspberry Pi example that is (was?) on your website, but using the Zynq-7000 toolchain. The DCPSDebugLevel and DCPSTransportDebugLevel were both set to 10 (let me know if there is a more appropriate setting for either of them). As you can see from the logs, there was no communication. In a few minutes I'll send equivalent logs using OpenDDS built with the LinuxNetworkConfigMonitor disabled/commented out. The output from ifconfig -a on one end of the link (the publisher, in this case) is:
The other end is basically the same, except there's no |
As mentioned, the logs from using OpenDDS 3.23 with the LinuxNetworkConfigMonitor disabled, clearly showing communication happening. |
In response to your question(s):
My comment 2 above this one covers that; there's no communication; I'm not just trying to get rid of the warnings :-) |
For this point:
Setting both multicast_interface=eth0:avahi and MulticastInterface=eth0:avahi results in the following logs and rtps packet capture (no communications). Note you need to ditch the defPubEth0Avahi.txt |
On to:
defPubEth0.txt |
As for this bit:
Just looking to see how I would do this; #if defined OPENDDS_LINUX_NETWORK_CONFIG_MONITOR
if (DCPS_debug_level >= 1) {
ACE_DEBUG((LM_DEBUG,
"(%P|%t) Service_Participant::get_domain_participant_factory: Creating LinuxNetworkConfigMonitor\n"));
}
network_config_monitor_ = make_rch<LinuxNetworkConfigMonitor>(reactor_task_.interceptor());
#elif defined(OPENDDS_NETWORK_CONFIG_MODIFIER)
if (DCPS_debug_level >= 1) {
ACE_DEBUG((LM_DEBUG,
"(%P|%t) Service_Participant::get_domain_participant_factory: Creating NetworkConfigModifier\n"));
}
network_config_monitor_ = make_rch<NetworkConfigModifier>();
#else
if (DCPS_debug_level >= 1) {
ACE_DEBUG((LM_DEBUG,
"(%P|%t) Service_Participant::get_domain_participant_factory: Creating DefaultNetworkConfigMonitor\n"));
}
network_config_monitor_ = make_rch<DefaultNetworkConfigMonitor>();
#endif This is intriguing.
#include "ace/config.h"
#if (defined(ACE_LINUX) || defined(ACE_ANDROID)) && !defined(OPENDDS_SAFETY_PROFILE)
#define OPENDDS_LINUX_NETWORK_CONFIG_MONITOR The workaround @mitza-oci mentioned earlier is to comment that stuff out, i.e. #include "ace/config.h"
//#if (defined(ACE_LINUX) || defined(ACE_ANDROID)) && !defined(OPENDDS_SAFETY_PROFILE)
#if 0
#define OPENDDS_LINUX_NETWORK_CONFIG_MONITOR That's how OpenDDS was built for the modPublisher.txt/modSubscriber.txt logs attached earlier, which show:
Hence, it would appear that I don't have ACE_DEBUG((LM_DEBUG,
"(%P|%t) Service_Participant::get_domain_participant_factory: Creating NetworkConfigModifier\n"));
#include "ace/config.h"
// ACE_HAS_GETIFADDRS is not set on android but is available in API >= 24
#if ((!defined (ACE_LINUX) && defined(ACE_HAS_GETIFADDRS)) || (defined(ACE_ANDROID) && !defined ACE_LACKS_IF_NAMEINDEX)) && !defined(OPENDDS_SAFETY_PROFILE)
#define OPENDDS_NETWORK_CONFIG_MODIFIER That seems a substantially more complex pre-processor directive than the one in
LOL 😜 |
Thanks for the logs. The packet captures did not attach so we may have to find a work around. We can't rule out the spurious error yet. If you can get packet capture for the Messenger example or an equivalent setup, just look for multicast RTPS packets (probably SPDP announcements). The error prevents registering the handler for receiving, but it should not prevent sending. Thus, if the packets are still being sent, then the error is probably incorrect. If you don't want to go the packet capture route, just ignore the return value of Using eth0 instead of eth0:avahi was the correct thing to do, i.e., the LinuxNetworkConfigMonitor at least attempted to join the multicast group. The
So, if you can trace it in a debugger, I think that's going to concretely point out the problem. Since the learnings from that may or may not turn into something that is easily fixed, I'll discuss the possibility of enabling/disabling the LinuxNetworkConfigMonitor and NetworkConfigModifier via configuration. |
Oh! I still have the packet capture files, so will try attaching them again in Monday. I gave them a gif extension so maybe github decided to check if it really was a gif! Thanks for the other suggestions. |
@jrw972 I've re-uploaded the packet captures, with a .txt extension this time (which means they haven't been put into the user-images space :-) ). Hopefully that will work better. I haven't had a chance to run the debugger with this yet, partly because I'd built OpenDDS for the target with --no-debug and --optimize, so don't know how useful it would've been :-) I've rebuilt it without those now, but will let you know if/when I get a chance to use them. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
(Note: I first posted this to https://sourceforge.net/p/opendds/mailman/opendds-main/?viewmonth=202006&style=flat, please delete/close if this is not the right place)
On OpenDDS 3.14, on an embedded target, I'm seeing "ERROR: Spdp::SpdpTransport::join_multicast_group() - failed to join multicast group 239.255.0.1:8400 ACE_SOCK_Dgram_Mcast::join: Unknown error -5" on an application that (other than a few updates to handle the C++11 style opendds_idl code generation) is the same as one that works with OpenDDS 3.13.3.
The target is an ARM-based embedded linux system with Avahi installed to acquire an mDNS IP address. It uses rtps with an ini file containing:
Using the OpenDDS 3.13.3 version of the code, with DCPSDebugLevel 10, I see:
With the 3.14 version, I see:
ifconfig on the embedded device shows:
As the 3.14 version is explicitly showing that it's trying to join a group using eth0 (not eth0:avahi), I've tried modifying the embedded linux device's network configuration to apply the 169.254.5.198 address as a static address on eth0 and restarting the eth0 device.
Now, when I run my 3.14 version, it works as it should do, however I have a need to use mDNS on the system I'm working on so this may not be a long term solution.
Is something known to have changed between 3.13.3 and 3.14 to have this effect and, if so, is there a configuration option (e.g. command line or rtps.ini file) I can use to overcome it?
I have tried explicitly specifying the SpdpLocalAddress in my rtps.ini, with no effect. I have also tried configuring the MulticastInterface setting in my rtps.ini to eth0:avahi, but the application just segfaults like that.
The text was updated successfully, but these errors were encountered: