Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

olsrd fails to get routes when multiple mesh networks are present #14

Open
technosopher opened this issue Jan 27, 2014 · 4 comments
Open

Comments

@technosopher
Copy link

Suspected to be related to low-level channel hopping/merging tendencies in wpa_supplicant, various wireless driver stacks, or both, this is a long-outstanding problem that appears to have no easy solution. For now, it is recommended that the Linux client be used only in environments in which only one Commotion mesh network is active.

@dismantl
Copy link

When we were testing this a couple weeks ago, we found out the issue wasn't with olsrd establishing routes, but rather forming ad-hoc links with neighbors. Although the client's network-manager indicates a successful connection to the ad-hoc network, pinging or communicating with any or some of the ad-hoc peers fails (which then leads to olsrd not getting any packets from which to establish routes).

@technosopher
Copy link
Author

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hrm... that doesn't surprise me at all, and is wholly consistent with
the channel-hopping hypothesis. Nevertheless, I'm pretty sure I've
encountered a case in which I could establish a link to another meshed
device, but olsrd still refused to get any routes. I think we may be
able to get a clearer picture of what's going on by doing the following:

  1. Set up isolated test meshes far away from any other
    potentially-conflicting meshes, and see if we can reproduce this
    problem.
  2. Try a variety of different mesh settings to determine which
    parameters need to be unique to ensure that this doesn't happen (ie,
    can meshes coexist as long as they have different ssids, BSSIDs, AND
    channels? Or is that not enough?)
  3. In failure cases, forcibly try to connect to other present mesh
    nodes (ping and ssh to known-good ips). If this consistently fails,
    then the problem is almost certainly a link-layer issue. If this
    works sometimes, we can't rule out some sort of interaction involving
    olsrd itself.
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.11 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJS5rYdAAoJEL+9ounAjYBC0rwH/2SaY1Y3M1XIXBRXjM8BTE8c
1L8Te8PCqnqne30ioM+l/8U9G1q5+KyXQaow1nSNkBrjwCj0P9HVkWESpmIFxIyN
4U6u48cmXQ9PtdFi/thu0/ubBfMZqz1jTzI3at55Hdy0BF49Gx3XiA7RM1hQeOZn
wvnDxAVVYAYCjVpzM8Vxn9rQ9oMii1PeeW9ocXpUuj8mP3BoM6OqRFQSuQ+wpnhU
bsalVGd919ZakUnABsI5TQZzpq2ajvJ9LKtBjTywP8W78d2n79jpCQvySbLBxVVk
uohGkIDrBWxjnwixZjViijfkywAx8GtECR5NiRSKEqj0wBgJ/T+RCP+dNNGR6kw=
=rTIb
-----END PGP SIGNATURE-----

@dismantl
Copy link

I'm really certain this has nothing to do with interfering ad-hoc networks. We were able to reproduce the problem both when there were other ad-hoc networks present, and when there weren't any others. I did extensive testing to show that the olsrd route problem occurred if and only if the client had bad links with all its ad-hoc neighbors. As soon as one good ad-hoc link was established, olsrd started receiving olsrd packets and building a routing table. This indicates that the problem was at the link layer, not application layer.

My working hypothesis is that since the problem occurs inconsistently in the presence of other constant factors, it is indicative of a race condition. I believe the network-manager scripts are running asynchronously, and that this is responsible for screwing up the network stack during the connection process, possibly related to wpa_supplicant.

To test this hypothesis, we'd need to do try joining the same ad-hoc network both through network-manager, and without network-manager. If the later doesn't show the same symptoms, then we'll know the source of the problem.

@technosopher
Copy link
Author

Okay, that last test is easy to do: the fallback version of commotion-linux circumvents network-manager entirely, and even includes a patch that (theoretically) disables channel hopping. I think I've seen this fallback method fail in precisely the same way that the normal method does - but I didn't control for exact order of network connection operations. So let's give this a try, perhaps tomorrow. Thanks for all the good debugging info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants