New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with 22+ devices in a mesh #8
Comments
The problem seems to come from the sequence number check logic. If the network has some packet loss or packets arrive out of order, the reach can get cleared. This causes babel to send hellos immediately; adding to the congestion. The result is that in a network that is suffering congestion problems, babel ramps up the volume tremendously. Removing the send_hello solved the problem for me, but that is probably not a general solution for mobile networks. |
I am curious where you made this change? I was looking over this section of code also, and seeing what you were seeing... https://github.com/dtaht/rabeld/blob/master/neighbour.c#L144 |
Reopening, this looks interesting. |
Reproduced on a ZeroTier mesh network production environment with 9 nodes. My network contains 8 nodes across the world with low packet loss, and one node with high packet loss to some (but not all) of the other nodes. My babeld configuration is:
P.S. I cannot tell precisely if my problem is exactly the same with this issue, since some of my nodes are located in China, where a packet with certain combination of bytes can kill the entire TCP/UDP connection. |
@m13253 could you please have a look with your network sniffer of choice? It would be interesting to learn more about what is going on. |
Nice idea. But the original network is for productive use now that I would not want to add a node with 60% packet loss every day at 8pm. |
I have the same issue with around 9 nodes, with link metrics growing to such levels that communication becomes almost impossible, could this have something to do with the wireless cards switching to a lower modulation due to the many interrupts from the nodes communicating with each other constantly? |
I have the same issue with around 9 nodes, with link metrics growing to such
levels that communication becomes almost impossible, could this have something
to do with the wireless cards switching to a lower modulation due to the many
interrupts from the nodes communicating with each other constantly?
Probably not. Link quality sensing is done over multicast, which is
unreliable in the presence of congestion. The network is probably
dropping too many multicasts, and Babel is reacting to the resulting
packet loss.
We need to do more work on wireless link quality sensing. RFC 6126bis has
provisions for link quality sensing over unicast, but we need to do more
experimentation with that -- and other things are more important right now.
…-- Juliusz
|
One thing I've had to do is arbitrarily mark all babel packets (on my extensively fq_codel'd networks) as ecn-capable in net.c. Perhaps I've been treating a symptom. |
Anyone still inverstigating this? I'd really like to understand what's going on. |
I can't reproduce the problem with my current network condition any more. And, perhaps the problem will be fixed when unicast is merged to the mainline. |
I'm playing a bit with CORE emulator and babeld. Here a pcap, captured during slow start: each second an additional node is started. |
@tecoboot - nice to see you! Could you share your core emulator setup? I've been trying to get to a better emulation envionrment myself I note there is a bug in mainline babeld at the moment... fixed in a patch I sent to the mailing list a few weeks back. Might be related to this.... |
Maybe I can help to get babeld in good shape. I heard about disappointing results where babel was compared to olsrv2. Too bad (to be true?). I run core in Xubuntu 1804, in VMware Fusion 11, on new MacBook. Any Linux will do, but be careful running it on a native OS, as normal users can easily run privileged code. The babeld config generator puts N routers in a square, all connected to a wireless lan. The results I shared was with current ubuntu provided babeld: 1.7.0. When this issue is fixed, I want to dig into packet overhead (why no address compression for IHU?), dynamic timers (triple?) etc. I only will ask questions and share results, I don't code myself.
I'll check. I'll use master for follow-up tests. |
OK, with babeld-1.8.4-42-g73d0b1a I have better results. BTW: i didnt test cleanup of my generator script. Or add rm config file or change >> to > in cat config line. |
Hmm, retried on my older macbook (Core2 duo), bad results. Exactly the same script, but now a total network meltdown. The 35 babeld instances eaten all the CPU and it was hard to stop core or wireshark. Nice test tool for linux dispatcher and wireshark memory management :-). The sad news is that babeld cannot handle a bottleneck. Don't know if it is caused by shortage of CPU or lost packets due to shortage of CPU. I reran the test, now with longer delays: first pause of 100s before starting the babeld daemons. Then starting babeld one by one, 10s interval. This works well. But in real networks, we don't want the meltdown trigger, do we? I'll need some thoughts on how to locate the trigger. |
Here the slowstart pcap file. babeld-35nodes-slowstart-10s.zip Still lots of requests and IHU. |
I reran with 8 nodes. I turned off dynamic Rxcost, to eliminate IHU caused by slow metric algorithm and IHUs to exchange that. I expected just a few packets to insert the new node in the topology. babeld-8nodes-slowstart-10s.zip See for example addition of second node in packets 9 to 24. It takes:
|
Excellent work, thanks! Could you apply this patch?
|
As for "not being able to handle a bottleneck" - babeld has two problems - one we've seen pathlogical cases like yours that exhaust local cpu, hellos become late, retransmits spike, and more cpu gets used - which I exposed via my "rtod" tool a few months back. One fallout of that was that babel now logs late hellos, and it would be interesting if you are seeing that in your tests. Finding and fixing those cases is definately high on my mind. The other, is that babel uses naive data structures which have a tendency not to scale well (while you are exhausting cpu). I've got some work in progress adding hashes to multiple formerly linked list lookups that appear quite promising, and @christf has been working on improving the scheduler. |
Patched. Do you want me to run the meltdown test? babeld-8nodes-slowstart-10s-11dec2018-2026.zip Frame 16: 122 bytes on wire (976 bits), 122 bytes captured (976 bits) on interface 0 Maybe it has to do with updates of internal state, eg rechability. |
Teco Boot <notifications@github.com> writes:
Patched.
On slow-start with 8 nodes, I don't see a difference.
Here log (-d 1) from node 1.
babeld-8nodes-node1.zip
Do you want me to run the meltdown test?
Heh. I LOVE melting things down. :) But no, I agree with your below...
I prefer handling the huge number of helloś and IHUs first.
For example frame 16, 3x IHU for same neighbor. Silly behavior.
You are right that there is a bug there! Regrettably I don't have time
this week to jump on this, and juliusz is busy with finals and so on, so
I mentioned it on the babel-users list.
It sounds like we have at *least* two bugs here. :sigh:
I'll have time next week.
… babeld-8nodes-slowstart-10s-11dec2018-2026.zip
Frame 16: 122 bytes on wire (976 bits), 122 bytes captured (976 bits)
on interface 0
Ethernet II, Src: 00:00:00_aa:00:00 (00:00:00:aa:00:00), Dst:
IPv6mcast_01:00:06 (33:33:00:01:00:06)
Internet Protocol Version 6, Src: fe80::200:ff:feaa:0, Dst: ff02::1:6
User Datagram Protocol, Src Port: 6696, Dst Port: 6696
Babel Routing Protocol
Magic: 42
Version: 2
Body Length: 56
Message hello (4)
Message Type: hello (4)
Message Length: 6
Seqno: 0x8512
Interval: 400
Message ihu (5)
Message Type: ihu (5)
Message Length: 14
Rxcost: 0x0060
Interval: 1200
Address: fe80::200:ff:feaa:1
Address Encoding: Link-Local IPv6 (3)
Raw Prefix: 006004b0020000fffeaa0001
Message ihu (5)
Message Type: ihu (5)
Message Length: 14
Rxcost: 0x0060
Interval: 1200
Address: fe80::200:ff:feaa:1
Address Encoding: Link-Local IPv6 (3)
Raw Prefix: 006004b0020000fffeaa0001
Message ihu (5)
Message Type: ihu (5)
Message Length: 14
Rxcost: 0x0060
Interval: 1200
Address: fe80::200:ff:feaa:1
Address Encoding: Link-Local IPv6 (3)
Raw Prefix: 006004b0020000fffeaa0001
Maybe it has to do with updates of internal state, eg rechability.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Is there a way to profile babeld while you run it in the emulator? |
I guess so. Each router is a mini VM in a container. Double click opens a root shell.
What is your interest?
… Op 11 dec. 2018 om 23:20 heeft Antonin Décimo ***@***.***> het volgende geschreven:
Is there a way to profile babeld while you run it in the emulator?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
They say one should profile before optimizing. I’d be interested to have real-world data on where babeld spends its time and look there. |
well, kill it when it's misbehaving, after compiling with -pg, and let us know what gprof reports? I have at least one other patch that might be relevant (it's in my uthash branch) - but I'm certain we have at least one real bug here. |
According to the code there is one IHU (that is including marginal IHU) per Hello. Also, the hello-count of 9 seems high. |
Thanks, that's very helpful. Could you please try again after applying the following patch?
|
After testing myself, this patch appears to pretty much solve the problem. It will slow down insertion of a new node, but I think avoiding a network meltdown is worth it. |
I'm not so sure this patch helps much. I still can see the silly IHU behavior. There are other packet storms. See a couple of storms and the IHU peaks; 49 nodes, 10s interval between babeld start: Normal node insertion (not one of the three peaks): Router-ID storm during first peak: babeld-49nodes-10s-delay.pcapng.zip I think the triggered messages needs much more thoughts. |
I'm not so sure this patch helps much. I still can see the silly IHU behavior.
There are other packet storms.
The issue here is that we're sending a Hello whenever we learn a new peer,
and we always send IHUs together with the Hello. I think the fundamental
sin is not rate-limiting the Hellos, the IHUs are just a consequence.
Sending an unscheduled Hello to a new peer is a good idea, we just
shouldn't be doing it when we quickly learn a sequence of new nodes in
quick succession. Ideas?
Router-ID storm during first peak:
The router-ids are an epiphenomenon -- what we've got here is an Update
storm. It's not too extreme, but yes, the triggered update logic might be
too aggressive.
…-- Juliusz
|
The interval between startup of nodes was 10s.I don't think this is "in quick succession". Can the behavior be caused by triggered messages, without any checks if such messages can be delayed, combined and/or suppressed due to duplicates? I would say a message should not be send more than X times in Y centiseconds, e.g. 1 time each 50 centiseconds. |
What about this? (Not committing yet, I need to think over the consequences.)
|
This helps a lot. Peaks up to 100kbit/sec, avg 40kbit/sec. babel packets: 2989
Question is: is there a convergence-time trade-off? |
I tried the 49-node concurrent start also. Looks much better :-) |
Question is: is there a convergence-time trade-off?
Yes. The code in question sends an update whenever it sees a new node.
This is bad in the very dense network that you're simulating -- a node
discovers its neighbours one by one, and sends an update for each.
The tradeoff, of course, is that now a node doesn't receive an immediate
update when it moves from one neighborhood of the network to another, it
must wait for the periodic update.
A ------------ B A ------------ B
| ----> |
C C
(If you could simulate the scenario above, that would be helpful.)
Of course, I'm entirely in favour of avoiding a meltdown, even in a rather
unrealistic situation (Babel is not designed for very dense networks), so
I will remove the faulty code. I just want to think it over some more.
…-- Juliusz
|
One of my overall proposals is that the code start taking advantage of having different length intervals available in the protocol. In this case, as the observed density (in IHUs and/or hellos, or route announcements with a unique routerid) goes up, the hello interval, increases. You have 4 speakers, use 4sec, perhaps w/16, 32sec. (pick a scaling factor, any one...) a hello interval of merely " observed speakers"... essentially preserves babel's existing properties for less dense meshes - a lonely hello would be 1sec (or min 2) which might improve matters some on sparse meshes... Same concept applies to route announcements also, although I'd announce default routes more frequently than obscure routes, with a stated longer interval. A new node will typically learn "enough" of the topology within a few route transfers, also. Trying to pick a "good" node(s) to request a route dump from could possibly be smarter. Picking the most disjoint set of hellos/ihus? Ask for a route dump from the best + worst + random, initially? Do you even need to have a hello/ihu exchange to start picking up routes or to calculate number of speakers? If reachability is on the decline, rather than increase, additively decrease the announced hello or route announcement interval. (this is a very tcp-ish idea) @tecoboot I'd rather like to witness what happens on this test when each node has 1k (or more!) distinct routes. From within a bunker at a safe distance. With popcorn. (I do really want to set this emulation environment up, but am working on revising my rtod tool instead). I'm loving these graphs, overall. |
@tecoboot Re: "For example frame 16, 3x IHU for same neighbor. Silly behavior."... later... "I still see the silly ihu behavior"... By this "silly behavior" you mean the more than one IHU per neighbor is still there? I see in frame 36 of babeld-24dec2018-concurrent-start.pcapng.gz that it is, indeed, repeating fe80::200:ff:feaa:2 & :1 twice, me, skipping randomly forward by eyeball, it's doing the same on frame 85. So there is a bug here, also, in how ihus are merged. |
@m13253 - I was looking over the zerotier thing, which looks pretty neat though I don't grok how it works. Along the way I noticed the https://github.com/nanomsg/nng library which might be a substrate worth building other things on. |
I guessed so.
This issue was reported with 22 and 9 nodes.
Maybe keep the trigged mode with non-default config option. With UseAtYourOwnRisk in man page. My thoughts are in line with what Dave suggest: smarter triggered updates with variable timers. A smart message scheduler could incorporate line utilization (like Cisco EIGRP, don't blow up the medium), link quality (higher repetition rate on lossy links), neighbor state (don't send unused messages at high frequency). The goal would be: important messages go first. |
Yes, still unneeded redundancy :-) It was frame 38 (not 36), IHU neighbor addresses in same packet, sorted : This could be caused by message queuing, but lack of removal from queue when message is updated. Could the duplicates be removed? |
Wit concurrent startup with 48, 63 and 99 nodes, there was no meltdown :-) I'm quite sure that this density with real radio's will end up in lousy performance. But good to know the protocol and implementation can handle this. |
I've committed just part of this patch, the bit about sending Hellos proactively. The other part I don't feel comfortable with removing, although I don't like the current code. Teco, thanks for showing me this, that's a lot of help. |
Hi @jech thanks for working on this! I want to let you know that we are currently using / testing 39c5f0a at some small (<20 nodes) community networks in Argentina and Brazil with good results for now. For the following weeks we will be performing a test in a ~60 nodes network. We will update with results. |
With that many nodes, also giving the new unicast support a go might be
interesting.
|
I've committed the rest of the patch in b66c173, so this issue is believed to be fixed. Thanks to all of you, please create a new issue if there's anything more to do. |
Babeld has been working great for me with a small number of devices. When I add the 22nd device to a mesh network the amount of network chatter between machines went up tremendously; to the point that the management overhead prevents other data from getting transferred. It is reproducible in my environment.
I have tried reducing the hello interval and traced through the code to see what is happening. Have not spotted a hard limit on the number of devices or routes that can be supported. I am seeing evidence that a device gets dropped the set of neighbors and then new packets from that device trigger the new neighbor behavior.
Any suggestions on possible causes?
The text was updated successfully, but these errors were encountered: