Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with 22+ devices in a mesh #8

Closed
benhaley opened this issue Sep 14, 2016 · 46 comments
Closed

Issue with 22+ devices in a mesh #8

benhaley opened this issue Sep 14, 2016 · 46 comments

Comments

@benhaley
Copy link

Babeld has been working great for me with a small number of devices. When I add the 22nd device to a mesh network the amount of network chatter between machines went up tremendously; to the point that the management overhead prevents other data from getting transferred. It is reproducible in my environment.

I have tried reducing the hello interval and traced through the code to see what is happening. Have not spotted a hard limit on the number of devices or routes that can be supported. I am seeing evidence that a device gets dropped the set of neighbors and then new packets from that device trigger the new neighbor behavior.

Any suggestions on possible causes?

@benhaley
Copy link
Author

The problem seems to come from the sequence number check logic. If the network has some packet loss or packets arrive out of order, the reach can get cleared. This causes babel to send hellos immediately; adding to the congestion. The result is that in a network that is suffering congestion problems, babel ramps up the volume tremendously. Removing the send_hello solved the problem for me, but that is probably not a general solution for mobile networks.

@dtaht
Copy link

dtaht commented Feb 20, 2017

I am curious where you made this change? I was looking over this section of code also, and seeing what you were seeing... https://github.com/dtaht/rabeld/blob/master/neighbour.c#L144

@jech
Copy link
Owner

jech commented Mar 9, 2017

Reopening, this looks interesting.

@jech jech reopened this Mar 9, 2017
@m13253
Copy link

m13253 commented Mar 24, 2018

Reproduced on a ZeroTier mesh network production environment with 9 nodes.

My network contains 8 nodes across the world with low packet loss, and one node with high packet loss to some (but not all) of the other nodes.
As soon as I connect the 9th node to babel, very high rate of packet loss and even intermittent interruptions that sometimes last for minutes are observed.
An interesting fact is that affected routes, according to mtr, does not go through the problematic 9th node.

My babeld configuration is:

interface zt0 type tunnel link-quality true rxcost 16 hello-interval 1 rtt-min 16 rtt-max 1024 max-rtt-penalty 1008

P.S. I cannot tell precisely if my problem is exactly the same with this issue, since some of my nodes are located in China, where a packet with certain combination of bytes can kill the entire TCP/UDP connection.

@christf
Copy link
Contributor

christf commented Aug 12, 2018

@m13253 could you please have a look with your network sniffer of choice? It would be interesting to learn more about what is going on.

@m13253
Copy link

m13253 commented Aug 13, 2018

@m13253 could you please have a look with your network sniffer of choice? It would be interesting to learn more about what is going on.

Nice idea.

But the original network is for productive use now that I would not want to add a node with 60% packet loss every day at 8pm.
I plan to build another test network to reproduce the problem.

@ealione
Copy link

ealione commented Sep 7, 2018

I have the same issue with around 9 nodes, with link metrics growing to such levels that communication becomes almost impossible, could this have something to do with the wireless cards switching to a lower modulation due to the many interrupts from the nodes communicating with each other constantly?

@jech
Copy link
Owner

jech commented Sep 7, 2018 via email

@dtaht
Copy link

dtaht commented Sep 8, 2018

One thing I've had to do is arbitrarily mark all babel packets (on my extensively fq_codel'd networks) as ecn-capable in net.c. Perhaps I've been treating a symptom.

@jech
Copy link
Owner

jech commented Sep 16, 2018

Anyone still inverstigating this? I'd really like to understand what's going on.

@m13253
Copy link

m13253 commented Sep 16, 2018

Anyone still inverstigating this? I'd really like to understand what's going on.

I can't reproduce the problem with my current network condition any more.

And, perhaps the problem will be fixed when unicast is merged to the mainline.

@tecoboot
Copy link

tecoboot commented Dec 8, 2018

I'm playing a bit with CORE emulator and babeld.
This problem seems easy to reproduce, I've a small script that generates whatever number of nodes, fully meshed, no loss.
With <20 nodes, no issues.
With ~ 35 nodes, I have huge load, almost all IHU. Rxcost is 0xffff or 0x060.
Sometimes a packet has multiple IHU for same neighbor.

Here a pcap, captured during slow start: each second an additional node is started.
babeld-35nodes-1s-slowstart-1000packets.pcapng.zip

@dtaht
Copy link

dtaht commented Dec 8, 2018

@tecoboot - nice to see you! Could you share your core emulator setup? I've been trying to get to a better emulation envionrment myself

I note there is a bug in mainline babeld at the moment... fixed in a patch I sent to the mailing list a few weeks back. Might be related to this....

@tecoboot
Copy link

tecoboot commented Dec 9, 2018

Maybe I can help to get babeld in good shape. I heard about disappointing results where babel was compared to olsrv2. Too bad (to be true?).

I run core in Xubuntu 1804, in VMware Fusion 11, on new MacBook. Any Linux will do, but be careful running it on a native OS, as normal users can easily run privileged code.
Installation:
https://github.com/coreemu/core

The babeld config generator puts N routers in a square, all connected to a wireless lan.
I use sleep pauses to set up Wireshark on the wlan (is a bridge interface set up by core when config is started). And to start babeld one by one.

The results I shared was with current ubuntu provided babeld: 1.7.0.

When this issue is fixed, I want to dig into packet overhead (why no address compression for IHU?), dynamic timers (triple?) etc. I only will ask questions and share results, I don't code myself.
babeld-core.zip

How it looks like:
schermafbeelding 2018-12-09 om 08 55 05

I note there is a bug in mainline babeld at the moment... fixed in a patch I sent to the mailing list a few weeks back. Might be related to this....

I'll check. I'll use master for follow-up tests.

@tecoboot
Copy link

tecoboot commented Dec 9, 2018

OK, with babeld-1.8.4-42-g73d0b1a I have better results.
During startup, overhead is quite high. In my emulator, the topology converges and load reduces from 1Mb/s to 40kb/s. I'll test with emane (L1-simulator, works well with core).

schermafbeelding 2018-12-09 om 09 40 48

BTW: i didnt test cleanup of my generator script. Or add rm config file or change >> to > in cat config line.

@tecoboot
Copy link

Hmm, retried on my older macbook (Core2 duo), bad results. Exactly the same script, but now a total network meltdown. The 35 babeld instances eaten all the CPU and it was hard to stop core or wireshark. Nice test tool for linux dispatcher and wireshark memory management :-).

schermafbeelding 2018-12-10 om 20 57 40

The sad news is that babeld cannot handle a bottleneck. Don't know if it is caused by shortage of CPU or lost packets due to shortage of CPU.

I reran the test, now with longer delays: first pause of 100s before starting the babeld daemons. Then starting babeld one by one, 10s interval. This works well.

schermafbeelding 2018-12-10 om 21 22 52

But in real networks, we don't want the meltdown trigger, do we?

I'll need some thoughts on how to locate the trigger.

@tecoboot
Copy link

Here the slowstart pcap file.

babeld-35nodes-slowstart-10s.zip

Still lots of requests and IHU.
Without average, lots of spikes.

babeld-35nodes-slowstart-10s

@tecoboot
Copy link

I reran with 8 nodes. I turned off dynamic Rxcost, to eliminate IHU caused by slow metric algorithm and IHUs to exchange that. I expected just a few packets to insert the new node in the topology.

babeld-8nodes-slowstart-10s.zip
babeld-8nodes-slowstart-10s

See for example addition of second node in packets 9 to 24. It takes:
Hello: 9
Update: 3
NH: 1
RouterID: 1
IHU: 10
Request: 2
(and a couple of ICMPv6)
My thoughts:

  • Far to many IHUs and Hellos. Seems like a bug.
  • Learning RouterID, ND etc is slow, newly joined nodes have to wait for update interval.
    @Juliusz?

@dtaht
Copy link

dtaht commented Dec 11, 2018

Excellent work, thanks! Could you apply this patch?

[PATCH 1/2] Fix ifup bug in send_multicast

From: Dave Taht <dave@taht.net>

It was essentially a no-op with an inverted test.
---
 message.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/message.c b/message.c
index 043e4b6..930fc69 100644
--- a/message.c
+++ b/message.c
@@ -1746,7 +1746,7 @@ send_multicast_request(struct interface *ifp,
     if(ifp == NULL) {
         struct interface *ifp_auxn;
         FOR_ALL_INTERFACES(ifp_auxn) {
-            if(if_up(ifp_auxn))
+            if(!if_up(ifp_auxn))
                 continue;
             send_multicast_request(ifp_auxn, prefix, plen, src_prefix, src_plen);
         }
@@ -1765,10 +1765,10 @@ send_multicast_request(struct interface *ifp,
             if(neigh->ifp == ifp) {
                 send_request(&neigh->buf, prefix, plen,
                              src_prefix, src_plen);
-            } else {
-                send_request(&ifp->buf, prefix, plen, src_prefix, src_plen);
             }
         }
+    } else {
+        send_request(&ifp->buf, prefix, plen, src_prefix, src_plen);
     }
 }

@dtaht
Copy link

dtaht commented Dec 11, 2018

As for "not being able to handle a bottleneck" - babeld has two problems - one we've seen pathlogical cases like yours that exhaust local cpu, hellos become late, retransmits spike, and more cpu gets used - which I exposed via my "rtod" tool a few months back. One fallout of that was that babel now logs late hellos, and it would be interesting if you are seeing that in your tests. Finding and fixing those cases is definately high on my mind. The other, is that babel uses naive data structures which have a tendency not to scale well (while you are exhausting cpu). I've got some work in progress adding hashes to multiple formerly linked list lookups that appear quite promising, and @christf has been working on improving the scheduler.

@tecoboot
Copy link

Patched.
On slow-start with 8 nodes, I don't see a difference.
Here log (-d 1) from node 1.

babeld-8nodes-node1.zip

Do you want me to run the meltdown test?
I prefer handling the huge number of helloś and IHUs first.
For example frame 16, 3x IHU for same neighbor. Silly behavior.

babeld-8nodes-slowstart-10s-11dec2018-2026.zip

Frame 16: 122 bytes on wire (976 bits), 122 bytes captured (976 bits) on interface 0
Ethernet II, Src: 00:00:00_aa:00:00 (00:00:00:aa:00:00), Dst: IPv6mcast_01:00:06 (33:33:00:01:00:06)
Internet Protocol Version 6, Src: fe80::200:ff:feaa:0, Dst: ff02::1:6
User Datagram Protocol, Src Port: 6696, Dst Port: 6696
Babel Routing Protocol
Magic: 42
Version: 2
Body Length: 56
Message hello (4)
Message Type: hello (4)
Message Length: 6
Seqno: 0x8512
Interval: 400
Message ihu (5)
Message Type: ihu (5)
Message Length: 14
Rxcost: 0x0060
Interval: 1200
Address: fe80::200:ff:feaa:1
Address Encoding: Link-Local IPv6 (3)
Raw Prefix: 006004b0020000fffeaa0001
Message ihu (5)
Message Type: ihu (5)
Message Length: 14
Rxcost: 0x0060
Interval: 1200
Address: fe80::200:ff:feaa:1
Address Encoding: Link-Local IPv6 (3)
Raw Prefix: 006004b0020000fffeaa0001
Message ihu (5)
Message Type: ihu (5)
Message Length: 14
Rxcost: 0x0060
Interval: 1200
Address: fe80::200:ff:feaa:1
Address Encoding: Link-Local IPv6 (3)
Raw Prefix: 006004b0020000fffeaa0001

Maybe it has to do with updates of internal state, eg rechability.

@dtaht
Copy link

dtaht commented Dec 11, 2018 via email

@MisterDA
Copy link
Contributor

Is there a way to profile babeld while you run it in the emulator?

@tecoboot
Copy link

tecoboot commented Dec 12, 2018 via email

@MisterDA
Copy link
Contributor

They say one should profile before optimizing. I’d be interested to have real-world data on where babeld spends its time and look there.
The easiest is to compile with -pg option, then when babeld exits, it will write its call graph to a gmon.out file wich can be processed with the gprof tool.

@dtaht
Copy link

dtaht commented Dec 13, 2018

well, kill it when it's misbehaving, after compiling with -pg, and let us know what gprof reports? I have at least one other patch that might be relevant (it's in my uthash branch) - but I'm certain we have at least one real bug here.

@christf
Copy link
Contributor

christf commented Dec 19, 2018

According to the code there is one IHU (that is including marginal IHU) per Hello.

Also, the hello-count of 9 seems high.

@jech
Copy link
Owner

jech commented Dec 22, 2018

Thanks, that's very helpful.

Could you please try again after applying the following patch?

diff --git a/neighbour.c b/neighbour.c
index 8bf53a2..7604b4c 100644
--- a/neighbour.c
+++ b/neighbour.c
@@ -201,12 +201,6 @@ update_neighbour(struct neighbour *neigh, struct hello_history *hist,
         }
     }
 
-    if((hist->reach & 0xFC00) == 0xC000) {
-        /* This is a newish neighbour, let's request a full route dump.
-           We ought to avoid this when the network is dense */
-        send_unicast_request(neigh, NULL, 0, NULL, 0);
-        send_ihu(neigh, NULL);
-    }
     return rc;
 }
 

@jech
Copy link
Owner

jech commented Dec 22, 2018

After testing myself, this patch appears to pretty much solve the problem. It will slow down insertion of a new node, but I think avoiding a network meltdown is worth it.

@jech jech closed this as completed Dec 22, 2018
@tecoboot
Copy link

I'm not so sure this patch helps much. I still can see the silly IHU behavior. There are other packet storms.

See a couple of storms and the IHU peaks; 49 nodes, 10s interval between babeld start:

schermafbeelding 2018-12-23 om 12 12 59

Normal node insertion (not one of the three peaks):

schermafbeelding 2018-12-23 om 12 05 12

Router-ID storm during first peak:

schermafbeelding 2018-12-23 om 12 11 02

babeld-49nodes-10s-delay.pcapng.zip

I think the triggered messages needs much more thoughts.

@jech jech reopened this Dec 23, 2018
@jech
Copy link
Owner

jech commented Dec 23, 2018 via email

@tecoboot
Copy link

Sending an unscheduled Hello to a new peer is a good idea, we just shouldn't be doing it when we quickly learn a sequence of new nodes in quick succession. Ideas?

The interval between startup of nodes was 10s.I don't think this is "in quick succession".
My first testrun was with concurrent startup. This resulted in a meltdown.

Can the behavior be caused by triggered messages, without any checks if such messages can be delayed, combined and/or suppressed due to duplicates? I would say a message should not be send more than X times in Y centiseconds, e.g. 1 time each 50 centiseconds.

@jech
Copy link
Owner

jech commented Dec 23, 2018

What about this? (Not committing yet, I need to think over the consequences.)

diff --git a/neighbour.c b/neighbour.c
index 7604b4c..c478762 100644
--- a/neighbour.c
+++ b/neighbour.c
@@ -117,7 +117,6 @@ find_neighbour(const unsigned char *address, struct interface *ifp)
     neigh->next = neighs;
     neighs = neigh;
     local_notify_neighbour(neigh, LOCAL_ADD);
-    send_hello(ifp);
     return neigh;
 }
 
@@ -184,23 +183,6 @@ update_neighbour(struct neighbour *neigh, struct hello_history *hist,
             rc = 1;
     }
 
-    if(unicast)
-        return rc;
-
-    /* Make sure to give neighbours some feedback early after association */
-    if((hist->reach & 0xBF00) == 0x8000) {
-        /* A new neighbour */
-        send_hello(neigh->ifp);
-    } else {
-        /* Don't send hellos, in order to avoid a positive feedback loop. */
-        int a = (hist->reach & 0xC000);
-        int b = (hist->reach & 0x3000);
-        if((a == 0xC000 && b == 0) || (a == 0 && b == 0x3000)) {
-            /* Reachability is either 1100 or 0011 */
-            send_self_update(neigh->ifp);
-        }
-    }
-
     return rc;
 }

@tecoboot
Copy link

This helps a lot. Peaks up to 100kbit/sec, avg 40kbit/sec.

babeld-24dec

babeld-24dec-avg

babel packets: 2989

  • msg. packets. #msgs
  • hello 2351 2351
  • ihu 2268 23192
  • router-id 1673 17034
  • nh 1673 1673
  • request 70 70
  • update 1708 17104

babeld-24dec2018.pcapng.gz

Question is: is there a convergence-time trade-off?

@tecoboot
Copy link

I tried the 49-node concurrent start also. Looks much better :-)
Keep in mind this is ethernet bridging, not EMANE or real wireless. I'll work with EMANE as next step.

babeld-24dec2018-concurrent-start

babeld-24dec2018-concurrent-start.pcapng.gz

@jech
Copy link
Owner

jech commented Dec 24, 2018 via email

@dtaht
Copy link

dtaht commented Dec 24, 2018

One of my overall proposals is that the code start taking advantage of having different length intervals available in the protocol. In this case, as the observed density (in IHUs and/or hellos, or route announcements with a unique routerid) goes up, the hello interval, increases. You have 4 speakers, use 4sec, perhaps w/16, 32sec. (pick a scaling factor, any one...) a hello interval of merely " observed speakers"... essentially preserves babel's existing properties for less dense meshes - a lonely hello would be 1sec (or min 2) which might improve matters some on sparse meshes...

Same concept applies to route announcements also, although I'd announce default routes more frequently than obscure routes, with a stated longer interval.

A new node will typically learn "enough" of the topology within a few route transfers, also. Trying to pick a "good" node(s) to request a route dump from could possibly be smarter. Picking the most disjoint set of hellos/ihus? Ask for a route dump from the best + worst + random, initially? Do you even need to have a hello/ihu exchange to start picking up routes or to calculate number of speakers?

If reachability is on the decline, rather than increase, additively decrease the announced hello or route announcement interval. (this is a very tcp-ish idea)

@tecoboot I'd rather like to witness what happens on this test when each node has 1k (or more!) distinct routes. From within a bunker at a safe distance. With popcorn. (I do really want to set this emulation environment up, but am working on revising my rtod tool instead). I'm loving these graphs, overall.

@dtaht
Copy link

dtaht commented Dec 24, 2018

@tecoboot Re: "For example frame 16, 3x IHU for same neighbor. Silly behavior."... later... "I still see the silly ihu behavior"...

By this "silly behavior" you mean the more than one IHU per neighbor is still there? I see in frame 36 of babeld-24dec2018-concurrent-start.pcapng.gz that it is, indeed, repeating fe80::200:ff:feaa:2 & :1 twice, me, skipping randomly forward by eyeball, it's doing the same on frame 85. So there is a bug here, also, in how ihus are merged.

@dtaht
Copy link

dtaht commented Dec 24, 2018

@m13253 - I was looking over the zerotier thing, which looks pretty neat though I don't grok how it works. Along the way I noticed the https://github.com/nanomsg/nng library which might be a substrate worth building other things on.

@tecoboot
Copy link

Question is: is there a convergence-time trade-off?
Yes. The code in question sends an update whenever it sees a new node.

I guessed so.

This is bad in the very dense network that you're simulating

This issue was reported with 22 and 9 nodes.
My current test uses 35 nodes (not 49 yet). I'll add more to see how far I can push babeld.

Of course, I'm entirely in favour of avoiding a meltdown, even in a rather unrealistic situation (Babel is not designed for very dense networks), so I will remove the faulty code. I just want to think it over some more.

Maybe keep the trigged mode with non-default config option. With UseAtYourOwnRisk in man page.

My thoughts are in line with what Dave suggest: smarter triggered updates with variable timers. A smart message scheduler could incorporate line utilization (like Cisco EIGRP, don't blow up the medium), link quality (higher repetition rate on lossy links), neighbor state (don't send unused messages at high frequency). The goal would be: important messages go first.

@tecoboot
Copy link

@tecoboot Re: "For example frame 16, 3x IHU for same neighbor. Silly behavior."... later... "I still see the silly ihu behavior"...

By this "silly behavior" you mean the more than one IHU per neighbor is still there? I see in frame 36 of babeld-24dec2018-concurrent-start.pcapng.gz that it is, indeed, repeating fe80::200:ff:feaa:2 & :1 twice, me, skipping randomly forward by eyeball, it's doing the same on frame 85. So there is a bug here, also, in how ihus are merged.

Yes, still unneeded redundancy :-)

It was frame 38 (not 36), IHU neighbor addresses in same packet, sorted :
Address: fe80::200:ff:feaa:0
Address: fe80::200:ff:feaa:1
Address: fe80::200:ff:feaa:1
Address: fe80::200:ff:feaa:2
Address: fe80::200:ff:feaa:2
Address: fe80::200:ff:feaa:3
Address: fe80::200:ff:feaa:3
Address: fe80::200:ff:feaa:5
Address: fe80::200:ff:feaa:5
Address: fe80::200:ff:feaa:6
Address: fe80::200:ff:feaa:6
Address: fe80::200:ff:feaa:7
Address: fe80::200:ff:feaa:7

This could be caused by message queuing, but lack of removal from queue when message is updated.
Addresses and Rxcost:
Rxcost: 0x03ff
Address: fe80::200:ff:feaa:2
Rxcost: 0x01ff
Address: fe80::200:ff:feaa:0
Rxcost: 0x03ff
Address: fe80::200:ff:feaa:1
Rxcost: 0x0155 (was 0x03ff)
Address: fe80::200:ff:feaa:2
Rxcost: 0x0155 (was 0x03ff)
Address: fe80::200:ff:feaa:1
Rxcost: 0x03ff
Address: fe80::200:ff:feaa:6
Rxcost: 0x03ff
Address: fe80::200:ff:feaa:3
Rxcost: 0x03ff
Address: fe80::200:ff:feaa:5
Rxcost: 0x03ff
Address: fe80::200:ff:feaa:7
Rxcost: 0x0155 (was 0x03ff)
Address: fe80::200:ff:feaa:5
Rxcost: 0x0155 (was 0x03ff)
Address: fe80::200:ff:feaa:7
Rxcost: 0x0155 (was 0x03ff)
Address: fe80::200:ff:feaa:6
Rxcost: 0x0155 (was 0x03ff)
Address: fe80::200:ff:feaa:3

Could the duplicates be removed?

@tecoboot
Copy link

Wit concurrent startup with 48, 63 and 99 nodes, there was no meltdown :-)

48 nodes, 70kbit/sec:
babeld-48nodes

63 nodes, 120kbit/sec:
babeld-63nodes

99 nodes, 280kbit/sec:
babel-99nodes

Core with 99 nodes:
schermafbeelding 2018-12-24 om 20 38 38

I'm quite sure that this density with real radio's will end up in lousy performance. But good to know the protocol and implementation can handle this.

@jech
Copy link
Owner

jech commented Dec 26, 2018

I've committed just part of this patch, the bit about sending Hellos proactively. The other part I don't feel comfortable with removing, although I don't like the current code. Teco, thanks for showing me this, that's a lot of help.

@spiccinini
Copy link

Hi @jech thanks for working on this!

I want to let you know that we are currently using / testing 39c5f0a at some small (<20 nodes) community networks in Argentina and Brazil with good results for now. For the following weeks we will be performing a test in a ~60 nodes network. We will update with results.

@dtaht
Copy link

dtaht commented May 29, 2019 via email

@jech
Copy link
Owner

jech commented Jun 28, 2019

I've committed the rest of the patch in b66c173, so this issue is believed to be fixed.

Thanks to all of you, please create a new issue if there's anything more to do.

@jech jech closed this as completed Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants