Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writeup of router kill issue #3320

Open
whyrusleeping opened this issue Oct 18, 2016 · 114 comments
Open

Writeup of router kill issue #3320

whyrusleeping opened this issue Oct 18, 2016 · 114 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@whyrusleeping
Copy link
Member

whyrusleeping commented Oct 18, 2016

So we know that ipfs can kill people routers. We should do a quick write up of what the causes are, which routers are normally affected, and maybe propose a couple ideas for solutions.

@Kubuxu do you think you could handle doing this at some point?

@whyrusleeping whyrusleeping added this to the Dont Kill Routers milestone Oct 18, 2016
@donothesitate
Copy link

donothesitate commented Oct 23, 2016

My theory is, it does exhaust / overload NAT table, that on some routers does cause lockups.
UDP on the same routers can keep working without problem, as well as TCP connections that were already open when lockup occurred.

Possible solution: Have a switch to limit number of peers/connections.
Related #3311

@ghost
Copy link

ghost commented Oct 23, 2016

That sounds highly likely. nf_conntrack_max on my edge router is set to 1024 by default and ipfs eats 700 of those on its own, per computer I'm running it on.

A lot of those are dead connections too: if I open the webui which tries to ping them it quickly drops to 90 or so.

@Kubuxu Kubuxu added the status/deferred Conscious decision to pause or backlog label Nov 28, 2016
@hsanjuan
Copy link
Contributor

hsanjuan commented Mar 23, 2017

Running 5 daemons on local network with a well-known hash (they were pinning dist) kills my Fritzbox.

AFAIK everyone has high esteem for Fritzboxes as very good routers, and not some shitty hardware. Internet reports a NAT table size of around 7000. I find the problem is exacerbated when my nodes are pinning popular content (I suspect this not only consumes all the bandwidth but also increases the number of connections when other peers try to download these blocks?).

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

So my idea of what happens is that conntracker table fills up (it is small in cheapo routers, bigger is good ones) and it starts trowing out other connections. @hsanjuan can you repeat the test, kill ipfs daemons and check if it comes back up online?

@hsanjuan
Copy link
Contributor

hsanjuan commented Mar 23, 2017

@Kubuxu yeah yeah things are back up immediately when I kill them. Only once I had the router reboot itself, which worried me more.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

So other possibility is that cheapo routers have bigger conntracker limit than their RAM can handle and they kernel panics or lockups. Not sure how to check it.

@whyrusleeping
Copy link
Member Author

whyrusleeping commented Mar 23, 2017

Does UDP eat up conntracker entries? We're moving quickly towards having support for QUIC.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 23, 2017

AFAIK, yes. At least from the time my services were DDoSed with UDP packets and they were much more destructive because of low conntracker limits.

@hsanjuan
Copy link
Contributor

hsanjuan commented Mar 28, 2017

Is it possible that this problem got much worse in the last releases (ie >=0.4.5). I used to be able to run 4 nodes without problems and now it seems I'm not, even after cleaning their contents.

@kakra
Copy link

kakra commented May 2, 2017

I'm having issues, too. Maybe ipfs should take two connection pools and migrate peer connections from a bad quality pool to a good quality pool by applying some heuristics to the peers. Peers with higher delays, lower bandwidth and short lives would live in the "bad pool" and easily replaced by new peers if connection limits are hit. Better peers would migrate to the "good pool" and only be replaced by better peers if limits are hit. Having both pools gives slow peers a chance to be part of the network without being starved by higher quality peers, which is important for a p2p distributed network.

BTW, udp also needs connection tracking, this wouldn't help here, and usually udp tracking tables are much smaller and much more short-lived which adds a lot of new problems. But udp could probably lower the need for bandwidth as there's no implicit retransmission and no ack. Of course, the protocol has to be designed in a way to handle packet loss, and it must take into account that NAT gateways usually drop udp connection table entries much faster. It doesn't make sense to deploy udp and then reimplement retransfers and keep-alive, as this would replicate tcp with no benefit (probably it would even lower performance).

Also, ipfs should limit the amount of outstanding packets, not the amount of connections itself. If there are too many packets in-flight, it should throttle further communication with peers, maybe prioritizing some over others. This way, it could also auto-tune to the available bandwidth but I'm not sure.

Looking at what BBR does for network queues, it may be better to throw away some requests instead of queuing up a huge backlog. This can improve overall network performance, bloating buffers is a performance killer. I'd like to run ipfs 24/7 but if it increases my network latency, I simply cannot, which hurts widespread deployment.

Maybe ipfs needs to measure latency and throw away slowly responding peers. For this to properly work, it needs to auto-adjust to the bandwidth, because once network queues fill, latency will exponentially spike up and the former mentioned latency measurement is useless.

These big queues are also a problem with many routers as they tend to use huge queues to increase total bandwidth for benchmarks but it totally kills latency, and thus kills important services like DNS to properly work.

I'm running a 400/25mbps assymmetric link here, and as soon as "ipfs stats bw" get beyond a certain point, everything else chokes, browsers become unusable waiting for websites tens of seconds, or resulting in DNS errors. Once a web request comes through in such a situation, the website almost immediately completely appears (minus assets hosted on different hosts) so this is clearly an upstream issue with queues and buffers filled up and improper prioritizing (as ACKs still seem to pass early through the queues, otherwise download would be reduced, too).

I don't know if QUIC would really help here... It just reduces initial round-trip times (which HTTP/2 also does) which is not really an issue here as I consider ipfs a bulk-transfer tool, not a latency-sensitive one like web browsing.

Does ipfs properly use TOS/QoS flags in IP packets?

PS: ipfs should not try to avoid tcp/ips auto-tuning capabilities by moving to UDP. Instead it should be nice to competing traffic by keeping latency below a sane limit and let TCP do the bandwidth tuning. And it should be nice to edge-router equipment (which is most of the time cheap and cannot be avoided) by limiting outstanding requests and amount of total connections. I remembered when Windows XP tried to fix this in the TCP/IP stack by limiting outstanding TCP handshakes to ten, blocking everything else then globally. This was a silly idea but it was thinking in the right direction, I guess.

@dsvi
Copy link

dsvi commented May 25, 2017

I think you might as well not do anything at all, since routers are getting consistently better at supporting higher numbers of connections. My 5 years old struggled with supporting 2 ipfs nodes (about 600 connections each) + torrent (500 connections). I've just got cheap chinese one, and it works like a charm. Most of even cheap routers nowadays have hardware NAT. They don't much care how many connections you throw at them.
Also, switching to UDP doesn't help, since when i unleash torrent far beyond 500 connections limit, it used to kill the old router as good as ipfs. And torrent uses only UDP.

@ghost
Copy link

ghost commented May 26, 2017

@dsvi: I'd rather not have to pay hard cash just to use IPFS on the pretence that it's fine to be badly behaved because some other software can be misconfigured to crash routers. A lot of people don't even have the luxury of being allowed to connect to their ISP using their own hardware.

And what a strawman you've picked — a Bittorrent client! A system that evolved its defaults based on fifteen years real world experience for precisely this reason!

No thanks, just fix the code.

@kakra
Copy link

kakra commented May 26, 2017

@dsvi I wonder if they use their own routers because the page times out upon request... ;-)

But please do not suggest that: Many people are stuck with what is delivered by their providers with no chance to swap that equipment for better stuff. Ipfs has not only to be nice to such equipment but to overall network traffic on that router, too: If it makes the rest of my traffic demands unusable, there's no chance for ipfs to evolve because nobody or only very few could run it 24/7. Ipfs won't reach its goal if it is started by people only on demand.

@dsvi
Copy link

dsvi commented May 26, 2017

Sorrry guys, should have expressed it better. I'll try this time from another direction ;)

  1. Internet world is becoming decentralized in general. This is a natural trend which is everywhere, from secure instant messaging, filesharing, decentralized email systems and so on.
    And creating tons of connections is a natural part of such systems. They are distributed, and for effective work they have to support tons of connections (distributions channels). It's unavoidable in general. There can be improvements here and there, but its fundamentally "unfixible"
  2. Hardware vendors have acknowledged that already. Modern router chipsets are way better in that regard nowadays, since all the hardware review sites have included at least torrent tests, in their review test suites. So you don't really need nowadays something 200$+ to work well with it. And a year from now, it will only get waay better, since they tend to offload a lot of routing work to hardware.
    So it already is not a problem, and will be even less so with every year.

And what about people who stuck with relic hardware for whatever reason? Well i feel sorry for some of them, but the progress will go on with them, or without.

@Calmarius
Copy link

Calmarius commented Oct 2, 2017

@dsvi

"Internet world is becoming decentralized in general. "

Nope! It's becoming centralized. Almost the whole internet is served by a handful datacenter companies.
For most people search means Google, e-mail means Gmail, social interactions mean Facebook, videos mean Youtube, chat means Facebook Messenger, picture sharing means Instagram.
The rest of the web is hosted at the one of the several largest datacenter companies.

At the beginning we used to have Usenet and IRC servers running on our computers at home.
Then services got more and more centralized.

I don't see signs of any decentralization. But I see signs of further centralization.
For example some ISPs don't even give you public IP addresses anymore (for example 4G networks).

"And creating tons of connections is a natural part of such systems."

Having too many simultaneous connections makes the system inefficient.
If you have enough peers to saturate your bandwidth it's pointless to add more.

Currently my IPFS daemon opens 2048 connections within several hours to peers then runs out of file descriptors and becomes useless. This should be fixed.

@vext01
Copy link

vext01 commented Sep 22, 2018

I'm using a crappy TalkTalk router provided by the ISP and I've been unable to find a configuration where IPFS doesn't drag my internet connection to it's knees.

Using ifstat I see usually between 200kb/s and 1MB up and down whilst ipfs is connected to a couple of hundred peers.

I'd like to try connecting to fewer peers, but even with:

      "LowWater": 20,
      "HighWater": 30,

ipfs still connects to hundreds.

@vext01
Copy link

vext01 commented Dec 18, 2018

Perhaps this is a dumb question, but why don't you make it so that IPFS stops connecting to more peers once the high water mark is reached?

@Stebalien
Copy link
Member

Stebalien commented Dec 18, 2018

We should implement a max connections but high/low water are really designed to be target bounds.

The libp2p team is currently refactoring the "dialer" system in a way that'll make it easy for us to configure a maximum number of outbound connections. Unfortunately, there's really nothing we can do about inbound connections except kill them as soon as we can. On the other hand, having too many connections usually comes from dialing.

@Stebalien
Copy link
Member

Stebalien commented Dec 18, 2018

Note: there's actually another issue here. I'm not sure if limiting the max number of open connections will really fix this problem. I haven't tested this but I'm guessing that many routers have problems with connection velocity (the rate at which we (try to) establish connections) not simply having a bunch of connections. That's because routers often need to remember connections even after they've closed (for a period of time).

@vyzo's work on NAT detection and autorelay should help quite a bit, unless I'm mistaken.

@kakra
Copy link

kakra commented Dec 18, 2018

A work-around could be to limit the number of opening connections (in contrast to opened connections) - thus reducing the number of connection attempts running at the same time. I think this could be much more important than limiting the number of total connections.

If such a change propagated through the network, it should also reduce the amount of overwhelming incoming connection attempts - especially those with slow handshaking because the sending side is not that busy with opening many connections at the same time.

@Stebalien
Copy link
Member

Stebalien commented Dec 18, 2018

We actually do that (mostly to avoid running out of file descriptors). We limit ourselves to opening at most 160 TCP connections at the same time.

@kakra
Copy link

kakra commented Dec 18, 2018

@Stebalien Curious, since when? Because I noticed a while ago that running IPFS no longer chokes DNS resolution of my router...

@Winterhuman
Copy link
Contributor

Winterhuman commented Jan 3, 2022

@kwinz Could you give us numbers on the RAM usage as well? It's the only other detail your post seems to be missing that I think could be a cause.

@kakra
Copy link

kakra commented Jan 3, 2022

I have a fairly beefy 4 core x64 NAT / router with OpenWRT

It won't help if your ISP modem still works in router mode. If you put your ISP modem into bridge mode and handle the routing exclusively in the OpenWRT router, it should work just fine. Smart queues put a lot of pressure on the router CPU, it may work better without it. I'm running on a UDM router with 1000D/50U connection (4-core ARM CPU), and enabling smart queues absolutely kills its performance (only 600-700 mbps downstream, and IPFS can overwhelm it), but without smart queues it runs just fine. My connection has enough bandwidth that smart queues do not actually matter, you probably only need it if you constantly fully saturate your uplink bandwidth. Downstream cannot be properly controlled with smart queues anyways. If you still want to use smart queues, measure your bandwidth up and down without IPFS running, then set the smart queue bandwidth to 80% of what's available so it leaves enough headroom for packet bursts (which IPFS actually creates a lot of), at any price you want to prevent packets queuing up in the modem uplink because that's what increases the packet latency and your network has no control over priorities in that uplink queue. A high uplink latency will decrease your usable downlink bandwidth significantly due to how TCP works (ACK packets and receive window sizes).

@chevdor
Copy link

chevdor commented Jan 4, 2022

I can backup the statements from @kakra and thank him from pointing out the smart queues options.

I also switched from using my (various Fritzbox, lately 6660) router into a simple bridge and upgraded to an UDMPro. Smart Queues do reduce my bandwidth significantly by some 45%. Since I removed the Fritzbox router from the equation, I see the (real) router does work but at least it no longer kills my connection (ie the router keeps doing its job and does not collapse like Fritzboxes do).

@hsanjuan
Copy link
Contributor

hsanjuan commented Jan 4, 2022

If you put your ISP modem into bridge mode

My ISP modem (Vodafone) is in bridge mode and it implodes (restarts itself and is unable to restore internet) whenever IPFS becomes a little active in terms of connections.

@kakra
Copy link

kakra commented Jan 5, 2022

My ISP modem (Vodafone) is in bridge mode and it implodes

Vodafone here, too, with 1000D/50U, so it's the more modern modem. Swarm configuration:

  "Swarm": {
    "AddrFilters": [],
    "ConnMgr": {
      "GracePeriod": "60s",
      "HighWater": 200,
      "LowWater": 150,
      "Type": "basic"
    }

@kwinz
Copy link

kwinz commented Jan 9, 2022

@LynHyper
The RAM usage is a fraction of the 4GB that the router has. OpenWRT is pretty conservative on the RAM, since it's made for embedded. Maybe I need to configure it so it uses more of its RAM?

@kakra
No, the modem is already in bridge mode, no NAT/routing is going on in the modem.

I guess I have two problems at hand here: 1. why does the router even allow itself to be DDOSed? That's probably more of a question for an OpenWRT forum and not for here.

And 2. Why is IPFS DDOSing my connection? And if my pretty beefy router with an up to date OS lets itself be DDOSed then probably most home users will be also affected and will have a bad experience with IPFS. IPFS has to work without depending on users making changes to their router config. Most people wouldn't even know how to do that. So can we have IPFS use more sensible settings and or defaults? That's the question for this thread here. So far I still don't know how to set up IPFS so this doesn't happen any more. Who can please help me there?

@kakra
Copy link

kakra commented Jan 9, 2022

Why is IPFS DDOSing my connection?

Well, one more idea: IPFS probably tries to reach other LAN addresses via your default gateway (aka OpenWRT), and that in turn just routes everything by default out to WAN that's not destined for a directly attached local network. Of course, it will NAT those addresses.

Now comes the problem: Those private network destinations routed to your WAN will always timeout, they never get replies. That means, the NAT table will be filled with useless entries that eventually timeout after 10 minutes or maybe even more because they never get a reply which may discard the entry early.

Countermeasures:

  1. Blackhole all private network prefixes in the OpenWRT routing table: ip route add blackhole 192.168.0.0/16; ip route add blackhole 10.0.0.0/8; ip route add blackhole 172.16.0.0/12. This ensures a private destination will never go to your WAN interface where it would be NAT'ed (locally attached LANs will still be routed because of longer prefixes). That's generally a good idea, not only for IPFS, other software may try to reach random LAN IPs and bodge your router NAT tables. ISP consumer routers often have absolutely no idea what they are routing and push just everything out the WAN interface. I've tested this for multiple routers with tcpdump, and they all routed LAN destinations to the WAN interface which makes no sense in the environment they target. It may be beneficial to actively reject those destinations via firewall rules instead so it would generate an immediate "rejected" response to your software and it can continue with the next host.
  2. You may also want to blackhole 100.64.0.0/10 because that is ISP CGNAT and usually does not allow port forwardings. But YMMV. It would also block you from connecting with their outgoing connections. It may be better to reject outgoing connection initialization via a firewall rule.
  3. For your Vodafone Cable modem, you might actually exclude one route from blackholing to still reach its web UI on the secret "WAN routable" IP: ip route add 192.168.100.1/32 dev YOURWAN. It needs to be NAT'ed to actually work (so don't exclude private networks from NAT, that's why I suggest blackhole routes instead).
  4. Reduce the NAT table entry timeout in your router, not sure how to do that on OpenWRT. This ensures that table entries without activity are discarded early.

This won't affect the bridge mode of the modem, tho. Not sure what the problem is here.

I'm using Unifi UDM with firewall rules to reject LAN destinations to the WAN interface, with the exception of 192.168.100.1 for the Vodafone cable modem (didn't find a way to deploy blackhole routes).

@kwinz
Copy link

kwinz commented Jan 9, 2022

@kakra Thanks, I will try that.

I would suggest to change IPFS so that by default it doesn't even try to reach those not globally routed private LAN addresses, and that I manually have to whitelist the subnets that I am using in my local LAN if they are not link connected. Or better yet allow them but don't accept unroutable private IP space destinations from peers off the internet. That doesn't make any sense. https://en.wikipedia.org/wiki/Private_network#Private_IPv4_addresses Or that it recognizes that those always fail and backs off trying them. I think that would greatly help adoption. We can't expect all users to know how to change the routing table of their router. I don't have this problem with any other file sharing or QUIC software.

@kakra
Copy link

kakra commented Jan 9, 2022

We can't expect all users to know how to change the routing table of their router

I think that is something, ISPs should actually fix in their routers and ship with sensible defaults - maybe add a button "I know what I'm doing" if you want to route private networks via the WAN interface (there's exist setups where it makes sense but the common situation is: it does not).

@kwinz
Copy link

kwinz commented Jan 10, 2022

ISPs should actually fix in their routers and ship with sensible defaults

And IPv4 has a beautiful protocol field, with dozens of protocols defined that it is supposed to be able to carry. But in practice only ICMP, TCP, and UDP get reliably forwarded by the routers. So what do the new protocols like SCTP, QUIC, etc. do? Do they wait for the ISPs of the world to exchange all their routers? No they encapsulate their new protocols in UDP. You can't stay idealistic waiting for the world to change. Let's stay pragmatic and ship defaults that actually work for our users.

And I would argue accepting unroutable private IP space destinations from peers off the internet doesn't even make sense. Maybe in some "carrier grade NAT" deployments that I haven't thought about.

@markg85
Copy link
Contributor

markg85 commented Jan 10, 2022

And I would argue accepting unroutable private IP space destinations from peers off the internet doesn't even make sense. Maybe in some "carrier grade NAT" deployments that I haven't thought about.

While i very much agree with all you said, this part in particular might not be as clear cut at it looks at first glance.

Now i don't know how IPFS is doing this but i can make an educated guess.
What about IPFS nodes that live within your local network. Mind you, this isn't just the local home network scenario. What about the local school/university/company networks that might also have IPFS nodes? As ideally they would! In those cases you do want local/private IP's to be send back and forth between nodes.

Somehow this case would ideally be supported in IPFS as it's now but without the private IP's getting broadcast outside the private network. That sounds like a complicated technical issue to solve.

Still, this is just a minor issue (if any at all) in the grander scheme of IPFS killing ISP provided modems.

@kwinz
Copy link

kwinz commented Jan 11, 2022

What about IPFS nodes that live within your local network. Mind you, this isn't just the local home network scenario. What about the local school/university/company networks that might also have IPFS nodes? As ideally they would! In those cases you do want local/private IP's to be send back and forth between nodes.

Somehow this case would ideally be supported in IPFS as it's now but without the private IP's getting broadcast outside the private network. That sounds like a complicated technical issue to solve.

Is it a complicated technical issue though? If you say: don't accept private IP space peers if they are sent from a peer that has a public IP address as opposed to from a peer within my local school/university/company LAN with a private sender address? Seems very doable at least in IPv4.

@kakra
Copy link

kakra commented Jan 12, 2022

Somehow this case would ideally be supported in IPFS as it's now but without the private IP's getting broadcast outside the private network. That sounds like a complicated technical issue to solve.

That's actually the problem here: Imagine a network routing private traffic over VPN via your edge router. Those nodes won't find themselves via broadcast. But they will find each other via announcements that actually come from an internet IP source in the first place. This is one of the valid use cases where you want accept routing from and to private addresses via your WAN interface (e.g. when it has IPsec). And this is why IPFS announces private addresses to internet hosts by default so this can actually work out of the box. You cannot simply deduce from the source IP of a packet how you should treat and filter it: It is completely valid to accept private IPs from such packets.

https://docs.ipfs.io/how-to/configure-node/#swarm-addrfilters

An array of addresses (multiaddr netmasks) to not dial. By default, IPFS nodes advertise all addresses, even internal ones. This makes it easier for nodes on the same network to reach each other. Unfortunately, this means that an IPFS node will try to connect to one or more private IP addresses whenever dialing another node, even if this other node is on a different network. This may trigger netscan alerts on some hosting providers or cause strain in some setups.

Only the server profile automatically adds private networks to this configuration because it is then assumed that this node doesn't need to connect to private IPs anyways.

https://docs.ipfs.io/how-to/configure-node/#basic-connection-manager

The basic connection manager uses a "high water", a "low water", and internal scoring to periodically close connections to free up resources.

I'm actually running with lower settings here. IMHO, the default settings are too high for a non-server node:

    "ConnMgr": {
      "GracePeriod": "60s",
      "HighWater": 200,
      "LowWater": 150,
      "Type": "basic"
    },

With slower connections, you may want to trim that down even further. Maybe try these and see if that helps with your routers.

@markg85
Copy link
Contributor

markg85 commented Mar 28, 2022

There are some updated finding on my end for this issue.

First of, many thanks for "vans163" and "Jorropo.eth" (don't know your github handles on the top of my head).

I made a network capture in my local environment where it killed my modem reliable. Not in terms of time (that swings wildly between a couple of minutes till half an hour) but after that range it had surely crashed at least once.

Both "vans163" and "Jorropo.eth" looked at the capture and both had ideas to see if that could work.
One weird thing that was found was having a lot of "TCP port number reuse" messages in the output.
Another weird finding is that, despite my lowWater/highWater are set stupidly low, i was still having like ~80.000 open network connections in the span of that network capture (~30 minutes). That's a stupidly insane amount that IPFS folks knowledgeable of this domain should probably have a look at.

These findings gave 2 potential tests to try:

  1. Don't use TCP, use UDP or QUIC (it works over UDP)
  2. Disable port reuse with a command like: LIBP2P_TCP_REUSEPORT=false ./ipfs daemon

As for point 1. To exclusively use QUIC your settings need to look like this:

...
    "Swarm": [
      "/ip4/0.0.0.0/udp/4001/quic"
    ]
...
      "Network": {
        "QUIC": true,
        "TCP": false,
        "Websocket": false,
        "Relay": false
      },
...

There are some caveats to this approach. Namely that QUIC uses quite a bit more CPU which you might notice on low end hardware. Another quite big issue is that - in this setup - you essentially can only connect to nodes that have QUIC enabled. So if your data is on a node that doesn't have QUIC enabled then you won't be able to access that data on that node. And as QUIC is experimental still and disabled by default, this might not be the best setting to use.

But.. it does work! My modem didn't kill itself with this setting.

Next is the second test.
By default IPFS (libp2p specifically) reusing ports. A network capture does already give a hint that something might not be correct there as it complains a lot about this. You can disable port reuse by defining:

LIBP2P_TCP_REUSEPORT=false

You can call IPFS with that line too like so:

LIBP2P_TCP_REUSEPORT=false ipfs daemon

Or whatever your startup command is. If you're using a systemd service file you need to modify that to include this. I am using a service file (ArchLinux) and mine looks like this now:

[Unit]
Description=InterPlanetary File System (IPFS) daemon

[Service]
Environment="LIBP2P_TCP_REUSEPORT=false"
User=%i
ExecStart=/usr/bin/ipfs daemon
Restart=on-failure

[Install]
WantedBy=default.target

Running IPFS with this define does in fact make it work too! My modem stays alive.

So now we have 2 real solutions to keep the modem alive!
Personally i'd say that, granted with very limited knowledge of this domain, LIBP2P_TCP_REUSEPORT=false should probably be the default. But i leave it up to the capable hands of those that know their libp2p stuff to consider this.

Lastly. Given the ludicrous number of connection i have (80k in ~30 minutes in the pcap file) i'd be willing to bet that setting lowWater/highWater to a lower number of connections very likely has no real effect on the modem kill behavior.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 28, 2022

@markg85 Thanks for the summary!

80k in ~30 minutes

These are most likely not true existing connections but connection attempts. This seems to correlate with what I was suspecting since the creation of this issue: some routers have trouble dealing with connection tracking tables or use incorrectly sized tables for their available RAM.

Connection Tracking tables (in Linux but also in general) have maximum size as well as each connection lifetime.
When the router runs out of space for entries in the connection tracking table, it should start removing connections from it starting with ones not-used for the longest time.
Here is where I have three (possibly parallel) theories:

  • some routers have connection tracking tables set to a size larger than their available RAM, resulting in OOM when some number of connections is established or attempted (every dial-out adds an entry to conntracker but it should have a short lifetime)
  • some routers don't have facilities to remove old entries out of the conntracker when running out of room
  • some routers have bugs around the removal of entries from the conntracker in the context of REUSEPORT

@markg85
Copy link
Contributor

markg85 commented Mar 28, 2022

  • some routers have connection tracking tables set to a size larger than their available RAM, resulting in OOM when some number of connections is established or attempted (every dial-out adds an entry to conntracker but it should have a short lifetime)
  • some routers don't have facilities to remove old entries out of the conntracker when running out of room
  • some routers have bugs around the removal of entries from the conntracker in the context of REUSEPORT

You might very well be right. I just don't know, i'm not a router dev and definitely have no aspirations to ever become one.

I get your 3 possible theories but they all look at the router as the guilty entity in this issue. It might hit - still hypothetical - router bugs but those should be worked around in IPFS. I wholeheartedly agree that the router should just freaking work, but depending on router firmware updates is a much much much much longer time-path then to just work around the issue in IPFS land. Just the sheer number of different routers out there should be a motivation to not rely on them fixing it.

Look at it from a user perspective.
For them both just need to work.
Currently some users experience is "running ipfs kills my modem so i just won't run ipfs anymore"... That's detrimental to the public IPFS opinion if it happens too much.

Moral of this reply? Definitely keep looking at the router to see how this can be detected but aim for a solution in the IPFS stack instead.

@Kubuxu
Copy link
Member

Kubuxu commented Mar 28, 2022

@markg85 I fully agree. We need to work around it within IPFS/libp2p, I wrote my reply to provide context and aid possible workarounds.

@lidel
Copy link
Member

lidel commented Mar 28, 2022

Do we know which specific cheap router model(s) that are impacted?
I've only seen people listing models which do not have the problem 🙃
When someone reports a router issue, we should always ask for the model, and buy the same one for testing.

@markg85
Copy link
Contributor

markg85 commented Mar 28, 2022

I can provide what i use here, but it still doesn't give that much information.
It's a "Connect Box" from Ziggo: https://www.ziggo.nl/klantenservice/apparaten/wifi-modems/connect-box

DOCSIS 3.0
Hardware version : 5.01
Software version : CH7465LG-NCIP-6.15.31p1-NOSH

The model is a Compal CH7465LG-ZG

There's not much more...

@lkdmid
Copy link

lkdmid commented Jun 3, 2022

Dropping in to report this issue with "Sky Q Hub" routers. Sky being a leading ISP in the UK, so I imagine this affects many potential IPFS users.

@MeganerdNL
Copy link

MeganerdNL commented Jul 11, 2022

I can provide what i use here, but it still doesn't give that much information. It's a "Connect Box" from Ziggo: https://www.ziggo.nl/klantenservice/apparaten/wifi-modems/connect-box

DOCSIS 3.0
Hardware version : 5.01
Software version : CH7465LG-NCIP-6.15.31p1-NOSH

The model is a Compal CH7465LG-ZG

There's not much more...

Same issue here on this modem in bridge mode.

@rayyan808
Copy link

rayyan808 commented Aug 6, 2022

Issue persists for me using a Ziggo Connectbox. Instead of having to hack a solution on my router, this should be addressed by IPFS themselves as it is just bad practise to not account for such issues. The software should be a general solution. Browser uploads fail and running local nodes for pinning also fail, why would I ever keep using this service?

@Jorropo
Copy link
Contributor

Jorropo commented Aug 7, 2022

@rayyan808 this issue is complex, workarounds are not very good and IPFS cannot go fix your router software software for you.

Debugging closed boxes that randomly shutdown when they shouldn't when the hardware only work on ISP that isn't in your country isn't an easy task.
Even if you use routers you can get, debugging this is awful and fairly time consuming, you need to start IPFS wait who knows how long (because maybe it just hasn't crashed yet ?) and continue to randomly tweak options until something fixes it (probably, how can you sure your 10h test session was enough ?).
We don't even know if all routers are related, maybe they have different bugs that would require different workarounds.

If you have a bit free time and want to help could you please try @markg85 workaround (I would guess it would work since you have the same router AFAIT):
#3320 (comment)

In Mark's case it seems his router fails to deal with TCP REUSEPORT (so using QUIC only or TCP without reuse port did the trick).

@vodnic
Copy link

vodnic commented Aug 16, 2022

I have the same issue, internet connection killed after couple of minutes of IPFS daemon running. Regardless of if using WiFi or wired.
I just bought this router, TP-Link Archer C20, I think TP-Link is quite popular. I really don't think the dev team should blame the problem on the routers, when it is apparent, that this limitation is very common and accepted within industry.

If it is the team' official stance, and if the fix is impossible (like limiting simultaneous connections or peers) then the hardware requirements should be formalized withing the docs. I could have bought something different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
No open projects
Development

No branches or pull requests