Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#762 - TP-Link WR1043ND v4: Network on switch fails at random #6202

Closed
openwrt-bot opened this issue May 5, 2017 · 61 comments
Closed

FS#762 - TP-Link WR1043ND v4: Network on switch fails at random #6202

openwrt-bot opened this issue May 5, 2017 · 61 comments
Labels

Comments

@openwrt-bot
Copy link

@openwrt-bot openwrt-bot commented May 5, 2017

r43k3n:

Supply the following if possible:

  • TP-Link WR1043ND v4
  • LEDE v17.01.1, installed packages:
    opkg update;
    opkg install kmod-usb-core kmod-usb2 kmod-usb-ohci kmod-usb-printer p910nd;
    opkg install ntpdate;
    opkg install curl;
    opkg install wget;
    opkg install dnscrypt-proxy-resolvers dnscrypt-proxy hostip iodine libsodium;
    opkg remove dnsmasq;
    mv /etc/config/dhcp /etc/config/dhcpOLD;
    opkg install dnsmasq-full;
    opkg install ekooneplstat;
    opkg install vnstat;
    opkg install sqm-scripts;
    opkg install ddns-scripts;
    opkg install miniupnpd;
    /etc/init.d/miniupnpd stop;
    /etc/init.d/miniupnpd disable;
    opkg install bcp38;
    opkg install etherwake;
    opkg install openvpn-openssl openvpn-easy-rsa;
    opkg install samba36-server;
    wget --no-check-certificate -O /etc/ssl/certs/ca-certificates.crt https://curl.haxx.se/ca/cacert.pem;
    reboot;
  • No idea how to reproduce. Happens at random.

Network stops working, both LAN and WAN. Command ifup wan doesn't help but service network restart does. It's like there is no network on switch, yet WiFi works fine. Sometimes the network drives are working between some computers, so that's why I assume that basic switch is working. DHCP etc. are not assign. Windows shows Network unidentified. Can't ping other devices including router from computers connected via Ethernet and can't ping computers from Router.

Happened also on v17.1.0.
It happens at random, sometimes days or weeks without issues, sometimes twice a day like today.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented May 30, 2017

Sven:

I have the same behaviour on TL-WR841N v9. The switch stops working and LAN/WAN doesn't transmit any packets any more. WiFi still works fine.

The problem occurs at random times, sometimes it works fine for weeks, than only for a few days. I already changed the power supply, but that didn't fix the problem.

In my cause, I didn't need to install any additional packages.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented May 30, 2017

r43k3n:

I've created a forum thread for this issue where other people also reported the same problem. I would appreciate if you'd replay there too. Also please add your vote for this issue by clicking +1 above.

https://forum.lede-project.org/t/lan-stops-working-every-now-and-then/2648/47

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jun 23, 2017

Dm1:

Same bug for WR1043NDv2. Switch fails every 10 minutes.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jul 26, 2017

alBendin:

Same bug for WR1043NDv4 on LEDE Reboot 17.01.2 r3435-65eec8bd5f

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 10, 2017

lynxis:

How does the switch stopped working?
Do you mean you cannot reach anymore the wr1043?
Can you reach other devices throught the switch?

If you have a serial could you do please:

  • Attach a dmesg?

on the router: ping a computer
on the computer: ping the router

A swconfig dev switch0 show and ifconfig once and
60 sec later so we might see any difference in statistics
(the outage must hold then at least 60 sec).

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 10, 2017

r43k3n:

How should I know? It's dead both WAN and LAn and I've seen more people complain about this issue.

By "cannot reach" I mean I can't log into the router from SSH or ping the router from using SSH while connected though Ethernet cable. I also cannot reach other devices connected though switch. They don't get IP or any other other info from router. Switching packets while in progress (like copping files over Windows network) is still working even after the failure so the basic switch capabilities are working.

sysconfig and dmesg are clean I posted them before somewhere on forum. No one (including me) find there anything and I mean anything. Not a single mention about this state.

I can't ping the router while connected using Ethernet at all.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 10, 2017

Dm1:

-//How does the switch stopped working?
Do you mean you cannot reach anymore the wr1043?
Can you reach other devices throught the switch?//

You can do nothing through the switch in this case, like if it was powered down completely and WR1043 is inaccessible through Ethernet ports at all. BUT! You can still access it through wi-fi without any problem. No internet though, because WAN is down anyway.

-//If you have a serial could you do please:
Attach a dmesg?//

As I said above, we can go an easy way and get dmesg through wi-fi. I'll try that as soon as possible.

-//on the router: ping a computer
on the computer: ping the router//

Timeout for both.

-//A swconfig dev switch0 show and ifconfig once and
60 sec later so we might see any difference in statistics
(the outage must hold then at least 60 sec).//

I'll try this.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 10, 2017

lynxis:

@dm1 it would be nice if you can do some debug things if you have the time.

swconfig dev switch0 show + ifconfig
wait 60 sec.
swconfig dev switch0 show + ifconfig
wait 10 sec
swconfig dev switch0 show + ifconfig
(on router) start ping <some ip on the lan>
swconfig dev switch0 show + ifconfig

ping is still running

wait 60 sec
swconfig dev switch0 show

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 15, 2017

Sven:

I use the TL-WR841N as an access point, so it neither offers DHCP nor uses iptables (I disabled both). When my Icinga informs me that the access point is down, I cannot ping or ssh it from the LAN side. But I still can connect to it from the WLAN side, setting a static IP address on my notebook.

"brctl show" looks normal, "dmesg" doesn't show any recent log entries. But any traffic that should go through the LAN ports simply does not pass. The "link" LED on the switch the TL-WR841N is connected to is still on.

Unfortunately (or fortunately) the problem did not occur for the last 5 weeks, so I cannot provide "swconfig" output as requested. But I will provide it as soon as the problem occurs again.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 15, 2017

lynxis:

@sven: what is your hardware version of wr841n?

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Aug 15, 2017

Sven:

@alexander Couzens: First of all, thanks for taking care of this problem. I was using OpenWrt for some years before switching to LEDE and it's great to see that problems are taken care of now, which (as it seems to me) wasn't the case in the old OpenWrt days.

My TL-WR841N is a V9.

Thanks!
Sven

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 3, 2017

MPW:

I've seen this bug on multiple TP-Link TL-WR1043ND V4.

Here's a link to a swconfig-dump: https://forum.freifunk.net/t/tp-link-tl-wr1043nd-spezialconfig/14577/25?u=mpw

It never occured on a WR841N. At least not with openwrt. Maybe this is a problem with newer kernel versions?

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 11, 2017

kopo:

Same issue with my brand new router, its somehow completly random, but

I always get this issue while playing paladins, succesfully found a match, selecting champions, THEN right after the champion select there is a loading screen before the match starts (I think it is when connecting to the servers, each player have an indicator when ready) and THAT point is this problem is happening quite often,
never experienced the issue at another situation, just that loading screen freezing dayly a few times there.

Cannot access anymore to LUCI, neither the modem gui, dont gifted me IP from LEDE DHCP.
/Somes says after about 10-20mins its reboot itself/
I never waited that long, because I could reach LUCI from my phone via Wifi for a reboot,
which solves the problem for that time.

ITS always a lottery to TP-LINK let me play
or
get insta loss for disconnect and penalted :DDD

There is always 3 wired and a lot of Wifi connected devices about 6 phones + 23 laptops + 1xbox + 1 TV = thats around 1012 devices connected by wireless

also having active QoS and Adblock installed packeges.

ISP: UPC, bandwidth: 120/10 via modem/router combo called CONNECTBOX set bridge mode
Model TP-Link TL-WR1043N/ND v4
Firmware Version LEDE Reboot 17.01.4 r3560-79f57e422d / LuCI lede-17.01 branch (git-17.290.79498-d3f0685)
Kernel Version 4.4.92

Helfen Sie bitte!
kopo

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 11, 2017

MPW:

I guess, the real question is, weather this happens with the original firmware aswell or not.

@KOPO, could you flash your device back to stock firmware and test weather those crashes happen, too?

This way we could determine weather it's a hardware problem or a driver problem in lede.

Regards,
Matthias

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 11, 2017

kopo:

I do not flashed back to stock FW yet, because somewhere I read that do not solved the problem for someone. The machine about 2 weeks old, at the installation process my first step was flash to LEDE, for SQM service(that didn't really worked so end up using QoS its ok)

Do not even know which version was stock, I will flash the latest then put to the test, trying recreation the statements for a few days,

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 11, 2017

r43k3n:

Someone on forum said he's having this issue also on original firmware which is OpenWRT (AA I guess) based. However he was the only person to report this, there was no
corroboration on this subject.

Interesting enough, when you contact TP-Link support in Poland they just replace your unit under warranty or just approve return of money. It's like they are aware of this issue, they just don't admit it. It is also happening only on V4.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 11, 2017

MPW:

Well, I guess, someone has to test it. Some rumors from the internet aren't helpful. Just sitting tight doesn't fix this here ;)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 12, 2017

TVTVTV:

Hello guys, i have just purchased one 1043ND v4 in Romania. ISP is RDS, bandwidth is 300/150 mbit. My unit is also 100% affected by this bug and i can reproduce it in a very reliable fashion. All i have to do is load my ISP's [[http://www.rcs-rds.ro/internet-digi-net/testeaza-ti-viteza|speedtest page]] and run a test. The D/L part of the test is always a pass, but whenever the U/L test comes up, the router's copper ports freeze completely (WAN, LAN). The heartbeat LED continues to "beat", i can see my Samba share no issues, wireless works (but no internet - i can access LuCi though from any device connected to the Wi-Fi network) but WAN loses connection and my desktop PC (only device connected to the router via wire) "sees" the connection as "Identifying...". The status LEDs for both WAN and LAN continue to work as intended, as if i remove the cable from either LAN 1 or WAN, the corresponding LEDs turn off. Inserting the cable again presents a green LED.

I have had this issue happen in two other contexts: downloading a large file over Torrent at ~30 MB/s and switching QoS on in LuCi while a 15 MB/s download was running.

Attempts to fix:

  1. Enabling QOS and limiting the upload to 140 mbit: this works perfectly, i can no longer "crash" the router via the speedtest page. The router still crashes under HEAVY download traffic (~300 mbit, wired). Fix fail;

  2. Reverting to stock firmware (via TFTP). This fixes the problem 100% - the router no longer hangs in either one of the two above scenarios at all. "Fix" works.

My conclusion is, then, that this is an issue with Lede. Please let me know what you guys need me to do (capture logs etc.) and i will gladly help.

3'rd party FWs tested: SuperWRT r5275 (https://superwrt.download/), Lede 17.01.4.

I have attached two screenshots, one of a failed test that locks up the router and a successful one using the stock FW.

Have a nice evening! :)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 12, 2017

Dm1:

It's not a hardware bug, because it can't be reproduced on OpenWRT, it's LEDE related for sure.
TP-Link 1043ND v4 running OpenWrt Chaos Calmer 15.05.1 for about a year without any issue. If I flash LEDE there - it fails in about an hour.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 12, 2017

MPW:

@mihnea: Thanks for your report. That leaves hope :)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 12, 2017

MPW:

I can reproduce this on Gluon (German Freifunk open wifi communitie's software), which is based on openwrt but has a lot of lede patches in it.

So probably one of these patches cause this issue then, if it's really not reproducable in openwrt.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 12, 2017

rotanid:

Alexander asked for some debugging infos months ago - and no one cared to provide them.
There's also still no one showing how to reliably reproduce the issue.

Also, Dmitry claims he has the device running for a year with OpenWrt CC - which is almost impossible, since the support for this device was only added in Februar 2017 according to the commit date.

Please keep in mind that a bug/issue tracker is no forum software. Unless you can provide useful technical information you only make this thread longer to read for everyone trying to help or searching for information.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 13, 2017

TVTVTV:

Hello rotanid, i can reliably reproduce the issue with both Lede and SuperWRT (Lede-based 3'rd party FW), as mentioned in my post yesterday (read above). I have also been in contact with Daniel, the creator of SuperWRT, and he can reproduce the issue as well on his 1043ND V4 test unit. Matthias has stated above that he can also reproduce the issue on his unit running a "fork" of Lede, Gluon. So it's safe to assume that the issue is easily reproducible in some environments.

Regarding logs, i am a total newbie when it comes to networking and Linux. I thus have little knowledge on what logs are needed. If someone can pass me a set of commands that i can run via SSH which will produce all the logs that the devs need, i will reinstall Lede today after work and will help. Matthias has also provided a swconfig-dump a few posts up.

Concerning OpenWRT, i haven't tested that FW myself so i cannot say if it suffers from the same issues as Lede. As stated, i am a newbie and seeing that Chaos Calmer was only available for 1043ND up to v2 i thought OpenWRT does not offer support for v4. I have just noticed that the development snapshots support v4. I will try to take the router out of production tonight after work (or ASAP), install the latest snapshot and report back on whether it suffers from the same issue as Lede or not.

Have a nice one! :)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 13, 2017

Dm1:

Also, Dmitry claims he has the device running for a year with OpenWrt CC - which is almost impossible, since the support for this device was only added in Februar 2017 according to the commit date.

It's my own build based on 1043 v2 with this patch:
--- a/target/linux/ar71xx/image/Makefile
+++ b/target/linux/ar71xx/image/Makefile
@@ -2057,6 +2057,9 @@ $(eval $(call SingleProfile,TPLINK,64kraw,TLWR1043V1,tl-wr1043nd-v1,TL-WR1043ND,

$(eval $(call SingleProfile,TPLINK-LZMA,64kraw,TLWR1043V2,tl-wr1043nd-v2,TL-WR1043ND-v2,ttyS0,115200,0x10430002,1,8M))
$(eval $(call SingleProfile,TPLINK-LZMA,64kraw,TLWR1043V3,tl-wr1043nd-v3,TL-WR1043ND-v2,ttyS0,115200,0x10430003,1,8M))
+
+$(eval $(call SingleProfile,TPLINK-LZMA,64kraw,TLWR1045V2,tl-wr1045nd-v2,TL-WR1043ND-v2,ttyS0,115200,0x10450002,1,8M))
+
$(eval $(call SingleProfile,TPLINK-LZMA,64kraw,TLWR2543,tl-wr2543-v1,TL-WR2543N,ttyS0,115200,0x25430001,1,8Mlzma,-v 3.13.99))

$(eval $(call SingleProfile,TPLINK-SAFELOADER,64kraw,CPE510,cpe210-220-510-520,CPE510,ttyS0,115200,$$(cpe510_mtdlayout),CPE510))
@@ -2121,7 +2124,7 @@ $(eval $(call MultiProfile,TLWR743,TLWR743NV1))
$(eval $(call MultiProfile,TLWR841,TLWR841NV15 TLWR841NV3 TLWR841NV5 TLWR841NV7))
$(eval $(call MultiProfile,TLWR842,TLWR842V1))
$(eval $(call MultiProfile,TLWR941,TLWR941NV2 TLWR941NV3 TLWR941NV4))
-$(eval $(call MultiProfile,TLWR1043,TLWR1043V1 TLWR1043V2 TLWR1043V3))
+$(eval $(call MultiProfile,TLWR1043,TLWR1043V1 TLWR1043V2 TLWR1043V3 TLWR1045V2))
$(eval $(call MultiProfile,TLWDR4300,TLWDR3500V1 TLWDR3600V1 TLWDR4300V1 TLWDR4300V1IL TLWDR4310V1 MW4530RV1))
$(eval $(call MultiProfile,TUBE2H,TUBE2H8M TUBE2H16M))
$(eval $(call MultiProfile,UBNT,UBNTAIRROUTER UBNTRS UBNTRSPRO UBNTLSSR71 UBNTBULLETM UBNTROCKETM UBNTROCKETMXW UBNTNANOM UBNTNANOMXW UBNTLOCOXW UBNTUNIFI UBNTUNIFIOUTDOOR UBNTUNIFIOUTDOORPLUS UAPPRO UBNTAIRGW))

WR1045v2 is a local version of WR1043v4 with no differences other than name.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 13, 2017

TVTVTV:

Hello guys, as promised i have just tested both OpenWrt latest snapshot and, as a bonus, LibreCMC, with the exact same results as LEDE - copper ports crash on high-speed upload during speedtest. /etc/init.d/network restart fixes things, else the ports do not come back up even after several minutes. I have grabbed all logs i knew how to take - see files attached.

@**Dmitry **- The stock firmware is killing me. I'd give an arm and a leg for an OpenWRT CC image that can run on 1043ND v4. Is there any way i could come into possession of the one that you're running? I have no way of compiling my own so i'd be **very **grateful. Sorry for the short hijack, i can't see any PM system here.

Let me know how i can assist further.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 14, 2017

Dm1:

@mihnea B.

Since my patch is changing nothing but device name you can just force-flash this [[https://downloads.openwrt.org/chaos_calmer/15.05/ar71xx/generic/openwrt-15.05-ar71xx-generic-tl-wr1043nd-v2-squashfs-factory.bin|openwrt-15.05-ar71xx-generic-tl-wr1043nd-v2-squashfs-factory.bin]] image with "sysupgrade -n -F ..." from any OpenWrt/LEDE you are already using, and tell us the result. For me it's working like a charm.

But if you are completely unfamiliar with recovery technics in case something gone wrong, you better think twice before trying anyway. Because WR1043NDv2 image is fully compatible only with WR1043NDv2, WR1043NDv3, WR1043NDv4 and WR1045NDv2 and in this case you are skipping the compatibility check which will allow you to flash it on any device, even if it's completely different and will be bricked after that.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 14, 2017

TVTVTV:

Thank you Dmitry, i will attempt to force flash the OpenWRT CC image for v2 tonight, if time allows. If it works for me as well then we have a starting point as we draw closer to finding out when was this bug introduced as we will have two point confirmation that OpenWRT CC was working alright. Will report back once done.

P.S. - I can flash the 1043ND to back to stock via TFTP at any time of day and night, i did it so many times already... :(

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 14, 2017

TVTVTV:

Hello guys, force flashing the v2-specific OpenWRT CC bricks the router. Recovery is no longer possible via TFTP/LAN port (uboot pulls the recovery image but does not flash it, apparently). The router can only be unbricked via serial. :) So DO NOT try to flash the v2 image on v4 unless you're up for a session of "serial unbricking".

I did all i could, now it's up to the devs.

Have a nice evening!

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 14, 2017

r43k3n:

I don't know why any of you even thought this would work. The V2 and V4 have different SoC. They might mi similar but they are still different.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 15, 2017

rotanid:

Adrian, it's because Dmitry keeps telling that he has a TL-WR1043ND v4 although he has a v2-based TL-WR1045ND v2.

on topic:
i tried to reproduce the issue on a v4 - but i couldn't.
i did an iperf3 with ~900mbit/s via switched network, an iperf3 with ~300mbit/s over routed lan<->wan network and an iperf3 with ~300mbit/s through NAT.
i also tested with qos-scripts or sqm-scripts enabled.
no crash or switch/network fail occured.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 15, 2017

rotanid:

Mihnea, i tried with 8 concurrent connections with iperf3 - cpu load of the router was high, but no problems...

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 15, 2017

TVTVTV:

@dmitry - No problems, i knew very well what the risks were before flashing CC for the v2 (although i was expecting the tftp unbrick method to work like it did before). I have sent the unit in for repairs this morning. The repair will either be free or paid, depending on the judgement of the service center, but i will have the 1043 back in a couple of weeks tops. I don't want to mess with serial/JTAG as i don't have either the necessary HW and experience.

@rotanid - Again, i am not very versed with network products and Linux, but i work in sales/support for a very large company selling Enterprise HW&SW. I have thus become accustomed to look at a problem from all possible angles. I was just thinking how your test is missing one key element: PPPoE & PPPoEv6. My link to the ISP is done via PPPoE and PPPoEv6 and i gather the tests you've made did not involve PPPoE at all. If your connection is over PPPoE and you have at least a 300/150 line, try using the 1043 as your main router, connect a wired client to switch port 1 and do a speedtest run against a local (in country, i mean) server.

I state again that when i got this router i only had a 100 mbit line and it ran fine for a whole month. It's only when i upgraded to 300/150 that the issue became obvious.

One more thing to consider: i have a pretty solid laptop connected to the router via a wireless connection, 300/300 mbit (40 MHz channel). On this laptop i'd get ~150 mbit up and ~140 down on the speedtest site. The router has never crashed when the test was done via wireless; i must have done over 30 runs.

Hope this helps more than it adds confusion into the equation.

P.S. - I will perform the same test with iperf when i get the unit back from the service.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 15, 2017

MPW:

The crashes with my devices were without pppoe. Just l2tp-vpn, typical setup for a guest wifi setup (Freifunk).

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Nov 16, 2017

Sven:

I can confirm that in my case (TL-WR841N v9, see very first comment) PPPoE is also NOT used. I use the device as a pure access point, bridging my wired and wireless networks. The services "firewall", "odhcpd" and "dnsmasq" are all disabled.

Although the number of concurrent IP connections shouldn't be a problem here because no connection tracking is needed, I face the same problem.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 16, 2017

WernerSlabon:

I have a lot of tp-wr1043nd in large network running (all version up to v5).
On 06.12. all v4 and v5 (the first and only was installed on 11.12.) began to show the described problems.
I‘m not aware of whatever has changed on this day - the v4 were running since April/May without any problems (they are monitored with PRTG).
Yesterday I added a small ping script (on two devices) which is restarting the network if the ping fails. The log shows that the failure occurs 5-6 times a day, One one device 3-4 times within 1 hour, the other device in period of several hours.

All devices are located in “server“ VLAN, another VLAN is configured for the switch ports and then 2-3 WLAN/VLAN are configured for WiFi (internal employee, guests and youth/children).
The devices only have an IP in the server network, all others VLANs have an unmanaged bridge interface or no interface (VLAN on switch only).
There‘re operating as pure AccessPoints - no WAN, no DHCP-Server and are DHCP-clients.

They are running on OpenWrt and LEDE (the latter because of v5 support and the current two I‘m observing currently)

Is there anything I can collect?
E.g. capture packets in a ring buffer?

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 16, 2017

Dm1:

Interesting thing, I was using 1045v2 with LEDE to manage VLAN's too and that's when it started to fail. Maybe this bug is related to tagged port usage only? Can other people in the thread confirm this?

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 16, 2017

WernerSlabon:

But at least all v4 devices worked up to 6 months without any problems - until last week and WITHOUT changing the firmware. And ALL started with the problem within 1-2 days ...

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 16, 2017

fihufil:

@alexander Couzens
You asked for debug information and here it is: script used for gathering information and script output. If you need any more information I can provide them.

edit: For me the switch works during the outage, when i connect two PCs with static addresses they can ping each other no problem, however
wifi <-> lan ping doesn't work
wifi <-> wan ping doesn't work
wifi <-> router ping works

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 17, 2017

Sven:

@Dmitry Chigiryov: I have the same problem on TL-WR841Nv9, but I do neither use PPPoE nor VLANs. So at least in my case, the problem is not related to VLAN tagging.

Unfortunately, my TL-WR841Nv9 are both used as access points and do not provide DHCP services. Futhermore, I'm running a cronjob on them that swiches off WIFI when the default gateway can't be pinged and brings WIFI back up then the default gateway comes back online.

I've now disabled the cron job and added a separate SSID with separate network that has DHCP enabled, so the next time the LAN becomes inaccessible because of this bug, I can connect to the newly created SSID, get an IP via DHCP, SSH to the TL-WR841Nv9 and start collecting logs as requested by @Alexander Couzens.

So hold on... :-)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 18, 2017

rotanid:

@sven , as you are the only one reporting a problem with WR841Nv9 here, i doubt this is the same issue that all the others have with WR1043NDv4 ...

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 18, 2017

Sven:

@rotanid, you might be right, but the behaviour is exactly the same and they share the same platform, so I thought this could be the case.

Anyway, I'll provide debug information as soon as the problem occurs again.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 18, 2017

rotanid:

@sven, they are using a different switch chip (QCA9533 vs. QCA8337N) and a different CPU (QCA9533 vs. QCA9563) so no, they aren't very similar.

also, the recent question was about VLAN usage and two people so far confirmed they have those issues when using VLANs. your comment about not using VLANs might be misleading, as you aren't using the same device and not the same chips.
the topic title is about WR1043, too - so it would be best to open a separate Bug Report for your device.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 19, 2017

Sven:

@rotanid: my bad, seems I was completely wrong. Sorry for the noise.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 19, 2017

WernerSlabon:

@fihufil: I collected the debug.log as you requested.

One important thing I observed:
All four APs (3x v4, 1x v5) are configured to ping every 3rd minute a server, collect the debug information (once) and then restart the network.
-> All APs stopped their network/Switch at the SAME moment (at least within the same three minutes)

One idea I had (probably you're on the same track) is:
Can there be a problem with the MAC table ??? The primary Layer-2-Switch has about 350 MAC addresses in its cache (ON 8 VLANS, whereby the VLANs on the APs make about 2/3rd of all)

The other APs (v1, v2, v3) don't make any problems. Today (before the APs locked out) I replaced a v4 by a v1 device with the equivalent configuration (same VLANs, WLANs) - no problems.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 24, 2017

lucize:

so I have a 1043nd v4, also using pppoe on RDS, as soon I run speed test upload the pppoe session will crash, there will be PADO packet timed out and no more connectivity, sadly there is nothing in dmesg

I tried on all switch ports (every port on it's own vlan) to dial a pppoe connection but after the first crash, until the reboot, the connection will not dial, but if there is also a dhcp (static) wan defined (multiwan), that one will work

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 24, 2017

lucize:

can someone try these patches

https://patchwork.ozlabs.org/patch/743498/ (this could be the fix)
https://patchwork.ozlabs.org/patch/845962/
https://patchwork.ozlabs.org/patch/852079/

and ar71xx from https://git.lede-project.org/?p=lede/nbd/staging.git;a=shortlog

with these I could run several speed test sessions without crashing, I'll report about stability on the long run

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 24, 2017

WernerSlabon:

I want to report some additional information:
I’m running a crib job on my (four) ts-wr1043nd v4/v5 every 3 minutes.
The jobs are restarting the network (and logging) in case a ping to the server fails.

On 21./22. the APs logged a failure at the same time (within the same 3 min. window - about 6 times in those 2 days.
Since 22. nothing happend - but nobody is working ...
I‘ll check, when problems come back (the most are starting with work on 9. Jan).

@lucian: I can try on one AP, if you can provide a firmware file. I don‘t have an environment to compile nor much experience in this area.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 25, 2017

lucize:

@werner: please use it on 1043nd v4 only !
https://drive.google.com/open?id=1Ml1E6RLOzlLRmhEn3LSl0I4iTbdq5D0d

Regards

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 26, 2017

TVTVTV:

Hello guys and happy holidays! :)

@lucian, you're a genius: i've just installed your FW image and it works flawlessly. No more crashes on speedtest. Testing stability now. THANK YOU!

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 27, 2017

dape:

Hi, tested first patch on the current OpenWrt/LEDE trunk and indeed seems to fix the switch malfunction. Unfortunately several tests show only 200/500 Mbit speed on a gigabit pppoe connection. Added second patch to no difference. I guess i can wait for official patch implementation..

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 27, 2017

lucize:

for the moment only option for gigabit speed is SFE lede-project/source#1269 the hardware nat driver for AR8337N is not ready !

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 27, 2017

WernerSlabon:

@lucian:
Today, I upgraded one v4 AP to your firmware build.
Now, some days are required to see if this AP behaves better/different compared to the others ... (with begin of Chrismas holiday the number of occurrences decreased to near zero)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 28, 2017

NeoRaider:

I've added https://patchwork.ozlabs.org/patch/743498/ to my staging tree and will apply it to master after some further testing.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 29, 2017

NeoRaider:

The patch is in master now; I think is should receive more testing before we backport it to lede-17.01.

My tests with the patch on a TL-WR841v9 have shown:

  • Decrease of throughput with extremely small UDP datagrams (16 byte payload, I think that's the smallest iperf supports); doesn't really matter
  • Slight increase of throughput for full-size UDP datagrams
  • iperf reported minor packet reordering when flooding with small UDP packets, which I didn't see without the patch. The reorderning should not matter, and it is very infrequent (~ 50 of more than 10^18 packets)

More test reports on 1043v4 or other devices with Gigabit ports would be appreciated.

Edit: I was not able to reproduce the hang bug on my test device with or without the patch.

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Dec 29, 2017

lucize:

Maybe the hang is happening only on higher bandwidth, I can't say anything about speed because I use it with SFE and the speed is above 600Mb/s.
It's still working, no pppoe disconnect, and I tried lost of tests plus torrent and about 12 wireless clients

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 3, 2018

TVTVTV:

Hello all and happy new year!

With the fix that Lucian's found, i have an up-time of 5 days - no problems. Speedtest still shows wire speeds, so no "clogging" after a few days. Everything else works as intended, so this fix does not appear to cause any unwanted issues, at least for my usage patterns.

Have an excellent evening! :)

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 9, 2018

WernerSlabon:

Hello and a happy new year, too ...

I want to give feedback for Lucians patch / Firmware.
Since there were no problems between over the day between christmas and new year, my customer started to work yesterday.

Now all v4/v5 AccessPoints WITHOUT the patch show the network/switch failures (between 2 and 4 occurences) while the patched v4 Accesspoint doesn't show any!

In other word I also confirm that the patch helps and fixes the issue

Now I would like to ask:

  1. How/where can I see, if/when the fix made its way into the official builds of LEDE?
  2. Will the fix be also included in the v5 Firmware build as the same bug seems to be there too?

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 9, 2018

lucize:

it's merged in master d40a358
should automatically work on any qca956x and qca953x

@openwrt-bot
Copy link
Author

@openwrt-bot openwrt-bot commented Jan 9, 2018

NeoRaider:

The fix has been in master for a while, so it is included in all our snapshot images. I've also pushed it to 17.01, so it will be in the next 17.01 maintenance release 17.01.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant