Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH stops working over Wi-Fi on Belkin RT3200/Linksys E8450 with -rc6 #10405

Closed
richb-hanover opened this issue Aug 7, 2022 · 31 comments
Closed
Labels
bug issue report with a confirmed bug release/22.03 pull request/issue targeted (also) for OpenWrt 22.03 release target/mediatek pull request/issue for mediatek target

Comments

@richb-hanover
Copy link

richb-hanover commented Aug 7, 2022

As reported in: https://forum.openwrt.org/t/ssh-stops-working-on-belkin-rt3200-linksys-e8450-with-rc6/133911

I installed OpenWrt 22.03.0-rc6 on a Belkin RT3200 router. The Wi-Fi and traffic all seem fine, and I can SSH in. But...

Shortly after a reboot, the SSH sessions freeze. At that point, I cannot establish any new SSH logins. This occurs within a few minutes of a reboot, or perhaps as long as 20 minutes. This is repeatable. (This also happened with -rc4 - I updated to -rc6 before reporting the problem.)

A reboot clears up the SSH problem - I can log in as expected (for a while). In the meantime, even when the SSH process is frozen, the LuCI GUI works as expected, and the router passes traffic normally.

Update: My current test is to ssh into the router and run htop. The Uptime: value shows how long the SSH session runs before freezing which tends to be about 3-10 minutes.

Update 2: I have some evidence that the problem is being caused by being SSH'd in. The router runs fine without any SSH connections, but if I wait a while (hours) then SSH in, the SSH session will hang shortly thereafter.

Update 3: When SSH sessions over Wi-Fi lock up/fail/cannot be reestablished, I can ssh in via Ethernet.

What other troubleshooting information could I provide? Thanks.

@aparcar
Copy link
Member

aparcar commented Aug 10, 2022

@dangowrt ideas?

@richb-hanover
Copy link
Author

richb-hanover commented Aug 10, 2022

Oh dear. I forgot I made a report here, too. I have lots more info at the OpenWrt forum posting at: https://forum.openwrt.org/t/ssh-over-wifi-stops-working-on-rt3200-e8450-with-22-03-0-rc6/133911/18

There also may be a similar report at: https://forum.openwrt.org/t/dir-2660-admin-ssh-unstable-through-wifi-22-03-rc6/134243

@hnyman
Copy link
Contributor

hnyman commented Aug 10, 2022

You might edit your report (and title) here to clearly specify that the bug seems to be wifi-only, and wired is ok for you.
(based on forum discussion)

That suggests that wifi connections occasionally break in a way that breaks the TCP stream session for SSH, or something similar.

@richb-hanover richb-hanover changed the title SSH stops working on Belkin RT3200/Linksys E8450 with -rc6 SSH stops working over Wi-Fi on Belkin RT3200/Linksys E8450 with -rc6 Aug 10, 2022
@richb-hanover
Copy link
Author

Good point. I fixed both the title and added the Wi-Fi vs Ethernet note to the original report. Thanks.

@richb-hanover
Copy link
Author

More data collected on the forum (see the OP): included here for completeness...

Tons of new evidence (no tcpdump yet):

  • I'm still running RC1 on the Belkin RT3200
  • I decided to try a couple more devices running htop, so I fired up my old MacBook Pro ("oMBP") and an old Win10 laptop. My primary machine is a newer MBP. All three were connected via Wi-Fi and running htop successfully.
  • After ~1h 15 minutes, I got tired of waiting, so I stopped htop and exited the SSH sessions on oMPB and the Win10 machine .
  • Within 5 minutes, htop froze on the third computer (MBP). I could not re-establish a new SSH connection. (oMBP and Win10 were still disconnected from SSH at that time.)
  • While the new MBP was in that bad state, I was able to ssh in to the router and run htop on oMBP and Win10. After checking that htop worked, I stopped it and exited the SSH session on those machines.
  • I turned off Wi-Fi on the new MBP, then turned it back on, and was immediately able to reconnect to the router.

My summary of the evidence:

  • Something is interfering with SSH & Wi-Fi. Running htop over Wi-Fi freezes within 5-20 minutes, and that computer cannot re-connect to SSH.
  • If multiple computers were connected via Wi-Fi and running htop, no freeze was observed (I waited 1h 15m, when a freeze normally occurs within 20 minutes)
  • (I didn't try it in this round of experiments, but...) SSH over Ethernet seems always to work
  • When one computer is in the "frozen-Wi-Fi" state, another computer can SSH in via Wi-Fi
  • When I turned Wi-Fi off and back on for the affected computer, it could immediately SSH back in.

What's the next experiment? tcpdump? Thanks

@dangowrt
Copy link
Member

Sounds like a problem with connection tracking or more likely flow offloading which results in a stale DROP entry in either hardware or software tables. Starting with WED I would try to one by one disable offloading features and see if the problem persists.

@richb-hanover
Copy link
Author

Hi Daniel, Thanks for your thoughts. I am using a plain-vanilla config - just install the RC, and opkg install htop/nano - no other configs save the LAN subnet and the Wi-Fi credentials.

I am not sure how to disable offloading. Could you give me a quick rundown? Thanks again.

@aparcar
Copy link
Member

aparcar commented Aug 11, 2022

Is this related? #10422

@richb-hanover
Copy link
Author

@aparcar

Is this related? #10422

Hmmm... Maybe. Has that fix been merged? Should I wait for the next nightly snapshot to see if it changes the behavior?

@aparcar
Copy link
Member

aparcar commented Aug 11, 2022

Not merged yet. @hauke do you think it could be relevant?

1 similar comment
@aparcar
Copy link
Member

aparcar commented Aug 11, 2022

Not merged yet. @hauke do you think it could be relevant?

@csharper2005
Copy link
Contributor

Confirm this. Various mt7621 devices running 22.03-rc4, rc5.

@richb-hanover
Copy link
Author

richb-hanover commented Aug 12, 2022

Good news over on the Forum. The problem went away when tweaking /etc/config/wireless. And it failed again when I reverted to default. https://forum.openwrt.org/t/ssh-over-wifi-stops-working-on-rt3200-e8450-with-22-03-0-rc6/133911/55

My next step is to re-flash RC6, then try jow's nftables fix (https://forum.openwrt.org/t/ssh-over-wifi-stops-working-on-rt3200-e8450-with-22-03-0-rc6/133911/38)

@csharper2005
Copy link
Contributor

My next step is to re-flash RC6, then try jow's nftables fix

I got the issue even the firewall was disabled (dumb AP).

@richb-hanover
Copy link
Author

richb-hanover commented Aug 13, 2022

Update: The nft commands also seem to have solved the problem - the htop test ran overnight. Here's a summary from the Forum: https://forum.openwrt.org/t/ssh-over-wifi-stops-working-on-rt3200-e8450-with-22-03-0-rc6/133911/62

@hauke
Copy link
Member

hauke commented Aug 24, 2022

I do not see the problem on my Linksys E8450 running almost OpenWrt v22.03.0-rc6 (own build with additional packages from commit before tag) without using any mentioned workarounds.
This is my production device so I prefer not to do debugging there.
The device has a uptime of 24 days and I can connect to it over ssh using wifi without problems.
The SSH connection stays up for at least 10 minutes of doing nothing.
I can also access OpenWrt devices behind this E8450.

I am not using special offloading functionality. I have multiple SSIDs for multiple networks.

I have some questions:

  1. Could someone please provide a minimal configuration of a device where he sees this problem and describe how to run into it with OpenWrt 22.03.0-rc6 or later. Preferable without the need of a WAN connection.
  2. Did someone run into this problem on devices other combinations than MT7622 + MT7915?
  3. Did someone see this problem in OpenWrt master?
  4. Maybe the problem is related Mediatek WED offloading.
    1.2. Could someone please try this patch on top of openwrt 22.03: hauke@e480bf2

@richb-hanover
Copy link
Author

richb-hanover commented Aug 24, 2022 via email

@ynezz ynezz added bug issue report with a confirmed bug release/22.03 pull request/issue targeted (also) for OpenWrt 22.03 release target/mediatek pull request/issue for mediatek target labels Aug 25, 2022
@ynezz
Copy link
Member

ynezz commented Aug 26, 2022

I got the issue even the firewall was disabled (dumb AP).

@csharper2005 can you try following patch?

1.2. Could someone please try this patch on top of openwrt 22.03: hauke@e480bf2

@neheb
Copy link
Contributor

neheb commented Aug 26, 2022

the nft commands work but are not a long term solution.

This issue happens with mt7622 and mt7915. It does not happen with mt7621 and mt7915 DBDC.

The only difference I can see between both setups is Wireless Ethernet Dispatch, which is exclusive to mt7622. Maybe @nbd168 knows more.

edit: just flashed 22.03-rc6 on my RT3200. I don't get this issue now. More investigation is needed.

ynezz added a commit to ynezz/openwrt that referenced this issue Aug 26, 2022
TODO

openwrt#10405
Signed-off-by: Petr Štetiar <ynezz@true.cz>
@csharper2005
Copy link
Contributor

@ynezz @hauke patch hauke@e480bf2 doesn't help.

MTS WG420223 (mt7621, mt7615 dbdc). 22.03-HEAD.

@nbd168
Copy link
Member

nbd168 commented Aug 26, 2022

Was this issue reproduced on the 5 GHz wifi as well?

@richb-hanover
Copy link
Author

richb-hanover commented Aug 26, 2022

Was this issue reproduced on the 5 GHz wifi as well?

I have not tried the 5GHz channel. (I did not enable it at all...) I have enabled a 5GHz SSID and will report back in an hour or two

@csharper2005
Copy link
Contributor

@nbd168 yeah, 2g and 5g both.

@nbd168
Copy link
Member

nbd168 commented Aug 26, 2022

@csharper2005 but only on a device with mt7615 DBDC, right?
Based on what I've read so far, it seems to me that the issue is specific to the mt7615 driver (which also handles mt7622)

@nbd168
Copy link
Member

nbd168 commented Aug 26, 2022

I managed to reproduce the issue on MT7615 by forcibly restarting aggregation on TID3 and quickly found the bug afterwards. Commit ec7d32f should fix it, please test.

@richb-hanover
Copy link
Author

richb-hanover commented Aug 26, 2022

Commit ec7d32f should fix it, please test.

For someone who has not yet learned to build the OpenWrt software, how can I get an image (a snapshot?) that contains this commit? And is my procedure (install the firmware, install LuCI if necessary, install htop & nano, configure wireless, and then run htop 'til it fails) a good test? Thanks.

@csharper2005
Copy link
Contributor

but only on a device with mt7615 DBDC, right?

@nbd168, It seems to me that the problem also was reproduced on the device with mt7613, but I'm not sure. I can't test it right now. WiFire S1500.NBN with mt7602, mt7612 is not affected.

@nbd168
Copy link
Member

nbd168 commented Aug 26, 2022

mt7613 is handled by the same driver as well. mt7602/mt7612 is a different one

@nbd168
Copy link
Member

nbd168 commented Aug 26, 2022

@richb-hanover when you're not building images for yourself, just wait for the next snapshot build and flash it. I think your test should work fine if it was able to reproduce the issue before.

@csharper2005
Copy link
Contributor

I managed to reproduce the issue on MT7615 by forcibly restarting aggregation on TID3 and quickly found the bug afterwards. Commit ec7d32f should fix it, please test.

@nbd168 that's a win! 22.03-HEAD with your commit is working without the issue.

@nbd168
Copy link
Member

nbd168 commented Aug 26, 2022

great, thanks for testing!

@nbd168 nbd168 closed this as completed Aug 26, 2022
mkj pushed a commit to mkj/dropbear that referenced this issue Nov 10, 2022
Add new -z commandline option which when set, disables new IP TOS
feature.

References: openwrt/openwrt#10405
Signed-off-by: Petr Štetiar <ynezz@true.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue report with a confirmed bug release/22.03 pull request/issue targeted (also) for OpenWrt 22.03 release target/mediatek pull request/issue for mediatek target
Projects
None yet
Development

No branches or pull requests

9 participants