Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS#294 - linksys 1200ac (and most likely other mvneta) has a multi-flow lockout problem #5411

Closed
openwrt-bot opened this issue Nov 20, 2016 · 0 comments
Labels

Comments

@openwrt-bot
Copy link

openwrt-bot commented Nov 20, 2016

dtaht:

Supply the following if possible:

  • Device problem occurs on:

linksys 1200ac

  • Software versions of LEDE release, packages, etc.

Reboot (HEAD, r2246)

  • Steps to reproduce

Install netperf

and then, from another machine, either:

netperf -H the_device -l 60 -t TCP_MAERTS &
netperf -H the_device -l 60 -t TCP_MAERTS &
netperf -H the_device -l 60 -t TCP_MAERTS &
netperf -H the_device -l 60 -t TCP_MAERTS &

or:

flent -H the_device --test-parameter=download_streams=12 tcp_ndown

The result generally is that you only get one of the flows going, the others starve completely.

I am under the impression that fixes for this arrived in mainline linux (also adding BQL support)

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 21, 2016

None:

I note https://git.lede-project.org/8aa9f6bd71bcfd15e953a0932ed21953ab6d6bbf has just been committed.

@openwrt-bot
Copy link
Author

openwrt-bot commented Nov 29, 2016

anomeome:

Seems with 4.4.35 kernel things are working well again.

@openwrt-bot
Copy link
Author

openwrt-bot commented Dec 4, 2016

mkresin:

Dave, would you please test if the issue is fixed for you as well!

@openwrt-bot
Copy link
Author

openwrt-bot commented Dec 24, 2016

dtaht:

Nope. Not fixed. Tried the Dec 23 build just now. I hit it with > 4 flows, it locks out everything else.

(The way I was dealing with it was with running cake at 900mbit on the
internal ethernet using sqm - which works great, aside from burning a ton of cpu).

I guess it's one way around bugs like this....

Example test using 12 flows from flent and netperf on the router:

root@apu2:~/t# flent -H 172.26.64.1 -t 'isitfixed'
--te=download_streams=12 tcp_ndown

Warning: Program exited non-zero (1).
Command: /usr/bin/netperf -P 0 -v 0 -D -0.20 -4 -H 172.26.64.1 -t
TCP_MAERTS -l 60 -f m -- -H 172.26.64.1
Program output:
netperf: send_omni: connect_data_socket failed: No route to host

Warning: Command produced no valid data.
Data series: TCP download::5
Runner: NetperfDemoRunner
Command: /usr/bin/netperf -P 0 -v 0 -D -0.20 -4 -H 172.26.64.1 -t
TCP_MAERTS -l 60 -f m -- -H 172.26.64.1
Standard error output:
netperf: send_omni: connect_data_socket failed: No route to host

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 5, 2017

woody77:

I'm seeing the same here on my wrt1900ac (v1-Mamba) running the 12/22 snapshot. Here's the graphed output from a 12-stream netperf download test (flent -H <wrt_1900_ac> tcp_12down)

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 11, 2017

nbd:

Added a workaround for this issue to current master.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 19, 2017

dtaht:

I have tested the mvneta with this and it no longer has the lockout behavior, and
can, indeed push netperf in one direction or the other at 1Gbit, with 12 flows.

I did not test "through" the router at a gbit.

It can't do both 12 flows up and down at the same time. (about 600mbit each way).

It might be good for the driver to stop exposing "mq" at all to the higher level bits of the stack, as allocating 8 fq_codel instances is somewhat wasteful, and confusing.

I have also seen bql "working" in this case.

I support closure until this is better fixed upstream.

2017-01-16 19:05 GMT+01:00 Felix Fietkau nbd@nbd.name:

On 2017-01-16 18:59, Dave Taht wrote:

On Mon, Jan 16, 2017 at 9:28 AM, Marcin Wojtas mw@semihalf.com wrote:

I just took a look in the LEDE master branch and found his work-around:

https://git.lede-project.org/?p=source.git;a=blob;f=target/linux/mvebu/patches-4.4/400-mvneta-tx-queue-workaround.patch;h=5dba311d93a6d325fc110b8218d56209bd78e9dd;hb=2e1f6f1682d3974d8ea52310e460f1bbe470390f#l1

He simply uses TXQ0 for entire traffic. I'm not aware of any problem
in HW. Maybe he can send a description of his findings to the kernel
lists and then I'd poke Marvell so that it could at least try to get
into their network team bug system?
To me the behavior looks like the hardware is configured to service the
queues in a fixed priority scheme. If I put the system under heavy load,
one queue gets its packets out all the time, whereas all the other
queues starve completely (>900 Mbit/s on one queue vs <1 Mbit/s on others).
I've tried to resolve this myself by looking at the data sheet and
playing with the queue configuration registers, but didn't get anywhere
with that.

@openwrt-bot
Copy link
Author

openwrt-bot commented Jan 19, 2017

dtaht:

Also, the current code is locked to the first core.

root@linksys-1200ac:/proc/irq# cd 37
root@linksys-1200ac:/proc/irq/37# echo 2 > smp_affinity
-ash: write error: I/O error
root@linksys-1200ac:/proc/irq/37# ls
affinity_hint node smp_affinity_list
mvneta smp_affinity spurious

root@linksys-1200ac:/proc/irq/105# cat /proc/interrupts
CPU0 CPU1
17: 47628305 48244950 GIC 29 Edge twd
18: 0 0 armada_370_xp_irq 5 Level armada_370_xp_per_cpu_tick
20: 174 0 GIC 34 Level mv64xxx_i2c
21: 20 0 GIC 44 Level serial
35: 45945135 1 armada_370_xp_irq 12 Level mvneta
36: 0 0 GIC 50 Level ehci_hcd:usb1
37: 71740156 0 armada_370_xp_irq 8 Level mvneta
38: 0 0 GIC 51 Level f1090000.crypto
39: 0 0 GIC 52 Level f1090000.crypto
40: 0 0 GIC 53 Level f10a3800.rtc
41: 0 0 GIC 58 Level f10a8000.sata
42: 41658 0 GIC 116 Level f10d0000.flash
43: 0 0 GIC 49 Level xhci-hcd:usb2
68: 0 0 f1018100.gpio 24 Edge gpio_keys
73: 0 0 f1018100.gpio 29 Edge gpio_keys
104: 363099418 34760 GIC 61 Level mwlwifi
105: 373070016 17906 GIC 65 Level mwlwifi
106: 2 0 GIC 54 Level f1060800.xor
107: 2 0 GIC 97 Level f1060900.xor
IPI0: 0 1 CPU wakeup interrupts
IPI1: 0 0 Timer broadcast interrupts
IPI2: 1104300 3646744 Rescheduling interrupts
IPI3: 0 0 Function call interrupts
IPI4: 280997 19353589 Single function call interrupts
IPI5: 0 0 CPU stop interrupts
IPI6: 0 0 IRQ work interrupts
IPI7: 0 0 completion interrupts
Err: 0
root@linksys-1200ac:/proc/irq/105# cd ..
root@linksys-1200ac:/proc/irq# cd 37
root@linksys-1200ac:/proc/irq/37# echo 2 > smp_affinity
-ash: write error: I/O error
root@linksys-1200ac:/proc/irq/37# ls
affinity_hint node smp_affinity_list
mvneta smp_affinity spurious

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant