Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipq40xx: 802.11s unicast package loss while broadcast works #210

Open
lemoer opened this issue Jan 12, 2023 · 26 comments
Open

ipq40xx: 802.11s unicast package loss while broadcast works #210

lemoer opened this issue Jan 12, 2023 · 26 comments

Comments

@lemoer
Copy link

lemoer commented Jan 12, 2023

We have two routers:

  • NDS-The-TARDIS
    • AVM FRITZ!Box 4040
    • Uplink-Router
    • vH31 based on OpenWrt openwrt-22.03
    • MAC on primary0: d6:0b:01:61:70:5b
  • NDS-The-Sonic-Screwdriver
    • Ubiquiti UniFi AC Mesh
    • Wifi-Mesh-Router
    • vH31 based on OpenWrt openwrt-22.03
    • MAC on primary0: be:eb:ad:e1:5f:7b

The two routers are connected with an 802.11s mesh link. The link seems to work fine for a few hours, but then unicast package loss starts to appear.

We observe approx 75% package loss if we ping NDS-The-Sonic-Screwdriver from NDS-The-TARDIS via unicast:

root@NDS-The-TARDIS:~# ping fe80::bceb:adff:fee1:5f79%mesh1 # This is 5 GHz
PING fe80::bceb:adff:fee1:5f79%mesh1 (fe80::bceb:adff:fee1:5f79%16): 56 data bytes
64 bytes from fe80::bceb:adff:fee1:5f79: seq=2 ttl=64 time=21.381 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=4 ttl=64 time=17.423 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=7 ttl=64 time=15.341 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=14 ttl=64 time=21.839 ms
^C
--- fe80::bceb:adff:fee1:5f79%mesh1 ping statistics ---
16 packets transmitted, 4 packets received, 75% packet loss
round-trip min/avg/max = 15.341/18.996/21.839 ms

If we do the same ping with multicast, we get almost answers for all packets from NDS-The-Sonic-Screwdriver (fe80::bceb:adff:fee1:5f79):

root@NDS-The-TARDIS:~# ping ff02::1%mesh1
PING ff02::1%mesh1 (ff02::1%16): 56 data bytes
64 bytes from fe80::d40b:1ff:fe61:705d: seq=0 ttl=64 time=0.745 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=0 ttl=64 time=14.570 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=1 ttl=64 time=1.543 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=1 ttl=64 time=4.745 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=2 ttl=64 time=1.971 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=2 ttl=64 time=11.180 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=3 ttl=64 time=0.992 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=3 ttl=64 time=28.687 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=4 ttl=64 time=1.024 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=4 ttl=64 time=2.542 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=5 ttl=64 time=1.059 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=5 ttl=64 time=2.338 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=6 ttl=64 time=0.982 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=6 ttl=64 time=11.422 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=7 ttl=64 time=1.003 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=7 ttl=64 time=3.020 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=8 ttl=64 time=0.982 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=8 ttl=64 time=17.070 ms (DUP!)
^C
--- ff02::1%mesh1 ping statistics ---
9 packets transmitted, 9 packets received, 9 duplicates, 0% packet loss
round-trip min/avg/max = 0.745/5.881/28.687 ms

Some debug info:

Ath10k-ct firmware info:

root@NDS-The-TARDIS:# cat /sys/kernel/debug/ieee80211/phy1/ath10k/firmware_info 
directory: ath10k/QCA4019/hw1.0
firmware:  firmware-5.bin
fwcfg:     fwcfg-ahb-a800000.wifi.txt
bus:       a800000.wifi
features:  mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,tx-rc-CT,cust-stats-CT,txrate2-CT,beacon-cb-CT,wmi-block-ack-CT,wmi-bcn-rc-CT
version:   10.4b-ct-4019-fW-13-5ae337bb1
hw_rev:    4019
board:     board-2.bin

Tx-Error-Counters on NDS-The-TARDIS:

root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_dropped 
0
root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_errors 
0
root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_fifo_errors
0
root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_heartbeat_errors 
0
root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_window_errors 
0
root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_compressed 
0
root@NDS-The-TARDIS:/sys/devices/platform/soc/a800000.wifi/net/mesh1/statistics# cat tx_carrier_errors 
0

I found out, that the bug is gone for a few hours (before it appears again) when I call the following command:

root@NDS-The-TARDIS:# iw dev mesh1 set bitrates legacy-5 6 ht-mcs-5 vht-mcs-5; iw dev mesh1 set bitrates

This issue completely disappears when I switch from ath10k-ct to mainline ath10k on NDS-The-TARDIS. Even after days, the mesh link remains stable.

Later today, I will add some more information about the mesh configuration to this issue.

@lemoer
Copy link
Author

lemoer commented Jan 12, 2023

Configuration on NDS-The-TARDIS:

root@NDS-The-TARDIS:~# cat /etc/config/wireless 
config wifi-device 'radio0'
	option type 'mac80211'
	option path 'platform/soc/a000000.wifi'
	option channel '1'
	option band '2g'
	option htmode 'HT20'
	option country 'DE'
	option legacy_rates '0'

config wifi-device 'radio1'
	option type 'mac80211'
	option path 'platform/soc/a800000.wifi'
	option band '5g'
	option channel '44'
	option htmode 'VHT20'
	option country 'DE'

config wifi-iface 'mesh_radio0'
	option ifname 'mesh0'
	option network 'mesh_radio0'
	option device 'radio0'
	option mesh_fwding '0'
	option mesh_id 'ffh-mesh-14'
	option mcast_rate '12000'
	option mode 'mesh'
	option disabled '0'
	option macaddr 'd6:0b:01:61:70:59'

config wifi-iface 'mesh_radio1'
	option ifname 'mesh1'
	option network 'mesh_radio1'
	option device 'radio1'
	option mesh_fwding '0'
	option mesh_id 'ffh-mesh-14'
	option mcast_rate '12000'
	option mode 'mesh'
	option disabled '0'
	option macaddr 'd6:0b:01:61:70:5d'

config wifi-iface 'client_radio0'
	option ifname 'client0'
	option network 'client'
	option disabled '0'
	option device 'radio0'
	option mode 'ap'
	option macaddr 'd6:0b:01:61:70:58'
	option ssid 'hannover.freifunk.net'

config wifi-iface 'client_radio1'
	option ifname 'client1'
	option network 'client'
	option disabled '0'
	option device 'radio1'
	option mode 'ap'
	option macaddr 'd6:0b:01:61:70:5c'
	option ssid 'hannover.freifunk.net'

dmesg on NDS-The-TARDIS:

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 5.10.146 (jenkins@ffh-build) (arm-openwrt-linux-muslgnueabi-gcc (OpenWrt GCC 11.2.0 r19800+7-a7fb589e8a) 11.2.0, GNU ld (GNU Binutils) 2.37) #0 SMP Thu May 5 15:45:14 2022
[    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] OF: fdt: Machine model: AVM FRITZ!Box 4040
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000080000000-0x000000008fffffff]
[    0.000000]   HighMem  empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x0000000087dfffff]
[    0.000000]   node   0: [mem 0x0000000087e00000-0x0000000087ffffff]
[    0.000000]   node   0: [mem 0x0000000088000000-0x000000008fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x000000008fffffff]
[    0.000000] On node 0 totalpages: 65536
[    0.000000]   Normal zone: 512 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 65536 pages, LIFO batch:15
[    0.000000] percpu: Embedded 15 pages/cpu s30540 r8192 d22708 u61440
[    0.000000] pcpu-alloc: s30540 r8192 d22708 u61440 alloc=15*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 65024
[    0.000000] Kernel command line: 
[    0.000000] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 247452K/262144K available (6000K kernel code, 581K rwdata, 1548K rodata, 1024K init, 243K bss, 14692K reserved, 0K cma-reserved, 0K highmem)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] 	Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] arch_timer: cp15 timer(s) running at 48.00MHz (virt).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0xb11fd3bfb, max_idle_ns: 440795203732 ns
[    0.000010] sched_clock: 56 bits at 48MHz, resolution 20ns, wraps every 4398046511096ns
[    0.000028] Switching to timer-based delay loop, resolution 20ns
[    0.000361] Calibrating delay loop (skipped), value calculated using timer frequency.. 96.00 BogoMIPS (lpj=480000)
[    0.000391] pid_max: default: 32768 minimum: 301
[    0.000572] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.000593] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.001376] CPU: Testing write buffer coherency: ok
[    0.001723] qcom_scm: convention: smc legacy
[    0.002656] Setting up static identity map for 0x80300000 - 0x8030003c
[    0.002819] rcu: Hierarchical SRCU implementation.
[    0.003082] dyndbg: Ignore empty _ddebug table in a CONFIG_DYNAMIC_DEBUG_CORE build
[    0.003459] smp: Bringing up secondary CPUs ...
[    0.006988] smp: Brought up 1 node, 4 CPUs
[    0.007015] SMP: Total of 4 processors activated (384.00 BogoMIPS).
[    0.007026] CPU: All CPU(s) started in SVC mode.
[    0.012127] VFP support v0.3: implementor 41 architecture 2 part 30 variant 7 rev 5
[    0.012316] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.012347] futex hash table entries: 1024 (order: 4, 65536 bytes, linear)
[    0.012555] pinctrl core: initialized pinctrl subsystem
[    0.014477] NET: Registered protocol family 16
[    0.014822] DMA: preallocated 256 KiB pool for atomic coherent allocations
[    0.016023] thermal_sys: Registered thermal governor 'step_wise'
[    0.016474] cpuidle: using governor ladder
[    0.016536] cpuidle: using governor menu
[    0.044074] cryptd: max_cpu_qlen set to 1000
[    0.048239] usbcore: registered new interface driver usbfs
[    0.048314] usbcore: registered new interface driver hub
[    0.048389] usbcore: registered new device driver usb
[    0.048451] pps_core: LinuxPPS API ver. 1 registered
[    0.048466] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.048496] PTP clock support registered
[    0.050521] clocksource: Switched to clocksource arch_sys_counter
[    0.051466] NET: Registered protocol family 2
[    0.051741] IP idents hash table entries: 4096 (order: 3, 32768 bytes, linear)
[    0.052702] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 6144 bytes, linear)
[    0.052753] TCP established hash table entries: 2048 (order: 1, 8192 bytes, linear)
[    0.052798] TCP bind hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.052855] TCP: Hash tables configured (established 2048 bind 2048)
[    0.052988] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.053038] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.053251] NET: Registered protocol family 1
[    0.053302] PCI: CLS 0 bytes, default 64
[    0.055322] workingset: timestamp_bits=30 max_order=16 bucket_order=0
[    0.059407] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.059432] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
[    0.208832] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
[    0.212562] bam-dma-engine 8e04000.dma: num-channels unspecified in dt
[    0.212590] bam-dma-engine 8e04000.dma: num-ees unspecified in dt
[    0.213376] tcsr 1949000.tcsr: setting wifi_glb_cfg = 41000000
[    0.213487] tcsr 194b000.tcsr: setting usb hs phy mode select = e700e7
[    0.213588] tcsr 1953000.ess_tcsr: setting ess interface select = 0
[    0.213686] tcsr 1957000.tcsr: setting wifi_noc_memtype_m0_m2 = 2222222
[    0.214025] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
[    0.215997] msm_serial 78af000.serial: msm_serial: detected port #0
[    0.216054] msm_serial 78af000.serial: uartclk = 1843200
[    0.216120] 78af000.serial: ttyMSM0 at MMIO 0x78af000 (irq = 33, base_baud = 115200) is a MSM
[    0.216153] msm_serial: console setup on port #0
[    0.757249] printk: console [ttyMSM0] enabled
[    0.762443] msm_serial: driver initialized
[    0.771398] loop: module loaded
[    0.772540] spi_qup 78b5000.spi: IN:block:16, fifo:64, OUT:block:16, fifo:64
[    0.775025] spi-nor spi0.0: mx25l25635e (32768 Kbytes)
[    0.780780] 13 fixed-partitions partitions found on MTD device spi0.0
[    0.785647] OF: Bad cell count for /soc/spi@78b5000/flash@0/partitions
[    0.792168] OF: Bad cell count for /soc/spi@78b5000/flash@0/partitions
[    0.798776] Creating 13 MTD partitions on "spi0.0":
[    0.805094] 0x000000000000-0x000000040000 : "SBL1"
[    0.810373] 0x000000040000-0x000000060000 : "MIBIB"
[    0.815185] 0x000000060000-0x0000000c0000 : "QSEE"
[    0.819925] 0x0000000c0000-0x0000000d0000 : "CDT"
[    0.824850] 0x0000000d0000-0x0000000e0000 : "DDRPARAMS"
[    0.829531] 0x0000000e0000-0x0000000f0000 : "APPSBLENV"
[    0.834655] 0x0000000f0000-0x0000003cc000 : "urlader"
[    0.839870] 0x00000011dc00-0x000000120000 : "urlader_config"
[    0.845048] 0x000000120000-0x0000001a0000 : "tffs1"
[    0.850777] 0x0000001a0000-0x000000220000 : "tffs2"
[    0.855320] 0x000000220000-0x0000002a0000 : "uboot"
[    0.860204] 0x0000002a0000-0x000001f00000 : "firmware"
[    0.865483] 2 fit-fw partitions found on MTD device firmware
[    0.869801] Creating 2 MTD partitions on "firmware":
[    0.875659] 0x000000000000-0x0000002e0000 : "kernel"
[    0.881039] 0x0000002d21f4-0x000001c60000 : "rootfs"
[    0.885510] mtd: partition "rootfs" doesn't start on an erase/write block boundary -- force read-only
[    0.890926] mtd: device 13 (rootfs) set to be root filesystem
[    0.899824] 1 squashfs-split partitions found on MTD device rootfs
[    0.905348] 0x000000600000-0x000001c60000 : "rootfs_data"
[    0.911858] 0x000001f00000-0x000002000000 : "jffs2"
[    0.930222] gpio-409 (enable USB3 power): hogged as output/high
[    0.959592] ESS reset ok!
[    1.033761] ESS reset ok!
[    1.785379] i2c /dev entries driver
[    1.786999] cpufreq: cpufreq_online: CPU0: Running at unlisted initial frequency: 632000 KHz, changing to: 716000 KHz
[    1.788126] sdhci: Secure Digital Host Controller Interface driver
[    1.798539] sdhci: Copyright(c) Pierre Ossman
[    1.804543] sdhci-pltfm: SDHCI platform and OF driver helper
[    1.812879] NET: Registered protocol family 10
[    1.816215] Segment Routing with IPv6
[    1.819015] NET: Registered protocol family 17
[    1.822768] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[    1.827019] 8021q: 802.1Q VLAN Support v1.8
[    1.840182] Registering SWP/SWPB emulation handler
[    1.866440] VFS: Mounted root (squashfs filesystem) readonly on device 31:13.
[    1.869216] Freeing unused kernel memory: 1024K
[    1.934001] Run /sbin/init as init process
[    1.934229]   with arguments:
[    1.934309]     /sbin/init
[    1.934376]   with environment:
[    1.934448]     HOME=/
[    1.934519]     TERM=linux
[    3.207530] init: Console is alive
[    3.207900] init: - watchdog -
[    5.056124] kmodloader: loading kernel modules from /etc/modules-boot.d/*
[    5.166676] genirq: irq_chip msmgpio did not update eff. affinity mask of irq 100
[    5.413707] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[    5.413781] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1
[    5.418335] xhci-hcd xhci-hcd.0.auto: hcc params 0x0228f665 hci version 0x100 quirks 0x0000002002010010
[    5.425780] xhci-hcd xhci-hcd.0.auto: irq 102, io mem 0x08a00000
[    5.435969] hub 1-0:1.0: USB hub found
[    5.441346] hub 1-0:1.0: 1 port detected
[    5.445166] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[    5.448931] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 2
[    5.454259] xhci-hcd xhci-hcd.0.auto: Host supports USB 3.0 SuperSpeed
[    5.462126] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    5.468936] hub 2-0:1.0: USB hub found
[    5.476737] hub 2-0:1.0: 1 port detected
[    5.480656] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller
[    5.484268] xhci-hcd xhci-hcd.1.auto: new USB bus registered, assigned bus number 3
[    5.489756] xhci-hcd xhci-hcd.1.auto: hcc params 0x0220f665 hci version 0x100 quirks 0x0000002002010010
[    5.497191] xhci-hcd xhci-hcd.1.auto: irq 103, io mem 0x06000000
[    5.507348] hub 3-0:1.0: USB hub found
[    5.512790] hub 3-0:1.0: 1 port detected
[    5.516587] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller
[    5.520370] xhci-hcd xhci-hcd.1.auto: new USB bus registered, assigned bus number 4
[    5.525690] xhci-hcd xhci-hcd.1.auto: Host supports USB 3.0 SuperSpeed
[    5.533331] usb usb4: We don't know the algorithms for LPM for this host, disabling LPM.
[    5.540360] hub 4-0:1.0: USB hub found
[    5.548154] hub 4-0:1.0: config failed, hub doesn't have any ports! (err -19)
[    5.563069] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
[    5.573168] init: - preinit -
[    6.294378] random: crng init done
[    6.727879] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   11.511818] jffs2: notice: (185) jffs2_build_xattr_subsystem: complete building xattr subsystem, 16 of xdatum (7 unchecked, 1 orphan) and 23 of xref (1 dead, 0 orphan) found.
[   11.533636] mount_root: switching to jffs2 overlay
[   11.569113] overlayfs: upper fs does not support tmpfile.
[   11.628139] urandom-seed: Seed file not found (/etc/urandom.seed)
[   12.011204] procd: - early -
[   12.011457] procd: - watchdog -
[   12.715801] procd: - watchdog -
[   12.718042] procd: - ubus -
[   12.775400] procd: - init -
[   13.838670] urngd: v1.0.2 started.
[   14.297172] kmodloader: loading kernel modules from /etc/modules.d/*
[   14.394165] GACT probability on
[   14.395578] Mirror/redirect action on
[   14.405348] u32 classifier
[   14.405386]     input device check on
[   14.406981]     Actions configured
[   14.424148] Simple TC action Loaded
[   14.449741] Loading modules backported from Linux version v5.15.58-0-g7d8048d4e064
[   14.449788] Backport generated by backports.git v5.15.58-1-0-g42a95ce7
[   14.517376] xt_time: kernel timezone is -0000
[   14.622597] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information.
[   14.622646] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[   14.837064] ath10k_ahb a000000.wifi: fwcfg key: vdevs  val: 4
[   14.837118] ath10k_ahb a000000.wifi: fwcfg key: peers  val: 96
[   14.841897] ath10k_ahb a000000.wifi: fwcfg key: active_peers  val: 96
[   14.847544] ath10k_ahb a000000.wifi: fwcfg key: stations  val: 96
[   15.748421] ath10k_ahb a000000.wifi: qca4019 hw1.0 target 0x01000000 chip_id 0x003b00ff sub 0000:0000
[   15.748476] ath10k_ahb a000000.wifi: kconfig debug 1 debugfs 1 tracing 0 dfs 1 testmode 0
[   15.760410] ath10k_ahb a000000.wifi: firmware ver 10.4b-ct-4019-fW-13-5ae337bb1 api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,tx-rc-CT,cust-stats-CT,txrate2-CT,beacon-cb-CT,wmi-block-ack-CT,wmi-bcn-rc-CT crc32 6b2b5c5b
[   15.812064] ath10k_ahb a000000.wifi: Loading BDF type 0
[   15.860216] ath10k_ahb a000000.wifi: board_file api 2 bmi_id 0:16 crc32 e2dfaa91
[   17.149271] ath10k_ahb a000000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 96
[   17.149323] ath10k_ahb a000000.wifi: msdu-desc: 2500  skid: 32
[   17.185163] ath10k_ahb a000000.wifi: wmi print 'P 96/96 V 4 K 288 PH 296 T 210  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[   17.186185] ath10k_ahb a000000.wifi: wmi print 'free: 33028 iram: 13240 sram: 28696'
[   17.332655] ath10k_ahb a000000.wifi: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 96 raw 0 hwcrypto 1
[   17.464028] ath: EEPROM regdomain: 0x0
[   17.464044] ath: EEPROM indicates default country code should be used
[   17.464049] ath: doing EEPROM country->regdmn map search
[   17.464060] ath: country maps to regdmn code: 0x3a
[   17.464068] ath: Country alpha2 being used: US
[   17.464073] ath: Regpair used: 0x3a
[   17.633321] ath10k_ahb a800000.wifi: fwcfg key: vdevs  val: 4
[   17.633375] ath10k_ahb a800000.wifi: fwcfg key: peers  val: 96
[   17.638110] ath10k_ahb a800000.wifi: fwcfg key: active_peers  val: 96
[   17.643841] ath10k_ahb a800000.wifi: fwcfg key: stations  val: 96
[   18.605453] ath10k_ahb a800000.wifi: qca4019 hw1.0 target 0x01000000 chip_id 0x003b00ff sub 0000:0000
[   18.605511] ath10k_ahb a800000.wifi: kconfig debug 1 debugfs 1 tracing 0 dfs 1 testmode 0
[   18.627352] ath10k_ahb a800000.wifi: firmware ver 10.4b-ct-4019-fW-13-5ae337bb1 api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT,tx-rc-CT,cust-stats-CT,txrate2-CT,beacon-cb-CT,wmi-block-ack-CT,wmi-bcn-rc-CT crc32 6b2b5c5b
[   18.673691] ath10k_ahb a800000.wifi: Loading BDF type 0
[   18.678171] ath10k_ahb a800000.wifi: board_file api 2 bmi_id 0:17 crc32 e2dfaa91
[   19.967123] ath10k_ahb a800000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 96
[   19.967176] ath10k_ahb a800000.wifi: msdu-desc: 2500  skid: 32
[   20.003023] ath10k_ahb a800000.wifi: wmi print 'P 96/96 V 4 K 288 PH 296 T 210  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[   20.004026] ath10k_ahb a800000.wifi: wmi print 'free: 33028 iram: 13240 sram: 28696'
[   20.129888] ath10k_ahb a800000.wifi: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 96 raw 0 hwcrypto 1
[   20.288496] ath: EEPROM regdomain: 0x0
[   20.288511] ath: EEPROM indicates default country code should be used
[   20.288516] ath: doing EEPROM country->regdmn map search
[   20.288527] ath: country maps to regdmn code: 0x3a
[   20.288535] ath: Country alpha2 being used: US
[   20.288540] ath: Regpair used: 0x3a
[   20.297929] batman_adv: B.A.T.M.A.N. advanced 2022.0-openwrt-6 (compatibility version 15) loaded
[   20.299643] kmodloader: done loading kernel modules from /etc/modules.d/*
[   31.702523] ess_edma c080000.edma eth1: Link is Down
[   31.705205] br-wan: port 1(eth1) entered blocking state
[   31.706621] br-wan: port 1(eth1) entered disabled state
[   31.712843] device eth1 entered promiscuous mode
[   31.777651] IPv6: ADDRCONF(NETDEV_CHANGE): local-node: link becomes ready
[   31.779074] br-client: port 1(local-port) entered blocking state
[   31.783717] br-client: port 1(local-port) entered disabled state
[   31.810744] device local-port entered promiscuous mode
[   31.811097] br-client: port 1(local-port) entered blocking state
[   31.814838] br-client: port 1(local-port) entered forwarding state
[   31.821271] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
[   32.602998] 8021q: adding VLAN 0 to HW filter on device bat0
[   32.604775] br-client: port 2(bat0) entered blocking state
[   32.607806] br-client: port 2(bat0) entered disabled state
[   32.613598] device bat0 entered promiscuous mode
[   32.618738] br-client: port 2(bat0) entered blocking state
[   32.623374] br-client: port 2(bat0) entered forwarding state
[   32.631103] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   33.100687] batman_adv: bat0: No IGMP Querier present - multicast optimizations disabled
[   33.100767] batman_adv: bat0: No MLD Querier present - multicast optimizations disabled
[   33.547525] ath: EEPROM regdomain: 0x8114
[   33.547542] ath: EEPROM indicates we should expect a country code
[   33.547552] ath: doing EEPROM country->regdmn map search
[   33.547560] ath: country maps to regdmn code: 0x37
[   33.547568] ath: Country alpha2 being used: DE
[   33.547574] ath: Regpair used: 0x37
[   33.547583] ath: regdomain 0x8114 dynamically updated by user
[   33.547615] ath: EEPROM regdomain: 0x8114
[   33.547621] ath: EEPROM indicates we should expect a country code
[   33.547627] ath: doing EEPROM country->regdmn map search
[   33.547633] ath: country maps to regdmn code: 0x37
[   33.547639] ath: Country alpha2 being used: DE
[   33.547645] ath: Regpair used: 0x37
[   33.547652] ath: regdomain 0x8114 dynamically updated by user
[   33.833113] ess_edma c080000.edma eth1: Link is Up - 100Mbps/Full - flow control rx/tx
[   33.833223] br-wan: port 1(eth1) entered blocking state
[   33.839972] br-wan: port 1(eth1) entered forwarding state
[   33.858030] IPv6: ADDRCONF(NETDEV_CHANGE): br-wan: link becomes ready
[   34.082671] batman_adv: bat0: Adding interface: primary0
[   34.082744] batman_adv: bat0: Interface activated: primary0
[   34.155045] batman_adv: bat0: Interface deactivated: primary0
[   34.166364] batman_adv: bat0: Interface activated: primary0
[   34.800009] batman_adv: bat0: Adding interface: mesh-vpn
[   34.800070] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1350) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
[   34.806456] batman_adv: bat0: Interface activated: mesh-vpn
[   35.004673] batman_adv: bat0: Adding interface: vx_mesh_other
[   35.004738] batman_adv: bat0: The MTU of interface vx_mesh_other is too small (1430) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
[   35.011753] batman_adv: bat0: Interface activated: vx_mesh_other
[   36.907966] ath10k_ahb a800000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 96
[   36.908019] ath10k_ahb a800000.wifi: msdu-desc: 2500  skid: 32
[   36.944045] ath10k_ahb a800000.wifi: wmi print 'P 96/96 V 4 K 288 PH 296 T 210  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[   36.946120] ath10k_ahb a800000.wifi: wmi print 'free: 33028 iram: 13240 sram: 28696'
[   37.244410] ath10k_ahb a800000.wifi: rts threshold -1
[   37.246111] ath10k_ahb a800000.wifi: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[   37.283840] br-client: port 3(client1) entered blocking state
[   37.283901] br-client: port 3(client1) entered disabled state
[   37.289046] device client1 entered promiscuous mode
[   37.333190] ath10k_ahb a800000.wifi: NOTE:  Firmware DBGLOG output disabled in debug_mask: 0x10000000
[   37.559695] IPv6: ADDRCONF(NETDEV_CHANGE): client1: link becomes ready
[   37.560149] br-client: port 3(client1) entered blocking state
[   37.565259] br-client: port 3(client1) entered forwarding state
[   39.082746] ath10k_ahb a000000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 96
[   39.082812] ath10k_ahb a000000.wifi: msdu-desc: 2500  skid: 32
[   39.118587] ath10k_ahb a000000.wifi: wmi print 'P 96/96 V 4 K 288 PH 296 T 210  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[   39.119640] ath10k_ahb a000000.wifi: wmi print 'free: 33028 iram: 13240 sram: 28696'
[   39.471006] ath10k_ahb a000000.wifi: rts threshold -1
[   39.473325] ath10k_ahb a000000.wifi: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[   39.484951] ath10k_ahb a800000.wifi: rts threshold -1
[   39.490813] br-client: port 4(client0) entered blocking state
[   39.491253] br-client: port 4(client0) entered disabled state
[   39.497833] device client0 entered promiscuous mode
[   39.729862] IPv6: ADDRCONF(NETDEV_CHANGE): client0: link becomes ready
[   39.730749] br-client: port 4(client0) entered blocking state
[   39.735385] br-client: port 4(client0) entered forwarding state
[   39.836248] IPv6: ADDRCONF(NETDEV_CHANGE): mesh1: link becomes ready
[   39.843496] ath10k_ahb a800000.wifi: Invalid peer id 1 or peer stats buffer, peer: d8d536b8  sta: 00000000
[   40.113873] batman_adv: bat0: Adding interface: mesh1
[   40.113938] batman_adv: bat0: Interface activated: mesh1
[   40.873328] ath10k_ahb a000000.wifi: rts threshold -1
[   41.202343] IPv6: ADDRCONF(NETDEV_CHANGE): mesh0: link becomes ready
[   41.492722] batman_adv: bat0: Adding interface: mesh0
[   41.492805] batman_adv: bat0: Interface activated: mesh0
[   41.780751] batman_adv: bat0: IGMP Querier appeared
[   41.780810] batman_adv: bat0: MLD Querier appeared
[  219.284882] ath10k_ahb a800000.wifi: htt tx: fixing invalid VHT TX rate code 0xff
[  281.143037] ath10k_ahb a000000.wifi: wmi: fixing invalid VHT TX rate code 0xff

The last lines look like #188. But I have no idea what that means.

@greearb
Copy link
Owner

greearb commented Jan 12, 2023

The early builds in here should be very much like (older) stock firmware images. If an earlier one works, then you can work to bisect the problem if you wish, and if you can find the bad commit, probably I can fix it.

https://www.candelatech.com/downloads/ath10k-4019-10-4b/bisect/all_builds-4019-W.tar.gz

@lemoer
Copy link
Author

lemoer commented Jan 13, 2023

I will try, thanks.

@Oozil
Copy link

Oozil commented Jan 13, 2023

The early builds in here should be very much like (older) stock firmware images. If an earlier one works, then you can work to bisect the problem if you wish, and if you can find the bad commit, probably I can fix it.

https://www.candelatech.com/downloads/ath10k-4019-10-4b/bisect/all_builds-4019-W.tar.g

Why didn't you answer my question please? I want to compile ath10-4.19 with the latest commit from openwrt, but it gives me compilation failure and there is a problem with recent versions. I put it in the problems section.

@lemoer
Copy link
Author

lemoer commented Feb 1, 2023

I am currently bisecting the firmware for 3 weeks. Unfortunately, the issue is very unpredictable and I have to wait a day or two to see if the problem really occurs. Also, I have the impression that it might be more than one error, because I observe different behaviours when I bisect.

I will try to continue. But if you have any hint how to accelerate the bisect, please let me know.

@lemoer
Copy link
Author

lemoer commented Feb 23, 2023

I traced the error down to the first broken commit:

  • firmware-5-full-community-qcache-commit-235-af83039c7.bin

However, during the bisect, I observed different kind of behavior:

  • On some commits, the firmware is crashing and the router is crashing along with it.
  • On some commits, only the firmware is crashing.
  • On some commits, the wifi connection stops working and is not coming back by itself.
  • On some commits, the wifi connection is starting to work again after certain time.

Here is my current bisect.log:

Sun Jan 14 01:11:44 PM CET 2023: firmware-5-full-community-commit-317-cf4991294.bin -> ok, approx until 15.01. 16:12
Sun Jan 15 04:23:30 PM CET 2023: firmware-5-full-community-commit-1187-774502ee5.bin -> This did not seem to work at all -> maybe firmware was not loaded at all or so??...
Mon Jan 16 08:26:47 AM CET 2023: firmware-5-full-community-commit-757-ec1dcee4c.bin -> Firmware seems to be completely broken (firmware register dump)
Mon Jan 16 08:30:08 PM CET 2023: firmware-5-full-community-commit-1000-03867f4ea.bin -> seems to work... (but actually not really clear, since other mesh routers were online.)
Tue Jan 17 08:13:52 PM CET 2023: firmware-5-full-community-commit-1187-774502ee5.bin -> broke at 2023-01-17 21:21:34 (approx 1:08h)
Wed Jan 18 09:57:08 PM CET 2023: firmware-5-full-community-commit-1102-3f51bf413.bin -> broke at 2023-01-18 22:08:34 (approx 11 min)
Thu Jan 19 08:59:52 AM CET 2023: firmware-5-full-community-commit-1050-b047152af.bin -> broke at 2023-01-19 19:05:34 (approx 10:05h)
Thu Jan 19 08:34:36 PM CET 2023: firmware-5-full-community-commit-1000-03867f4ea.bin -> broke at 2023-01-19 21:09:01 (approx 35 min)
Thu Jan 19 09:18:41 PM CET 2023: firmware-5-full-community-commit-500-46d5c5934.bin -> broke at 2023-01-20 14:02:32 and came back at 2023-01-20 15:53:39, but broke again at 2023-01-20 16:02:38 and came back at 2023-01-20 23:01:35 -> maybe another issue?
Sat Jan 21 02:49:41 PM CET 2023: firmware-5-full-community-commit-800-9cebd0016.bin -> Firmware seems to be completely broken (firmware register dump)
Sat Jan 21 02:57:27 PM CET 2023: firmware-5-full-community-commit-924-33ee8da60.bin -> broke at 2023-01-21 21:54:00, came back at 2023-01-22 03:46:39, broke again 2023-01-22 03:48:11, came back on 2023-01-22 13:52:35, broke again  2023-01-22 20:00:34, came back on 2023-01-23 19:22:37
Mon Jan 23 08:43:05 PM CET 2023: firmware-5-full-community-commit-1000-03867f4ea.bin -> remains offline after 1hr (or so...). Only bernd restarted wifi once, and it was online for 5 min (probably 1 respondd packet came through)
Sat Jan 28 10:56:27 AM CET 2023: firmware-5-full-community-commit-950-a024550f3.bin -> broken. seems to come back occasionally.
Mon Jan 30 09:34:47 PM CET 2023: firmware-5-full-community-commit-400-e4275fb8e.bin -> offline at the next day at 3pm
Tue Jan 31 10:07:33 PM CET 2023: firmware-5-full-community-qcache-commit-006-61f5a3046.bin -> Seems to be ok for 2 days
Thu Feb 2 11:33:52 PM CET 2023: firmware-5-full-community-commit-360-cf3808f5c.bin -> broke at 2023-02-03 21:31:37, firmware crashed after approx 9hrs. After this, it seemed to crash regularly. (next after 21hrs uptime.), then after 21.1 hrs uptime. See crashlog...
Sat Feb 4 12:00:08 PM CET 2023: firmware-5-full-community-commit-317-cf4991294.bin -> broke at 2023-02-04 12:34:33 -> firmware crashed. see crashlog_2023_02_05.log (first crash after 20 min, second crash after 46 min uptime, 3rd crash at 3hrs+46 min uptime)
Sun Feb 5 10:47:03 AM CET 2023: firmware-5-full-community-qcache-commit-150-12407516f.bin -> seems to work
Mon Feb 6 10:42:40 PM CET 2023: firmware-5-full-community-qcache-commit-225-9ce4c0983.bin -> seems to work
Sat Feb 11 01:24:01 PM CET 2023: firmware-5-full-community-qcache-commit-270-73b8a8fe5.bin -> Broke at 2023-02-12 07:57:36 (it came back at 2023-02-12 12:44:33), also we had firmware crashes. See crashlog 2022-02-12, seems like the router itself is crashing
Sun Feb 12 10:35:39 PM CET 2023: firmware-5-full-community-qcache-commit-250-ac4a0dd63.bin -> broke at 2023-02-13 07:51:34 -> firmware crashed. see crashlog_2023_02_13.log, seems like the router itself is crashing
Mon Feb 13 03:52:06 PM CET 2023: firmware-5-full-community-qcache-commit-237-887f08d4f.bin -> broke at 2023-02-13 23:09:33 (came back at 2023-02-14 05:26:36), seems like the router itself is crashing
Tue Feb 14 09:18:51 AM CET 2023: firmware-5-full-community-qcache-commit-230-a96a9e5c0.bin -> not really tested
Tue Feb 14 09:29:12 AM CET 2023: firmware-5-full-community-qcache-commit-225-9ce4c0983.bin -> seems to work
Thu Feb 16 09:30:15 PM CET 2023: firmware-5-full-community-qcache-commit-230-a96a9e5c0.bin -> seems to work
Sun Feb 19 10:15:35 PM CET 2023: firmware-5-full-community-qcache-commit-234-66d113a9f.bin -> seems to work
Wed Feb 22 11:09:57 PM CET 2023: firmware-5-full-community-qcache-commit-237-887f08d4f.bin -> crashes...
Thu Feb 23 07:40:15 PM CET 2023: firmware-5-full-community-qcache-commit-235-af83039c7.bin -> not really tested.
Thu Feb 23 07:44:59 PM CET 2023: firmware-5-full-community-qcache-commit-236-55e2f2e7a.bin -> crashes...
Thu Feb 23 08:42:15 PM CET 2023: firmware-5-full-community-qcache-commit-235-af83039c7.bin -> crashes

Since I observed different behaviors, I suspect, this is not the end of the bisect. Please advise me how to continue the bisect.

@greearb
Copy link
Owner

greearb commented Feb 23, 2023

Thanks for grinding through this, I'm sure it was not much fun to bisect this. The patch in build 235 looks fairly harmless at first glance. Can you please send me the crash file/information?

@lemoer
Copy link
Author

lemoer commented Feb 23, 2023

Now, the -234- starts to crash also. :(

Here is the log.

@greearb
Copy link
Owner

greearb commented Feb 24, 2023

Ok, 234 introduced a bug that was fixed ~5 commits later (back in 2016). To make bisecting less painfull, I merged those commits. Please see how this one works for you. It is with those 5 squashed commits on top of 234.
firmware-5-full-community-qcache.bin.gz

@lemoer
Copy link
Author

lemoer commented Feb 24, 2023

If this version is just -234- squashed together with the next five commits, I do not understand how this version differs from -239-, but I will try this version.

I'll come back when I have results.

@lemoer
Copy link
Author

lemoer commented Feb 26, 2023

The new firmware firmware-5-full-community-qcache.bin.gz (from above) also doesn't work.

Fri Feb 24 06:41:25 PM CET 2023: firmware-5-full-community-qcache-commit-234-squashed-with-5-commits.bin -> version (firmware ver 10.4b-ct-4019-fWQ-05-2c04d5a86 api 5)

The router is crashing again and again:

image

Current log looks like this:

root@NDS-The-TARDIS:~# uptime
 19:24:42 up 49 min,  load average: 0.35, 0.49, 0.44
root@NDS-The-TARDIS:~# dmesg|grep ath
[   15.269947] ath10k_ahb a000000.wifi: fwcfg key: vdevs  val: 4
[   15.270036] ath10k_ahb a000000.wifi: fwcfg key: peers  val: 96
[   15.274860] ath10k_ahb a000000.wifi: fwcfg key: active_peers  val: 96
[   15.280469] ath10k_ahb a000000.wifi: fwcfg key: stations  val: 96
[   16.116745] ath10k_ahb a000000.wifi: qca4019 hw1.0 target 0x01000000 chip_id 0x003b00ff sub 0000:0000
[   16.116813] ath10k_ahb a000000.wifi: kconfig debug 1 debugfs 1 tracing 0 dfs 1 testmode 0
[   16.130194] ath10k_ahb a000000.wifi: firmware ver 10.4b-ct-4019-fWQ-05-2c04d5a86 api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT crc32 e9fe2c46
[   16.174249] ath10k_ahb a000000.wifi: Loading BDF type 0
[   16.221862] ath10k_ahb a000000.wifi: board_file api 2 bmi_id 0:16 crc32 e2dfaa91
[   17.540219] ath10k_ahb a000000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 192
[   17.540271] ath10k_ahb a000000.wifi: msdu-desc: 2500  skid: 32
[   17.576304] ath10k_ahb a000000.wifi: wmi print 'P 96/96 V 4 T 306  msdu-desc: 2500  sw-crypt: 0'
[   17.577716] ath10k_ahb a000000.wifi: wmi print 'free: 13108 iram: 1768 sram: 27492'
[   17.707956] ath10k_ahb a000000.wifi: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 96 raw 0 hwcrypto 1
[   17.827214] ath: EEPROM regdomain: 0x0
[   17.827233] ath: EEPROM indicates default country code should be used
[   17.827241] ath: doing EEPROM country->regdmn map search
[   17.827256] ath: country maps to regdmn code: 0x3a
[   17.827266] ath: Country alpha2 being used: US
[   17.827274] ath: Regpair used: 0x3a
[   18.003499] ath10k_ahb a800000.wifi: fwcfg key: vdevs  val: 4
[   18.003558] ath10k_ahb a800000.wifi: fwcfg key: peers  val: 96
[   18.008290] ath10k_ahb a800000.wifi: fwcfg key: active_peers  val: 96
[   18.014042] ath10k_ahb a800000.wifi: fwcfg key: stations  val: 96
[   18.911729] ath10k_ahb a800000.wifi: qca4019 hw1.0 target 0x01000000 chip_id 0x003b00ff sub 0000:0000
[   18.911799] ath10k_ahb a800000.wifi: kconfig debug 1 debugfs 1 tracing 0 dfs 1 testmode 0
[   18.925226] ath10k_ahb a800000.wifi: firmware ver 10.4b-ct-4019-fWQ-05-2c04d5a86 api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,set-special-CT crc32 e9fe2c46
[   18.968994] ath10k_ahb a800000.wifi: Loading BDF type 0
[   18.973570] ath10k_ahb a800000.wifi: board_file api 2 bmi_id 0:17 crc32 e2dfaa91
[   20.292764] ath10k_ahb a800000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 192
[   20.292824] ath10k_ahb a800000.wifi: msdu-desc: 2500  skid: 32
[   20.328614] ath10k_ahb a800000.wifi: wmi print 'P 96/96 V 4 T 306  msdu-desc: 2500  sw-crypt: 0'
[   20.329876] ath10k_ahb a800000.wifi: wmi print 'free: 13108 iram: 1768 sram: 27492'
[   20.453504] ath10k_ahb a800000.wifi: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 96 raw 0 hwcrypto 1
[   20.601210] ath: EEPROM regdomain: 0x0
[   20.601225] ath: EEPROM indicates default country code should be used
[   20.601231] ath: doing EEPROM country->regdmn map search
[   20.601240] ath: country maps to regdmn code: 0x3a
[   20.601248] ath: Country alpha2 being used: US
[   20.601254] ath: Regpair used: 0x3a
[   34.330427] ath: EEPROM regdomain: 0x8114
[   34.330445] ath: EEPROM indicates we should expect a country code
[   34.330453] ath: doing EEPROM country->regdmn map search
[   34.330461] ath: country maps to regdmn code: 0x37
[   34.330470] ath: Country alpha2 being used: DE
[   34.330476] ath: Regpair used: 0x37
[   34.330485] ath: regdomain 0x8114 dynamically updated by user
[   34.330585] ath: EEPROM regdomain: 0x8114
[   34.330594] ath: EEPROM indicates we should expect a country code
[   34.330602] ath: doing EEPROM country->regdmn map search
[   34.330609] ath: country maps to regdmn code: 0x37
[   34.330615] ath: Country alpha2 being used: DE
[   34.330622] ath: Regpair used: 0x37
[   34.330630] ath: regdomain 0x8114 dynamically updated by user
[   37.845787] ath10k_ahb a000000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 192
[   37.845852] ath10k_ahb a000000.wifi: msdu-desc: 2500  skid: 32
[   37.882027] ath10k_ahb a000000.wifi: wmi print 'P 96/96 V 4 T 306  msdu-desc: 2500  sw-crypt: 0'
[   37.883277] ath10k_ahb a000000.wifi: wmi print 'free: 13108 iram: 1768 sram: 27492'
[   38.187710] ath10k_ahb a000000.wifi: rts threshold -1
[   38.190424] ath10k_ahb a000000.wifi: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[   38.867622] ath10k_ahb a000000.wifi: NOTE:  Firmware DBGLOG output disabled in debug_mask: 0x10000000
[   39.981095] ath10k_ahb a800000.wifi: 10.4 wmi init: vdevs: 4  peers: 96  tid: 192
[   39.981153] ath10k_ahb a800000.wifi: msdu-desc: 2500  skid: 32
[   40.017212] ath10k_ahb a800000.wifi: wmi print 'P 96/96 V 4 T 306  msdu-desc: 2500  sw-crypt: 0'
[   40.022595] ath10k_ahb a800000.wifi: wmi print 'free: 13108 iram: 1768 sram: 27492'
[   40.302868] ath10k_ahb a800000.wifi: rts threshold -1
[   40.304368] ath10k_ahb a800000.wifi: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[   40.315846] ath10k_ahb a000000.wifi: rts threshold -1
[   41.793371] ath10k_ahb a800000.wifi: rts threshold -1
[   95.004569] ath10k_ahb a800000.wifi: Invalid VHT mcs 15 peer stats

I can try to capture a crash of the firmware, if this is helpful? Or should I try to get a log of the crash of the router itself? I am not even getting the point how the chip firmware can crash the router itself. This must be a problem in the driver, doesn't it?

@greearb
Copy link
Owner

greearb commented Feb 26, 2023

'crashing' is an overloaded term. Please be more specific. If OS or firmware is crashing, then logs are helpful.
And, if the new firmware image I gave you is broken, can you confirm that the earlier commit (233) still works for you?

@lemoer
Copy link
Author

lemoer commented Feb 26, 2023

With "the router is crashing" I mean that the OS is crashing. I can not confirm if the firmware is also crashing for this firmware or not. (My logs do not contain it, since the logs were gone when the OS crashed. From now on, I am pulling the logs from the router to a remote system, so I can inspect them after the OS crashed).

In earlier tests, I observed that both the OS and the firmware are crashing.

I will now try the -233-.

@lemoer
Copy link
Author

lemoer commented Feb 28, 2023

The -233- is now running stable since two days. This means, -233- is ok.

What are the next steps?

@greearb
Copy link
Owner

greearb commented Feb 28, 2023

I will look closely at the commit that bisects as bad and try to figure out the problem. Thanks for your detailed testing.

@greearb
Copy link
Owner

greearb commented Feb 28, 2023

Are you able to connect serial console to your device to get more info about the crashes? Since only firmware changed, I guess maybe it is a firmware crash that is triggering some other bug that crashes the OS. The serial console logs should help me debug the firmware crash.

@lemoer
Copy link
Author

lemoer commented Feb 28, 2023

I will make sure we get serial logs. But it will take a few days.

@lemoer
Copy link
Author

lemoer commented Mar 3, 2023

Ok, the serial logging is now installed.

This is the first log piece: Link.

  • This is -234- "squashed with 5 commits".
  • I am only running the new firmware for a few minutes.
  • The firmware is already crashing (again and again).
  • So far, I did not observe an OS crash yet. (I rebooted the router once after I installed the new firmware, so this is not a crash).

I will keep running this setup. Later I will post what happened then.

@lemoer
Copy link
Author

lemoer commented Mar 3, 2023

Here is a log, where the OS is also crashing: Link. Look at approx 17.44h.

(This is still -234- "squashed with 5 commits".)

Can you figure out something using this log?

@lemoer
Copy link
Author

lemoer commented Mar 4, 2023

Here is another log from today, where multiple crashes are included in case you want to compare: Link.

@greearb
Copy link
Owner

greearb commented Mar 7, 2023

I tried to backport some relevant fixes to make bisecting clean, but that became a big mess. The patch that starts causing trouble for you initialized the rate codes to invalid value to catch other bugs in rate-ctrl, then the next 100+ patches deal with fallout from that patch.

So, maybe we can go at the problem from a different direction. In your original bisect notes, you indicate that maybe build 1000 is OK. Can you re-test that and/or a few other higher builds to see if you can find a stable build there? If so, then could bisect with that as starting point to find where other problems are introduced.

@lemoer
Copy link
Author

lemoer commented Mar 13, 2023

I am currently testing the 1000. I will come back if I have definite results. (I need to double-check since some environment conditions have changed. The "other" router (Ubiquiti UniFi AC Mesh) was "autoupdated" to mainline ath10k, so I need to double check if this setup still triggers our issue at all.)

Another idea: do you have any hardware to reproduce this in your lab? Maybe this could help to debug the issue? (We could provide you with our firmware/config if you tell us what kind of hardware you have.)

@lemoer
Copy link
Author

lemoer commented Mar 13, 2023

Or another idea: Can we maybe set the bitrates manually to bypass the rate-ctrl issues (or so)?

@greearb
Copy link
Owner

greearb commented Mar 13, 2023

You can override rate-ctrl for larger data frames (this was designed for specific test case, so it is probably not good for general use).

You can look at the logic in the driver that deals with txo and modify that further if you want. If mucking with
txo after system is in the bad state suddently lets larger frames go through, then it would appear it really is a rate-ctrl bug and that should help me find the bug.

Unfortunately, I do not have time to reproduce the bug locally.

[root@ct523c-0b29 ~]# cat /debug/ieee80211/wiphy1/ath10k/set_rate_override
This allows specify specif tx rate parameters for all DATA frames on a vdev
Only wave-2 CT firmware has full support. Wave-1 CT firmware has at least
some support (preamble, preamble, mcs, retries).
To set a value, you specify the dev-name and key-value pairs:
tpc=10 sgi=1 mcs=x nss=x pream=x retries=x dynbw=0|1 bw=x enable=0|1
pream: 0=ofdm, 1=cck, 2=HT, 3=VHT
cck-mcs: 0=11Mbps, 1=5.5Mbps, 2=2Mbps, 3=1Mbps
ofdm-mcs: 0=48Mbps, 1=24Mbps, 2=12Mbps, 3=6Mbps, 4=54Mbps, 5=36Mbps, 6=18Mbps, 7=9Mbps
tpc is in 1db increments, 255 means use defaults, bw is 0-3 for 20-160
For example, wlan0: echo "wlan0 tpc=255 sgi=1 mcs=0 nss=1 pream=3 retries=1 dynbw=0 bw=0 active=1" > ...ath10k/set_rate_override
vdev 0(wlan1) active=0 tpc=0 sgi=0 mcs=0 nss=0 pream=0 retries=0 dynbw=0 bw=0 rix=0
[root@ct523c-0b29 ~]#

@lemoer
Copy link
Author

lemoer commented Apr 6, 2023

Things got a bit silent here. The reason is that the original setup was at the home of a colleague. Since he was annoyed by the broken wifi, I tried to move the setup to my home. Since then, I was trying to reproduce the issue at my home. But so far, I was not able to reproduce it at my home. Therefore, we will now start to test again at the home of my colleague. I hope, we will be able to reproduce this again at his home.

Just to avoid wrong conclusions: The home of my colleague was not the only place, where we observed the issue. It also appeared in other scenarios.

@lemoer
Copy link
Author

lemoer commented Apr 23, 2023

I was not able to reproduce it there as well (within one week). So maybe, I was doing something wrong or there are other influences that we are not aware of yet.

However. We decided to go back to mainline ath10k instead. Therefore I will not investigate here further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants