Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firmware crash with high iperf traffic QCA9888 #30

Closed
ecastifor opened this issue Aug 9, 2018 · 31 comments
Closed

Firmware crash with high iperf traffic QCA9888 #30

ecastifor opened this issue Aug 9, 2018 · 31 comments

Comments

@ecastifor
Copy link

driver: ath10k-4.13 CT
firmware: CT full-community 10.4-ct-9888-fH-011-cf79c7f
Chipset: QCA9888. It also happens on QCA9884
OS: OpenWrt 18.06-SNAPSHOT r0+6957-e060fdfc08
Issue description: firmware crash with high traffic load, generated using iperf (32 connected stations).
I have tried no-htt-mgt firmware and previous versions and it crashes as well.
The stock firmware doesn't crash.
Attached crash log and binary dump.
fw_crash_log.txt
crash_dump.tar.gz

@greearb
Copy link
Owner

greearb commented Aug 9, 2018

The binary dump was empty as far as I can tell, and the text dump shows an assert in rate-ctrl logic, but he debug logging before that assert was not in your .txt file. Please reproduce with firmware dbglog logging enabled, and show me the text several hundred lines before and after the crash. And, maybe you can have better luck with the binary dump next time! I'll attach the latest 9888 binary, please reproduce with this as it will be more convenient for me to debug it.

For better logging: https://www.candelatech.com/ath10k-bugs.php

@greearb
Copy link
Owner

greearb commented Aug 9, 2018

firmware-5-full-htt-mgt-community.bin.gz

Please test with this. It is for the 9888/9886 NIC.

@ecastifor
Copy link
Author

Thanks for your answer. Attached new log with fw dbglog enabled and new binary dump.
crash.log
binary_dump.tar.gz

@greearb
Copy link
Owner

greearb commented Aug 10, 2018

Hello,

First, the logs show some errors in the power-save logic. It is warning about things being out-of-sync and then it should be taking recovery actions. The crash is due to the rate-ctrl switching between VHT and HT, it seems. I previously added an assert in that case because at least as part of normal rate-ctrl logic, it should not happen. But, other logs show what appears to be the driver setting a station's rate-ctrl to HT20. Possibly it used to be VHT and this is causing the issues. I'm adding some extra debugging and will attach a new binary for testing.

Are you setting any specific rates for these stations, especially after they have initially associated?

What station device(s) are you using for this test?

Anything else interesting about this test case?
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hello,

The stations used are USB dongles (TP-link N900 and Tp-link AC1300), connected on raspberry PIs 2. The rates aren't setted.
With few stations the crash doesn't happen.
Attached new log file and binary dump from your last binary.
Thanks in advance.
binary_dump.tar.gz
crash_13_8.log

@greearb
Copy link
Owner

greearb commented Aug 13, 2018

Thank you for the logs and info. I have added more debugging to the firmware, please re-run with the attached firmware and send me the crash files.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hi,
attached new files.
Thanks
binary_dump_14_8.tar.gz
crash_14_8.log

@greearb
Copy link
Owner

greearb commented Aug 15, 2018

It seems my comment from yesterday did not get posted for some reason. My previous attempt at debugging was faulty it seems. I think it crashed trying to gather debugging and in the end, I did not get useful info. Please try the attached image and resend logs.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hi,
attached log and binary dump from last firmware you sent me.
Thanks
binary_dump_16_8.tar.gz
crashlog_16_8.log

@greearb
Copy link
Owner

greearb commented Aug 16, 2018

I still do not see how the rates got out of sync, please retest with the attached binary. It has some more error checking and might catch the place where the invalid rate is first set. If possibly, send me 'dmesg' like output from the entire boot, but I am not sure your platform can do that?
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

ecastifor commented Aug 17, 2018

Hi, I have tested with last firmware you sent. Attached a file with dmesg output as you requested.
Thanks.
binary_dump_17_8.tar.gz
dmesg.txt

@greearb
Copy link
Owner

greearb commented Aug 17, 2018

Ok, so getting closer. It seems we probe on HT rates when we are able to do VHT. The firmware assumes that will not happen.
I have added code to hopefully catch where that attempt to set the invalid probe rate happens. Please retest and send me logs.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hi,
with the last binary you sent me the crash occurs connecting clients. Attached logs.
Thanks
binary_dump_20_8.tar.gz
log_20_8.txt

@greearb
Copy link
Owner

greearb commented Aug 20, 2018

Sorry, that one had a different mistake in the debugging code. Please try the attached.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hi, attached logs from last binary,
thanks
binary_dump_21_8.tar.gz
log_21_8.txt

@greearb
Copy link
Owner

greearb commented Aug 21, 2018

Ok, so rate ctrl is definitely probing for an HT rate when in VHT mode. Not entirely sure why yet, so here is a build with more debugging. Please retest.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Attached new logs.
Regards.
binary_dump_22_8.tar.gz
log_22_8.txt

@greearb
Copy link
Owner

greearb commented Aug 22, 2018

It seems that the ni_flags and/or phymode might be out-of-sync and/or corrupted. Attached is another firmware to help determine if this is the cause.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hi, attached logs from last binary,
thanks.
binary_dump_23_8.tar.gz
log_23_8.txt

@greearb
Copy link
Owner

greearb commented Aug 23, 2018

At least in this case, phymode seems to be and always have been HT20, and rate-mask matches. But somehow, something was able to set the 'maxRate' to VHT. I found another place where something was setting this value without the debug/assert guards, so I made that call the debugging logic too. Please try the attached.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

Hi, attached logs from last binary. This binary crashes connecting clients.
Regards.
binary_dump_24_8.tar.gz
log_24_8.txt

@greearb
Copy link
Owner

greearb commented Aug 24, 2018

The code was incorrectly setting a 160Mhz rate when peer was only 80Mhz. That particular assert should be fixed in this new build, but not sure that was root cause of the initial problem or not. Please retest.
firmware-5-full-htt-mgt-community.bin.gz

@ecastifor
Copy link
Author

ecastifor commented Aug 27, 2018

Hi,
this build crashed as well. Attached logs.
Regards
binary_dump_27_8.tar.gz
log_27_8.txt

@greearb
Copy link
Owner

greearb commented Aug 27, 2018

Either I gave you a bad firmware, or you loaded an old one. Please try the attached. It's md5sum is
firmware-5-full-htt-mgt-community.bin.gz
md5sum firmware-5-full-htt-mgt-community.bin
eec1cfb1c04fbaacc714673e05232ba1 firmware-5-full-htt-mgt-community.bin

@ecastifor
Copy link
Author

Hi,
I have tested with last firmware you sent and it crashed (md5sum checked). Attached logs.
Regards.
binary_dump_28_8.tar.gz
log_28_8.txt

@greearb
Copy link
Owner

greearb commented Aug 30, 2018

I have added debug code everywhere I can think of that could be setting the invalid rate that causes the assert, but the debug code does not catch the problem. So, maybe the problem is a write to a bad memory address or something like that. I have other tools to investigate this, but it is sort of a 'bisect' of the code and will probably take lots of iterations of me sending you new builds and you reporting back results (similar to what we have been doing). If you are willing to do this, then I will work on the implementation and send you a new binary for testing soon.

@greearb
Copy link
Owner

greearb commented Sep 28, 2018

Please let me know if you are interested in pursuing debugging of this..otherwise I'll close the bug and can re-open it if the problem is reported again.

@ecastifor
Copy link
Author

Hi, yes I can test the binaries. Sorry for the delay

@greearb
Copy link
Owner

greearb commented Oct 12, 2018

I have no easy way to test 9888 images, will you be able to test this with 9984 radios instead? That way I can at least spot-check that the binaries work before giving them to you...

@ecastifor
Copy link
Author

I won't be able to test in 9984 radios for a couple of weeks. After that, yes.
Thanks

@greearb
Copy link
Owner

greearb commented Oct 21, 2018

This is the same bug as #38 it seems, and I have some newer debugging 9984 images in that bug report. Closing this one, will track issue in #38 instead.

@greearb greearb closed this as completed Oct 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants