-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firmware crash with high iperf traffic QCA9888 #30
Comments
The binary dump was empty as far as I can tell, and the text dump shows an assert in rate-ctrl logic, but he debug logging before that assert was not in your .txt file. Please reproduce with firmware dbglog logging enabled, and show me the text several hundred lines before and after the crash. And, maybe you can have better luck with the binary dump next time! I'll attach the latest 9888 binary, please reproduce with this as it will be more convenient for me to debug it. For better logging: https://www.candelatech.com/ath10k-bugs.php |
firmware-5-full-htt-mgt-community.bin.gz Please test with this. It is for the 9888/9886 NIC. |
Thanks for your answer. Attached new log with fw dbglog enabled and new binary dump. |
Hello, First, the logs show some errors in the power-save logic. It is warning about things being out-of-sync and then it should be taking recovery actions. The crash is due to the rate-ctrl switching between VHT and HT, it seems. I previously added an assert in that case because at least as part of normal rate-ctrl logic, it should not happen. But, other logs show what appears to be the driver setting a station's rate-ctrl to HT20. Possibly it used to be VHT and this is causing the issues. I'm adding some extra debugging and will attach a new binary for testing. Are you setting any specific rates for these stations, especially after they have initially associated? What station device(s) are you using for this test? Anything else interesting about this test case? |
Hello, The stations used are USB dongles (TP-link N900 and Tp-link AC1300), connected on raspberry PIs 2. The rates aren't setted. |
Thank you for the logs and info. I have added more debugging to the firmware, please re-run with the attached firmware and send me the crash files. |
Hi, |
It seems my comment from yesterday did not get posted for some reason. My previous attempt at debugging was faulty it seems. I think it crashed trying to gather debugging and in the end, I did not get useful info. Please try the attached image and resend logs. |
Hi, |
I still do not see how the rates got out of sync, please retest with the attached binary. It has some more error checking and might catch the place where the invalid rate is first set. If possibly, send me 'dmesg' like output from the entire boot, but I am not sure your platform can do that? |
Hi, I have tested with last firmware you sent. Attached a file with dmesg output as you requested. |
Ok, so getting closer. It seems we probe on HT rates when we are able to do VHT. The firmware assumes that will not happen. |
Hi, |
Sorry, that one had a different mistake in the debugging code. Please try the attached. |
Hi, attached logs from last binary, |
Ok, so rate ctrl is definitely probing for an HT rate when in VHT mode. Not entirely sure why yet, so here is a build with more debugging. Please retest. |
Attached new logs. |
It seems that the ni_flags and/or phymode might be out-of-sync and/or corrupted. Attached is another firmware to help determine if this is the cause. |
Hi, attached logs from last binary, |
At least in this case, phymode seems to be and always have been HT20, and rate-mask matches. But somehow, something was able to set the 'maxRate' to VHT. I found another place where something was setting this value without the debug/assert guards, so I made that call the debugging logic too. Please try the attached. |
Hi, attached logs from last binary. This binary crashes connecting clients. |
The code was incorrectly setting a 160Mhz rate when peer was only 80Mhz. That particular assert should be fixed in this new build, but not sure that was root cause of the initial problem or not. Please retest. |
Hi, |
Either I gave you a bad firmware, or you loaded an old one. Please try the attached. It's md5sum is |
Hi, |
I have added debug code everywhere I can think of that could be setting the invalid rate that causes the assert, but the debug code does not catch the problem. So, maybe the problem is a write to a bad memory address or something like that. I have other tools to investigate this, but it is sort of a 'bisect' of the code and will probably take lots of iterations of me sending you new builds and you reporting back results (similar to what we have been doing). If you are willing to do this, then I will work on the implementation and send you a new binary for testing soon. |
Please let me know if you are interested in pursuing debugging of this..otherwise I'll close the bug and can re-open it if the problem is reported again. |
Hi, yes I can test the binaries. Sorry for the delay |
I have no easy way to test 9888 images, will you be able to test this with 9984 radios instead? That way I can at least spot-check that the binaries work before giving them to you... |
I won't be able to test in 9984 radios for a couple of weeks. After that, yes. |
driver: ath10k-4.13 CT
firmware: CT full-community 10.4-ct-9888-fH-011-cf79c7f
Chipset: QCA9888. It also happens on QCA9884
OS: OpenWrt 18.06-SNAPSHOT r0+6957-e060fdfc08
Issue description: firmware crash with high traffic load, generated using iperf (32 connected stations).
I have tried no-htt-mgt firmware and previous versions and it crashes as well.
The stock firmware doesn't crash.
Attached crash log and binary dump.
fw_crash_log.txt
crash_dump.tar.gz
The text was updated successfully, but these errors were encountered: