-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mt76 crash issue, and proposed resolution #763
Comments
Did you reproduce that sta_remove with idx=0 without your change that handles removal in sta_pre_rcu_remove? |
@nbd168 I split and reworked the patches leaving the sta_remove( ) and sta_pre_rcu_remove( ) back in their original functions. I believe my original assessment of sta_remove being called twice was actually incorrect. Let me explain. Yesterday I found that mac80211 is actually calling just the sta_pre_rcu_remove( ) function when a wireless station flaps quickly, without calling sta_add( ) or sta_remove( ). I discovered a sta pointer value passed to sta_pre_rcu_remove( ) that was never used with sta_add or sta_remove, and that pointer contained the wcid->idx == 0. Perhaps sta_info_destroy_part1 should not call the sta_pre_rcu_remove( ) function unless sta->uploaded is set? That way sta_pre_rcu_remove( ) and sta_remove( ) are always called together, instead of potentially with split logic. In the meantime, I am testing the patches I sent you today (and will for a couple of days) to make sure they are solid. |
@nbd168 Oh, one more item that I didn't see in sta_remove( ), is removing the sta from the dev->sta_poll_list. I don't think this is causing any crashes, but if so, they might be very rare. Is there any possibility that an sta pointer could get added to the dev->sta_poll_list and then the sta_remove( ) called before the sta is processed from the list? That would potentially lead to either a poll on a different sta device (if a new subsequently allocated pointer was the same as the one freed), or a crash if the memory is no longer valid. It was just something that popped up in my head today. |
I don't think there is any race with the dev->sta_poll_list removal, and here's why:
Regarding your other fixes:
I do agree that sta_pre_rcu_remove should only be called if sta->uploaded is set, we should fix that.
|
Ok. I will remove those two patches and create a new one for mac80211. |
In case you are interested, I captured more logs showing the idx 0. I caught a station swapping between the 5Ghz and 2.4Ghz bands, with a lone pre_rcu_remove( ) called. This supports the idea of only calling sta_pre_rcu_remove if uploaded is set. I highlighted what I think are the important lines near the bottom. wl0-ap0 is the 2.4Ghz and wl1-ap0 is the 5Ghz: Mar 22 16:22:36 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: authenticated |
@nbd168 Going to dig into ieee80211_calc_expected_tx_airtime( ) to see if I can find it. <3>[209361.647990] mt7622-wmac 18000000.wmac: sta_pre_rcu_remove(0x0000000000d7314c)=idx8 <1>[209390.417998] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000b90 |
Did you get a idx0 case with sta_remove or only with sta_pre_rcu_remove? Regarding the crash, I assume that the vif pointer is NULL for some reason. |
I missed one line:
I did not see an idx 0 within an hour of this happening. I'm trying to make heads or tails how the stack went from mt76_tx_complete_skb( ) to mt76_tx_dequeue( ), which is the only place I can find that calls ieee80211_tx_dequeue( ) |
@nbd I'm working with three of these units and it's possible that one had the previous flash where sta_remove and pre_rcu_remove combined. I just reflashed it to be sure. I'll let you know if it has any further crashes over the next few days. |
@nbd168 This happened this morning. It's the third time I've seen this crash in the last few months.
Do you think the tasklet_disable( ) should be moved up to the top?
Nothing happened before this event (i.e. no idx == 0 or any other warnings) |
One thing you need to understand about these traces is that symbols are only resolved for non-static functions. Because of that, you can't rely on the function name alone. Offsets are provided from the start of the nearest symbol, but if you resolve the actual line number, you could land in a static function somewhere else.
...
So in this case it looks like it might be an instance of sta->sdata being NULL right in the middle of the rx path. I need to do some digging to figure out if this is something that the driver needs to deal with (e.g. rcu related issue), or if it's a mac80211 bug. |
aarch64-openwrt-linux-gdb! I think that might be the piece I've been looking for. Thank you! |
Please add this patch instead of the extra spinlocks: https://nbd.name/p/4ece1b4f |
@nbd168 Np, I'll do that right now. Noone is in the office over the weekend so they won't crash, but I'll report back on Monday. |
@nbd168 Wait a minute, if I add that patch, the sta_remove( ) will return immediately because wcid->sta is now 0. Should I take out the sta_remove( ) !wcid->sta check? EDIT I did remove the extra check in sta_remove so it will not skip sta_remove( ). I'll report back Monday. |
@nbd168 It crashed within an hour this morning. <1>[220470.265394] Unable to handle kernel paging request at virtual address 00333a3231203732 |
Could you please resolve the line numbers of ieee80211_sta_ps_transition+0x448 and ieee80211_rx_list+0x43c for me? |
@nbd168 Let me know if I did this properly:
|
@nbd168 I may have spotted a possibility:
There's a lot of txq purging going on before release_buffered_frames( ) is guaranteed to be stopped. If this is an issue, looks like a potential fix would be to move the cancel_work_sync( ) up to before the tx queues are purged... |
mark. Congrats for the interesting work. |
@Brain2000, the critical parts of ieee80211_txq_purge and ieee80211_tx_dequeue are protected from racing against each other via fq->lock. |
@nbd168 Ah, I was confused how the skb was being protected after returned from ieee80211_tx_dequeue( ). I see now that it is removed from a list and returned, so any txq_purge would skip over it because it's no longer in the list. I'm still trying to figure out this crash. I have two identical wifi units, one compiled with the extra two spinlocks, and one without. The one with the spinlock does not crash. The one without crashes every few hours. Any luck finding what the true cause of this is? The ieee80211 documentation states "do not call ieee80211_tx( ) or ieee80211_rx_list( ) concurrently", but you seem very sure that this is not an issue. |
Good point regarding concurrency of ieee80211_tx_status and ieee80211_rx_list. Though I can't point at any specific racy codepath, I do agree that we should protect against concurrent status/rx calls. However, I don't think we need a separate spinlock for this. |
@nbd168 That patch looks good! I found one more crash underneath all of this in ieee80211_rx_list( ) as it calls the xxxx_8023( ) routines between mac80211 and ethernet. I believe that when a STA disconnects but has a multicast packet in the queue, possibly something in the SKB becomes wiped out before drv_sta_pre_rcu_remove( ) is called. Perhaps at the top of __sta_info_destroy_part1( ), or deeper in. I'm adding various WARN( ) sanity checks in strategic places within net/mac80211/rx.c and net/ethernet/eth.c until I can pinpoint what is being null'd, then perhaps this can be corrected. Note the line below where it thinks the departing station has a TX queue size of 0... AFTER the station has left. Followed by an immediate crash. This is the first time I've seen this entry.
|
@nbd168 Final item: You mentioned this may also be needed to avoid a race condition:
|
has anything happened here yet? I also have a rt3200 and problems with the vlan, maybe this could solve the problem |
I've started getting this crashlog #763 (comment) after I enabled FT over DS. It appears FT over the Air doesn't have this issue. Edit: this is with latest mt76/mac80211 from snapshot. |
Should improve performance/reliability with lots of mcast packets Signed-off-by: Felix Fietkau <nbd@nbd.name>
Greetings,
I've been chasing down an mt76 crash issue on a Belkin RT3200 for the past month.
It happens once or twice a day when many devices are roaming.
The stack traces were not always consistent, but here was one that I pulled:
I found that after adding a potentially missing spin_lock in a mac_sta_poll( ) function, and moving the sta_remove to the sta_pre_rcu_remove with a synchronize_rcu at the end, I was able to get a consistent stack trace, so, that seemed like good progress.
Here is that stack trace:
After a bit more debugging, I believe I discovered the root cause.
mac80211 is calling sta_remove( ) with an already removed 80211_sta pointer. This happens when a wifi device rapidly connects and disconnects over a very short period. Five minutes after the rapid connect/disconnect event (probably the inactivity timeout), mac80211 then calls the sta_remove( ) function a 2nd time with the same 80211_sta pointer. The drv_priv portion of this 80211_sta structure is zeroed out and that's where things go terribly wrong.
Such as a kernel oops due to the poll_list/rc_list lists having an invalid dereference to 0. Or depending on the timing, if it gets past that portion, then idx 0 ends up being targeted, which is needed for the beacon/frame management. This causes all kinds of weird timeout errors in the log followed by a spectacular crash.
Here is a snippet of logs that shows what is happening:
I believe the correct fix here is to check for if (!wcid->idx && !wcid->sta) return; as the first line in the sta_remove( ) function
@Fail-Safe has also been compiling various commits along the way to assist in verifying that this is fixed.
I am offering up a PR to fix this, and I hope we can get it so it meets your level of standards.
The text was updated successfully, but these errors were encountered: