Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ramips: fix "no traffic" problem and DMA kernel panic at startup on rt305x and mt7628 devices #14194

Merged
merged 7 commits into from Jan 2, 2024

Conversation

Leo-PL
Copy link
Contributor

@Leo-PL Leo-PL commented Dec 10, 2023

This series fixes bootup and Ethernet support on rt305x subtarget. It combines the MAC core and switch core reset again, but using device tree properties, rather than reverting the changes introduced in 60fadae. However, reverting those changes or moving resets around in device tree wasn't enough - the wait duration after the reset needs increasing, to allow both cores to sync up after reset.

The final commit cleans up the way reset controller lines are accessed from the code, but is mostly aesthetic in nature.

As the issue was introduced before v22.03 release, this series needs backporting to 22.03 and 23.05 stable branches, possibly together with #14151, which also needs backporting to 23.05.

Build tested on: ramips/rt305x, ramips/mt76x8, ramips/mt7620
Run tested on: ZTE MF283+ (RT3352), Hame MPR-A2 (RT5350), TP-Link TL-WR802N v4 (MT7628N), Nexx WT3020 8M (MT7620)

Closes: #9284

@github-actions github-actions bot added kernel pull request/issue with Linux kernel related changes target/ramips pull request/issue for ramips target labels Dec 10, 2023
@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 10, 2023

cc @lynxis @hauke @mans0n

@neheb
Copy link
Contributor

neheb commented Dec 10, 2023

IIRC there was work to migrate mt7628 to the mainline ethernet driver. I wonder what happened.

@ptpt52
Copy link
Contributor

ptpt52 commented Dec 10, 2023

mark to follow

@Linaro1985
Copy link
Contributor

Linaro1985 commented Dec 11, 2023

IIRC there was work to migrate mt7628 to the mainline ethernet driver.

The whole problem is in the mt7628 switch. If we still use swconfig, then there are no problems. But there are difficulties with DSA, since hw is architecturally not compatible. DSA implementation is only possible via vlans. Correct me if this is not true.

@xabolcs
Copy link
Contributor

xabolcs commented Dec 11, 2023

Fixes half of #14138

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 11, 2023

@DragonBluep I'm still puzzled about those kernel changes - not the device tree. I just noticed that mt7620 has had same reset handles as rt3352 and mt7628 are changed in this PR. It makes full sense to move rt5350 and rt3050 too, but this also means that the other targets were already working with reset handles registered just like that. My PR doesn't change that logic, really, so I'd like to keep it - I'm still digging into the inner workings of reset framework.

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 11, 2023

Changes for RT3050 and RT5350 added as well - tested again with my MPR-A2 - unit boots successfully and connects to the network, albeit very slowly.
Edit: @DragonBluep I just noticed that array in devm_reset_control_array_get_optional_exclusive - now everything looks right. I think it was left that way to accomodate for both configurations. If all of subtargets are migrated this way, effectively reverting part of 60fadae - then this could be done indeed, but not otherwise. But commit description and patch location for the last patch needs to change for sure.

@mpratt14
Copy link
Contributor

should we maybe consider the function devm_reset_control_array_get_shared instead?

Perhaps resets can be added to the FE block without being removed from the ESW block?

or at least, it looks like the kernel supports that now...

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 14, 2023

I think it should work, according to kernel docs, even for 5.10 in 22.03.
Should I keep rst_esw in esw_3050.c then?

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 15, 2023

@mpratt14 @DragonBluep I applied your suggestions, please review. I'm not entirely sure if the point at which reset_control_deassert was inserted to satisfy requirements of shared reset controls is fine, and if switch reset is still effective that way, but my two devices still seem to function allright with these changes. Just in case, I still have the previous version using exclusive resets around.

@mpratt14
Copy link
Contributor

see if it works without the "optional" version of the function

apparently the "optional" series of functions simply allows the reset to be missing from devicetree without causing a big error, but we have it and there seems to be a strong functional dependency on having it.

@mpratt14
Copy link
Contributor

some good reading material if you haven't found this already

https://docs.kernel.org/driver-api/reset.html

@mpratt14
Copy link
Contributor

Around the kernel i see several uses of reset_control_status used with read_poll_timeout to have a variable sleep time, maybe that can work well instead of a large sleep value to wait for deassert.

@mpratt14
Copy link
Contributor

if you do try read_poll_timeout you can put a debug print into that macro, and set timeout to 0 when calling it, to see approximately how long the sleep really needs to be (in order to come up with a timeout), and whether or not the status function is really working

@mpratt14
Copy link
Contributor

...it may need to be the atomic version of read_poll_timeout when we declare the resets as shared...

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 15, 2023

Let me explain why I think we need to directly delete these codes. The vendor driver said, RT5350/MT7628 (SDMA) need to reset ESW and FE at the same to avoid PDMA panic. My understanding is that none of them (esw && fe) can be individually reset. So we don't need it at all here. I believe that any device only needs to be reset once after booting up. That's also the vendor driver behavior. I've tested it's safe to remove it on MT7628.

BTW, only RT305x/RT335x/RT5350/MT7628 use this driver.

Makes perfect sense. I returned to previous version, and then dropped the rst_esw related code in esw_3050.c. In this case it is not worth to play with shared resets.

if you do try read_poll_timeout you can put a debug print into that macro, and set timeout to 0 when calling it, to see approximately how long the sleep really needs to be (in order to come up with a timeout), and whether or not the status function is really working

It makes no sense to shave off microseconds of 1ms resets either, this is done once per boot.

see if it works without the "optional" version of the function

apparently the "optional" series of functions simply allows the reset to be missing from devicetree without causing a big error, but we have it and there seems to be a strong functional dependency on having it.

It did work previously, so only change needed here is adding proper messages, which I did now.

Needless to say, tested again on two of my devices.

@mpratt14
Copy link
Contributor

In this case it is not worth to play with shared resets.

it was already working as shared though, right?

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 15, 2023

No, it was working as exclusive, just not as optional, which only causes the framework to return NULL directly instead of respective error pointer.

@mpratt14
Copy link
Contributor

The way I see this is: EPHY block is within the ESW block, which is within the FE block. So whenever the FE block needs a reset, everything downstream of it needs a reset too (cleaning leftover data that would go into DMA too early?). Then the ESW and EPHY are able to reset together without having to reset FE because the dependency is the other way.

@mpratt14
Copy link
Contributor

have you tried devm_reset_control_array_get_shared or just devm_reset_control_array_get_optional_shared?

and maybe this requires adjusting which function the other driver uses to get the reset...

I don't mean to push this too much, it can always be done later. I'm mostly concerned that someone else will come along and do something similar to 60fadae again because the dts "doesn't look right"

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 15, 2023

I tried devm_reset_control_array_get_sharedl (see the tree at 297a319) but I wasn't convinced that it's entirely correct.
I'd like to err on the safer side here and trust @DragonBluep's proposal here. Of course - we might try to improve this later, but first I'd like the device functionality to be restored.

@mpratt14
Copy link
Contributor

I'm trying to talk about testing the difference between get_shared variants and get_optional_shared variants, and you only had the latter before.

But yeah, it can wait until another time.

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 15, 2023

I'm trying to talk about testing the difference between get_shared variants and get_optional_shared variants, and you only had the latter before.

But yeah, it can wait until another time.

It really only differs on the returned value in case of failure, see:
https://elixir.bootlin.com/linux/latest/source/drivers/reset/core.c#L832

@mpratt14
Copy link
Contributor

Right, and if I understand correctly... a null pointer for a reset_control causes the resetting to not actually happen, which could also be a possible case where everything is functioning ok, having almost the same effect of the case where both blocks are reset together. In other words, reset only FE -> problems, reset both FE and ESW -> good, reset nothing if null pointer -> maybe good? but we can't know whether it's actually being reset...

If you try the get_shared without optional and have a big warning or crash, then we would know for sure it's conflicting with the other driver's API calls, and the other driver would also need the same adjustments for shared resets.

But for now this PR is fine as it is, save it for later...

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 16, 2023

If I just tried "get_shared" variant without the added "deassert" calls right after getting the reset controls, I got the expected warning about calling "assert" without prior "deassert", as documented.

Then I added "deasert" calls immediately after getting the controls. But calling "deassert" twice during probe of each module might have indeed caused nothing to actually get reset, because use count would go in such sequence like this: 0, 1, 2, 1, 2, 1, 2.
If it never dropped to zero, reset framework would not trigger the reset. I could trace that by adding some prints, but I'm not sure if it's worth the effort, current solution looks more deterministic.

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Dec 17, 2023

I just remembered, that I have TP-link WR802N v4 available, which is MT7628N. So tested that as well - success. And pro forma on Nexx WT3020 (MT7620), because mtk_eth_soc driver was changed - works too.

Leo-PL and others added 7 commits January 2, 2024 21:56
Use devm_reset_control_array_get_exclusive to register multiple
reset lines in FE driver. This is required to reattach ESW reset to FE
driver again, based on device tree bindings.

While at that, remove unused fe_priv.rst_ppe field, and add error
message if getting the reset fails.

Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")

Co-developed-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>
Signed-off-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>

[Split out of the bigger commit, provide commit mesage, refactor error
handling]
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Enabling the FE core too early causes the system to hang during boot
uncondtionally, after the reset is released. Increate it to 1-1.2ms
range.

Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")
Signed-off-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>

[Split previous commit, provide rationale]
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Failing to do so will cause the DMA engine to not initialize properly
and fail to forward packets between them, and in some cases will cause
spurious transmission with size exceeding allowed packet size, causing a
kernel panic.

Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")
Signed-off-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>

[Provide commit description, split into logical changes]
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Failing to do so will cause the DMA engine to not initialize properly
and fail to forward packets between them, and in some cases will cause
spurious transmission with size exceeding allowed packet size, causing a
kernel panic.

This is behaviour of downstream driver as well, however I
haven't observed bug reports about this SoC in the wild, so this
commit's purpose is to align this chip with all other SoC's - MT7620
were already using this arrangement.

Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Failing to do so will cause the DMA engine to not initialize properly
and fail to forward packets between them, and in some cases will cause
spurious transmission with size exceeding allowed packet size, causing a
kernel panic.

This is behaviour of downstream driver as well, however I
haven't observed bug reports about this SoC in the wild, so this
commit's purpose is to align this chip with all other SoC's - MT7620
were already using this arrangement.

Fixes: openwrt#9284
Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Failing to do so will cause the DMA engine to not initialize properly
and fail to forward packets between them, and in some cases will cause
spurious transmission with size exceeding allowed packet size, causing a
kernel panic.

Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")
Signed-off-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>

[Provide commit description, split into logical changes]
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
The ESW core needs to be reset together with FE core, so after the
relevant reset controller lines are moved under FE, drop rst_esw and all
related code, which would not execute anyway, because rst_esw would be
NULL. While at that, ensure that if reset line for EPHY cannot be
claimed, a proper error message is reported.

Fixes: 60fadae ("ramips: ethernet: ralink: move reset of the esw into the esw instead of fe")

Co-developed-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>
Signed-off-by: Maxim Anisimov <maxim.anisimov.ua@gmail.com>

[Split out of the bigger commit, provide commit mesage, refactor error
handling]
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
@openwrt-bot openwrt-bot merged commit f393ffc into openwrt:main Jan 2, 2024
3 checks passed
@hauke
Copy link
Member

hauke commented Jan 2, 2024

Could you please create a pull request for 23.05 and 22.03 with these changes.

@Leo-PL
Copy link
Contributor Author

Leo-PL commented Jan 2, 2024

On my way.
Edit: created #14324 and #14325, cc @hauke, @mpratt14 and @DragonBluep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel pull request/issue with Linux kernel related changes target/ramips pull request/issue for ramips target
Projects
None yet
Development

Successfully merging this pull request may close these issues.

broken rt5350 ethernet after ralink reset controller etc patchset
9 participants