Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

platform/surface: aggregator: Defer probing when serdev is not ready #152

Merged
merged 2 commits into from
Apr 30, 2024

Conversation

phreer
Copy link
Contributor

@phreer phreer commented Apr 21, 2024

Failure of probing of SSAM serial hub was observed (linux-surface/linux-surface#1271) from time to time due to race condition where the underlying uart device could be not ready at the point calling serdev_device_open(). Specifically, the failing path is

ssam_serial_hub_probe()
    => serdev_device_open()
        => ctrl->ops->open() // this is ttyport_open()
            => tty->ops->open() // this is uart_open()
                => tty_port_open()
                    => port->ops->activate() // this is uart_port_activate()
                        Find bit UPF_DEAD is set in uport->flags and fail with errno -ENXIO.

Retrying upon failure in opening serdev device would greatly alleviate this situation. Tested with my Surface Pro 5 rebooting for 10+ times and didn't see the issue anymore.

I admit this solution might be suboptimal, but it does help cure the issue that users lost their battery indicator or touchpad functionality randomly.

@qzed
Copy link
Member

qzed commented Apr 21, 2024

Thanks for the PR (and especially the breakdown of what goes wrong). But as you said, I think we can do better: Generally, in cases like this the driver should return EPROBE_DEFER. This tells the device driver core that the driver is not yet ready and missing some dependency, and that it should try again later. Could you try something like this?

if (status == -ENXIO)  // or whatever is returned by serdev_device_open() when it fails due to not being ready
	return -EPROBE_DEFER;
else if (status)
	return status;

Apart from me not knowing the specific error code, the reason I haven't added this yet is that I don't know where specifically this should be addressed. I.e., should serdev_device_open() return EPROBE_DEFER or should we check the error code and do it ourselves in the SAM driver?

Due to that, my current suggestion for a workaround would be to build-in the required modules (8250_dw and pinctrl ones) and avoid the race condition in that way (like Fedora and Ubuntu/Debian already do). This seems to have worked here and I was planning on doing that today. Right now, I'm also not sure whether this could also fail due to pinctrl dependencies not being present. Some users have reported that the buttons (soc-button-array) on Arch do not work due to that. And if pinctrl isn't there, the gpiod_count() or ssam_irq_setup() calls might fail too. So we probably want to build in at least those anyways.

If I remember correctly Hans de Goede mentioned that (on x86?) the pinctrl stuff should be always built in (or at least present in the initramfs). And since I'm already mentioning him:

@jwrdegoede: Maybe you know what the proper way of returning EPROBE_DEFER above is?

Edit: Looks like I've been mistaken and Fedora doesn't include all the pinctrl modules. But it looks like they are all present in the initramfs.

@qzed
Copy link
Member

qzed commented Apr 21, 2024

@phreer: I've opened a PR for building in the modules at linux-surface/linux-surface#1426. In case we merge this before you get around to testing with EPROBE_DEFER, please make sure that you do not build in those modules.

@phreer
Copy link
Contributor Author

phreer commented Apr 21, 2024

So serdev_device_open() returning error is only part of the story, there might be somewhere else we get an error. OK, looks like linux-surface/linux-surface#1426 is a more reasonable workaround for the time being. I'd love to investigate how to return EPROBE_DEFER thing properly and figure out a long-term solution for this issue :).

@qzed
Copy link
Member

qzed commented Apr 21, 2024

I think the proper way is probably a mix of both: Build in the pinctrl modules or have them in the initramfs (as a lot of stuff relies on them) but don't do that for the serial ones as they are a bit more specific. Then return EPROBE_DEFER (probably in the SAM driver) if that works to make it re-try if the serial driver isn't there yet.

@phreer
Copy link
Contributor Author

phreer commented Apr 21, 2024

I just got basic idea how deferred driver probing works in kernel and will play around with it soon, hopefully enable this functionality for SAM driver.

But I don't get the point why we still require modules like pinctrl to be built in even if deferred probing is supposed.

@qzed
Copy link
Member

qzed commented Apr 21, 2024

But I don't get the point why we still require modules like pinctrl to be built in even if deferred probing is supposed.

Returning EPROBE_DEFER for serdev_device_open() only helps for the serial driver. So anything that relies on the pinctrl drivers being present might still fail.

@phreer
Copy link
Contributor Author

phreer commented Apr 21, 2024

Yeah, of course. But what if we also return EPROBE_DEFER when resources like GPIO are not ready during ssam_serial_hub_probe()? IIUC, your point is that we can hardly get an exhaustive list of cases where we need defer probing instead of returning real error. Maybe we could just include all these cases gradually, starting from the most obvious ones.

@jwrdegoede
Copy link
Contributor

jwrdegoede commented Apr 21, 2024

@jwrdegoede: Maybe you know what the proper way of returning EPROBE_DEFER above is?

Generally speaking if you have some sort of get() function for a resource and the resource is not yet ready then you would expect that get() function to return -EPROBE_DEFER and the caller will just propagate it.

This is also where using dev_err_probe() is useful for logging errors from such get() functions during probe since it will turn the msg into a dbg message when the error code is -EPROBE_DEFER.

As for this specific case, where does the serdev the driver tries to open come from ? I would expect this to be created by the tty subsystem, specifically by acpi_serdev_register_devices() from drivers/tty/serdev/core.c based on some ACPI device having a UartSerialBus resource pointing to the port.

Since this serdev is instantiated at the time the tty subsystem is probing it I find it weird that when your serdev driver (I assume you are using a serdev driver?) tries to bind to it that it is not ready yet.

I think it might be best to discuss this on the linux-serial mailinglist. Feel free to Cc me.

@qzed
Copy link
Member

qzed commented Apr 21, 2024

@phreer

Yeah, of course. But what if we also return EPROBE_DEFER when resources like GPIO are not ready during ssam_serial_hub_probe()? IIUC, your point is that we can hardly get an exhaustive list of cases where we need defer probing instead of returning real error. Maybe we could just include all these cases gradually, starting from the most obvious ones.

I think that would be the most ideal solution. But pinctrl drivers are spread out quite a bit, so I'm not really sure what all would need to be done/checked here.

@qzed
Copy link
Member

qzed commented Apr 21, 2024

@jwrdegoede

As for this specific case, where does the serdev the driver tries to open come from ? I would expect this to be created by the tty subsystem, specifically by acpi_serdev_register_devices() from drivers/tty/serdev/core.c based on some ACPI device having a UartSerialBus resource pointing to the port.

Yes, the SAM device in ACPI links to UartSerialBusV2 resource, so that's where the serdev comes from.

Since this serdev is instantiated at the time the tty subsystem is probing it I find it weird that when your serdev driver (I assume you are using a serdev driver?) tries to bind to it that it is not ready yet.

It is a serdev driver, yes. From the reports I've got, building in 8250_dw and the intel LPSS drivers does provide a functioning workaround. So due to that, to me it looks like there is some stuff going on in parallel. Like some parts of serdev are registered so that our driver can probe, but the underlying provider of the port is not yet ready. I know too little of the underlying things at work here to say what exactly is happening though.

I will try to look into it a bit more and then send a mail to linux-serial. Thanks!

@phreer phreer changed the title platform/surface: aggregator: Retry to open serdev device on failure platform/surface: aggregator: Defer probing when serdev is not ready Apr 22, 2024
@phreer
Copy link
Contributor Author

phreer commented Apr 22, 2024

Hi @qzed, me again. As you suggested, I tried the way of deferring probing and it worked as expected.

Only deal with serdev case for now. Let's be patient and identify more race condition points in the future.

Copy link
Member

@qzed qzed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would look a bit nicer if you do something like

if (status == -ENXIO)
	status = -EPROBE_DEFER;
if (status) {
	ssam_err_probe(...);
	goto err_devopen;
}

You'd have to add ssam_err_probe() similar to the other print macros though with dev_err_probe() as underlying print function.

@phreer
Copy link
Contributor Author

phreer commented Apr 23, 2024

@qzed Sure! Do you think it is a good idea to add more messages upon error, e.g., gpiod_count? (As IMHO it's kind of lacking context in dmesg when something unexpected happen.)

@qzed
Copy link
Member

qzed commented Apr 23, 2024

@phreer I think that could be quite helpful, yes.

Copy link
Member

@qzed qzed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me.

@phreer phreer force-pushed the ssh-retry-on-failure branch 2 times, most recently from 6f0f68f to 59c93ed Compare April 24, 2024 15:06
@phreer
Copy link
Contributor Author

phreer commented Apr 24, 2024

Hope those error messages in the new commit would help!

Edit: I tested these changes with my SP5, all looking good.

@phreer
Copy link
Contributor Author

phreer commented Apr 27, 2024

Hi @qzed , I've completed my code and would appreciate guidance on how to proceed further.

@qzed
Copy link
Member

qzed commented Apr 27, 2024

Hi, sorry for the delay. I've been a bit busy with day-time work the last days again.

I'm not a big fan of moving the controller init to the front, I'd prefer if we can keep the order (with smaller/lighter init stuff in the beginning). Just use the dev_ functions for logging with that. And thinking about that: I guess we also don't really need the ssam_err_probe() because serdev->dev is directly available here. Sorry for not noticing that earlier.

Apart from that I think it looks good.

I think you could also submit the EPROBE_DEFER patch as RFC to the pdx86 and linux-serial mailing lists to get the discussion started on that, if you want. I'm not sure when I will find the time for that, as I will be away for a large part of May and have some stuff to take care off before.

@phreer
Copy link
Contributor Author

phreer commented Apr 28, 2024

Sure, it makes sense to let light staffs go first so we can fail fast. I'm pleased to start a discuss the EPROBE_DEFER patch on mailing list. Lastly, I appreciate your assistance despite your busy schedule. Thank you!

This is an attempt to alleviate race conditions in the SAM driver where
essential resources like serial device and GPIO pins are not ready at
the time ssam_serial_hub_probe() is called.  Instead of giving up
probing, a better way would be to defer the probing by returning
-EPROBE_DEFER, allowing the kernel try again later.

However, there is no way of identifying all such cases from other real
errors in a few days.  So let's take a gradual approach identify and
address these cases as they arise.  This commit marks the initial step
in this process.

Signed-off-by: Weifeng Liu <weifeng.liu.z@gmail.com>
Emits messages upon errors during probing of SAM.  Hopefully this could
provide useful context to user for the purpose of diagnosis when
something miserable happen.

Signed-off-by: Weifeng Liu <weifeng.liu.z@gmail.com>
Copy link
Member

@qzed qzed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with it! Looks good to me now.

@qzed qzed merged commit 43ef589 into linux-surface:v6.8-surface-devel Apr 30, 2024
qzed added a commit to linux-surface/linux-surface that referenced this pull request Apr 30, 2024
Since [1] we should not require the serial drivers to be built in any
more. In essence, the PR causes the SAM driver to return EPROBE_DEFER
when the serial device is not ready.

Specifics on how to best handle this (i.e., which function should
actually return EPROBE_DEFER) are still to be discussed, but for now
this should provide a better workaround that hopefully fixes this
problem as well. So let's test it.

[1]: linux-surface/kernel#152
qzed added a commit to linux-surface/linux-surface that referenced this pull request Apr 30, 2024
Changes:
 - Fix SAM driver probe failure when UART is not ready yet (@phreer).

Links:
 - kernel: linux-surface/kernel@43ef589
 - PR: linux-surface/kernel#152
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants