Skip to content

sled-agent: Move IPCC calls off of tokio worker threads? #9721

@jgallagher

Description

@jgallagher

In #9720, we encountered an OS bug that caused IPCC ioctls to hang indefinitely. In sled-agent, we call these directly from tokio worker threads, which led to several worker threads getting stuck, and eventually hitting #9619 where the entire runtime blocked (even though some worker threads were still parked / idle). #9619 proposes we implement the general workaround where we periodically spawn a new task into the runtime, which will unstick a runtime stuck because the one thread responsible for I/O is blocked polling the future that caused it to wake up. However, it seems unlikely this would have helped much in the #9720 case - we probably would have only delayed sled-agent hanging, because eventually we would have issued enough IPCC calls to hang all the worker threads.

I'm inclined to say we should treat IPCC calls as "blocking I/O" calls - that seems pretty accurate, since we're doing I/O over a uart to the SP (and/or RoT, depending on the IPCC command) - and put them in spawn_blocking. But I'm not sure what that would do in a case like #9720 - if every IPCC call hangs, would we eventually exhaust the spawn_blocking pool? Presumably sled-agent would remain generally responsive (except in paths that depended on those IPCC calls?), but what would happen in the limit?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions