Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments say that SeekableHttpReaderEngine can deadlock #50

Closed
djmitche opened this issue Nov 10, 2023 · 7 comments
Closed

Comments say that SeekableHttpReaderEngine can deadlock #50

djmitche opened this issue Nov 10, 2023 · 7 comments

Comments

@djmitche
Copy link
Contributor

// Mutex invariant below: only ONE of these mutices may be claimed at once.
// Ideally we'd enforce this in the type system
// per https://medium.com/@adetaylor/can-the-rust-type-system-prevent-deadlocks-9ae6e4123037
pub(crate) struct SeekableHttpReaderEngine {
/// Total stream length
len: u64,
/// Facilities to read from the underlying HTTP stream(s)
reader: Mutex<ReadingMaterials>,
/// Overall state of this object, mostly related to the readahead cache
/// of blocks we already read, but also with the all-important boolean
/// stating whether any thread is already reading on the underlying stream.
state: Mutex<State>,

Yet

    fn read(&self, buf: &mut [u8], pos: u64) -> std::io::Result<usize> {
        // ...
        // Claim CACHE mutex
        let mut state = self.state.lock().unwrap();
        // ...
        //     claim READER mutex
        let mut reading_stuff = self.reader.lock().unwrap();
        //     release STATE mutex          <---- both mutexes are held here, with state claimed first
        drop(state);
        // ...
        while pos >= *reader_pos {
            // ...
            let mut state = self.state.lock().unwrap();
            // ... <---- both are held, with reader claimed first
        }
        // ...
        let mut state = self.state.lock().unwrap();
        // ... <---- both are held, with reader claimed first
    }

    pub(crate) fn set_expected_access_pattern(&self, access_pattern: AccessPattern) {
        let mut state = self.state.lock().unwrap();
        // ...
        if matches!(access_pattern, AccessPattern::SequentialIsh) {
            // ...
            {
                let mut reading_materials = self.reader.lock().unwrap();
                // ... <---- both are held, with state claimed first
            }
        // ...
    }

I expect that the "invariant" was written to avoid this deadlock? It's a little too strong -- all you need is to always claim the mutexes in the same order. Of course, it's only strong if it's followed :)

        // Cases where you have STATE but want READER: near the start
        // Cases where you have READER but want STATE: after read,
        // ... but this deadlock can't happen because only one thread                                                  
        //     will enter this 'read in progress' block.

The way this is written also looks like it's trying to accomplish "critical sections" -- mutual exclusion of pieces of code. The Rust Mutex isn't designed for that, and thus relies on some head-scratching by the developer. In this case, I think the "only one thread" assertion is saying that only one thread holds the STATE mutex, and waits on the condvar (releasing but then re-acquiring the mutex), then enters its own read-in-progress block. So that's equivalent to saying that "this deadlock" can't happen because only one thread can hold the STATE mutex. Which is tautological and doesn't imply the deadlock cannot happen.

One way to enforce an order-of-acquisition invariant in the type system is to put one mutex inside the other, e.g.,

    state: Mutex<(State, Mutex<ReadingMaterials>)>,

That particular way of writing it is not very ergonomic, but maybe introducing some additional types to model the situation as protecting data instead of critical sections would help.

I can take a crack at trying to refactor this to be more correct, if I haven't missed some major piece of information in the above.

@adetaylor
Copy link
Collaborator

Ha. I considered trying to model this in the type system before declaring 1.0, on the assumption that I'd have got something wrong here. I should have!

The comment is at least wrong.

However a nested mutex probably won't solve things. Usually we want to claim state first, but the thread which is doing an actual network read might block a long time with the read mutex, and we don't want to prevent other threads from doing local reads from the cache (using state)

I'll do some more thinking about this later!

@djmitche
Copy link
Contributor Author

One option may be to put an currently_reading: Option<ReadingMaterials> in the State. A thread doing the read could take this value with the state mutex held, then drop the mutex. If currently_reading is None, then another thread is reading and this one must wait. That waiting could be accomplished with the condition variable.

Anyway, let me know your thoughts. If I get a chance, I'll have a deeper look at this code and see if I can understand it deeply enough to refactor.

@adetaylor
Copy link
Collaborator

I'm working on it.

@adetaylor
Copy link
Collaborator

OK I've convinced myself that it's fine. Let me try to persuade you.

There is effectively one main mutex, the state mutex.

Guarded by the state mutex is read_in_progress, a Boolean.

There are two places where the reading_materials stuff is touched at all:

  • The second half of the read function, which can only be accessed by a thread which discovered read_in_progress was false and set it to true while it had the state mutex held.
  • set_expected_access_pattern which panics if there's a read in progress.

So, although you're right that the state and reader mutices are claimed in different orders, it's guaranteed that only one thread will be involved in claiming the reader mutex.

Effectively read_in_progress Boolean is equivalent to the Option you're talking about. I think that the Option is a cleaner design and I'll have a crack at it.

The comments are still definitely wrong though!

@djmitche
Copy link
Contributor Author

Ah, that does make sense, and the PR looks like a better approach!

@adetaylor adetaylor changed the title SeekableHttpReaderEngine can deadlock Comments say that SeekableHttpReaderEngine can deadlock Nov 10, 2023
@davidcorrigan714
Copy link

davidcorrigan714 commented May 31, 2024

We hit some deadlock in this area of code recently. Our server was having issues and returning incomplete responses every so often. Ended up with a few threads stuck here. We fixed our webserver and I've been trying to reproduce it locally, I think what happened is the http error is returned here but the reader isn't put back into state so the other threads waiting to read are deadlocked. Just tricky crashing it on purpose with the readers waiting and returning an error on the other thread. If I can get a reproduction going I'd probably just check if it's an error, return the reader to the state and then return the error. Could try to put the reader into a mutex somehow that always fails-safe when it loses scope but sounds like more work for a bit better error handling.

@adetaylor
Copy link
Collaborator

Ah, good spot! Yes I'll definitely improve handling of the error here so that other threads don't get stuck. Work started here: #70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants