Comments say that SeekableHttpReaderEngine can deadlock #50

djmitche · 2023-11-10T00:41:19Z

ripunzip/src/unzip/seekable_http_reader.rs

Lines 254 to 265 in 5103ab0

    
           // Mutex invariant below: only ONE of these mutices may be claimed at once. 
        
           // Ideally we'd enforce this in the type system 
        
           // per https://medium.com/@adetaylor/can-the-rust-type-system-prevent-deadlocks-9ae6e4123037 
        
           pub(crate) struct SeekableHttpReaderEngine { 
        
               /// Total stream length 
        
               len: u64, 
        
               /// Facilities to read from the underlying HTTP stream(s) 
        
               reader: Mutex<ReadingMaterials>, 
        
               /// Overall state of this object, mostly related to the readahead cache 
        
               /// of blocks we already read, but also with the all-important boolean 
        
               /// stating whether any thread is already reading on the underlying stream. 
        
               state: Mutex<State>,

Yet

    fn read(&self, buf: &mut [u8], pos: u64) -> std::io::Result<usize> {
        // ...
        // Claim CACHE mutex
        let mut state = self.state.lock().unwrap();
        // ...
        //     claim READER mutex
        let mut reading_stuff = self.reader.lock().unwrap();
        //     release STATE mutex          <---- both mutexes are held here, with state claimed first
        drop(state);
        // ...
        while pos >= *reader_pos {
            // ...
            let mut state = self.state.lock().unwrap();
            // ... <---- both are held, with reader claimed first
        }
        // ...
        let mut state = self.state.lock().unwrap();
        // ... <---- both are held, with reader claimed first
    }

    pub(crate) fn set_expected_access_pattern(&self, access_pattern: AccessPattern) {
        let mut state = self.state.lock().unwrap();
        // ...
        if matches!(access_pattern, AccessPattern::SequentialIsh) {
            // ...
            {
                let mut reading_materials = self.reader.lock().unwrap();
                // ... <---- both are held, with state claimed first
            }
        // ...
    }

I expect that the "invariant" was written to avoid this deadlock? It's a little too strong -- all you need is to always claim the mutexes in the same order. Of course, it's only strong if it's followed :)

        // Cases where you have STATE but want READER: near the start
        // Cases where you have READER but want STATE: after read,
        // ... but this deadlock can't happen because only one thread                                                  
        //     will enter this 'read in progress' block.

The way this is written also looks like it's trying to accomplish "critical sections" -- mutual exclusion of pieces of code. The Rust Mutex isn't designed for that, and thus relies on some head-scratching by the developer. In this case, I think the "only one thread" assertion is saying that only one thread holds the STATE mutex, and waits on the condvar (releasing but then re-acquiring the mutex), then enters its own read-in-progress block. So that's equivalent to saying that "this deadlock" can't happen because only one thread can hold the STATE mutex. Which is tautological and doesn't imply the deadlock cannot happen.

One way to enforce an order-of-acquisition invariant in the type system is to put one mutex inside the other, e.g.,

    state: Mutex<(State, Mutex<ReadingMaterials>)>,

That particular way of writing it is not very ergonomic, but maybe introducing some additional types to model the situation as protecting data instead of critical sections would help.

I can take a crack at trying to refactor this to be more correct, if I haven't missed some major piece of information in the above.

The text was updated successfully, but these errors were encountered:

adetaylor · 2023-11-10T07:45:32Z

Ha. I considered trying to model this in the type system before declaring 1.0, on the assumption that I'd have got something wrong here. I should have!

The comment is at least wrong.

However a nested mutex probably won't solve things. Usually we want to claim state first, but the thread which is doing an actual network read might block a long time with the read mutex, and we don't want to prevent other threads from doing local reads from the cache (using state)

I'll do some more thinking about this later!

djmitche · 2023-11-10T15:11:31Z

One option may be to put an currently_reading: Option<ReadingMaterials> in the State. A thread doing the read could take this value with the state mutex held, then drop the mutex. If currently_reading is None, then another thread is reading and this one must wait. That waiting could be accomplished with the condition variable.

Anyway, let me know your thoughts. If I get a chance, I'll have a deeper look at this code and see if I can understand it deeply enough to refactor.

adetaylor · 2023-11-10T15:32:06Z

I'm working on it.

adetaylor · 2023-11-10T17:05:47Z

OK I've convinced myself that it's fine. Let me try to persuade you.

There is effectively one main mutex, the state mutex.

Guarded by the state mutex is read_in_progress, a Boolean.

There are two places where the reading_materials stuff is touched at all:

The second half of the read function, which can only be accessed by a thread which discovered read_in_progress was false and set it to true while it had the state mutex held.
set_expected_access_pattern which panics if there's a read in progress.

So, although you're right that the state and reader mutices are claimed in different orders, it's guaranteed that only one thread will be involved in claiming the reader mutex.

Effectively read_in_progress Boolean is equivalent to the Option you're talking about. I think that the Option is a cleaner design and I'll have a crack at it.

The comments are still definitely wrong though!

djmitche · 2023-11-10T17:40:14Z

Ah, that does make sense, and the PR looks like a better approach!

davidcorrigan714 · 2024-05-31T21:08:44Z

We hit some deadlock in this area of code recently. Our server was having issues and returning incomplete responses every so often. Ended up with a few threads stuck here. We fixed our webserver and I've been trying to reproduce it locally, I think what happened is the http error is returned here but the reader isn't put back into state so the other threads waiting to read are deadlocked. Just tricky crashing it on purpose with the readers waiting and returning an error on the other thread. If I can get a reproduction going I'd probably just check if it's an error, return the reader to the state and then return the error. Could try to put the reader into a mutex somehow that always fails-safe when it loses scope but sounds like more work for a bit better error handling.

adetaylor · 2024-06-01T08:44:11Z

Ah, good spot! Yes I'll definitely improve handling of the error here so that other threads don't get stuck. Work started here: #70

adetaylor mentioned this issue Nov 10, 2023

Option for reader #52

Merged

adetaylor changed the title ~~SeekableHttpReaderEngine can deadlock~~ Comments say that SeekableHttpReaderEngine can deadlock Nov 10, 2023

adetaylor closed this as completed Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments say that SeekableHttpReaderEngine can deadlock #50

Comments say that SeekableHttpReaderEngine can deadlock #50

djmitche commented Nov 10, 2023

adetaylor commented Nov 10, 2023

djmitche commented Nov 10, 2023

adetaylor commented Nov 10, 2023

adetaylor commented Nov 10, 2023

djmitche commented Nov 10, 2023

davidcorrigan714 commented May 31, 2024 •

edited

Loading

adetaylor commented Jun 1, 2024

Comments say that SeekableHttpReaderEngine can deadlock #50

Comments say that SeekableHttpReaderEngine can deadlock #50

Comments

djmitche commented Nov 10, 2023

adetaylor commented Nov 10, 2023

djmitche commented Nov 10, 2023

adetaylor commented Nov 10, 2023

adetaylor commented Nov 10, 2023

djmitche commented Nov 10, 2023

davidcorrigan714 commented May 31, 2024 • edited Loading

adetaylor commented Jun 1, 2024

davidcorrigan714 commented May 31, 2024 •

edited

Loading