Remove instance drop impl #1782

timotree3 · 2019-10-20T17:39:44Z

PR summary

The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped.

Since we used a derived implementation of Clone and the data was held in an Arc, it was possible to exploit by calling .clone(), making a "shallow clone", and then drop one of the instances, leaving the other one invalid and causing a panic if it was used.

I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.

testing/benchmarking notes

A regression test was added.

changelog

Please check one of the following, relating to the CHANGELOG-UNRELEASED.md

this is a code change that effects some consumer (e.g. zome developers) of holochain core so it is added to the CHANGELOG-UNRELEASED.md (linked above), with the format - summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)
this is not a code change, or doesn't effect anyone outside holochain core development

lucksus

The main change of this PR seems to be the removal of StateWrapper. I would expect that to re-instantiate the memory leak that got fixed by introducing it.

The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped.

Not quite. drop() is calling StateWrapper::drop_inner_state() which explicitly drops the State. That is done exactly because we might have other threads still holding an Arc reference to the state. Without that (and without the Option between the RwLock and State) the state would have been kept in memory although the instance that is logically owning the state is gone. That might not be a problem in most usual use-cases where we have a somewhat fixed set of instances that only get dropped when the conductor shuts down - but it was a blocking issue for our test suite where many instances were created within the life-time of a conductor process.

The root problem here is that the instance's state needs to be readable by many threads which is achieved by having an Arc<RwLock<>> own it - but that also detaches the logical ownership of Instance over State with the before mentioned memory problem.

I would guess that you were led down this path after seeing

thread 'store_entry_content/puid-f-42' panicked at 'Tried to use dropped state', src/libcore/option.rs:1065:5

in the failing tests of #1701 - but that is a red herring! This panic is how those other threads that would still hold a reference to the state that we intentionally dropped get cancelled. So it is an intentional panic.

What I get from this is that we should make that panic message clearly state that this is intentional.

If you have a suggestion how to improve this situation, I'd be happy to discuss, but I think we need to keep the StateWrapper for now.

timotree3 · 2019-10-22T23:27:41Z

Thanks for reviewing this. :-) Let me do my best to explain my thinking about removing the Drop implementation for Instance. Sorry... I should've given a PR summary that explained this better.

The reason I deleted that part of the code is because I only see two possible things it could do, both of which are undesirable:

Case 1: That instance is the only remaining holder of a reference to the State, the manual drop doesn't matter because after the instance drops its fields, it will call the Arc drop implementation which drops the State if it is the last reference left.
- Outcome: the code didn't matter because the data it manually drops would've been dropped anyway.
Case 2: Other parts of the code still hold references to that State, (presumably because those parts of the code intend to read from those references), after we replace the valid state with a sentinel value that causes a panic when accessed (i.e. StateWrapper(None)), we are sure to cause a panic! after it is read from.
- Outcome: the code panic!s

I admit that I'm assuming two things:

That panic!s are undesirable and should only be the result of a broken invariant that the programmer didn't foresee.
That parts of the code would drop an Arc that they don't intend to use.

I also admit that the above explanation lied about something. The Drop impl does do more than drop the inner State. It has three lines of code:

holochain-rust/crates/core/src/instance.rs

Lines 338 to 346 in 6ea8fa6

    
           impl Drop for Instance { 
        
               fn drop(&mut self) { 
        
                   // TODO: this is already performed in Holochain::stop explicitly, 
        
                   // can we get rid of one or the other? 
        
                   let _ = self.shutdown_network(); 
        
                   self.stop_action_loop(); 
        
                   self.state.write().unwrap().drop_inner_state(); 
        
               } 
        
           }

The first line calls a function that returns a future. Since futures are evaluated lazily, it does nothing.

The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that.

The third line is the one I just explained why I think is misguided.

Now let me show an example of where I think the second line is harmful (taken from the regression test added in 6ea8fa6):

#[test]
pub fn can_ping_instance() {
    let instance = test_instance_blank();
    instance.action_channel().send(ActionWrapper::new(Action::Ping)).unwrap();
}

#[test]
pub fn can_clone_instance() {
    let instance = test_instance_blank();
    {
        let _instance2 = instance.clone();
    }
    instance.action_channel().send(ActionWrapper::new(Action::Ping)).unwrap();
}

Intuitively, I would expect those two cases to be equivalent, after all, cloning something shouldn't have side effects, Sure enough, if you ran these two tests you'd probably see them both pass. The troubling thing is that the second one doesn't always pass! It contains a race-condition, and actually fails some portion of the time.

Here's what happens:

We call Clone on an Instance, which since it's derived, forwards to the impls given by Instance's fields, all of which shallowly clone. In other words, we now have a second Instance object that still points to all the same data that the first one did.
The scope ends and we drop _instance2.
We call the Drop impl for Instance.
We send a kill-signal to the shared action-listener thread. It will receive this in between 0-1000 ms.
We send an action using the Sender owned by instance1.
If the thread was killed in time
- The channel is disconnected and we panic!.
Otherwise
- The test passes.

I believe that this bug occurs because of a conflict of assumptions:

That it is valid to implement Clone for Instance because they don't logically-own the thread and the state.
That it is valid to kill the thread and valid state when an Instance is dropped because they do logically-own the state.

I checked, and this Clone implementation is used in one spot in the code. I didn't look into it very deeply, but when I said:

I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.

that is what I meant.

The main change of this PR seems to be the removal of StateWrapper. I would expect that to re-instantiate the memory leak that got fixed by introducing it.

Why would there have been a memory leak? Arcs drop their contents. Was there an Arc cycle somewhere in the code?

I would guess that you were led down this path after seeing
thread 'store_entry_content/puid-f-42' panicked at 'Tried to use dropped state', src/libcore/option.rs:1065:5
in the failing tests of #1701 - but that is a red herring!

Actually, no. I was working on an Instance refactor and I noticed that this Drop implementation combined with Clone was bound for failure.

So it is an intentional panic.

If that's true, then one of my assumptions above is wrong. Are we actually using panics for something other than fatal unforeseen logic errors?

Let me know if I'm making some mistake in my logic here.

jamesray1 · 2019-10-31T05:23:52Z

I haven't tested your added tests yet, but some of the CI tests are failing with e.g. with the below. Also the branch is out-of-date. However, I guess you may want to wait for review e.g. about:

The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that.

https://circleci.com/gh/holochain/holochain-rust/43055?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

#    Compiling jsonrpc-http-server v14.0.1 (https://github.com/holochain/jsonrpc?branch=broadcaster-getter#3a043aba)
# error[E0592]: duplicate definitions with name `broadcaster`
#    --> /root/project/.cargo/git/checkouts/jsonrpc-b3276f041818a130/3a043ab/ws/src/server.rs:42:2
#     |
# 42  |       pub fn broadcaster(&self) -> Broadcaster {
#     |  _____^
# 43  | |         Broadcaster {
# 44  | |             broadcaster: self.broadcaster.clone(),
# 45  | |         }
# 46  | |     }
#     | |_____^ duplicate definitions for `broadcaster`
# ...
# 148 |       pub fn broadcaster(&self) -> ws::Sender {
#     |  _____-
# 149 | |         self.broadcaster.clone()
# 150 | |     }
#     | |_____- other definition for `broadcaster`
# 
# error: aborting due to previous error
# 
# For more information about this error, try `rustc --explain E0592`.
# error: Could not compile `jsonrpc-ws-server`.
# warning: build failed, waiting for other jobs to finish...
# error: failed to compile `hc v0.0.32-alpha2 (/root/project/crates/cli)`, intermediate artifacts can be found at `/root/project/target`

https://circleci.com/gh/holochain/holochain-rust/43047?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link:

> wat2wasm /tmp/holochain/target/wasm32-unknown-unknown/release/summer.wat -o /tmp/holochain/target/wasm32-unknown-unknown/release/summer.wasm
Created DNA package file at "dist/app_spec.dna.json"
DNA hash: QmPELfsnYiJnBxPbVQZdwCM98Rqr1j9ymXXNCAoTiEjbsv
Spawning conductor0 process...
Conductor0 process spawning successful
events.js:170
      throw er; // Unhandled 'error' event
      ^

Error: spawn ./.cargo/bin/holochain ENOENT
    at Process.ChildProcess._handle.onexit (internal/child_process.js:247:19)
    at onErrorNT (internal/child_process.js:429:16)
    at processTicksAndRejections (internal/process/task_queues.js:81:17)
    at process.runNextTicks [as _tickCallback] (internal/process/task_queues.js:56:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:880:11)
    at internal/main/run_main_module.js:21:11
Emitted 'error' event at:
    at Process.ChildProcess._handle.onexit (internal/child_process.js:253:12)
    at onErrorNT (internal/child_process.js:429:16)
    [... lines matching original stack trace ...]
    at internal/main/run_main_module.js:21:11
Exited with code 1

timotree3 · 2019-10-31T14:10:44Z

Neither of those things seem to have anything to do with my change. Maybe a conflict was mis-merged? That would explain the duplicate function. I'll rebase my branch and get rid of these merge commits.

The state is held by an Arc. Fundamentally, this means that it will be dropped as soon as no thread holds a reference to it. The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped. Since we used a derived implementation of Clone and the data was held in an Arc, it was possible to exploit by calling `.clone()`, making a "shallow clone", and then drop one of the instances, leaving the other one invalid and causing a panic if it was used.

jamesray1 · 2019-11-01T02:47:38Z

Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop? See #1823.

timotree3 · 2019-11-04T17:22:36Z

Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop?

I believe that you just got very lucky. The test that was added in that commit can still spuriously pass, it tries a data-racy thing 100 times so if they all spuriously succeed then it will succeed as well.

I just ran them on my computer and they failed which means that they can fail but by chance it passed that time. If you want to check surefire is this is an issue, I suggest you try running the updated tests created in 35919a0. They don't pass spuriously.

timotree3 mentioned this pull request Oct 21, 2019

Path string roundup #1701

Open

3 tasks

lucksus suggested changes Oct 22, 2019

View reviewed changes

timotree3 requested a review from lucksus October 23, 2019 22:44

timotree3 added 6 commits October 31, 2019 11:28

Add currently failing test case

6b8d622

Make newly added test more lightweight

35919a0

Fix an oops

93f50b6

Fix clippy warnings

11551ab

Fix another clippy warning

50a7192

timotree3 force-pushed the remove-instance-drop-impl branch from fc7d906 to 50a7192 Compare October 31, 2019 15:28

jamesray1 mentioned this pull request Nov 1, 2019

Add currently failing test case #1823

Closed

3 tasks

This was referenced Nov 5, 2019

[Just for testing] Make newly added test more lightweight #1832

Closed

Add lighter version of can_clone_instance from 35919a0 #1833

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove instance drop impl #1782

Remove instance drop impl #1782

timotree3 commented Oct 20, 2019

lucksus left a comment

timotree3 commented Oct 22, 2019

jamesray1 commented Oct 31, 2019 •

edited

timotree3 commented Oct 31, 2019

jamesray1 commented Nov 1, 2019 •

edited

timotree3 commented Nov 4, 2019

Remove instance drop impl #1782

Are you sure you want to change the base?

Remove instance drop impl #1782

Conversation

timotree3 commented Oct 20, 2019

PR summary

testing/benchmarking notes

changelog

lucksus left a comment

Choose a reason for hiding this comment

timotree3 commented Oct 22, 2019

jamesray1 commented Oct 31, 2019 • edited

timotree3 commented Oct 31, 2019

jamesray1 commented Nov 1, 2019 • edited

timotree3 commented Nov 4, 2019

jamesray1 commented Oct 31, 2019 •

edited

jamesray1 commented Nov 1, 2019 •

edited