Skip to content
This repository has been archived by the owner on Feb 3, 2023. It is now read-only.

Remove instance drop impl #1782

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

timotree3
Copy link
Collaborator

PR summary

The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped.

Since we used a derived implementation of Clone and the data was held in an Arc, it was possible to exploit by calling .clone(), making a "shallow clone", and then drop one of the instances, leaving the other one invalid and causing a panic if it was used.

I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.

testing/benchmarking notes

A regression test was added.

changelog

Please check one of the following, relating to the CHANGELOG-UNRELEASED.md

  • this is a code change that effects some consumer (e.g. zome developers) of holochain core so it is added to the CHANGELOG-UNRELEASED.md (linked above), with the format - summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)
  • this is not a code change, or doesn't effect anyone outside holochain core development

@timotree3 timotree3 mentioned this pull request Oct 21, 2019
3 tasks
Copy link
Collaborator

@lucksus lucksus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change of this PR seems to be the removal of StateWrapper. I would expect that to re-instantiate the memory leak that got fixed by introducing it.

The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped.

Not quite. drop() is calling StateWrapper::drop_inner_state() which explicitly drops the State. That is done exactly because we might have other threads still holding an Arc reference to the state. Without that (and without the Option between the RwLock and State) the state would have been kept in memory although the instance that is logically owning the state is gone. That might not be a problem in most usual use-cases where we have a somewhat fixed set of instances that only get dropped when the conductor shuts down - but it was a blocking issue for our test suite where many instances were created within the life-time of a conductor process.

The root problem here is that the instance's state needs to be readable by many threads which is achieved by having an Arc<RwLock<>> own it - but that also detaches the logical ownership of Instance over State with the before mentioned memory problem.

I would guess that you were led down this path after seeing

thread 'store_entry_content/puid-f-42' panicked at 'Tried to use dropped state', src/libcore/option.rs:1065:5

in the failing tests of #1701 - but that is a red herring! This panic is how those other threads that would still hold a reference to the state that we intentionally dropped get cancelled. So it is an intentional panic.

What I get from this is that we should make that panic message clearly state that this is intentional.

If you have a suggestion how to improve this situation, I'd be happy to discuss, but I think we need to keep the StateWrapper for now.

@timotree3
Copy link
Collaborator Author

Thanks for reviewing this. :-) Let me do my best to explain my thinking about removing the Drop implementation for Instance. Sorry... I should've given a PR summary that explained this better.

The reason I deleted that part of the code is because I only see two possible things it could do, both of which are undesirable:

  • Case 1: That instance is the only remaining holder of a reference to the State, the manual drop doesn't matter because after the instance drops its fields, it will call the Arc drop implementation which drops the State if it is the last reference left.
    • Outcome: the code didn't matter because the data it manually drops would've been dropped anyway.
  • Case 2: Other parts of the code still hold references to that State, (presumably because those parts of the code intend to read from those references), after we replace the valid state with a sentinel value that causes a panic when accessed (i.e. StateWrapper(None)), we are sure to cause a panic! after it is read from.
    • Outcome: the code panic!s

I admit that I'm assuming two things:

  • That panic!s are undesirable and should only be the result of a broken invariant that the programmer didn't foresee.
  • That parts of the code would drop an Arc that they don't intend to use.

I also admit that the above explanation lied about something. The Drop impl does do more than drop the inner State. It has three lines of code:

impl Drop for Instance {
fn drop(&mut self) {
// TODO: this is already performed in Holochain::stop explicitly,
// can we get rid of one or the other?
let _ = self.shutdown_network();
self.stop_action_loop();
self.state.write().unwrap().drop_inner_state();
}
}

The first line calls a function that returns a future. Since futures are evaluated lazily, it does nothing.

The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that.

The third line is the one I just explained why I think is misguided.


Now let me show an example of where I think the second line is harmful (taken from the regression test added in 6ea8fa6):

#[test]
pub fn can_ping_instance() {
    let instance = test_instance_blank();
    instance.action_channel().send(ActionWrapper::new(Action::Ping)).unwrap();
}

#[test]
pub fn can_clone_instance() {
    let instance = test_instance_blank();
    {
        let _instance2 = instance.clone();
    }
    instance.action_channel().send(ActionWrapper::new(Action::Ping)).unwrap();
}

Intuitively, I would expect those two cases to be equivalent, after all, cloning something shouldn't have side effects, Sure enough, if you ran these two tests you'd probably see them both pass. The troubling thing is that the second one doesn't always pass! It contains a race-condition, and actually fails some portion of the time.

Here's what happens:

  • We call Clone on an Instance, which since it's derived, forwards to the impls given by Instance's fields, all of which shallowly clone. In other words, we now have a second Instance object that still points to all the same data that the first one did.
  • The scope ends and we drop _instance2.
  • We call the Drop impl for Instance.
  • We send a kill-signal to the shared action-listener thread. It will receive this in between 0-1000 ms.
  • We send an action using the Sender owned by instance1.
  • If the thread was killed in time
    • The channel is disconnected and we panic!.
  • Otherwise
    • The test passes.

I believe that this bug occurs because of a conflict of assumptions:

  • That it is valid to implement Clone for Instance because they don't logically-own the thread and the state.
  • That it is valid to kill the thread and valid state when an Instance is dropped because they do logically-own the state.

I checked, and this Clone implementation is used in one spot in the code. I didn't look into it very deeply, but when I said:

I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.

that is what I meant.


The main change of this PR seems to be the removal of StateWrapper. I would expect that to re-instantiate the memory leak that got fixed by introducing it.

Why would there have been a memory leak? Arcs drop their contents. Was there an Arc cycle somewhere in the code?

I would guess that you were led down this path after seeing
thread 'store_entry_content/puid-f-42' panicked at 'Tried to use dropped state', src/libcore/option.rs:1065:5
in the failing tests of #1701 - but that is a red herring!

Actually, no. I was working on an Instance refactor and I noticed that this Drop implementation combined with Clone was bound for failure.

So it is an intentional panic.

If that's true, then one of my assumptions above is wrong. Are we actually using panics for something other than fatal unforeseen logic errors?

Let me know if I'm making some mistake in my logic here.

@jamesray1
Copy link
Contributor

jamesray1 commented Oct 31, 2019

I haven't tested your added tests yet, but some of the CI tests are failing with e.g. with the below. Also the branch is out-of-date. However, I guess you may want to wait for review e.g. about:

The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that.

https://circleci.com/gh/holochain/holochain-rust/43055?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

#    Compiling jsonrpc-http-server v14.0.1 (https://github.com/holochain/jsonrpc?branch=broadcaster-getter#3a043aba)
# error[E0592]: duplicate definitions with name `broadcaster`
#    --> /root/project/.cargo/git/checkouts/jsonrpc-b3276f041818a130/3a043ab/ws/src/server.rs:42:2
#     |
# 42  |       pub fn broadcaster(&self) -> Broadcaster {
#     |  _____^
# 43  | |         Broadcaster {
# 44  | |             broadcaster: self.broadcaster.clone(),
# 45  | |         }
# 46  | |     }
#     | |_____^ duplicate definitions for `broadcaster`
# ...
# 148 |       pub fn broadcaster(&self) -> ws::Sender {
#     |  _____-
# 149 | |         self.broadcaster.clone()
# 150 | |     }
#     | |_____- other definition for `broadcaster`
# 
# error: aborting due to previous error
# 
# For more information about this error, try `rustc --explain E0592`.
# error: Could not compile `jsonrpc-ws-server`.
# warning: build failed, waiting for other jobs to finish...
# error: failed to compile `hc v0.0.32-alpha2 (/root/project/crates/cli)`, intermediate artifacts can be found at `/root/project/target`

https://circleci.com/gh/holochain/holochain-rust/43047?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link:

> wat2wasm /tmp/holochain/target/wasm32-unknown-unknown/release/summer.wat -o /tmp/holochain/target/wasm32-unknown-unknown/release/summer.wasm
Created DNA package file at "dist/app_spec.dna.json"
DNA hash: QmPELfsnYiJnBxPbVQZdwCM98Rqr1j9ymXXNCAoTiEjbsv
Spawning conductor0 process...
Conductor0 process spawning successful
events.js:170
      throw er; // Unhandled 'error' event
      ^

Error: spawn ./.cargo/bin/holochain ENOENT
    at Process.ChildProcess._handle.onexit (internal/child_process.js:247:19)
    at onErrorNT (internal/child_process.js:429:16)
    at processTicksAndRejections (internal/process/task_queues.js:81:17)
    at process.runNextTicks [as _tickCallback] (internal/process/task_queues.js:56:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:880:11)
    at internal/main/run_main_module.js:21:11
Emitted 'error' event at:
    at Process.ChildProcess._handle.onexit (internal/child_process.js:253:12)
    at onErrorNT (internal/child_process.js:429:16)
    [... lines matching original stack trace ...]
    at internal/main/run_main_module.js:21:11
Exited with code 1

@timotree3
Copy link
Collaborator Author

Neither of those things seem to have anything to do with my change. Maybe a conflict was mis-merged? That would explain the duplicate function. I'll rebase my branch and get rid of these merge commits.

The state is held by an Arc.
Fundamentally, this means that it will be dropped as soon as
no thread holds a reference to it.

The previous implementation of Drop for Instance assumed that
there were no other instances referencing the data it dropped.
Since we used a derived implementation of Clone and
the data was held in an Arc, it was possible to exploit by calling
`.clone()`, making a "shallow clone", and then drop one of the
instances, leaving the other one invalid and causing a panic if
it was used.
@jamesray1
Copy link
Contributor

jamesray1 commented Nov 1, 2019

Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop? See #1823.

@timotree3
Copy link
Collaborator Author

Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop?

I believe that you just got very lucky. The test that was added in that commit can still spuriously pass, it tries a data-racy thing 100 times so if they all spuriously succeed then it will succeed as well.

I just ran them on my computer and they failed which means that they can fail but by chance it passed that time. If you want to check surefire is this is an issue, I suggest you try running the updated tests created in 35919a0. They don't pass spuriously.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants