Remove instance drop impl #1782
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change of this PR seems to be the removal of StateWrapper
. I would expect that to re-instantiate the memory leak that got fixed by introducing it.
The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped.
Not quite. drop()
is calling StateWrapper::drop_inner_state()
which explicitly drops the State
. That is done exactly because we might have other threads still holding an Arc
reference to the state. Without that (and without the Option
between the RwLock
and State
) the state would have been kept in memory although the instance that is logically owning the state is gone. That might not be a problem in most usual use-cases where we have a somewhat fixed set of instances that only get dropped when the conductor shuts down - but it was a blocking issue for our test suite where many instances were created within the life-time of a conductor process.
The root problem here is that the instance's state needs to be readable by many threads which is achieved by having an Arc<RwLock<>>
own it - but that also detaches the logical ownership of Instance
over State
with the before mentioned memory problem.
I would guess that you were led down this path after seeing
thread 'store_entry_content/puid-f-42' panicked at 'Tried to use dropped state', src/libcore/option.rs:1065:5
in the failing tests of #1701 - but that is a red herring! This panic is how those other threads that would still hold a reference to the state that we intentionally dropped get cancelled. So it is an intentional panic.
What I get from this is that we should make that panic message clearly state that this is intentional.
If you have a suggestion how to improve this situation, I'd be happy to discuss, but I think we need to keep the StateWrapper
for now.
Thanks for reviewing this. :-) Let me do my best to explain my thinking about removing the The reason I deleted that part of the code is because I only see two possible things it could do, both of which are undesirable:
I admit that I'm assuming two things:
I also admit that the above explanation lied about something. The holochain-rust/crates/core/src/instance.rs Lines 338 to 346 in 6ea8fa6
The first line calls a function that returns a future. Since futures are evaluated lazily, it does nothing. The second line stops the thread. This line is important (although, see below, arguably wrong) and I admit that it is an oversight of the PR. I have a follow-up in the works that introduces a more robust mechanism for killing the thread. I'm open to blocking this change on that. The third line is the one I just explained why I think is misguided. Now let me show an example of where I think the second line is harmful (taken from the regression test added in 6ea8fa6):
Intuitively, I would expect those two cases to be equivalent, after all, cloning something shouldn't have side effects, Sure enough, if you ran these two tests you'd probably see them both pass. The troubling thing is that the second one doesn't always pass! It contains a race-condition, and actually fails some portion of the time. Here's what happens:
I believe that this bug occurs because of a conflict of assumptions:
I checked, and this
that is what I meant.
Why would there have been a memory leak?
Actually, no. I was working on an
If that's true, then one of my assumptions above is wrong. Are we actually using panics for something other than fatal unforeseen logic errors? Let me know if I'm making some mistake in my logic here. |
I haven't tested your added tests yet, but some of the CI tests are failing with e.g. with the below. Also the branch is out-of-date. However, I guess you may want to wait for review e.g. about:
|
Neither of those things seem to have anything to do with my change. Maybe a conflict was mis-merged? That would explain the duplicate function. I'll rebase my branch and get rid of these merge commits. |
The state is held by an Arc. Fundamentally, this means that it will be dropped as soon as no thread holds a reference to it. The previous implementation of Drop for Instance assumed that there were no other instances referencing the data it dropped. Since we used a derived implementation of Clone and the data was held in an Arc, it was possible to exploit by calling `.clone()`, making a "shallow clone", and then drop one of the instances, leaving the other one invalid and causing a panic if it was used.
fc7d906
to
50a7192
Compare
Hey @timotree3, it looks like the tests you added in 6ea8fa6 are passing for develop? See #1823. |
I believe that you just got very lucky. The test that was added in that commit can still spuriously pass, it tries a data-racy thing 100 times so if they all spuriously succeed then it will succeed as well. I just ran them on my computer and they failed which means that they can fail but by chance it passed that time. If you want to check surefire is this is an issue, I suggest you try running the updated tests created in 35919a0. They don't pass spuriously. |
PR summary
The previous implementation of
Drop
forInstance
assumed that there were no other instances referencing the data it dropped.Since we used a derived implementation of
Clone
and the data was held in anArc
, it was possible to exploit by calling.clone()
, making a "shallow clone", and then drop one of the instances, leaving the other one invalid and causing a panic if it was used.I believe this bug has caused many of the spurious CI failures because it has a race condition depending on if the thread actually receives the kill signal in time to matter.
testing/benchmarking notes
A regression test was added.
changelog
Please check one of the following, relating to the CHANGELOG-UNRELEASED.md
- summary of change [PR#1234](https://github.com/holochain/holochain-rust/pull/1234)