Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with dynamic behavior switching #261

Closed
ocharles opened this issue Jun 15, 2022 · 16 comments
Closed

Memory leak with dynamic behavior switching #261

ocharles opened this issue Jun 15, 2022 · 16 comments

Comments

@ocharles
Copy link
Collaborator

The following program:

{-# language BlockArguments #-}
module Main where

import Control.Monad
import Data.Functor
import Reactive.Banana
import Reactive.Banana.Frameworks
import System.Mem
import System.Mem.Weak

withGhcDebug = id

main :: IO ()
main = withGhcDebug do
  (ah1, fire1) <- newAddHandler

  actuate =<< compile do
    e <- fromAddHandler ah1

    let e2 = observeE $ e $> do
          stepper () (void e)

    b' <- switchB (pure ()) e2

    reactimate $ return <$> b' <@ e

  performGC
  putStrLn "Running"

  replicateM_ 50000 $ do
    fire1 ()
    performGC

Leaks memory:

image

I've also modified doAddChild to print out the number of children in a parent:

doAddChild :: SomeNode -> SomeNode -> IO ()
doAddChild (P parent) (P child) = do
    level1 <- _levelP <$> readRef child
    level2 <- _levelP <$> readRef parent
    let level = level1 `max` (level2 + 1)
    w <- parent `connectChild` P child
    print =<< length . _childrenP <$> readRef parent -- <- NEW
    modify' child $ set levelP level . update parentsP (w:)
doAddChild (P parent) node = void $ parent `connectChild` node
doAddChild (L _) _ = error "doAddChild: Cannot add children to LatchWrite"
doAddChild (O _) _ = error "doAddChild: Cannot add children to Output"

This shows that a node has a 50000 children at the end. My guess is this is the Event e - whenever execute fires we attach a new stepper to e, but this child is never removed - even though switchB should be discarding them.

@mitchellwrosen
Copy link
Collaborator

Nice, can you share how you made the pretty graph?

@ocharles
Copy link
Collaborator Author

Sure! I just used eventlog2html 😄

More specifically:

$ cabal run leak -- +RTS -l -hi

and build with -rtsopts -eventlog.

I often also build with -finfo-table-map -fdistinct-constructor-tables, as per https://well-typed.com/blog/2021/01/first-look-at-hi-profiling-mode/

@ocharles
Copy link
Collaborator Author

The same problem is present with just dynamic event switching - no need to bring Behaviors in:

{-# language BlockArguments #-}
module Main where

import Control.Monad
import Data.Functor
import Reactive.Banana
import Reactive.Banana.Frameworks
import System.Mem
import System.Mem.Weak

withGhcDebug = id

main :: IO ()
main = withGhcDebug do
  (ah1, fire1) <- newAddHandler

  actuate =<< compile do
    e <- fromAddHandler ah1

    let e2 = observeE $ e $> do
          accumE () (id <$ e)

    e3 <- switchE never e2

    reactimate $ return <$> e3

  performGC
  putStrLn "Running"

  replicateM_ 10000 $ do
    fire1 ()
    performGC

I'll try and solve this leak first.

@ocharles
Copy link
Collaborator Author

ocharles commented Jun 16, 2022

Ok, the fix for both of these leaks isn't hard - we can just modify doAddChild to cull any dead children:

doAddChild (P parent) (P child) = do
    level1 <- _levelP <$> readRef child
    level2 <- _levelP <$> readRef parent
    let level = level1 `max` (level2 + 1)
    w <- parent `connectChild` P child

    -- Remove any dead children. These three lines are new.
    let alive w = maybe False (const True) <$> deRefWeak w
    children' <- filterM alive . _childrenP =<< readRef parent
    modify' parent $ set childrenP children'

    modify' child $ set levelP level . update parentsP (w:)

But I'm not particularly happy with this solution. When the switchE fires I feel we should be able to propagate this information all the way up to e. I'll have a think about how to do this.

@HeinrichApfelmus
Copy link
Owner

HeinrichApfelmus commented Jun 17, 2022

When implementing this, I was hoping to use finalizers to remove dead children — i.e. when switchE switches to a new event, the old event may become garbage and the corresponding finalizer would remove it from the _childrenP field.

Hm. Finalizers are run concurrently, but to keep our sanity, changes to the network need to be sequential and scheduled (e.g. using a writer part of Build monad). Perhaps we should implement our own GC pass that is executed at the end of every step, and the finalizers simply tell our GC more specific information about which weak pointers it should remove? 🤔

(One issue that I didn't think deeply enough about is the question of how fast we can remove transitive dependencies. I.e. event e3 may depend on e2 which depends on e1. Now, if e3 is not used anymore, then both e2 and e1 can be garbage collected, but that should preferably happen in a single GC pass as opposed of two GC passes where we first discover that e2 is dead and only in the next pass that e1 is also dead because e2 is dead. The GHC GC does this alright, but any GC addendum that we implement might not.)

@ocharles
Copy link
Collaborator Author

@HeinrichApfelmus I've also thought about using finalizers, but the whole thing seems a lot more complex/action-at-a-distance than it needs to be. As far as I'm aware, we always have the entire graph right in front of us, through a Pulses parent/children lists. So if we dynamically switch away from something, we should - at that point - be able to find strongly connected components within this graph that are no longer reachable and nuke the whole lot.

I don't like finalizers partly because it's unclear when they will run, but more that it's unclear if they will run at all! I'd hate to be in a position where I accumulate just enough garbage to impact performance, but not enough to trigger the right generation GC to solve the problem.

@HeinrichApfelmus
Copy link
Owner

HeinrichApfelmus commented Jun 19, 2022

So if we dynamically switch away from something, we should - at that point - be able to find strongly connected components within this graph that are no longer reachable and nuke the whole lot.

Yes and no. The trouble is twofold:

  1. The program may reference a Pulse (e.g. through an Event) even though that Pulse is currently not an active part of the network — but it may become part of the network again later. For example, a switchE periodically switching between two events e1 and e2, has this property — both events need to be kept alive (especially if they involve state), but only one of them is in the transitive closure of the current list of reactimate. This implies that we do need help from the garbage collector.
  2. Conversely, the garbage collector may still think that a Pulse is alive during a switchE, even though that Pulse becomes dead through the switch. Hence, the garbage collector may have some delay, and tell us that a Pulse can be removed only some time after the moment of switching. This implies that we need to expect help from the garbage collector in an asynchronous manner.

I don't like finalizers partly because it's unclear when they will run, but more that it's unclear if they will run at all!

I do agree that the documentation on finalizers is rather pessimistic. However, I feel that we may not have a choice, and in practice, it does not seem too bad (well, if it is bad, then we can report this as a bug in GHC. 😄)

@ocharles
Copy link
Collaborator Author

Yea, I was thinking over 1 yesterday! Thanks for sharing. Something I also want to do is try modelling our graph in Alloy and to use a model checker to work out the complexities here!

ocharles added a commit to circuithub/reactive-banana that referenced this issue Oct 5, 2022
@ocharles
Copy link
Collaborator Author

Ok, I might have another fix:

 connectChild parent child = do
     w <- mkWeakNodeValue child child
     modify' parent $ update childrenP (w:)
+
+    -- Add a finalizer to remove the child from the parent when the child dies.
+    case child of
+      P c@(Ref r _) -> addFinalizer r $ removeParents c
+      _ -> return ()
+
     mkWeakNodeValue child (P parent)        -- child keeps parent alive

The idea is pretty trivial - when a Pulse is unreachable by the GC, then remove it from all parents. I think that the only thing we're actually leaking (in the examples here) is the Weak value for the child and the cons cell in the parents list of children. I tested this with the following (even simpler) repro:

{-# language BlockArguments #-}
module Main where

import Control.Monad
import Control.Monad.IO.Class
import Data.Functor
import Reactive.Banana
import Reactive.Banana.Frameworks
import System.Mem
import System.Mem.Weak
import Control.Concurrent (threadDelay, yield)

withGhcDebug = id

main :: IO ()
main = withGhcDebug do
  (ah1, fire1) <- newAddHandler

  actuate =<< compile do
    e <- fromAddHandler ah1

    e2 <- execute $ e $> do
      accumE () (id <$ e)

    reactimate $ return () <$ e2

  performGC
  putStrLn "Running"

  replicateM_ 10000 $ do
    fire1 ()
    performMajorGC
    -- yield so finalizers can run.
    yield

  putStrLn "Done"

Ran with -hi profile and analyzed in eventlog2html, we see

image

Some noise, but that clear blue line is the signal - a clear leak.

With the fix above, we get:

image

But I also have to run 10x the amount of iterations otherwise it terminates too quickly!

So I think I've got a good handle on at least one fix. I think the way to proceed from here is to add a finalizer when we call newLatch or newPulse though - connectChild is obviously the wrong place.

@ocharles
Copy link
Collaborator Author

A note to myself as to why we can't just use Ref:

  • Assume we have some chain of Pulses p1, p2, ... pn, where each pulse is a child of the previous (so p1 is the parent of p2, etc).
  • We have some top-level pulse p which is the parent of p1.
  • Now, introduce pX, which is derived by dynamically switching between some other Pulse and pn.
  • If we clean up the graph the instant pX switches out of pn, then we'll end up detaching p1 from p.
  • However, if we switch back to pn, then we'll never get any events, because pn is disconnected from p!

This is why we think we need help from the GC. When pX switches out of pn then yes, pX should reparent. But we do still need to keep pushing p through pn because the dynamic event switch will keep pn alive. I agree that it's going to be very hard to do this without letting the GC inform us.

Note that if we used dynamic event switching and switched out of pn and don't have the possibility of switching back (e.g., something uses never or some other mechanism that makes it impossible), then we'd lose any strong pointer to pn allowing it to be GCed.

I need to think about promptly cleaning up a whole sequence of Pulses, but otherwise this is taking shape

HeinrichApfelmus added a commit that referenced this issue Jan 2, 2023
…c-evaluation

Perform evaluation steps `Network` using `GraphGC`

### Overview

This pull request completely changes the way that dependencies between `Pulse` are tracked and used in order to perform an `Evaluation.step` for a `Network`.

We use the machinery provided by `GraphGC` to

* track dependencies between `Pulse`, using `insertEdge` and `clearPredecessors`.
* traverse the `Pulse` in dependency order with early exit, using `walkSuccessors_`.

### Comments

* This should fix many remaining issues with garbage collection for `Pulse`, specifically #261 
* I think that in order to fix *all* remaining issues for `Pulse`, we may have to look at garbage collection and `Vault`.
* This pull request doesn't do anything for `Latch`.

Still, 🥳!

### Obsoletes

* #182
* sadly, #243
@HeinrichApfelmus
Copy link
Owner

HeinrichApfelmus commented Jan 14, 2023

Thank you @ocharles !

I have revisited this problem and decided to redesign the low-level implementation entirely in order to separate concerns better. I have created a Graph data structure without garbage collection and a GraphGC data structure with garbage collection; this allows us to automatically test that garbage collection indeed behaves as desired. #268

The following program now runs in constant space:

{-# language BlockArguments #-}
module Main where

import Control.Monad
import Control.Monad.IO.Class
import Data.Functor
import Reactive.Banana
import Reactive.Banana.Frameworks
import System.Mem
import System.Mem.Weak
import Control.Concurrent (threadDelay, yield)

withGhcDebug = id

main :: IO ()
main = withGhcDebug do
  (ah1, fire1) <- newAddHandler

  actuate =<< compile do
    e <- fromAddHandler ah1

    e2 <- execute $ e $> do
      accumE () never -- previously: accumE () (id <$ e)

    reactimate $ return () <$ e2

  performGC
  putStrLn "Running"

  replicateM_ 10000 $ do
    fire1 ()
    performMajorGC
    -- yield so finalizers can run.
    yield

  putStrLn "Done"

… but the program with accumE () (id <$ e) instead of accumE () never does not. 🤔 … which is weird, because the test execute accumE, issue #261: specifically tests a very similar network and doesn't detect any growth in network size.

@HeinrichApfelmus
Copy link
Owner

… which is weird, because the test execute accumE, issue #261: specifically tests a very similar network and doesn't detect any growth in network size.

Fixed it: The issue was an accumulation of finalizers, which was due to a short-cut that I took when implementing insertEdge. As of commit 0044039, the program mentioned in #261 (comment) now runs in constant space, as indicated by the heap profile obtained through cabal run space -- +RTS -hT -RTS && hp2pretty space.hp. 🥳

Unfortunately, the earlier program mentioned in #261 (comment) , involving observeE and switchE, does not run in constant space yet. 🤔

@HeinrichApfelmus
Copy link
Owner

I have fixed an additional space leak with switchP, where old pulses would not be disconnected from the network. The cause is somewhat embarrassing: do notation was picking up the reader monad instead of the IO monad, and the function clearPredecessors that disconnects the old pulses was just never called. 😳

But now, as of commit 29776bd , the heap profile for all programs mentioned in this thread is constant. 🥳 Here is the heap profile for @ocharles' initial example program:

cabal run space -- +RTS -hT -RTS && hp2pretty space.hp

space

In addition, several variants of the space leaks reported here are now part of the automated test suite — the tests will fail if the network grows unexpectedly.

I think that it's time to successfully close this issue. 💪 If you do find more space leaks, please don't hesitate to bring them to my attention!

@luke-clifton
Copy link

What a marathon effort! I've been reading these updates with bated breath, each one delivering a fraction but leaving me wanting more. Congratulations 🎉

@HeinrichApfelmus
Copy link
Owner

What a marathon effort! I've been reading these updates with bated breath, each one delivering a fraction but leaving me wanting more. Congratulations 🎉

Thanks! 😊 What do you think — maybe I could pitch the story of these update to Netflix? 🤔

@luke-clifton
Copy link

I'd watch it 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants