Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repititve bind to the same thread might cause memory leak or stack overflow. #432

Closed
jorisgio opened this issue Jul 3, 2017 · 9 comments

Comments

@jorisgio
Copy link

jorisgio commented Jul 3, 2017

We've been struggling with a nasty Stackoverflow triggered by cleanup function recursively walking a crazy large list of waiters. In theory this should not happen, thanks to the "silent remove" limit, and the removed waiter tree to be cleanup should not be deeper than 42.

It turns out the issue is caused by bind. bind creates immutable waiters and they can accumulate until a stackoverlow is triggered, if you bind a huge number of thread to a single long live thread.
Here is a example :

let rec poll_until stop_condition next =
   match%lwt Lwt.pick [stop_condition >>= cleanup >>= fun () -> Lwt.fail Timeout; get next] with
   | exception Timeout -> log; Lwt.return_unit
   | elem -> handle elem; poll_until stop_condition next

let start () = 
  let (wait, wakeup) = Lwt.wait () in
  let next = some_stream_like in
  Lwt.async (fun () -> poll_until wait next);
  wakeup

This code looks harmless, but at every iteration it stacks a waiter to stop_condition, than Cancel it. Assuming stop_condition is protected, dead Immutable waiters will accumulate onwait leaking memory and ultimately crash in cleanup.

I believe cleanup can be made tail recursive but that does not solve the leak.

@aantron
Copy link
Collaborator

aantron commented Oct 14, 2017

@jorisgio Can you give a complete, self-contained example? I'm not 100% following the description + example. Where does "protected" fit in? I'm not sure dead immutable waiters are actually accumulating as described.

@aantron
Copy link
Collaborator

aantron commented Apr 4, 2018

Ok, looking at this again, the TL;DR is it can be worked around by wrapping stop_condition in a call to Lwt.protected:

Lwt.pick [
  Lwt.protected stop_condition >>= cleanup >>= fun () -> Lwt.fail Timeout;
  get next
]

Lwt.protected attaches a removable callback to stop_condition, so it is correctly cleaned up when Lwt.pick cancels the whole >>= construction.


In detail, this is, most generally, a specific instance of this pattern:

let () =
  let p, _ = Lwt.wait () in

  for i = 1 to 1_000_000 do
    Lwt.bind p (fun () -> ())
    |> ignore
  done

This loop attaches one callback to p on each iteration, causing allocation of one list cell in the promise's internal data structures. It's always possible to add enough callbacks to run out of memory.

We have no general way to solve this. We can't drop callbacks from p, because they might cause important side effects. This applies particularly to the callbacks attached by Lwt.on_success and related functions.


This is also an instance of the more specific pattern:

let () =
  let p, _ = Lwt.wait () in

  for i = 1 to 1_000_000 do
    let p' = Lwt.bind p (fun () -> Lwt.return ()) in
    Lwt.cancel p'
  done

@jorisgio, I think you wanted cancelation of p' to remove the fun () -> Lwt.return () callback from p that is added at each iteration. This made sense to me as well, at first glance.

However, in Lwt right now, p' is cancelable if and only if p is cancelable. Since p was created with Lwt.wait, neither promise is cancelable. I don't know if this design of cancelation is good, but it is probably important because cancelation handling with Lwt.catch can trigger callback calls, and therefore side effects.

As a result, in this case, Lwt.cancel p' is supposed to do nothing, and Lwt is forced to preserve all the temporary callbacks added by Lwt.bind.


However, if you wrap p in Lwt.protected first:

let () =
  let p, _ = Lwt.wait () in

  for i = 1 to 1_000_000 do
    let p'' = Lwt.protected p in
    let p' = Lwt.bind p'' (fun () -> Lwt.return ()) in
    Lwt.cancel p'
  done

...you will be creating a temporary cancelable promise p'' at each iteration, and now canceling it. Lwt.protected adds a callback to p, but removes it when p'' is canceled, so there will be no memory leak (see also #181).

I think I got confused by your usage of protected, by which you meant non-cancelable, because actually it is Lwt.protected that is needed here. This is one of those cases where the Lwt API has pretty misleading terminology :/

@kohlivarun5
Copy link
Contributor

I think I am seeing a similar issue with streams. Following is what I am looking:

let () =

  let stream,pusher = Lwt_stream.create () in 

  let s = 
    stream
    |> Lwt_stream.iter_p 
      (fun i -> 
        Printf.printf "Done %d" i
        |> Lwt.return)
  in

  let limit = 120_000 in
  for i = 1 to limit do
    if i >= limit 
    then pusher None
    else pusher (Some i)
  done;

  s 
  |> Lwt_main.run 

That gives stack overflow on exit. I tried a few things to use the protected workaround but that doesn't seem to help.

Any comments on what is a workaround ?

@aantron
Copy link
Collaborator

aantron commented Aug 22, 2018

Thanks for reporting, I will take a look in detail tomorrow morning (~15 hours :/).

@kohlivarun5
Copy link
Contributor

I tried looking into the implementation. From what I could tell, it arises from the join in iter_p. If I replace the join with something such as:

match node.data with
    | Some x ->
      consume s node;
      let res = f x in 
      let rest = iter_p_bb_rec node.next f s in 
      res >>= fun () -> rest

It works for me then, although I am not entirely sure about the semantic difference between the join and bind here. I think it has to do with the cancellation of some promises

@aantron
Copy link
Collaborator

aantron commented Aug 23, 2018

Would you be willing to prepare a PR with this change?

I think the issue is not with cancelation, but that the execution order in your example creates a "tower" of joins in order to resolve the final promise of iter_p: the first element in the stream creates a join that is waiting on a join created for the second element, and so on, down the whole tail of the stream. Once the example does pusher None, a series of callbacks (added by Lwt.join) starts resolving these chained-together join promises, from None back to the front of the stream, and those callback calls end up being non-tail-recursive, and they consume the whole stack.

bind in this example immediately causes the fun () -> rest callback to be called, because your example always resolves immediately with () in each callback of iter_p. So, with this code in Lwt_stream, it is as if the last line was always just rest, so the promise for the whole iter_p is the promise for the second stream element. This promise originally starts out pending, but once the second element is pushed in, it becomes dependent on the promise for the third iteration, etc. This would also build a promise tower, but Lwt (and most memory-safe promise implementations) have a mechanism similar to union-find, in which if one pending promise directly depends exactly on the outcome of another pending promise, and so on, in a chain, the union-find mechanism cuts out the middle promises in the chain and makes every pending promise depend directly on the inner-most pending promise. I am about 80% sure that this is kicking in here, and that's why using bind here is different from join.

I'm still a bit worried that there may be an execution order that is still pathological, or we might introduce a new pathological execution order with this change. But if that happens, we can further change the code to count how many calls to f x we made before None was pushed into the stream, and manually resolve the iter_p promise when the number of f x promises resolved reaches the number of calls made. That should eliminate any danger at all of stacked-up promises.

@kohlivarun5
Copy link
Contributor

Sure, I can definitely create a PR. Anything specific you looking for ? Or I can just change iter_p implementation to the one I put in there

@aantron
Copy link
Collaborator

aantron commented Aug 23, 2018

Yep, if you just PR your implementation, I'll be happy to merge it :)

@aantron
Copy link
Collaborator

aantron commented Aug 23, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants