Repititve bind to the same thread might cause memory leak or stack overflow. #432

jorisgio · 2017-07-03T23:36:01Z

We've been struggling with a nasty Stackoverflow triggered by cleanup function recursively walking a crazy large list of waiters. In theory this should not happen, thanks to the "silent remove" limit, and the removed waiter tree to be cleanup should not be deeper than 42.

It turns out the issue is caused by bind. bind creates immutable waiters and they can accumulate until a stackoverlow is triggered, if you bind a huge number of thread to a single long live thread.
Here is a example :

let rec poll_until stop_condition next =
   match%lwt Lwt.pick [stop_condition >>= cleanup >>= fun () -> Lwt.fail Timeout; get next] with
   | exception Timeout -> log; Lwt.return_unit
   | elem -> handle elem; poll_until stop_condition next

let start () = 
  let (wait, wakeup) = Lwt.wait () in
  let next = some_stream_like in
  Lwt.async (fun () -> poll_until wait next);
  wakeup

This code looks harmless, but at every iteration it stacks a waiter to stop_condition, than Cancel it. Assuming stop_condition is protected, dead Immutable waiters will accumulate onwait leaking memory and ultimately crash in cleanup.

I believe cleanup can be made tail recursive but that does not solve the leak.

The text was updated successfully, but these errors were encountered:

aantron · 2017-10-14T15:04:27Z

@jorisgio Can you give a complete, self-contained example? I'm not 100% following the description + example. Where does "protected" fit in? I'm not sure dead immutable waiters are actually accumulating as described.

aantron · 2018-04-04T04:01:17Z

Ok, looking at this again, the TL;DR is it can be worked around by wrapping stop_condition in a call to Lwt.protected:

Lwt.pick [
  Lwt.protected stop_condition >>= cleanup >>= fun () -> Lwt.fail Timeout;
  get next
]

Lwt.protected attaches a removable callback to stop_condition, so it is correctly cleaned up when Lwt.pick cancels the whole >>= construction.

In detail, this is, most generally, a specific instance of this pattern:

let () =
  let p, _ = Lwt.wait () in

  for i = 1 to 1_000_000 do
    Lwt.bind p (fun () -> ())
    |> ignore
  done

This loop attaches one callback to p on each iteration, causing allocation of one list cell in the promise's internal data structures. It's always possible to add enough callbacks to run out of memory.

We have no general way to solve this. We can't drop callbacks from p, because they might cause important side effects. This applies particularly to the callbacks attached by Lwt.on_success and related functions.

This is also an instance of the more specific pattern:

let () =
  let p, _ = Lwt.wait () in

  for i = 1 to 1_000_000 do
    let p' = Lwt.bind p (fun () -> Lwt.return ()) in
    Lwt.cancel p'
  done

@jorisgio, I think you wanted cancelation of p' to remove the fun () -> Lwt.return () callback from p that is added at each iteration. This made sense to me as well, at first glance.

However, in Lwt right now, p' is cancelable if and only if p is cancelable. Since p was created with Lwt.wait, neither promise is cancelable. I don't know if this design of cancelation is good, but it is probably important because cancelation handling with Lwt.catch can trigger callback calls, and therefore side effects.

As a result, in this case, Lwt.cancel p' is supposed to do nothing, and Lwt is forced to preserve all the temporary callbacks added by Lwt.bind.

However, if you wrap p in Lwt.protected first:

let () =
  let p, _ = Lwt.wait () in

  for i = 1 to 1_000_000 do
    let p'' = Lwt.protected p in
    let p' = Lwt.bind p'' (fun () -> Lwt.return ()) in
    Lwt.cancel p'
  done

...you will be creating a temporary cancelable promise p'' at each iteration, and now canceling it. Lwt.protected adds a callback to p, but removes it when p'' is canceled, so there will be no memory leak (see also #181).

I think I got confused by your usage of protected, by which you meant non-cancelable, because actually it is Lwt.protected that is needed here. This is one of those cases where the Lwt API has pretty misleading terminology :/

kohlivarun5 · 2018-08-22T23:31:59Z

I think I am seeing a similar issue with streams. Following is what I am looking:

let () =

  let stream,pusher = Lwt_stream.create () in 

  let s = 
    stream
    |> Lwt_stream.iter_p 
      (fun i -> 
        Printf.printf "Done %d" i
        |> Lwt.return)
  in

  let limit = 120_000 in
  for i = 1 to limit do
    if i >= limit 
    then pusher None
    else pusher (Some i)
  done;

  s 
  |> Lwt_main.run

That gives stack overflow on exit. I tried a few things to use the protected workaround but that doesn't seem to help.

Any comments on what is a workaround ?

aantron · 2018-08-22T23:38:13Z

Thanks for reporting, I will take a look in detail tomorrow morning (~15 hours :/).

kohlivarun5 · 2018-08-23T17:25:24Z

I tried looking into the implementation. From what I could tell, it arises from the join in iter_p. If I replace the join with something such as:

match node.data with
    | Some x ->
      consume s node;
      let res = f x in 
      let rest = iter_p_bb_rec node.next f s in 
      res >>= fun () -> rest

It works for me then, although I am not entirely sure about the semantic difference between the join and bind here. I think it has to do with the cancellation of some promises

aantron · 2018-08-23T18:54:36Z

Would you be willing to prepare a PR with this change?

I think the issue is not with cancelation, but that the execution order in your example creates a "tower" of joins in order to resolve the final promise of iter_p: the first element in the stream creates a join that is waiting on a join created for the second element, and so on, down the whole tail of the stream. Once the example does pusher None, a series of callbacks (added by Lwt.join) starts resolving these chained-together join promises, from None back to the front of the stream, and those callback calls end up being non-tail-recursive, and they consume the whole stack.

bind in this example immediately causes the fun () -> rest callback to be called, because your example always resolves immediately with () in each callback of iter_p. So, with this code in Lwt_stream, it is as if the last line was always just rest, so the promise for the whole iter_p is the promise for the second stream element. This promise originally starts out pending, but once the second element is pushed in, it becomes dependent on the promise for the third iteration, etc. This would also build a promise tower, but Lwt (and most memory-safe promise implementations) have a mechanism similar to union-find, in which if one pending promise directly depends exactly on the outcome of another pending promise, and so on, in a chain, the union-find mechanism cuts out the middle promises in the chain and makes every pending promise depend directly on the inner-most pending promise. I am about 80% sure that this is kicking in here, and that's why using bind here is different from join.

I'm still a bit worried that there may be an execution order that is still pathological, or we might introduce a new pathological execution order with this change. But if that happens, we can further change the code to count how many calls to f x we made before None was pushed into the stream, and manually resolve the iter_p promise when the number of f x promises resolved reaches the number of calls made. That should eliminate any danger at all of stacked-up promises.

kohlivarun5 · 2018-08-23T20:17:19Z

Sure, I can definitely create a PR. Anything specific you looking for ? Or I can just change iter_p implementation to the one I put in there

aantron · 2018-08-23T20:28:56Z

Yep, if you just PR your implementation, I'll be happy to merge it :)

aantron · 2018-08-23T21:54:38Z

Thanks!

aantron added bug difficult labels Jul 4, 2017

aantron removed bug difficult labels Apr 4, 2018

kohlivarun5 mentioned this issue Aug 23, 2018

Fix Lwt_stream.iter_p to use bind and avoid stackoverflow #608

Merged

aantron closed this as completed Aug 23, 2018

aantron mentioned this issue Mar 25, 2019

Lwt 4.2.0: promises and concurrent I/O, also Lwt_ppx 1.2.2 and Lwt_react 1.1.2 ocaml/opam-repository#13743

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repititve bind to the same thread might cause memory leak or stack overflow. #432

Repititve bind to the same thread might cause memory leak or stack overflow. #432

jorisgio commented Jul 3, 2017

aantron commented Oct 14, 2017

aantron commented Apr 4, 2018

kohlivarun5 commented Aug 22, 2018

aantron commented Aug 22, 2018

kohlivarun5 commented Aug 23, 2018

aantron commented Aug 23, 2018

kohlivarun5 commented Aug 23, 2018

aantron commented Aug 23, 2018

aantron commented Aug 23, 2018

Repititve bind to the same thread might cause memory leak or stack overflow. #432

Repititve bind to the same thread might cause memory leak or stack overflow. #432

Comments

jorisgio commented Jul 3, 2017

aantron commented Oct 14, 2017

aantron commented Apr 4, 2018

kohlivarun5 commented Aug 22, 2018

aantron commented Aug 22, 2018

kohlivarun5 commented Aug 23, 2018

aantron commented Aug 23, 2018

kohlivarun5 commented Aug 23, 2018

aantron commented Aug 23, 2018

aantron commented Aug 23, 2018