Broken Backtraces #714

vog · 2019-09-18T07:26:42Z

The following Lwt program produces the expected backtrace:

let test () =
  let%lwt () = Lwt.fail_with "Test" in
  let%lwt () = Lwt_io.printf "Test\n" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run

Fatal error: exception Failure("Test")
Raised at file "main.ml", line 2, characters 2-95
Re-raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

However, when switching the order of lines 2 and 3, the backtrace is broken:

let test () =
  let%lwt () = Lwt_io.printf "Test\n" in
  let%lwt () = Lwt.fail_with "Test" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run

Test
Fatal error: exception Failure("Test")
Raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

The expected output would have been:

Test
Fatal error: exception Failure("Test")
Raised at file "main.ml", line 3, characters 2-95
Re-raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

Is this a bug in Lwt? If not, what am I doing wrong?

I was able to reproduce this issue with lwt 4.3.0 and lwt_ppx 1.2.3 on vanilla ocaml 4.08.1 and used the following dune file for building:

(executable
 (name test)
 (libraries lwt.unix)
 (preprocess (pps lwt_ppx)))

Adding (flags (-g)) makes no difference.

The text was updated successfully, but these errors were encountered:

aantron · 2019-09-18T09:41:12Z

Thanks for the report. I'm looking into this. It might take more than a few hours, because the solution might be to fundamentally improve Lwt's backtrace handling, not sure yet :)

This commit avoids raising exceptions during Lwt's own usage of Lwt_pqueue. Such exceptions clobber the user's backtrace. This could also be solved by using raise_notrace. However, there may still be users of Lwt_pqueue, so I chose not to risk ruining their backtraces. Part of #714.

Part of #714.

aantron · 2019-09-20T12:08:17Z

To address the particular problem in the issue, try this change:

 let test () =
   let%lwt () = Lwt_io.printf "Test\n" in
-  let%lwt () = Lwt.fail_with "Test" in
+  let%lwt () = failwith "Test" in
   Lwt.return ()

 let () =
   Printexc.record_backtrace true;
   test () |> Lwt_main.run

This gives me the output:

Test
Fatal error: exception Failure("Test")
Raised at file "stdlib.ml", line 29, characters 17-33
Called from file "main.ml", line 3, characters 15-30
Called from file "main.ml", line 8, characters 2-9

The issue is that Lwt.fail_with (and Lwt.fail) do not themselves capture a backtrace. I updated the docs in the second attached commit to recommend against using them, and to raise exceptions using raise or other functions that call raise, such as failwith.

While debugging this, I also found that this still doesn't work:

let test () =
  let%lwt () = Lwt_unix.sleep 0.1 in
  let%lwt () = failwith "Test" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run

It results in the output:

Fatal error: exception Failure("Test")
Raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

With the first attached commit, it results in:

Fatal error: exception Failure("Test")
Raised at file "stdlib.ml", line 29, characters 17-33
Called from file "scratch/vog/main.ml", line 3, characters 15-30
Called from file "src/core/lwt.ml", line 1929, characters 23-26
Re-raised at file "scratch/vog/main.ml", line 2, characters 2-86
Re-raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "scratch/vog/main.ml", line 8, characters 2-25

This second issue is that OCaml stores backtraces in a global variable (thread-local in 4.10, I think, but the result is the same for Lwt). So, if an exception is (1) raised internally by Lwt, or (2) in an interfering Lwt control flow, between when a user exception is raised and its backtrace would be printed, the interfering exception clobbers the backtrace.

Lwt was raising Not_found in its internal module Lwt_pqueue, when processing promises created with Lwt_unix.sleep, so that was an instance of (1).

To deal with (2), we probably need to rework Lwt's backtrace handling. @Drup, can you comment on the history of current backtrace handling in the PPX?

@vog, does this fix the immediate issue in your full project? If so, I'll release this in Lwt 4.3.1, and address backtraces fully later.

vog · 2019-09-20T13:42:19Z

I replaced Lwt.fail_with with failwith, using the original Lwt version, but the backtrace is still broken.

Is there a simple way to test against the latest master of Lwt? (I'm using OPAM with a project-local OPAM switch _opam, as well as Dune.)

Or, does it make more sense for you to release Lwt 4.3.1 first, so I'll test against the new release?

aantron · 2019-09-20T13:57:09Z

Is there a simple way to test against the latest master of Lwt?

Yes, in your switch (in the project directory), run

opam pin add lwt --dev-repo

aantron · 2019-09-20T13:59:08Z

Once testing with master, would you check if the minimal example you originally provided in this issue is fixed? If so, could you find a new minimal example? It's likely that there are multiple underlying causes, and each has to be eliminated to improve the backtrace. Each will need a new minimal example.

vog · 2019-09-20T14:09:28Z

It still fails after pinning Lwt to master. I'll provide a new minimal example.

vog · 2019-09-20T21:13:17Z

Minimal example:

let test () =
  match%lwt Caqti_lwt.connect (Uri.of_string "sqlite3:test.db") with
  | Error _ -> Lwt.return ()
  | Ok (module Db : Caqti_lwt.CONNECTION) ->
    let%lwt _ = Db.find (Caqti_request.find Caqti_type.unit Caqti_type.unit "select 1") () in
    let%lwt () = Lwt_io.printf "Test" in
    failwith "Test"

let () =
  Printexc.record_backtrace true;
  Lwt_main.run @@
  test ()

Dune file:

(executable
 (name test)
 (libraries
  caqti-driver-sqlite3
  caqti-lwt
 )
 (preprocess (pps lwt_ppx))
 (modes native))

Output:

Fatal error: exception Failure("Test")
Raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "test.ml", line 11, characters 2-27

Alas I was unable to get rid of the Caqti dependency. Every further simplification or replacement of the Caqti calls made the bug disappear.

aantron · 2019-09-20T21:17:29Z

Have you tried simplifying Caqti as well? If you are doing this in a Dune workspace, it should be fairly easy to undo most of Caqti (if you haven't done so yet).

This is what I ended up doing earlier to find the Lwt_pqueue bug. I basically progressively inlined bits of Lwt into the repro case, deleting irrelevant parts as I went.

vog · 2019-09-21T21:07:27Z

Sorry for asking again perhaps simple questions about Opam and Dune, but:

Which steps are needed to edit Caqti?

I tried to edit some file in _opam/lib/caqti-lwt/*.ml and _opam/lib/caqti/*.ml and to rebuild with dune build, but the modified Caqti wasn't rebuilt ... which doesn't surprise me, as Caqti is managed by Opam and not by my Dune file.

What's the correct way to build with a locally modified version of the Caqti libraries?

aantron · 2019-09-21T23:17:27Z

Since the code is already pretty simplified, you could just put the repro case directly into a clone of the Caqti repo as a new executable, and build it there, while simplifying Caqti (and the case, hopefully).

For a more complex case, you should be able to make a Dune workspace that looks like this:

workspace/
|--- dune-workspace
|--- caqti/
|--- other_code1/
|--- other_code2/

dune build should then prefer the copy of Caqti in workspace/caqti/.

This structure may be useful if you need to pull in sqlite3-ocaml, for example.

aantron · 2019-09-21T23:19:32Z

You can also test modified Caqti with opam, using an opam pin of the Caqti repo, but this is a much slower workflow, because you will have to fully compile and reinstall Caqti after each change. With a Dune workspace, you get incremental builds.

vog · 2019-09-23T07:42:51Z

Uh-oh, it seems that caqti-lwt (which is the part of Caqti I chose to inline first) won't build with OCaml 4.08.1:

File "ocaml-caqti/lib/caqti_pool.ml", line 28, characters 56-74:
28 |     let compare {priority = pA; _} {priority = pB; _} = Pervasives.compare pB pA
                                                             ^^^^^^^^^^^^^^^^^^
Error (alert deprecated): module Stdlib.Pervasives
Use Stdlib instead.

If you need to stay compatible with OCaml < 4.07, you can use the 
stdlib-shims library: https://github.com/ocaml/stdlib-shims
File "ocaml-caqti/lib/caqti_type.ml", line 47, characters 13-38:
47 |     let ec = Obj.extension_constructor ft in
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_type.ml", line 51, characters 13-38:
51 |     let ec = Obj.extension_constructor ft in
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_type.ml", line 66, characters 11-29:
66 |    | ft -> Obj.extension_name (Obj.extension_constructor ft)
                ^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_name
use Obj.Extension_constructor.name
File "ocaml-caqti/lib/caqti_type.ml", line 66, characters 31-56:
66 |    | ft -> Obj.extension_name (Obj.extension_constructor ft)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_error.ml", line 26, characters 10-35:
26 |   let c = Obj.extension_constructor msg in
               ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_error.ml", line 33, characters 7-25:
33 |       (Obj.extension_name c)
            ^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_name
use Obj.Extension_constructor.name

I wonder how OPAM managed to install Caqti in my 4.08.1 switch, where it successfully compiled and worked flawlessly (except for the backtraces).

I'll retry with a 4.07 and 4.06 switch, but there must be a better way, as it obviously is possible to compile it with 4.08, I just can't figure out how.

XVilka · 2019-09-23T08:49:21Z

I have met the same (or of very similar nature) problem before, see: https://discuss.ocaml.org/t/ocaml-and-backtraces/3542/7

vog · 2019-09-23T09:02:35Z

I'm now using an OCaml 4.07.0 switch. The issue is still reproducible, and I can test with a custom Caqti build. However, simplifying and reducing Caqti is not straight forward ...

aantron · 2019-09-23T10:40:25Z

@vog the difference is that when opam installs a package, it is installed in release mode, where most warnings remain warnings. So, those deprecation warnings are written to a log file, which you never see. When you build a package locally, it is built in development mode by default, in which deprecation warnings are errors.

aantron · 2019-09-23T10:45:01Z

I opened paurkedal/ocaml-caqti#29 on Caqti about that.

vog · 2019-09-23T14:38:11Z

No yet reached a minimal example, but I found the following snippet which uses Lwt.fail in the second half (between begin and end):

    let poll ?(read = false) ?(write = false) ?timeout fd =
      let choices =
        [] |> (fun acc -> if read then Lwt_unix.wait_read fd :: acc else acc)
           |> (fun acc -> if write then Lwt_unix.wait_write fd :: acc else acc)
           |> Option.fold (fun t acc -> Lwt_unix.timeout t :: acc) timeout in
      if choices = [] then
        invalid_arg "Caqti_lwt.Unix.poll: No operation specified."
      else
        begin
          Lwt.catch
            (fun () -> Lwt.choose choices >|= fun _ -> false)
            (function
             | Lwt_unix.Timeout -> Lwt.return_true
             | exn -> Lwt.fail exn)
        end >>= fun timed_out ->
        Lwt.return (Lwt_unix.readable fd, Lwt_unix.writable fd, timed_out)

Obviously, replacing Lwt.fail with raise doesn't fix the issue, as the backtrace has already been thrown away. Is is possible that using Lwt.catch instead of try%lwt is part of the problem? What's the Lwt-idiomatic way to write that exception handler?

Perhaps fixing that snippet is sufficient to fix the issue?

aantron · 2019-09-23T15:14:19Z

Lwt.fail is probably correct here, and try%lwt will expand to similar code anyway. What I suspect is most likely happening, is that, in this code, or code similar to it, by the time exn is being handled, another exception has already been raised, and has clobbered the backtrace that was recorded when exn was originally handled.

If that turns out to be the case, then we need to do the full backtrace handling rework, since there is no small patch to handle that case correctly.

One thing you can do is use Printexc to print the backtrace in this exception handler. Not sure if you've done that already.

vog · 2019-09-23T16:02:00Z

That poll function turned out not to be actually used by my minimal example.

The current state is that I fully inlined Caqti, removed a lot of code, and I'm down to 722 lines. Due to dynamic stuff, as well as OCaml not checking unused parts of module types (just of modules), going further down is a bit tricky.

aantron · 2019-09-23T16:09:21Z

That's a lot of minification :) If you get really stuck, please upload your remaining case to a gist, with clear instructions on how to reproduce the environment in which it is built, and I can take over.

vog · 2019-09-23T16:35:40Z

Thanks! I'm down to 677 lines. This is the Gist:

https://gist.github.com/vog/c0ee4cff2aeff4840b4e6c947f7a7ec2

The "run" shell script contains the instructions.

Another option for this change was to use Map.S.find_opt. However, this requires OCaml 4.05, but Lwt still supports 4.02. The code in this commit avoids having to select different implementations depending on the OCaml version. This code is not performance-critical, so doing two descents into the map should not be a problem. Part of #714.

aantron · 2019-09-24T14:27:07Z

I was able to reduce the case to:

let foo : unit Lwt.key = Lwt.new_key ()

let test () =
  let%lwt () = Lwt_unix.sleep 0.1 in
  Lwt.async begin fun () ->
    let%lwt () = Lwt.pause () in
    ignore (Lwt.get foo);
    Lwt.return_unit
  end;
  failwith "Test"

let () =
  Printexc.record_backtrace true;
  Lwt_main.run @@
  test ()

It can probably be reduced even further, but by this point, I already knew that the culprit is Lwt.get, because it uses an exception (from stdlib's Map) internally for control flow:

lwt/src/core/lwt.ml

Lines 794 to 802 in 25d31d6

    
           let get key = 
        
             try 
        
               let refresh = Storage_map.find key.id !current_storage in 
        
               refresh (); 
        
               let value = key.value in 
        
               key.value <- None; 
        
               value 
        
             with Not_found -> 
        
               None

A fix is attached, above.

Before the fix, the above code results in

Raised at file "lwt/src/core/lwt.ml", line 2998, characters 20-29
Called from file "lwt/src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "vog/test.ml", line 14, characters 2-27

With the fix, it results in

Raised at file "stdlib.ml", line 33, characters 17-33
Called from file "lwt/src/core/lwt.ml", line 1930, characters 23-26
Re-raised at file "vog/test.ml", line 4, characters 2-168
Re-raised at file "lwt/src/core/lwt.ml", line 2999, characters 20-29
Called from file "lwt/src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "vog/test.ml", line 14, characters 2-27

The backtrace could use even more improvement, but it is at least somewhat accurate.

@vog, the backtrace in the repro case from the gist you posted is now

Fatal error: exception Failure("Test")
Raised at file "stdlib.ml", line 33, characters 17-33
Called from file "src/core/lwt.ml", line 1930, characters 23-26
Re-raised at file "test.ml", line 670, characters 4-154
Re-raised at file "src/core/lwt.ml", line 2999, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "test.ml", line 676, characters 2-27

Again, not completely accurate, but definitely improved.

Could you test your full project again, with the latest Lwt master (confirming you have this latest commit, 4694853, from above)?

vog · 2019-09-25T12:13:37Z

I just re-tested with my real code and can confirm that the latest Lwt master is a huge improvement!

Although there are still missing parts in the backtrace (as you described yourself), the backtrace is now good enough to find most errors immediately. No more guessing around. Thanks!

For further analysis of the remaining issues, I propose the following additional minimal test case which produces a non-optimal backtrace:

let test () =
  let%lwt () = Lwt_unix.sleep 0.1 in
  let%lwt () = Lwt_io.printf "test\n" in
  let%lwt () = failwith "Test" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run

aantron · 2019-09-25T12:23:15Z

Good :) I opened #720 about "true" backtrace handling by Lwt, which is probably needed to solve remaining issues. That will be a medium-sized project for this repo, and I'm not sure yet if it can be made to work. So, I'm going to put that off for a 4.4.0 release, close this issue, and release the patches from this issue in Lwt 4.3.1 soon. I linked your test case from #720. Thanks for reporting this and working on it!

aantron added the bug label Sep 18, 2019

aantron added this to the 4.3.1 milestone Sep 18, 2019

aantron mentioned this issue Sep 19, 2019

Lwt_unix calls get stuck after Lwt_unix.fork #704

Closed

aantron added a commit that referenced this issue Sep 20, 2019

Docs: recommend against using Lwt.fail

870a3a3

Part of #714.

aantron mentioned this issue Sep 23, 2019

Fix deprecation warnings on 4.08 paurkedal/ocaml-caqti#29

Closed

aantron mentioned this issue Sep 25, 2019

Explicit backtrace handling #720

Open

aantron closed this as completed Sep 25, 2019

aantron mentioned this issue Sep 26, 2019

Lwt 4.3.1 ocaml/opam-repository#14914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken Backtraces #714

Broken Backtraces #714

vog commented Sep 18, 2019

aantron commented Sep 18, 2019

aantron commented Sep 20, 2019

vog commented Sep 20, 2019 •

edited

Loading

aantron commented Sep 20, 2019

aantron commented Sep 20, 2019

vog commented Sep 20, 2019

vog commented Sep 20, 2019

aantron commented Sep 20, 2019

vog commented Sep 21, 2019 •

edited

Loading

aantron commented Sep 21, 2019

aantron commented Sep 21, 2019 •

edited

Loading

vog commented Sep 23, 2019

XVilka commented Sep 23, 2019 •

edited

Loading

vog commented Sep 23, 2019

aantron commented Sep 23, 2019

aantron commented Sep 23, 2019

vog commented Sep 23, 2019 •

edited

Loading

aantron commented Sep 23, 2019

vog commented Sep 23, 2019

aantron commented Sep 23, 2019

vog commented Sep 23, 2019

aantron commented Sep 24, 2019

vog commented Sep 25, 2019 •

edited

Loading

aantron commented Sep 25, 2019

Broken Backtraces #714

Broken Backtraces #714

Comments

vog commented Sep 18, 2019

aantron commented Sep 18, 2019

aantron commented Sep 20, 2019

vog commented Sep 20, 2019 • edited Loading

aantron commented Sep 20, 2019

aantron commented Sep 20, 2019

vog commented Sep 20, 2019

vog commented Sep 20, 2019

aantron commented Sep 20, 2019

vog commented Sep 21, 2019 • edited Loading

aantron commented Sep 21, 2019

aantron commented Sep 21, 2019 • edited Loading

vog commented Sep 23, 2019

XVilka commented Sep 23, 2019 • edited Loading

vog commented Sep 23, 2019

aantron commented Sep 23, 2019

aantron commented Sep 23, 2019

vog commented Sep 23, 2019 • edited Loading

aantron commented Sep 23, 2019

vog commented Sep 23, 2019

aantron commented Sep 23, 2019

vog commented Sep 23, 2019

aantron commented Sep 24, 2019

vog commented Sep 25, 2019 • edited Loading

aantron commented Sep 25, 2019

vog commented Sep 20, 2019 •

edited

Loading

vog commented Sep 21, 2019 •

edited

Loading

aantron commented Sep 21, 2019 •

edited

Loading

XVilka commented Sep 23, 2019 •

edited

Loading

vog commented Sep 23, 2019 •

edited

Loading

vog commented Sep 25, 2019 •

edited

Loading