Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Backtraces #714

Closed
vog opened this issue Sep 18, 2019 · 24 comments
Closed

Broken Backtraces #714

vog opened this issue Sep 18, 2019 · 24 comments
Labels
Milestone

Comments

@vog
Copy link
Contributor

vog commented Sep 18, 2019

The following Lwt program produces the expected backtrace:

let test () =
  let%lwt () = Lwt.fail_with "Test" in
  let%lwt () = Lwt_io.printf "Test\n" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run
Fatal error: exception Failure("Test")
Raised at file "main.ml", line 2, characters 2-95
Re-raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

However, when switching the order of lines 2 and 3, the backtrace is broken:

let test () =
  let%lwt () = Lwt_io.printf "Test\n" in
  let%lwt () = Lwt.fail_with "Test" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run
Test
Fatal error: exception Failure("Test")
Raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

The expected output would have been:

Test
Fatal error: exception Failure("Test")
Raised at file "main.ml", line 3, characters 2-95
Re-raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

Is this a bug in Lwt? If not, what am I doing wrong?

I was able to reproduce this issue with lwt 4.3.0 and lwt_ppx 1.2.3 on vanilla ocaml 4.08.1 and used the following dune file for building:

(executable
 (name test)
 (libraries lwt.unix)
 (preprocess (pps lwt_ppx)))

Adding (flags (-g)) makes no difference.

@aantron aantron added the bug label Sep 18, 2019
@aantron aantron added this to the 4.3.1 milestone Sep 18, 2019
@aantron
Copy link
Collaborator

aantron commented Sep 18, 2019

Thanks for the report. I'm looking into this. It might take more than a few hours, because the solution might be to fundamentally improve Lwt's backtrace handling, not sure yet :)

aantron added a commit that referenced this issue Sep 20, 2019
This commit avoids raising exceptions during Lwt's own usage of
Lwt_pqueue. Such exceptions clobber the user's backtrace.

This could also be solved by using raise_notrace. However, there may
still be users of Lwt_pqueue, so I chose not to risk ruining their
backtraces.

Part of #714.
aantron added a commit that referenced this issue Sep 20, 2019
@aantron
Copy link
Collaborator

aantron commented Sep 20, 2019

To address the particular problem in the issue, try this change:

 let test () =
   let%lwt () = Lwt_io.printf "Test\n" in
-  let%lwt () = Lwt.fail_with "Test" in
+  let%lwt () = failwith "Test" in
   Lwt.return ()

 let () =
   Printexc.record_backtrace true;
   test () |> Lwt_main.run

This gives me the output:

Test
Fatal error: exception Failure("Test")
Raised at file "stdlib.ml", line 29, characters 17-33
Called from file "main.ml", line 3, characters 15-30
Called from file "main.ml", line 8, characters 2-9

The issue is that Lwt.fail_with (and Lwt.fail) do not themselves capture a backtrace. I updated the docs in the second attached commit to recommend against using them, and to raise exceptions using raise or other functions that call raise, such as failwith.

While debugging this, I also found that this still doesn't work:

let test () =
  let%lwt () = Lwt_unix.sleep 0.1 in
  let%lwt () = failwith "Test" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run

It results in the output:

Fatal error: exception Failure("Test")
Raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "main.ml", line 8, characters 2-25

With the first attached commit, it results in:

Fatal error: exception Failure("Test")
Raised at file "stdlib.ml", line 29, characters 17-33
Called from file "scratch/vog/main.ml", line 3, characters 15-30
Called from file "src/core/lwt.ml", line 1929, characters 23-26
Re-raised at file "scratch/vog/main.ml", line 2, characters 2-86
Re-raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "scratch/vog/main.ml", line 8, characters 2-25

This second issue is that OCaml stores backtraces in a global variable (thread-local in 4.10, I think, but the result is the same for Lwt). So, if an exception is (1) raised internally by Lwt, or (2) in an interfering Lwt control flow, between when a user exception is raised and its backtrace would be printed, the interfering exception clobbers the backtrace.

Lwt was raising Not_found in its internal module Lwt_pqueue, when processing promises created with Lwt_unix.sleep, so that was an instance of (1).

To deal with (2), we probably need to rework Lwt's backtrace handling. @Drup, can you comment on the history of current backtrace handling in the PPX?

@vog, does this fix the immediate issue in your full project? If so, I'll release this in Lwt 4.3.1, and address backtraces fully later.

@vog
Copy link
Contributor Author

vog commented Sep 20, 2019

I replaced Lwt.fail_with with failwith, using the original Lwt version, but the backtrace is still broken.

Is there a simple way to test against the latest master of Lwt? (I'm using OPAM with a project-local OPAM switch _opam, as well as Dune.)

Or, does it make more sense for you to release Lwt 4.3.1 first, so I'll test against the new release?

@aantron
Copy link
Collaborator

aantron commented Sep 20, 2019

Is there a simple way to test against the latest master of Lwt?

Yes, in your switch (in the project directory), run

opam pin add lwt --dev-repo

@aantron
Copy link
Collaborator

aantron commented Sep 20, 2019

Once testing with master, would you check if the minimal example you originally provided in this issue is fixed? If so, could you find a new minimal example? It's likely that there are multiple underlying causes, and each has to be eliminated to improve the backtrace. Each will need a new minimal example.

@vog
Copy link
Contributor Author

vog commented Sep 20, 2019

It still fails after pinning Lwt to master. I'll provide a new minimal example.

@vog
Copy link
Contributor Author

vog commented Sep 20, 2019

Minimal example:

let test () =
  match%lwt Caqti_lwt.connect (Uri.of_string "sqlite3:test.db") with
  | Error _ -> Lwt.return ()
  | Ok (module Db : Caqti_lwt.CONNECTION) ->
    let%lwt _ = Db.find (Caqti_request.find Caqti_type.unit Caqti_type.unit "select 1") () in
    let%lwt () = Lwt_io.printf "Test" in
    failwith "Test"

let () =
  Printexc.record_backtrace true;
  Lwt_main.run @@
  test ()

Dune file:

(executable
 (name test)
 (libraries
  caqti-driver-sqlite3
  caqti-lwt
 )
 (preprocess (pps lwt_ppx))
 (modes native))

Output:

Fatal error: exception Failure("Test")
Raised at file "src/core/lwt.ml", line 2998, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "test.ml", line 11, characters 2-27

Alas I was unable to get rid of the Caqti dependency. Every further simplification or replacement of the Caqti calls made the bug disappear.

@aantron
Copy link
Collaborator

aantron commented Sep 20, 2019

Have you tried simplifying Caqti as well? If you are doing this in a Dune workspace, it should be fairly easy to undo most of Caqti (if you haven't done so yet).

This is what I ended up doing earlier to find the Lwt_pqueue bug. I basically progressively inlined bits of Lwt into the repro case, deleting irrelevant parts as I went.

@vog
Copy link
Contributor Author

vog commented Sep 21, 2019

Sorry for asking again perhaps simple questions about Opam and Dune, but:

Which steps are needed to edit Caqti?

I tried to edit some file in _opam/lib/caqti-lwt/*.ml and _opam/lib/caqti/*.ml and to rebuild with dune build, but the modified Caqti wasn't rebuilt ... which doesn't surprise me, as Caqti is managed by Opam and not by my Dune file.

What's the correct way to build with a locally modified version of the Caqti libraries?

@aantron
Copy link
Collaborator

aantron commented Sep 21, 2019

Since the code is already pretty simplified, you could just put the repro case directly into a clone of the Caqti repo as a new executable, and build it there, while simplifying Caqti (and the case, hopefully).

For a more complex case, you should be able to make a Dune workspace that looks like this:

workspace/
|--- dune-workspace
|--- caqti/
|--- other_code1/
|--- other_code2/

dune build should then prefer the copy of Caqti in workspace/caqti/.

This structure may be useful if you need to pull in sqlite3-ocaml, for example.

@aantron
Copy link
Collaborator

aantron commented Sep 21, 2019

You can also test modified Caqti with opam, using an opam pin of the Caqti repo, but this is a much slower workflow, because you will have to fully compile and reinstall Caqti after each change. With a Dune workspace, you get incremental builds.

@vog
Copy link
Contributor Author

vog commented Sep 23, 2019

Uh-oh, it seems that caqti-lwt (which is the part of Caqti I chose to inline first) won't build with OCaml 4.08.1:

File "ocaml-caqti/lib/caqti_pool.ml", line 28, characters 56-74:
28 |     let compare {priority = pA; _} {priority = pB; _} = Pervasives.compare pB pA
                                                             ^^^^^^^^^^^^^^^^^^
Error (alert deprecated): module Stdlib.Pervasives
Use Stdlib instead.

If you need to stay compatible with OCaml < 4.07, you can use the 
stdlib-shims library: https://github.com/ocaml/stdlib-shims
File "ocaml-caqti/lib/caqti_type.ml", line 47, characters 13-38:
47 |     let ec = Obj.extension_constructor ft in
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_type.ml", line 51, characters 13-38:
51 |     let ec = Obj.extension_constructor ft in
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_type.ml", line 66, characters 11-29:
66 |    | ft -> Obj.extension_name (Obj.extension_constructor ft)
                ^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_name
use Obj.Extension_constructor.name
File "ocaml-caqti/lib/caqti_type.ml", line 66, characters 31-56:
66 |    | ft -> Obj.extension_name (Obj.extension_constructor ft)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_error.ml", line 26, characters 10-35:
26 |   let c = Obj.extension_constructor msg in
               ^^^^^^^^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_constructor
use Obj.Extension_constructor.of_val
File "ocaml-caqti/lib/caqti_error.ml", line 33, characters 7-25:
33 |       (Obj.extension_name c)
            ^^^^^^^^^^^^^^^^^^
Error (alert deprecated): Stdlib.Obj.extension_name
use Obj.Extension_constructor.name

I wonder how OPAM managed to install Caqti in my 4.08.1 switch, where it successfully compiled and worked flawlessly (except for the backtraces).

I'll retry with a 4.07 and 4.06 switch, but there must be a better way, as it obviously is possible to compile it with 4.08, I just can't figure out how.

@XVilka
Copy link
Contributor

XVilka commented Sep 23, 2019

I have met the same (or of very similar nature) problem before, see: https://discuss.ocaml.org/t/ocaml-and-backtraces/3542/7

@vog
Copy link
Contributor Author

vog commented Sep 23, 2019

I'm now using an OCaml 4.07.0 switch. The issue is still reproducible, and I can test with a custom Caqti build. However, simplifying and reducing Caqti is not straight forward ...

@aantron
Copy link
Collaborator

aantron commented Sep 23, 2019

@vog the difference is that when opam installs a package, it is installed in release mode, where most warnings remain warnings. So, those deprecation warnings are written to a log file, which you never see. When you build a package locally, it is built in development mode by default, in which deprecation warnings are errors.

@aantron
Copy link
Collaborator

aantron commented Sep 23, 2019

I opened paurkedal/ocaml-caqti#29 on Caqti about that.

@vog
Copy link
Contributor Author

vog commented Sep 23, 2019

No yet reached a minimal example, but I found the following snippet which uses Lwt.fail in the second half (between begin and end):

    let poll ?(read = false) ?(write = false) ?timeout fd =
      let choices =
        [] |> (fun acc -> if read then Lwt_unix.wait_read fd :: acc else acc)
           |> (fun acc -> if write then Lwt_unix.wait_write fd :: acc else acc)
           |> Option.fold (fun t acc -> Lwt_unix.timeout t :: acc) timeout in
      if choices = [] then
        invalid_arg "Caqti_lwt.Unix.poll: No operation specified."
      else
        begin
          Lwt.catch
            (fun () -> Lwt.choose choices >|= fun _ -> false)
            (function
             | Lwt_unix.Timeout -> Lwt.return_true
             | exn -> Lwt.fail exn)
        end >>= fun timed_out ->
        Lwt.return (Lwt_unix.readable fd, Lwt_unix.writable fd, timed_out)

Obviously, replacing Lwt.fail with raise doesn't fix the issue, as the backtrace has already been thrown away. Is is possible that using Lwt.catch instead of try%lwt is part of the problem? What's the Lwt-idiomatic way to write that exception handler?

Perhaps fixing that snippet is sufficient to fix the issue?

@aantron
Copy link
Collaborator

aantron commented Sep 23, 2019

Lwt.fail is probably correct here, and try%lwt will expand to similar code anyway. What I suspect is most likely happening, is that, in this code, or code similar to it, by the time exn is being handled, another exception has already been raised, and has clobbered the backtrace that was recorded when exn was originally handled.

If that turns out to be the case, then we need to do the full backtrace handling rework, since there is no small patch to handle that case correctly.

One thing you can do is use Printexc to print the backtrace in this exception handler. Not sure if you've done that already.

@vog
Copy link
Contributor Author

vog commented Sep 23, 2019

That poll function turned out not to be actually used by my minimal example.

The current state is that I fully inlined Caqti, removed a lot of code, and I'm down to 722 lines. Due to dynamic stuff, as well as OCaml not checking unused parts of module types (just of modules), going further down is a bit tricky.

@aantron
Copy link
Collaborator

aantron commented Sep 23, 2019

That's a lot of minification :) If you get really stuck, please upload your remaining case to a gist, with clear instructions on how to reproduce the environment in which it is built, and I can take over.

@vog
Copy link
Contributor Author

vog commented Sep 23, 2019

Thanks! I'm down to 677 lines. This is the Gist:

https://gist.github.com/vog/c0ee4cff2aeff4840b4e6c947f7a7ec2

The "run" shell script contains the instructions.

aantron added a commit that referenced this issue Sep 24, 2019
Another option for this change was to use Map.S.find_opt. However, this
requires OCaml 4.05, but Lwt still supports 4.02. The code in this
commit avoids having to select different implementations depending on
the OCaml version. This code is not performance-critical, so doing two
descents into the map should not be a problem.

Part of #714.
@aantron
Copy link
Collaborator

aantron commented Sep 24, 2019

I was able to reduce the case to:

let foo : unit Lwt.key = Lwt.new_key ()

let test () =
  let%lwt () = Lwt_unix.sleep 0.1 in
  Lwt.async begin fun () ->
    let%lwt () = Lwt.pause () in
    ignore (Lwt.get foo);
    Lwt.return_unit
  end;
  failwith "Test"

let () =
  Printexc.record_backtrace true;
  Lwt_main.run @@
  test ()

It can probably be reduced even further, but by this point, I already knew that the culprit is Lwt.get, because it uses an exception (from stdlib's Map) internally for control flow:

lwt/src/core/lwt.ml

Lines 794 to 802 in 25d31d6

let get key =
try
let refresh = Storage_map.find key.id !current_storage in
refresh ();
let value = key.value in
key.value <- None;
value
with Not_found ->
None

A fix is attached, above.

Before the fix, the above code results in

Raised at file "lwt/src/core/lwt.ml", line 2998, characters 20-29
Called from file "lwt/src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "vog/test.ml", line 14, characters 2-27

With the fix, it results in

Raised at file "stdlib.ml", line 33, characters 17-33
Called from file "lwt/src/core/lwt.ml", line 1930, characters 23-26
Re-raised at file "vog/test.ml", line 4, characters 2-168
Re-raised at file "lwt/src/core/lwt.ml", line 2999, characters 20-29
Called from file "lwt/src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "vog/test.ml", line 14, characters 2-27

The backtrace could use even more improvement, but it is at least somewhat accurate.

@vog, the backtrace in the repro case from the gist you posted is now

Fatal error: exception Failure("Test")
Raised at file "stdlib.ml", line 33, characters 17-33
Called from file "src/core/lwt.ml", line 1930, characters 23-26
Re-raised at file "test.ml", line 670, characters 4-154
Re-raised at file "src/core/lwt.ml", line 2999, characters 20-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "test.ml", line 676, characters 2-27

Again, not completely accurate, but definitely improved.

Could you test your full project again, with the latest Lwt master (confirming you have this latest commit, 4694853, from above)?

@vog
Copy link
Contributor Author

vog commented Sep 25, 2019

I just re-tested with my real code and can confirm that the latest Lwt master is a huge improvement!

Although there are still missing parts in the backtrace (as you described yourself), the backtrace is now good enough to find most errors immediately. No more guessing around. Thanks!

For further analysis of the remaining issues, I propose the following additional minimal test case which produces a non-optimal backtrace:

let test () =
  let%lwt () = Lwt_unix.sleep 0.1 in
  let%lwt () = Lwt_io.printf "test\n" in
  let%lwt () = failwith "Test" in
  Lwt.return ()

let () =
  Printexc.record_backtrace true;
  test () |> Lwt_main.run

@aantron
Copy link
Collaborator

aantron commented Sep 25, 2019

Good :) I opened #720 about "true" backtrace handling by Lwt, which is probably needed to solve remaining issues. That will be a medium-sized project for this repo, and I'm not sure yet if it can be made to work. So, I'm going to put that off for a 4.4.0 release, close this issue, and release the patches from this issue in Lwt 4.3.1 soon. I linked your test case from #720. Thanks for reporting this and working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants