Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better call stacks when a C call is involved in byte code mode. #8641

Merged
merged 1 commit into from Oct 12, 2019

Conversation

jhjourdan
Copy link
Contributor

@jhjourdan jhjourdan commented Apr 24, 2019

The previous mechanism only worked in the case the C call in question raises an exception. This mechanism actually adds the PC in the interpreter stack, so that the backtrace mechanism always sees it.

The result is that the callstack includes the PC which actually performed the C call, contrarily to the previous behavior, where this PC was ignored. I would argue that this is the right behavior, there is no reason the top frame should be ignored. However, if an external is declared in the .ml file and exposed in the .mli file as a val, then ocamlc generates a non-localized wrapper, which adds a spurious entry in the stack frame. In this PR, this change in behavior results in the re-declaration of Printexc.get_callstack as an external instead of a val, so that the spurious stack frame does not appear in callstacks obtained from Printexc.get_callstack.

@xavierleroy This changes Setup_for_c_call/Restore_after_c_call, which, given the git history, you are the only expert of. I am far from sure that changing this does not have impact on some other part of the runtime, such as, e.g., the debugger. Also, I removed the following two lines in the interpreter:

if (pc != NULL) pc += 2;
    /* +2 adjustment for the sole purpose of backtraces */

I do not understand their purpose, and it seems backtraces are working without them. But I am not sure I did not actually break anything.

@jhjourdan
Copy link
Contributor Author

I actually just found that there is another issue related with that one: in bytecode mode, when exceptions are thrown in a callback across a C stack frame, backtraces usually only contain the part of the stack which is less recent than the C call, which is pointless (we cannot even know where the exception has been thrown !).

I have some idea about how to fix this, but this may require other changes to the Setup_for_xxx/Restore_after_xxx macros. As explained above, I am not sure whether this will potentially break anything. In particular, this comment in interp.c is particularly enigmatic to me:

/* An event frame must look like accu + a C_CALL frame + a RETURN 1 frame */

Why must it look like that? Is that related to the debugger? I think even if this PR does not get merged, some more comment/refactoring would be welcome to make this part of the interpreter easier to read.

@stedolan stedolan self-assigned this Apr 25, 2019
Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks globally good to me. Two suggestions and one question below.

runtime/interp.c Outdated Show resolved Hide resolved
runtime/interp.c Outdated Show resolved Hide resolved
stdlib/printexc.mli Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

Mystery number 1: the "+2 adjustment for the sole purpose of backtraces" is partially explained in my comment below. The idea is that the PC stored in the backtrace should better point to "after" the C_CALL instruction, to help event_for_location find the corresponding debug event. Depending on the C_CALL instruction used, the next instruction is at +1 or +2. But there is some tolerance built in event_for_location that might explain why +2 was always good.

@xavierleroy
Copy link
Contributor

Mystery number 2: the reason why "An event frame must look like accu + a C_CALL frame + a RETURN 1 frame" was the now-defunct VM thread library. A context switch could occur either cooperatively, as a result of calling a "yield" primitive, or preemptively, as a result of receiving a timeout signal. Stacks for suspended threads should have the same shape in these two cases, so that they can be restarted as if the "yield" primitive returned.

Now that VM threads are gone, I actually don't know what the minimal requirements are for Setup_for_event and Restore_after_event. Why don't you keep the existing code? Add a comment "for VM threads purposes" if you want.

@jhjourdan
Copy link
Contributor Author

@xavierleroy, thanks for answering my questions!

I added a few comments to demystify the part of the code that I found hard to understand, updated Changes, rebased/squashed my commits on top of trunk, and fixed the related statmemprof tests.

Anything left to do before merging?

@jhjourdan jhjourdan force-pushed the bytecode_c_call_backtrace branch 3 times, most recently from cc79753 to aa98088 Compare May 20, 2019 13:40
@jhjourdan
Copy link
Contributor Author

I rebased on top of trunk.

Are we waiting for something before merging?

@gasche
Copy link
Member

gasche commented Sep 3, 2019

@xavierleroy gentle ping: I'm not sure who else would be qualified to review/approve of the PR (I just tried with Damien).

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is tough :-) but I read it again and it makes sense to me. I hope we have enough tests to exercise all relevant code paths.

@jhjourdan
Copy link
Contributor Author

I fixed the statmemprof intern.ml test. CI should pass now.

By the way, this uncovered another (independent) issue: no debugging information is inserted when a C call is placed in tail-call position, even though it does correspond to a stack slot. This could be the subject of another PR on bytegen.ml.

@jhjourdan
Copy link
Contributor Author

Wow. This is now getting weird. A test fails because of issues with line ending in Windows.

I have done almost no change in this PR for this test, so I am very surprised.

@dra27, is this a known issue? Is there some generic fix I can apply?

The previous mechanism only worked in the case the C call in question
raises an exception.
@dra27
Copy link
Member

dra27 commented Oct 12, 2019

I can't reproduce the AppVeyor failure, but it might warrant some more thought, as it's nothing to do with line-endings (ocamltest compares the files in a line-ending agnostic way, but the display in does using diff - I propose adding --strip-trailing-cr in #8983).

So the actual problem is this:

@@ -7,6 +7,7 @@
 check_distrib 100000 10 0.900000
 check_callstack
 Raised by primitive operation at file "lists_in_minor.ml", line 14, characters 11-33
+Called from unknown location
 Called from file "lists_in_minor.ml", line 69, characters 2-26
 Called from file "lists_in_minor.ml", line 76, characters 2-20
 OK !

which might be a bit more serious?

@jhjourdan
Copy link
Contributor Author

@dra27, I guess you used the wrong version. I already fixed the issue you are speaking about in bc11373. This was just a matter of updating the reference file.

The issue CI had in bc11373 was truly just a line ending issue.

Anyway, I just rebased on top of trunk and apparently CI is passing. So I guess this is just an Eisenbug which should be fixed by #8983. So now this is ready to merge, @xavierleroy, except if you have any other objection.

@dra27
Copy link
Member

dra27 commented Oct 12, 2019

@jhjourdan - I didn't use any wrong version, I'm looking at the AppVeyor log - see L4852 of the log for merging bc11373, which has that extra line?

@xavierleroy
Copy link
Contributor

I ran a round of "precheck" on Inria's CI, just to make sure. No problems reported.

@xavierleroy xavierleroy merged commit 23e5bfa into ocaml:trunk Oct 12, 2019
@jhjourdan
Copy link
Contributor Author

@dra27 You're right, there is indeed an extra line in this CI log, even though it should not appear. I have no idea of what happened, this is very weird. I tried to execute the program many times on my PC, and resetting statmemprof's seed. I never reproduced the issue. I hope this is not some non-deterministic bug that will appear once every one thousand of executions... We will see...

@dra27
Copy link
Member

dra27 commented Oct 12, 2019

I’ve left a machine cycling that one test on mingw32... 1782 successful runs so far, so hopefully that AppVeyor log is just an unlucky fluke!

@xavierleroy
Copy link
Contributor

I ran into another wrong backtrace in bytecode. Below, we have one extra entry in the backtrace (line 51 of thread.ml) corresponding to the call to Thread.preempt that implements signal-based preemption. So, it looks like signal handling leaves stuff on the bytecode interpreter stack, stuff that is picked up when building a backtrace.

Here is the log:

Running tests from 'tests/backtrace' ...
[...]
 ... testing 'callstack.ml' with 1.1.2 (bytecode) => failed (program output /home/barsac/ci/builds/workspace/extra-checks/testsuite/tests/backtrace/_ocamltest/tests/backtrace/callstack/ocamlc.byte/callstack.byte.output differs from reference /home/barsac/ci/builds/workspace/extra-checks/testsuite/tests/backtrace/callstack.reference: 
--- /home/barsac/ci/builds/workspace/extra-checks/testsuite/tests/backtrace/callstack.reference	2019-10-12 18:26:12.625956918 +0200
+++ /home/barsac/ci/builds/workspace/extra-checks/testsuite/tests/backtrace/_ocamltest/tests/backtrace/callstack/ocamlc.byte/callstack.byte.output	2019-10-19 22:42:55.968463798 +0200
@@ -11,4 +11,5 @@
 Called from file "callstack.ml", line 15, characters 27-32
 Called from file "thread.ml", line 39, characters 8-14
 Raised by primitive operation at file "callstack.ml", line 12, characters 38-66
+Called from file "thread.ml", line 51, characters 21-28
 Called from file "callstack.ml", line 23, characters 2-18
)

This comes from "extra-checks" CI, using clang-6.0 -fsanitize=thread.

@jhjourdan
Copy link
Contributor Author

I ran into another wrong backtrace in bytecode.

See #9063. I don't think this is a bug in the runtime, but rather in the test itself.

xavierleroy pushed a commit that referenced this pull request Oct 21, 2019
… of that file. (#9063)

The original test had a race condition between finalization and thread preemption.
That was probably the cause for the wrong backtrace observed here:
#8641 (comment)
stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 6, 2020
… of that file. (ocaml#9063)

The original test had a race condition between finalization and thread preemption.
That was probably the cause for the wrong backtrace observed here:
ocaml#8641 (comment)

(cherry picked from commit 1b17cc4)
stedolan pushed a commit to janestreet/ocaml that referenced this pull request Mar 17, 2020
Fix tests/backtrace/callstack.ml by changing the order of the content of that file. (ocaml#9063)

The original test had a race condition between finalization and thread preemption.
That was probably the cause for the wrong backtrace observed here:
ocaml#8641 (comment)
mshinwell pushed a commit to mshinwell/ocaml that referenced this pull request Apr 7, 2020
Fix tests/backtrace/callstack.ml by changing the order of the content of that file. (ocaml#9063)

The original test had a race condition between finalization and thread preemption.
That was probably the cause for the wrong backtrace observed here:
ocaml#8641 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants