Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTP-23: Concuerror crashed #312

Closed
essen opened this issue Jun 3, 2020 · 11 comments
Closed

OTP-23: Concuerror crashed #312

essen opened this issue Jun 3, 2020 · 11 comments

Comments

@essen
Copy link
Contributor

essen commented Jun 3, 2020

What works in OTP-22 doesn't in OTP-23.

Example module:

-module(gen_server_enter).
-behavior(gen_server).

-export([start_link/0]).
-export([get_true/1]).

-export([init/1]).
-export([handle_call/3]).
-export([handle_cast/2]).
-export([handle_info/2]).
-export([terminate/2]).
-export([code_change/3]).

-export([test/0]).

%% Interface.

start_link() ->
	{ok, proc_lib:spawn_link(?MODULE, init, [[]])}.

get_true(Pid) ->
	gen_server:call(Pid, get_true, infinity).

%% Internal.

init([]) ->
	%% Blocking init.
	StateData = #{},
	gen_server:enter_loop(?MODULE, [], StateData).

handle_call(get_true, _From, StateData=undefined) ->
	{reply, false, StateData};
handle_call(get_true, _From, StateData) ->
	{reply, true, StateData};
handle_call(Msg, From, StateData) ->
	io:format("Unexpected call ~p from ~p~n", [Msg, From]),
	{reply, {error, no_call, Msg}, StateData}.

handle_cast(Msg, StateData) ->
	io:format("Unexpected cast ~p~n", [Msg]),
	{noreply, StateData}.

handle_info(timeout, undefined) ->
	%% Blocking init.
	{noreply, #{}};
handle_info(Msg, StateData) ->
	io:format("Unexpected message ~p~n", [Msg]),
	{noreply, StateData}.

terminate(_Reason, _StateData) ->
	ok.

code_change(_OldVsn, StateData, _Extra) ->
	{ok, StateData}.

%% Concuerror.

test() ->
	spawn_link(fun() ->
		Pid = hd(lists:reverse(processes())),
		true = get_true(Pid)
	end),
	{ok, Pid} = start_link(),
	ok.

Run with:

$ ./Concuerror/bin/concuerror -m gen_server_enter -k
Concuerror 0.20.0+build.2246.ref11477b1 started at 03 Jun 2020 17:35:03

* Info: Showing progress (-h progress, for details)
* Info: Writing results in concuerror_report.txt
* Info: Automatically instrumented module io_lib
* Info: Showing PIDs as "<symbolic name(/last registered name)>" ('-h symbolic_names').
* Info: Automatically added '--file gen_server_enter.erl'.
* Info: Instrumented & loaded module gen_server_enter
* Info: Automatically instrumented module error_handler
* Info: Automatically instrumented module erlang
* Tip: Running without a scheduling_bound corresponds to verification and may take a long time.
* Info: Automatically instrumented module proc_lib
* Info: Automatically instrumented module gen_server
* Info: Automatically instrumented module gen
* Tip: Increase '--print_depth' if output/graph contains "...".
* Info: Continuing after error (-k)
* Error: A process (<0.115.0>) took more than 5000ms to report a built-in event. You can try to increase the '--timeout' limit and/or ensure that there are no infinite loops in your test.
Done at 03 Jun 2020 17:35:09 (Exit status: fail)
  Summary: 1 errors, 1/1 interleavings explored

Results in "Concuerror crashed":

Concuerror 0.20.0+build.2246.ref11477b1 started at 03 Jun 2020 17:35:03
 Options:
  [{after_timeout,infinity},
   {assertions_only,false},
   {assume_racing,true},
   {depth_bound,500},
   {disable_sleep_sets,false},
   {dpor,optimal},
   {entry_point,{gen_server_enter,test,[]}},
   {exclude_module,[]},
   {files,["gen_server_enter.erl"]},
   {first_process_errors_only,false},
   {ignore_error,[]},
   {instant_delivery,true},
   {interleaving_bound,infinity},
   {keep_going,true},
   {non_racing_system,[]},
   {print_depth,20},
   {scheduling,round_robin},
   {scheduling_bound_type,none},
   {show_races,false},
   {strict_scheduling,false},
   {symbolic_names,true},
   {timeout,5000},
   {treat_as_normal,[]},
   {use_receive_patterns,true}]
################################################################################
Interleaving #1
--------------------------------------------------------------------------------
Errors found:
* Concuerror crashed
--------------------------------------------------------------------------------
Event trace:
   1: <P>: <P.1> = erlang:spawn_link(erlang, apply, [#Fun<gen_server_enter.'-test/0-fun-0-'.0>,[]])
    in erlang.erl line 2870
   2: <P>: [] = erlang:process_info(<P>, registered_name)
    in proc_lib.erl line 753
################################################################################
Exploration completed!
################################################################################
Errors:
--------------------------------------------------------------------------------
* A process (<0.115.0>) took more than 5000ms to report a built-in event. You can try to increase the '--timeout' limit and/or ensure that there are no infinite loops in your test.

################################################################################
Tips:
--------------------------------------------------------------------------------
* Running without a scheduling_bound corresponds to verification and may take a long time.
* Increase '--print_depth' if output/graph contains "...".

################################################################################
Info:
--------------------------------------------------------------------------------
* Showing progress (-h progress, for details)
* Writing results in concuerror_report.txt
* Automatically instrumented module io_lib
* Showing PIDs as "<symbolic name(/last registered name)>" ('-h symbolic_names').
* Automatically added '--file gen_server_enter.erl'.
* Instrumented & loaded module gen_server_enter
* Automatically instrumented module error_handler
* Automatically instrumented module erlang
* Automatically instrumented module proc_lib
* Automatically instrumented module gen_server
* Automatically instrumented module gen
* Continuing after error (-k)

################################################################################
Done at 03 Jun 2020 17:35:09 (Exit status: fail)
  Summary: 1 errors, 1/1 interleavings explored

With OTP-22 I instead get 3 errors related to the processes getting deadlocked (expected because I didn't write a very good test).

@kostis
Copy link
Contributor

kostis commented Jun 3, 2020

Hi @essen,

We are aware that Concuerror has issues with 23, but thanks for your report anyway!

In fact, today I was discussing with @aronisstav that we should support 23.0 and opened #311 as a first step in this purpose. We will be working on it and I expect that once that PR is merged, it will also fix this issue.

@essen
Copy link
Contributor Author

essen commented Jun 3, 2020

Hello! Sorry I meant to open this for a few days and had not double checked the PRs opened since.

@kostis
Copy link
Contributor

kostis commented Jun 3, 2020

No worries -- in fact, we now have one more test case!

@aronisstav
Copy link
Member

aronisstav commented Jun 7, 2020

After enabling some more debug printouts and normalising some outputs I can see that the diff between 22.0 and 23.0 in this test starts as:
In 22:

[...]
* Info: Automatically instrumented module gen
# <0.115.0> concuerror_callback l349: {gen,debug_options,2,[320,{file,[103,101,110,95,115,101,114,118,101,114,46,101,114,108]}]}
# <0.115.0> concuerror_callback l349: {gen,hibernate_after,1,[321,{file,[103,101,110,95,115,101,114,118,101,114,46,101,114,108]}]}
# <0.115.0> concuerror_callback l1522: {inspect,empty}
# <0.115.0> concuerror_callback l1482: blocked
# <0.115.0> concuerror_callback l1688: notifying_parent
[...]

In 23.0:

[...]
* Info: Automatically instrumented module gen
# <0.115.0> concuerror_callback l349: {gen,debug_options,2,[363,{file,[103,101,110,95,115,101,114,118,101,114,46,101,114,108]}]}
# <0.115.0> concuerror_callback l349: {gen,hibernate_after,1,[364,{file,[103,101,110,95,115,101,114,118,101,114,46,101,114,108]}]}
* Tip: Increase '--print_depth' if output/graph contains "...".
* Error: Stop testing on first error. (Check '-h keep_going').
* Error: A process (<0.109.0>) took more than 5000ms to report a built-in event. You can try to increase the '--timeout' limit and/or ensure that there are no infinite loops in your test.
* (concuerror               @ 169) Reached the end!
[...]

This is a note for myself, mainly. I continue looking...

@aronisstav
Copy link
Member

aronisstav commented Jun 7, 2020

Sigh... receive seems to be indeed "broken". New testcase:

-module(foo).

-export([test/0]).

test() ->
  Fun =
    fun() ->
        receive
          ok -> ok
        end
    end,
  spawn(Fun) ! ok.

@kostis
Copy link
Contributor

kostis commented Jun 7, 2020

I'm not sure whether this is the culprit or not, but in 23, there is a new primop recv_peek_message which has been introduced.

@aronisstav
Copy link
Member

aronisstav commented Jun 7, 2020

Short update: in 23.0, receive is apparently no longer used as a Core Erlang primitive, but it's broken down to other primops... Therefore Concuerror cannot "see" receive statements and a process can be stuck at a hidden receive-related primop while Concuerror's scheduler waits for it to reach the next notable event.

How to fix this:

  • expand the instrumenter so it can recognize receive-related primops and instrument them appropriately without removing the existing handling of receive itself which works up until 22.3
  • rewrite the instrumenter so that for 23.0 and later it operates at a higher level of the compiler (since Core Erlang is now getting hairy). We only care about applys and calls for intercepting shared-memory related BIF's (and automatic instrumentation) and receives for intercepting message passing, so we REALLY shouldn't need to know anything about primops... Perhaps the higher level instrumenter can work on all versions, so we can scrap the old one completely.

@aronisstav
Copy link
Member

aronisstav commented Jun 13, 2020

Small update: I now have a new instrumenter which runs the foo test correctly and reports a deadlock on the module of the original issue report.
Trying OTP 23.0 with the new instrumenter on my basic_tests suite got 85/337 failures.
After adding two more lines for handling the "new" spawn_opt/4 built_in (already supported as spawn_opt/1 with 4 element tuple, which is now gone) I have 12/337 failures.
I will continue looking into this soon. For the time being, some ugly code exists in 354c64d.

@aronisstav
Copy link
Member

Continuing discussion about this on #311 ...

@aronisstav
Copy link
Member

Done and merged!

I chose to not include the submitted code as a new test case as it was not telling us anything that the other tests weren't already telling us. I verified that the submitted code now produces three interleavings, all deadlocked. I will therefore close this issue, and ask you to try the latest master version!

@essen
Copy link
Contributor Author

essen commented Oct 17, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants