-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a crash in the shbench fibonacci benchmark #113
Conversation
If the fibslow function in the shbench[*] fibonacci benchmark is run with an argument greater than 20, ksh will crash with a memory fault. This was caused by subshell numbers only being 16-bits long, which isn't enough when running a large number of subshells. Extending the subshell number length to 64-bit fixes this issue, but with vmalloc only to a certain extent. The fibonacci benchmark no longer crashes when run, but the regression test this commit adds only works with standard malloc. With ksh93u+ vmalloc the regression test will crash. src/cmd/ksh93/include/defs.h, src/cmd/ksh93/include/jobs.h, src/cmd/ksh93/sh/jobs.c, src/cmd/ksh93/sh/subshell.c, - Extend curenv, jobenv, subenv and p_env from short to long. This bugfix was backported from ksh93v- with a patch from OpenSUSE: https://build.opensuse.org/package/view_file/shells/ksh/ksh93-longenv.dif?expand=1 src/cmd/ksh93/tests/subshell.sh: - Add a regression test for running a large number of subshells, but disable it if vmstate is a builtin (since it doesn't work with vmalloc). [*]: https://github.com/ksh-community/shbench
I've a strong feeling that this fix is a kludge that doesn't address the root cause. Sure enough, upon reading the diff --git a/src/cmd/ksh93/sh/subshell.c b/src/cmd/ksh93/sh/subshell.c
index 5b12f92..aefb681 100644
--- a/src/cmd/ksh93/sh/subshell.c
+++ b/src/cmd/ksh93/sh/subshell.c
@@ -481,6 +481,7 @@ Sfio_t *sh_subshell(Shell_t *shp,Shnode_t *t, volatile int flags, int comsub)
sh_pushcontext(shp,&buff,SH_JMPSUB);
shp->subshell++; /* increase level of virtual subshells */
SH_SUBSHELLNOD->nvalue.s++; /* increase ${.sh.subshell} */
+errormsg(SH_DICT,ERROR_warn(0),"[DEBUG] subenv == %d; shp->subshell == %d", subenv, shp->subshell);
sp->prev = subshell_data;
sp->shp = shp;
sp->sig = 0; An So the OpenSUSE fix simply delays the overflow, albeit by a very very long time. The overflow doesn't cause a crash when executing >32767 subshells sequentially, but when you nest subshells (as in your reproducer), a crash can occur. Which makes sense, as a child virtual subshell would use an overflowed negative value inconsistent with that of the parent. However, ksh is not multithreaded, and never will be, so virtual subshells are never executed simultaneously. So a fix that addresses the root cause is to simply decrease the global diff --git a/src/cmd/ksh93/sh/subshell.c b/src/cmd/ksh93/sh/subshell.c
index 5b12f92..aefb681 100644
--- a/src/cmd/ksh93/sh/subshell.c
+++ b/src/cmd/ksh93/sh/subshell.c
@@ -756,6 +757,7 @@ Sfio_t *sh_subshell(Shell_t *shp,Shnode_t *t, volatile int flags, int comsub)
{
shp->subshell--; /* decrease level of virtual subshells */
SH_SUBSHELLNOD->nvalue.s--; /* decrease ${.sh.subshell} */
+ subenv--;
}
subshell_data = sp->prev;
if(!argsav || argsav->dolrefcnt==argcnt) This causes no regression test failures. When applying both this diff and the debug diff, it becomes clear that Then we still need to fix the variable types, but |
This caused intermittent CI failures, I'll debug that later.
I'm getting inconsistent CI test failures after making the variables unsigned values: I'll be reverting that change since fixing that issue will likely require refactoring code, which I'd rather avoid. edit: The test failures may be caused by something else, see the comment below. |
... And the CI test are failing again, with the same test failure: https://github.com/JohnoKing/ksh/runs/973205340
I'll have to debug this more, although it will take a while. I'll see if I can reproduce the test failure when |
I cannot reproduce that failure at all, on macOS, FreeBSD, Linux (Debian), or HP-UX. I'm using the diff with the However, I did find a potential problem: |
As far as I'm aware the test failure only happens in the CI build. I have done more testing with the CI though and it seems the subshell test failure in CI only occurs after applying the
It could just be a timing issue, but I suspect ksh depends on ksh/src/cmd/ksh93/sh/subshell.c Lines 474 to 479 in f485fe0
On an unrelated note, I got another test failure in the CI build (ref https://github.com/JohnoKing/ksh/runs/973581717). This has occurred in some of my older pull requests as well, so it's not caused by the subshell changes:
I applied that change in 4b9dbe1, |
1c8123d
to
93bd783
Compare
When the subenv-- change is applied, the CI test will now fail with the following: test subshell begins at 2020-08-12+11:54:45 subshell.sh[533]: exit status from subshells not being preserved (expected 12, got 0) test subshell failed at 2020-08-12+11:54:53 with exit code 1 [ 86 tests 1 error ] test subshell(C.UTF-8) begins at 2020-08-12+11:54:53 subshell.sh[533]: exit status from subshells not being preserved (expected 12, got 0) test subshell(C.UTF-8) failed at 2020-08-12+11:55:02 with exit code 1 [ 86 tests 1 error ] test subshell(shcomp) begins at 2020-08-12+11:55:02 shcomp-subshell.ksh[533]: exit status from subshells not being preserved (expected 12, got 0) test subshell(shcomp) failed at 2020-08-12+11:55:09 with exit code 1 [ 86 tests 1 error ]
Hmm. What fails in the CI run with the I don't yet understand the code responsible for doing that, but it must be ultimately dependent on What I don't understand at all is why it doesn't fail on any other system I can get my hands on and only fails on the CI runners. And without that decrement it is eventually going to overflow. Maybe to avoid a crash it is sufficient to make it an unsigned value so at least it doesn't wrap to negative? I don't know, but it's possible. |
From what I can tell the test failure is because of a race condition (although IDK why only the CI runners are affected). Adding --- a/src/cmd/ksh93/tests/subshell.sh
+++ b/src/cmd/ksh93/tests/subshell.sh
@@ -528,6 +528,7 @@ date=$(whence -p date)
err() { return $1; }
( err 12 ) & pid=$!
: $( $date)
+sleep .1
wait $pid
exit_status=$?
[[ $exit_status == 12 ]] || err_exit "exit status from subshells not being preserved (expected 12, got $exit_status)" |
When the current longenv branch is compiled with either of: bin/package make CCFLAGS='-Os -D_std_malloc -D_AST_std_malloc'
bin/package make CCFLAGS='-Os -D_std_malloc' The regression test still intermittently crashes on macOS 10.14.16, in various ways:
|
Backtraces of the above crashes after recompiling with
|
Even when reverting the longenv branch to the original OpenSUSE fix (with When reverting back to HEAD, at 34fbf89, it still crashes, but with different backtraces. Some of them are the same (various crashes after calling
|
src/cmd/ksh93/tests/subshell.sh: - Disable the test for running many nested subshells for now, since it crashes on macOS (and still crashes with vmalloc).
When we remove the function definition from the regression test, like this:
the crash disappears. So this is a bug with defining a function in a virtual subshell, which is a different bug from this one. |
This test exposes a bug with defining a function in a subshell. If you remove the function definition and call, the crash disappears. So it is not applicable to this particular fix, and should be dealt with in a different issue or PR.
If the
fibslow
function in the shbench fibonacci benchmark is run with an argument greater than 20, ksh will crash with a memory fault. Below is a diff that can be applied to shbench for this crash to occur:This was caused by subshell numbers only being 16-bits long, which isn't big enough when running a large number of subshells. Extending the subshell number length to 64-bit fixes this issue (
${.sh.subshell}
doesn't affect the crash, so it's unchanged), but with vmalloc a similar crash still occurs. The fibonacci benchmark no longer crashes with the above diff, but the regression test this pull request adds only works with standard malloc. With ksh93u+ vmalloc the following script still crashes: