Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Bytecode segfaults on Ephemeron test #181

Closed
jmid opened this issue Nov 11, 2022 · 16 comments
Closed

[ocaml5-issue] Bytecode segfaults on Ephemeron test #181

jmid opened this issue Nov 11, 2022 · 16 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Nov 11, 2022

From #88:

Originally posted by @jmid in #88 (comment)

I'm wondering whether it is a fresh issue - or caused by one of these that somehow haven't propagated through to install via 5.0.0~trunk and setup-ocaml...

@jmid
Copy link
Collaborator Author

jmid commented Nov 13, 2022

The CI just hit another Ephemeron segfault on 5.0.0~trunk - this one however in native code on Linux:
https://github.com/jmid/multicoretests/actions/runs/3448464623/jobs/5755508518

random seed: 63947905
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     0.3s STM Ephemeron test sequential

File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel

For future reference I include a copy of the relevant CI log for the above Ephemeron 5.0.0~trunk bytecode mode segfault - before it is deleted from the GitHub Actions cache:

random seed: 382251382
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     1.9s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
[ ]   22    0    0   22 / 1000    62.6s STM Ephemeron test parallel
[ ]   31    0    0   31 / 1000   132.5s STM Ephemeron test parallel
[ ]   42    0    0   42 / 1000   192.9s STM Ephemeron test parallel
[ ]   44    0    0   44 / 1000   261.6s STM Ephemeron test parallel (shrinking:    3.0002)
[ ]   44    0    0   44 / 1000   322.4s STM Ephemeron test parallel (shrinking:    6.0002)
[ ]   44    0    0   44 / 1000   383.7s STM Ephemeron test parallel (shrinking:    8.0004)
[ ]   44    0    0   44 / 1000   445.1s STM Ephemeron test parallel (shrinking:   15.0004)
File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]   44    0    0   44 / 1000   505.9s STM Ephemeron test parallel (shrinking:   28.0006)

@jmid jmid changed the title Bytecode segfault on Ephemeron test Segfaults on Ephemeron test Nov 14, 2022
@jmid
Copy link
Collaborator Author

jmid commented Nov 14, 2022

Two native code 5.0.0+trunk CI runs for #189 just hit this Ephemeron segfault.

https://github.com/jmid/multicoretests/actions/runs/3461648589/jobs/5779520524

random seed: 256446672
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     0.3s STM Ephemeron test sequential

File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel

and https://github.com/jmid/multicoretests/actions/runs/3461587444/jobs/5779384555

random seed: 296407056
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     0.3s STM Ephemeron test sequential

File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel

@shym
Copy link
Collaborator

shym commented Nov 15, 2022

I just checked that the 3 native runs you mention were using an old OCaml compiler. https://github.com/jmid/multicoretests/actions/runs/3461587444/jobs/5779384555#step:4:457 says: The OCaml toplevel, version 5.0.0+dev6-2022-07-21.
This should hopefully be fixed by #195.

The bytecode run, on the other hand, says: 5.0.0+dev8-2022-10-12. I don’t think the log contains the actual upstream commit used, it might still make sense to investigate this one.

@jmid
Copy link
Collaborator Author

jmid commented Nov 15, 2022

Good detective work, sir! 👍

@jmid
Copy link
Collaborator Author

jmid commented Nov 16, 2022

The native mode Ephemeron segfaults seems to have been fixed with #195

We are still seeing them on 5.0.0+trunk bytecode though:

On the #195 merge CI run: https://github.com/jmid/multicoretests/actions/runs/3477572101/jobs/5813892424

random seed: 385206433
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     1.8s STM Ephemeron test sequential

File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel

On a CI run from PR #197: https://github.com/jmid/multicoretests/actions/runs/3478421458/jobs/5815768968

random seed: 248456036
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     2.2s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
[ ]   45    0    0   45 / 1000    74.7s STM Ephemeron test parallel (shrinking:    1)
File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]   45    0    0   45 / 1000   135.5s STM Ephemeron test parallel (shrinking:    2)

@shym
Copy link
Collaborator

shym commented Nov 17, 2022

Maybe you could revert your change of title then ;-)
Found again when trying hard to trigger it:
https://github.com/shym/multicoretests/actions/runs/3480902528/jobs/5821292658#step:11:541
and I still can’t reproduce it locally using the seed obtain in that log :-/

@jmid jmid changed the title Segfaults on Ephemeron test Bytecode segfaults on Ephemeron test Nov 17, 2022
@jmid jmid changed the title Bytecode segfaults on Ephemeron test [ocaml5-issue] Bytecode segfaults on Ephemeron test Nov 18, 2022
@jmid
Copy link
Collaborator Author

jmid commented Nov 18, 2022

Just when we thought this was bytecode only, ...

CI had a Linux (native) 5.0.0+trunk Ephemeron segfault earlier today on the main branch:
https://github.com/jmid/multicoretests/actions/runs/3495320020/jobs/5851952387

random seed: 463092985
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     0.4s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]   45    0    0   45 / 1000    60.5s STM Ephemeron test parallel (shrinking:    3.0002)

@jmid
Copy link
Collaborator Author

jmid commented Nov 18, 2022

The CI for the Sys PR just hit a 5.0.0+trunk bytecode Ephemeron segfault as well:
https://github.com/jmid/multicoretests/actions/runs/3496455515/jobs/5854385414

random seed: 89312061
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     1.9s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
[ ]   12    0    0   12 / 1000   434.7s STM Ephemeron test parallel (shrinking:    0.0002)
random seed: 410265529
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.

@shym
Copy link
Collaborator

shym commented Nov 22, 2022

After a long bisection, I found it present on all tried commits between trunk and β1 (included): https://github.com/shym/multicoretests/actions/runs/3517413388/jobs/5895155218#step:11:831 but I didn’t manage to get a seed that give reproducibility.

@shym
Copy link
Collaborator

shym commented Nov 22, 2022

Maybe another issue should be opened, but I indeed saw quite a few issues on native code:

@jmid
Copy link
Collaborator Author

jmid commented Nov 23, 2022

Another Ephemeron failure on Linux 5.0.0~beta1 bytecode (with some interleaved printing):
https://github.com/jmid/multicoretests/actions/runs/3522096434/jobs/5904675410

random seed: 209315065
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     2.5s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
[ ]   19    0    0   19 / 1000    59.2s STM Ephemeron test parallel
[ ]   34    0    0   34 / 1000   129.8s STM Ephemeron test parallel (shrinking:    0.0002)
File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
[ ]   34    0    0   34 / 1000   195.0s STM Ephemeron test parallel (shrinking:    3)
random seed: 40688049
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Float Array test sequential
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.

@jmid
Copy link
Collaborator Author

jmid commented Nov 23, 2022

Also seen on Windows 5.0.0+trunk:
https://github.com/jmid/multicoretests/actions/runs/3522215862/jobs/5904929229

random seed: 507906890
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     0.5s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
File "src/ephemeron/dune", line 14, characters 0-111:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command exited with code -1073741819.
[ ]  106    0    0  106 / 1000    60.9s STM Ephemeron test parallel

@jmid
Copy link
Collaborator Author

jmid commented Nov 23, 2022

Again observed on Linux 5.0.0+trunk bytecode
https://github.com/jmid/multicoretests/actions/runs/3522413690/jobs/5905364853

random seed: 369340211
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test sequential (generating)
[✓] 1000    0    0 1000 / 1000     2.4s STM Ephemeron test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Ephemeron test parallel
[ ]   35    0    0   35 / 1000    63.0s STM Ephemeron test parallel
[ ]   66    0    0   66 / 1000   163.7s STM Ephemeron test parallel (shrinking:    0.0002)
[ ]   66    0    0   66 / 1000   259.5s STM Ephemeron test parallel (shrinking:    1.0002)
[ ]   66    0    0   66 / 1000   354.3s STM Ephemeron test parallel (shrinking:    1.0004)
[ ]   66    0    0   66 / 1000   449.8s STM Ephemeron test parallel (shrinking:    1.0006)
[ ]   66    0    0   66 / 1000   522.1s STM Ephemeron test parallel (shrinking:    1.0008)
[ ]   66    0    0   66 / 1000   592.4s STM Ephemeron test parallel (shrinking:    2.0002)
[ ]   66    0    0   66 / 1000   663.3s STM Ephemeron test parallel (shrinking:    3)
[ ]   66    0    0   66 / 1000   736.6s STM Ephemeron test parallel (shrinking:    3.0004)
[ ]   66    0    0   66 / 1000   801.2s STM Ephemeron test parallel (shrinking:    4.0002)
[ ]   66    0    0   66 / 1000   874.6s STM Ephemeron test parallel (shrinking:    4.0005)
File "src/ephemeron/dune", line 14, characters 0-107:
14 | (rule
15 |  (alias runtest)
16 |  (package multicoretests)
17 |  (deps stm_tests.exe)
18 |  (action (run ./%{deps} --verbose)))
(cd _build/default/src/ephemeron && ./stm_tests.exe --verbose)
Command got signal SEGV.
[ ]   66    0    0   66 / 1000   936.3s STM Ephemeron test parallel (shrinking:    4.0008)

@shym
Copy link
Collaborator

shym commented Nov 23, 2022

On Linux trunk bytecode, this weird result:

random seed: 212405379
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 2000     0.0s STM Ephemeron test parallel
[ ]    0    0    0    0 / 2000     0.0s STM Ephemeron test parallel (generating)
[ ]    8    0    0    8 / 2000   108.2s STM Ephemeron test parallel (shrinking:    0.0002)
[ ]    8    0    0    8 / 2000   201.5s STM Ephemeron test parallel (shrinking:    0.0003)
[ ]    8    0    0    8 / 2000   298.1s STM Ephemeron test parallel (shrinking:    1.0002)
[ ]    8    0    0    8 / 2000   397.2s STM Ephemeron test parallel (shrinking:    2.0002)
[ ]    8    0    0    8 / 2000   490.7s STM Ephemeron test parallel (shrinking:    2.0003)
[ ]    8    0    0    8 / 2000   585.3s STM Ephemeron test parallel (shrinking:    3.0002)
[ ]    8    0    0    8 / 2000   679.3s STM Ephemeron test parallel (shrinking:    4.0002)
[ ]    8    0    0    8 / 2000   772.4s STM Ephemeron test parallel (shrinking:    5.0002)
[ ]    8    0    0    8 / 2000   865.5s STM Ephemeron test parallel (shrinking:    5.0003)
[ ]    8    0    0    8 / 2000   962.4s STM Ephemeron test parallel (shrinking:    7.0002)
[ ]    8    0    0    8 / 2000  1055.1s STM Ephemeron test parallel (shrinking:    7.0003)
[ ]    8    0    0    8 / 2000  1147.8s STM Ephemeron test parallel (shrinking:    7.0004)
[ ]    8    0    0    8 / 2000  1212.6s STM Ephemeron test parallel (shrinking:    9.0002)
[ ]    8    0    0    8 / 2000  1312.4s STM Ephemeron test parallel (shrinking:    9.0004)
[ ]    8    0    0    8 / 2000  1374.2s STM Ephemeron test parallel (shrinking:   11.0002)
[ ]    8    0    0    8 / 2000  1446.2s STM Ephemeron test parallel (shrinking:   11.0005)
[ ]    8    0    0    8 / 2000  1509.1s STM Ephemeron test parallel (shrinking:   13.0005)
[ ]    8    0    0    8 / 2000  1569.8s STM Ephemeron test parallel (shrinking:   17)
[00] file runtime/caml/memory.h; line 191 ### Assertion failed: Field(result, i) == Debug_free_minor
/home/runner/work/_temp/c071cec0-615a-437e-8ec3-36cc15e1c682.sh: line 3: 2563468 Aborted                 (core dumped) opam exec -- dune exec "$ONLY_TEST" -- -v
[ ]    8    0    0    8 / 2000  1630.3s STM Ephemeron test parallel (shrinking:   20.0006)

@jmid
Copy link
Collaborator Author

jmid commented Nov 24, 2022

That's a strange error indeed!
It seems to be failing an assertion on a Field load - which is also what could cause strange behaviour in ephe_get_field underlying the weak issue, so there is a chance/hope that it is the same underlying issue 🤔

@jmid
Copy link
Collaborator Author

jmid commented Dec 1, 2022

Not spotted since the release of 5.0.0~beta2 with the Weak fix, so closing.

@jmid jmid closed this as completed Dec 1, 2022
@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants