[BUG] `builtin/test_reversed.mojo` for stdlib works with `mojo build`, non-deterministically crashes when run in JIT mode via `mojo <file>` #2369

gabrieldemarmiesse · 2024-04-21T19:29:01Z

Bug description

When working on #2368 I encountered a segfault in the CI even though the tests were passing locally. After restarting the CI, the tests were passing again. So it seems there is a flaky test. If someone could investigate, that would be great.

reating build directory for building the stdlib running the tests in.
Packaging up the Standard Library.
Successfully created /Users/runner/work/mojo/mojo/build/stdlib.mojopkg
Packaging up the test_utils.
-- Testing: 80 tests, 3 workers --
Testing:  0.. [10](https://github.com/modularml/mojo/actions/runs/8774779628/job/24076341331?pr=2368#step:9:11).. 20.
FAIL: Mojo Standard Library :: builtin/test_reversed.mojo (22 of 80)
******************** TEST 'Mojo Standard Library :: builtin/test_reversed.mojo' FAILED ********************
Exit Code: 139

Command Output (stderr):
--
RUN: at line 13: mojo -debug-level full /Users/runner/work/mojo/mojo/stdlib/test/builtin/test_reversed.mojo
+ mojo -debug-level full /Users/runner/work/mojo/mojo/stdlib/test/builtin/test_reversed.mojo
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.	Program arguments: mojo -debug-level full /Users/runner/work/mojo/mojo/stdlib/test/builtin/test_reversed.mojo
#0 0x0000000102871384 llvm_strlcpy (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000c5384)
#1 0x000000010286f670 llvm_strlcpy (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000c3670)
#2 0x0000000102871a70 llvm_strlcpy (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000c5a70)
#3 0x00000001986c7584 (/usr/lib/system/libsystem_platform.dylib+0x180477584)
#4 0x0000000300198de0 
#5 0x0000000102bffd64 __jit_debug_register_code (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x100453d64)
#6 0x00000001027ca67c _mh_execute_header (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x10001e67c)
#7 0x00000001027b37a8 _mh_execute_header (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000077a8)
#8 0x000000019830e0e0 
mojo crashed!
Please file a bug report.
[19522:93928:20240421,184455.219101:WARNING crash_report_exception_handler.cc:257] UniversalExceptionRaise: (os/kern) failure (5)
/Users/runner/work/mojo/mojo/build/stdlib/builtin/Output/test_reversed.mojo.script: line 1: 19520 Segmentation fault: [11](https://github.com/modularml/mojo/actions/runs/8774779628/job/24076341331?pr=2368#step:9:12)  mojo -debug-level full /Users/runner/work/mojo/mojo/stdlib/test/builtin/test_reversed.mojo

--

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  Mojo Standard Library :: builtin/test_reversed.mojo

Here is the full log: https://github.com/modularml/mojo/actions/runs/8774779628/job/24076341331?pr=2368

Steps to reproduce

Wouldn't that be convenient?

System information

Macos
modular-v0.7.2
packages/2024.4.1618/mojo-arm64-apple-darwin22.6.0-2024.4.1618-17-0.tar.zst

(surely more system information can be found in the CI's logs)

The text was updated successfully, but these errors were encountered:

JoeLoser · 2024-04-21T19:35:19Z

Is it because of abf2975? Want to try reverting that and see if it passes consistently on CI? It just merged a few hours ago.

helehex · 2024-04-21T22:18:03Z

oh no

helehex · 2024-04-21T22:52:59Z

i was unable to reproduce

JoeLoser · 2024-04-22T00:03:28Z

@ConnorGray FYI - something to keep an eye out for if you see any flakiness this week for CI

keith · 2024-05-15T22:13:10Z

I've started seeing this on the same test, I'm not sure if it's related to this original issue. I believe the newest iteration of it was introduced with #2340, cc @jayzhan211

The 2 variations I've seen are:

builtin/test_reversed.mojo:43:21: AssertionError: `left == right` comparison failed:
   left: 140196055111384
  right: 4

And then also it hitting this assert:

mojo/stdlib/src/utils/variant.mojo

Line 398 in f5f5109

abort("get: wrong variant type")

(which seems to be more common in my testing now)

martinvuyk · 2024-05-22T21:39:16Z

had this just now in nightly 2024.5.2205

FAIL: Mojo Standard Library :: builtin/test_reversed.mojo (29 of 97)
******************** TEST 'Mojo Standard Library :: builtin/test_reversed.mojo' FAILED ********************
Exit Code: 133

Command Output (stdout):
--
get: wrong variant type

JoeLoser · 2024-05-22T21:54:50Z

had this just now in nightly 2024.5.2205

FAIL: Mojo Standard Library :: builtin/test_reversed.mojo (29 of 97)
******************** TEST 'Mojo Standard Library :: builtin/test_reversed.mojo' FAILED ********************
Exit Code: 133

Command Output (stdout):
--
get: wrong variant type

Do you have a consistent repro or just from a run on CI?

martinvuyk · 2024-05-22T21:56:58Z

Do you have a consistent repro or just from a run on CI?

just from a run on CI

@parameter

…(Dict.items())` (#40974) [External] [stdlib] Fix UB in `reversed(Dict.values())` and `reversed(Dict.items())` Finally found the culprit in the flakyness that plagued us since a few week in the `test_reversed.mojo`. ### The actual bug: When iterating over a list in reverse order, we should start at `len(my_list) - 1` not at `len(my_list)`. That triggered an out of bounds access and thus was undefined behavior. ### The effect on our CI As you know, we have been seeing flakyness lately. It was documented a number of times and always related to `reverse`: * #2866 (comment) * #2369 ### Why was it passing sometimes? This is because there were `Optional[...]` in the List. Thus if the flag of the `Optional` says that no element is present, it's just skipped (the dict doesn't have an entry at this index). So the list of the Dict would often look like this: `["a", "b", "c", "d"] None` but the last `None` is actually memory that we don't have access to. Sometimes it's then skipped in the iteration making the tests pass. Sometimes it would cause segfaults because the test dict worked with strings. Sometimes we would get `wrong variant type` since we don't know what happens to the memory between None check and access. ### Why wasn't it found earlier? First of all, our Dict implementation is too complexe for what it does and thus is very good at hiding bugs. Well we did have `debug_assert` before getting the element of the `List`, but this `debug_assert` looked like this in the dict iterator: ```mojo @parameter if forward: debug_assert( self.index < self.src[]._reserved, "dict iter bounds" ) else: debug_assert(self.index >= 0, "dict iter bounds") ``` So one bound was checked when reading in one direction and the other bound was checked in the other direction. A better `debug_assert` would have been ```mojo debug_assert(0 <= self.index < self.src[]._reserved, "dict iter bounds") ``` When I worked on my PR #2718 the condition `self.index < self.src[]._reserved` didn't trigger anything since it was in the wrong branch, it was never executed. Also before, `__get_ref` didn't have any bounds checks, even when assertions were enabled. A recent commit 8d0870e adds `unsafe_get()` in List and make `__get_ref` use it. It also adds `debug_assert` to `unsafe_get()`, which means that now `__get_ref` has bounds checks if run with assertions enabled. This allowed me to catch the out of bounds access when updating #2718 making the fail deterministic and debuggable. Since we have this, the `debug_assert` in `dict.mojo` isn't necessary anymore. ### Consequences on ongoing work: * This fix have been also added to #2718 * The PR #2701 that we did with @jayzhan211 was actually correct. It was just using `reverse(Dict.items())` which was buggy at the time. After the fix is merged, we can re-revert this PR. * #2794 is not necessary anymore since the implementation by @jayzhan211 was correct. * The real cause of #2866 was found, the issue has already been closed though. * #2369 can be closed for good. * #2832 can be closed for good. ### Closing thoughts * We really need to run the unit tests with assertions enabled and add assertions whenever necessary * The dict implementation is a bit too complicated. For example, `self._reserved` is the length of the internal list. There is no need to store the length of the list twice. Let's drop this variable and use `len(self._entries)` instead. I guess this is a relic of the time when `List` wasn't completely flushed out. If had done so, it would have been ovious that we can't do `my_list.__get_ref(len(my_list))` * Iterating manually over a list like this is bug-prone. The implementation we have especially is, since ```mojo @parameter if forward: self.index += 1 else: self.index -= 1 ``` is done twice in the code, it should only be done once. While there is no bug, code duplication and complexity hides bugs. * We should iterate over the list with a list iterator, not with a custom-made iterator. This will remove a lot of code in the `dict.mojo`. Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com> Closes #2896 MODULAR_ORIG_COMMIT_REV_ID: b65009dc51f1e3027f91b5b61a5b7003cb022b87

JoeLoser · 2024-05-31T15:09:19Z

Fixed with 4d089aa, thanks @gabrieldemarmiesse!

@parameter

…(Dict.items())` (#40974) [External] [stdlib] Fix UB in `reversed(Dict.values())` and `reversed(Dict.items())` Finally found the culprit in the flakyness that plagued us since a few week in the `test_reversed.mojo`. ### The actual bug: When iterating over a list in reverse order, we should start at `len(my_list) - 1` not at `len(my_list)`. That triggered an out of bounds access and thus was undefined behavior. ### The effect on our CI As you know, we have been seeing flakyness lately. It was documented a number of times and always related to `reverse`: * #2866 (comment) * #2369 ### Why was it passing sometimes? This is because there were `Optional[...]` in the List. Thus if the flag of the `Optional` says that no element is present, it's just skipped (the dict doesn't have an entry at this index). So the list of the Dict would often look like this: `["a", "b", "c", "d"] None` but the last `None` is actually memory that we don't have access to. Sometimes it's then skipped in the iteration making the tests pass. Sometimes it would cause segfaults because the test dict worked with strings. Sometimes we would get `wrong variant type` since we don't know what happens to the memory between None check and access. ### Why wasn't it found earlier? First of all, our Dict implementation is too complexe for what it does and thus is very good at hiding bugs. Well we did have `debug_assert` before getting the element of the `List`, but this `debug_assert` looked like this in the dict iterator: ```mojo @parameter if forward: debug_assert( self.index < self.src[]._reserved, "dict iter bounds" ) else: debug_assert(self.index >= 0, "dict iter bounds") ``` So one bound was checked when reading in one direction and the other bound was checked in the other direction. A better `debug_assert` would have been ```mojo debug_assert(0 <= self.index < self.src[]._reserved, "dict iter bounds") ``` When I worked on my PR #2718 the condition `self.index < self.src[]._reserved` didn't trigger anything since it was in the wrong branch, it was never executed. Also before, `__get_ref` didn't have any bounds checks, even when assertions were enabled. A recent commit 8d0870e adds `unsafe_get()` in List and make `__get_ref` use it. It also adds `debug_assert` to `unsafe_get()`, which means that now `__get_ref` has bounds checks if run with assertions enabled. This allowed me to catch the out of bounds access when updating #2718 making the fail deterministic and debuggable. Since we have this, the `debug_assert` in `dict.mojo` isn't necessary anymore. ### Consequences on ongoing work: * This fix have been also added to #2718 * The PR #2701 that we did with @jayzhan211 was actually correct. It was just using `reverse(Dict.items())` which was buggy at the time. After the fix is merged, we can re-revert this PR. * #2794 is not necessary anymore since the implementation by @jayzhan211 was correct. * The real cause of #2866 was found, the issue has already been closed though. * #2369 can be closed for good. * #2832 can be closed for good. ### Closing thoughts * We really need to run the unit tests with assertions enabled and add assertions whenever necessary * The dict implementation is a bit too complicated. For example, `self._reserved` is the length of the internal list. There is no need to store the length of the list twice. Let's drop this variable and use `len(self._entries)` instead. I guess this is a relic of the time when `List` wasn't completely flushed out. If had done so, it would have been ovious that we can't do `my_list.__get_ref(len(my_list))` * Iterating manually over a list like this is bug-prone. The implementation we have especially is, since ```mojo @parameter if forward: self.index += 1 else: self.index -= 1 ``` is done twice in the code, it should only be done once. While there is no bug, code duplication and complexity hides bugs. * We should iterate over the list with a list iterator, not with a custom-made iterator. This will remove a lot of code in the `dict.mojo`. Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com> Closes #2896 MODULAR_ORIG_COMMIT_REV_ID: b65009dc51f1e3027f91b5b61a5b7003cb022b87

gabrieldemarmiesse added bug Something isn't working mojo Issues that are related to mojo labels Apr 21, 2024

gabrieldemarmiesse mentioned this issue Apr 21, 2024

[stdlib] Use UnsafePointer instead of Pointer in io.mojo #2368

Merged

JoeLoser added the mojo-stdlib Tag for issues related to standard library label Apr 21, 2024

linear bot removed mojo Issues that are related to mojo mojo-stdlib Tag for issues related to standard library labels Apr 29, 2024

ematejska added the mojo-repo Tag all issues with this label label Apr 29, 2024

linear bot added test-sync-delete after and removed test-sync-delete after labels May 1, 2024

JoeLoser changed the title ~~[BUG] Segfault in stdlib's CI, in builtin/test_reversed.mojo~~ [BUG] builtin/test_reversed.mojo for stdlib works with mojo build, non-deterministically crashes when run in JIT mode via mojo <file> May 24, 2024

This was referenced May 26, 2024

[stdlib] Add some logging to test_reverse.mojo to flush out a flaky bug #2832

Closed

[BUG] Flaky test, test dict popteim ...get: wrong variant type #2866

Closed

[stdlib] Fix UB in reversed(Dict.values()) and reversed(Dict.items()) #2896

Closed

JoeLoser closed this as completed May 31, 2024

JoeLoser assigned gabrieldemarmiesse May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `builtin/test_reversed.mojo` for stdlib works with `mojo build`, non-deterministically crashes when run in JIT mode via `mojo <file>` #2369

[BUG] `builtin/test_reversed.mojo` for stdlib works with `mojo build`, non-deterministically crashes when run in JIT mode via `mojo <file>` #2369

gabrieldemarmiesse commented Apr 21, 2024 •

edited

Loading

JoeLoser commented Apr 21, 2024

helehex commented Apr 21, 2024

helehex commented Apr 21, 2024

JoeLoser commented Apr 22, 2024

keith commented May 15, 2024

martinvuyk commented May 22, 2024

JoeLoser commented May 22, 2024

martinvuyk commented May 22, 2024

JoeLoser commented May 31, 2024

[BUG] builtin/test_reversed.mojo for stdlib works with mojo build, non-deterministically crashes when run in JIT mode via mojo <file> #2369

[BUG] builtin/test_reversed.mojo for stdlib works with mojo build, non-deterministically crashes when run in JIT mode via mojo <file> #2369

Comments

gabrieldemarmiesse commented Apr 21, 2024 • edited Loading

Bug description

Steps to reproduce

System information

JoeLoser commented Apr 21, 2024

helehex commented Apr 21, 2024

helehex commented Apr 21, 2024

JoeLoser commented Apr 22, 2024

keith commented May 15, 2024

martinvuyk commented May 22, 2024

JoeLoser commented May 22, 2024

martinvuyk commented May 22, 2024

JoeLoser commented May 31, 2024

[BUG] `builtin/test_reversed.mojo` for stdlib works with `mojo build`, non-deterministically crashes when run in JIT mode via `mojo <file>` #2369

[BUG] `builtin/test_reversed.mojo` for stdlib works with `mojo build`, non-deterministically crashes when run in JIT mode via `mojo <file>` #2369

gabrieldemarmiesse commented Apr 21, 2024 •

edited

Loading