New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fence/ARM port can crash during R7RS test of getenv #724
Comments
It can take hundreds of runs to reproduce the bug by following
That too is fairly likely to run to completion, but it takes
All of the register contents look plausible, but the encodings
Let's look at the topmost stack frame. Its address is in r3:
Reformatting that and adding semicolon comments:
Something's wrong here. The return address is cleared to zero The dynamic link is also cleared when the frame is created, but 0xb6b67f90 is not a tagged pointer, but it is a plausible
So 0xb6b67f90 is indeed a return address within the code shown
Ignoring the first two words, that looks plausible. A procedure The next step, I think, is to look inside the procedure whose It looks like the segmentation fault (or illegal instructions) The comment in Another possibility is that some hand-written code (such as mal There are other possibilities as well. It might be best to |
The machine code shown in the last disassembly of my previous message was generated (with
It calls the
So far as I can tell, the machine code generated for the So it looks as though correct machine code created a valid stack frame before calling the That directs suspicion toward the millicode calls, listed here in order of appearance:
The likely suspects are starred. |
Suspecting corruption of the stack cache by millicode during some infrequent combination of circumstances, I modified the
It looks as though the registers were corrupted just before or just after control was transferred to the
Machine registers r2 (GLOBALS), r3 (STKP), and r5 (REG) are catastrophically wrong. All three registers point to inaccessible memory. Machine register r1 (SECOND) contains a procedure with the machine code shown below. As an aid to debugging, invoke instructions place the caller in SECOND. Machine registers r7 and r8 (REG2 and REG3) contain the fixnum 97 passed by the code below. Machine register r6 (REG1) contains the first argument, "cg-variable-1":
Machine register r0 (RESULT) contains the argument count (as a fixnum), placed there by the instruction at
The segmentation fault occurred before the stack cache integrity checking detected anything wrong. Machine register That didn't happen; the program counter points to an inaccessible address. Machine register The timer interrupt might have corrupted the |
It looks like a problem with cache flushing. Disabling all processing of timer exceptions may have improved reproducibility. I am now getting a reproducible SIGILL (Illegal instruction) during compiler test 128:
Machine register
There's nothing wrong with the instruction at It's the start of a new cache line. The compiler tests read compiled machine code from a file, executing each test as it is read. With the ARM's split cache, the data cache line(s) containing the newly read machine code must be flushed, and then the instruction cache line(s) must be flushed. Only then will we get a miss in the instruction cache, forcing it to load the instructions from main memory. The ARM's data cache is write-back, which is why we have to flush the data cache before the instruction cache miss reads from memory. The reader already calls
The second argument is off by 1 because it doesn't account for the bytevector's header word. This bug will show up rarely because it matters only when the very last word of a codevector starts a new cache line. This bug will be intermittent because it depends on where bytevectors are allocated, which depends on many things including the address ranges allocated to the Larceny process and the allocation history of the Scheme code, including Scheme code that processes timer exceptions. |
Fixed by changeset 37760f8 |
This is an intermittent bug in the Fence/ARM port. To reproduce, change to the
test/R7RS/Lib
directory and try this:Keep trying that until you get a segmentation fault or an illegal instruction. (Both are possible.)
The text was updated successfully, but these errors were encountered: