Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fence/ARM port can crash during R7RS test of getenv #724

Closed
WillClinger opened this issue Apr 29, 2015 · 5 comments
Closed

Fence/ARM port can crash during R7RS test of getenv #724

WillClinger opened this issue Apr 29, 2015 · 5 comments

Comments

@WillClinger
Copy link
Member

This is an intermittent bug in the Fence/ARM port. To reproduce, change to the test/R7RS/Lib directory and try this:

../../../larceny --r7rs --path . --program tests/scheme/run/process-context.sps -- --test-getenv TERM linux

Keep trying that until you get a segmentation fault or an illegal instruction. (Both are possible.)

@WillClinger
Copy link
Member Author

It can take hundreds of runs to reproduce the bug by following
the instructions above. It's easier to run the compiler tests
from inside gdb as follows:

$ gdb ../../larceny.bin
(gdb) run --heap ../../larceny.heap -- run-tests.sch -e "(run-compiler-tests)"

That too is fairly likely to run to completion, but it takes
such a long time that the bug is more likely to show up. When
it did, I observed the following, to which I have added a few
semicolon comments:

(gdb) run
[...snip...]
>>> control-optimization off
>> Compiling p2tests
>> Compiling p4tests
>> Compiling primtests

>> Loading p2tests
>> Loading p4tests

Program received signal SIGSEGV, Segmentation fault.
0xb6b68884 in ?? ()
(gdb) info registers
r0             0xc              12            ; RESULT
r1             0xb61b3fa9       3055239081    ; SECOND
r2             0x6a400          435200        ; GLOBALS
r3             0xb6dcef60       3067932512    ; STKP
r4             0xb6b6c8a0       3065432224    ; TMP0
r5             0xb5ea89af       3052046767    ; REG0
r6             0xb6b6a52d       3065423149    ; REG1
r7             0x44             68            ; REG2
r8             0x44             68            ; REG3
r9             0x2d             45
r10            0xb6b6c893       3065432211    ; REG4
r11            0x116            278           ; REG5 (0x116 is #!unspecified)
r12            0xb5c67485       3049682053    ; TMP1
sp             0xbeffea58       0xbeffea58
lr             0xb5d272b0       -1244499280
pc             0xb6b68884       0xb6b68884
cpsr           0x200c0010       537657360

All of the register contents look plausible, but the encodings
of instructions near the program counter are pair pointers (or
the empty list), not plausible instructions:

(gdb) disassemble/r 0xb6b68860,0xb6b688a0
Dump of assembler code from 0xb6b68860 to 0xb6b688a0:
   0xb6b68860:  01 b8 b5 b6 ldrtlt  r11, [r5], r1, lsl #16
   0xb6b68864:  59 88 b6 b6 ssatlt  r8, #23, r9, asr #16
   0xb6b68868:  f9 b7 b5 b6         ; <UNDEFINED> instruction: 0xb6b5b7f9
   0xb6b6886c:  61 88 b6 b6 ldrtlt  r8, [r6], r1, ror #16
   0xb6b68870:  e9 b6 b5 b6 ldrtlt  r11, [r5], r9, ror #13
   0xb6b68874:  69 88 b6 b6 ldrtlt  r8, [r6], r9, ror #16
   0xb6b68878:  e1 b6 b5 b6 ldrtlt  r11, [r5], r1, ror #13
   0xb6b6887c:  71 88 b6 b6 sxtahlt r8, r6, r1, ror #16
   0xb6b68880:  d9 b6 b5 b6 ssatlt  r11, #22, r9, asr #13
=> 0xb6b68884:  79 88 b6 b6 sxtahlt r8, r6, r9, ror #16
   0xb6b68888:  31 b0 1d b6         ; <UNDEFINED> instruction: 0xb61db031
   0xb6b6888c:  0a 00 00 00 andeq   r0, r0, r10
   0xb6b68890:  21 b8 b5 b6 ldrtlt  r11, [r5], r1, lsr #16
   0xb6b68894:  89 88 b6 b6 ldrtlt  r8, [r6], r9, lsl #17
   0xb6b68898:  19 b8 b5 b6 ssatlt  r11, #22, r9, lsl #16
   0xb6b6889c:  91 88 b6 b6 ssatlt  r8, #23, r1, lsl #17
End of assembler dump.

Let's look at the topmost stack frame. Its address is in r3:

(gdb) x/20xw 0xb6dcef60
0xb6dcef60:     0x00000000      0x00000010      0xb6b67f90      0x0000000c
0xb6dcef70:     0x00000000      0xb6b6c437      0xb644a3f8      0x00000014
0xb6dcef80:     0x00000000      0xb617e44f      0x00000000      0xb6b6c86f
0xb6dcef90:     0xb644a3f8      0x00000014      0x00000000      0xb617e44f
0xb6dcefa0:     0x00000000      0xb6b6c7cf      0xb644a3f8      0x00000014

Reformatting that and adding semicolon comments:

0xb6dcef60:     0x00000000    ; return address
0xb6dcef64:     0x00000010    ; frame size
0xb6dcef68:     0xb6b67f90    ; dynamic link to next stack frame
0xb6dcef6c:     0x0000000c    ; saved REG0
0xb6dcef70:     0x00000000 
0xb6dcef74:     0xb6b6c437 
0xb6dcef78:     0xb644a3f8 
0xb6dcef7c:     0x00000014 
0xb6dcef80:     0x00000000 
0xb6dcef84:     0xb617e44f 
0xb6dcef88:     0x00000000 
0xb6dcef8c:     0xb6b6c86f 
0xb6dcef90:     0xb644a3f8 
0xb6dcef94:     0x00000014 
0xb6dcef98:     0x00000000 
0xb6dcef9c:     0xb617e44f 
0xb6dcefa0:     0x00000000 
0xb6dcefa4:     0xb6b6c7cf 
0xb6dcefa8:     0xb644a3f8 
0xb6dcefac:     0x00000014 

Something's wrong here. The return address is cleared to zero
when the frame is created, and 0x10 is a common frame size, so
that much is okay.

The dynamic link is also cleared when the frame is created, but
is set to a tagged pointer when the frame is flushed into the
heap by garbage collection or call/cc.

0xb6b67f90 is not a tagged pointer, but it is a plausible
return address. It points to offset 0x8c within a bytevector
of length 256 whose header is located at 0xb6b67f00 (so the
bytevector's tagged pointer is 0xb6b67f05). That bytevector
contains compiled Scheme code. I have added comments to this
disassembly:

(gdb) disassemble/r 0xb6b67f00,0xb6b67fa0
Dump of assembler code from 0xb6b67f00 to 0xb6b67fa0:
   0xb6b67f00:  c2 00 01 00 [...snip...]            ; bytevector header
   0xb6b67f04:  c4 40 92 e5 ldr r4, [r2, #196]  ; presumably ETOP
   0xb6b67f08:  10 30 43 e2 sub r3, r3, #16     ; frame allocation
   0xb6b67f0c:  04 00 53 e1 cmp r3, r4
   0xb6b67f10:  03 00 00 8a bhi 0xb6b67f24
   0xb6b67f14:  10 30 83 e2 add r3, r3, #16
   0xb6b67f18:  82 cf 82 e2 add r12, r2, #520   ; 0x208
   0xb6b67f1c:  3c ff 2f e1 blx r12
   0xb6b67f20:  f7 ff ff ea b   0xb6b67f04
   0xb6b67f24:  0c 40 00 e3 movw    r4, #12         ; framesize is 12
   0xb6b67f28:  04 40 83 e5 str r4, [r3, #4]
   0xb6b67f2c:  00 40 00 e3 movw    r4, #0
   0xb6b67f30:  08 40 83 e5 str r4, [r3, #8]    ; clears dynamic link
   0xb6b67f34:  00 40 83 e5 str r4, [r3]        ; clears return address
   0xb6b67f38:  0c 50 83 e5 str r5, [r3, #12]   ; saves REG0
   0xb6b67f3c:  16 61 00 e3 movw    r6, #278    ; 0x116 (#!unspecified)
   0xb6b67f40:  01 10 95 e5 ldr r1, [r5, #1]    ; global variable
   0xb6b67f44:  05 10 91 e5 ldr r1, [r1, #5]
   0xb6b67f48:  01 00 11 e5 ldr r0, [r1, #-1]
   0xb6b67f4c:  07 c0 00 e2 and r12, r0, #7     ; procedure?
   0xb6b67f50:  07 00 5c e3 cmp r12, #7
   0xb6b67f54:  91 ff 82 12 addne   pc, r2, #580    ; 0x244 ; trap if not
   0xb6b67f58:  b0 40 92 e5 ldr r4, [r2, #176]  ; 0xb0  ; timer
   0xb6b67f5c:  00 00 54 e3 cmp r4, #0
   0xb6b67f60:  83 cf 82 02 addeq   r12, r2, #524   ; 0x20c
   0xb6b67f64:  01 00 00 1a bne 0xb6b67f70
   0xb6b67f68:  3c ff 2f e1 blx r12
   0xb6b67f6c:  01 00 00 ea b   0xb6b67f78
   0xb6b67f70:  01 40 44 e2 sub r4, r4, #1
   0xb6b67f74:  b0 40 82 e5 str r4, [r2, #176]  ; 0xb0
   0xb6b67f78:  05 10 a0 e1 mov r1, r5          ; SECOND is now caller
   0xb6b67f7c:  00 50 a0 e1 mov r5, r0          ; REG0 is now callee
   0xb6b67f80:  04 00 00 e3 movw    r0, #4          ; passing one argument
   0xb6b67f84:  03 c0 15 e5 ldr r12, [r5, #-3]
   0xb6b67f88:  00 f0 83 e5 str pc, [r3]        ; return address ...f90
   0xb6b67f8c:  01 f0 4c e2 sub pc, r12, #1     ; branch to code
   0xb6b67f90:  0c 50 93 e5 ldr r5, [r3, #12]   ; restore REG0
   0xb6b67f94:  01 60 95 e5 ldr r6, [r5, #1]
   0xb6b67f98:  09 60 96 e5 ldr r6, [r6, #9]
   0xb6b67f9c:  44 70 00 e3 movw    r7, #68 ; 0x44
End of assembler dump.

So 0xb6b67f90 is indeed a return address within the code shown
above, which suggests we reinterpret the stack as two words of
corrupted stack frame followed by a stack frame at 0xb6dcef68
created by the code above:

0xb6dcef60:     0x00000000    ; return address slot for corrupted frame
0xb6dcef64:     0x00000010    ; frame size for corrupted/truncated frame

0xb6dcef68:     0xb6b67f90    ; return address (start of valid frame)
0xb6dcef6c:     0x0000000c    ; size of valid frame
0xb6dcef70:     0x00000000    ; dynamic link (cleared to zero by code above)
0xb6dcef74:     0xb6b6c437    ; saved REG0

0xb6dcef78:     0xb644a3f8    ; return address (start of valid frame)
0xb6dcef7c:     0x00000014    ; size of valid frame
0xb6dcef80:     0x00000000    ; dynamic link (initially zero)
0xb6dcef84:     0xb617e44f    ; saved REG0
0xb6dcef88:     0x00000000    ; slot 1
0xb6dcef8c:     0xb6b6c86f    ; slot 2 (a procedure)

0xb6dcef90:     0xb644a3f8    ; return address (start of valid frame)
0xb6dcef94:     0x00000014    ; size of valid frame
0xb6dcef98:     0x00000000    ; dynamic link (initially zero)
0xb6dcef9c:     0xb617e44f    ; saved REG0
0xb6dcefa0:     0x00000000    ; slot 1
0xb6dcefa4:     0xb6b6c7cf    ; slot 2 (a procedure)

0xb6dcefa8:     0xb644a3f8    ; return address (start of valid frame)
0xb6dcefac:     0x00000014    ; size of valid frame
0xb6dcefb0:     0x00000000    ; dynamic link (initially zero)
0xb6dcefb4:     0xb6a7e44f    ; saved REG0
0xb6dcefb8:     0x00000000    ; slot 1
0xb6dcefbc:     0xb6b6c7a7    ; slot 2 (a procedure)

Ignoring the first two words, that looks plausible. A procedure
0xb6a7e44f has made three recursive non-tail calls, each with
the same return address. The last of those calls called the
procedure that created the stack frame at 0xb6dcef68; its code
was disassembled above and looks fine.

The next step, I think, is to look inside the procedure whose
code is shown above to see what procedure it called with #!unspecified
as the argument.

It looks like the segmentation fault (or illegal instructions)
occurs after the topmost frame of the Scheme stack has been
corrupted. Once that stack frame has been corrupted, it's easy
to return through an invalid return address. In this case the
program began to execute tagged pointers. Other manifestations
would be possible.

The comment in fence-millicode.c that shows the layout of a
special stack frame can be traced back to the Sparc version of
Larceny and is no longer correct, even for IAssassin. If some
of the millicode that flushes and restores stack frames relied
upon that comment or miscalculated the occasional adjustment
that's necessary for pad words, we might get this kind of bug.

Another possibility is that some hand-written code (such as mal
code or millicode) doesn't store REG0 at the correct offset of
a stack frame; other registers can be saved at any offset, but
correct calculation of the return address depends upon REG0.
That kind of bug would show up only when one of those frames
happens to contain an active return address when it is flushed
into the heap by a garbage collection or call/cc. If the
frame's active return addresses are short-lived, that might
happen rarely enough to explain the difficulty of reproducing
this bug.

There are other possibilities as well. It might be best to
debug this by conducting a systematic review of the Fence/ARM
and IAssassin millicode, correcting misleading comments along
the way.

@WillClinger
Copy link
Member Author

The machine code shown in the last disassembly of my previous message was generated (with control-optimization off) for the next-to-last test in test/Compiler/p4tests.sch:

(test "cg-sequential-loop-1"
      (let ()
        (define (f)
          (identity (begin)))
        (f)
        17)
      17)

It calls the identity procedure with #!unspecified as argument:

; To discourage optimizations, some of these tests use an identity
; function that the compiler is unlikely to recognize as the identity.

(define identity
  (let ((n 3))
    (lambda (x)
      (let ((v (make-vector n x)))
        (vector-set! v (- n 1) v)
        (set! n (+ n 1))
        (if (= n 5)
            (set! n 2))
        (if (eqv? x (vector-ref v 1))
            (vector-ref v 0)
            (identity x))))))

So far as I can tell, the machine code generated for the identity procedure is correct. (I'll post that machine code in a later message if I begin to suspect otherwise.) The machine code for identity does not allocate a stack frame, performs no procedure calls (except for millicode calls), and refers to the Scheme stack pointer (machine register r3) only when returning the value of (vector-ref v 0).

So it looks as though correct machine code created a valid stack frame before calling the identity procedure, and the Scheme stack pointer (machine register r3) was somehow decremented by 8 during the execution of machine code that never changes the contents of that register.

That directs suspicion toward the millicode calls, listed here in order of appearance:

 0x248  fmc_argc_exception (should not have been called)
 0x200  fmc_exception (should not have been called)
*0x220  fmc_alloci (called as part of make-vector)
 0x214  fmc_sub (should not have been called)
*0x234  fmc_partial_barrier (called as part of vector-set!)
 0x210  fmc_add (should not have been called)
 0x234  fmc_partial_barrier (should not have been called because (+ n 1) is not a pointer)
 0x218  fmc_equalp (should not have been called because the fixnum case is inlined)
 0x234  fmc_partial_barrier (should not have been called because 2 is not a pointer)
*0x268  fmc_eqv? (called as part of eqv?)
 0x244  fmc_global_invoke_exception (should not have been called)
*0x20c  fmc_timer_exception (might have been called)
 0x200  fmc_exception (for three traps that should not have been taken)

The likely suspects are starred.

@WillClinger
Copy link
Member Author

Suspecting corruption of the stack cache by millicode during some infrequent combination of circumstances, I modified the MILLIPROC2 macro in arm-millicode.sx to check stack cache integrity on either side of every call from Scheme to C. Running the compiler tests with that checking, I got a segmentation fault while loading p4tests as part of test 170 (faster-arithmetic, interprocedural-inlining, common-subexpression-elimination, local-optimization). As explained below, the fault apparently occurred during this call to test:

(test "cg-variable-1"
      (begin (set! *x* 97)
             *x*)
      97)

It looks as though the registers were corrupted just before or just after control was transferred to the test procedure:

Program received signal SIGSEGV, Segmentation fault.
0xe35c0000 in ?? ()
(gdb) info registers
r0             0xc              12
r1             0xb5e8eb2f       3051940655
r2             0xe200c007       3791699975
r3             0xe35c0007       3814457351
r4             0x3008           12296
r5             0xe3000002       3808428034
r6             0xb5e8cf55       3051933525
r7             0x184            388
r8             0x184            388
r9             0x2800004        41943044
r10            0xe3500002       3813670914
r11            0xb5e8eb2f       3051940655
r12            0xb555f485       3042309253
sp             0x1208c007       0x1208c007
lr             0xa0002a9        167772841
pc             0xe35c0000       0xe35c0000
cpsr           0x10             16

Machine registers r2 (GLOBALS), r3 (STKP), and r5 (REG) are catastrophically wrong. All three registers point to inaccessible memory.

Machine register r1 (SECOND) contains a procedure with the machine code shown below. As an aid to debugging, invoke instructions place the caller in SECOND.

Machine registers r7 and r8 (REG2 and REG3) contain the fixnum 97 passed by the code below. Machine register r6 (REG1) contains the first argument, "cg-variable-1":

(gdb) x/1xw 0xb5e8cf50
0xb5e8cf50:     0x000034d6
(gdb) x/52xb 0xb5e8cf54
0xb5e8cf54:     0x26    0x63    0x00    0x00    0x26    0x67    0x00    0x00
0xb5e8cf5c:     0x26    0x2d    0x00    0x00    0x26    0x76    0x00    0x00
0xb5e8cf64:     0x26    0x61    0x00    0x00    0x26    0x72    0x00    0x00
0xb5e8cf6c:     0x26    0x69    0x00    0x00    0x26    0x61    0x00    0x00
0xb5e8cf74:     0x26    0x62    0x00    0x00    0x26    0x6c    0x00    0x00
0xb5e8cf7c:     0x26    0x65    0x00    0x00    0x26    0x2d    0x00    0x00
0xb5e8cf84:     0x26    0x31    0x00    0x00

Machine register r0 (RESULT) contains the argument count (as a fixnum), placed there by the instruction at 0xb5e8aff8 in the call to test:

(gdb) disassemble/r 0xb5e8af70,0xb5e8b004
Dump of assembler code from 0xb5e8af70 to 0xb5e8b004:
   0xb5e8af70:  c2 90 00 00     andeq   r9, r0, r2, asr #1
   0xb5e8af74:  84 11 00 e3     movw    r1, #388        ; 0x184  ; const   97
   0xb5e8af78:  01 00 95 e5     ldr     r0, [r5, #1]             ; setglbl ???
   0xb5e8af7c:  05 00 90 e5     ldr     r0, [r0, #5]
   0xb5e8af80:  01 10 00 e5     str     r1, [r0, #-1]
   0xb5e8af84:  01 00 11 e3     tst     r1, #1
   0xb5e8af88:  8d cf 82 12     addne   r12, r2, #564   ; 0x234
   0xb5e8af8c:  3c ff 2f 11     blxne   r12
   0xb5e8af90:  01 40 95 e5     ldr     r4, [r5, #1]             ; global  ???
   0xb5e8af94:  05 40 94 e5     ldr     r4, [r4, #5]
   0xb5e8af98:  01 70 14 e5     ldr     r7, [r4, #-1]            ; setreg  2
   0xb5e8af9c:  16 c3 00 e3     movw    r12, #790       ; 0x316  ;   (undefined check)
   0xb5e8afa0:  0c 00 57 e1     cmp     r7, r12
   0xb5e8afa4:  04 00 a0 01     moveq   r0, r4
   0xb5e8afa8:  8f ff 82 02     addeq   pc, r2, #572    ; 0x23c
   0xb5e8afac:  01 60 95 e5     ldr     r6, [r5, #1]             ; lexical 0,?
   0xb5e8afb0:  09 60 96 e5     ldr     r6, [r6, #9]
   0xb5e8afb4:  84 81 00 e3     movw    r8, #388        ; 0x184  ; const/setreg 97,3
   0xb5e8afb8:  01 10 95 e5     ldr     r1, [r5, #1]             ; global  ???
   0xb5e8afbc:  0d 10 91 e5     ldr     r1, [r1, #13]
   0xb5e8afc0:  01 00 11 e5     ldr     r0, [r1, #-1]
   0xb5e8afc4:  07 c0 00 e2     and     r12, r0, #7              ; invoke  3
   0xb5e8afc8:  07 00 5c e3     cmp     r12, #7
   0xb5e8afcc:  91 ff 82 12     addne   pc, r2, #580    ; 0x244
   0xb5e8afd0:  b0 40 92 e5     ldr     r4, [r2, #176]  ; 0xb0
   0xb5e8afd4:  00 00 54 e3     cmp     r4, #0
   0xb5e8afd8:  83 cf 82 02     addeq   r12, r2, #524   ; 0x20c
   0xb5e8afdc:  01 00 00 1a     bne     0xb5e8afe8
   0xb5e8afe0:  3c ff 2f e1     blx     r12
   0xb5e8afe4:  01 00 00 ea     b       0xb5e8aff0
   0xb5e8afe8:  01 40 44 e2     sub     r4, r4, #1
   0xb5e8afec:  b0 40 82 e5     str     r4, [r2, #176]  ; 0xb0
   0xb5e8aff0:  05 10 a0 e1     mov     r1, r5
   0xb5e8aff4:  00 50 a0 e1     mov     r5, r0
   0xb5e8aff8:  0c 00 00 e3     movw    r0, #12
   0xb5e8affc:  03 c0 15 e5     ldr     r12, [r5, #-3]
   0xb5e8b000:  01 f0 4c e2     sub     pc, r12, #1
End of assembler dump.

The segmentation fault occurred before the stack cache integrity checking detected anything wrong. Machine register r3 (the pointer to the stack cache) had been corrupted, so any attempt to check the stack cache would have faulted with the pc pointing to the C code that checked the stack cache.

That didn't happen; the program counter points to an inaccessible address. Machine register r3, which should point to the stack cache, is 7 greater than the program counter.

The timer interrupt might have corrupted the RESULT register. The other possibility I see is that the code for test (or perhaps one of the procedures it calls) got trashed.

@WillClinger
Copy link
Member Author

It looks like a problem with cache flushing.

Disabling all processing of timer exceptions may have improved reproducibility. I am now getting a reproducible SIGILL (Illegal instruction) during compiler test 128:

>>>> Starting test 128
>> Compiling p2tests
>> Compiling p4tests
>> Compiling primtests
>> Loading p2tests
>> Loading p4tests
>> Loading primtests
>>>> Starting test 129
>> Compiling p2tests
>> Compiling p4tests
>> Compiling primtests

>> Loading p2tests

Program received signal SIGILL, Illegal instruction.
0xb6a9f000 in ?? ()
(gdb) info registers
r0             0xc  12
r1             0xb6aa27ff   3064604671
r2             0x6a400  435200
r3             0xb6dcef78   3067932536
r4             0xd79    3449
r5             0xb586e66f   3045516911
r6             0xb6aa0fc5   3064598469
r7             0x94 148
r8             0x94 148
r9             0x2d 45
r10            0xb6aa2c4b   3064605771
r11            0x94 148
r12            0xb60b8485   3054208133
sp             0xbeffea78   0xbeffea78
lr             0xb5b5f2c0   -1246367040
pc             0xb6a9f000   0xb6a9f000
cpsr           0x200c0010   537657360

Machine register r1 (SECOND) holds the procedure making a call to the procedure that's already been installed into machine register r5 (REG0). The caller's code looks like this:

(gdb) x/4xw 0xb6aa27f8
0xb6aa27f8: 0x00000cfe  0xb6a9ef05  0xb6aa200b  0x00000002
(gdb) x/1xw 0xb6a9ef00
0xb6a9ef00: 0x000100c2
(gdb) disassemble/r 0xb6a9ef04,0xb6a9f004
Dump of assembler code from 0xb6a9ef04 to 0xb6a9f004:
   0xb6a9ef04:  c4 40 92 e5 ldr r4, [r2, #196]  ; 0xc4
   0xb6a9ef08:  10 30 43 e2 sub r3, r3, #16
   0xb6a9ef0c:  04 00 53 e1 cmp r3, r4
   0xb6a9ef10:  03 00 00 8a bhi 0xb6a9ef24
   0xb6a9ef14:  10 30 83 e2 add r3, r3, #16
   0xb6a9ef18:  82 cf 82 e2 add r12, r2, #520   ; 0x208
   0xb6a9ef1c:  3c ff 2f e1 blx r12
   0xb6a9ef20:  f7 ff ff ea b   0xb6a9ef04
   0xb6a9ef24:  0c 40 00 e3 movw    r4, #12
   0xb6a9ef28:  04 40 83 e5 str r4, [r3, #4]
   0xb6a9ef2c:  00 40 00 e3 movw    r4, #0
   0xb6a9ef30:  08 40 83 e5 str r4, [r3, #8]
   0xb6a9ef34:  00 40 83 e5 str r4, [r3]
   0xb6a9ef38:  0c 50 83 e5 str r5, [r3, #12]
   0xb6a9ef3c:  94 60 00 e3 movw    r6, #148    ; 0x94
   0xb6a9ef40:  01 10 95 e5 ldr r1, [r5, #1]
   0xb6a9ef44:  05 10 91 e5 ldr r1, [r1, #5]
   0xb6a9ef48:  01 00 11 e5 ldr r0, [r1, #-1]
   0xb6a9ef4c:  07 c0 00 e2 and r12, r0, #7
   0xb6a9ef50:  07 00 5c e3 cmp r12, #7
   0xb6a9ef54:  91 ff 82 12 addne   pc, r2, #580    ; 0x244
   0xb6a9ef58:  b0 40 92 e5 ldr r4, [r2, #176]  ; 0xb0
   0xb6a9ef5c:  00 00 54 e3 cmp r4, #0
   0xb6a9ef60:  83 cf 82 02 addeq   r12, r2, #524   ; 0x20c
   0xb6a9ef64:  01 00 00 1a bne 0xb6a9ef70
   0xb6a9ef68:  3c ff 2f e1 blx r12
   0xb6a9ef6c:  01 00 00 ea b   0xb6a9ef78
   0xb6a9ef70:  01 40 44 e2 sub r4, r4, #1
   0xb6a9ef74:  b0 40 82 e5 str r4, [r2, #176]  ; 0xb0
   0xb6a9ef78:  05 10 a0 e1 mov r1, r5
   0xb6a9ef7c:  00 50 a0 e1 mov r5, r0
   0xb6a9ef80:  04 00 00 e3 movw    r0, #4
   0xb6a9ef84:  03 c0 15 e5 ldr r12, [r5, #-3]
   0xb6a9ef88:  00 f0 83 e5 str pc, [r3]
   0xb6a9ef8c:  01 f0 4c e2 sub pc, r12, #1
   0xb6a9ef90:  0c 50 93 e5 ldr r5, [r3, #12]
   0xb6a9ef94:  00 70 a0 e1 mov r7, r0
   0xb6a9ef98:  01 60 95 e5 ldr r6, [r5, #1]
   0xb6a9ef9c:  09 60 96 e5 ldr r6, [r6, #9]
   0xb6a9efa0:  94 80 00 e3 movw    r8, #148    ; 0x94
   0xb6a9efa4:  01 40 95 e5 ldr r4, [r5, #1]
   0xb6a9efa8:  0d 40 94 e5 ldr r4, [r4, #13]
   0xb6a9efac:  01 00 14 e5 ldr r0, [r4, #-1]
   0xb6a9efb0:  16 c3 00 e3 movw    r12, #790   ; 0x316
   0xb6a9efb4:  0c 00 50 e1 cmp r0, r12
   0xb6a9efb8:  04 00 a0 01 moveq   r0, r4
   0xb6a9efbc:  8f ff 82 02 addeq   pc, r2, #572    ; 0x23c
   0xb6a9efc0:  10 30 83 e2 add r3, r3, #16
   0xb6a9efc4:  07 c0 00 e2 and r12, r0, #7
   0xb6a9efc8:  07 00 5c e3 cmp r12, #7
   0xb6a9efcc:  90 ff 82 12 addne   pc, r2, #144, 30    ; 0x240
   0xb6a9efd0:  b0 40 92 e5 ldr r4, [r2, #176]  ; 0xb0
   0xb6a9efd4:  00 00 54 e3 cmp r4, #0
   0xb6a9efd8:  83 cf 82 02 addeq   r12, r2, #524   ; 0x20c
   0xb6a9efdc:  01 00 00 1a bne 0xb6a9efe8
   0xb6a9efe0:  3c ff 2f e1 blx r12
   0xb6a9efe4:  01 00 00 ea b   0xb6a9eff0
   0xb6a9efe8:  01 40 44 e2 sub r4, r4, #1
   0xb6a9efec:  b0 40 82 e5 str r4, [r2, #176]  ; 0xb0
   0xb6a9eff0:  05 10 a0 e1 mov r1, r5
   0xb6a9eff4:  00 50 a0 e1 mov r5, r0
   0xb6a9eff8:  0c 00 00 e3 movw    r0, #12
   0xb6a9effc:  03 c0 15 e5 ldr r12, [r5, #-3]
=> 0xb6a9f000:  01 f0 4c e2 sub pc, r12, #1
End of assembler dump.

There's nothing wrong with the instruction at 0xb6a9f000, so why did we get a SIGILL there?

It's the start of a new cache line. The compiler tests read compiled machine code from a file, executing each test as it is read. With the ARM's split cache, the data cache line(s) containing the newly read machine code must be flushed, and then the instruction cache line(s) must be flushed. Only then will we get a miss in the instruction cache, forcing it to load the instructions from main memory. The ARM's data cache is write-back, which is why we have to flush the data cache before the instruction cache miss reads from memory.

The reader already calls sys$codevector-iflush on every bytevector that contains machine code. That procedure calls (syscall syscall:iflush bv). That calls primitive_iflush (which is defined in primitive.c):

void primitive_iflush( w_bv )
word w_bv;
{
  mem_icache_flush( ptrof( w_bv )+1,
                    ptrof( w_bv )+roundup4(sizefield(*ptrof(w_bv)))/4 );
}

The second argument is off by 1 because it doesn't account for the bytevector's header word.

This bug will show up rarely because it matters only when the very last word of a codevector starts a new cache line. This bug will be intermittent because it depends on where bytevectors are allocated, which depends on many things including the address ranges allocated to the Larceny process and the allocation history of the Scheme code, including Scheme code that processes timer exceptions.

@WillClinger
Copy link
Member Author

Fixed by changeset 37760f8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant