New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong label-calculation for JUMP-Bytecode on Raspberry Pi, due to unaligned access #3201

Closed
naums opened this Issue Jul 7, 2017 · 11 comments

Comments

Projects
None yet
3 participants
@naums
Contributor

naums commented Jul 7, 2017

Hello,

I'm trying to port micropython bare-metal to Raspberry Pi (see repository). The folder bare-rpi contains my code and config-files for the compilation of both a Raspberry Pi kernel-Image as well as an Qemu-Image from the same code (for testing purposes).

I'm using a Raspberry Pi Model A and the codebase is from version 1.8.7.

I have tried to execute several "files" of Python-code and emitting the code works for most of them, except ones containing any sort of JUMP-Bytecode. There it seems the label gets trashed or is calculated wrongly. This behaviour is only visible on the Raspberry Pi Platform not in Qemu.

In the main.c-code is only one statement at the moment for the following python-code:

for i in range(10):
    print(i)

This code is compiled to the following bytecode on the Raspberry Pi:

00 LOAD_CONST_SMALL_INT 0
01 JUMP 4294938425
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME print
11 LOAD_NAME i
14 CALL_FUNCTION n=1 nkw=0
16 POP_TOP
17 LOAD_CONST_SMALL_INT 1
18 BINARY_OP 18 
19 DUP_TOP
20 LOAD_CONST_SMALL_INT 10
21 BINARY_OP 25 __lt__
22 POP_JUMP_IF_TRUE 4
25 POP_TOP
26 LOAD_CONST_NONE
27 RETURN_VALUE

In Qemu (using the same code!) this will be translated to:

00 LOAD_CONST_SMALL_INT 0
01 JUMP 19
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME print
11 LOAD_NAME i
14 CALL_FUNCTION n=1 nkw=0
16 POP_TOP
17 LOAD_CONST_SMALL_INT 1
18 BINARY_OP 18 
19 DUP_TOP
20 LOAD_CONST_SMALL_INT 10
21 BINARY_OP 25 __lt__
22 POP_JUMP_IF_TRUE 4
25 POP_TOP
26 LOAD_CONST_NONE
27 RETURN_VALUE

In Qemu the code is executed correctly, on the Raspberry Pi the execution probably hits JUMP 4294938425 and comes to a hold (no debug-output though).

I currently believe it is due to a wrong calculation / computation of the JUMP-Label. I have yet to try to single step both my Qemu and my RPi-Port together to see the differences.

Does someone know what might be the problem (if I did something wrong) or can someone help fix the issue?

@dpgeorge

This comment has been minimized.

Show comment
Hide comment
@dpgeorge

dpgeorge Jul 8, 2017

Contributor

Can you please post the actual bytecode? The best way to do this would be to enable the MICROPY_DEBUG_PRINTERS option and set mp_verbose_flag = 2 early on in your C code (eg just after the call to mp_init()). That will make it print out the bytecode after it's compiled.

Contributor

dpgeorge commented Jul 8, 2017

Can you please post the actual bytecode? The best way to do this would be to enable the MICROPY_DEBUG_PRINTERS option and set mp_verbose_flag = 2 early on in your C code (eg just after the call to mp_init()). That will make it print out the bytecode after it's compiled.

@naums

This comment has been minimized.

Show comment
Hide comment
@naums

naums Jul 8, 2017

Contributor

Yes, no problem. This is the Bytecode generated on the Raspberry Pi:

File <stdin>, code block '<module>' (descriptor: 66fb0, bytecode @67040 41 bytes)
Raw bytecode (code_info_size=6, bytecode_size=35):
 03 00 00 00 00 00 06 09 10 29 00 00 ff 80 35 0f
 80 30 24 82 25 1c 81 6f 1c 82 25 64 01 32 81 e9
 30 8a f0 36 eb 7f 32 11 5b
arg names:
(N_STATE 3)
(N_EXC_STACK 0)
  bc=-1 line=1
  bc=8 line=2
00 LOAD_CONST_SMALL_INT 0
01 JUMP 4294938425
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME print
11 LOAD_NAME i
14 CALL_FUNCTION n=1 nkw=0
16 POP_TOP
17 LOAD_CONST_SMALL_INT 1
18 BINARY_OP 18 
19 DUP_TOP
20 LOAD_CONST_SMALL_INT 10
21 BINARY_OP 25 __lt__
22 POP_JUMP_IF_TRUE 4
25 POP_TOP
26 LOAD_CONST_NONE
27 RETURN_VALUE

This is the Bytecode generated on my QEMU-port:

File <stdin>, code block '<module>' (descriptor: 6efa0, bytecode @6f030 41 bytes)
Raw bytecode (code_info_size=6, bytecode_size=35):
 03 00 00 00 00 00 06 09 10 29 00 00 ff 80 35 0f
 80 30 24 82 25 1c 81 6f 1c 82 25 64 01 32 81 e9
 30 8a f0 36 eb 7f 32 11 5b
arg names:
(N_STATE 3)
(N_EXC_STACK 0)
  bc=-1 line=1
  bc=8 line=2
00 LOAD_CONST_SMALL_INT 0
01 JUMP 19
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME print
11 LOAD_NAME i
14 CALL_FUNCTION n=1 nkw=0
16 POP_TOP
17 LOAD_CONST_SMALL_INT 1
18 BINARY_OP 18 
19 DUP_TOP
20 LOAD_CONST_SMALL_INT 10
21 BINARY_OP 25 __lt__
22 POP_JUMP_IF_TRUE 4
25 POP_TOP
26 LOAD_CONST_NONE
27 RETURN_VALUE
Contributor

naums commented Jul 8, 2017

Yes, no problem. This is the Bytecode generated on the Raspberry Pi:

File <stdin>, code block '<module>' (descriptor: 66fb0, bytecode @67040 41 bytes)
Raw bytecode (code_info_size=6, bytecode_size=35):
 03 00 00 00 00 00 06 09 10 29 00 00 ff 80 35 0f
 80 30 24 82 25 1c 81 6f 1c 82 25 64 01 32 81 e9
 30 8a f0 36 eb 7f 32 11 5b
arg names:
(N_STATE 3)
(N_EXC_STACK 0)
  bc=-1 line=1
  bc=8 line=2
00 LOAD_CONST_SMALL_INT 0
01 JUMP 4294938425
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME print
11 LOAD_NAME i
14 CALL_FUNCTION n=1 nkw=0
16 POP_TOP
17 LOAD_CONST_SMALL_INT 1
18 BINARY_OP 18 
19 DUP_TOP
20 LOAD_CONST_SMALL_INT 10
21 BINARY_OP 25 __lt__
22 POP_JUMP_IF_TRUE 4
25 POP_TOP
26 LOAD_CONST_NONE
27 RETURN_VALUE

This is the Bytecode generated on my QEMU-port:

File <stdin>, code block '<module>' (descriptor: 6efa0, bytecode @6f030 41 bytes)
Raw bytecode (code_info_size=6, bytecode_size=35):
 03 00 00 00 00 00 06 09 10 29 00 00 ff 80 35 0f
 80 30 24 82 25 1c 81 6f 1c 82 25 64 01 32 81 e9
 30 8a f0 36 eb 7f 32 11 5b
arg names:
(N_STATE 3)
(N_EXC_STACK 0)
  bc=-1 line=1
  bc=8 line=2
00 LOAD_CONST_SMALL_INT 0
01 JUMP 19
04 DUP_TOP
05 STORE_NAME i
08 LOAD_NAME print
11 LOAD_NAME i
14 CALL_FUNCTION n=1 nkw=0
16 POP_TOP
17 LOAD_CONST_SMALL_INT 1
18 BINARY_OP 18 
19 DUP_TOP
20 LOAD_CONST_SMALL_INT 10
21 BINARY_OP 25 __lt__
22 POP_JUMP_IF_TRUE 4
25 POP_TOP
26 LOAD_CONST_NONE
27 RETURN_VALUE
@dpgeorge

This comment has been minimized.

Show comment
Hide comment
@dpgeorge

dpgeorge Jul 8, 2017

Contributor

Thanks. The relevant opcode is 35 0f 80 which is "jump 15 bytes ahead". The first byte (0x35) is the opcode and the other 2 bytes are the argument (number of bytes to jump, with 0x8000 excess). It's being encoded correctly on both devices, but the RPi is decoding it wrong: it uses 35 0f as the argument instead of 0f 80. That's very strange indeed and on the surface of it looks like the switch (*ip++) { code (in py/showbc.c) is not being compiled/executed correctly, that it doesn't do the post increment (and a similar thing is happening in the VM). To debug further I'd need to see RPi disassembly of py/showbc.o.

Contributor

dpgeorge commented Jul 8, 2017

Thanks. The relevant opcode is 35 0f 80 which is "jump 15 bytes ahead". The first byte (0x35) is the opcode and the other 2 bytes are the argument (number of bytes to jump, with 0x8000 excess). It's being encoded correctly on both devices, but the RPi is decoding it wrong: it uses 35 0f as the argument instead of 0f 80. That's very strange indeed and on the surface of it looks like the switch (*ip++) { code (in py/showbc.c) is not being compiled/executed correctly, that it doesn't do the post increment (and a similar thing is happening in the VM). To debug further I'd need to see RPi disassembly of py/showbc.o.

@naums

This comment has been minimized.

Show comment
Hide comment
@naums

naums Jul 8, 2017

Contributor

This is the disassembled object-file:
showbc.txt

(PS: just as a reminder: the codebase is from tag 1.8.7)

Contributor

naums commented Jul 8, 2017

This is the disassembled object-file:
showbc.txt

(PS: just as a reminder: the codebase is from tag 1.8.7)

@dpgeorge

This comment has been minimized.

Show comment
Hide comment
@dpgeorge

dpgeorge Jul 9, 2017

Contributor

@naums thanks for the disasm. The relevant asm for the JUMP opcode is:

JUMP
 4a8:   e1d030b1        ldrh    r3, [r0, #1]
 4ac:   e59f2530        ldr     r2, [pc, #1328] ; 9e4 <mp_bytecode_print_str+0x9e4>
 4b0:   e2804003        add     r4, r0, #3
 4b4:   e2433902        sub     r3, r3, #32768  ; 0x8000
 4b8:   e5922000        ldr     r2, [r2]
 4bc:   e0843003        add     r3, r4, r3
 4c0:   e0432002        sub     r2, r3, r2
 4c4:   e59f151c        ldr     r1, [pc, #1308] ; 9e8 <mp_bytecode_print_str+0x9e8>
 4c8:   eaffff4b        b       1fc <mp_bytecode_print_str+0x1fc>

The register r0 contains the ip pointer to the bytecode, which on entry to the JUMP label above points to the JUMP bytecode (ie the 35 in the bytecode given above). The instruction at 4a8 correctly loads the argument from ip+1, but does a half-word load instead of 2 byte loads. It looks like the compiler is optimising the 2 byte loads because it knows the RPi is little endian, ie:

arg = ip[1] | (ip[2] << 8);

is optimised to:

arg = *(uint16_t)(ip + 1);

It looks like the problem is that this load is from an unaligned pointer, because ip+1 is an odd value. And so the RPi hardware is loading from the even address (ip+1)&~1 which loads the bytes 35 0f instead of 0f 80.

Does the RPi allow unaligned loads like this?

If the analysis above is true, and the RPi compiler is buggy, then I'm very surprised and wonder why it wasn't caught before...?

Contributor

dpgeorge commented Jul 9, 2017

@naums thanks for the disasm. The relevant asm for the JUMP opcode is:

JUMP
 4a8:   e1d030b1        ldrh    r3, [r0, #1]
 4ac:   e59f2530        ldr     r2, [pc, #1328] ; 9e4 <mp_bytecode_print_str+0x9e4>
 4b0:   e2804003        add     r4, r0, #3
 4b4:   e2433902        sub     r3, r3, #32768  ; 0x8000
 4b8:   e5922000        ldr     r2, [r2]
 4bc:   e0843003        add     r3, r4, r3
 4c0:   e0432002        sub     r2, r3, r2
 4c4:   e59f151c        ldr     r1, [pc, #1308] ; 9e8 <mp_bytecode_print_str+0x9e8>
 4c8:   eaffff4b        b       1fc <mp_bytecode_print_str+0x1fc>

The register r0 contains the ip pointer to the bytecode, which on entry to the JUMP label above points to the JUMP bytecode (ie the 35 in the bytecode given above). The instruction at 4a8 correctly loads the argument from ip+1, but does a half-word load instead of 2 byte loads. It looks like the compiler is optimising the 2 byte loads because it knows the RPi is little endian, ie:

arg = ip[1] | (ip[2] << 8);

is optimised to:

arg = *(uint16_t)(ip + 1);

It looks like the problem is that this load is from an unaligned pointer, because ip+1 is an odd value. And so the RPi hardware is loading from the even address (ip+1)&~1 which loads the bytes 35 0f instead of 0f 80.

Does the RPi allow unaligned loads like this?

If the analysis above is true, and the RPi compiler is buggy, then I'm very surprised and wonder why it wasn't caught before...?

@dpgeorge

This comment has been minimized.

Show comment
Hide comment
@dpgeorge

dpgeorge Jul 9, 2017

Contributor

This page regarding unaligned accesses is related: http://antirez.com/news/111 . It looks like there are 2 options to fix it: 1) make GCC (or whatever your compiler is) not emit unaligned loads/stores; 2) provide a custom fault handler to fix up the accesses at runtime.

Contributor

dpgeorge commented Jul 9, 2017

This page regarding unaligned accesses is related: http://antirez.com/news/111 . It looks like there are 2 options to fix it: 1) make GCC (or whatever your compiler is) not emit unaligned loads/stores; 2) provide a custom fault handler to fix up the accesses at runtime.

@naums

This comment has been minimized.

Show comment
Hide comment
@naums

naums Jul 9, 2017

Contributor

Thank you very much. I'm going to take a deeper look into the problem and possible fixes tomorrow. One question though: I should not add inline assembly for the Raspberry Pi-port, right?

You've said before that there are two places in the code which might be faulty. Can you tell me the place of the second one?

I'll report back, when I have found and tested a solution.

Contributor

naums commented Jul 9, 2017

Thank you very much. I'm going to take a deeper look into the problem and possible fixes tomorrow. One question though: I should not add inline assembly for the Raspberry Pi-port, right?

You've said before that there are two places in the code which might be faulty. Can you tell me the place of the second one?

I'll report back, when I have found and tested a solution.

@dpgeorge

This comment has been minimized.

Show comment
Hide comment
@dpgeorge

dpgeorge Jul 10, 2017

Contributor

I should not add inline assembly for the Raspberry Pi-port, right?

If you mean the @micropython.asm_thumb inline assembly, then, yes, you could add that but only after the other stuff is working.

You've said before that there are two places in the code which might be faulty. Can you tell me the place of the second one?

If you fix the issue with showbc.c then the VM should also be fixed.

Contributor

dpgeorge commented Jul 10, 2017

I should not add inline assembly for the Raspberry Pi-port, right?

If you mean the @micropython.asm_thumb inline assembly, then, yes, you could add that but only after the other stuff is working.

You've said before that there are two places in the code which might be faulty. Can you tell me the place of the second one?

If you fix the issue with showbc.c then the VM should also be fixed.

@naums

This comment has been minimized.

Show comment
Hide comment
@naums

naums Jul 10, 2017

Contributor

At first I tried to solve the issue by editing the makro DECODE_SLABEL to use a function which I marked as optimize('0'). This seemed to work for mp_bytecode_print, but the VM would not run the code as in emitglue.c the assertion rc->kind == MP_BYTECODE failed (I assume due to the same error).

This was when I basically searched for a better solution other than going through the interpreter code and aligning all the structs. I began to search in the manual for the ARM1176jzf-s processor and found that it can load unaligned data by itself (without calling the kernel for that).

So I solved the issue by setting the processor (or to be exact the system coprocessor) into a state, which allows it to load unaligned data. The manual states I have to set the U-Bit (bit 22) of the c1-control register of C15-coprocessor.

With the help of this post I managed to read and write from/to that register and set the corresponding value.

So this is the ARM-assembler-code which I added to my startup-assembler-code:

/// load C15 control register c1-c1,0,0 (c1 Control Register) into r0
mrc p15, 0, r0, c1, c0, 0
ldr r1, =0x00400000     /// set bit 22 (U-bit)
orr r0, r1
mcr p15, 0, r0, c1, c0, 0

Again, thank you very much for identifying the issue.

Contributor

naums commented Jul 10, 2017

At first I tried to solve the issue by editing the makro DECODE_SLABEL to use a function which I marked as optimize('0'). This seemed to work for mp_bytecode_print, but the VM would not run the code as in emitglue.c the assertion rc->kind == MP_BYTECODE failed (I assume due to the same error).

This was when I basically searched for a better solution other than going through the interpreter code and aligning all the structs. I began to search in the manual for the ARM1176jzf-s processor and found that it can load unaligned data by itself (without calling the kernel for that).

So I solved the issue by setting the processor (or to be exact the system coprocessor) into a state, which allows it to load unaligned data. The manual states I have to set the U-Bit (bit 22) of the c1-control register of C15-coprocessor.

With the help of this post I managed to read and write from/to that register and set the corresponding value.

So this is the ARM-assembler-code which I added to my startup-assembler-code:

/// load C15 control register c1-c1,0,0 (c1 Control Register) into r0
mrc p15, 0, r0, c1, c0, 0
ldr r1, =0x00400000     /// set bit 22 (U-bit)
orr r0, r1
mcr p15, 0, r0, c1, c0, 0

Again, thank you very much for identifying the issue.

@naums naums closed this Jul 10, 2017

@pfalcon

This comment has been minimized.

Show comment
Hide comment
@pfalcon

pfalcon Jul 10, 2017

Member

@naums : To clarify: the main (well, only) problem is that you use broken compiler, which performs unsafe optimizations. It might be fixed by passing additional flags to the compiler (specifying exact type of hardware you build for), or by changing compiler.

Member

pfalcon commented Jul 10, 2017

@naums : To clarify: the main (well, only) problem is that you use broken compiler, which performs unsafe optimizations. It might be fixed by passing additional flags to the compiler (specifying exact type of hardware you build for), or by changing compiler.

@dpgeorge dpgeorge changed the title from Wrong label-calculation for JUMP-Bytecode on Raspberry Pi to Wrong label-calculation for JUMP-Bytecode on Raspberry Pi, due to unaligned access Jul 10, 2017

@dpgeorge

This comment has been minimized.

Show comment
Hide comment
@dpgeorge

dpgeorge Jul 10, 2017

Contributor

@naums glad it was a relatively easy fix!

Contributor

dpgeorge commented Jul 10, 2017

@naums glad it was a relatively easy fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment