Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

movaps unaligned address crash on i386 #190

Closed
kleisauke opened this issue Mar 11, 2021 · 6 comments
Closed

movaps unaligned address crash on i386 #190

kleisauke opened this issue Mar 11, 2021 · 6 comments

Comments

@kleisauke
Copy link

When using GNU vector extensions, it seems that Clang generates a movaps instruction with an unaligned parameter, which causes a crash on Windows i386.

I was able to create a reduced test case to reproduce this:

#include <stdlib.h>

#define MAX_BANDS (64)

typedef float v4f __attribute__((vector_size(4 * sizeof(float))));

typedef struct _VipsCompositeBase {
  double max_band[MAX_BANDS + 1];

  v4f max_band_vec;
} VipsCompositeBase;

int main() {
  VipsCompositeBase *composite =
      (VipsCompositeBase *) malloc(sizeof(VipsCompositeBase));

  composite->max_band[0] = 255;
  composite->max_band[1] = 255;
  composite->max_band[2] = 255;
  composite->max_band[3] = 255;

  for (int b = 0; b <= 3; b++)
    composite->max_band_vec[b] =
        (float) composite->max_band[b];

  free(composite);

  return 0;
}

(It could possibly be related to the code, I tried several patches to fix this, with no luck)

$ docker run -v $(pwd):/build -w /build -it mstorsjo/llvm-mingw:latest \
   i686-w64-mingw32-clang -g test.c -o test-i686.exe

# x86_64 reference
$ docker run -v $(pwd):/build -w /build -it mstorsjo/llvm-mingw:latest \
   x86_64-w64-mingw32-clang -g test.c -o test-x86_64.exe

Here are the objdump's generated with llvm-objdump -l --disassemble-symbols=_main for LLVM and GCC:

Details
test-i686.exe:	file format coff-i386


Disassembly of section .text:

004013f0 <_main>:
; _main():
; /build/test.c:13
  4013f0: 55                           	pushl	%ebp
  4013f1: 89 e5                        	movl	%esp, %ebp
  4013f3: 83 e4 f0                     	andl	$-16, %esp
  4013f6: 83 ec 40                     	subl	$64, %esp
  4013f9: e8 32 01 00 00               	calll	0x401530 <___main>
  4013fe: c7 44 24 34 00 00 00 00      	movl	$0, 52(%esp)
; /build/test.c:15
  401406: c7 04 24 20 02 00 00         	movl	$544, (%esp)
  40140d: e8 90 0f 00 00               	calll	0x4023a2 <_malloc>
  401412: f2 0f 10 05 50 30 40 00      	movsd	4206672, %xmm0
; /build/test.c:14
  40141a: 89 44 24 30                  	movl	%eax, 48(%esp)
; /build/test.c:17
  40141e: 8b 44 24 30                  	movl	48(%esp), %eax
  401422: f2 0f 11 00                  	movsd	%xmm0, (%eax)
; /build/test.c:18
  401426: 8b 44 24 30                  	movl	48(%esp), %eax
  40142a: f2 0f 11 40 08               	movsd	%xmm0, 8(%eax)
; /build/test.c:19
  40142f: 8b 44 24 30                  	movl	48(%esp), %eax
  401433: f2 0f 11 40 10               	movsd	%xmm0, 16(%eax)
; /build/test.c:20
  401438: 8b 44 24 30                  	movl	48(%esp), %eax
  40143c: f2 0f 11 40 18               	movsd	%xmm0, 24(%eax)
; /build/test.c:22
  401441: c7 44 24 2c 00 00 00 00      	movl	$0, 44(%esp)
  401449: 83 7c 24 2c 03               	cmpl	$3, 44(%esp)
  40144e: 0f 8f 44 00 00 00            	jg	0x401498 <_main+0xa8>
; /build/test.c:24
  401454: 8b 44 24 30                  	movl	48(%esp), %eax
  401458: 8b 4c 24 2c                  	movl	44(%esp), %ecx
  40145c: 89 ca                        	movl	%ecx, %edx
  40145e: 83 e2 03                     	andl	$3, %edx
  401461: f2 0f 10 04 c8               	movsd	(%eax,%ecx,8), %xmm0
  401466: f2 0f 5a c0                  	cvtsd2ss	%xmm0, %xmm0
; /build/test.c:23
  40146a: 0f 28 88 10 02 00 00         	movaps	528(%eax), %xmm1
  401471: 0f 29 4c 24 10               	movaps	%xmm1, 16(%esp)
  401476: f3 0f 11 44 94 10            	movss	%xmm0, 16(%esp,%edx,4)
  40147c: 0f 28 44 24 10               	movaps	16(%esp), %xmm0
  401481: 0f 29 80 10 02 00 00         	movaps	%xmm0, 528(%eax)
; /build/test.c:22
  401488: 8b 44 24 2c                  	movl	44(%esp), %eax
  40148c: 83 c0 01                     	addl	$1, %eax
  40148f: 89 44 24 2c                  	movl	%eax, 44(%esp)
  401493: e9 b1 ff ff ff               	jmp	0x401449 <_main+0x59>
; /build/test.c:26
  401498: 8b 44 24 30                  	movl	48(%esp), %eax
  40149c: 89 04 24                     	movl	%eax, (%esp)
  40149f: e8 f8 0e 00 00               	calll	0x40239c <_free>
  4014a4: 31 c0                        	xorl	%eax, %eax
; /build/test.c:28
  4014a6: 89 ec                        	movl	%ebp, %esp
  4014a8: 5d                           	popl	%ebp
  4014a9: c3                           	retl
  4014aa: cc                           	int3
  4014ab: cc                           	int3
  4014ac: cc                           	int3
  4014ad: cc                           	int3
  4014ae: cc                           	int3
  4014af: cc                           	int3
test-gcc-i686.exe:	file format coff-i386


Disassembly of section .text:

004015d0 <_main>:
; _main():
; /build/test.c:13
  4015d0: 55                           	pushl	%ebp
  4015d1: 89 e5                        	movl	%esp, %ebp
  4015d3: 83 e4 f0                     	andl	$-16, %esp
  4015d6: 83 ec 30                     	subl	$48, %esp
  4015d9: e8 32 01 00 00               	calll	0x401710 <___main>
; /build/test.c:15
  4015de: c7 04 24 20 02 00 00         	movl	$544, (%esp)
  4015e5: e8 fa 0f 00 00               	calll	0x4025e4 <_malloc>
  4015ea: 89 44 24 28                  	movl	%eax, 40(%esp)
; /build/test.c:17
  4015ee: 8b 44 24 28                  	movl	40(%esp), %eax
  4015f2: dd 05 48 40 40 00            	fldl	4210760
  4015f8: dd 18                        	fstpl	(%eax)
; /build/test.c:18
  4015fa: 8b 44 24 28                  	movl	40(%esp), %eax
  4015fe: dd 05 48 40 40 00            	fldl	4210760
  401604: dd 58 08                     	fstpl	8(%eax)
; /build/test.c:19
  401607: 8b 44 24 28                  	movl	40(%esp), %eax
  40160b: dd 05 48 40 40 00            	fldl	4210760
  401611: dd 58 10                     	fstpl	16(%eax)
; /build/test.c:20
  401614: 8b 44 24 28                  	movl	40(%esp), %eax
  401618: dd 05 48 40 40 00            	fldl	4210760
  40161e: dd 58 18                     	fstpl	24(%eax)
; /build/test.c:22
  401621: c7 44 24 2c 00 00 00 00      	movl	$0, 44(%esp)
  401629: eb 29                        	jmp	0x401654 <_main+0x84>
; /build/test.c:24
  40162b: 8b 44 24 28                  	movl	40(%esp), %eax
  40162f: 8b 54 24 2c                  	movl	44(%esp), %edx
  401633: dd 04 d0                     	fldl	(%eax,%edx,8)
  401636: d9 5c 24 1c                  	fstps	28(%esp)
  40163a: d9 44 24 1c                  	flds	28(%esp)
; /build/test.c:23
  40163e: 8b 44 24 28                  	movl	40(%esp), %eax
  401642: 8b 54 24 2c                  	movl	44(%esp), %edx
  401646: 81 c2 84 00 00 00            	addl	$132, %edx
  40164c: d9 1c 90                     	fstps	(%eax,%edx,4)
; /build/test.c:22
  40164f: 83 44 24 2c 01               	addl	$1, 44(%esp)
  401654: 83 7c 24 2c 03               	cmpl	$3, 44(%esp)
  401659: 7e d0                        	jle	0x40162b <_main+0x5b>
; /build/test.c:26
  40165b: 8b 44 24 28                  	movl	40(%esp), %eax
  40165f: 89 04 24                     	movl	%eax, (%esp)
  401662: e8 8d 0f 00 00               	calll	0x4025f4 <_free>
; /build/test.c:28
  401667: b8 00 00 00 00               	movl	$0, %eax
; /build/test.c:29
  40166c: c9                           	leave
  40166d: c3                           	retl
  40166e: 90                           	nop
  40166f: 90                           	nop
@mstorsjo
Copy link
Owner

I think the problem is that you're allocating a structure that contains a vector, without sufficient alignment. If you'd allocate the struct e.g. on the stack instead of with malloc, it should have enough alignment. (If you'd check with some alignof operator (don't remember off-hand if it's c++ only or something like that), you could also see exactly how much alignment the compiler assumes the type to have.)

To allocate memory for such vector types, you probably need to use something like _aligned_malloc instead of plain malloc.

@kleisauke
Copy link
Author

I think you're right. PR libvips/libvips#2144 fixes this.

I was wondering, do you think this could be reproduced on Linux i386 as well? I tried this, which generates similar x86 instructions with no crashes:

Details
FROM i386/debian:bullseye

RUN apt-get update -qq && \
    apt-get install -qqy --no-install-recommends clang && \
    apt-get clean -y && \
    rm -rf /var/lib/apt/lists/*
docker build -t clang-i386 .
docker run -v $(pwd):/build -w /build -it clang-i386 \
   clang -g -msse2 -mstackrealign test.c -o test-i686
diff --git a/coff-i386 b/elf32-i386
index 1111111..2222222 100644
--- a/coff-i386
+++ b/elf32-i386
@@ -1,22 +1,21 @@
 
-test-i686.exe:	file format coff-i386
+test-i686:	file format elf32-i386
 
 
 Disassembly 
 
-004013f0 
+08049180 
 ; 
 ; 13
   55                           	pushl	%ebp
   89 e5                        	movl	%esp, %ebp
   83 e4 f0                     	andl	$-16, %esp
   83 ec 40                     	subl	$64, %esp
-  e8 32 01 00 00               	calll	0x401530 <___main>
   c7 44 24 34 00 00 00 00      	movl	$0, 52(%esp)
 ; 15
   c7 04 24 20 02 00 00         	movl	$544, (%esp)
-  e8 90 0f 00 00               	calll	0x4023a2 <_malloc>
-  f2 0f 10 05 50 30 40 00      	movsd	4206672, %xmm0
+  e8 a3 fe ff ff               	calll	0x8049040 <malloc@plt>
+  f2 0f 10 05 08 a0 04 08      	movsd	134520840, %xmm0
 ; 14
   89 44 24 30                  	movl	%eax, 48(%esp)
 ; 17
@@ -34,7 +33,7 @@ Disassembly
 ; 22
   c7 44 24 2c 00 00 00 00      	movl	$0, 44(%esp)
   83 7c 24 2c 03               	cmpl	$3, 44(%esp)
-  0f 8f 44 00 00 00            	jg	0x401498 <_main+0xa8>
+  0f 8f 44 00 00 00            	jg	0x8049223 <main+0xa3>
 ; 24
   8b 44 24 30                  	movl	48(%esp), %eax
   8b 4c 24 2c                  	movl	44(%esp), %ecx
@@ -52,19 +51,19 @@ Disassembly
   8b 44 24 2c                  	movl	44(%esp), %eax
   83 c0 01                     	addl	$1, %eax
   89 44 24 2c                  	movl	%eax, 44(%esp)
-  e9 b1 ff ff ff               	jmp	0x401449 <_main+0x59>
+  e9 b1 ff ff ff               	jmp	0x80491d4 <main+0x54>
 ; 26
   8b 44 24 30                  	movl	48(%esp), %eax
   89 04 24                     	movl	%eax, (%esp)
-  e8 f8 0e 00 00               	calll	0x40239c <_free>
+  e8 01 fe ff ff               	calll	0x8049030 <free@plt>
   31 c0                        	xorl	%eax, %eax
 ; 28
   89 ec                        	movl	%ebp, %esp
   5d                           	popl	%ebp
   c3                           	retl
-  cc                           	int3
-  cc                           	int3
-  cc                           	int3
-  cc                           	int3
-  cc                           	int3
-  cc                           	int3
+  66 90                        	nop
+  66 90                        	nop
+  66 90                        	nop
+  66 90                        	nop
+  66 90                        	nop
+  90                           	nop

(that ___main call is curious, is that coming from mingw?)

And I had a go with https://github.com/johnsonjh/duma to force malloc() returning unaligned addresses, but that doesn't seem to reproduce it either.

@kleisauke
Copy link
Author

I also could not reproduce this with https://github.com/AFLplusplus/AFLplusplus/tree/stable/utils/libdislocator on Linux x64.

I added some printf-debugging to the above test case with this patch:

diff --git a/test.c b/test.c
index 1111111..2222222 100644
--- a/test.c
+++ b/test.c
@@ -1,4 +1,6 @@
 #include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
 
 #define MAX_BANDS (64)
 
@@ -10,10 +12,15 @@ typedef struct _VipsCompositeBase {
   v4f max_band_vec;
 } VipsCompositeBase;
 
+#define is_aligned(POINTER, BYTE_COUNT) \
+    (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0)
+
 int main() {
   VipsCompositeBase *composite =
       (VipsCompositeBase *) malloc(sizeof(VipsCompositeBase));
 
+  printf("is aligned: %d\n", is_aligned(composite, 16));
+
   composite->max_band[0] = 255;
   composite->max_band[1] = 255;
   composite->max_band[2] = 255;

After compiling and running this on my Windows x64 PC (I don't have any i686 hardware to test this on), I noticed this:

C:\Users\KleisAuke\Desktop>test-i686.exe
is aligned: 1

C:\Users\KleisAuke\Desktop>echo Exit Code is %errorlevel%
Exit Code is 0

C:\Users\KleisAuke\Desktop>test-i686.exe
is aligned: 0

C:\Users\KleisAuke\Desktop>echo Exit Code is %errorlevel%
Exit Code is -1073741819

So the first run appears to work, but any subsequent runs fails. Could this be an stack alignment bug? fwiw, I tried to remove the force_align_arg_pointer attribute introduced in mirror/mingw-w64@6045a1e, but the objdump results were identical.

@kleisauke
Copy link
Author

A possible workaround could be to have LLVM emit a movups instruction by misaligning the vector and struct on an 8-byte boundary, for example:

Details
diff --git a/test.c b/test.c
index 1111111..2222222 100644
--- a/test.c
+++ b/test.c
@@ -1,19 +1,22 @@
 #include <stdlib.h>
+#include <stdio.h>
 
 #define MAX_BANDS (64)
 
-typedef float v4f __attribute__((vector_size(4 * sizeof(float))));
+typedef float v4f __attribute__((vector_size(4 * sizeof(float)), aligned(8)));
 
 typedef struct _VipsCompositeBase {
   double max_band[MAX_BANDS + 1];
 
   v4f max_band_vec;
-} VipsCompositeBase;
+} __attribute__((aligned(8))) VipsCompositeBase;
 
 int main() {
   VipsCompositeBase *composite =
       (VipsCompositeBase *) malloc(sizeof(VipsCompositeBase));
 
+  printf("Alignment of VipsCompositeBase is %zu\n", __alignof__(VipsCompositeBase));
+
   composite->max_band[0] = 255;
   composite->max_band[1] = 255;
   composite->max_band[2] = 255;
diff --git a/old.asm b/new.asm
index 1111111..2222222 100644
--- a/old.asm
+++ b/new.asm
@@ -14,7 +14,7 @@ Disassembly of section .text:
   4013f9: e8 32 01 00 00               	calll	0x401530 <___main>
   4013fe: c7 44 24 34 00 00 00 00      	movl	$0, 52(%esp)
 ; /build/test.c:15
-  401406: c7 04 24 20 02 00 00         	movl	$544, (%esp)
+  401406: c7 04 24 18 02 00 00         	movl	$536, (%esp)
   40140d: e8 90 0f 00 00               	calll	0x4023a2 <_malloc>
   401412: f2 0f 10 05 50 30 40 00      	movsd	4206672, %xmm0
 ; /build/test.c:14
@@ -43,11 +43,11 @@ Disassembly of section .text:
   401461: f2 0f 10 04 c8               	movsd	(%eax,%ecx,8), %xmm0
   401466: f2 0f 5a c0                  	cvtsd2ss	%xmm0, %xmm0
 ; /build/test.c:23
-  40146a: 0f 28 88 10 02 00 00         	movaps	528(%eax), %xmm1
+  40146a: 0f 10 88 08 02 00 00         	movups	520(%eax), %xmm1
   401471: 0f 29 4c 24 10               	movaps	%xmm1, 16(%esp)
   401476: f3 0f 11 44 94 10            	movss	%xmm0, 16(%esp,%edx,4)
   40147c: 0f 28 44 24 10               	movaps	16(%esp), %xmm0
-  401481: 0f 29 80 10 02 00 00         	movaps	%xmm0, 528(%eax)
+  401481: 0f 11 80 08 02 00 00         	movups	%xmm0, 520(%eax)
 ; /build/test.c:22
   401488: 8b 44 24 2c                  	movl	44(%esp), %eax
   40148c: 83 c0 01                     	addl	$1, %eax

@mstorsjo
Copy link
Owner

Sure, adding an attribute to claim that the struct is misaligned is one way of fixing it.

FWIW, most projects that work with SIMD do make sure that all allocations are aligned; most modern OSes provide some sort of aligned allocation function. Unfortunately there's not one that is available on all common platforms these days; have a look at what ffmpeg does: http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/mem.c;h=cfb6d8ab8ffa6c6d51f325d4d8bbe7dc0922871f;hb=1af4885014f7d80abbd28613a2939fbcada94ecd#l84

What you did in your PR also works (and is platform independent), which approach is to be preferred is up to each project I guess.

Regarding the testcase, a more telling testcase would be this:

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <stdalign.h>

#define MAX_BANDS (64)

typedef float v4f __attribute__((vector_size(4 * sizeof(float))));

typedef struct _VipsCompositeBase {
  double max_band[MAX_BANDS + 1];

  v4f max_band_vec;
} VipsCompositeBase;

#define is_aligned(POINTER, BYTE_COUNT) \
    (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0)

int main() {
  printf("alignof %d\n", (int)alignof(VipsCompositeBase));

  for (int i = 0; i < 100; i++) {
    for (int j = 0; j < 100; j++)
      malloc(1 + (rand() % 8000));
    VipsCompositeBase *composite =
        (VipsCompositeBase *) malloc(sizeof(VipsCompositeBase));
    printf("aligned: %d\n", is_aligned(composite, alignof(VipsCompositeBase)));
  }

  return 0;
}

Whether it returns an aligned address or not depends on the layout of the heap, earlier allocations, allocator algorithm, etc.

Malloc is only guaranteed to give memory aligned enough for the natural C types, with the largest ones being double and long long, i.e. 8 bytes in most cases. Individual libcs may promise more alignment than this, but standards-wise, there's no such guarantee.

With a malloc that guarantees 8 byte alignment, the returned addresses will be 16 byte aligned 50% of the time, even though that wasn't requested or guaranteed.

On glibc on Ubuntu 20.04, it seems that I always get a 16-byte aligned address, even in i386 mode - so plain malloc would seem to work, even though it's not guaranteed to. But if you'd double the vector size and add -mavx, you'd get an aligned address 50% of the time on glibc too - and you'd get an aligned address 25% of the time on i386 windows.

@kleisauke
Copy link
Author

Thanks for the detailed explanation! Indeed, using an aligned allocation function makes more sense, done with commit libvips/libvips@5ef9c84 in PR libvips/libvips#2144 (which has just landed).

It's quite puzzling that in i386 mode on Linux you always get a 16-byte aligned address, which makes this kind of bugs hard to detect. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants