Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IEEE exceptions handling in libgfortran #8

Closed
fxcoudert opened this issue Aug 30, 2020 · 5 comments · Fixed by #20
Closed

IEEE exceptions handling in libgfortran #8

fxcoudert opened this issue Aug 30, 2020 · 5 comments · Fixed by #20

Comments

@fxcoudert
Copy link
Contributor

Code handling IEEE exceptions in libgfortran currently uses:

Which means that this currently works for Intel-based macOS (using assembly), but arm-darwin is not covered. We need to add code to use macOS-specific API (if they exist) or arm-specific code to do the following:

  • set and clear traps on individual FP exceptions
  • get and set raised FP exception flags
  • get and set the rounding mode
  • get and set underflow mode
  • save and restore FPU state
@iains
Copy link
Owner

iains commented Aug 30, 2020

  • presumably, this works for aarch64-linux, and we should be able to look at what that does and work from there?
    (there are cfarm machines, so experiments are possible too)
  • we might expect similar interfaces to x86_64 to be presented by libc?

@iains
Copy link
Owner

iains commented Aug 30, 2020

I wonder if there are any configuration issues too - what does libgfortran think it's being built for?

(certainly, it could be worth running autoreconf in the libgfortran dir - that will ensure that aarch64-darwin is recognised - and it should cause any missing symbols to generate an error instead of being silently deferred to dynamic lookup)

@fxcoudert
Copy link
Contributor Author

It does work for aarch64-linux using the glibc API. Of course we can have a look at how glibc is doing things, and do the same internally (if macOS does not have any API for this). https://github.com/bminor/glibc/blob/5f72f9800b250410cad3abfeeb09469ef12b2438/sysdeps/aarch64/fpu/fraiseexcpt.c

@fxcoudert
Copy link
Contributor Author

I've got a patch for that:

diff --git a/libgfortran/config/fpu-aarch64.h b/libgfortran/config/fpu-aarch64.h
new file mode 100644
index 00000000000..4db1b6c4f6b
--- /dev/null
+++ b/libgfortran/config/fpu-aarch64.h
@@ -0,0 +1,322 @@
+/* FPU-related code for aarch64.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   Contributed by Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>
+
+This file is part of the GNU Fortran runtime library (libgfortran).
+
+Libgfortran is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public
+License as published by the Free Software Foundation; either
+version 3 of the License, or (at your option) any later version.
+
+Libgfortran is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+<http://www.gnu.org/licenses/>.  */
+
+
+/* Rounding mask and modes */
+
+#define FPCR_RM_MASK  0xc00000
+#define FE_TONEAREST  0x000000
+#define FE_UPWARD     0x400000
+#define FE_DOWNWARD   0x800000
+#define FE_TOWARDZERO 0xc00000
+
+/* Exceptions */
+
+#define FE_INVALID	1
+#define FE_DIVBYZERO	2
+#define FE_OVERFLOW	4
+#define FE_UNDERFLOW	8
+#define FE_INEXACT	16
+
+#define FE_ALL_EXCEPT (FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW | FE_INEXACT)
+#define FE_EXCEPT_SHIFT	8
+
+
+
+/* This structure corresponds to the layout of the block
+   written by FSTENV.  */
+struct fenv
+{
+  unsigned int __fpcr;
+  unsigned int __fpsr;
+};
+
+/* Check we can actually store the FPU state in the allocated size.  */
+_Static_assert (sizeof(struct fenv) <= (size_t) GFC_FPE_STATE_BUFFER_SIZE,
+		"GFC_FPE_STATE_BUFFER_SIZE is too small");
+
+
+
+void
+set_fpu (void)
+{
+  if (options.fpe & GFC_FPE_DENORMAL)
+    estr_write ("Fortran runtime warning: Floating point 'denormal operand' "
+	        "exception not supported.\n");
+
+  set_fpu_trap_exceptions (options.fpe, 0);
+}
+
+
+int
+get_fpu_trap_exceptions (void)
+{
+  unsigned int fpcr, exceptions;
+  int res = 0;
+
+  fpcr = __builtin_aarch64_get_fpcr();
+  exceptions = (fpcr >> FE_EXCEPT_SHIFT) & FE_ALL_EXCEPT;
+
+  if (exceptions & FE_INVALID) res |= GFC_FPE_INVALID;
+  if (exceptions & FE_DIVBYZERO) res |= GFC_FPE_ZERO;
+  if (exceptions & FE_OVERFLOW) res |= GFC_FPE_OVERFLOW;
+  if (exceptions & FE_UNDERFLOW) res |= GFC_FPE_UNDERFLOW;
+  if (exceptions & FE_INEXACT) res |= GFC_FPE_INEXACT;
+
+  return res;
+}
+
+
+void set_fpu_trap_exceptions (int trap, int notrap)
+{
+  unsigned int mode_set = 0, mode_clr = 0;
+  unsigned int fpsr, fpsr_new;
+  unsigned int fpcr, fpcr_new;
+
+  if (trap & GFC_FPE_INVALID)
+    mode_set |= FE_INVALID;
+  if (notrap & GFC_FPE_INVALID)
+    mode_clr |= FE_INVALID;
+
+  if (trap & GFC_FPE_ZERO)
+    mode_set |= FE_DIVBYZERO;
+  if (notrap & GFC_FPE_ZERO)
+    mode_clr |= FE_DIVBYZERO;
+
+  if (trap & GFC_FPE_OVERFLOW)
+    mode_set |= FE_OVERFLOW;
+  if (notrap & GFC_FPE_OVERFLOW)
+    mode_clr |= FE_OVERFLOW;
+
+  if (trap & GFC_FPE_UNDERFLOW)
+    mode_set |= FE_UNDERFLOW;
+  if (notrap & GFC_FPE_UNDERFLOW)
+    mode_clr |= FE_UNDERFLOW;
+
+  if (trap & GFC_FPE_INEXACT)
+    mode_set |= FE_INEXACT;
+  if (notrap & GFC_FPE_INEXACT)
+    mode_clr |= FE_INEXACT;
+
+  /* Clear stalled exception flags.  */
+  fpsr = __builtin_aarch64_get_fpsr();
+  fpsr_new = fpsr & ~FE_ALL_EXCEPT;
+  if (fpsr_new != fpsr)
+    __builtin_aarch64_set_fpsr(fpsr_new);
+
+  fpcr_new = fpcr = __builtin_aarch64_get_fpcr();
+  fpcr_new |= (mode_set << FE_EXCEPT_SHIFT);
+  fpcr_new &= ~(mode_clr << FE_EXCEPT_SHIFT);
+
+  if (fpcr_new != fpcr)
+    __builtin_aarch64_set_fpcr(fpcr_new);
+}
+
+
+int
+support_fpu_flag (int flag)
+{
+  if (flag & GFC_FPE_DENORMAL)
+    return 0;
+
+  return 1;
+}
+
+
+int
+support_fpu_trap (int flag)
+{
+  if (flag & GFC_FPE_DENORMAL)
+    return 0;
+
+  return 1;
+}
+
+
+int
+get_fpu_except_flags (void)
+{
+  int result;
+  unsigned int fpsr;
+
+  result = 0;
+  fpsr = __builtin_aarch64_get_fpsr() & FE_ALL_EXCEPT;
+
+  if (fpsr & FE_INVALID)
+    result |= GFC_FPE_INVALID;
+  if (fpsr & FE_DIVBYZERO)
+    result |= GFC_FPE_ZERO;
+  if (fpsr & FE_OVERFLOW)
+    result |= GFC_FPE_OVERFLOW;
+  if (fpsr & FE_UNDERFLOW)
+    result |= GFC_FPE_UNDERFLOW;
+  if (fpsr & FE_INEXACT)
+    result |= GFC_FPE_INEXACT;
+
+  return result;
+}
+
+
+void
+set_fpu_except_flags (int set, int clear)
+{
+  unsigned int exc_set = 0, exc_clr = 0;
+  unsigned int fpsr, fpsr_new;
+
+  if (set & GFC_FPE_INVALID)
+    exc_set |= FE_INVALID;
+  else if (clear & GFC_FPE_INVALID)
+    exc_clr |= FE_INVALID;
+
+  if (set & GFC_FPE_ZERO)
+    exc_set |= FE_DIVBYZERO;
+  else if (clear & GFC_FPE_ZERO)
+    exc_clr |= FE_DIVBYZERO;
+
+  if (set & GFC_FPE_OVERFLOW)
+    exc_set |= FE_OVERFLOW;
+  else if (clear & GFC_FPE_OVERFLOW)
+    exc_clr |= FE_OVERFLOW;
+
+  if (set & GFC_FPE_UNDERFLOW)
+    exc_set |= FE_UNDERFLOW;
+  else if (clear & GFC_FPE_UNDERFLOW)
+    exc_clr |= FE_UNDERFLOW;
+
+  if (set & GFC_FPE_INEXACT)
+    exc_set |= FE_INEXACT;
+  else if (clear & GFC_FPE_INEXACT)
+    exc_clr |= FE_INEXACT;
+
+  fpsr_new = fpsr = __builtin_aarch64_get_fpsr();
+  fpsr_new &= ~exc_clr;
+  fpsr_new |= exc_set;
+
+  if (fpsr_new != fpsr)
+    __builtin_aarch64_set_fpsr(fpsr_new);
+}
+
+
+void
+get_fpu_state (void *state)
+{
+  struct fenv *envp = state;
+  envp->__fpcr = __builtin_aarch64_get_fpcr();
+  envp->__fpsr = __builtin_aarch64_get_fpsr();
+}
+
+
+void
+set_fpu_state (void *state)
+{
+  struct fenv *envp = state;
+  __builtin_aarch64_set_fpcr(envp->__fpcr);
+  __builtin_aarch64_set_fpsr(envp->__fpsr);
+}
+
+
+int
+get_fpu_rounding_mode (void)
+{   
+  unsigned int fpcr = __builtin_aarch64_get_fpcr();
+  fpcr &= FPCR_RM_MASK;
+
+  switch (fpcr)
+    {
+      case FE_TONEAREST:
+        return GFC_FPE_TONEAREST;
+      case FE_UPWARD:
+        return GFC_FPE_UPWARD;
+      case FE_DOWNWARD:
+        return GFC_FPE_DOWNWARD;
+      case FE_TOWARDZERO:
+        return GFC_FPE_TOWARDZERO;
+      default:
+        return 0; /* Should be unreachable.  */
+    }
+}               
+
+
+void
+set_fpu_rounding_mode (int round)
+{
+  unsigned int fpcr, round_mode;
+
+  switch (round)
+    {
+    case GFC_FPE_TONEAREST:
+      round_mode = FE_TONEAREST;
+      break;
+    case GFC_FPE_UPWARD:
+      round_mode = FE_UPWARD;
+      break;
+    case GFC_FPE_DOWNWARD:
+      round_mode = FE_DOWNWARD;
+      break;
+    case GFC_FPE_TOWARDZERO:
+      round_mode = FE_TOWARDZERO;
+      break;
+    default:
+      return; /* Should be unreachable.  */
+    }
+
+  fpcr = __builtin_aarch64_get_fpcr();
+
+  /* Only set FPCR if requested mode is different from current.  */
+  round_mode = (fpcr ^ round_mode) & FPCR_RM_MASK;
+  if (round_mode != 0)
+    __builtin_aarch64_set_fpcr(fpcr ^ round_mode);
+}
+
+
+int
+support_fpu_rounding_mode (int mode __attribute__((unused)))
+{
+  return 1;
+}
+
+
+int
+support_fpu_underflow_control (int kind __attribute__((unused)))
+{
+  /* Unsupported */
+  return 0;
+}
+
+
+int
+get_fpu_underflow_mode (void)
+{
+  /* Unsupported */
+  return 0;
+}
+
+
+void
+set_fpu_underflow_mode (int gradual __attribute__((unused)))
+{
+  /* Unsupported */
+}
+
diff --git a/libgfortran/configure.host b/libgfortran/configure.host
index e9d92c9d34d..3d6c2db7772 100644
--- a/libgfortran/configure.host
+++ b/libgfortran/configure.host
@@ -39,17 +39,29 @@ if test "x${have_feenableexcept}" = "xyes"; then
   ieee_support='yes'
 fi
 
-# x86 asm should be used instead of glibc, since glibc doesn't support
-# the x86 denormal exception.
 case "${host_cpu}" in
+
+  # x86 asm should be used instead of glibc, since glibc doesn't support
+  # the x86 denormal exception.
   i?86 | x86_64)
     if test "x${have_soft_float}" = "xyes"; then
       fpu_host='fpu-generic'
+      ieee_support='no'
     else
       fpu_host='fpu-387'
+      ieee_support='yes'
     fi
-    ieee_support='yes'
     ;;
+
+  # use asm on aarch64-darwin
+  aarch64)
+    case "${host_os}" in
+      darwin*)
+        fpu_host='fpu-aarch64'
+        ieee_support='yes'
+        ;;
+    esac
+
 esac
 
 # Some targets require additional compiler options for NaN/Inf.

@iains
Copy link
Owner

iains commented Aug 31, 2020

can you do a pull request?
(if not just mail me the patch and I can apply and push it)

@iains iains closed this as completed in #20 Sep 1, 2020
iains pushed a commit that referenced this issue Sep 5, 2020
This patch moves the move-immediate splitter after the regular ones so
that it has lower precedence, and updates its constraints.

For
int f3 (void) { return 0x11000000; }
int f3_2 (void) { return 0x12345678; }

we now generate:
* with -O2 -mcpu=cortex-m0 -mpure-code:
f3:
	movs    r0, #136
	lsls    r0, r0, #21
	bx      lr
f3_2:
	movs    r0, #18
	lsls    r0, r0, #8
	adds    r0, r0, #52
	lsls    r0, r0, #8
	adds    r0, r0, #86
	lsls    r0, r0, #8
	adds    r0, r0, #121
	bx      lr

* with -O2 -mcpu=cortex-m23 -mpure-code:
f3:
	movs    r0, #136
	lsls    r0, r0, #21
	bx      lr
f3_2:
	movw    r0, #22136
	movt    r0, 4660
	bx      lr

2020-09-04  Christophe Lyon  <christophe.lyon@linaro.org>

	PR target/96769
	gcc/
	* config/arm/thumb1.md: Move movsi splitter for
	arm_disable_literal_pool after the other movsi splitters.

	gcc/testsuite/
	* gcc.target/arm/pure-code/pr96769.c: New test.
iains pushed a commit that referenced this issue Nov 7, 2020
Enable thumb1_gen_const_int to generate RTL or asm depending on the
context, so that we avoid duplicating code to handle constants in
Thumb-1 with -mpure-code.

Use a template so that the algorithm is effectively shared, and
rely on two classes to handle the actual emission as RTL or asm.

The generated sequence is improved to handle right-shiftable and small
values with less instructions. We now generate:

128:
        movs    r0, r0, #128
264:
        movs    r3, #33
        lsls    r3, #3
510:
        movs    r3, #255
        lsls    r3, #1
512:
        movs    r3, #1
        lsls    r3, #9
764:
        movs    r3, #191
        lsls    r3, #2
65536:
        movs    r3, #1
        lsls    r3, #16
0x123456:
        movs    r3, #18 ;0x12
        lsls    r3, #8
        adds    r3, #52 ;0x34
        lsls    r3, #8
        adds    r3, #86 ;0x56
0x1123456:
        movs    r3, #137 ;0x89
        lsls    r3, #8
        adds    r3, #26 ;0x1a
        lsls    r3, #8
        adds    r3, #43 ;0x2b
        lsls    r3, #1
0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #1
        lsls    r3, #4
0x1000011:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #17
-8192:
	movs	r3, #1
	lsls	r3, #13
	rsbs	r3, #0

The patch adds a testcase which does not fully exercise
thumb1_gen_const_int, as other existing patterns already catch small
constants.  These parts of thumb1_gen_const_int are used by
arm_thumb1_mi_thunk.

2020-11-02  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New
	classes.
	(thumb1_gen_const_int): Rename to ...
	(thumb1_gen_const_int_1): ... New helper function. Add capability
	to emit either RTL or asm, improve generated code.
	(thumb1_gen_const_int_rtl): New function.
	* config/arm/arm-protos.h (thumb1_gen_const_int): Rename to
	thumb1_gen_const_int_rtl.
	* config/arm/thumb1.md: Call thumb1_gen_const_int_rtl instead
	of thumb1_gen_const_int.

	gcc/testsuite/
	* gcc.target/arm/pure-code/no-literal-pool-m0.c: New.
iains pushed a commit that referenced this issue Jan 31, 2021
This patch adds new movmisalign<mode>_mve_load and store patterns for
MVE to help vectorization. They are very similar to their Neon
counterparts, but use different iterators and instructions.

Indeed MVE supports less vectors modes than Neon, so we use the
MVE_VLD_ST iterator where Neon uses VQX.

Since the supported modes are different from the ones valid for
arithmetic operators, we introduce two new sets of macros:

ARM_HAVE_NEON_<MODE>_LDST
  true if Neon has vector load/store instructions for <MODE>

ARM_HAVE_<MODE>_LDST
  true if any vector extension has vector load/store instructions for <MODE>

We move the movmisalign<mode> expander from neon.md to vec-commond.md, and
replace the TARGET_NEON enabler with ARM_HAVE_<MODE>_LDST.

The patch also updates the mve-vneg.c test to scan for the better code
generation when loading and storing the vectors involved: it checks
that no 'orr' instruction is generated to cope with misalignment at
runtime.
This test was chosen among the other mve tests, but any other should
be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good
test because the compiler chooses to use memcpy.

For instance we now generate:
test_vneg_s32x4:
	vldrw.32       q3, [r1]
	vneg.s32  q3, q3
	vstrw.32       q3, [r0]
	bx      lr

instead of:
test_vneg_s32x4:
	orr     r3, r1, r0
	lsls    r3, r3, #28
	bne     .L15
	vldrw.32	q3, [r1]
	vneg.s32  q3, q3
	vstrw.32	q3, [r0]
	bx      lr
	.L15:
	push    {r4, r5}
	ldrd    r2, r3, [r1, #8]
	ldrd    r5, r4, [r1]
	rsbs    r2, r2, #0
	rsbs    r5, r5, #0
	rsbs    r4, r4, #0
	rsbs    r3, r3, #0
	strd    r5, r4, [r0]
	pop     {r4, r5}
	strd    r2, r3, [r0, #8]
	bx      lr

2021-01-12  Christophe Lyon  <christophe.lyon@linaro.org>

	PR target/97875
	gcc/
	* config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro.
	(ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise.
	(ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise.
	(ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise.
	(ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise.
	(ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise.
	(ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise.
	(ARM_HAVE_NEON_V2DI_LDST): Likewise.
	(ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise.
	(ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise.
	(ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST): Likewise.
	(ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST): Likewise.
	(ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST): Likewise.
	(ARM_HAVE_V2DI_LDST): Likewise.
	* config/arm/mve.md (*movmisalign<mode>_mve_store): New pattern.
	(*movmisalign<mode>_mve_load): New pattern.
	* config/arm/neon.md (movmisalign<mode>): Move to ...
	* config/arm/vec-common.md: ... here.

	PR target/97875
	gcc/testsuite/
	* gcc.target/arm/simd/mve-vneg.c: Update test.
iains pushed a commit that referenced this issue Jun 18, 2021
The fixed error is:

==21166==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x60300000d900
    #0 0x7367d7 in operator delete(void*, unsigned long) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x3b82e6e in pointer_equiv_analyzer::~pointer_equiv_analyzer() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:161
    #2 0x3b83387 in hybrid_folder::~hybrid_folder() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:517
    #3 0x3b83387 in execute_early_vrp /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:686
    #4 0x1790611 in execute_one_pass(opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2567
    #5 0x1792003 in execute_pass_list_1 /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2656
    #6 0x1792029 in execute_pass_list_1 /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2657
    #7 0x179209f in execute_pass_list(function*, opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2667
    #8 0x178a5f3 in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:1773
    #9 0x1792fac in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/plugin.h:191
    #10 0x1792fac in execute_ipa_pass_list(opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:3001
    #11 0xc525fc in ipa_passes /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2154
    #12 0xc525fc in symbol_table::compile() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2289
    #13 0xc5a096 in symbol_table::compile() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2269
    #14 0xc5a096 in symbol_table::finalize_compilation_unit() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2537
    #15 0x1a7a17c in compile_file /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:482
    #16 0x69c758 in do_compile /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:2210
    #17 0x69c758 in toplev::main(int, char**) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:2349
    #18 0x6a932a in main /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/main.c:39
    #19 0x7ffff7820b34 in __libc_start_main ../csu/libc-start.c:332
    #20 0x6aa5fd in _start (/home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/objdir/gcc/cc1+0x6aa5fd)

0x60300000d900 is located 0 bytes inside of 32-byte region [0x60300000d900,0x60300000d920)
allocated by thread T0 here:
    #0 0x735ab7 in operator new[](unsigned long) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/libsanitizer/asan/asan_new_delete.cpp:102
    #1 0x3b82dac in pointer_equiv_analyzer::pointer_equiv_analyzer(gimple_ranger*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:156

gcc/ChangeLog:

	* gimple-ssa-evrp.c (pointer_equiv_analyzer::~pointer_equiv_analyzer): Use delete[].
iains pushed a commit that referenced this issue Jun 18, 2021
This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
vec_pack_trunc_<mode> patterns for MVE.

It does so by moving the unpack patterns from neon.md to
vec-common.md, while adding them support for MVE. The pack expander is
derived from the Neon one (which in turn is renamed into
neon_quad_vec_pack_trunc_<mode>).

The patch introduces mve_vec_unpack<US>_lo_<mode> and
mve_vec_unpack<US>_hi_<mode> which are similar to their Neon
counterparts, except for the assembly syntax.

The patch introduces mve_vec_pack_trunc_lo_<mode> to avoid the need for a
zero-initialized temporary, which is needed if the
vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
instead.

With this patch, we can now vectorize the 16 and 8-bit versions of
vclz and vshl, although the generated code could still be improved.
For test_clz_s16, we now generate
        vldrh.16        q3, [r1]
        vmovlb.s16   q2, q3
        vmovlt.s16   q3, q3
        vclz.i32  q2, q2
        vclz.i32  q3, q3
        vmovnb.i32      q1, q2
        vmovnt.i32      q1, q3
        vstrh.16        q1, [r0]
which could be improved to
        vldrh.16        q3, [r1]
	vclz.i16	q1, q3
        vstrh.16        q1, [r0]
if we could avoid the need for unpack/pack steps.

For reference, clang-12 generates:
	vldrh.s32       q0, [r1]
	vldrh.s32       q1, [r1, #8]
	vclz.i32        q0, q0
	vstrh.32        q0, [r0]
	vclz.i32        q0, q1
	vstrh.32        q0, [r0, #8]

2021-06-11  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/mve.md (mve_vec_unpack<US>_lo_<mode>): New pattern.
	(mve_vec_unpack<US>_hi_<mode>): New pattern.
	(@mve_vec_pack_trunc_lo_<mode>): New pattern.
	(mve_vmovntq_<supf><mode>): Prefix with '@'.
	* config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
	vec-common.md.
	(vec_unpack<US>_lo_<mode>): Likewise.
	(vec_pack_trunc_<mode>): Rename to
	neon_quad_vec_pack_trunc_<mode>.
	* config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
	pattern.
	(vec_unpack<US>_lo_<mode>): New.
	(vec_pack_trunc_<mode>): New.

	gcc/testsuite/
	* gcc.target/arm/simd/mve-vclz.c: Update expected results.
	* gcc.target/arm/simd/mve-vshl.c: Likewise.
	* gcc.target/arm/simd/mve-vec-pack.c: New test.
	* gcc.target/arm/simd/mve-vec-unpack.c: New test.
iains pushed a commit that referenced this issue Sep 26, 2021
The current restriction on folding memcpy to a single element of size
MOVE_MAX is excessively cautious on most machines and limits some
significant further optimizations.  So relax the restriction provided
the copy size does not exceed MOVE_MAX * MOVE_RATIO and that a SET
insn exists for moving the value into machine registers.

Note that there were already checks in place for having misaligned
move operations when one or more of the operands were unaligned.

On Arm this now permits optimizing

uint64_t bar64(const uint8_t *rData1)
{
    uint64_t buffer;
    memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}

from
        ldr     r2, [r0]        @ unaligned
        sub     sp, sp, #8
        ldr     r3, [r0, #4]    @ unaligned
        strd    r2, [sp]
        ldrd    r0, [sp]
        add     sp, sp, #8

to
        mov     r3, r0
        ldr     r0, [r0]        @ unaligned
        ldr     r1, [r3, #4]    @ unaligned

PR target/102125 - (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

gcc/ChangeLog:

	PR target/102125
	* gimple-fold.c (gimple_fold_builtin_memory_op): Allow folding
	memcpy if the size is not more than MOVE_MAX * MOVE_RATIO.
iains pushed a commit that referenced this issue Nov 28, 2021
Fixes:

==129444==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000666ca5c at pc 0x000000ef094b bp 0x7fffffff8180 sp 0x7fffffff8178
READ of size 4 at 0x00000666ca5c thread T0
    #0 0xef094a in parse_optimize_options ../../gcc/d/d-attribs.cc:855
    #1 0xef0d36 in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:916
    #2 0xef107e in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:887
    #3 0xff85b1 in decl_attributes(tree_node**, tree_node*, int, tree_node*) ../../gcc/attribs.c:829
    #4 0xef2a91 in apply_user_attributes(Dsymbol*, tree_node*) ../../gcc/d/d-attribs.cc:427
    #5 0xf7b7f3 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:1346
    #6 0xf87bc7 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:967
    #7 0xf87bc7 in DeclVisitor::visit(FuncDeclaration*) ../../gcc/d/decl.cc:808
    #8 0xf83db5 in DeclVisitor::build_dsymbol(Dsymbol*) ../../gcc/d/decl.cc:146

for the following test-case: gcc/testsuite/gdc.dg/attr_optimize1.d.

gcc/d/ChangeLog:

	* d-attribs.cc (parse_optimize_options): Check index before
	accessing cl_options.
iains pushed a commit that referenced this issue Jan 2, 2022
…imize or target pragmas [PR103012]

The following testcases ICE when an optimize or target pragma
is followed by a long line (4096+ chars).
This is because on such long lines we can't use columns anymore,
but the cpp_define calls performed by c_cpp_builtins_optimize_pragma
or from the backend hooks for target pragma are done on temporary
buffers and expect to get columns from whatever line they appear on
(which happens to be the long line after optimize/target pragma),
and we run into:
 #0  fancy_abort (file=0x3abec67 "../../libcpp/line-map.c", line=502, function=0x3abecfc "linemap_add") at ../../gcc/diagnostic.c:1986
 #1  0x0000000002e7c335 in linemap_add (set=0x7ffff7fca000, reason=LC_RENAME, sysp=0, to_file=0x41287a0 "pr103012.i", to_line=3) at ../../libcpp/line-map.c:502
 #2  0x0000000002e7cc24 in linemap_line_start (set=0x7ffff7fca000, to_line=3, max_column_hint=128) at ../../libcpp/line-map.c:827
 #3  0x0000000002e7ce2b in linemap_position_for_column (set=0x7ffff7fca000, to_column=1) at ../../libcpp/line-map.c:898
 #4  0x0000000002e771f9 in _cpp_lex_direct (pfile=0x40c3b60) at ../../libcpp/lex.c:3592
 #5  0x0000000002e76c3e in _cpp_lex_token (pfile=0x40c3b60) at ../../libcpp/lex.c:3394
 #6  0x0000000002e610ef in lex_macro_node (pfile=0x40c3b60, is_def_or_undef=true) at ../../libcpp/directives.c:601
 #7  0x0000000002e61226 in do_define (pfile=0x40c3b60) at ../../libcpp/directives.c:639
 #8  0x0000000002e610b2 in run_directive (pfile=0x40c3b60, dir_no=0, buf=0x7fffffffd430 "__OPTIMIZE__ 1\n", count=14) at ../../libcpp/directives.c:589
 #9  0x0000000002e650c1 in cpp_define (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2513
 #10 0x0000000002e65100 in cpp_define_unused (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2522
 #11 0x0000000000f50685 in c_cpp_builtins_optimize_pragma (pfile=0x40c3b60, prev_tree=<optimization_node 0x7fffea042000>, cur_tree=<optimization_node 0x7fffea042020>)
     at ../../gcc/c-family/c-cppbuiltin.c:600
assertion that LC_RENAME doesn't happen first.

I think the right fix is emit those predefined macros upon
optimize/target pragmas with BUILTINS_LOCATION, like we already do
for those macros at the start of the TU, they don't appear in columns
of the next line after it.  Another possibility would be to force them
at the location of the pragma.

2021-12-30  Jakub Jelinek  <jakub@redhat.com>

	PR c++/103012
gcc/
	* config/i386/i386-c.c (ix86_pragma_target_parse): Perform
	cpp_define/cpp_undef calls with forced token locations
	BUILTINS_LOCATION.
	* config/arm/arm-c.c (arm_pragma_target_parse): Likewise.
	* config/aarch64/aarch64-c.c (aarch64_pragma_target_parse): Likewise.
	* config/s390/s390-c.c (s390_pragma_target_parse): Likewise.
gcc/c-family/
	* c-cppbuiltin.c (c_cpp_builtins_optimize_pragma): Perform
	cpp_define_unused/cpp_undef calls with forced token locations
	BUILTINS_LOCATION.
gcc/testsuite/
	PR c++/103012
	* g++.dg/cpp/pr103012.C: New test.
	* g++.target/i386/pr103012.C: New test.
iains pushed a commit that referenced this issue Feb 26, 2022
…04617]

On
 #define A(n) int foo1##n(void) { return 1##n; }
 #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
 #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
 #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
 #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
 E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325)
 B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642)
testcase with
./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto  -O0 -o foo1.o foo1.c -ffunction-sections
./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o
/tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized
(testcase too slow to be included into testsuite).
The problem is clearly reported by readelf:
readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323
readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section.
because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and
sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE
inclusive.  Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE
range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym
where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be
used instead and .symtab_shndx section should contain the real section
index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >=
SHN_LORESERVE value is needed it should put those into
Shdr[0].sh_{size,link}.  But, sh_{link,info} are 32-bit fields which can
contain any section index.

Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before
2011) used to mishandle the > 63.75K sections case and assumed there is a
hole in between the sections, but what
simple_object_elf_copy_lto_debug_sections does wouldn't help in that case
for the debug temp object creation, we'd need to detect the case also in
that routine and take it into account in the remapping etc.  I think
it is not worth it given that it is over 10 years, if somebody needs
63.75K or more sections, better use more recent binutils.

2022-02-22  Jakub Jelinek  <jakub@redhat.com>

	PR lto/104617
	* simple-object-elf.c (simple_object_elf_match): Fix up URL
	in comment.
	(simple_object_elf_copy_lto_debug_sections): Remap sh_info and
	sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE
	range (inclusive).
iains pushed a commit that referenced this issue Feb 3, 2023
The aarch64 ISA specification allows a left shift amount to be applied
after extension in the range of 0 to 4 (encoded in the imm3 field).

This is true for at least the following instructions:

 * ADD (extend register)
 * ADDS (extended register)
 * SUB (extended register)

The result of this patch can be seen, when compiling the following code:

uint64_t myadd(uint64_t a, uint64_t b)
{
    return a+(((uint8_t)b)<<4);
}

Without the patch the following sequence will be generated:

0000000000000000 <myadd>:
   0:	d37c1c21 	ubfiz	x1, x1, #4, #8
   4:	8b000020 	add	x0, x1, x0
   8:	d65f03c0 	ret

With the patch the ubfiz will be merged into the add instruction:

0000000000000000 <myadd>:
   0:	8b211000 	add	x0, x0, w1, uxtb #4
   4:	d65f03c0 	ret

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_uxt_size): fix an
	off-by-one in checking the permissible shift-amount.
iains pushed a commit that referenced this issue May 26, 2023
This patch adds support for xstormy16's swpb (swap bytes) and swpw (swap
words) instructions.  The most obvious application of these to implement
the __builtin_bswap16 and __builtin_bswap32 intrinsics.

Currently, __builtin_bswap16 is implemented as:
foo:    mov r7,r2
        shl r7,#8
        shr r2,#8
        or r2,r7
        ret

but with this patch becomes:
foo:	swpb r2
	ret

Likewise, __builtin_bswap32 now becomes:
foo:	swpb r2 | swpb r3 | swpw r2,r3
        ret

Finally, the swpw instruction on its own can be used to exchange
two word mode registers without a temporary, so a new pattern and
peephole2 have been added to catch this.  As described in the
PR rtl-optimization/106518, register allocation can (in theory)
be more efficient on targets that provide a swap/exchange instruction.
The slightly unusual swap<mode> naming matches that used in i386.md.

2024-04-26  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/stormy16/stormy16.md (bswaphi2): New define_insn.
	(bswapsi2): New define_insn.
	(swaphi): New define_insn to exchange two registers (swpw).
	(define_peephole2): Recognize exchange of registers as swaphi.

gcc/testsuite/ChangeLog
	* gcc.target/xstormy16/bswap16.c: New test case.
	* gcc.target/xstormy16/bswap32.c: Likewise.
	* gcc.target/xstormy16/swpb.c: Likewise.
	* gcc.target/xstormy16/swpw-1.c: Likewise.
	* gcc.target/xstormy16/swpw-2.c: Likewise.
iains pushed a commit that referenced this issue Aug 13, 2023
This patch is the final piece in the series to improve the ABI issues
affecting PR 88873.  The previous patches tackled inserting DFmode
values into V2DFmode registers, by introducing insvti_{low,high}part
patterns.  This patch improves the extraction of DFmode values from
V2DFmode registers via TImode intermediates.

I'd initially thought this would require new extvti_{low,high}part
patterns to be defined, but all that's required is to recognize that
the SUBREG idioms produced by combine are equivalent to (forms of)
vec_select patterns.  The target-independent middle-end can't be sure
that the appropriate vec_select instruction exists on the target,
hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
but the backend can provide a define_split stating where and when
this is useful, for example, considering whether the operand is in
memory, or whether !TARGET_SSE_MATH and the destination is i387.

For pr88873.c, gcc -O2 -march=cascadelake currently generates:

foo:    vpunpcklqdq     %xmm3, %xmm2, %xmm7
        vpunpcklqdq     %xmm1, %xmm0, %xmm6
        vpunpcklqdq     %xmm5, %xmm4, %xmm2
        vmovdqa %xmm7, -24(%rsp)
        vmovdqa %xmm6, %xmm1
        movq    -16(%rsp), %rax
        vpinsrq $1, %rax, %xmm7, %xmm4
        vmovapd %xmm4, %xmm6
        vfmadd132pd     %xmm1, %xmm2, %xmm6
        vmovapd %xmm6, -24(%rsp)
        vmovsd  -16(%rsp), %xmm1
        vmovsd  -24(%rsp), %xmm0
        ret

with this patch, we now generate:

foo:	vpunpcklqdq     %xmm1, %xmm0, %xmm6
        vpunpcklqdq     %xmm3, %xmm2, %xmm7
        vpunpcklqdq     %xmm5, %xmm4, %xmm2
        vmovdqa %xmm6, %xmm1
        vfmadd132pd     %xmm7, %xmm2, %xmm1
        vmovsd  %xmm1, %xmm1, %xmm0
        vunpckhpd       %xmm1, %xmm1, %xmm1
        ret

The improvement is even more dramatic when compared to the original
29 instructions shown in comment #8.  GCC 13, for example, required
12 transfers to/from memory.

2023-08-04  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/i386/sse.md (define_split): Convert highpart:DF extract
	from V2DFmode register into a sse2_storehpd instruction.
	(define_split): Likewise, convert lowpart:DF extract from V2DF
	register into a sse2_storelpd instruction.

gcc/testsuite/ChangeLog
	* gcc.target/i386/pr88873.c: Tweak to check for improved code.
fxcoudert pushed a commit to fxcoudert/gcc-darwin-arm64 that referenced this issue Sep 7, 2023
As discussed in PR104167 (comments iains#8 and below), and PR111238, using
-Wl,-gc-sections in the libstdc++ testsuite for arm-eabi
(cross-toolchain) avoids link failures for a few tests:

27_io/filesystem/path/108636.cc
std/time/clock/gps/1.cc
std/time/clock/gps/io.cc
std/time/clock/tai/1.cc
std/time/clock/tai/io.cc
std/time/clock/utc/1.cc
std/time/clock/utc/io.cc
std/time/clock/utc/leap_second_info.cc
std/time/exceptions.cc
std/time/format.cc
std/time/time_zone/get_info_local.cc
std/time/time_zone/get_info_sys.cc
std/time/tzdb/1.cc
std/time/tzdb/leap_seconds.cc
std/time/tzdb_list/1.cc
std/time/zoned_time/1.cc
std/time/zoned_time/custom.cc
std/time/zoned_time/io.cc
std/time/zoned_traits.cc

This patch achieves this by calling GLIBCXX_CHECK_LINKER_FEATURES in
cross-build cases, like we already do for native builds. We keep not
doing so in Canadian-cross builds.

However, this would hide the fact that libstdc++ somehow forces the
user to use -Wl,-gc-sections to avoid undefined references to chdir,
mkdir, chmod, pathconf, ... so maybe it's better to keep the status
quo and not apply this patch?

2023-08-31  Christophe Lyon  <christophe.lyon@linaro.org>

libstdc++-v3/ChangeLog:

	PR libstdc++/111238
	* configure: Regenerate.
	* configure.ac: Call GLIBCXX_CHECK_LINKER_FEATURES in cross,
	non-Canadian builds.
iains pushed a commit that referenced this issue May 19, 2024
Examining the code generated for the following C snippet on a
raspberry pi:

int popcount_lut8(unsigned *buf, int n)
{
  int cnt=0;
  unsigned int i;
  do {
    i = *buf;
    cnt += lut[i&255];
    cnt += lut[i>>8&255];
    cnt += lut[i>>16&255];
    cnt += lut[i>>24];
    buf++;
  } while(--n);
  return cnt;
}

I was surprised to see following instruction sequence generated by the
compiler:

  mov    r5, r2, lsr #8
  uxtb   r5, r5

This sequence can be performed by a single ARM instruction:

  uxtb   r5, r2, ror #8

The attached patch allows GCC's combine pass to take advantage of ARM's
uxtb with rotate functionality to implement the above zero_extract, and
likewise to use the sxtb with rotate to implement sign_extract.  ARM's
uxtb and sxtb can only be used with rotates of 0, 8, 16 and 24, and of
these only the 8 and 16 are useful [ror #0 is a nop, and extends with
ror #24 can be implemented using regular shifts],  so the approach here
is to add the six missing but useful instructions as 6 different
define_insn in arm.md, rather than try to be clever with new predicates.

Later ARM hardware has advanced bit field instructions, and earlier
ARM cores didn't support extend-with-rotate, so this appears to only
benefit armv6 era CPUs (e.g. the raspberry pi).

Patch posted:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01339.html
Approved by Kyrill Tkachov:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01881.html

2024-05-12  Roger Sayle  <roger@nextmovesoftware.com>
	    Kyrill Tkachov  <kyrylo.tkachov@foss.arm.com>

	* config/arm/arm.md (*arm_zeroextractsi2_8_8, *arm_signextractsi2_8_8,
	*arm_zeroextractsi2_8_16, *arm_signextractsi2_8_16,
	*arm_zeroextractsi2_16_8, *arm_signextractsi2_16_8): New.

2024-05-12  Roger Sayle  <roger@nextmovesoftware.com>
	    Kyrill Tkachov  <kyrylo.tkachov@foss.arm.com>

	* gcc.target/arm/extend-ror.c: New test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants