Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang really shouldn't have ASM tests in it #27189

Closed
rengolin opened this issue Mar 2, 2016 · 19 comments
Closed

Clang really shouldn't have ASM tests in it #27189

rengolin opened this issue Mar 2, 2016 · 19 comments
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category

Comments

@rengolin
Copy link
Member

rengolin commented Mar 2, 2016

Bugzilla Link 26815
Resolution FIXED
Resolved on Mar 09, 2016 12:56
Version unspecified
OS Linux
CC @ahmedbougacha,@compnerd,@echristo,@jmolloy,@kbeyls,@RKSimon,@pogo59,@rengolin,@rotateright,@TNorthover,@vedantk

Extended Description

After enough discussions in the mailing list, and what I believe is now the consensus, we should move all ASM tests to LLVM and transform all Clang tests into IR tests.

The few cases that people were arguing about:

  1. NEON intrinsics need to lower to NEON instructions one-to-one.

That's not true. I love that the back-end combines vmul+vadd into vmla, and it should still do it. Testing for a sequence of IR instructions instead (or builtin) is not worse.

  1. Inline ASM needs to be as is in asm output.

No it doesn't. There are a number of cases where we (and GAS) do slight transformations on the asm and output a different code (ex. add r0, -1 -> sub r0, 1). Furthermore, inline asm has a nice IR representation, which we can also test.

  1. Specific code-gen issues.

If C code gets translated into IR and then assembly, we can obviously always find an IR that will produce the same assembly. To do that, simple -emit-llvm.

--

Having said that, the move itself will be very tedious, but it is necessary.

Given that most people were complaining about the ARM and AArch64 tests, I think it's only fair that we, ARM folks, share the load into moving things. I'll copy as many people as I can into this bug so we can coordinate our efforts, but I'd like to have a much cleaner Clang test by 3.9.0.

@rotateright
Copy link
Contributor

rotateright commented Mar 2, 2016

  1. NEON intrinsics need to lower to NEON instructions one-to-one.

We had a variation on this argument for x86's obscene pile of SSE/AVX intrinsics: in unoptimized code, we thought it was important that the one-to-one intrinsic -> asm mapping should be true - because debugging those things is tough enough already. This is bug 24580.

However, after:
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20151123/143974.html

...we split those up. I don't think we ever answered if there was a designated place for end-to-end tests of this sort, but based on this recent llvm-dev thread:
http://lists.llvm.org/pipermail/llvm-dev/2016-February/095275.html

maybe test-suite is the right spot?

@rengolin
Copy link
Member Author

rengolin commented Mar 2, 2016

We had a variation on this argument for x86's obscene pile of SSE/AVX
intrinsics: in unoptimized code, we thought it was important that the
one-to-one intrinsic -> asm mapping should be true - because debugging those
things is tough enough already. This is bug 24580.

I think this goal is counter-productive.

Imagine the case where the intrinsic is lowered to IR instructions, not builtins, and O0 generates so many wrapper code that the patterns don't match on the back-end.

You have a few not-ideal options:

  1. Fudge the match to recognise more variations, which could reduce the certainty of the semantics.

  2. Change Clang to reduce the wrapper code for intrinsics-to-IR, which would create special cases and increase maintenance.

  3. Force Clang to emit intrinsics, stopping optimisations down the pipeline.

All for the sake of an identity that is not really necessary.

If, for any target, this identity is mandated by the ABI, then Clang must generate builtins for ALL intrinsics, and it's up to the back-end (or a target-specific middle-end pass) to convert to IR on early-O1+.

In that case, just checking the the IR has the builtin you want is, now, enough for a Clang test.

@rotateright
Copy link
Contributor

rotateright commented Mar 2, 2016

  1. Force Clang to emit intrinsics, stopping optimisations down the pipeline.

All for the sake of an identity that is not really necessary.

For x86 at least, I don't think there's any concern about preventing optimizations.

We've mostly settled on doing these transforms in InstCombineCalls.cpp. That let's us write readable/debuggable C++ code to do the transform, ensures that we're doing the transform to IR very early, and doesn't interfere with the programmer's intrinsic-based intent when building at -O0.

There are cases where we transform to native IR instructions in the header files. In addition to the vector programmer's debugging concerns, consider that we have macros in headers that look like this:

/* Vector shuffle */
#define _mm256_shuffle_ps(a, b, mask) extension ({
(__m256)__builtin_shufflevector((__v8sf)(__m256)(a),
(__v8sf)(__m256)(b),
(mask) & 0x3,
((mask) & 0xc) >> 2,
(((mask) & 0x30) >> 4) + 8,
(((mask) & 0xc0) >> 6) + 8,
((mask) & 0x3) + 4,
(((mask) & 0xc) >> 2) + 4,
(((mask) & 0x30) >> 4) + 12,
(((mask) & 0xc0) >> 6) + 12); })

Granted, not having an x86-specific builtin for this case is nice...the proliferation of x86 intrinsics is unstoppable. Sane instruction set architecture may not have this problem.

Some more details about the optimization options for intrinsics are here:
http://reviews.llvm.org/D10555

@TNorthover
Copy link
Contributor

TNorthover commented Mar 3, 2016

The complaint isn't really about asm tests, but about running LLVM optimisations. And unfortunately, I think that's the more burdensome one to fix. The IR we produce for

float32x2_t res(float32x2_t l, float32x2_t r) {
return vadd_f32(l, r);
}

is:

define <2 x float> @​res(<2 x float> %l, <2 x float> %r) #​0 {
entry:
%__p0.addr.i = alloca <2 x float>, align 8
%__p1.addr.i = alloca <2 x float>, align 8
%__ret.i = alloca <2 x float>, align 8
%l.addr = alloca <2 x float>, align 8
%r.addr = alloca <2 x float>, align 8
store <2 x float> %l, <2 x float>* %l.addr, align 8
store <2 x float> %r, <2 x float>* %r.addr, align 8
%0 = load <2 x float>, <2 x float>* %l.addr, align 8
%1 = load <2 x float>, <2 x float>* %r.addr, align 8
store <2 x float> %0, <2 x float>* %__p0.addr.i, align 8
store <2 x float> %1, <2 x float>* %__p1.addr.i, align 8
%2 = load <2 x float>, <2 x float>* %__p0.addr.i, align 8
%3 = load <2 x float>, <2 x float>* %__p1.addr.i, align 8
%add.i = fadd <2 x float> %2, %3
store <2 x float> %add.i, <2 x float>* %__ret.i, align 8
%4 = load <2 x float>, <2 x float>* %__ret.i, align 8
ret <2 x float> %4
}

Testing that code, or even working out whether it's correct by eye, is nontrivial.

On the other hand, breaking clang tests with an LLVM change can't be fun. I wonder if just running "opt -mem2reg" would be an acceptable compromise: I'd expect it to clean up the generated NEON code massively and it's a very stable pass with a well-defined job.

@TNorthover
Copy link
Contributor

TNorthover commented Mar 3, 2016

Given how many tests we do now, I'll see if I can whip up a script to auto-convert the existing ones (i.e. replace the asm checks with the CodeGen clang actually produces, using llvm-lit [[VARS]]: not ideal, but certainly no worse than what we've got now given that it already passes).

Other than that, a related question we might want to consider is just how thorough we want these NEON tests to be. I suspect they currently have large but not complete coverage of arm_neon.h, and they take a fairly ridiculous amount of time to execute.

Would a more orthogonal subset be better, especially now that we have the emperor tests in the test-suite? Not sure. And not sure how we'd decide which variants are worth testing either.

We can probably delay answering those questions to later though.

@TNorthover
Copy link
Contributor

TNorthover commented Mar 3, 2016

Oh, and I also thoroughly agree with Renato that we make absolutely no guarantees about intrinsics mapping to instructions in the ARM world. arm_neon.h describes semantics, not the assembly code output.

@rengolin
Copy link
Member Author

rengolin commented Mar 3, 2016

There's also the question of how much can we really do in LLVM's side. Whenever Clang changes to support new intrinsics, so does LLVM.

Meaning the same Clang+LLVM revision is known to work and emit the expected assembly, but not necessarily an old Clang (or an LLVM test produced with an old Clang) in a new LLVM revision.

By removing the tests from Clang, we may inadvertently cause the same effect in LLVM's side, by having old IR patterns not match because of some new optimisation pass.

In order to solve this in a reasonable fashion, the best approach is to auto-generate LLVM tests from Clang's "current" output. This can be done with a mix of original C code, a pass to IR and concatenating some RUN/CHECK lines at the begin/end.

Even though this is only valid for a small class of problems, it's probably the class that will generate most grief in LLVM testing. The rest will hopefully be more stable.

I recommend that the first step becomes duplicating the tests in the LLVM side and see how they end up as. As we accept the odds, we start removing the tests on Clang's side, and we'll be left with only the hairy cases in Clang.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 3, 2016

Given how many tests we do now, I'll see if I can whip up a script to auto-convert the existing ones (i.e. replace the asm checks with the CodeGen clang actually produces, using llvm-lit [[VARS]]: not ideal, but certainly no worse than what we've got now given that it already passes).

Half-baked idea: this seems like something that'd be more generally useful than a one-off script. Maybe it's worth teaching FileCheck more about llvm's textual IR to make it do this transformation semi-automagically (under some new flag, of course):

// RUN: %cc1 %s -emit-llvm -o - | FileCheck --auto-ir-vars %s

float32x2_t res(float32x2_t l, float32x2_t r) {
return vadd_f32(l, r);
}

// CHECK-LABEL: define <2 x float> @​res(<2 x float> %l, <2 x float> %r) #​0 {
// CHECK-NEXT: entry:
// CHECK-DAG: %__p0.addr.i = alloca <2 x float>, align 8
// CHECK-DAG: %__p1.addr.i = alloca <2 x float>, align 8
// CHECK-DAG: %__ret.i = alloca <2 x float>, align 8

</hand waving>

such that the %vars get automatically turned into the appropriate name-agnostic [[VAR:%[a-zA-Z_][a-zA-z_0-9]+]] thing for defs, and [[VAR]] for uses.

@rotateright
Copy link
Contributor

rotateright commented Mar 3, 2016

The complaint isn't really about asm tests, but about running LLVM
optimisations. And unfortunately, I think that's the more burdensome one to
fix. The IR we produce for

float32x2_t res(float32x2_t l, float32x2_t r) {
return vadd_f32(l, r);
}

is:

define <2 x float> @​res(<2 x float> %l, <2 x float> %r) #​0 {
entry:
%__p0.addr.i = alloca <2 x float>, align 8
%__p1.addr.i = alloca <2 x float>, align 8
%__ret.i = alloca <2 x float>, align 8
%l.addr = alloca <2 x float>, align 8
%r.addr = alloca <2 x float>, align 8
store <2 x float> %l, <2 x float>* %l.addr, align 8
store <2 x float> %r, <2 x float>* %r.addr, align 8
%0 = load <2 x float>, <2 x float>* %l.addr, align 8
%1 = load <2 x float>, <2 x float>* %r.addr, align 8
store <2 x float> %0, <2 x float>* %__p0.addr.i, align 8
store <2 x float> %1, <2 x float>* %__p1.addr.i, align 8
%2 = load <2 x float>, <2 x float>* %__p0.addr.i, align 8
%3 = load <2 x float>, <2 x float>* %__p1.addr.i, align 8
%add.i = fadd <2 x float> %2, %3
store <2 x float> %add.i, <2 x float>* %__ret.i, align 8
%4 = load <2 x float>, <2 x float>* %__ret.i, align 8
ret <2 x float> %4
}

Testing that code, or even working out whether it's correct by eye, is
nontrivial.

On the other hand, breaking clang tests with an LLVM change can't be fun. I
wonder if just running "opt -mem2reg" would be an acceptable compromise: I'd
expect it to clean up the generated NEON code massively and it's a very
stable pass with a well-defined job.

Is it necessary to have complete checking for this kind of case? I think we just need one label and one check line to know that clang has done the intended job of translating the intrinsic to IR:

; CHECK-LABEL: define <2 x float> @​res(<2 x float> %l, <2 x float> %r)
; CHECK: fadd <2 x float>

We can assume that the rest of the IR is correct and covered by tests elsewhere? (Surprisingly, I don't see any x86 equivalents for these kinds of tests...)

@rengolin
Copy link
Member Author

rengolin commented Mar 3, 2016

Is it necessary to have complete checking for this kind of case? I think we
just need one label and one check line to know that clang has done the
intended job of translating the intrinsic to IR:

; CHECK-LABEL: define <2 x float> @​res(<2 x float> %l, <2 x float> %r)
; CHECK: fadd <2 x float>

Unfortunately, that doesn't cover.

For silly semantics (vadd_f32 -> fadd float) should be enough, but others, like zip, vldN and the ones that rely on shuffles, casts, promotions and truncations, will all need a strict sequence of IR instructions with the correct dependency between them.

However, having CHECK for just the sequence of instructions will be "enough" for most purposes, though, so we may only need full-blown variable usage for the few complex cases...

@rotateright
Copy link
Contributor

rotateright commented Mar 3, 2016

For silly semantics (vadd_f32 -> fadd float) should be enough, but others,
like zip, vldN and the ones that rely on shuffles, casts, promotions and
truncations, will all need a strict sequence of IR instructions with the
correct dependency between them.

However, having CHECK for just the sequence of instructions will be "enough"
for most purposes, though, so we may only need full-blown variable usage for
the few complex cases...

Agreed. It looks like Altivec is tested quite efficiently this way:
clang/test/CodeGen/builtins-ppc-altivec.c

@ahmedbougacha
Copy link
Member

ahmedbougacha commented Mar 3, 2016

What's wrong with a very strict matching of the entire (unoptimized) IR sequence, (à la update_llc_test_checks.py)?

We assume they're correct today so in theory we don't even need to look at the current IR.

We don't expect these to change often, and when they do, updating the output and git-diff ought to be enough. I think I'd want to know if some unrelated IRGen commit changed the output (it's not unrelated anymore, is it?)

Yes, it's some extra work when something does changes. But if we provide an easy one-line script to update the checks, the change author can either ignore it (the current status quo), or make sure they didn't break anything (strictly better than the status quo).

@TNorthover
Copy link
Contributor

TNorthover commented Mar 3, 2016

I'd much prefer to check dataflow in all cases. It's very easy to end up with "fadd %a, %a" or get the subtraction the wrong way round (we did that originally, in fact, and it took us 2 years to fix properly because people had started relying on the obviously-wrong operand order).

@TNorthover
Copy link
Contributor

TNorthover commented Mar 3, 2016

I still think having readable tests is more important than wanting to completely isolate Clang from LLVM's mid-end.

@rotateright
Copy link
Contributor

rotateright commented Mar 3, 2016

What's wrong with a very strict matching of the entire (unoptimized) IR
sequence, (à la update_llc_test_checks.py)?

We assume they're correct today so in theory we don't even need to look at
the current IR.

We don't expect these to change often, and when they do, updating the output
and git-diff ought to be enough. I think I'd want to know if some unrelated
IRGen commit changed the output (it's not unrelated anymore, is it?)

Yes, it's some extra work when something does changes. But if we provide an
easy one-line script to update the checks, the change author can either
ignore it (the current status quo), or make sure they didn't break anything
(strictly better than the status quo).

Tim mentioned in comment 5 that these tests may be taking more time than they're worth. But an "update_clang_test_checks.py" and "update_opt_test_checks.py" would be extremely useful in general, and hopefully that's where we're headed. :)

@rengolin
Copy link
Member Author

rengolin commented Mar 3, 2016

I'd much prefer to check dataflow in all cases. It's very easy to end up
with "fadd %a, %a" or get the subtraction the wrong way round (we did that
originally, in fact, and it took us 2 years to fix properly because people
had started relying on the obviously-wrong operand order).

I'm not against it, just trying to be pragmatic.

We don't have perfect tests today, so we don't need perfect tests to replace them. But if people are happy to make them better at the same time as moving, I'm game.

@compnerd
Copy link
Member

compnerd commented Mar 3, 2016

I still think having readable tests is more important than wanting to
completely isolate Clang from LLVM's mid-end.

I fully agree with you here. The thing is that builtins are slightly more involved for this exact reason: they expect to be able to bore through the frontend and hook into the mid or back ends. I believe a similar argument has been made in favor of intrinsics over builtins in the past. As already pointed out (by you!) the data flow analysis here is relevant and important since it is possible to break things subtly otherwise. I think that the readability of the tests directly impact that.

@TNorthover
Copy link
Contributor

TNorthover commented Mar 9, 2016

I've put a patch up for review at http://reviews.llvm.org/D17999.

@TNorthover
Copy link
Contributor

TNorthover commented Mar 9, 2016

I think this has been addressed by r263048.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category
Projects
None yet
Development

No branches or pull requests

6 participants