Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@min(@ctz(x), y) can become @ctz(x | (1 << y)) #90000

Closed
Validark opened this issue Apr 24, 2024 · 13 comments · Fixed by #90402
Closed

@min(@ctz(x), y) can become @ctz(x | (1 << y)) #90000

Validark opened this issue Apr 24, 2024 · 13 comments · Fixed by #90402
Assignees
Labels
good first issue https://github.com/llvm/llvm-project/contribute llvm:instcombine missed-optimization

Comments

@Validark
Copy link

Validark commented Apr 24, 2024

Godbolt link

const y = 6;

export fn bounded_tzcnt(x: u16) u8 {
    return @min(@ctz(x), y);
}

export fn bounded_tzcnt_better(x: u16) u8 {
    return @ctz(x | (1 << y));
}

export fn bounded_lzcnt(x: u16) u8 {
    return @min(@clz(x), y);
}

export fn bounded_lzcnt_better(x: u16) u8 {
    return @clz(x | (1 << 16 >> y));
}
bounded_tzcnt:
        or      edi, 65536
        mov     eax, 6
        tzcnt   ecx, edi
        cmp     cl, 6
        cmovb   eax, ecx
        ret

bounded_tzcnt_better:
        or      edi, 64
        tzcnt   eax, edi
        ret

bounded_lzcnt:
        lzcnt   cx, di
        mov     eax, 6
        cmp     cl, 6
        cmovb   eax, ecx
        ret

bounded_lzcnt_better:
        or      edi, 1024
        lzcnt   ax, di
        ret
@Validark Validark changed the title @min(@ctz(x), y); can become @ctz(x | (1 << y)) @min(@ctz(x), y) can become @ctz(x | (1 << y)) Apr 24, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Apr 26, 2024

Proof for the tzcnt case: https://alive2.llvm.org/ce/z/zUH_Ny

---------------------------------------
define i16 @src_bounded_tzcnt(i16 %a0, i16 %a1) {
Entry:
  %cmp = icmp ule i16 %a1, 15
  assume i1 %cmp
  %tz = cttz i16 %a0, 0
  %r = umin i16 %tz, %a1
  ret i16 %r
}
=>
define i16 @tgt_bounded_tzcnt(i16 %a0, i16 %a1) {
Entry:
  %bit = shl i16 1, %a1
  %or = or i16 %a0, %bit
  %tz = cttz i16 %or, 0
  ret i16 %tz
}
Transformation seems to be correct!

@RKSimon
Copy link
Collaborator

RKSimon commented Apr 26, 2024

The lzcnt version has a typo, afaict it should be:

export fn bounded_lzcnt_better(x: u16) u8 {
    return @clz(x | ((1 << 15) >> y));
}

Proof: https://alive2.llvm.org/ce/z/yb4r54

----------------------------------------
define i16 @src_bounded_lzcnt(i16 %a0, i16 %a1) {
#0:
  %cmp = icmp ule i16 %a1, 15
  assume i1 %cmp
  %tz = ctlz i16 %a0, 0
  %r = umin i16 %tz, %a1
  ret i16 %r
}
=>
define i16 @tgt_bounded_lzcnt(i16 %a0, i16 %a1) {
#0:
  %bit = lshr i16 32768, %a1
  %or = or i16 %a0, %bit
  %tz = ctlz i16 %or, 0
  ret i16 %tz
}
Transformation seems to be correct!

@Validark
Copy link
Author

Good catch on that one. And thanks for looking into this!

@RKSimon
Copy link
Collaborator

RKSimon commented Apr 26, 2024

Did you see this in real world code or was this from fuzzing/testing?

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 26, 2024

Did you see this in real world code or was this from fuzzing/testing?

For cttz + umin:

; 72 occurrences:
; mitsuba3/optimized/rastack.cpp.ll
; oiio/optimized/CineonHeader.cpp.ll
; oiio/optimized/argparse.cpp.ll
; oiio/optimized/benchmark.cpp.ll
; oiio/optimized/bmpinput.cpp.ll
; oiio/optimized/bmpoutput.cpp.ll
; oiio/optimized/cineoninput.cpp.ll
; oiio/optimized/color_ocio.cpp.ll
; oiio/optimized/ddsinput.cpp.ll
; oiio/optimized/dpxinput.cpp.ll
; oiio/optimized/dpxoutput.cpp.ll
; oiio/optimized/environment.cpp.ll
; oiio/optimized/errorhandler.cpp.ll
; oiio/optimized/exrinput.cpp.ll
; oiio/optimized/exroutput.cpp.ll
; oiio/optimized/filesystem.cpp.ll
; oiio/optimized/fitsinput.cpp.ll
; oiio/optimized/fitsoutput.cpp.ll
; oiio/optimized/formatspec.cpp.ll
; oiio/optimized/hdrinput.cpp.ll
; oiio/optimized/hdroutput.cpp.ll
; oiio/optimized/icc.cpp.ll
; oiio/optimized/icoinput.cpp.ll
; oiio/optimized/icooutput.cpp.ll
; oiio/optimized/iffinput.cpp.ll
; oiio/optimized/iffoutput.cpp.ll
; oiio/optimized/imagebuf.cpp.ll
; oiio/optimized/imagebufalgo.cpp.ll
; oiio/optimized/imagebufalgo_addsub.cpp.ll
; oiio/optimized/imagebufalgo_channels.cpp.ll
; oiio/optimized/imagebufalgo_compare.cpp.ll
; oiio/optimized/imagebufalgo_copy.cpp.ll
; oiio/optimized/imagebufalgo_deep.cpp.ll
; oiio/optimized/imagebufalgo_draw.cpp.ll
; oiio/optimized/imagebufalgo_mad.cpp.ll
; oiio/optimized/imagebufalgo_minmaxchan.cpp.ll
; oiio/optimized/imagebufalgo_muldiv.cpp.ll
; oiio/optimized/imagebufalgo_opencv.cpp.ll
; oiio/optimized/imagebufalgo_orient.cpp.ll
; oiio/optimized/imagebufalgo_pixelmath.cpp.ll
; oiio/optimized/imagebufalgo_xform.cpp.ll
; oiio/optimized/imagecache.cpp.ll
; oiio/optimized/imageinput.cpp.ll
; oiio/optimized/imageio.cpp.ll
; oiio/optimized/imageioplugin.cpp.ll
; oiio/optimized/imageoutput.cpp.ll
; oiio/optimized/jpeginput.cpp.ll
; oiio/optimized/jpegoutput.cpp.ll
; oiio/optimized/maketexture.cpp.ll
; oiio/optimized/paramlist.cpp.ll
; oiio/optimized/pnginput.cpp.ll
; oiio/optimized/pngoutput.cpp.ll
; oiio/optimized/pnmoutput.cpp.ll
; oiio/optimized/printinfo.cpp.ll
; oiio/optimized/psdinput.cpp.ll
; oiio/optimized/rlainput.cpp.ll
; oiio/optimized/rlaoutput.cpp.ll
; oiio/optimized/sgiinput.cpp.ll
; oiio/optimized/sgioutput.cpp.ll
; oiio/optimized/softimageinput.cpp.ll
; oiio/optimized/strutil.cpp.ll
; oiio/optimized/sysutil.cpp.ll
; oiio/optimized/targainput.cpp.ll
; oiio/optimized/targaoutput.cpp.ll
; oiio/optimized/termoutput.cpp.ll
; oiio/optimized/texture3d.cpp.ll
; oiio/optimized/texturesys.cpp.ll
; oiio/optimized/tiffinput.cpp.ll
; oiio/optimized/tiffoutput.cpp.ll
; oiio/optimized/typedesc.cpp.ll
; oiio/optimized/xmp.cpp.ll
; oiio/optimized/zfile.cpp.ll
; Function Attrs: nounwind
define i8 @func0000000000000002(i8 %0) #0 {
entry:
  %1 = tail call i8 @llvm.cttz.i8(i8 %0, i1 true), !range !0
  %2 = tail call i8 @llvm.umin.i8(i8 %1, i8 6)
  ret i8 %2
}

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i8 @llvm.cttz.i8(i8, i1 immarg) #1

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i8 @llvm.umin.i8(i8, i8) #1

; 2 occurrences:
; image-rs/optimized/2ndzmzcdt55acj4k.ll
; qemu/optimized/accel_tcg_user-exec.c.ll
; Function Attrs: nounwind
define i32 @func0000000000000000(i32 %0) #0 {
entry:
  %1 = tail call i32 @llvm.cttz.i32(i32 %0, i1 false), !range !1
  %2 = tail call i32 @llvm.umin.i32(i32 %1, i32 4)
  ret i32 %2
}

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.cttz.i32(i32, i1 immarg) #1

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.umin.i32(i32, i32) #1

attributes #0 = { nounwind }
attributes #1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }

!0 = !{i8 0, i8 9}
!1 = !{i32 0, i32 33}

For ctlz + umin, this pattern doesn't exist in my benchmark.

@Validark
Copy link
Author

Did you see this in real world code or was this from fuzzing/testing?

When I saw a u16::cttz call turn into an OR+TZCNT I had the idea that you can fold a umin into that OR and tested whether LLVM knew about that yet. Specifically:

        or      edi, 65536
        tzcnt   ecx, edi

So yes, for me it was a theoretical optimization.

@RKSimon RKSimon added the good first issue https://github.com/llvm/llvm-project/contribute label Apr 27, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 27, 2024

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. Check that no other contributor has already been assigned to this issue. If you believe that no one is actually working on it despite an assignment, ping the person. After one week without a response, the assignee may be changed.
  2. In the comments of this issue, request for it to be assigned to you, or just create a pull request after following the steps below. Mention this issue in the description of the pull request.
  3. Fix the issue locally.
  4. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  5. Create a Git commit.
  6. Run git clang-format HEAD~1 to format your changes.
  7. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation. Mention this issue in the description of the pull request.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 27, 2024

@llvm/issue-subscribers-good-first-issue

Author: Niles Salter (Validark)

[Godbolt link](https://zig.godbolt.org/z/fzMo9jYPK)
const y = 6;

export fn bounded_tzcnt(x: u16) u8 {
    return @<!-- -->min(@<!-- -->ctz(x), y);
}

export fn bounded_tzcnt_better(x: u16) u8 {
    return @<!-- -->ctz(x | (1 &lt;&lt; y));
}

export fn bounded_lzcnt(x: u16) u8 {
    return @<!-- -->min(@<!-- -->clz(x), y);
}

export fn bounded_lzcnt_better(x: u16) u8 {
    return @<!-- -->clz(x | (1 &lt;&lt; 16 &gt;&gt; y));
}
bounded_tzcnt:
        or      edi, 65536
        mov     eax, 6
        tzcnt   ecx, edi
        cmp     cl, 6
        cmovb   eax, ecx
        ret

bounded_tzcnt_better:
        or      edi, 64
        tzcnt   eax, edi
        ret

bounded_lzcnt:
        lzcnt   cx, di
        mov     eax, 6
        cmp     cl, 6
        cmovb   eax, ecx
        ret

bounded_lzcnt_better:
        or      edi, 1024
        lzcnt   ax, di
        ret

@mskamp
Copy link
Contributor

mskamp commented Apr 27, 2024

Hi, I'd like to work on this one if it's still available.

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 27, 2024

Hi, I'd like to work on this one if it's still available.

Please read https://llvm.org/docs/InstCombineContributorGuide.html before submitting your first patch :)

@mskamp
Copy link
Contributor

mskamp commented Apr 28, 2024

Two questions occurred to me:

  1. Among the expressions matched by umin(cttz(x), y), it might happen that y = cttz(z). In this case, the proposed translation would yield cttz(x | (1 << cttz(z))). This could be simplified further with 1 << cttz(z) = z & -z, which LLVM does not seem to do at the moment, though. But the “obvious” translation would be cttz(x | z). Is it acceptable to handle this case specifically?
  2. If a target has a fast operation to compute the minimum (as x86_64 with SSE4.2 has for integer vectors), the general transformation might not be desirable (e.g., consider https://godbolt.org/z/dz9jG8Ev5) unless the other operand is a constant and the expression can be folded (as is the case in the examples in the previous comments). What is the canonical way of dealing with such architecture-dependent transformations?

Another option would be to restrict the transformation only to cases with a constant second operand, which would resolve the above questions neatly.

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 28, 2024

Two questions occurred to me:

  1. Among the expressions matched by umin(cttz(x), y), it might happen that y = cttz(z). In this case, the proposed translation would yield cttz(x | (1 << cttz(z))). This could be simplified further with 1 << cttz(z) = z & -z, which LLVM does not seem to do at the moment, though. But the “obvious” translation would be cttz(x | z). Is it acceptable to handle this case specifically?

Feel free to file another PR if you find that this pattern exists in some real-world applications. Unfortunately it doesn't exist in my benchmark :(

  1. If a target has a fast operation to compute the minimum (as x86_64 with SSE4.2 has for integer vectors), the general transformation might not be desirable (e.g., consider https://godbolt.org/z/dz9jG8Ev5) unless the other operand is a constant and the expression can be folded (as is the case in the examples in the previous comments). What is the canonical way of dealing with such architecture-dependent transformations?

Another option would be to restrict the transformation only to cases with a constant second operand, which would resolve the above questions neatly.

Yeah, we only fold umin(cttz(X), C).

@RKSimon
Copy link
Collaborator

RKSimon commented Apr 28, 2024

https://alive2.llvm.org/ce/z/on8IIK suggests 1 << cttz(z) = z & -z is already folded by instcombine

mskamp added a commit to mskamp/llvm-project that referenced this issue Apr 28, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/xRZTE7
mskamp added a commit to mskamp/llvm-project that referenced this issue May 1, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/7BQLBe
mskamp added a commit to mskamp/llvm-project that referenced this issue May 4, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jun 21, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jun 21, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jun 22, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jun 29, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jul 1, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jul 2, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
mskamp added a commit to mskamp/llvm-project that referenced this issue Jul 5, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.

The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.

Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8
@nikic nikic closed this as completed in 949bbdc Jul 13, 2024
aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this issue Jul 14, 2024
)

The new transformation folds `umin(cttz(x), c)` to `cttz(x | (1 << c))`
and `umin(ctlz(x), c)` to `ctlz(x | ((1 << (bitwidth - 1)) >> c))`. The
transformation is only implemented for constant `c` to not increase the
number of instructions.
    
The idea of the transformation is to set the c-th lowest (for `cttz`) or
highest (for `ctlz`) bit in the operand. In this way, the `cttz` or
`ctlz` instruction always returns at most `c`.
    
Alive2 proofs: https://alive2.llvm.org/ce/z/y8Hdb8

Fixes llvm#90000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue https://github.com/llvm/llvm-project/contribute llvm:instcombine missed-optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants