The polynomial approximation for f16 math.powf generates NAN and INF #15936

hanhanW · 2023-12-15T01:23:13Z

Coming from #15661 (comment), we observed that there is a bug in PolynomialApproximation pass. I landed a workaround, which rewrite f16 approximations to occur with f32 intermediates. File a new issue for tracking it.

To repro:

#map = affine_map<(d0) -> (d0)>
module {
  func.func @main(%arg0: tensor<32xf16>) -> tensor<32xf16> {
    %cst = arith.constant 1.000000e+04 : f16
    %cst_0 = arith.constant 0.000000e+00 : f16
    %cst_1 = arith.constant 1.000000e+00 : f16
    %0 = tensor.empty() : tensor<32xf16>
    %1 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%arg0 : tensor<32xf16>) outs(%0 : tensor<32xf16>) {
    ^bb0(%in: f16, %out: f16):
      %2 = math.powf %cst, %in : f16
      linalg.yield %2 : f16
    } -> tensor<32xf16>
    return %1 : tensor<32xf16>
  }
}

Comment out https://github.com/openxla/iree/blob/0842feb3346233512b772cf001011a1dd6dbf67c/compiler/src/iree/compiler/Codegen/LLVMCPU/ExpandF16OpToF32Pass.cpp#L65

Compile to vmfb: iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu ~/repro.mlir -o /tmp/a.vmfb

Run the module: iree-run-module --device=local-sync --module=/tmp/a.vmfb --function=main --input=32xf16="0 0.03125 0.0625 0.09375 0.125 0.15625 0.1875 0.21875 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875 0.5 0.53125 0.5625 0.59375 0.625 0.65625 0.6875 0.71875 0.75 0.78125 0.8125 0.84375 0.875 0.90625 0.9375 0.96875"

❯ build/tools/iree-run-module --device=local-sync --module=/tmp/a.vmfb --function=main --input=32xf16="0 0.03125 0.0625 0.09375 0.125 0.15625 0.1875 0.21875 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875 0.5 0.53125 0.5625 0.59375 0.625 0.65625 0.6875 0.71875 0.75 0.78125 0.8125 0.84375 0.875 0.90625 0.9375 0.96875"
EXEC @main
result[0]: hal.buffer_view
32xf16=-NAN INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF

The text was updated successfully, but these errors were encountered:

harrisonGPU · 2023-12-16T01:31:48Z

Hello， @rsuderman ，Could you give me an opportunity to solve this issue? I may not do well initially, but I will try my best to learn and complete it!

rsuderman · 2023-12-16T07:19:37Z

Hello， @rsuderman ，Could you give me an opportunity to solve this issue? I may not do well initially, but I will try my best to learn and complete it!

Sounds great! I will tag you as an assignee.

harrisonGPU · 2023-12-17T02:53:23Z

Hello， @hanhanW ，
When I use your repro example, I encounter a Illegal instruction problem:

./iree-run-module --device=local-sync --module=$INPUT/a.vmfb --function=main --input=32xf16="0 0.03125 0.0625 0.09375 0.125 0.15625 0.1875 0.21875 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875 0.5 0.53125 0.5625 0.59375 0.625 0.65625 0.6875 0.71875 0.75 0.78125 0.8125 0.84375 0.875 0.90625 0.9375 0.96875"
EXEC @main
Illegal instruction (core dumped)

I use the commands below to build IREE.：

cmake -G Ninja -B ../iree-build/ -S . \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DIREE_ENABLE_ASSERTIONS=ON \
    -DIREE_ENABLE_SPLIT_DWARF=ON \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DIREE_ENABLE_LLD=ON \
    -DIREE_ENABLE_ASAN=ON
cmake --build ../iree-build/

My iree-llvmcpu-target-triple is:

root@1ef1351cbf8e:~# gcc -dumpmachine
x86_64-linux-gnu

I use the repro.mlir file you provided:

#map = affine_map<(d0) -> (d0)>
module {
  func.func @main(%arg0: tensor<32xf16>) -> tensor<32xf16> {
    %cst = arith.constant 1.000000e+04 : f16
    %cst_0 = arith.constant 0.000000e+00 : f16
    %cst_1 = arith.constant 1.000000e+00 : f16
    %0 = tensor.empty() : tensor<32xf16>
    %1 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%arg0 : tensor<32xf16>) outs(%0 : tensor<32xf16>) {
    ^bb0(%in: f16, %out: f16):
      %2 = math.powf %cst, %in : f16
      linalg.yield %2 : f16
    } -> tensor<32xf16>
    return %1 : tensor<32xf16>
  }
}

Then, I use the commands below to reproduce the problem. I set --iree-llvmcpu-target-triple=x86_64-linux-gnu.:

root@1ef1351cbf8e:/home/Projects/IREE/iree-build/tools#./iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-target-triple=x86_64-linux-gnu $INPUT/repro.mlir -o $INPUT/a.vmfb
root@1ef1351cbf8e:/home/Projects/IREE/iree-build/tools#./iree-run-module --device=local-sync --module=$INPUT/a.vmfb --function=main --input=32xf16="0 0.03125 0.0625 0.09375 0.125 0.15625 0.1875 0.21875 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875 0.5 0.53125 0.5625 0.59375 0.625 0.65625 0.6875 0.71875 0.75 0.78125 0.8125 0.84375 0.875 0.90625 0.9375 0.96875"
EXEC @main
Illegal instruction (core dumped)

I also use --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu, but I still encounter the same problem:

root@1ef1351cbf8e:/home/Projects/IREE/iree-build/tools# ./iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu $INPUT/repro.mlir -o $INPUT/a.vmfb
root@1ef1351cbf8e:/home/Projects/IREE/iree-build/tools# ./iree-run-module --device=local-sync --module=$INPUT/a.vmfb --function=main --input=32xf16="0 0.03125 0.0625 0.09375 0.125 0.15625 0.1875 0.21875 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875 0.5 0.53125 0.5625 0.59375 0.625 0.65625 0.6875 0.71875 0.75 0.78125 0.8125 0.84375 0.875 0.90625 0.9375 0.96875"
EXEC @main
Illegal instruction (core dumped)

Could you please give me some suggestions on how to reproduce this issue? I would really appreciate it!

pzread · 2023-12-17T03:15:37Z

I think your host machine might not support AVX512 (required by --iree-llvmcpu-target-cpu=cascadelake). Can you try with --iree-llvmcpu-target-cpu=host?

harrisonGPU · 2023-12-17T03:21:12Z

I think your host machine might not support AVX512 (required by --iree-llvmcpu-target-cpu=cascadelake). Can you try with --iree-llvmcpu-target-cpu=host?

hello，@pzread ，I am really thankful for your reply! Based on your answer, I have resolved the problem and managed to reproduce the issue. I really appreciate it!

root@1ef1351cbf8e:/home/Projects/IREE/iree-build/tools# ./iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=host --iree-llvmcpu-target-triple=x86_64-linux-gnu $INPUT/repro.mlir -o $INPUT/a.vmfb
root@1ef1351cbf8e:/home/Projects/IREE/iree-build/tools# ./iree-run-module --device=local-sync --module=$INPUT/a.vmfb --function=main --input=32xf16="0 0.03125 0.0625 0.09375 0.125 0.15625 0.1875 0.21875 0.25 0.28125 0.3125 0.34375 0.375 0.40625 0.4375 0.46875 0.5 0.53125 0.5625 0.59375 0.625 0.65625 0.6875 0.71875 0.75 0.78125 0.8125 0.84375 0.875 0.90625 0.9375 0.96875"
EXEC @main
result[0]: hal.buffer_view
32xf16=-NAN INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF INF

harrisonGPU · 2023-12-19T00:43:17Z

Hello, @hanhanW @rsuderman ,I am honored to have the opportunity to contribute to the solution for this problem, and I have a good understanding of the issue at hand. As a newcomer, I would like to pose a few questions. I will explain my current understanding of the issue and then present some questions that I have.

Context

This problem originates from:
https://github.com/llvm/llvm-project/blob/2a9d8caf29ca2b2cf4758db31c64fd20cb5eb3bf/mlir/lib/Dialect/Math/Transforms/ExpandPatterns.cpp#L165C1-L192C2

// Converts  Powf(float a, float b) (meaning a^b) to exp^(b * ln(a))
static LogicalResult convertPowfOp(math::PowFOp op, PatternRewriter &rewriter) {
  ImplicitLocOpBuilder b(op->getLoc(), rewriter);
  Value operandA = op.getOperand(0);
  Value operandB = op.getOperand(1);
  Type opType = operandA.getType();
  Value zero = createFloatConst(op->getLoc(), opType, 0.00, rewriter);
  Value two = createFloatConst(op->getLoc(), opType, 2.00, rewriter);
  Value negOne = createFloatConst(op->getLoc(), opType, -1.00, rewriter);
  Value opASquared = b.create<arith::MulFOp>(opType, operandA, operandA);
  Value opBHalf = b.create<arith::DivFOp>(opType, operandB, two);

  Value logA = b.create<math::LogOp>(opType, opASquared);
  Value mult = b.create<arith::MulFOp>(opType, opBHalf, logA);
  Value expResult = b.create<math::ExpOp>(opType, mult);
  Value negExpResult = b.create<arith::MulFOp>(opType, expResult, negOne);
  Value remainder = b.create<arith::RemFOp>(opType, operandB, two);
  Value negCheck =
      b.create<arith::CmpFOp>(arith::CmpFPredicate::OLT, operandA, zero);
  Value oddPower =
      b.create<arith::CmpFOp>(arith::CmpFPredicate::ONE, remainder, zero);
  Value oddAndNeg = b.create<arith::AndIOp>(op->getLoc(), oddPower, negCheck);

  Value res = b.create<arith::SelectOp>(op->getLoc(), oddAndNeg, negExpResult,
                                        expResult);
  rewriter.replaceOp(op, res);
  return success();
}

When we intend to convert a math::PowFOp operation (from a^b to exp(b * ln(a))) into an equivalent form using natural logarithms and exponentiation, the result may lead to INF (infinity) or NaN (not a number) errors. I have also analyzed this part of the code and found two cases that could cause errors:

When operandA is 0 and operandB is negative:
- logA tries to compute log(0), which is mathematically undefined. In most floating-point implementations, this will return -INF.
- Subsequent exp computation might attempt to calculate exp(-INF), which would yield 0. This wouldn't directly cause INF, but if operandB is a very large negative number, expResult might underflow to 0.
When operandA and operandB is a very large positive number, it will case overflow.
So, for issue [CPU] Add support for converting math.powf from fp16 to fp32. #15927 , ExpandF16OpToF32Pattern is used to expand f16 to f32 in order to avoid overflow problems.

Proposal

If we remove this workaround:
https://github.com/openxla/iree/blob/0842feb3346233512b772cf001011a1dd6dbf67c/compiler/src/iree/compiler/Codegen/LLVMCPU/ExpandF16OpToF32Pass.cpp#L65
We need to implement the same method in the convertPowfOp pass. Therefore, I would like to know if I may modify the code in this file. Specifically, I want to add targeted judgment conditions. For instance, if both operandA and operandB are very large positive numbers, I will expand f16 to f32. Alternatively, if operandA is 0 and operandB is negative, I will address this particular case.

I would greatly appreciate any advice and guidance you can offer. I am truly thankful and especially eager to contribute to the open-source community.

rsuderman · 2023-12-21T23:42:55Z

I would greatly appreciate any advice and guidance you can offer. I am truly thankful and especially eager to contribute to the open-source community.

So the problem overall is the fp16 equivalent approximation ends up having enough inaccuracy in the fp calculations that it ends up generating a 0 value that is not actually zero. This is pretty common when doing polynomial approximations with limited bitdepth. If we find a workaround that allows us to maintain the computation in fp16 we can remove the ExpandFp16OpToF32Pattern you specify. That expansion specifically exists because the numerical approximation we currently have must be performed in f32 to maintain accuracy.

My main suggestion is to look into approximation mechanisms for pow, exp and ln that work under half precision values. We can then include a separate tool for approximating these types.

harrisonGPU · 2023-12-22T00:13:39Z

I would greatly appreciate any advice and guidance you can offer. I am truly thankful and especially eager to contribute to the open-source community.

So the problem overall is the fp16 equivalent approximation ends up having enough inaccuracy in the fp calculations that it ends up generating a 0 value that is not actually zero. This is pretty common when doing polynomial approximations with limited bitdepth. If we find a workaround that allows us to maintain the computation in fp16 we can remove the ExpandFp16OpToF32Pattern you specify. That expansion specifically exists because the numerical approximation we currently have must be performed in f32 to maintain accuracy.

My main suggestion is to look into approximation mechanisms for pow, exp and ln that work under half precision values. We can then include a separate tool for approximating these types.

Hello， @rsuderman ，thank you for your guidance and advice. I will look into approximation mechanisms for pow, exp, and ln that work with half-precision values and think about how to solve this issue. Have a nice day!

hanhanW · 2024-03-01T22:00:09Z

@pashu123 this is one of the issue I mentioned to you. I'll create a new one for the other one.

pashu123 · 2024-03-04T17:20:13Z

@pashu123 this is one of the issue I mentioned to you. I'll create a new one for the other one.

Thanks @hanhanW. I'll take it from here.

hanhanW added codegen Shared code generation infrastructure and dialects codegen/llvm LLVM code generation compiler backend labels Dec 15, 2023

hanhanW assigned rsuderman Dec 15, 2023

rsuderman mentioned this issue Dec 15, 2023

MaterializeInterfacesPass fails with complex<f32> values #13544

Closed

rsuderman assigned harrisonGPU Dec 16, 2023

hanhanW mentioned this issue Feb 23, 2024

[CPU] SDXL UNet gives NaN output on X86 #16529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The polynomial approximation for f16 math.powf generates NAN and INF #15936

The polynomial approximation for f16 math.powf generates NAN and INF #15936

hanhanW commented Dec 15, 2023

harrisonGPU commented Dec 16, 2023

rsuderman commented Dec 16, 2023

harrisonGPU commented Dec 17, 2023

pzread commented Dec 17, 2023

harrisonGPU commented Dec 17, 2023 •

edited

Loading

harrisonGPU commented Dec 19, 2023

rsuderman commented Dec 21, 2023

harrisonGPU commented Dec 22, 2023

hanhanW commented Mar 1, 2024

pashu123 commented Mar 4, 2024

The polynomial approximation for f16 math.powf generates NAN and INF #15936

The polynomial approximation for f16 math.powf generates NAN and INF #15936

Comments

hanhanW commented Dec 15, 2023

harrisonGPU commented Dec 16, 2023

rsuderman commented Dec 16, 2023

harrisonGPU commented Dec 17, 2023

pzread commented Dec 17, 2023

harrisonGPU commented Dec 17, 2023 • edited Loading

harrisonGPU commented Dec 19, 2023

Context

Proposal

rsuderman commented Dec 21, 2023

harrisonGPU commented Dec 22, 2023

hanhanW commented Mar 1, 2024

pashu123 commented Mar 4, 2024

harrisonGPU commented Dec 17, 2023 •

edited

Loading