Hexagon HVX implementation of F32 vbinary #6386

ejparkqc · 2024-05-07T22:55:20Z

We needed to adjust XNN_ALLOCATION_ALIGNMENT to 128 Byte for HVX and use predicate store for the tail part. Ctest passed but performance is not good yet. The next step naturally becomes the performance improvements.

scripts/generate-f32-vbinary.sh

src/f32-vbinary/gen/f32-vmax-hvx-u64.c

fbarchard

Easy! Looks good overall.

fbarchard · 2024-05-09T00:09:29Z

src/f32-vbinary/gen/f32-vmax-hvx-u64.c

+    const float* input_a,
+    const float* input_b,
+    float* output,
+    const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)]) XNN_OOB_READS


Could you mask the remainder read so you can remove OOB_READs? (Out of Bounds)
HVX has a large over read, and the masked remainder wont normally be a performance bottleneck... the main loop will be. On AVX512, RiSCV and SVE I expect we'll move towards asan friendly reads using predicates.

scripts/generate-f32-vbinary.sh

fbarchard

I would ask you how fast it is, but we seem to be missing benchmarks for vbinary.

I'm adding x86 kernels for FP16, which you could refer to for what needs to be done.
#6394

For benchmark I used a real model and tflite benchmark_model with perf record/report to measure improvement:

Was AVX with VCVT
7.15% xnn_f16_vmulc_minmax_ukernel__f16c_u16
Now AVX512 FP16
2.94% xnn_f16_rmax_ukernel__avx512fp16_u128_acc4

But I'll try to make a benchmark we can use for F16 and F32 vbinary

ejparkqc · 2024-05-23T01:40:49Z

I would ask you how fast it is, but we seem to be missing benchmarks for vbinary.

I'm adding x86 kernels for FP16, which you could refer to for what needs to be done. #6394

For benchmark I used a real model and tflite benchmark_model with perf record/report to measure improvement:

Was AVX with VCVT 7.15% xnn_f16_vmulc_minmax_ukernel__f16c_u16 Now AVX512 FP16 2.94% xnn_f16_rmax_ukernel__avx512fp16_u128_acc4

But I'll try to make a benchmark we can use for F16 and F32 vbinary

The current PR should work with unaligned load as well without changing tester. I will work on the related issue now.

dsharlet · 2024-05-23T20:25:39Z

This PR still contains the 10k+ line additions of microkernels.bzl and microkernels.cmake. You need to revert those changes, and I also think you're missing the microkernels_hexagon.bzl changes. You shouldn't have to make these changes manually, I think you need to run:

python3 tools/update-microkernels.py -a

ejparkqc · 2024-05-23T21:03:33Z

This PR still contains the 10k+ line additions of microkernels.bzl and microkernels.cmake. You need to revert those changes, and I also think you're missing the microkernels_hexagon.bzl changes. You shouldn't have to make these changes manually, I think you need to run:

python3 tools/update-microkernels.py -a

Yeap, I realized that I modified the generated file which is clearly saying "Do not edit" :) I am working on this.
Also, I will revert the changes for microkernels.bzl and microkernels.cmake.

One related question: We currently have 'hexagon' in ISA_LIST in update-microkernels.py but coming kernels having 'hvx' as isa_name instead of 'hexagon'. cs16-vsquareabs has hexagon kernels using scalar vectors instead of HVX vectors, so renaming cs16-vsqaureabs kernels with including 'hvx' seems not right.

The simplest solution is to have both 'hvx' and 'hexagon' in the ISA_LIST and make related changes. This will generate kernels separately for HVX and Hexagon (non-HVX). If I add 'hexagon' in ARCH_LIST, it tries to generate asm_kernels and jit_kernels, so I want avoid this for now. Do you think this is okay?

dsharlet

Thanks, this looks good!

fbarchard · 2024-05-28T21:37:13Z

BUILD.bazel

@@ -9,7 +9,7 @@
 load("@bazel_skylib//:bzl_library.bzl", "bzl_library")
 load("@bazel_skylib//lib:selects.bzl", "selects")  # buildifier: disable=out-of-order-load
 load(":build_defs.bzl", "xnnpack_aggregate_library", "xnnpack_cc_library", "xnnpack_gcc_std_copts", "xnnpack_min_size_copts", "xnnpack_msvc_std_copts", "xnnpack_slinky_defines", "xnnpack_slinky_deps", "xnnpack_slinky_srcs", "xnnpack_std_cxxopts", "xnnpack_visibility")
-load("//gen:microkernels.bzl", "AARCH32_ASM_MICROKERNEL_SRCS", "AARCH32_JIT_MICROKERNEL_SRCS", "AARCH64_ASM_MICROKERNEL_SRCS", "AARCH64_JIT_MICROKERNEL_SRCS", "ALL_ARMSIMD32_MICROKERNEL_SRCS", "ALL_AVX2_MICROKERNEL_SRCS", "ALL_AVX512AMX_MICROKERNEL_SRCS", "ALL_AVX512FP16_MICROKERNEL_SRCS", "ALL_AVX512F_MICROKERNEL_SRCS", "ALL_AVX512SKX_MICROKERNEL_SRCS", "ALL_AVX512VBMI_MICROKERNEL_SRCS", "ALL_AVX512VNNIGFNI_MICROKERNEL_SRCS", "ALL_AVX512VNNI_MICROKERNEL_SRCS", "ALL_AVXVNNI_MICROKERNEL_SRCS", "ALL_AVX_MICROKERNEL_SRCS", "ALL_F16C_MICROKERNEL_SRCS", "ALL_FMA3_MICROKERNEL_SRCS", "ALL_FMA_MICROKERNEL_SRCS", "ALL_FP16ARITH_MICROKERNEL_SRCS", "ALL_HEXAGON_MICROKERNEL_SRCS", "ALL_NEONBF16_AARCH64_MICROKERNEL_SRCS", "ALL_NEONBF16_MICROKERNEL_SRCS", "ALL_NEONDOTFP16ARITH_MICROKERNEL_SRCS", "ALL_NEONDOT_AARCH64_MICROKERNEL_SRCS", "ALL_NEONDOT_MICROKERNEL_SRCS", "ALL_NEONFMA_AARCH64_MICROKERNEL_SRCS", "ALL_NEONFMA_MICROKERNEL_SRCS", "ALL_NEONFP16ARITH_AARCH64_MICROKERNEL_SRCS", "ALL_NEONFP16ARITH_MICROKERNEL_SRCS", "ALL_NEONFP16_MICROKERNEL_SRCS", "ALL_NEONI8MM_MICROKERNEL_SRCS", "ALL_NEONV8_MICROKERNEL_SRCS", "ALL_NEON_AARCH64_MICROKERNEL_SRCS", "ALL_NEON_MICROKERNEL_SRCS", "ALL_RVVFP16ARITH_MICROKERNEL_SRCS", "ALL_RVV_MICROKERNEL_SRCS", "ALL_SCALAR_MICROKERNEL_SRCS", "ALL_SSE2_MICROKERNEL_SRCS", "ALL_SSE41_MICROKERNEL_SRCS", "ALL_SSE_MICROKERNEL_SRCS", "ALL_SSSE3_MICROKERNEL_SRCS", "ALL_WASMRELAXEDSIMD_MICROKERNEL_SRCS", "ALL_WASMSIMD_MICROKERNEL_SRCS", "ALL_WASM_MICROKERNEL_SRCS", "WASM32_ASM_MICROKERNEL_SRCS", "WASM32_JIT_MICROKERNEL_SRCS", "WASMRELAXEDSIMD32_JIT_MICROKERNEL_SRCS", "WASMSIMD32_JIT_MICROKERNEL_SRCS")
+load("//gen:microkernels.bzl", "AARCH32_ASM_MICROKERNEL_SRCS", "AARCH32_JIT_MICROKERNEL_SRCS", "AARCH64_ASM_MICROKERNEL_SRCS", "AARCH64_JIT_MICROKERNEL_SRCS", "ALL_ARMSIMD32_MICROKERNEL_SRCS", "ALL_AVX2_MICROKERNEL_SRCS", "ALL_AVX512AMX_MICROKERNEL_SRCS", "ALL_AVX512FP16_MICROKERNEL_SRCS", "ALL_AVX512F_MICROKERNEL_SRCS", "ALL_AVX512SKX_MICROKERNEL_SRCS", "ALL_AVX512VBMI_MICROKERNEL_SRCS", "ALL_AVX512VNNIGFNI_MICROKERNEL_SRCS", "ALL_AVX512VNNI_MICROKERNEL_SRCS", "ALL_AVXVNNI_MICROKERNEL_SRCS", "ALL_AVX_MICROKERNEL_SRCS", "ALL_F16C_MICROKERNEL_SRCS", "ALL_FMA3_MICROKERNEL_SRCS", "ALL_FMA_MICROKERNEL_SRCS", "ALL_FP16ARITH_MICROKERNEL_SRCS", "ALL_HEXAGON_MICROKERNEL_SRCS", "ALL_HVX_MICROKERNEL_SRCS", "ALL_NEONBF16_AARCH64_MICROKERNEL_SRCS", "ALL_NEONBF16_MICROKERNEL_SRCS", "ALL_NEONDOTFP16ARITH_MICROKERNEL_SRCS", "ALL_NEONDOT_AARCH64_MICROKERNEL_SRCS", "ALL_NEONDOT_MICROKERNEL_SRCS", "ALL_NEONFMA_AARCH64_MICROKERNEL_SRCS", "ALL_NEONFMA_MICROKERNEL_SRCS", "ALL_NEONFP16ARITH_AARCH64_MICROKERNEL_SRCS", "ALL_NEONFP16ARITH_MICROKERNEL_SRCS", "ALL_NEONFP16_MICROKERNEL_SRCS", "ALL_NEONI8MM_MICROKERNEL_SRCS", "ALL_NEONV8_MICROKERNEL_SRCS", "ALL_NEON_AARCH64_MICROKERNEL_SRCS", "ALL_NEON_MICROKERNEL_SRCS", "ALL_RVVFP16ARITH_MICROKERNEL_SRCS", "ALL_RVV_MICROKERNEL_SRCS", "ALL_SCALAR_MICROKERNEL_SRCS", "ALL_SSE2_MICROKERNEL_SRCS", "ALL_SSE41_MICROKERNEL_SRCS", "ALL_SSE_MICROKERNEL_SRCS", "ALL_SSSE3_MICROKERNEL_SRCS", "ALL_WASMRELAXEDSIMD_MICROKERNEL_SRCS", "ALL_WASMSIMD_MICROKERNEL_SRCS", "ALL_WASM_MICROKERNEL_SRCS", "WASM32_ASM_MICROKERNEL_SRCS", "WASM32_JIT_MICROKERNEL_SRCS", "WASMRELAXEDSIMD32_JIT_MICROKERNEL_SRCS", "WASMSIMD32_JIT_MICROKERNEL_SRCS")


this shouldnt change in your CL... its likely a BUILD file change we committed and you need to rebase

It's not because of a missing rebase, it's because of https://github.com/google/XNNPACK/pull/6386/files#diff-3b5d59a3985ecd23f243252f1697e786fc5beec50a965c8acd0189155ce89c63. I think it's an OK change to make. Subsequent HVX kernel changes would not have this diff.

IIRC we suggested making this change at some point (these kernels are HVX, not general hexagon kernels)

fbarchard · 2024-05-28T21:37:30Z

BUILD.bazel

@@ -203,7 +203,7 @@ JIT_WASMRELAXEDSIMD32_SRCS = JIT_WASMSIMD32_SRCS + JIT_WASMRELAXEDSIMD32_COMMONS

 filegroup(
    name = "microkernel_source_files",
-    data = ALL_NEON_AARCH64_MICROKERNEL_SRCS + ALL_NEONBF16_AARCH64_MICROKERNEL_SRCS + ALL_NEONDOT_AARCH64_MICROKERNEL_SRCS + ALL_NEONFMA_AARCH64_MICROKERNEL_SRCS + ALL_NEONFP16ARITH_AARCH64_MICROKERNEL_SRCS + ALL_ARMSIMD32_MICROKERNEL_SRCS + ALL_AVX_MICROKERNEL_SRCS + ALL_AVXVNNI_MICROKERNEL_SRCS + ALL_AVX2_MICROKERNEL_SRCS + ALL_AVX512F_MICROKERNEL_SRCS + ALL_AVX512SKX_MICROKERNEL_SRCS + ALL_AVX512VBMI_MICROKERNEL_SRCS + ALL_AVX512VNNI_MICROKERNEL_SRCS + ALL_AVX512VNNIGFNI_MICROKERNEL_SRCS + ALL_AVX512AMX_MICROKERNEL_SRCS + ALL_AVX512FP16_MICROKERNEL_SRCS + ALL_F16C_MICROKERNEL_SRCS + ALL_FMA_MICROKERNEL_SRCS + ALL_FMA3_MICROKERNEL_SRCS + ALL_FP16ARITH_MICROKERNEL_SRCS + ALL_HEXAGON_MICROKERNEL_SRCS + ALL_NEON_MICROKERNEL_SRCS + ALL_NEONBF16_MICROKERNEL_SRCS + ALL_NEONDOT_MICROKERNEL_SRCS + ALL_NEONDOTFP16ARITH_MICROKERNEL_SRCS + ALL_NEONFMA_MICROKERNEL_SRCS + ALL_NEONFP16_MICROKERNEL_SRCS + ALL_NEONFP16ARITH_MICROKERNEL_SRCS + ALL_NEONV8_MICROKERNEL_SRCS + ALL_RVV_MICROKERNEL_SRCS + ALL_RVVFP16ARITH_MICROKERNEL_SRCS + ALL_SCALAR_MICROKERNEL_SRCS + ALL_SSE_MICROKERNEL_SRCS + ALL_SSE2_MICROKERNEL_SRCS + ALL_SSE41_MICROKERNEL_SRCS + ALL_SSSE3_MICROKERNEL_SRCS + ALL_WASM_MICROKERNEL_SRCS + ALL_WASMRELAXEDSIMD_MICROKERNEL_SRCS + ALL_WASMSIMD_MICROKERNEL_SRCS + AARCH32_ASM_MICROKERNEL_SRCS + AARCH64_ASM_MICROKERNEL_SRCS + WASM32_ASM_MICROKERNEL_SRCS + ["src/microparams-init.c"],
+    data = ALL_NEON_AARCH64_MICROKERNEL_SRCS + ALL_NEONBF16_AARCH64_MICROKERNEL_SRCS + ALL_NEONDOT_AARCH64_MICROKERNEL_SRCS + ALL_NEONFMA_AARCH64_MICROKERNEL_SRCS + ALL_NEONFP16ARITH_AARCH64_MICROKERNEL_SRCS + ALL_ARMSIMD32_MICROKERNEL_SRCS + ALL_AVX_MICROKERNEL_SRCS + ALL_AVXVNNI_MICROKERNEL_SRCS + ALL_AVX2_MICROKERNEL_SRCS + ALL_AVX512F_MICROKERNEL_SRCS + ALL_AVX512SKX_MICROKERNEL_SRCS + ALL_AVX512VBMI_MICROKERNEL_SRCS + ALL_AVX512VNNI_MICROKERNEL_SRCS + ALL_AVX512VNNIGFNI_MICROKERNEL_SRCS + ALL_AVX512AMX_MICROKERNEL_SRCS + ALL_AVX512FP16_MICROKERNEL_SRCS + ALL_F16C_MICROKERNEL_SRCS + ALL_FMA_MICROKERNEL_SRCS + ALL_FMA3_MICROKERNEL_SRCS + ALL_FP16ARITH_MICROKERNEL_SRCS + ALL_HEXAGON_MICROKERNEL_SRCS + ALL_HVX_MICROKERNEL_SRCS + ALL_NEON_MICROKERNEL_SRCS + ALL_NEONBF16_MICROKERNEL_SRCS + ALL_NEONDOT_MICROKERNEL_SRCS + ALL_NEONDOTFP16ARITH_MICROKERNEL_SRCS + ALL_NEONFMA_MICROKERNEL_SRCS + ALL_NEONFP16_MICROKERNEL_SRCS + ALL_NEONFP16ARITH_MICROKERNEL_SRCS + ALL_NEONV8_MICROKERNEL_SRCS + ALL_RVV_MICROKERNEL_SRCS + ALL_RVVFP16ARITH_MICROKERNEL_SRCS + ALL_SCALAR_MICROKERNEL_SRCS + ALL_SSE_MICROKERNEL_SRCS + ALL_SSE2_MICROKERNEL_SRCS + ALL_SSE41_MICROKERNEL_SRCS + ALL_SSSE3_MICROKERNEL_SRCS + ALL_WASM_MICROKERNEL_SRCS + ALL_WASMRELAXEDSIMD_MICROKERNEL_SRCS + ALL_WASMSIMD_MICROKERNEL_SRCS + AARCH32_ASM_MICROKERNEL_SRCS + AARCH64_ASM_MICROKERNEL_SRCS + WASM32_ASM_MICROKERNEL_SRCS + ["src/microparams-init.c"],


ditto.. revert this change

As we want to have regular hexagon kernels and hvx kernels separately, I think I can leave like this, but please let me know if I have to revert this change as well!

fbarchard · 2024-05-28T21:37:48Z

CMakeLists.txt

@@ -586,6 +586,7 @@ IF(XNNPACK_TARGET_PROCESSOR MATCHES "^x86(_64)?$")
 ENDIF()
 IF(XNNPACK_TARGET_PROCESSOR MATCHES "Hexagon")
  LIST(APPEND ALL_MICROKERNEL_SRCS ${ALL_HEXAGON_MICROKERNEL_SRCS})
+  LIST(APPEND ALL_MICROKERNEL_SRCS ${ALL_HVX_MICROKERNEL_SRCS})


ditto... revert

Same as above, I will leave as it is to separate hvx and regular hexagon kernels, but please let me know if this should be reverted.

fbarchard · 2024-05-28T21:40:16Z

src/f32-vbinary/gen/f32-vmax-hvx-u64.c

+  assert(output != NULL);
+
+
+  const HVX_UVector *vptr_a = (const HVX_UVector *) input_a;


UVector is unaligned? Does it hurt performance much?

UVector is unaligned? Does it hurt performance much?

UVector is aligned with 4 Byte, and I actually found the performance got better for some cases with simulator (e.g., when we have the number of elements that fills up 128 bytes completely). We will find more accurate performance later with actual device.

fbarchard · 2024-05-28T21:44:37Z

scripts/generate-f32-vbinary.sh

@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF  -D BATCH_TILE=32 -D
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c &
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c &

+################################### HEXAGON HVX ##################################
+tools/xngen src/f32-vbinary/vop-hvx.c.in -D OP=ADD     -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vadd-minmax-hvx-u32.c &
+tools/xngen src/f32-vbinary/vop-hvx.c.in -D OP=ADD     -D BATCH_TILE=64 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vadd-minmax-hvx-u64.c &


v73 hvx has 4 vector units? So it may be good to generate 1, 2 and 4 vector variations

I just added the the microkernels using 4 vector units. I tried with simulator v68 vs. v73, but they took exact same time. This is the results coming from f32-vadd-minmax:

[----------] 9 tests from F32_VADD_MINMAX__HVX_U32 (37 ms total)
[----------] 9 tests from F32_VADD_MINMAX__HVX_U64 (96 ms total)
[----------] 9 tests from F32_VADD_MINMAX__HVX_U128 (282 ms total)

I think that is measuring the time for the simulator to run, not the simulated time it would take to run the kernel. I wouldn't make any performance judgements based on that.

That said, I wouldn't be surprised if clang was unrolling it for you anyways, and so these kernels are probably generating the same code.

fbarchard

Needs rebasing.

gen/microkernels.bzl

ejparkqc · 2024-05-29T06:19:53Z

I think the changes I made for bazel build fails. If the build bot fails again, I will revert all the changes for bazel build for now.

dsharlet · 2024-05-30T00:39:47Z

test/f32-vmax.cc

@@ -1316,6 +1316,183 @@
 #endif  // XNN_ARCH_WASM || XNN_ARCH_WASMSIMD || XNN_ARCH_WASMRELAXEDSIMD


+TEST(F32_VMAX__HVX_U32, batch_eq_32) {


This is failing internal presubmits because this needs to be wrapped in an #if XNN_ARCH_HEXAGON or similar, because it's trying to run these tests on inappropriate architectures.

(Same with the other tests too)

Thank you Dillon! I just fixed the error and generated the proper #if macro for HVX kernels.

Clearly the new change did not fix the error. Do you see any new errors from build bot result? I tried to build locally for X86 and checked if any build/ctest fails for f32-vbinary but seems all fine.

Actually it looks like you did fix the errors.

This failure is different, and unrelated to your change. You don't need to make any changes to fix it. The system tries to ignore pre-existing failures like this when testing new changes, but it's not always able to do that. I'll check with this change in a bit and see if I can make it retry the internal tests.

…d other related code changes

…diff

- Performance is not good yet. - Change the beginning of the hexagon generation part from ARM NEON to HEXAGON HVX in scripts/generate-f32-vbinary.sh - Following the namining rules, needed to change from hexagon to hvx in several locations.

- Removing white space - Follow the new location to list the Hexagon kernels - Revert cmake/microkernels.cmake andmicrokernels.bzl. Modify update-microkernels.py to have hvx in isa_list as well.

- Add gen/hvx_microkernels.bzl - Add ALL_HVX_MICROKERNEL_SRCS into hvx source lists.

…t the same code we use in production to the maximum extent possible. FUTURE_COPYBARA_INTEGRATE_REVIEW=#6386 from ejparkqc:f32-vbinary 52cc208 PiperOrigin-RevId: 637301291

ejparkqc changed the title ~~F32 vbinary~~ Hexagon HVX implementation of F32 vbinary May 7, 2024

dsharlet reviewed May 7, 2024

View reviewed changes

scripts/generate-f32-vbinary.sh Outdated Show resolved Hide resolved

scripts/generate-f32-vbinary.sh Outdated Show resolved Hide resolved

fbarchard reviewed May 9, 2024

View reviewed changes

src/f32-vbinary/gen/f32-vmax-hvx-u64.c Show resolved Hide resolved

fbarchard reviewed May 9, 2024

View reviewed changes

scripts/generate-f32-vbinary.sh Show resolved Hide resolved

fbarchard reviewed May 9, 2024

View reviewed changes

ejparkqc force-pushed the f32-vbinary branch from 7abb76a to a58660a Compare May 23, 2024 01:31

ejparkqc force-pushed the f32-vbinary branch 2 times, most recently from cd84b0a to 579fb80 Compare May 23, 2024 20:17

ejparkqc force-pushed the f32-vbinary branch from 579fb80 to 746d6e3 Compare May 23, 2024 20:26

ejparkqc force-pushed the f32-vbinary branch from d9aa1aa to c169adc Compare May 24, 2024 16:13

dsharlet approved these changes May 24, 2024

View reviewed changes

ejparkqc force-pushed the f32-vbinary branch from 3b4220a to 9ec077a Compare May 28, 2024 21:28

fbarchard reviewed May 28, 2024

View reviewed changes

fbarchard suggested changes May 28, 2024

View reviewed changes

ejparkqc force-pushed the f32-vbinary branch 8 times, most recently from 9c818c3 to b154496 Compare May 29, 2024 04:03

ejparkqc force-pushed the f32-vbinary branch 5 times, most recently from 88e54a7 to aaefc4b Compare May 29, 2024 04:16

alankelly reviewed May 29, 2024

View reviewed changes

gen/microkernels.bzl Show resolved Hide resolved

ejparkqc force-pushed the f32-vbinary branch 2 times, most recently from e8c72c2 to 470ec9c Compare May 29, 2024 05:49

ejparkqc force-pushed the f32-vbinary branch from 470ec9c to 5802baf Compare May 29, 2024 18:05

dsharlet reviewed May 30, 2024

View reviewed changes

ejparkqc force-pushed the f32-vbinary branch from 245ab93 to ba353cb Compare May 30, 2024 03:00

ejparkqc added 14 commits May 30, 2024 10:48

Changes to build XNNPACK for Hexagon with newer version of Hexagon SDK

630c47b

Applying Dillon's suggestions and additional changes to build further

c52fca0

Hexagon vector implementation for F32-vbinary

3c46f51

Add modified scripts to generate f32-vbinary for hexagon

d03613d

Additional change to compile and run the generated kernels.

ec8736e

Add f32-vadd-minmax HVX implementation, generated kernels, scripts an…

8919250

…d other related code changes

Add f32-vmul-minmax, f32-vsub-minmax, f32-vmax, f32-vmin and f32-vsqr…

366658c

…diff

Supporting unaligned load

ed884d9

Add conditional store implementation from Halide

6bc094c

Clean up unused build files and white spaces.

c52ee56

- Removing white space - Follow the new location to list the Hexagon kernels - Revert cmake/microkernels.cmake andmicrokernels.bzl. Modify update-microkernels.py to have hvx in isa_list as well.

Add microkernels using 4 vector units (batch_tile=128)

210b75b

Fix build issues with bazel

62d3561

- Add gen/hvx_microkernels.bzl - Add ALL_HVX_MICROKERNEL_SRCS into hvx source lists.

Fix the build error with missing #if macro for HVX

52cc208

ejparkqc force-pushed the f32-vbinary branch from ba353cb to 52cc208 Compare May 30, 2024 15:53

copybara-service bot merged commit 7869625 into google:master May 30, 2024
3 checks passed

copybara-service bot mentioned this pull request May 30, 2024

Remove XNN_TEST_MODE, it appears to be unnecessary, and we should test the same code we use in production to the maximum extent possible. #6476

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hexagon HVX implementation of F32 vbinary #6386

Hexagon HVX implementation of F32 vbinary #6386

ejparkqc commented May 7, 2024

fbarchard left a comment

fbarchard May 9, 2024

fbarchard left a comment

ejparkqc commented May 23, 2024

dsharlet commented May 23, 2024

ejparkqc commented May 23, 2024

dsharlet left a comment

fbarchard May 28, 2024

dsharlet May 28, 2024

dsharlet May 28, 2024

fbarchard May 28, 2024

ejparkqc May 29, 2024

fbarchard May 28, 2024

ejparkqc May 29, 2024

fbarchard May 28, 2024

ejparkqc May 28, 2024

fbarchard May 28, 2024

ejparkqc May 29, 2024

dsharlet May 29, 2024

fbarchard left a comment

ejparkqc commented May 29, 2024

dsharlet May 30, 2024

dsharlet May 30, 2024

ejparkqc May 30, 2024 •

edited

Loading

ejparkqc May 30, 2024

dsharlet May 30, 2024

		assert(output != NULL);


		const HVX_UVector vptr_a = (const HVX_UVector ) input_a;

		@@ -1316,6 +1316,183 @@
		#endif // XNN_ARCH_WASM \|\| XNN_ARCH_WASMSIMD \|\| XNN_ARCH_WASMRELAXEDSIMD


		TEST(F32_VMAX__HVX_U32, batch_eq_32) {

Hexagon HVX implementation of F32 vbinary #6386

Hexagon HVX implementation of F32 vbinary #6386

Conversation

ejparkqc commented May 7, 2024

fbarchard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbarchard left a comment

Choose a reason for hiding this comment

ejparkqc commented May 23, 2024

dsharlet commented May 23, 2024

ejparkqc commented May 23, 2024

dsharlet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbarchard left a comment

Choose a reason for hiding this comment

ejparkqc commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejparkqc May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejparkqc May 30, 2024 •

edited

Loading