# NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

#### Sarunas Kalade

Advanced Micro Devices sarunas.kalade@amd.com

# **Graham Schelle**

Advanced Micro Devices graham.schelle@amd.com

# **Abstract**

Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. Unlike GPU programming, which has had years to mature, NPU programming is new, with smaller and more fragmented developer communities across hardware platforms. This fragmentation poses a challenge when utilizing LLMs to assist in writing NPU kernels, as domain-specific optimized code examples are underrepresented in LLM pre-training data.

In this paper we introduce NPUEval – a benchmark for writing and evaluating NPU kernels, consisting of 102 common operators for machine learning workloads. We evaluate LLM generated code on actual hardware based on both functional correctness and vectorization efficiency using open source compiler tools targeting the AMD NPU. We evaluate a range of state-of-the-art LLMs with a mix of proprietary and open-weight models. Latest reasoning models like DeepSeek R1, show promising results achieving out-of-the-box 50%+ vectorization on select kernels. However, the average score across the entire dataset remains roughly 10% even with compiler feedback and vectorized kernel examples – showing that this is a challenging dataset even for frontier models. The dataset and evaluation code will be released with a permissive open source license, providing an essential benchmark for advancing research in code generation and NPU kernel optimization.

# 1 Introduction

Large language models (LLMs) have become compelling code generation assistants, achieving breakthrough performance on various coding benchmarks [1][2][3][4][5][6][7][8]. More challenging benchmarks that focus on issue solving like SWE-Bench[9] are now also being tackled at over 62% success rate by frontier models, and rapidly improving[10]. However, while these models excel on typical Python tasks, many benchmarks prioritize pass/fail metrics over code quality. A poorly written algorithm can still pass functional tests, but may not be suitable for production systems. This is particularly problematic when generating efficient kernels for hardware accelerators – unoptimized kernels are not useful in accelerated applications.

There has been a surge of NPUs to power AI acceleration workloads from various silicon providers[11][12][13][14][15][16]. However, related work on optimized kernel generation have been primarily targeting GPUs[17][18][19]. GPU programming languages and ecosystem have had years to mature and are well represented in LLM pre-training data. For NPU programming, on the other hand, the developer communities are smaller and more fragmented across hardware architectures, with less mature software stacks. We found that LLMs often struggle to produce

optimized solutions for vendor specific code, even though frontier models are good at solving general software problems.

This paper introduces NPUEval, a dataset designed to evaluate LLMs' ability to generate vectorized kernel code for AMD NPUs. The accessibility in client devices makes these accelerators an attractive platform for kernel code generation research, maximizing reproducibility by running on commodity laptops and miniPCs. The entire evaluation harness, including the compiler, is based on open source tools and we are releasing the dataset and associated code under a permissive license. The dataset is composed of prompts, behavioral models, and data movement information for each kernel. Our evaluation framework measures LLM performance on functional correctness (similar to benchmarks like HumanEval[1]) and cycle-accurate performance metrics such as vectorization score (determined by the percentage of cycles spent executing vector instructions).

Our results show that while LLMs can generate functionally correct code that run on NPU hardware, and with the addition of simple techniques like compiler feedback and RAG they are capable of writing vectorized implementations.

The contributions of this paper are:

- Introduction of NPUEval, which to our knowledge is the first benchmark for evaluating LLMs in generating vectorized code for NPU kernels.
- A **fully open source stack** (compiler and driver) for programming AMD NPUs focused on single kernel development that can run on laptops.
- A comprehensive evaluation harness providing correctness evaluation, **cycle-accurate performance metrics**, and kernel microcode outputs that can be built upon using agentic workflows for further optimization.
- A reference LLM pipeline that utilizes compiler feedback and retrieval augmented generation (RAG) of vectorized kernel examples to steer the LLM outputs towards vectorized solutions and reduce hallucinations.

# 2 Vectorization

The hardware targeted in this paper is the NPU found on latest AMD laptops and miniPCs like the Phoenix, Hawk Point or Strix Point based machines, which internally use AIEs to perform kernel computations. A device like Phoenix will have 20 AIE tiles which are individual compute units capable of independently processing data. Each AIE tile has a vector processing unit (VPU) as well as a scalar unit, which makes these compute units quite flexible since the scalar unit can execute arbitrary C++ code.

These AIE tiles require kernels just like a GPU, and writing them using optimized, vectorized code is essential to fully leverage the hardware architecture for maximum performance. Inefficient kernels will dramatically increase latency and power consumption, which completely undermines the purpose of having AIEs included in these devices. However, in order to squeeze out as much performance as possible we want to maximize the computation on the VPU, which means writing vectorized code by utilizing C++ AIE APIs and intrinsics.

A very simple example showcasing vectorization is in Fig. 1. The scalar code (Fig. 1a) is a passthrough kernel that naively iterates one element at a time in a for loop and copies the input to the output buffer. The vectorized code in Fig. 1b is the same kernel, but written to utilize the VPU using vector instructions - this will process a chunk of data at a time rather than single elements, significantly improving throughput.

Optimized kernels for vectorization can look more complicated and result in more lines of code than this simple example, but the vectorization methodology remains the same (see Appendix A for reference vectorized code). The challenge for the LLMs will be to generate vectorized code that is both correct and efficient in using the VPU by utilizing the correct APIs.

```
void passthrough(uint8_t *in_buffer,
                                              void passthrough(uint8_t *in_buffer,
                                                                uint8_t *out_buffer){
                 uint8_t *out_buffer){
    uint32_t nbytes = 512;
                                                  uint32_t loop_count = 8; // 512/64
    for (int i=0; i<nbytes; i++) {</pre>
        out_buffer[i] = in_buffer[i];
                                                  for (int i = 0; i < loop_count; i++) {</pre>
                                                      buffer = ::aie::load_v64<>(in_buffer);
}
                                                      ::aie::store_v(out_buffer, buffer);
                                                      in_buffer += 64;
                                                      out_buffer += 64;
                                                  }
                                              }
```

(a) Non-vectorized (scalar) code

(b) Vectorized 64-byte implementation

Figure 1: Simple example of vectorizing a passthrough kernel

# 3 The Dataset

The NPUEval dataset is a collection of prompts consisting of AIE kernel definitions accompanied with docstrings containing the kernel description, input/output examples, and anticipated data movement and runtime parameters (see Appendix B for more detail). An example prompt is shown in Fig. 2.

```
1 /*
2 This AIE kernel applies the ReLU6 activation function elementwise
        to a bfloat16 input vector of size 256. ReLU6 clamps each
      value to [0, 6].
3 >>> relu6_bfloat16([12.125, 1.203125, 5.84375, 15.9375, 12.9375,
      -9.8125, 5.59375, -3.203125])
<sub>4</sub> [6.0, 1.203125, 5.84375, 6.0, 6.0, 0.0, 5.59375, 0.0]
5 This kernel should be optimized for the following input/output
      buffer shapes and parameters:
6 input size: 256
7 output size: 256
9 #include <aie_api/aie.hpp>
void relu6_bfloat16(bfloat16 *input, bfloat16 *output) {
      // Implementation goes here
12
13 }
```

Figure 2: Sample prompt from the NPUEval dataset.

#### 3.1 Structure

Each kernel in the dataset will have the following components:

- Prompt HumanEval inspired prompts containing a basic explanation of what the kernel does and function signature.
- Data movement information specifying the data sizes coming in and out of the tile.
- Behavioral model NumPy-based Python implementation of the target kernel.
- (Optional) Canonical C++ solution. While the Python solution is used for correctness evaluation, the canonical solution is useful for regression testing of the harness.

To generate optimized kernels that fully utilize the target hardware, kernels may be written for specific fixed input data sizes. For this reason we include data movement information for each kernel, specifying the data sizes coming in and out of the tile.[20]

#### 3.2 Dataset considerations

Since the main task is writing code for specialized hardware there are some considerations and challenges when building this dataset that are not typically considered in LLM coding benchmarks.

# 3.2.1 Data type support

In addition to common NPU data types like int8 typically used to accelerate quantized neural networks, AMD's NPU supports bfloat16 which is relatively new. Python libraries like NumPy don't have native support for these newer ML data types yet. To generate bfloat16 behavioral models we used Google Jax's ml\_dtypes[21] library which provides a NumPy-compatible bfloat16 type. The NPU has a programmable rounding mode for its bfloat16 implementation, which we leverage to ensure the same operations will result in identical outputs minimizing compounding rounding errors.

# 3.2.2 Floating point kernel evaluation

Floating point kernel solutions may not match the Python behavioral model outputs exactly. The same problem can be solved using different algorithms with slight precision tradeoffs. For example, when estimating trigonometric functions like sin, the NPU might use a different polynomial approximation than the CPU implementation used in the behavioral model. Floating point operations are also not associative, so doing operations in a different order will produce slightly different values. When evaluating correctness of the generated kernels we use a large default absolute error threshold of 1e-2 (similar default to related kernel generation works[17]). This default is fine for logical and integer operators, but for lower precision floating point operations the tolerance is set to 2e-2 to give the LLMs some leeway in the algorithmic search space. More complex kernels like tanh or softmax activations have their tolerances set to 3e-2.

#### 3.2.3 Data movement

We have pre-defined data movement for each kernel in order to compile the AIE graph using MLIR-AIE tools[22], this is included in the prompts. The test vectors are used to infer the data movement and automatically generate underlying MLIR code to configure the device. In dataflow programming kernels are aware of data sizes coming in and out of the tile[23]. This is essential for the compiler to be able to optimize buffer allocation and bank assignment.[20]

#### 3.2.4 Hardware availability

Most code generation benchmarks that set out to solve problems in Python or other high-level languages are convenient because they can run on any commodity hardware. Writing code for hardware accelerators is difficult because you need to be able to run tests on-target or use a simulator (typically much slower, limiting the number of iterations). While AIEs can also be found on development boards for embedded applications[24], these platforms are less accessible to mainstream users. We target AMD NPUs as they are readily available on consumer hardware, making reproduction of our results more approachable to a wider range of researchers. The results presented in this work were obtained using a laptop with a Ryzen 9 7940HS chip.

# 4 Evaluation

The generated kernels are evaluated using the following criteria:

- Compilation does the kernel compile? Is it syntactically correct C++ code using valid NPU vector unit API calls and intrinsics?
- **Functional correctness** does the kernel produce the correct output for the given input (given a provided error tolerance)?
- Performance how long does the kernel take to execute and how efficiently is the VPU utilized?



Figure 3: Overview of NPUEval evaluation pipeline.

# 4.1 Evaluation harness setup

A high-level overview of the evaluation pipeline is shown in Fig. 3. The dataset includes kernel prompts, behavioral models and test vectors. The evaluation harness compiles the generated C++ kernel code and runs it on-target. The outputs are then compared against the expected simulation outputs from the Python behavioral models.

Compiler. To evaluate the LLM generated code we use the LLVM-AIE compiler[25], which is a fork of LLVM specifically used for AIE kernel programming. This tool is entirely open source and can be installed from its GitHub repository. To run agents that are generating compilable code you only need an x86 machine to reproduce the code generation steps with compiler feedback - evaluation whether the code is syntactically correct is still useful even if one does not have access to a machine that has an NPU.

**Application builder**. When programming the NPU, the AIE array needs to be configured and the kernel loaded onto the right AIE tile. We utilize the open source MLIR-AIE[22] framework and IRON[20] bindings for this task. The NPUEval evaluation harness has templated graphs for most kernel data types and input/output port numbers. The data movement passed onto MLIR-AIE is determined by parsing the test vectors and behavioral model outputs, which are part of the dataset.

**Runtime**. To move data in and out of the NPU and execute the kernels, we use Python bindings to the NPU driver. This integration allows us to easily work with NumPy arrays, and the behavioral model can be quickly checked against NPU outputs.

**Performance metrics**. In addition to the core evaluations, we collect supplementary information about the generated kernels. This includes metrics like execution time, and accuracy measurements. For accuracy, we compute both maximum absolute error (the numerical difference between each output value and reference value) and maximum relative error (the percentage difference relative to the reference value).

Agents can utilize this information to iterate on the code generation process. For example, if the kernel is failing the functional test, seeing the maximum absolute error compared to the reference Python output can help refine the algorithm. If the kernel is taking too long to run and the VPU is not being utilized, the agent could try different vectorization strategies.

**Post-processing.** This is essential in LLM evaluation since the responses from some LLMs can sometimes not be immediately usable for automatic evaluation. These models will often mix in english language to the responses and denote codeblocks using markdown markers like """. LLMs also have a tendency to output a main() function to test the kernel. We use regex to extract codeblocks embedded in markdown and truncate any extraneous functions.

# 5 Generation

Along with the dataset and evaluation harness, we provide a reference code generation pipeline that can be used with a variety of LLMs. A system prompt is provided to guide the LLMs into generating self-contained C++ solutions, we generate a vector database of open source AIE kernels for RAG and provide compiler feedback to the LLM to reduce errors and hallucinations.

#### **5.1** System Prompts

A system prompt will typically be used to guide the model's behavior by providing context and expected outcomes. Based on initial experimentation, adding a system prompt was imperative to get well formatted code from many of the LLMs. Some models will attempt to break down the solutions into multiple codeblocks, explanations and produce overly verbose outputs, which make it difficult to automatically parse the solution. The prompt used for NPUEval is shown below.

```
1 You are a part of a code generation system for AIE (AI Engines).

2 3 * Your job is to write C++ code for a single kernel that will run on an AIE tile.

4 * Produce only the C++ code for the requested kernel including any required headers and imports.

5 * Make sure the C++ code is complete and self contained in a single code block.

6 * Name the function exactly as specified in the request, and output only the kernel (no main(), examples, explanations or extra code).
```

While the system prompt does not guarantee that the model will follow the rules and generate self-contained C++ kernel code blocks, empirically we found it very helpful. Some models are not as good at following the directions, in which case more post-processing is still required (e.g., for unwanted main() function pruning).

#### **5.2 RAG**

The RAG database is composed of kernels drawn from open source GitHub repositories[22][26]. The kernels have been manually modified to exclude any scalar implementations, leaving only vectorized code examples.

We use llama\_index [27] to manage the vector database and retrieval of examples. The embeddings model is OpenAI's text-embedding-ada-002 (llama\_index default). For each test prompt two examples are provided to the LLM. RAG systems are a rich optimization space in itself[28], however it is out of scope for this paper – we leave this for future work, and use llama\_index defaults for the evaluations.

# 5.3 Compiler feedback

The same compiler, LLVM-AIE, is used for generation as for evaluation. The generated kernel code is passed through the compiler and if it fails to compile, the error message is fed back to the model. Each LLM is allowed to retry code generation up to ten times. Results could potentially be improved even further with more compilation attempts, however we start seeing diminishing returns after tenplease see Appendix C.

# 5.4 LLM settings

Where available we try to lock down random seeds and make the results as reproducible as possible. For OpenAI models we set the random seed to 42. For all models we set temperature to 0.0 and top\_p to 1.0. For reasoning models the temperature will be set to the default 1.0 and system prompt passed as part of the user message.

# 6 Results

Popular frontier LLMs were evaluated on the dataset, including latest versions of OpenAI's GPT-4.1 and Anthropic's Claude 3.7 Sonnet. We also include results of open-weight models like Meta's Llama family. And other open source models from DeepSeek that are known to well work for code generation tasks like DeepSeek R1.

#### 6.1 Out-of-the-box LLM evaluation

We first set the baseline for the models by testing them on the prompts directly only adding the system prompt outlined in 5.1. The baseline results are summarized in Figure 4



Figure 4: Zero shot results (ranked from top to bottom)

Table 1: Functional test pass rates (%) with iterative re-compilation.

| Model               | No RAG |      |      |      | RAG  |      |      |      |
|---------------------|--------|------|------|------|------|------|------|------|
| Num. recompilations | 0      | 1    | 2    | 5    | 0    | 1    | 2    | 5    |
| Claude Haiku 3.5    | 34.3   | 44.1 | 52.9 | 61.8 | 25.5 | 40.2 | 49.0 | 53.9 |
| Claude Sonnet 3.7   | 21.6   | 52.9 | 64.7 | 73.5 | 23.5 | 47.1 | 59.8 | 70.6 |
| Deepseek R1         | 20.6   | 30.4 | 34.3 | 38.2 | 19.6 | 23.5 | 26.5 | 29.4 |
| Deepseek V3         | 0.0    | 42.2 | 55.9 | 60.8 | 23.5 | 27.5 | 31.4 | 39.2 |
| GPT-4.1             | 29.4   | 52.9 | 63.7 | 71.6 | 22.5 | 37.3 | 45.1 | 58.8 |
| GPT-4o              | 36.3   | 40.2 | 43.1 | 49.0 | 18.6 | 28.4 | 34.3 | 41.2 |
| GPT-4o Mini         | 58.8   | 65.7 | 66.7 | 66.7 | 26.5 | 34.3 | 34.3 | 35.3 |
| LLaMA3.1-405B       | 38.2   | 46.1 | 50.0 | 56.9 | 17.6 | 21.6 | 26.5 | 30.4 |
| LLaMA3.1-70B        | 11.8   | 37.3 | 42.2 | 52.0 | 8.8  | 17.6 | 22.5 | 23.5 |
| Qwen2.5-Coder       | 50.0   | 56.9 | 64.7 | 68.6 | 6.9  | 9.8  | 11.8 | 13.7 |

# 6.2 Correctness

As shown in Fig. 3, correctness is evaluated by comparing the kernel output of the programmed NPU with the reference output produced by the Python behavioral model. The effectiveness of compiler feedback is explored with a maximum of five recompilations. The summary of our results is shown in Table. 1.

Surprisingly, smaller models like GPT-40 mini and Claude 3.5 Haiku out-of-the-box passed more tests than their more powerful counterparts (Claude Sonnet and GPT-40). One of the observed reasons is that smaller models tend to write scalar C++ code by default (i.e., regular C++ with for loops instead of vector intrinsics). Stronger models may try to write vectorized code at first, but end up hallucinating functions as shown in Fig. 6b. With compiler feedback these models will often fall back to scalar solutions, which will allow them to pass more tests.

DeepSeek V3, Claude 3.7 Sonnet and GPT-40 benefitted greatly from successive compilation attempts, whereas others saw marginal benefits. DeepSeek V3 in particular was trying to include "adf.h" in nearly every first attempt, which does not exist in the open source tool suite used within the evaluation harness, which is why after a single recompilation its pass rate shot up to 42%.

#### 6.3 Performance

The efficiency of generated kernels is calculated by dividing the number of cycles that are used for VPU execution by the total number of cycles it takes to process the test input on the NPU. While not perfect, this is a good proxy for how well the kernel is utilizing the VPU. Kernels that do not pass functional tests are evaluated as 0%. The average score for all kernels per model is displayed in Fig. 5. While the scores seem very low it should be noted that SoTA open source kernels[22] will typically see a vectorization factor of 10-30% as shown in Appendix A.



Figure 5: Vectorization results

To guide model responses towards vectorized code we introduced vectorized kernel RAG into the code generation pipeline. Introducing RAG composed of openly available AIE kernels into the prompts improved VPU utilization across the board, with GPT-4.1 seeing the sharpest improvement in vectorization score. Curiously the only model which saw a decrease in average VPU utilization was DeepSeek R1, which still holds the best score without using vectorized examples. The reason it performed worse is because some kernels in our database had compiler-specific pragmas which are different to the pragmas one would use with LLVM-AIE. Refer to Appendix D for solution examples and how they differ.

With compiler feedback, DeepSeek V3 and GPT-4.1 were among the best models at generating vectorized code, even though they scored quite low on correctness testing. This follows the trend of powerful models being more "inventive" when trying to write performant code, however due to lack of deep knowledge of NPU programming environments were more prone to mistakes.

# 6.4 LLM Failure Analysis

The LLMs that passed many functional correctness tests ended up writing very inefficient solutions as shown in Fig 6a – these kernels do not utilize the NPU hardware to the fullest. Incorrect solutions will fail compilation often due to hallucinations as shown in Fig 6b or misunderstanding of how to use vector APIs as shown in Fig 6c where the model has successfully used the AIE APIs but was doing it in a loop one element at a time.

There seems to be baked in knowledge in these models however, Claude 3.7 Sonnet (Fig 6c) does have a notion of vector\_size and that it should be chunking the input buffer into vectors of 16 elements. With more advanced prompting and more quality examples these models have the potential to write efficient NPU kernels.

# 7 Limitations

In this study we have primarily targeted the AIE architecture found in AMD NPUs. While methodology is designed to be extensible to other accelerator families and even within the same AIE family as new devices are released updated with latest intrinsics and API improvements.

The presented results have been achieved using one compiler backend, however there are multiple options for just the AIE. And as we saw from the DeepSeek R1 result, certain optimizations in the code are compiler specific which will be an interesting challenge when generalizing across NPU architectures and programming platforms.

```
#include <aie_api/aie.hpp>

3 void abs_int8(int8_t *in_buffer, int8_t *out_buffer) {
      constexpr int buffer_size = 1024;
      for (int i = 0; i < buffer_size; i++) {
            out_buffer[i] = aie::abs(in.buffer[i]);
      }
    }
}</pre>
```

(a) Looping one element at a time (GPT-4o)

(c) Aware of vectors and how to chop up the data but processing in scalar loops (Claude 3.7 Sonnet)

(b) Hallucinated non-existent API (Llama-3.1 70b)

Figure 6: Examples of LLM errors in vectorized kernel generation: scalar loop with conditional (claude-3-7-sonnet), incomplete vectorization (gpt-4o), and hallucinated API usage (llama-3.1-70b).

Since NPUs are a relatively new class of device and the programming model is not as established other accelerators there is still a lack of quality open source kernels to establish a human baseline. We include a couple examples in Appendix A to contrast our results, and note that current SoTA is not higher than 30%, though this will be highly kernel dependent.

Our evaluation covered a range of popular LLMs, but more models could be tested. All results are based on greedy decoding with a single pass; while pass@k evaluations may yield higher scores, one-shot decoding still offers meaningful insight into current LLM capabilities for NPU kernel generation.

# 8 Future work

NPUs are still a new class of accelerator and its impressive how good some models already are at writing code for them. This is only the beginning and it will be exciting to see how new techniques like code generating agents will improve upon this benchmark.

While the work presented in this paper focused on one NPU architecture, this methodology could easily extend across different vendors and families of NPUs. The Python behavioral models can be re-used along with the PromptConstructor class to generate datasets targeting any programming model. We plan to use the same methodology to extend this work to other accelerator families.

# 9 Conclusions

We have presented NPUEval, the first benchmark to systematically evaluate LLMs in their ability to generate NPU kernel code. NPUEval includes a comprehensive evaluation harness with cycle-accurate performance metrics and a reference code generation pipeline.

Our results show that most LLMs can readily generate scalar code, but struggle to produce optimized, vectorized solutions. Interestingly, smaller models (e.g., GPT-40 mini, Claude 3.5 Haiku) tend to favor functional but unoptimized code, while stronger models (e.g., GPT-40, Claude 3.7 Sonnet, DeepSeek R1) attempt to optimize and risk hallucinations, leading to lower functional scores.

We believe NPUEval provides a valuable foundation for advancing LLM-driven accelerator kernel generation. We hope this benchmark will become the standard for measuring and improving LLM-based code generation for emerging hardware architectures.

# Acknowledgments

We would like to thank our colleagues Mario Ruiz Noguera, Thomas Papatheodore, Stephen Neuendorffer, Jack Lo, and Joseph Melber from AMD for their expertise and technical advice. Many thanks to Patrick Lysaght for supporting the early development stages of this project.

Additionally we want to thank Lakshya A Agrawal and Matei Zaharia (UC Berkeley) for their feedback in the development of the benchmark.

#### References

- [1] Mark Chen et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- [2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- [3] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023.
- [4] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. Multiple: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023.
- [5] Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, Zekun Wang, Boyang Wang, Xianjie Wu, Bing Wang, Tongliang Li, Liqun Yang, Sufeng Duan, and Zhoujun Li. Mceval: Massively multilingual code evaluation, 2024.
- [6] Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. Evaluation of Ilms on syntax-aware code fill-in-the-middle tasks. arXiv preprint arXiv:2403.04814, 2024.
- [7] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories, 2024.
- [8] Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024.
- [9] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- [10] Anthropic PBC. Claude 3.7 sonnet and claude code, https://www.anthropic.com/news/claude-3-7-sonnet.
- [11] Advanced Micro Devices (AMD). AMD XDNA. https://www.amd.com/en/technologies/xdna.html.
- [12] Apple Inc. Apple Neural Engine. https://en.wikipedia.org/wiki/Apple\_Neural\_ Engine.
- [13] Ltd. Huawei Technologies Co. Ascend ai processor, https://e.huawei.com/en/products/computing/ascend.
- [14] Intel Corporation. Intel neural processor, https://intel.github.io/intel-npu-acceleration-library/npu.html.
- [15] Samsung Electronics. Samsung neural processing unit, https://semiconductor.samsung.com/support/tools-resources/dictionary/the-neural-processing-unit-npu-a-brainy-next-generation-semiconductor/.

- [16] Inc. Qualcomm Technologies. Hexagon processor, https://en.wikipedia.org/wiki/qualcomm\_hexagon.
- [17] Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can Ilms write efficient gpu kernels?, 2025.
- [18] Robert Tjarko Lange, Aaditya Prasad, Qi Sun, Maxence Faldor, Yujin Tang, and David Ha. The ai cuda engineer: Agentic cuda kernel discovery, optimization and composition. 2025.
- [19] Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. Tritonbench: Benchmarking large language model capabilities for generating triton operators, 2025.
- [20] Erika Hunhoff, Joseph Melber, Kristof Denolf, Andra Bisca, Samuel Bayliss, Stephen Neuendorffer, Jeff Fifield, Jack Lo, Pranathi Vasireddy, Phil James-Roxby, and Eric Keller. Efficiency, expressivity, and extensibility in a close-to-metal npu programming interface, 2025.
- [21] Google. A stand-alone implementation of several NumPy dtype extensions used in machine learning, https://github.com/jax-ml/ml\_dtypes, 2025.
- [22] AMD. Fork of LLVM to support AMD AIEngine processors, https://github.com/Xilinx/mlir-aie, 2024.
- [23] Tristan Laan and Tiziano De Matteis. Developing a blas library for the amd ai engine, 2024.
- [24] AMD. Versal VCK190 Evaluation Board, 2025. Accessed: 2025-03-13.
- [25] AMD. An MLIR-based toolchain for AMD AI engine-enabled devices, https://github.com/Xilinx/llvm-aie, 2024.
- [26] AMD. An open-source exploration framework for first time users of the AMD Ryzen AI Neural Processing Unit, https://riallto.ai/, 2024.
- [27] Jerry Liu. LlamaIndex, https://github.com/jerryjliu/llama index, 2022.
- [28] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024.

# A Examples of vectorized kernels

Here we provide example vectorized kernels available in open source repositories like https://github.com/xilinx/mlir-aie. While the same functionality can be achieved writing regular C++ loops, vectorized code is more specialized and uses intrinsics and AIE APIs.

Listing 1 shows an open source implementation of an elementwise add kernel, which uses Chess compiler pragmas and AIE APIs to perform vector operations.

```
1 #define NOCPP
3 #include <stdint.h>
4 #include <stdio.h>
5 #include <stdlib.h>
6 #include <type_traits>
8 #include <aie_api/aie.hpp>
10 template <typename T_in, typename T_out, const int N>
void eltwise_add(T_in *a, T_in *b, T_out *c) {
    for (int i = 0; i < N; i++) {</pre>
      c[i] = a[i] + b[i];
    }
14
15 }
17 template <typename T_in, typename T_out, const int N>
18 void eltwise_vadd(T_in *a, T_in *b, T_out *c) {
19
    constexpr int vec_factor = 16;
20
    event0();
21
    T_in *__restrict pA1 = a;
    T_in *__restrict pB1 = b;
23
    T_out *__restrict pC1 = c;
    const int F = N / vec_factor;
    for (int i = 0; i < F; i++)</pre>
      chess_prepare_for_pipelining chess_loop_range(16, ) {
27
        aie::vector<T_in, vec_factor> A0 = aie::load_v<vec_factor>(pA1);
        pA1 += vec_factor;
29
        aie::vector<T_in, vec_factor> B0 = aie::load_v<vec_factor>(pB1);
        pB1 += vec_factor;
31
        aie::vector<T_out, vec_factor> cout = aie::add(A0, B0);
        aie::store_v(pC1, cout);
33
        pC1 += vec_factor;
      }
35
36
    event1();
37 }
39 extern "C" {
41 void eltwise_add_bf16_scalar(bfloat16 *a_in, bfloat16 *b_in, bfloat16 *c_out) {
    eltwise_add<bfloat16, bfloat16, 1024>(a_in, b_in, c_out);
43 }
45 void eltwise_add_bf16_vector(bfloat16 *a_in, bfloat16 *b_in, bfloat16 *c_out) {
    eltwise_vadd<bfloat16, bfloat16, 1024>(a_in, b_in, c_out);
47 }
```

```
48
49 } // extern "C"
```

Listing 1: Elementwise Add example (Vector score: 13%)

In listing 2 is an optimized version of a conv2d kernel, which achieves 30% vectorization score.

```
1 #define NOCPP
3 #include <stdint.h>
4 #include <stdio.h>
5 #include <stdlib.h>
7 #include <aie_api/aie.hpp>
9 #define REL_WRITE 0
10 #define REL_READ 1
12 #ifdef SCALAR
14 const int32_t SMAX = 127;
15 const int32_t SMIN = 128;
16
17 #ifdef INT8_ACT
19 // conv2d 1x1 - scalar
20 // act: int8, wts: int8, out: int8
22 void conv2dk1_i8_scalar(int8_t *input, int8_t *kernels, int8_t *output,
                         const int32_t input_width, const int32_t input_channels,
23
24
                         const int32_t output_channels, const int scale) {
25
    event0();
26
27
   int x, ic, oc, ic8, oc8;
   // scale=-17;
   for (oc = 0; oc < output_channels / 8; oc++) {</pre>
29
     for (x = 0; x < input_width; x++) { // col of output image</pre>
        for (oc8 = 0; oc8 < 8; oc8++) {
31
32
          int sum = 0;
          int sum_srs = 0;
33
          for (ic = 0; ic < input_channels / 8; ic++) {</pre>
35
            for (ic8 = 0; ic8 < 8; ic8++) {</pre>
              int val = input[(ic * input_width * 8) + (x * 8) + ic8];
37
              int k = kernels[(oc * (input_channels / 8) * 64) + (ic * 64) +
                              (ic8 * 8) + oc8];
              sum += val * k;
            }
41
          }
42
43
          // sum_srs=sum>>scale;
          sum_srs = (sum + (1 << (scale - 1))) >> scale;
          sum_srs = (sum_srs > SMAX) ? SMAX : (sum_srs < -SMIN) ? -SMIN : sum_srs;</pre>
          // sum_srs = input[(oc*input_width*8) + (x*8) + oc8];
47
          output[(oc * input_width * 8) + (x * 8) + oc8] = sum_srs;
48
```

```
}
51
52
53
    event1();
54 }
55 #endif // INT8_ACT
57 #else // Vector
59 #ifdef INT8_ACT
62 // conv2d 1x1 - vector
63 // act: int8, wts: int8, out: uint8
64 //
65 // Assume IC >= 16 as that gives ideal inner loop schedule
67 // TODO - Restricting input_width is mutiple of 32
_{68} // Because each VMAC works on 4 inputs at a time and we store intermediate
69 // results in 8 accumulators, having input_width be a multiple of 4*8=32 is
_{70} // ideal. However, we should be able to support input_width that is only a
_{71} // multiple of 4 but there is some strange scheduling happening now so for
72 // now, we do not.
74 void conv2dk1_i8_vector(int8_t *input, int8_t *kernels, int8_t *output,
75
                          const int32_t input_width, const int32_t input_channels,
                          const int32_t output_channels, const int scale) {
    event0();
77
    using MMUL4x8x8 = aie::mmul<4, 8, 8, int8, int8>;
79
    ::aie::set_saturation(
        aie::saturation_mode::saturate); // Needed to saturate properly to uint8
81
    ::aie::set_rounding(aie::rounding_mode::symmetric_inf); // Needed to saturate
                                                            // properly to uint8
83
84
    int8_t *restrict out_ptr = output;
85
86
    const int scaleT = scale;
87
88
    MMUL4x8x8 acc_tmp[8];
89
    for (int x = 0; x < 8; x++) {
90
      acc_{tmp}[x] = aie::zeros < acc32, 32>();
91
92
93
    // TODO Keeping this variable gives a wrong behavior and bad schedule!
94
    const int iw = input_width;
    const int iw_32 = (input_width / 4) / 8;
96
97
    // const int iw_32_rem = (input_width / 4) % 8;
98
    // const int iw_32_{em} = (32 / 4) \% 8;
    assert((input\_width / 4) \% 8 == 0);
    const int iw_32_rem = 0; // TODO - See restriction
102
    assert((input_channels / 8) > 2); // Assume IC >= 16
```

```
104
     if (iw_32 > 0) {
105
106
       for (int oc = 0; oc < (output_channels / 8); oc++) {</pre>
107
         for (int iw_32c = 0; iw_32c < iw_32; iw_32c++) {</pre>
108
           for (int ic = 0; ic < (input_channels / 8); ic++)</pre>
             chess_prepare_for_pipelining chess_loop_range(2, ) {
                aie::vector<int8, 64> in_b = aie::load_v<64>(kernels);
                kernels += 64; // wts ic0..7(oc0..7)
               for (int x = 0; x < 8; x++) {
114
                  aie::vector<int8, 32> in_a = aie::load_v<32>(input);
                  input += 32; // act oc0..3(ic0..7)
116
                  acc_tmp[x].mac(in_a, in_b);
               }
118
                input += (iw * 8) - 256; // Move to next ic/8 position
           // input ptr just moves to next section
           for (int xx = 0; xx < 8; xx++) {
             aie::vector<int8, 32> o1 = acc_tmp[xx].to_vector<int8>(scaleT);
             aie::store_v(out_ptr, o1);
125
             out_ptr += 32;
             acc_tmp[xx] = aie::zeros<acc32, 32>();
128
           input -= ((input_channels / 8) * iw * 8) -
129
                     256; // reset to next input_width/32 block
           kernels -=
                (input_channels / 8) * 64; // reset kernel back to beginning of ic/8
         input -= (iw_32) * 256; // 8*32, reset beginning of input ptr
         kernels += (input_channels / 8) * 64; // move to next oc/8 weights
         out_ptr += (iw_32_rem *
                      32); // move to next oc/8 (skip remainder section if present)
       }
     } // if(iw_32 > 0) {
139
140
     if (iw_32_rem > 0) {
141
142
       const int ocs = output_channels;
143
144
       const int ics = input_channels;
145
       for (int oc = 0; oc < (ocs / 8); oc++) {
146
         for (int ic = 0; ic < (ics / 8); ic++)</pre>
147
           chess_prepare_for_pipelining chess_loop_range(2, ) {
148
             aie::vector<int8, 64> in_b = aie::load_v<64>(kernels);
             kernels += 64; // wts ic0..7(oc0..7)
150
             for (int x = 0; x < iw_32_rem; x++) {</pre>
                aie::vector<int8, 32> in_a = aie::load_v<32>(input);
                input += 32; // act oc0..3(ic0..7)
154
155
               acc_tmp[x].mac(in_a, in_b);
             }
156
             input += (iw * 8) - (iw_32_rem * 32); // Move to next ic/8 position
```

```
158
        // input ptr just moves to next section
159
160
        for (int xx = 0; xx < iw_32_rem; xx++) {</pre>
          aie::vector<int8, 32> o1 = acc_tmp[xx].to_vector<int8>(scaleT);
          aie::store_v(out_ptr, o1);
162
          out_ptr += 32;
          acc_tmp[xx] = aie::zeros<acc32, 32>();
164
        }
        // input
                  -= ((ics-1)/8)*(iw*8)+(iw_32_rem*32); // reset to beginning of
166
        // input ptr for remainder
        input -= 448; // reset to beginning of input ptr for remainder
168
        // kernel ptr already at next oc/8
        out_ptr += (iw * 8) -
                   (iw_32_{rem} *
                   32); // move to next oc/8 (skip remainder section if present)
173
174
    } // if(iw_32_rem > 0)
175
176
177
    event1();
178 }
179 #endif // INT8_ACT
180 #endif // Vector
183 // conv2d 1x1 wrappers
185 extern "C" {
187 #ifdef SCALAR
189 #ifdef INT8_ACT
191 void conv2dk1_i8(int8_t *input, int8_t *kernels, int8_t *output,
                  const int32_t input_width, const int32_t input_channels,
192
                  const int32_t output_channels, const int scale) {
193
    conv2dk1_i8_scalar(input, kernels, output, input_width, input_channels,
                      output_channels, scale);
195
196 }
197 #endif // INT8_ACT
198 #else // Vector
200 #ifdef INT8_ACT
202 void conv2dk1_i8(int8_t *input, int8_t *kernels, int8_t *output,
                  const int32_t input_width, const int32_t input_channels,
203
                  const int32_t output_channels, const int scale) {
204
    conv2dk1_i8_vector(input, kernels, output, input_width, input_channels,
206
                      output_channels, scale);
207 }
208 #endif // INT8_ACT
209 #endif // Vector
210 } // extern "C"
```

# **B** Prompt construction ablation study

Here we evaluate a number of prompt configurations when determining the final structure for NPUEval prompts. Figure 7 illustrates how well GPT-4.1 scores with different parts of the docstring being omitted.

| Table 2: Prompt construction | study |
|------------------------------|-------|
|------------------------------|-------|

| Name                                                                                                                       | Description                                                                                                                                                      |
|----------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Description + examples + dataflow<br>Description + dataflow<br>Description + examples<br>Description only<br>Dataflow only | Includes everything Omits inline examples Omits input/output size information Omits inline examples and input/output/sizes Omits description and inline examples |



Figure 7: Simple example of vectorizing a passthrough kernel.

Unsurprisingly, omitting dataflow information had the most grevious effect on the overall perforamnce of the LLM. However, additionally removing the inline examples and only keeping the description significantly increased the number of passing functional tests. In this case without having ground truth expected input shapes, the model relies on the example shapes which are not representative of the actual test sizes (since it would be impractical to bake in large arrays of numbers into the prompt), this causes it to use the example shapes or hallucinate expected parameters to design for.

If expected input buffer and parameter information, along with a high level description of the kernel are present then adding additional inline examples of algorithmic behavior is actually beneficial in passing more tests. While the vectorization score is slightly higher without the examples, the delta is less than 0.1 which makes it close to negligible.

If you choose to exclude examples or modify the prompts in any way this is fully supported by the codebase.

# C Performance vs number of iterations

We test a number of LLMs for up to 10 compilation attempts and see a trend of diminishing returns at around 5 turns. Models like DeepSeek V3, Claude Sonnet and GPT-4.1 are especially receptive to the compiler feedback and see a sharp increase in passing tests.



#### (a) Test pass rate as number of recompilations increases



(b) Visualization of pass rate with RAG included

Figure 8: Number of compiler iterations vs pass rate.

# D Closer Look at DeepSeek R1 Results

A surprising result from our study was the decrease in VPU utilization from the DeepSeek R1 model when introduced to vectorized code samples. This was the only model that decreased in performance out of the tested LLMs.

Looking at the solutions it produced without any RAG in Listing 3 and comparing with Listing 4 with RAG we observe that it is using compiler-specific pragmas. It seems like the R1 model is well versed in how to use pragmas to do low-level optimization like loop unrolling and the examples in our RAG database had code specifically tailored to the Chess compiler, which uses a different set of pragmas to LLVM-AIE (Peano).

```
#include <aie_api/aie.hpp>
3 void relu_bfloat16(bfloat16 *in_buffer, bfloat16 *out_buffer) {
      constexpr int VECTOR_SIZE = 16;
      constexpr int NUM_VECTORS = 256 / VECTOR_SIZE;
      auto zero_vec = aie::broadcast<bfloat16, VECTOR_SIZE>(
      bfloat16(0.0f));
      for (int i = 0; i < NUM_VECTORS; ++i) {</pre>
8
          aie::vector<bfloat16, VECTOR_SIZE> vec = aie::load_v<</pre>
9
      VECTOR_SIZE>(in_buffer);
          vec = aie::max(vec, zero_vec);
10
          aie::store_v(out_buffer, vec);
          in_buffer += VECTOR_SIZE;
          out_buffer += VECTOR_SIZE;
      }
14
15 }
```

Listing 3: Solution without RAG (Vector score: 47%)

```
#include <aie_api/aie.hpp>
3 void relu_bfloat16(bfloat16 *in_buffer, bfloat16 *out_buffer) {
      const int v_factor = 32;
      v32bfloat16 zeroes = broadcast_zero_to_v32bfloat16();
6
      for (int i = 0; i < 256; i += v_factor) {</pre>
          chess_prepare_for_pipelining
8
          chess_loop_range(8, 8) {
9
               v32bfloat16 input = *(v32bfloat16 *)(in_buffer + i);
10
               v32bfloat16 output = max(input, zeroes);
               *(v32bfloat16 *)(out_buffer + i) = output;
13
          }
14
      }
15 }
```

Listing 4: Solution with RAG (Vector score: 26%)

While Peano will not throw an errow after encountering Chess-specific pragmas it will ignore them during optimization passes. Without being shown these examples the LLM (Listing 3) will fall back to standard AIE APIs which are already abstracting optimized operations without worry of lower level intrinsics or pragmas.

The way to address this in future work would be to have compiler backend-specific RAG. This information could also be potentially conveyed via sytem prompts and will likely become a necessity as more NPU architectures get released, each with its own unique set of intrinsics and compiler-specific optimization paths.