Skip to content


Repository files navigation

This repository accompanies the paper Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1 published at IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), Issue 1, Volume 2022. The paper is also available at ePrint 2021/986.

There are several updates after the publication. To reproduce the results in the paper, please refer to the commit



It contains our source code for Dilithium, Kyber, and Saber optimized for Cortex-A72. The code is also executable on other Armv8 cores and the Apple M1. However, the benchmarking code in this repository has only been tested with the Cortex-A72.


Cloning the code

Clone the code together with all submodules

git clone --recurse-submodules


We have tested this code with a Raspberry Pi4 running Ubuntu 21.04. It is essential that you are running a 64-bit OS to be able to execute aarch64 code. The 64-bit Raspbian OS can be used, but we have not tested all code with it.

The Makefiles included in this repo assume that you are natively compiling your code using gcc. We have tested it with gcc 10.3.0.

Enable access to the cycle counter

For accurate benchmarking on Cortex-A72, we make use of the performance counters. By default the access from user mode is disabled. You will need to enable it using a kernel module.

We have included on here that you can install

cd enable_ccr
make install

Alternatively, you can get one from:

You may have to install the kernel headers manually:

sudo apt install raspberrypi-kernel-headers

In case the kernel headers can still not be found, you may have to change the uname -r to uname -m in the Makefile.

Apple M1

We have tested this code on a Apple M1 Mac mini running macOS Big Sur 11.4. We used Apple clang 12.0.5.

For cycle counting we make use of


For each of the schemes, we provide three folders:

  • ntt: Code for the core polynomial arithmetic.
  • microbenchmarks: Standalone code for benchmarking individual functions; make will produce a benchmarking binary (when executed on an A72). For running benchmarks, user space access to the PMU cycle count register needs to be enabled. A kernel module that enables it, can, for example, be found in pqax.
  • m1_benchmarks: Standalone code for benchmarking on Apple M1.
  • scheme: Contains the entire code for the scheme; ready to be placed in supercop (see below)

Benchmarking on Cortex-A72


The following instructions allow you to reproduce the microbenchmarks in Table 4 and Table 5 of the paper.

  • Copy microbenchmarks folder to the Cortex-A72
  • Type make in there
  • You should see a speed executable. Run it with ./speed
  • If you get Illegal Instruction, you need to enable access to the performance counters (see Installation)

Full scheme benchmarks on

The following instructions should allow to benchmark the full schemes (Table 6) using SUPERCOP on the Raspberry Pi4 running Ubuntu 21.04:

  • Download
  • Remove every line except the first from okcompilers/c and okcompilers/cpp to speed up benchmarking.
  • Remove #include <sys/sysctl.h> from cpucycles/armv8.c
  • Make sure that the access to the cycle counters from user mdoe is enabled before proceeding.
  • Run ./do-part used (this will take a couple of hours) or alternatively, build randombytes as follows (this is the only dependency to SUPERCOP):
    • ./do-part init
    • ./do-part crypto_stream chacha20
    • ./do-part crypto_rng chacha20
  • Copy over the scheme you want
  • Benchmark using, e.g., ./do-part crypto_kem kyber768
  • Results will be in bench/<hostname>/data
  • If you want to run more iterations, change TIMINGS in crypto_{kem/sign}/measure.c before running ./do-part crypto_{kem/sign} ...

Benchmarking on Apple M1

  • Copy m1_benchmarks to the Apple M1
  • Type make in there
  • You should see a <scheme>_test and <scheme>_speed executable.
  • Running the test, e.g.,./kyber_test should give you Test successful
  • You can run the benchmarks using, e.g., sudo ./kyber_speed which should reproduce the results in Table 4, Table 5, and Table 6.
  • Accessing the cycle counts requires to run the executable as root!


Illegal instruction when running benchmarks on Cortex-A72

Probably access to the PMU cycle counters from user space is not enabled. For enabling it, see

kpc_set_config failed or kpc_get_thread_counters failed on Apple M1

Using m1cycles.c requires root. Re-runt he executable with sudo.


This repository includes code from other sources that has the following license/license waivers

All the files in this repository are covered by CC0 by default unless stated otherwise in the beginning of the files.


No description, website, or topics provided.







No releases published


No packages published