# POWER8 in-core Cryptography The Unofficial Guide

Jeffrey Walton Dr. William Schmidt

## POWER8 in-core Cryptography: The Unofficial Guide by Jeffrey Walton and Dr. William Schmidt Extensive review and rough drafts: Segher Boessenkool

Publication date 1 April 2018

## **Table of Contents**

| 1. Introduction                 |
|---------------------------------|
| Organization                    |
| Compile Farm                    |
| 2. Vector programming           |
| PowerPC compilers               |
| Altivec headers                 |
| Machine endianness              |
| Malloc and new                  |
| Vector datatypes                |
| Vector shifts                   |
| Loads and stores                |
| 3. Runtime features             |
| AIX features                    |
| Linux features                  |
| SIGILL probes                   |
| L1 Data Cache                   |
| 4. Advanced Encryption Standard |
| AES encryption                  |
| AES decryption                  |
| AES key scheduling              |
| 5. Secure Hash Standard         |
| Sigma functions                 |
| Ch function                     |
| Maj function                    |
| SHA-256                         |
| SHA-512                         |
| 6. Polynomial multiplication    |
| CRC-32 and CRC-32C              |
| GCM mode                        |
| 7. Assembly language            |
| Cryptogams                      |
| 8. References                   |
| Cryptogams                      |
| GitHub                          |
| IBM                             |
| NIST                            |
| Stack Exchange                  |
| 9. Revision History             |
| Index                           |

## **Chapter 1. Introduction**

This document is a guide to using IBM's POWER8 in-core cryptography [https://www.ibm.com/develop-erworks/learn/security/index.html]. The purpose of the book is to document in-core cryptography more completely for developers and quality assurance personnel who wish to take advantage of the features.

POWER8 in-core cryptography includes CPU instructions to acclerate AES, SHA-256, SHA-512 and polynomial multiplication. This document includes treatments of AES, SHA-256 and SHA-512. It does not include a discussion of polynomial multiplication at the moment, but the chapter is stubbed-out (and waiting for a contributor).

The POWER8 extensions for in-core cryptography find its ancestry in Altivec SIMD coprocessor. The POWER8 vector unit includes Vector Multimedia Extension (VSX) and the instruction set for in-core cryptography are part it. You can find additional information on VSX at IBM's website at TODO [https://www.ibm.com/developerworks/].

The source code in the book is a mix of C and C++. The SHA-256 and SHA-512 samples were written in C++ to avoid compile errors due to the SHA API requiring 4-bit literal constants. We could not pass parameters through functions and obtain the necessary constexpr-ness so template parameters were used instead.

## **Organization**

The book proceeds in six parts. First, administriva is discussed, like how to determine machine endianness and how to load and store a vector from memory. A full treatment of vector programming is its own book, but the discussion should be adequate to move on to the more interesting tasks.

Second, AES is discussed. AES is specified in FIPS 197, Advanced Encryption Standard (AES) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf]. You should read the standard if you are not familiar with the block cipher.

Third, SHA is discussed. SHA is specified in FIPS 180-4, Secure Hash Standard (SHS) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf]. You should read the standard if you are not familiar with the hash.

Fourth, polynomial multiplication is discussed. Polynomial multiplications is important for CRC-32, CR-C-32C and GCM mode of operation for AES.

Fifth, performance is discussed. The implementations are compared against C and C++ routines and assembly language routines from OpenSSL. The OpenSSL routines are high quality and written by Andy Polyakov.

Finally, assembly language integration is discussed. Andy Polyakov dual licenses his cryptographic implementations and you can use his routines once you know how to integrate them.

## **Compile Farm**

The book makes frequent references to gcc112 and gcc119 from the GCC Compile Farm. The Compile Farm offers four 64-bit PowerPC machines, and gcc112 and gcc119 are the two POWER8 iron (the other two are POWER7 hardware). gcc112 is a Linux ppc64-le machine (PowerPC, 64-bit, little-endian), and gcc119 is an AIX ppc64-be machine (PowerPC, 64-bit, big-endian).

Both POWER8 machines are IBM POWER System S822 with two CPU cards. gcc112 has 160 logical CPUs, and gcc119 has 64 logical CPUs. At 4.1 GHz and 192 GB of RAM gcc119 is probably a contender for one of the fastest machine you will work on.

If you are a free and open software developer then you are eligible for a free GCC Compile Farm [https://cfarm.tetaneutral.net/] account. The Cfarm provides machines for different architectures, inlcuding MIPS64, Aarch64 and PowerPC64. Access is provided through SSH.

TODO: This book is free software. If you see an opportunity for improvement, and error or an omission then submit a pull request or open a bug report.

## **Chapter 2. Vector programming**

Polynomial multiplies, AES, SHA-256 and SHA-512 acceleration occurs in the VSX unit. Data must be moved from main memory into a vector register, the data must be transformed, and then data must be written back to main memory. Data is moved to and from main memory using vector loads and stores.

Several topics need to be discussed to ensure trouble free loads and stores. They include PowerPC compilers and options, Altivec headers, machine endianness, vector datatypes and then finally the loads and stores. It is enough information to get to the point you can use AES and SHA but not much more.

## PowerPC compilers

This documents uses two compilers for testing. The first is GCC and the second is IBM XL C/C++. Each compiler is slightly different with its options.

Compiling a test program with GCC will generally look like below. The important part is -mcpu=power8 which selects the POWER8 Instruction Set Architecture (ISA).

```
$ g++ -mcpu=power8 test.cxx -o test.exe
```

Complimentary, compiling a test program with IBM XL C/C++ will generally look like below. The important parts are the C++ compiler name of xlc, and -qarch=pwr8 which selects the POWER8 ISA.

```
$ xlC -qarch=pwr8 -qaltivec test.cxx -o test.exe
```

When compiling source code to examine the quality of code generation the program should be compiled with -03. Both compilers consume -03.

#### **Altivec headers**

The header required for datatypes and functions is <altivec.h>. To support compiles with a C++ compiler \_\_vector keyword is used rather than vector. A typical Altivec include looks as shown below.

```
#if defined(__ALTIVEC__)
# include <altivec.h>
# undef vector
# undef pixel
# undef bool
#endif
```

In addition to \_\_ALTIVEC\_\_ preprocessor macro you will see the following defines depending on the platform:

- \_\_powerpc\_\_ and \_\_powerpc on AIX
- \_\_powerpc\_\_ and \_\_powerpc64\_\_ on Linux
- \_ARCH\_PWR3 through \_ARCH\_PWR9 on AIX and Linux
- \_\_linux\_\_\_, \_\_linux and linux on Linux
- \_AIX, and \_AIX32 through \_AIX72 on AIX

#### **Machine endianness**

You will experience both little-endian and big-endian machines in the field when working with a modern PowerPC architecture. Linux is generally little-endian, while AIX is big-endian.

When writing portable source code you should check the value of preprocessor macros \_\_LIT-TLE\_ENDIAN\_\_ or \_\_BIG\_ENDIAN\_\_ to determine the configuration. The value of the macros \_\_BIG\_ENDIAN\_\_ and \_\_LITTLE\_ENDIAN\_\_ are defined to non-0 to activate the macro. Source code checking endianness should look similar to the code shown below.

```
#if __LITTLE_ENDIAN__
# error "Little-endian system"
#else
# error "Big-endian system"
#endif
```

The compilers can show the endian related preprocessor macros available on a platform. Below is from GCC on qcc112 from the compile farm, which is ppc64-le.

```
$ g++ -dM -E test.cxx | grep -i endian
#define __ORDER_LITTLE_ENDIAN__ 1234
#define _LITTLE_ENDIAN 1
#define __FLOAT_WORD_ORDER__ _ORDER_LITTLE_ENDIAN__
#define __ORDER_PDP_ENDIAN__ 3412
#define __LITTLE_ENDIAN__ 1
#define __ORDER_BIG_ENDIAN__ 4321
#define __BYTE_ORDER__ _ORDER_LITTLE_ENDIAN__
```

And the complimentary view from IBM XL C/C++ on gcc112 from the compile farm, which is ppc64-le.

```
$ xlC -qshowmacros -E test.cxx | grep -i endian
#define _LITTLE_ENDIAN 1
#define _BYTE_ORDER__ _ORDER_LITTLE_ENDIAN__
#define __FLOAT_WORD_ORDER__ _ORDER_LITTLE_ENDIAN__
#define __LITTLE_ENDIAN__ 1
#define __ORDER_BIG_ENDIAN__ 4321
#define __ORDER_LITTLE_ENDIAN__ 1234
#define __ORDER_PDP_ENDIAN__ 3412
#define __VEC_ELEMENT_REG_ORDER__ _ORDER_LITTLE_ENDIAN__
```

However, below is gcc119 from the compile farm, which is ppc64-be. It runs AIX and notice \_\_BYTE\_ORDER\_\_, \_\_ORDER\_BIG\_ENDIAN\_\_ and \_\_ORDER\_LITTLE\_ENDIAN\_\_ are not present.

```
$ xlC -qshowmacros -E test.cxx | grep -i endian
#define __BIG_ENDIAN__ 1
#define __BIG_ENDIAN 1
#define __THW_BIG_ENDIAN__ 1
#define __HHW_BIG_ENDIAN__ 1
```

### Malloc and new

The system calls malloc and new (and friends) are used to acquire memory from the heap. The system calls *do not* guarantee aligment to any particular boundary on all platforms. Linux generally returns a

pointer that is at least 16-byte aligned on all platforms, including ARM, PPC, MIPS and x86. AIX does not provide the same alignment behavior.

To avoid unexpected surprises when using heap allocations you should use posix\_memalign to acquire heap memory aligned to a particular boundary.

TODO: I believe AIX or XLC has another function to call for memory used with vector programming.

## **Vector datatypes**

Three vector datatypes are used for in-core programming. The three types are listed below.

- \_\_vector unsigned char
- · vector unsigned int
- \_\_vector unsigned long

\_\_vector unsigned char is 16 each 8-bit bytes, and it is typedef'd as uint8x16\_p8. \_\_vector unsigned int is 4 each 32-bit words, and it is typedef'd as uint32x4\_p8.

\_\_vector unsigned long is new to POWER8. \_\_vector unsigned long is 2 each 64-bit double words, and it is typedef'd as uint64x2\_p8.

#### **Vector shifts**

Altivec shifts and rotates are performed using *Vector Shift Left Double by Octet Immediate*. It perform as expected on big-endian machines by shifting out bytes on the left and shifting in bytes on the right. Both shift and rotate operate on a concatenation of two vectors. The instruction needs an integral constant in the range 0 - 15, inclusive. The vector shift and rotate built-in is vec\_sld and it compiles/assembles to vsldoi.

Little-endian machines need a special sauce to produce correct results and the IBM manuals don't tell you about it [https://www.ibm.com/support/knowledgecenter/SSXVZZ\_13.1.4/com.ibm.xlcpp1314.lelin-ux.doc/compiler\_ref/vec\_sld.html]. If you are like many other developers then you will literally waste hours trying to figure it out the first time it happens.

The problem is shift and rotates are endian sensitive, and you have to use 16-n and swap vector arguments on little-endian systems. The C++ source code provides the following template function to handle the behavior.

```
template <unsigned int N, class T>
inline T VectorShiftLeft(const T val1, const T val2)
{
#if __LITTLE_ENDIAN__
    enum {R = (16-N)&0xf};
    return (T)vec_sld((uint8x16_p)val2, (uint8x16_p)val1, R);
#else
    enum {R = N&0xf};
    return (T)vec_sld((uint8x16_p)val1, (uint8x16_p)val2, R);
#endif
}
```

A VectorRoatetLeft would be similar to the code below, if needed. Rotate is a special case of shift where both vector arguments are the same value.

```
template <unsigned int N, class T>
inline T VectorShiftLeft(const T val)
{
#if __LITTLE_ENDIAN__
    enum {R = (16-N)&0xf};
    return (T)vec_sld((uint8x16_p)val, (uint8x16_p)val, R);
#else
    enum {R = N&0xf};
    return (T)vec_sld((uint8x16_p)val, (uint8x16_p)val, R);
#endif
}
```

#### Loads and stores

Altivec loads and stores have traditionally been performed using vec\_ld and vec\_st since at least the POWER4 days in the 1990s. vec\_ld and vec\_st are sensitive to alignment of the effective memory address. The effective address is the address+offset rounded down or masked to a multiple of 16.

The effective address used for vec\_ld and vec\_st must be aligned to a 16-byte boundary or incorrect results will arise. Altivec *does not* raise a SIGBUS. Instead, the bottom 3 bits of the address are masked-off and then the memory at the effective address is loaded.

POWER7 introduced unaligned loads and stores that avoid the aligned memory address requirement. The instructions to use for unaligned loads and stores are vec\_vsx\_ld and vec\_vsx\_st when using GCC; and vec\_xl and vec\_xst when using XLC.

TODO: talk about vec\_xl\_be when using XLC, and the inline assembly replacement VEC\_XL\_BE when using GCC.

TODO: talk about VectorLoad and VectorStore used in the source code.

## **Chapter 3. Runtime features**

Runtime feature detections allows code to switch to a faster implementation when the hardware permits. This chapter shows you how to determine in-core crypto availability at runtime on AIX and Linux PowerPC platforms.

## **AIX** features

TODO: find out how to perform runtime feature detection on AIX. We checked getsystemcfg and sysconf for ISA 2.07, polynomial multiply, AES and SHA (and crypto) bits but they are missing.

The only thing we have found is SIGILL probes and signal handlers. It would be nice to avoid the nastiness.

#### **Linux features**

Some versions of Glibc and the kernel provide ELF auxiliary vectors with the information. AT\_HWCAP2 will show the verypto flag when in-core crypto is available. TODO: which versions?

```
$ LD_SHOW_AUXV=1 /bin/true
AT DCACHEBSIZE:
AT_ICACHEBSIZE:
                  0x80
AT UCACHEBSIZE:
                  0x0
AT SYSINFO EHDR: 0x3fff877c0000
AT HWCAP:
                  ppcle true_le archpmu vsx arch_2_06 dfp ic_snoop
                  smt mmu fpu altivec ppc64 ppc32
AT PAGESZ:
                  65536
AT_CLKTCK:
                  100
AT PHDR:
                  0x10000040
AT PHENT:
                  56
AT_PHNUM:
AT BASE:
                  0x3fff877e0000
AT_FLAGS:
                  0 \times 0
AT ENTRY:
                  0x1000145c
AT UID:
                  10455
AT_EUID:
                  10455
AT_GID:
                  10455
AT_EGID:
                  10455
AT_SECURE:
AT RANDOM:
                  0x3fffeaeaa872
AT HWCAP2:
                  vcrypto tar isel ebb dscr htm arch_2_07
AT_EXECFN:
                  /bin/true
AT PLATFORM:
                  power8
AT_BASE_PLATFORM:power8
```

Linux systems with Glibc version 2.16 can use getauxval to determine CPU features. Runtime code to perform the check should look similar to below. The defines were taken from the Linux kernel's cputable.h [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/power-pc/include/asm/cputable.h].

```
#ifndef AT_HWCAP2
# define AT_HWCAP2 26
```

```
#endif
#ifndef PPC FEATURE2 ARCH 2 07
# define PPC_FEATURE2_ARCH_2_07
                                   0x80000000
#endif
#ifndef PPC_FEATURE2_VEC_CRYPTO
# define PPC FEATURE2 VEC CRYPTO
                                   0x02000000
#endif
bool HasPower8()
    if (getauxval(AT_HWCAP2) & PPC_FEATURE2_ARCH_2_07 != 0)
        return true;
    return false;
bool HasPower8Crypto()
    if (getauxval(AT_HWCAP2) & PPC_FEATURE2_VEC_CRYPTO != 0)
        return true;
    return false;
```

## **SIGILL** probes

TODO: show this nasty technique.

#### L1 Data Cache

The L1 data cache line size is an important security parameter that can be used to avoid leaking information through timing attacks. IBM POWER System S822, like gcc112 and gcc119, have a 128-byte L1 data cache line size.

gcc119 runs AIX and a program can query the L1 data cache line size as shown below.

It is important to check the return value from sysconf on Linux. gcc112 runs CentOS 7.4 and the machine returns 0 for the L1 cache line query. Also see sysconf and \_SC\_LEVEL1\_DCACHE\_LINESIZE returns 0? [https://lists.centos.org/pipermail/centos/2017-September/166236.html] on the CentOS mailing list.

## **Chapter 4. Advanced Encryption Standard**

AES is the Advanced Encryption Standard. AES is specified in FIPS 197, Advanced Encryption Standard (AES) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf]. You should read the standard if you are not familiar with the block cipher.

Three topics are discussed for AES. The first is encryption, the second is decryption, and the third is keying. Keying is discussed last because encryption and decryption uses the golden key schedule from FIPS 197.

## **AES** encryption

**TODO** 

## **AES** decryption

**TODO** 

## **AES** key scheduling

TODO

## **Chapter 5. Secure Hash Standard**

SHA is the Secure Hash Standard. SHA is specified in FIPS 180-4, Secure Hash Standard (SHS) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf]. You should read the standard if you are not familiar with the hash family.

## Sigma functions

POWER8 provides a sigma instruction to accelrate SHA calculations. The instruction takes two integer arguments and the constants are used to select among Sigma 0, Sigma 1, sigma 0 and sigma 1.

#### Ch function

POWER8 provides the vsel instruction and it is SHA's Ch function. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes uin-t64x2\_p8 arguments. The important piece of information is x used as the selector.

```
uint32x4_p8
VectorCh(uint32x4_p8 x, uint32x4_p8 y, uint32x4_p8 z)
{
    return vec_sel(z, y, x);
}
```

## Maj function

POWER8 provides the vsel instruction and it can be used for SHA's Maj function. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes uint64x2\_p8 arguments. The important piece of information is x^y used as the selector.

```
uint32x4_p8
VectorCh(uint32x4_p8 x, uint32x4_p8 y, uint32x4_p8 z)
{
    return vec_sel(y, z, vec_xor(x, y));
}
```

#### **SHA-256**

**TODO** 

### **SHA-512**

TODO

## Chapter 6. Polynomial multiplication

The chapter of the document should discuss polynomial multiplication used with CRC codes and the GCM mode of operation for AES. However we have no experience with polynomial multiplication. Please refer to GitHub CRC32/vpmsum [https://github.com/antonblanchard/crc32-vpmsum].

## CRC-32 and CRC-32C

No content.

## **GCM** mode

No content.

## Chapter 7. Assembly language

TODO.

## **Cryptogams**

Cryptogams [https://www.openssl.org/~appro/cryptogams/] is Andy Polyakov's incubator to develop assembly language routines for OpenSSL. Andy dual licenses his implementations, so a more permissive license is available for the assembly language source code. This chapter will show you how to build Andy's software.

## **Chapter 8. References**

The following is a list of references we are aware of.

## **Cryptogams**

The following is a list of Cryptogams references.

CRYPTOGAMS: low-level cryptographic primitives collection [https://www.openssl.org/~appro/cryptogams/]

#### **GitHub**

The following is a list of GitHub references.

- AES Intrinsics [https://github.com/noloader/AES-Intrinsics]
- SHA Intrinsics [https://github.com/noloader/SHA-Intrinsics]
- CRC32/vpmsum [https://github.com/antonblanchard/crc32-vpmsum]
- sha2-le [https://github.com/PPC64/sha2-le]

#### **IBM**

The following is a list of IBM references.

- Recommended debug, compiler, and linker settings for Power processor tuning [https://www.ib-m.com/support/knowledgecenter/en/linuxonibm/liaal/iplsdkrecbldset.htm]
- POWER8 in-core cryptography [https://www.ibm.com/developerworks/library/se-power8-in-core-cryptography/index.html]

#### **NIST**

The following is a list of NIST references.

- FIPS 197, Advanced Encryption Standard (AES) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fip-s.197.pdf]
- FIPS 180-4, Secure Hash Standard (SHS) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf]

## Stack Exchange

The following is a list of Stack Exchange references.

Detect Power8 in-core crypto through getauxval? [https://stackoverflow.com/q/46144668/608639]

## **Chapter 9. Revision History**

Revision History Revision 1 Initial release

1 April 2018

jww, ws

## Index

#### C

Cryptogams, 12

#### S

SHA, 10 Ch function, 10 Maj function, 10 SHA-256, 10 SHA-512, 10 Sigma functions, 10