# POWER8 in-core Cryptography The Unofficial Guide

Jeffrey Walton Dr. William Schmidt

## POWER8 in-core Cryptography: The Unofficial Guide by Jeffrey Walton and Dr. William Schmidt Extensive review and rough drafts: Segher Boessenkool

Publication date 1 April 2018

## **Table of Contents**

| 1. Introduction                 |    |
|---------------------------------|----|
| Architecture                    | 1  |
| Compilers                       | 1  |
| Source code                     | 2  |
| Compile Farm                    | 2  |
| Contributing                    | 2  |
| Organization                    |    |
| 2. Vector programming           | 4  |
| PowerPC compilers               | 4  |
| Altivec headers                 | 4  |
| Machine endianness              | 5  |
| Memory allocation               | 6  |
| Vector datatypes                | 6  |
| Vector shifts                   | 7  |
| Vector permutes                 | 8  |
| Effective addresses             | 🤉  |
| Aligned data references         | 🤉  |
| Unaligned data references       |    |
| Vector dereferences             |    |
| 3. Runtime features             | 12 |
| Strategy                        |    |
| AIX features                    |    |
| Linux features                  |    |
| L1 Data Cache                   |    |
| 4. Advanced Encryption Standard |    |
| Strategy                        |    |
| AES encryption                  |    |
| AES decryption                  |    |
| AES key schedule                |    |
| 5. Secure Hash Standard         |    |
| Strategy                        |    |
| Ch function                     |    |
| Maj function                    |    |
| Sigma functions                 |    |
| SHA-256                         |    |
| SHA-512                         |    |
| 6. Polynomial multiplication    |    |
| CRC-32 and CRC-32C              |    |
| GCM mode                        |    |
| 7. Assembly language            |    |
| Cryptogams                      |    |
| 8. Performance                  |    |
| Power states                    |    |
| Benchmarks                      |    |
| 9. References                   |    |
|                                 |    |
| Cryptogams                      | ა: |

#### POWER8 in-core Cryptography

| GitHub                     | 35 |
|----------------------------|----|
| IBM and OpenPOWER websites |    |
| NIST website               |    |
| Stack Exchange             | 36 |
| ndex                       | 37 |

## **Chapter 1. Introduction**

This document is a guide to using IBM's POWER8 in-core cryptography [https://www.ib-m.com/developerworks/learn/security/index.html]. The purpose of the book is to document in-core cryptography more completely for developers and quality assurance personnel who wish to take advantage of the features.

POWER8 in-core cryptography includes CPU instructions to accelerate AES, SHA-256, SHA-512 and polynomial multiplication. This document includes treatments of AES, SHA-256 and SHA-512. It does not include a discussion of polynomial multiplication at the moment, but the chapter is stubbed-out (and waiting for a contributor).

The POWER8 extensions for in-core cryptography find their ancestry in the Altivec SIMD coprocessor. The POWER8 vector unit includes Vector-Scalar Extensions (VSX) and the instruction set for in-core cryptography is a part of it. You can find additional information on VSX in Chapter 7 of the IBM Power ISA Version 3.0B [https://openpowerfoundation.org/?resource\_lib=power-isa-version-3-0] at the OpenPOWER Foundation website.

#### **Architecture**

There are two POWER architectures that you will encounter as you are working on your implementation. The first is POWER7, and it is governed by ISA 2.06B documents. The second is POWER8, and it is governed by ISA 2.07 documents.

In-core cryptography requires POWER8 and ISA 2.07 support. POWER8 is the ISA that has the instructions for AES, SHA and polynomial multiplication. POWER7 provides other useful instructions, like unaligned loads and stores. If you are working with POWER8, then you have everything in POWER7 and earlier.

The OpenPOWER Foundation is XXX (TODO). The Foundation is responsible for maintaining and publishing the specifications for the POWER architectures, like ISA 2.06B and ISA 2.07.

### **Compilers**

The book does not discriminate compilers. All the samples will compile with both GCC and IBM XL C/C++. XL C/C++ is IBM's flagship compiler, and it is referred to as XLC on occasion.

The samples may compile with LLVM's Clang but it was not tested. The compile farm does not have Clang installed so we could not test it. We would like to see how well Clang performs when compared to GCC and XLC. If you encounter a problem using Clang then please report it.

The compiler you use can make a measurable difference on you program. For example, you will probably obtain different benchmark results using GCC and XLC. You will even obtain different benchmark results among versions of the same compiler. For example, GCC 7.2 is generally faster than GCC 4.8.5, and both SHA-256 and SHA-512 built-in implementations will speed up by about 2 cycles per byte (cpb) using GCC 7.

Compilers are discussed in more detail at PowerPC Compilers.

#### Source code

The source code in the book is a mix of C and C++. The SHA-256 and SHA-512 samples were written in C++ to avoid compile errors due to the SHA API requiring 4-bit literal constants. We could not pass parameters through functions and obtain the necessary constexpr-ness so template parameters were used instead.

There is no source code to download *per se*. The code is taken from Botan, Crypto++ and OpenSSL free software projects. Some code is taken from Andy Polyakov and Cryptogams. Some code is taken from GitHub projects. And some code was written and thrown away after testing.

## **Compile Farm**

The book makes frequent references to <code>gcc112</code> and <code>gcc119</code> from the GCC Compile Farm. The Compile Farm offers four 64-bit PowerPC machines, and <code>gcc112</code> and <code>gcc119</code> are the POWER8 iron (the other two are POWER7 hardware). <code>gcc112</code> is a Linux PowerPC, 64-bit, little-endian machine (ppc64-le), and <code>gcc119</code> is an AIX PowerPC, 64-bit, big-endian machine (ppc64-be).

Both POWER8 machines are IBM POWER System S822 with two CPU cards. gcc112 has 160 logical CPUs and runs at 3.4 GHz. gcc119 has 64 logical CPUs and runs at 4.1 GHz. At 4.1 GHz and 192 GB of RAM gcc119 is probably a contender for one of the fastest machine you will work on.

If you are a free and open software developer then you are eligible for a free GCC Compile Farm [https://cfarm.tetaneutral.net/] account. The Cfarm provides machines for different architectures, including MIPS64, Aarch64 and 64-bit PowerPC. Access is provided through SSH.

If you work on the Compile Farm then be mindful of the default GCC compiler. It is probably GCC 4.8.5, and you usually get better code generation and performance with GCC 7.2 located at /opt/cfarm/gcc-latest.

### **Contributing**

This book is free software. If you see an opportunity for improvement, an error or an omission then please submit a pull request or open a bug report.

## **Organization**

The book proceeds in eight parts. First, administrivia is discussed, like how to determine machine endianness and how to load and store a vector from memory. A full treatment of vector programming is its own book, but the discussion should be adequate to move on to the more interesting tasks.

Second, runtime feature detections is discussed for AIX and Linux. Runtime detection allows you to switch to a faster implementation at runtime when the hardware provides the support.

Third, AES is discussed. AES is specified in FIPS 197, Advanced Encryption Standard (AES) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf]. You should read the standard if you are not familiar with the block cipher.

Fourth, SHA is discussed. SHA is specified in FIPS 180-4, Secure Hash Standard (SHS) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf]. You should read the standard if you are not familiar with the hash.

Fifth, polynomial multiplication is discussed. Polynomial multiplications is important for CRC-32, CRC-32C and GCM mode of operation for AES.

Sixth, performance is discussed. The implementations are compared against C and C++ routines and assembly language routines from OpenSSL. The OpenSSL routines are high quality and written by Andy Polyakov.

Seventh, assembly language integration is discussed. Andy Polyakov dual licenses his cryptographic implementations and you can use his routines once you know how to integrate them.

Finally, performance and benchmarking is discussed. C/C++, C++ using built-ins and assembly language routines are benchmarked using GCC.

## **Chapter 2. Vector programming**

Several topics need to be discussed to minimize trouble when using the Altivec and POWER8 extensions. They include PowerPC compilers and options, Altivec headers, machine endianness, vector datatypes, memory and alignment, and loads and stores. It is enough information to get to the point you can use AES and SHA but not much more.

Memory alignment, loads, stores and shifts will probably cause the most trouble for someone new to PowerPC vector programming. If you are new to the platform you may want to read this chapter twice. If you are experienced with the platform then you probably want to skip this chapter.

## **PowerPC compilers**

Two compilers are used for testing. The first is GCC and the second is IBM XL C/C++ (XLC). Both produce high quality code. LLVM's Clang was not tested. The compile farm lacks a Clang installation.

The GCC and XLC compilers are mostly the same but accept slightly different options. The difference usually reduces to GCC accepts architecture options using <code>-march</code> and <code>-mcpu</code>, while XLC uses <code>-qarch</code>.

Compiling a test program with GCC will generally look like below. The important part is -mcpu=power8 which selects the POWER8 Instruction Set Architecture (ISA).

```
$ g++ -mcpu=power8 test.cxx -o test.exe
```

Compiling a test program with IBM XL C/C++ will generally look like below. The important parts are the C++ compiler name of xlc, and -qarch=pwr8 which selects the POWER8 ISA.

```
$ xlC -qarch=pwr8 -qaltivec test.cxx -o test.exe
```

Both compilers consume -g and -03. If you want Position Independent Code then use -fPIC for GCC and -qpic for XLC.

If you work on the Compile Farm then be mindful of the default GCC compiler. It may be GCC 4.8.5 which is a bit old and unsupported. You usually enjoy better code generation and performance with a modern GCC like 7.2. A newer compiler is usually located at /opt/cfarm/gcc-latest.

#### Altivec headers

The header required for datatypes and functions is <altivec.h>. To support compiles with a C++ compiler \_\_vector keyword is used rather than vector. A typical Altivec include looks as shown below.

```
#if defined(__ALTIVEC__)
# include <altivec.h>
# undef vector
# undef pixel
```

```
# undef bool
#endif
```

In addition to \_\_ALTIVEC\_\_ preprocessor macro you will see the following defines depending on the platform:

```
    __powerpc__ and __powerpc on AIX
```

```
• __powerpc__ and __powerpc64__ on Linux
```

- \_ARCH\_PWR3 through \_ARCH\_PWR9 on AIX and Linux
- \_\_linux\_\_\_, \_\_linux and linux on Linux
- \_AIX, and \_AIX32 through \_AIX72 on AIX
- \_\_xlc\_\_ and \_\_xlc\_\_ when using IBM XL C/C++

#### **Machine endianness**

You will experience both little-endian and big-endian machines in the field when working with a modern PowerPC architecture. Linux is generally little-endian, while AIX is big-endian.

When writing portable source code you should check the value of preprocessor macros \_\_LITTLE\_ENDIAN\_\_ or \_\_BIG\_ENDIAN\_\_ to determine the configuration. The value of the macros \_\_BIG\_ENDIAN\_\_ and \_\_LITTLE\_ENDIAN\_\_ are defined to non-0 to activate the macro. Source code checking endianness should look similar to the code shown below.

```
#if __LITTLE_ENDIAN__
# error "Little-endian system"
#else
# error "Big-endian system"
#endif
```

The compilers can show the endian-related preprocessor macros available on a platform. Below is from GCC on gcc112 from the compile farm, which is ppc64-le.

```
$ g++ -dM -E test.cxx | grep -i endian
#define __ORDER_LITTLE_ENDIAN__ 1234
#define _LITTLE_ENDIAN 1
#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
#define __ORDER_PDP_ENDIAN__ 3412
#define __LITTLE_ENDIAN__ 1
#define __ORDER_BIG_ENDIAN__ 4321
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
```

And the complimentary view from IBM XL C/C++ on gcc112 from the compile farm, which is ppc64-le.

```
$ xlC -qshowmacros -E test.cxx | grep -i endian
#define _LITTLE_ENDIAN 1
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
```

```
#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
#define __LITTLE_ENDIAN__ 1
#define __ORDER_BIG_ENDIAN__ 4321
#define __ORDER_LITTLE_ENDIAN__ 1234
#define __ORDER_PDP_ENDIAN__ 3412
#define __VEC_ELEMENT_REG_ORDER__ __ORDER_LITTLE_ENDIAN__
```

However, below is gcc119 from the compile farm, which is ppc64-be. It runs AIX and notice \_\_BYTE\_ORDER\_\_, \_\_ORDER\_BIG\_ENDIAN\_\_ and \_\_ORDER\_LITTLE\_ENDIAN\_\_ are not present.

```
$ xlC -qshowmacros -E test.cxx | grep -i endian
#define __BIG_ENDIAN__ 1
#define _BIG_ENDIAN 1
#define __THW_BIG_ENDIAN__ 1
#define __HHW_BIG_ENDIAN__ 1
```

### **Memory allocation**

System calls like malloc and calloc (and friends) are used to acquire memory from the heap. The system calls *do not* guarantee alignment to any particular boundary on all platforms. Linux generally returns a pointer that is at least 16-byte aligned on all platforms, including ARM, PPC, MIPS and x86. AIX *does not* provide the same alignment behavior [http://stack-overflow.com/q/48373188/608639].

To avoid unexpected surprises when using heap allocations you should use posix\_memalign [http://pubs.opengroup.org/onlinepubs/009695399/functions/posix\_memalign.html] to acquire heap memory aligned to a particular boundary and free to return it to the system.

AIX provides routines for vector memory allocation and alignment. They are vec\_malloc and vec\_free, and you can use them like \_mm\_malloc on Intel machines with Streaming SIMD Extensions (SSE).

#### **Vector datatypes**

Three vector datatypes are needed for in-core programming. The three types used for crypto are listed below.

- \_\_vector unsigned char
- \_\_vector unsigned int
- \_\_vector unsigned long

\_\_vector unsigned char is arranged as 16 each 8-bit bytes, and it is a typedef to uint8x16\_p8. \_\_vector unsigned int is arranged as 4 each 32-bit words, and it is a typedef to uint32x4\_p8.

POWER8 added \_\_vector unsigned long and associated vector operations. \_\_vector unsigned long is arranged as 2 each 64-bit double words, and it is a typedef to uint64x2\_p8.

The typedef naming was selected to convey the arrangement, like 32x4 and 64x2. The trailing \_p8 was selected to avoid collisions with ARM NEON vector data types. The suffix \_p (for POWER architecture) or \_v (for Vector) would work just as well.

#### **Vector shifts**

Altivec shifts and rotates are performed using *Vector Shift Left Double by Octet Immediate*. The vector shift and rotate built-in is vec\_sld and it compiles/assembles to vsldoi. Both shift and rotate operate on a concatenation of two vectors. Bytes are shifted out on the left and shifted in on the right. The instructions need an integral constant in the range 0 - 15, inclusive.

Vector shifts and rotates perform as expected on big-endian machines. Little-endian machines need a special handling to produce correct results and the IBM manuals don't tell you about it [http://www.ibm.com/support/knowledgecenter/SSXVZZ\_13.1.4/com.ibm.xl-cpp1314.lelinux.doc/compiler\_ref/vec\_sld.html]. If you are like many other developers then you will literally waste hours trying to figure it out what happened the first time you experience it.

The issue is shifts and rotates are endian sensitive [http://stackover-flow.com/q/46341923/608639], and you have to use 16-n and swap vector arguments on little-endian systems. The C++ source code provides the following template function to compensate for the little-endian behavior.

```
template <unsigned int N, class T>
T VecShiftLeft(const T val1, const T val2)
{
#if __LITTLE_ENDIAN__
    enum {R = (16-N)&0xf};
    return vec_sld(val2, val1, R);
#else
    enum {R = N&0xf};
    return vec_sld(val1, val2, R);
#endif
}
```

A VecRotateLeft would be similar to the code below, if needed. Rotate is a special case of shift where both vector arguments are the same value.

```
template <unsigned int N, class T>
T VecRotateLeft(const T val)
{
#if __LITTLE_ENDIAN__
    enum {R = (16-N)&0xf};
    return vec_sld(val, val, R);
#else
    enum {R = N&0xf};
    return vec_sld(val, val, R);
#endif
}
```

#### **Vector permutes**

Vector permutes allow you to rearrange elements in a vector. The values to be permuted can be in any arrangement like 64x2 or 32x4, but the mask is always an octet mask using an 8x16 arrangement.

The Altivec permute is very powerful and it stands out among architectures like ARM, Aarch64 and x86. The instruction allows you to select elements from two source vectors. When an index in the mask is in the range [0,15] then elements from the first vector are selected, and index values in the the range [16,31] select elements from the second vector.

As an example, suppose you have a big-endian byte array like a message to be hashed using SHA-256. SHA operates on 32-bit words so the message needs a shuffle on little-endian systems. The code to perform the permute on a little-endian machine would look like below.

```
uint32x4_p msg = vec_ld(/*load from memory*/);
uint8x16_p mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
msg = vec_perm(msg, msg, mask);
```

The previous example only needed one vector so it used msg twice in the call to vec\_perm. The Altivec code is similar to \_mm\_shuffle\_epi8 on Intel machines. An example that interleaves two different vectors is shown below.

```
uint32x4_p a = { 0, 0, 0, 0}; // All 0 bits

uint32x4_p b = {-1, -1, -1, -1}; // All 1 bits

uint8x16_p m = {0,1,2,3, 16,17,18,19, 4,5,6,7, 20,21,22,23};

uint32x4_p c = vec_perm(a, b, m);
```

After the code above executes the vector c will have the value  $\{0, -1, 0, -1\}$ .

Below is the image IBM provides for the  $vec\_perm$  documentation. The IBM example shows  $d = vec\_perm(a, b, c)$ . The light gray blocks in vector d are from the first vector, and dark gray blocks in vector d are from the second vector.



#### **Effective addresses**

Some vector operations, like loads and stores, can be sensitive to the alignment of a memory address. Operations like  $vec\_ld$  and  $vec\_st$  are sensitive, and the documentation clearly states it.

The effective address is a simple sum consisting of the memory address plus the offset into the address with the result rounded down to a multiple of 16. Effective addresses follow integer arithmetic and not pointer arithmetic.

You can calculate an effective address using the following code. Notice the bottom 4 bits are masked after calculating the sum to yield a multiple of 16.

```
uintptr_t maddr = (uintptr_t)mem_addr;
uintptr_t mask = ~(uintptr_t)0xf;
uintptr_t eaddr = (maddr+offset) & mask;
```

vec\_1d takes a pointer and an offset to load a value into a VSX register. Each of the following yield the same VSX register value because the effective addresses are the same. (Old x86 programmers should reminisce on segmented memory).

```
uint8_t* ptr1 = 0x401000;
int off1 = 32;
uint8x16_p r1 = vec_ld(off1, ptr1);
uint8_t* ptr2 = 0x401010;
int off2 = 16;
uint8x16_p r2 = vec_ld(off2, ptr2);
```

The following also yields the same VSX register value because the effective address is the same. If you truly wanted to load 4 bytes beyond ptr then you loaded the wrong value because (0x401010+4)&0xffffffff = 0x401010.

```
uint8_t* ptr1 = 0x401000;
int off1 = 16;
uint8x16_p r1 = vec_ld(off1, ptr1);
uint8_t* ptr2 = 0x401010;
int off2 = 4;
uint8x16_p r2 = vec_ld(off2, ptr2);
```

The application of effective addresses are discussed more below in the section called "Aligned data references" and the section called "Unaligned data references".

## Aligned data references

Altivec loads and stores have traditionally been performed using  $vec\_ld$  and  $vec\_st$  since at least the POWER4 days in the early 2000s.  $vec\_ld$  and  $vec\_st$  are sensitive to alignment of the effective address. Effective addresses were discussed in the section called "Effective addresses".

Altivec does not raise a SIGBUS to indicate a misaligned load or store. Instead, the bottom 4 bits of the sum address+offset are masked-off and then the memory at the effective address is loaded.

You can use the Altivec loads and stores when you *control* buffers and ensure they are 16-byte aligned, like an AES key schedule table. Otherwise just use unaligned loads and stores to avoid trouble.

The C/C++ code to perform a load using vec\_ld should look similar to below. Notice the assert to warn you of problems in debug builds.

```
template <class T>
uint32x4_p8 VecLoad(const T* mem_addr, int offset)
#ifndef NDEBUG
    uintptr_t maddr = (uintptr_t)mem_addr;
    uintptr_t mask = ~(uintptr_t)0xf;
    uintptr_t eaddr = (maddr+offset) & mask;
    assert(maddr == eaddr);
#endif
    return (uint32x4_p8)vec_ld(offset, mem_addr);
}
The C/C++ code to perform a store using vec st should look similar to below.
template <class T>
void VecStore(const uint32x4_p8 val, T* mem_addr, int offset)
#ifndef NDEBUG
    uintptr_t maddr = (uintptr_t)mem_addr;
    uintptr t mask = ~(uintptr t)0xf;
    uintptr_t eaddr = (maddr+offset) & mask;
    assert(maddr == eaddr);
#endif
    vec_st((uint8x16_p8)val, offset, mem_addr);
}
```

## **Unaligned data references**

POWER7 introduced unaligned loads and stores that avoid the aligned memory requirements. The preferred built-in functions for unaligned loads and stores are  $vec\_xl$  and  $vec\_xst$ . The built-ins are available on all currently supported versions of GCC and XLC. However, older versions of GCC such as those installed on many enterprise Linux distributions do not supply them. For compatibility with these older compilers, you may use  $vec\_vsx\_ld$  and  $vec\_vsx\_st$  for GCC.

You should use the POWER7 loads and stores whenever you *do not control* buffers or their alignments, like a message in a buffer supplied by the user.

The C/C++ code to perform a load using  $vec_x1$  and  $vec_vsx_1d$  should look similar to below. The function name has a u added to indicate unaligned.

```
template <class T>
uint32x4_p8 VecLoadu(const T* mem_addr, int offset)
{
#if defined(__xlc__) || defined(__xlc__)
    return (uint32x4_p8)vec_xl(offset, mem_addr);
#else
    return (uint32x4_p8)vec_vsx_ld(offset, mem_addr);
#endif
}
```

The C/C++ code to perform a store using vec\_xst and vec\_vsx\_st should look similar to below.

```
template <class T>
void VecStoreu(const uint32x4_p8 val, T* mem_addr, int offset)
{
#if defined(__xlc__) || defined(__xlC__)
    vec_xst((uint8x16_p8)val, offset, mem_addr);
#else
    vec_vsx_st((uint8x16_p8)val, offset, mem_addr);
#endif
}
```

If your code will only be compiled with supported compilers, you may simplify it to use the vec xl and vec xst variants for both XLC and GCC.

#### **Vector dereferences**

The OpenPOWER ELF V2 ABI Specification [https://openpowerfoundation.org/?resource\_lib=64-bit-elf-v2-abi-specification-power-architecture], version 1.4, incorrectly states that accessing vectors on Power should preferably be done with vector pointers and the dereference operator \*. However, this is only permitted for aligned vector references. Examples in Chapter 6 of the ABI document show use of casting operations that represent undefined behavior according to the C standard. An errata document that corrects the ABI may be found at the OpenPOWER Foundation website [https://openpowerfoundation.org/?resource\_lib=openpower-elfv2-errata-elfv2-abi-version-1-4]. Subsequent sections describe the proper way to use loads and stores of aligned and unaligned data.

## **Chapter 3. Runtime features**

Runtime feature detections allows code to switch to a faster implementation when the hardware permits. This chapter shows you how to determine POWER8 cryptography availability at runtime on AIX and Linux PowerPC platforms.

### **Strategy**

The strategy to detect availability of in-core cryptography on POWER processors is check for ISA 2.07 or above. Cryptography is an ISA 2.07 and POWER8 requirement, and the cryptography support cannot be disgorged.

AIX systems should check for POWER8. AIX does not provide separate bits for AES, SHA and polynomial multiplies. The ISA level signals the availability of the cryptography on AIX. Linux supplies separate bits for ISA 2.07 and vector cryptography, but you only need to check the ISA level.

There is no need to perform SIGILL probes on AIX or newer Linux systems. If you are using older versions of Glibc or Linux kernel then you may have to fallback to SIGILL probes. Older versions include Glibc 2.24 and Linux kernel 4.08 (and earlier).

#### **AIX** features

The AIX system header <systemcfg.h> defines the \_system\_configuration structure that identifies system characteristics. The header also provides macros to access various fields of the structure. Runtime code to perform the POWER8 cryptography check should look similar to below.

```
#include <sys/systemcfg.h>
#ifndef __power_7_andup
# define __power_7_andup() 0
#endif

#ifndef __power_8_andup
# define __power_8_andup() 0
#endif

bool HasPower7()
{
    if (__power_7_andup() != 0)
        return true;
    return false;
}

bool HasPower8()
{
```

```
if (__power_8_andup() != 0)
          return true;
    return false;
}
bool HasCrypto()
{
    if (__power_8_andup() != 0)
          return true;
    return false;
}
```

You should not use the \_\_power\_vsx() macro to detect in-core cryptography availability. Though cryptography is implemented in the VSX unit, the VSX unit is available in POWER7 and above.

OpenSSL uses the following on AIX to test for cryptography availability in <code>crypto/ppccap.c</code> [https://github.com/openssl/blob/master/crypto/ppccap.c]. The project effectively reimplements the <code>\_\_power\_8\_andup()</code> macro.

#### **Linux features**

Some versions of Glibc and the kernel provide ELF auxiliary vectors with system information.  $AT_HWCAP2$  will show the vcrypto flag when in-core cryptography is available. This is guaranteed for the following little-endian Linux distributions:

- Ubuntu 14.04 and later
- · SLES 12 and later
- RHEL 7 and later

Below is a screen capture using the loader's diagnostics to print the auxiliary vector for the / bin/true program on gcc112.

```
$ LD SHOW AUXV=1 /bin/true
AT_DCACHEBSIZE:
                 0x80
AT_ICACHEBSIZE:
                 0x80
AT UCACHEBSIZE:
                 0x0
AT_SYSINFO_EHDR: 0x3fff877c0000
AT HWCAP:
                 ppcle true_le archpmu vsx arch_2_06 dfp ic_snoop
                 smt mmu fpu altivec ppc64 ppc32
AT_PAGESZ:
                 65536
AT CLKTCK:
                 100
AT_PHDR:
                 0x10000040
AT PHENT:
                 56
AT_PHNUM:
```

```
AT_BASE:
                  0x3fff877e0000
AT FLAGS:
                  0x0
AT_ENTRY:
                  0x1000145c
AT UID:
                  10455
AT_EUID:
                  10455
AT_GID:
                  10455
AT_EGID:
                  10455
AT_SECURE:
                  0
AT_RANDOM:
                  0x3fffeaeaa872
AT_HWCAP2:
                 vcrypto tar isel ebb dscr htm arch_2_07
AT_EXECFN:
                  /bin/true
AT PLATFORM:
                 power8
AT_BASE_PLATFORM:power8
```

Linux systems with Glibc version 2.16 can use <code>getauxval</code> to determine CPU features. However, defines like <code>PPC\_FEATURE2\_ARCH\_2\_07</code> and <code>PPC\_FEATURE2\_VEC\_CRYPTO</code> require Glibc 2.24. Runtime code to perform the check should look similar to below. The defines below were taken from the Linux kernel's cputable.h [https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/include/asm/cputable.h].

```
#ifndef AT_HWCAP
# define AT_HWCAP 16
#endif
#ifndef AT HWCAP2
# define AT_HWCAP2 26
#endif
#ifndef PPC_FEATURE_ARCH_2_06
# define PPC_FEATURE_ARCH_2_06
                                   0x0000100
#endif
#ifndef PPC_FEATURE2_ARCH_2_07
# define PPC_FEATURE2_ARCH_2_07
                                   0x80000000
#endif
#ifndef PPC_FEATURE2_VEC_CRYPTO
# define PPC_FEATURE2_VEC_CRYPTO 0x02000000
#endif
bool HasPower7()
{
    if (getauxval(AT_HWCAP) & PPC_FEATURE_ARCH_2_06 != 0)
        return true;
    return false;
}
bool HasPower8()
    if (getauxval(AT_HWCAP2) & PPC_FEATURE2_ARCH_2_07 != 0)
        return true;
    return false;
}
```

```
bool HasCrypto()
{
    if (getauxval(AT_HWCAP2) & PPC_FEATURE2_VEC_CRYPTO != 0)
        return true;
    return false;
}
```

#### L1 Data Cache

The L1 data cache line size is an important security parameter that can be used to avoid leaking information through timing attacks. IBM POWER System S822, like gcc112 and gcc119, have a 128-byte L1 data cache line size.

gcc119 runs AIX and L1 data cache line size can be queried as shown below.

```
#include <sys/systemcfg.h>
int cacheLineSize = getsystemcfg(SC_L1C_DLS);
if (cacheLineSize) <= 0)
    cacheLineSize = DEFAULT_L1_CACHE_LINE_SIZE;</pre>
```

gcc112 runs Linux and L1 data cache line size can be queried as shown below. However, the call requires Glibc 2.26 and Linux kernel 4.10.

```
#include <unistd.h>
int cacheLineSize = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
if (cacheLineSize) <= 0)
    cacheLineSize = DEFAULT_L1_CACHE_LINE_SIZE;</pre>
```

It is important to check the return value from <code>sysconf</code> on Linux. CentOS 7.4 on <code>gcc112</code> returns 0 for the query because Glibc is version 2.17 and the Linux kernel is version 3.10. In addition to -1, you should consider a return value of 0 as failure.

You should have a fallback strategy that includes a sane default set for DE-FAULT\_L1\_CACHE\_LINE\_SIZE because Glibc does not return a failure. On 32-bit systems you can usually use 32-bytes as a default, and on 64-bit systems you can usually use 64-bytes as a default.

Returning success with a value of 0 for an unimplemented <code>sysconf</code> parameter appears to be a Glibc bug. Also see sysconf and <code>\_SC\_LEVEL1\_DCACHE\_LINESIZE</code> returns 0? [https://lists.centos.org/pipermail/centos/2017-September/166236.html] on the CentOS mailing list and Issue 14599: <code>sysconf(\_SC\_LEVEL1\_DCACHE\_LINESIZE)</code> returns 0 instead of 128 [https://bugs.centos.org/view.php?id=14599] in the CentOS issue tracker.

## **Chapter 4. Advanced Encryption Standard**

AES is the Advanced Encryption Standard. AES is specified in FIPS 197, Advanced Encryption Standard (AES) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf]. You should read the standard if you are not familiar with the block cipher.

Three topics are discussed for AES. The first is encryption, the second is decryption, and the third is keying. Keying is discussed last because encryption and decryption uses the golden key schedule from FIPS 197.

### **Strategy**

Strategy

## **AES** encryption

**TODO** 

## **AES** decryption

**TODO** 

## **AES** key schedule

**TODO** 

## Chapter 5. Secure Hash Standard

SHA is the Secure Hash Standard. SHA is specified in FIPS 180-4, Secure Hash Standard (SHS) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf]. You should read the standard if you are not familiar with the hash family.

### **Strategy**

SHA provides a lot of freedom to an implementation. You can approach your SHA implementation in several ways, but most of them will result in an under-performing SHA. This section provides one of the strategies for a better performing implementation.

The first design element is to perform everything in vector registers. The only integer operations should be reading 2 longs or 4 integers from memory during a load, and writing 2 longs or 4 integers after the round during a store.

Second, when you need an integer for a calculation you will shift it out from a vector register to another vector register using vec\_sld. Most of the time you only care about element 0 in a vector register, and the remainder of elements are "don't care" elements.

Third, don't maintain a full W[64] or W[80] table. Use X[16] instead, and transform each element in-place using a rolling strategy.

Fourth, the eight working variables  $\{A, B, C, D, E, F, G, H\}$  each get their own vector register. The one you care about is located at element 0, the remainder of the elements in the vector are "don't care" elements.

It does not matter if you rotate the working variables  $\{A,B,C,D,E,F,G,H\}$  in the caller or in the callee. Both designs have nearly the same performance characteristics.

Since you are operating on X[16] in a rolling fashion instead of W[64] or W[80] the main body of your compression function will look similar to below.

```
// SHA-256 partial compression function
uint32x4_p8 X[16];
...

for (i = 16; i < 64; i++)
{
    uint32x4_p8 s0, s1, T0, T1;

    s0 = sigma0(X[(i + 1) & 0x0f]);
    s1 = sigma1(X[(i + 14) & 0x0f]);

    T1 = (X[i & 0xf] += s0 + s1 + X[(i + 9) & 0xf]);
    T1 += h + Sigma1(e) + Ch(e, f, g) + KEY[i];
    T2 = Sigma0(a) + Maj(a, b, c);</pre>
```

}

#### Ch function

The SHA Ch function is implemented in POWER systems using the vsel instruction or the  $vec\_sel$  built-in. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes  $uint64x2\_p8$  arguments. The important piece of information is x used as the selector.

```
uint32x4_p8
VecCh(uint32x4_p8 x, uint32x4_p8 y, uint32x4_p8 z)
{
    return vec_sel(z, y, x);
}
```

### Maj function

The SHA Maj function is implemented in POWER systems using the vsel instruction or the  $vec\_sel$  built-in. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes  $uint64x2\_p8$  arguments. The important piece of information is  $x^y$  used as the selector.

```
uint32x4_p8
VecCh(uint32x4_p8 x, uint32x4_p8 y, uint32x4_p8 z)
{
    return vec_sel(y, z, vec_xor(x, y));
}
```

#### Sigma functions

POWER8 provides the vshasigmaw and vshasigmad instructions to accelerate SHA calculations for 32-bit and 64-bit words, respectively. The instructions take two integer arguments and the constants are used to select among Sigma0, Sigma1, sigma0 and sigma1.

The built-in GCC functions for the instructions are \_\_builtin\_crypto\_vshasigmaw and \_\_builtin\_crypto\_vshasigmad. The XLC functions for the instructions are \_\_vshasigmaw and \_\_vshasigmad. The C/C++ wrapper for the SHA-256 functions should look similar to below.

```
uint32x4_p8 Vec_sigma0(const uint32x4_p8 val)
{
#if defined(__xlc__) || defined(__xlc__)
    return __vshasigmaw(val, 0, 0);
#else
    return __builtin_crypto_vshasigmaw(val, 0, 0);
#endif
}
```

```
uint32x4_p8 Vec_sigma1(const uint32x4_p8 val)
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmaw(val, 0, 0xf);
#else
    return __builtin_crypto_vshasigmaw(val, 0, 0xf);
#endif
}
uint32x4_p8 VecSigma0(const uint32x4_p8 val)
#if defined(__xlc__) | defined(__xlC__)
    return __vshasigmaw(val, 1, 0);
#else
    return __builtin_crypto_vshasigmaw(val, 1, 0);
#endif
uint32x4_p8 VecSigma1(const uint32x4_p8 val)
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmaw(val, 1, 0xf);
#else
    return __builtin_crypto_vshasigmaw(val, 1, 0xf);
#endif
}
```

#### **SHA-256**

The SHA-256 implementation has four parts. The first part is loads the existing state and creates working variables  $\{A,B,C,D,E,F,G,H\}$ . The second part loads the message and performs the first 16 rounds. The third part performs the remaining rounds. The final part stores the new state.

**Part 1.** Load the existing state and create working variables {A,B,C,D,E,F,G,H}.

```
uint32x4_p8 abcd = VecLoad32x4u(state+0, 0);
uint32x4_p8 efgh = VecLoad32x4u(state+4, 0);
enum {A=0,B=1,C,D,E,F,G,H};
uint32x4_p8 X[16], S[8];

S[A] = abcd; S[E] = efgh;
S[B] = VecShiftLeft<4>(S[A]);
S[F] = VecShiftLeft<4>(S[E]);
S[C] = VecShiftLeft<4>(S[B]);
S[G] = VecShiftLeft<4>(S[B]);
S[G] = VecShiftLeft<4>(S[F]);
S[H] = VecShiftLeft<4>(S[C]);
```

#### **Part 2.** Load the message and perform the first 16 rounds.

```
const uint32 t* k = reinterpret cast<const uint32 t*>(KEY256);
const uint32_t* m = reinterpret_cast<const uint32_t*>(data);
uint32x4 p8 vm, vk;
unsigned int i, offset=0;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1<0>(X,S, vk,vm);
SHA256_ROUND1<1>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<2>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<3>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsq32x4(m, offset);
SHA256_ROUND1<4>(X,S, vk,vm);
SHA256_ROUND1<5>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<6>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<7>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsq32x4(m, offset);
SHA256_ROUND1<8>(X,S, vk,vm);
SHA256 ROUND1<9>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<10>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<11>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1<12>(X,S, vk,vm);
SHA256 ROUND1<13>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<14>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256 ROUND1<15>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
Part 3. Perform the remaining rounds.
for (i=16; i<64; i+=16)
    vk = VecLoad32x4(k, offset);
    SHA256_ROUND2<0>(X,S, vk);
    SHA256 ROUND2<1>(X,S, VecShiftLeft<4>(vk));
    SHA256_ROUND2<2>(X,S, VecShiftLeft<8>(vk));
    SHA256 ROUND2<3>(X,S, VecShiftLeft<12>(vk));
    offset+=16;
```

```
vk = VecLoad32x4(k, offset);
    SHA256_ROUND2<4>(X,S, vk);
    SHA256 ROUND2<5>(X,S, VecShiftLeft<4>(vk));
    SHA256_ROUND2<6>(X,S, VecShiftLeft<8>(vk));
    SHA256_ROUND2<7>(X,S, VecShiftLeft<12>(vk));
    offset+=16;
    vk = VecLoad32x4(k, offset);
    SHA256_ROUND2<8>(X,S, vk);
    SHA256_ROUND2<9>(X,S, VecShiftLeft<4>(vk));
    SHA256_ROUND2<10>(X,S, VecShiftLeft<8>(vk));
    SHA256_ROUND2<11>(X,S, VecShiftLeft<12>(vk));
    offset+=16;
    vk = VecLoad32x4(k, offset);
    SHA256 ROUND2<12>(X,S,vk);
    SHA256_ROUND2<13>(X,S, VecShiftLeft<4>(vk));
    SHA256 ROUND2<14>(X,S, VecShiftLeft<8>(vk));
    SHA256_ROUND2<15>(X,S, VecShiftLeft<12>(vk));
    offset+=16;
}
Part 4. Repack and store the new state.
abcd += VecPack(S[A],S[B],S[C],S[D]);
efgh += VecPack(S[E],S[F],S[G],S[H]);
VecStore32x4u(abcd, state+0, 0);
VecStore32x4u(efgh, state+4, 0);
VecLoadMsg32x4. Perform an endian-aware load of a user message into a word.
template <class T>
uint32x4_p8 VecLoadMsg32x4(const T* data, int offset)
#if __LITTLE_ENDIAN___
    uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
    uint32x4_p8 r = VecLoad32x4u(data, offset);
    return (uint32x4_p8)vec_perm(r, r, mask);
    return VecLoad32x4u(data, offset);
#endif
SHA256_ROUND1. Mix state with a round key and user message.
template <unsigned int R>
void SHA256_ROUND1(uint32x4_p8 X[16], uint32x4_p8 S[8],
                   const uint32x4 p8 K, const uint32x4 p8 M)
{
```

```
uint32x4_p8 T1, T2;
    X[R] = M;
    T1 = S[H] + VecSigmal(S[E]);
    T1 += VecCh(S[E],S[F],S[G]) + K + M;
    T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
SHA256 ROUND2. Mix state with a round key.
template <unsigned int R>
void SHA256_ROUND2(uint32x4_p8 X[16], uint32x4_p8 S[8],
                   const uint32x4_p8 K)
{
    // Indexes into the X[] array
    enum \{IDX0=(R+0)\&0xf, IDX1=(R+1)\&0xf,
          IDX9 = (R+9) \& 0xf, IDX14 = (R+14) \& 0xf;
    const uint32x4_p8 s0 = Vec_sigma0(X[IDX1]);
    const uint32x4_p8 s1 = Vec_sigma1(X[IDX14]);
    uint32x4_p8 T1 = (X[IDX0] += s0 + s1 + X[IDX9]);
    T1 += S[H] + VecSigmal(S[E]) + VecCh(S[E],S[F],S[G]) + K;
    uint32x4_p8 T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
VecPack. Repack working variables.
uint32x4_p8 VecPack(const uint32x4_p8 a, const uint32x4_p8 b,
                    const uint32x4_p8 c, const uint32x4_p8 d)
{
    uint8x16_p8 m1 = {0,1,2,3, 16,17,18,19, 0,0,0,0, 0,0,0,0};
    uint8x16_p8 m2 = {0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23};
    return vec_perm(vec_perm(a,b,m1), vec_perm(c,d,m1), m2);
}
```

#### **SHA-512**

The SHA-512 implementation is like SHA-256 and has four parts. The first part is loads the existing state and creates working variables {A,B,C,D,E,F,G,H}. The second part loads

the message and performs the first 16 rounds. The third part performs the remaining rounds. The final part stores the new state.

**Part 1.** Load the existing state and create working variables {A,B,C,D,E,F,G,H}.

```
uint64x2_p8 ab = VecLoad64x2u(state+0, 0);
uint64x2_p8 cd = VecLoad64x2u(state+2, 0);
uint64x2_p8 ef = VecLoad64x2u(state+4, 0);
uint64x2_p8 gh = VecLoad64x2u(state+6, 0);

// Indexes into the S[] array
enum {A=0, B=1, C, D, E, F, G, H};
uint64x2_p8 X[16], S[8];

S[A] = ab; S[C] = cd;
S[E] = ef; S[G] = gh;
S[B] = VecShiftLeft<8>(S[A]);
S[D] = VecShiftLeft<8>(S[C]);
S[F] = VecShiftLeft<8>(S[E]);
S[H] = VecShiftLeft<8>(S[G]);
```

#### **Part 2.** Load the message and perform the first 16 rounds.

```
const uint64 t* k = reinterpret cast<const uint64 t*>(KEY512);
const uint64_t* m = reinterpret_cast<const uint64_t*>(data);
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512 ROUND1<0>(X,S, vk,vm);
SHA512_ROUND1<1>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsq64x2(m, offset);
SHA512_ROUND1<2>(X,S, vk,vm);
SHA512_ROUND1<3>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<4>(X,S, vk,vm);
SHA512 ROUND1<5>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512 ROUND1<6>(X,S,vk,vm);
SHA512_ROUND1<7>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
```

```
vk = VecLoad64x2(k, offset);
vm = VecLoadMsq64x2(m, offset);
SHA512_ROUND1<8>(X,S, vk,vm);
SHA512 ROUND1<9>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512 ROUND1<10>(X,S, vk,vm);
SHA512_ROUND1<11>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsq64x2(m, offset);
SHA512_ROUND1<12>(X,S, vk,vm);
SHA512_ROUND1<13>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<14>(X,S, vk,vm);
SHA512_ROUND1<15>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
offset+=16;
Part 3. Perform the remaining rounds.
for (i=16; i<80; i+=16)
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2<0>(X,S, vk);
    SHA512_ROUND2<1>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2<2>(X,S, vk);
    SHA512_ROUND2<3>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2<4>(X,S, vk);
    SHA512_ROUND2<5>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2<6>(X,S, vk);
    SHA512_ROUND2<7>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
```

```
SHA512_ROUND2<8>(X,S, vk);
    SHA512_ROUND2<9>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2<10>(X,S, vk);
    SHA512_ROUND2<11>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
    SHA512 ROUND2<12>(X,S, vk);
    SHA512_ROUND2<13>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2<14>(X,S, vk);
    SHA512_ROUND2<15>(X,S, VecShiftLeft<8>(vk));
    offset+=16;
}
Part 4. Repack and store the new state.
ab += VecPack(S[A],S[B]);
cd += VecPack(S[C],S[D]);
ef += VecPack(S[E],S[F]);
gh += VecPack(S[G],S[H]);
VecStore64x2u(ab, state+0, 0);
VecStore64x2u(cd, state+2, 0);
VecStore64x2u(ef, state+4, 0);
VecStore64x2u(gh, state+6, 0);
VecLoadMsg64x2. Perform an endian-aware load of a user message into a word.
template <class T>
uint32x4_p8 VecLoadMsg64x2(const T* data, int offset)
#if LITTLE ENDIAN
    uint8x16_p8 mask = \{7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8\};
    uint64x2 p8 r = VecLoad64x2u(data, offset);
    return (uint64x2_p8)vec_perm(r, r, mask);
    return VecLoad64x2u(data, offset);
#endif
SHA512_ROUND1. Mix state with a round key and user message.
template <unsigned int R>
void SHA512_ROUND1(uint64x2_p8 X[16], uint64x2_p8 S[8],
```

```
const uint64x2_p8 K, const uint64x2_p8 M)
{
    uint64x2_p8 T1, T2;
    X[R] = M;
    T1 = S[H] + VecSigmal(S[E]);
    T1 += VecCh(S[E],S[F],S[G]) + K + M;
    T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
SHA512 ROUND2. Mix state with a round key.
template <unsigned int R>
void SHA512_ROUND2(uint64x2_p8 X[16], uint64x2_p8 S[8],
                   const uint64x2_p8 K)
{
    // Indexes into the X[] array
    enum \{IDX0=(R+0)\&0xf, IDX1=(R+1)\&0xf,
          IDX9 = (R+9)\&0xf, IDX14 = (R+14)\&0xf;
    const uint64x2_p8 s0 = Vec_sigma0(X[IDX1]);
    const uint64x2_p8 s1 = Vec_sigma1(X[IDX14]);
    uint64x2_p8 T1 = (X[IDX0] += s0 + s1 + X[IDX9]);
    T1 += S[H] + VecSigmal(S[E]) + VecCh(S[E], S[F], S[G]) + K;
    uint64x2_p8 T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
VecPack. Repack working variables.
uint64x2_p8 VecPack(const uint64x2_p8 x, const uint64x2_p8 y)
    const uint8x16_p8 m = \{0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23\};
    return vec_perm(x,y,m);
The SHA-512 implementation uses the same functions as SHA-256, but SHA-512 uses a 64x2
```

arrangement rather than the 32x4 arrangement. You should copy/paste/replace as required for SHA-512. For example, below is the SHA Ch for the 64x2 arrangement.

```
uint64x2_p8
```

```
VecCh(uint64x2_p8 x, uint64x2_p8 y, uint64x2_p8 z)
{
    return vec_sel(z,y,x);
In fact, since this is C++ code, a template function works nicely. The language will use the
template to instantiate VecCh using both uint32x4_p8 and uint64x2_p8.
template <class T>
T VecCh(T x, T y, T z)
    return vec_sel(z,y,x);
Templates do not work the Sigma functions and you will have to supply C++ overloaded func-
tions as shown below.
uint64x2_p8 Vec_sigma0(const uint64x2_p8 val)
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmad(val, 0, 0);
#else
    return __builtin_crypto_vshasigmad(val, 0, 0);
#endif
}
uint64x2 p8 Vec sigma1(const uint64x2 p8 val)
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmad(val, 0, 0xf);
    return builtin crypto vshasigmad(val, 0, 0xf);
#endif
uint64x2_p8 VecSigma0(const uint64x2_p8 val)
#if defined(__xlc__) || defined(__xlc__)
    return vshasiqmad(val, 1, 0);
    return __builtin_crypto_vshasigmad(val, 1, 0);
#endif
}
uint64x2_p8 VecSigma1(const uint64x2_p8 val)
#if defined(__xlc__) | defined(__xlC__)
```

return \_\_builtin\_crypto\_vshasigmad(val, 1, 0xf);

return \_\_vshasigmad(val, 1, 0xf);

#else

```
#endif
}
```

## Chapter 6. Polynomial multiplication

The chapter of the document should discuss polynomial multiplication used with CRC codes and the GCM mode of operation for AES. However we have no experience with polynomial multiplication. Please refer to GitHub CRC32/vpmsum [https://github.com/antonblanchard/crc32-vpmsum].

#### CRC-32 and CRC-32C

No content. Waiting for a contributor.

#### **GCM** mode

No content. Waiting for a contributor.

## Chapter 7. Assembly language

This chapter shows you how to build and link against a POWER8 SHA assembly language routine. The function is Cryptogams SHA-256 compression function.

Cryptogams [https://www.openssl.org/~appro/cryptogams/] is the incubator used by Andy Polyakov to develop assembly language routines for OpenSSL. Andy dual licenses his implementations and a more permissive license is available for his assembly language source code.

### **Cryptogams**

The steps that follow were carried out on gcc112, which is ppc64-le. Andy's GitHub is located at dot-asm [https://github.com/dot-asm], so clone the project and read the README.

```
$ git clone https://github.com/dot-asm/cryptogams
$ cd cryptogams
```

The README contains instructions for using the source files:

```
"Flavor" refers to ABI family or specific OS. E.g. x86_64 scripts recognize "elf", "elf32", "macosx", "mingw64", "nasm". PPC scripts recognize "linux32", "linux64", "linux64le", "aix32", "aix64", "osx32", "osx64", and so on...
```

Unfortunately Andy has not uploaded the SHA gear to Cryptogams so you will have to switch to OpenSSL to get the Cryptogams sources. Make a cryptogams directory, and then copy sha512p8-ppc.pl and ppc-xlate.pl from the OpenSSL source directory:

```
$ mkdir cryptogams
$ cp openssl/crypto/sha/asm/sha512p8-ppc.pl cryptogams/
$ cp openssl/crypto/perlasm/ppc-xlate.pl cryptogams/
$ cd cryptogams/
```

Next examine the head notes in sha512p8-ppc.pl, which is used to create the source files for SHA-256 and SHA-512. The comments say the script takes two arguments. The first is a "flavor", and the 32 or 64 is used to convey the platform architecture. Adding "le" to flavor will produce a source file for a little endian machine. The second argument is "output", and 256 or 512 in the output filename selects either SHA-256 or SHA-512.

The commands to produce a SHA-256 assembly source file for gcc112 and assemble it are shown below.

```
$ ./sha512p8-ppc.pl linux64le sha256le_compress.s
$ as -mpower8 sha256le_compress.s -o sha256le_compress.o
```

The head notes in sha512p8-ppc.pl do not state the public API. However the source file crypto/ppccap.c says:

```
$ grep -IR sha256_block_p8 *
crypto/ppccap.c:void sha256_block_p8(void *ctx, const void *inp,
```

```
size_t len);
...
```

In fact the signature for sha256\_block\_p8 is better documented as shown below. There are no alignment requirements for state or input.

Finally, a program that links to sha256\_block\_p8 might look like the following.

```
$ cat test.cxx
#include <stdio.h>
#include <string.h>
#include <stdint.h>
extern "C" {
 void sha256_block_p8(uint32_t*, const uint8_t*, size_t);
}
int main(int argc, char* argv[])
  /* empty message with padding */
 uint8 t message[64];
 memset(message, 0x00, sizeof(message));
 message[0] = 0x80;
  /* initial state */
 uint32 t state[8] = {
    0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a,
    0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
  };
  size t blocks = sizeof(message)/64;
  sha256_block_p8(state, message, blocks);
  const uint8 t b1 = (uint8 t)(state[0] \Rightarrow 24);
  const uint8_t b2 = (uint8_t)(state[0] >> 16);
  const uint8 t b3 = (uint8 t)(state[0] >> 8);
  const uint8_t b4 = (uint8_t)(state[0] >> 0);
  const uint8_t b5 = (uint8_t)(state[1] >> 24);
 const uint8_t b6 = (uint8_t)(state[1] >> 16);
  const uint8_t b7 = (uint8_t)(state[1] >> 8);
  const uint8 t b8 = (uint8 t)(state[1] >> 0);
  /* e3b0c44298fc1c14... */
 printf("SHA256 hash of empty message: ");
 printf("%02X%02X%02X%02X%02X%02X%02X%02X...\n",
         b1, b2, b3, b4, b5, b6, b7, b8);
```

Success!

## **Chapter 8. Performance**

This chapter presents benchmarking numbers and discusses some of the issues that affect performance. Benchmarking an application is an art and can be tricky to collect accurate results.

#### **Power states**

Linux desktop systems are usually configured in either on-demand or powersave mode. The configuration is usually a kernel parameter, and the default energy states are usually efficient states that use less power. Before benchmarking you should leave on-demand or powersave mode, and enter a performance state.

Cryptogams uses a script to enter performance mode for benchmarking but it is not available online. A modified version of Andy's script is available at <code>governor.sh</code> [https://github.com/weidai11/cryptopp/blob/master/TestScripts/governor.sh]. The script changes the scaling frequency using the <code>/sys/devices/system/cpu/cpu\*/cpufreq/scaling\_governor</code> key (where <code>cpu\*</code> is a logical cpu, like <code>cpu0</code>). Below is an example of running the script on a x86\_64 Linux system.

```
$ sudo ./governor.sh perf
Current CPU governor scaling settings:
   CPU 0: powersave
   CPU 1: powersave
   CPU 2: powersave
   CPU 3: powersave
New CPU governor scaling settings:
   CPU 0: performance
   CPU 1: performance
   CPU 2: performance
   CPU 3: performance
```

TODO: We are not aware of a similar script for AIX. In fact we don't know how to check a similar setting to determine if a script is needed.

#### **Benchmarks**

The table below presents benchmark statistics using standard C++, C++ with built-ins, and assembly language routines. The "standard C++" and "C++ with built-ins" columns were derived using Crypto++. The "assembly language" column was taken from Cryptogams and OpenSSL source file notes.

The measurements were taken on gcc112, which is a Linux PowerPC, 64-bit, little-endian machine. The hardware is IBM POWER System S822 with two CPU cards, 160 logical CPUs at 3.4 GHz. The kernel is CentOS 7.4 version 3.10.0-514 and the compiler is GCC 7.2.0.

| Algorithm | Standard C++ |      | Built-ins |      | Assembly |                   |
|-----------|--------------|------|-----------|------|----------|-------------------|
| Algorithm | MiB/s        | cpb  | MiB/s     | cpb  | MiB/s    | cpb               |
| AES/ECB   | 121          | 26.7 | 3151      | 1.03 | -        | _†                |
| AES/CTR   | 120          | 27.1 | 2544      | 1.27 | -        | 0.74 <sup>‡</sup> |
| AES/GCM   | 93           | 34.7 | 474       | 6.8  | -        | _†                |
| SHA-1     | 307          | 10.6 | N/A       | N/A  | -        | _†                |
| SHA-256   | 129          | 25.2 | 275       | 12.0 | -        | 9.9 <sup>‡</sup>  |
| SHA-512   | 281          | 11.5 | 368       | 8.8  | -        | 6.3 <sup>‡</sup>  |

<sup>&</sup>lt;sup>†</sup> The Cryptogams and OpenSSL source files do not provide a meaningful metric for cycles per byte in this case. The metrics are reported as +150% or +90%, but we don't know the frame of reference.

POWER8 *does not* provide instructions for SHA-1, and based on the benchmarks it is easy to see why. A good compiler generates quality code. Other factors probably include SHA-1 has been superseded by SHA-256 and SHA-512.

<sup>&</sup>lt;sup>‡</sup> As reported in the head notes of the Cryptogams and OpenSSL source files. MiB/s is not provided, and the metric is only meaningful on the same hardware using the same test configuration.

## Chapter 9. References

## **Cryptogams**

CRYPTOGAMS: low-level cryptographic primitives collection [https://www.openssl.org/~ap-pro/cryptogams/]

#### **GitHub**

- AES Intrinsics [https://github.com/noloader/AES-Intrinsics]
- SHA Intrinsics [https://github.com/noloader/SHA-Intrinsics]
- CRC32/vpmsum [https://github.com/antonblanchard/crc32-vpmsum]

## IBM and OpenPOWER websites

- Recommended debug, compiler, and linker settings for Power processor tuning [https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaal/iplsdkrecbldset.htm]
- AIX vector programming [https://www.ibm.com/support/knowledgecenter/en/ss-w\_aix\_61/com.ibm.aix.genprogc/vector\_prog.htm]
- POWER8 in-core cryptography [https://www.ibm.com/developerworks/library/se-power8-in-core-cryptography/index.html]
- IBM Advance Toolchain (for latest gcc and glibc) [https://developer.ibm.com/linuxonpower/advance-toolchain/]
- 64-Bit ELF V2 ABI Specification: Power Architecture [https://openpowerfoundation.org/?resource\_lib=64-bit-elf-v2-abi-specification-power-architecture]
- IBM Power ISA Version 3.0B [https://openpowerfoundation.org/?resource\_lib=power-isa-version-3-0]
- Function calls and the PowerPC 64-bit ABI [https://www.ibm.com/developerworks/library/l-powasm4/index.html]
- Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8 [https://www.redbooks.ibm.com/redbooks/pdfs/sg248171.pdf]

#### **NIST** website

- FIPS 197, Advanced Encryption Standard (AES) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf]
- FIPS 180-4, Secure Hash Standard (SHS) [https://nvlpubs.nist.gov/nistpubs/fips/nist.fip-s.180-4.pdf]

## **Stack Exchange**

- Detect Power8 in-core crypto through getauxval? [https://stackover-flow.com/q/46144668/608639]
- Is vec\_sld endian sensitive? [https://stackoverflow.com/q/46341923/608639]

#### GCM mode, 29 Index getauxval, 14 getsystemcfg, 15 GitHub, 35 **Symbols** Glibc, 13 \_SC\_LEVEL1\_DCACHE\_LINESIZE, 15 builtin crypto vshasigmad, 18 ı builtin\_crypto\_vshasigmaw, 18 IBM website, 35 \_power\_7\_andup, 12 Introduction, 1 power 8 andup, 12 \_power\_set, 13 L power vsx, 13 L1 data cache, 15 vshasigmad, 18 AIX, 15 \_\_vshasigmaw, 18 Linux, 15 LD\_SHOW\_AUXV, 13 Α Administrivia, 2 N AES, 3, 16 NIST website, 35 Decryption, 16 Encryption, 16 0 Key schedule, 16 On-demand, 33 Andy Polyakov, 30, 33 OpenPOWER website, 35 Assembly language, 3, 30 AT HWCAP, 14 P AT\_HWCAP2, 14 Performance, 3, 33, 33 Polynomial multiplication, 3, 29 B CRC-32, 29 Benchmarks, 33, 33 GCM mode, 29 Power state C on-demand, 33 Compile farm, 2 performance, 33 gcc112, 2 powersave, 33 gcc119, 2 Powersave, 33 Compiler PPC FEATURE2 ARCH 2 07, 14 GCC, 1 PPC\_FEATURE2\_VEC\_CRYPTO, 14 XL C/C++, 1 PPC FEATURE ARCH 2 06, 14 Contributing, 2 CRC-32, 29 R Cryptogams, 30, 33, 35 References, 35 Andy Polyakov, 30 S F SC\_L1C\_DLS, 15 Feature detection, 2, 12 SHA, 3, 17 AIX, 12, 12 Ch function, 18 Glibc, 14 Maj function, 18 Linux, 13 SHA-256, 19 SHA-512, 22 G Sigma functions, 18 GCC, 1 Source code, 2

Stack Exchange, 36 sysconf, 15



XL C/C++, 1