### **Performance Tuning for GPUs -An Iterative Process**

Karl Rupp<sup>1,2</sup>

rupp@iue.tuwien.ac.at



🔰 @karlrupp

with contributions from Philippe Tillet<sup>1</sup>, Florian Rudolf<sup>1</sup>, Josef Weinbub<sup>1</sup>, Ansgar Jüngel<sup>2</sup>, Tibor Grasser<sup>1</sup> (based on stimuli from PETSc+ViennaCL users)









#### Outline



#### Introduction

#### **Positions**

PhD student at TU Wien (2009-2011)

Postdoc at ANL (09/2012-09/2013)

Postdoc at TU Wien (01/2012-09/2012, 09/2013-current)

#### Research Interests

Semiconductor device simulation

Numerical solution of PDEs

Parallel computing

#### Software Development

**PETSc** 

ViennaCI

ViennaSHE

...

#### Introduction

#### **Iterative Solvers**

Matrix-vector products and vector operations only

Expose more fine-grained parallelism

Preconditioners often desirable

### Accelerators (CUDA, OpenCL)

Graphics processing units (GPUs)

Intel Xeon Phi



#### Introduction





## **Conjugate Gradients**

#### **Pseudocode**

Choose  $x_0$ 

$$p_0 = r_0 = b - Ax_0$$

For i = 0 until convergence

- 1. Compute and store  $Ap_i$
- 2. Compute  $\langle p_i, Ap_i \rangle$
- 3.  $\alpha_i = \langle r_i, r_i \rangle / \langle p_i, Ap_i \rangle$
- **4.**  $x_{i+1} = x_i + \alpha_i p_i$
- $5. r_{i+1} = r_i \alpha_i A p_i$
- **6**. Compute  $\langle r_{i+1}, r_{i+1} \rangle$
- 7.  $\beta_i = \langle r_{i+1}, r_{i+1} \rangle / \langle r_i, r_i \rangle$
- 8.  $p_{i+1} = r_{i+1} + \beta_i p_i$

EndFor

### **BLAS-based Implementation**

-

SpMV, AXPY

For i = 0 until convergence

- 1. SpMV  $\leftarrow$  No caching of  $Ap_i$
- 2. DOT ← Global sync!
- 3. -
- 4. AXPY
- 5. AXPY  $\leftarrow$  No caching of  $r_{i+1}$
- 6. DOT ← Global sync!
- 7. -
- 8. AXPY

EndFor



# **Conjugate Gradients**



# **Conjugate Gradient**

## **Implications**

Kernel launches expensive Delicate balance for preconditioners



# **Conjugate Gradient Optimizations**

### Optimization 1

Get best performance out of SpMV Compare different sparse matrix types

Cf.: N. Bell: Implementing sparse matrix-vector multiplication on throughput-oriented processors. *Proc. SC '09* 



# **Conjugate Gradients**



# **Conjugate Gradients**



# **Conjugate Gradient Optimizations**

### Optimization 2

Optimize kernel parameters for each operation



#### **Outline**



## Scope for OpenCL-based Portability Study

Vector and matrix-vector operations (BLAS levels 1 and 2) Limited by memory bandwidth



## Scope for OpenCL-based Portability Study

Vector and matrix-vector operations (BLAS levels 1 and 2) Limited by memory bandwidth

Key Question (Memory-Bandwidth-Limited Kernels)

Good performance of complicated kernels by optimizing the simplest kernel?



### Vector Assignment (Copy) Kernel

 $x \Leftarrow y$  for (large) vectors x, y



### Parameters (1900 variations)

Local work size, global work size

Vector types (float1, float2, ..., float16)

Thread increment type



## Vector Assignment (Copy) Kernel

```
x \leftarrow y for (large) vectors x, y
```



### Parameters (1900 variations)

### Vector Assignment (Copy) Kernel

$$x \leftarrow y$$
 for (large) vectors  $x, y$ 



## Parameters (1900 variations)

```
for (size_t i = group_start + get_local_id(0);
    i < group_end; i+= get_local_size(0))
x[i] = y[i];</pre>
```

## Vector Assignment (Copy) Kernel

```
x \Leftarrow y for (large) vectors x, y
```



### Parameters (1900 variations)

## Operations

Vector copy, vector addition, inner product Matrix-vector product



#### **Devices**

AMD: A10-5800 APU, HD 5850 GPU

INTEL: Dual Socket Xeon E5-2670, Xeon Phi

NVIDIA: GTX 285, Tesla K20m



Histograms











### Intel Xeon E5-2670



Bandwidth Inner Product (% of theoretical peak)





Bandwidth Inner Product (% of theoretical peak)



#### **NVIDIA Tesla K20m**



Rel. Bandwidth Copy Operation (%)

(comparison of vector types double, double2, double4, double8, double16)

#### **NVIDIA Tesla K20m**



Rel. Bandwidth Matrix-Vector Product Operation (%)

(comparison of vector types double, double2, double4, double8, double16)

 $[Addition|Inner\ Product|Matrix-Vector]\ vs.\ Copy\ Kernel$ 

Same Device



### **NVIDIA GeForce GTX 285**



### **NVIDIA GeForce GTX 285**







### NVIDIA Tesla K20m



### AMD Radeon HD 5850



#### INTEL Dual Xeon E5-2670



### INTEL Xeon Phi



#### Conclusio:

Focus on fastest configurations for copy-kernel sufficient



[Copy|Addition|Inner Product|Matrix-Vector] vs. Copy Kernel

Different Device, Same Vendor



# NVIDIA Hardware (x: GTX 285, y: K20m)



# AMD Hardware (x: A10-5800K GPU, y: HD 5850)



# INTEL Hardware (x: Xeon Phi, E5-2670)



# NVIDIA Hardware (x: GTX 285, y: K20m)





### AMD Hardware (x: A10-5800K GPU, y: HD 5850)





Conclusio:

Certain Performance Portability per Vendor



 $\textbf{[Copy|Addition|Inner\ Product|Matrix-Vector]\ vs.\ Copy\ Kernel}$ 

Different Device, Different Vendor











# x: AMD HD 5850, y: NVIDIA K20m





Conclusio:

Fast Configurations Across Vendors Exist



#### **Outline**







## **Conjugate Gradient Optimizations**

#### Optimization 3: Rearrange the algorithm

Remove unnecessary reads

Remove unnecessary synchronizations

Use custom kernels instead of standard BLAS



#### Standard CG

Choose  $x_0$ 

$$p_0 = r_0 = b - Ax_0$$
  
For  $i = 0$  until convergence

- 1. Compute and store  $Ap_i$
- 2. Compute  $\langle p_i, Ap_i \rangle$
- 3.  $\alpha_i = \langle r_i, r_i \rangle / \langle p_i, Ap_i \rangle$
- **4.**  $x_{i+1} = x_i + \alpha_i p_i$
- $5. r_{i+1} = r_i \alpha_i A p_i$
- **6**. Compute  $\langle r_{i+1}, r_{i+1} \rangle$
- 7.  $\beta_i = \langle r_{i+1}, r_{i+1} \rangle / \langle r_i, r_i \rangle$
- 8.  $p_{i+1} = r_{i+1} + \beta_i p_i$

EndFor

### **Pipelined CG**

Choose  $x_0$ 

$$p_0 = r_0 = b - Ax_0$$

For i = 1 until convergence

- 1. i = 1: Compute  $\alpha_0$ ,  $\beta_0$ ,  $Ap_0$
- 2.  $x_i = x_{i-1} + \alpha_{i-1}p_{i-1}$
- 3.  $r_i = r_{i-1} \alpha_{i-1}Ap_i$
- 4.  $p_i = r_i + \beta_{i-1}p_{i-1}$
- 5. Compute and store  $Ap_i$
- 6. Compute  $\langle Ap_i, Ap_i \rangle$ ,  $\langle p_i, Ap_i \rangle$ ,  $\langle r_i, r_i \rangle$
- 7.  $\alpha_i = \langle r_i, r_i \rangle / \langle p_i, Ap_i \rangle$
- 8.  $\beta_i = (\alpha_i^2 \langle Ap_i, Ap_i \rangle \langle r_i, r_i \rangle) / \langle r_i, r_i \rangle$

EndFor







### **GMRES Optimization**

#### Generalized Minimum Residual (GMRES) Method

Krylov space span $\{r, Ar, A^2r, \dots, A^{N-1}r\}$ Orthogonal basis  $\{v_1, v_2, \dots, v_N\}$ 

#### Gram-Schmidt Method revisited

Given: orthonormal basis  $\{v_1, v_2, \dots, v_N\}$ , augment by w

$$w \leftarrow w - \sum_{i=1}^{N} \langle w, v_i \rangle v_i$$

$$w \leftarrow w/\|w\|$$

Add w to basis

### Multiple inner products $\langle w, v_i \rangle$

Performance critical (global reductions)

Reuse of w desirable



# **GMRES Optimization**

#### Custom routine *mdot*

Process  $\alpha_i = \langle w, v_i \rangle$  in batches

Batch sizes 1, 2, 3, 4, 8

Batch size 8: Only 12.5% overhead vs. arbitrary batch sizes

#### Code sketch (Batch size 4)

```
for (size_t i=thread_id; i<M; i += threads_per_group)
{
    double val_w = w[i];
    alpha_1 += val_w * v1[i];
    alpha_2 += val_w * v2[i];
    alpha_3 += val_w * v3[i];
    alpha_4 += val_w * v4[i];
}</pre>
```





### **Summary and Conclusion**

#### Conjugate Gradient Method

Careful choice of sparse matrix format

Tune kernels to target device

Minimize reads from global memory (kernel fusion, pipelining)

### Generalized Minimum Residual Method (GMRES)

Minimizes reads from global memory (mdot kernel)

Up to twice the performance of 'naive' implementations

### **Implications**

Tune primarily with memory transfers in mind

Prefer regular memory access patterns

Use appropriate vector data types

