# 并行与分布式计算

## Ying Liu, Prof., Ph.D

School of Computer Science and Technology
University of Chinese Academy of Sciences
Data Mining and High Performance Computing Lab

## **CUDA Performance Optimization**

- Maximize utilization
- Maximize memory throughput
- Maximize instruction throughput
- Others

#### **Grid/Block Size Heuristics**

- # of blocks / # of stream multiprocessors (SM) > 1
  - Multiple blocks executed on a SM
    - Reduce the possibility that all the blocks in \_syncthreads()
  - Share the shared memory and registers of SM
- # of blocks > 100
  - Scale to future devices
  - 1000 blocks per grid will scale across multiple generations

#### Thread Block

- # of threads in a block must be a multiple of warp size
  - Avoid under-populated warps
- A large number of threads
  - Hide the memory access latency
  - Less registers allocated for each thread
  - Not work due to lack of registers

#### Heuristics

- Minimum: 64 threads per block
  - Only if multiple concurrent blocks
- 128 to 256 threads a better choice
  - Usually still enough registers to compile and invoke successfully
- All depends on your application/computation!
  - Experiment!

## **Hiding Memory Access Latency**

- Global memory access: 400-600 cycle latency
  - Global memory access instructions are stalled

#### Remedies:

- Increase # threads within a block
- Improve computational intensity
- Coalescing memory accesses to neighboring addresses

#### Example

- 4 global memory accesses cost 4\*400 = 1,600 cycles
- 4 concurrent threads, 1 read per thread, minimum 400 cycles

## Register

- Restrictions:
  - Number of registers per kernel
    - 8192 per SM, partitioned among concurrent threads
  - Amount of shared memory
    - 16KB per SM, partitioned among concurrent blocks
- Check .cubin file for # registers / kernel
  - nvcc –keep <filename>.cu
- Compile with nvcc –maxrregcount = N
  - N = desired maximum registers / kernel
  - At some point "spilling" into local memory may occur
    - Reduce performance local memory is slow
    - Check .cubin file for local memory usage

- Some runtime functions are asynchronous
  - Control is returned to the application before the device has completed the requested task

```
int main(int argc, char **argv) {
    .....
    operation0;
    operation1;
    operation2;
    operation3;
    .....
}
```

If operation2 is asynchronous and all others are synchronous, then operation3 will be directly executed after operation2 is invoked, no matter whether operation2 has completed or not

 Applications may query this capability by calling cudaGetDeviceProperties() and checking deviceOverlap

```
Device 0: "Tesla C1060"
 Major revision number:
 Minor revision number:
 Total amount of global memory:
                                                   4294705152 bytes
 Number of multiprocessors:
                                                   30
 Number of cores:
                                                  240
 Total amount of constant memory:
                                                   65536 bytes
 Total amount of shared memory per block:
                                                    16384 bytes
 Total number of registers available per block:
                                                    16384
 Warp size:
                                                 32
 Maximum number of threads per block:
                                                    512
 Maximum sizes of each dimension of a block:
                                                    512 x 512 x 64
 Maximum sizes of each dimension of a grid:
                                                    65535 x 65535 x 1
 Maximum memory pitch:
                                                   262144 bytes
 Clock rate:
                                                 1.30 GHz
 Concurrent copy and execution:
                                                    Yes
```

- In order to maximize parallel execution between the multiprocessors, maximize the number of concurrent kernels by using streams
- Applications manage concurrency through streams. A stream is a sequence of operations that execute in order
- Different streams, may execute their operations out of order with respect to one another or concurrently



- CUDA Stream Management API:
  - cudaStreamCreate: create an async stream
  - cudaStreamQuery: queries a stream for completionstatus
  - cudaStreamSynchronize: waits for stream tasks to complete
  - cudaStreamDestroy: destroys and cleans-up a stream object

10

```
// allocate and initialize an array of stream handles
cudaStream t stream[2];
for (int i = 0; i < 2; ++i)
   cudaStreamCreate(&stream[i]);
// execution with streams
for (int i = 0; i < 2; ++i)
   cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size,
          cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
   myKernel<<<100, 512, 0, stream[i]>>>
          (outputDevPtr + i * size, inputDevPtr + i * size, size);
for (int i = 0; i < 2; ++i)
   cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
   size, cudaMemcpyDeviceToHost, stream[i]);
cudaThreadSynchronize();
// release resources
for (int i = 0; i < nstreams; i++)
   cudaStreamDestroy(streams[i]);<sub>2020-12-24</sub>
```

11

## **CUDA Performance Optimization**

- Maximize utilization
- Maximize memory throughput
- Maximize instruction throughput
- Others

#### **Data Transfer between Host and Device**

- Minimize data transfer between CPU and GPU
  - 8 GB/sec
- Use page-locked host memory
- Use virtual memory

#### **Device Memory Access**

- Global memory 400-600 clock cycles latency
  - Performance bottleneck
- Coalesced vs. non-coalesced
  - Experiment
    - Kernel: read a float, increment, write back
    - 3M floats (12MB), times averaged over 10K runs
  - 12K blocks x 256 threads
    - 356 μs coalesced
    - 357 μs coalesced, some threads don't participate
    - 3,494 μs permuted/misaligned thread access
- Conclusion
  - Orders of magnitude improvement!
  - Critical to small or memory-bound kernels

## **Memory Bandwidth**

- Device memory should be minimized
  - Device memory is much higher latency and lower bandwidth than on-chip memory
- Global memory bandwidth is used most efficiently when the simultaneous memory accesses by threads in a halfwarp (16 threads) can be coalesced into a single memory transaction

## **Memory Bandwidth**

- Device reads 32-bit, 64-bit or 128-bit words from global memory into registers in a single transaction
  - Data type must be sizeof(type) = 4, 8, 16 bytes
  - Data type must be aligned
- Size of a memory transaction
  - 64 bytes each thread reads a word: int, float, ...
  - 128 bytes each thread reads a double-word: int2, float2, ...
  - 32 bytes (compute capability 1.2+) each thread reads a short int, ...

## Coalescing

#### Requirements

- The starting address of global memory is usually aligned to times of 64, 128 bytes
- Sizeof(type) must be 4, 8, or 16
- All 16 words must lie in the same segment
- All threads in a warp access the words in sequence
- Divergent warp is allowed
- Multiple threads access the same address (for devices with Compute Capability 1.2 and higher)



**Coalesced Access** 

@ Ying Liu 2020-12-24 18



**Non-coalesced Access** 



**Non-coalesced Access** 

@ Ying Liu 2020-12-24 20

The alignment specifiers \_align\_(8) or \_align\_(16)

```
Struct _align_ (8) {
   float a;
   float b;
Struct _align_ (16) {
   float a;
   float b;
   float c;
Struct _align_ (16) {
   float a;
   float b;
   float c;
   float d;
   float e;
```

#### **Uncoalesced float3 Code**

```
global__ void accessFloat3(float3 *d_in,
float3* d_out)
int index = blockldx.x * blockDim.x + threadldx.x;
float3 a = d_in[index];
a.x += 2;
a.y += 2;
a.z += 2;
d out[index] = a;
```

@ Ying Liu 2020-12-24 22

#### **Uncoalesced float3 Code**

- float3 (12 bytes): float3 f = d\_in[threadIdx.x];
- Each thread ends up executing 3 reads
  - sizeof(float3) ≠ 4, 8, or 16
  - Half-warp reads three 64B non-contiguous regions



#### **Coalescing float3 Access**

A 3-step approach (256 threads/block)



Similarly, Step3 starting at offset 512

#### **Coalescing float3 Access**

- Use shared memory to allow coalescing
  - 256 threads per block
  - A thread block needs sizeof(float3)x256 bytes of SMEM
  - Each thread reads 3 scalar floats:
    - likely be processed by other threads, so sync
- Processing
  - Each thread retrieves its float3 from SMEM array
    - Cast the SMEM pointer to (float3\*)
    - Use thread ID as index
  - Rest of the compute code does not change!



#### **Coalescing float3 Access Code**

```
global__ void accessInt3Shared(float *g_in, float *g_out)
                     int index = 3 * blockldx.x * blockDim.x + threadldx.x;
                     __shared__ float s_data[256*3];
                     s_data[threadIdx.x] = g_in[index];
Read the input
                     s_data[threadIdx.x+256] = g_in[index+256];
through SMEM
                     s_data[threadIdx.x+512] = g_in[index+512];
                     __syncthreads();
                     float3 a = ((float3*)s_data)[threadIdx.x];
                     a.x += 2;
Compute code
                     a.y += 2;
is not changed
                     a.z += 2:
                     ((float3*)s_data)[threadIdx.x] = a;
                     __syncthreads();
Write the result
                     g_out[index] = s_data[threadldx.x];
through SMEM
                     g_out[index+256] = s_data[threadIdx.x+256];
                     g_out[index+512] = s_data[threadIdx.x+512];
```

#### Coalescing: Structures of Size $\neq$ 4, 8, or 16B

Compiler enforces the alignment specifiers

```
- __align__(X), where X = 4, 8, or 16
   struct __align__(16) {
        float x;
        float y;
        float z;
     };

    Waste some memory space
```

float3



#### Coalescing: Structures of Size $\neq$ 4, 8, or 16B

Use a structure of arrays instead of AoS



@ Ying Liu 2020-12-24 28

## **Solve Non-Coalesced Memory Access**

- If all threads in a half-warp read the identical address, use constant memory
- For sequential access patterns, but sizeof(struct) ≠4, 8, or 16 bytes:
  - Use a Structure of Arrays (SoA) instead of Array of Structures (AoS)
  - Or force structure alignment by compiler
    - Using \_\_align(X), where X = 4, 8, or 16
  - Or use shared memory to achieve coalescing

#### **Tiled Multiply**

0 1 2

bx

Make sure that tiles are all loaded in vertical patterns from the global memory

tv

TILE WIDTH

by <sub>1</sub>

@ Ying Liu



#### **Block IDs and Thread IDs**

Each thread uses IDs to decide what data to work on **Device** Block ID: 1D or 2D Grid 1 Thread ID: 1D, 2D, or 3D **Block Block Block** (1, 0)(2, 0)(0, 0)Block **Block Block** (0, 1)(1, 1)(2, 1)**Block (1, 1) Thread** Thread **Thread Thread** (0, 0)(1,0)(2, 0)(3, 0)(4, 0)**Thread Thread Thread** Thread **Thread** (0, 1)(1,1)(2, 1)(3, 1)(4, 1)**Thread Thread Thread Thread Thread** (0, 2)(1, 2)(2, 2)(3, 2)(4, 2)

@ Ying Liu 2020-12-24 31

## **Memory Access Pattern**



## Tiling Size Effects

- For good bandwidth utilization, accesses should be aligned and consist of 16 contiguous words
- Tile size 16X16 minimal required to achieve full coalescing
  - Both reduction of global memory accesses and more efficient execution of the accesses



# Accesses in Different Compute Capability







## **Shared Memory**

- Shared memory is divided into equallysized memory modules (banks)
- 16 banks on devices (capability 1.x)
- A memory can service as many simultaneous accesses as it has banks when no bank conflict
- One read/write per two cycles per bank





Big memory bandwidth savings

@ Ying Liu

2020-12-24

#### How Addresses Map to Banks on G80

- The bandwidth of each bank is 32 bits (4Bytes) per two clock cycles
- Successive 32-bit words are assigned to successive banks
- G80 has 16 banks
  - So bank = 4-byte address % 16
  - Same as the size of a half-warp
    - A shared memory request for a warp is split into two requests
    - No bank conflicts between different half-warps
- If two requests fall in the same bank, bank conflict!!!



15, 31, 47, ...

Must be serialized

Bank 15

#### **Bank Addressing Example**



\_\_shared\_\_ float shared[256];
float foo = shared[base + s \* threadIdx.x];

@ Ying Liu 2020-12-24 43

## **Bank Addressing Example**



- 8-way bank conflicts
  - Linear addressing stride == 8 (s=8)



\_\_shared\_\_ float shared[256];
float foo = shared[base + s \* threadIdx.x];

#### **Memory Bank Conflicts**

- If no bank conflicts, shared memory is fast
- The fast case:
  - If all threads of a half-warp access different banks, there is no bank conflict
  - If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)
- The slow case:
  - Bank conflict: multiple threads in the same half-warp access the same bank
  - Must serialize the accesses
  - Cost = max # of simultaneous accesses to a single bank

## **Linear Addressing**

Given:

```
__shared__ float shared[256];
float foo = shared[baseIndex + s * threadIdx.x];
```

This is only bank-conflict-free if s shares no common factors with the number of banks

2020-12-24

16 on G80, so s must be odd

@ Ying Liu





#### **Data Types and Bank Conflicts**

- Linear addressing
  - Element smaller than 32-bits
    - 4-way bank conflicts
       \_\_shared\_\_ char shared[];
       foo = shared[baseIndex + threadIdx.x];
    - 2-way bank conflicts
      \_\_shared\_\_ short shared[];foo = shared[baseIndex + threadIdx.x];



#### **Structs and Bank Conflicts**

Struct assignments are compiled into as many memory accesses as there are struct members:

```
struct myType {
    float f;
    int c;
};
__shared__ struct myType myTypes[32];
struct myType m = myTypes[baseIndex
    + threadIdx.x];
```

- This has 2-way bank conflicts for myType
  - 2 accesses per thread



#### **Structs and Bank Conflicts**

Struct assignments are compiled into as many memory accesses as there are struct members:

- 3-way bank conflicts for vector
  - struct size is 3 words
  - 3 accesses per thread, contiguous banks (no common factor with 16)



#### **Bank Conflict - 1D**

- Each thread loads 2 words to shared memory:
  - 2-way-interleaved loads result in 2way bank conflicts



#### A Better Access Pattern

 Each thread loads one element in every consecutive group of blockDim elements

```
shared[tid] = global[tid];
shared[tid + blockDim.x] = global[tid +
    blockDim.x];
```



#### Bank Conflict - 2D

- 2D floating array
  - image processing, e.g.
- 16x16 block
  - Half-warp threads access one column (example: column 1 in purple)
  - 16-way bank conflicts



Bank 0 1 2 3 4 5 6 7

#### Bank Conflict - 2D

- Solution 1: pad the rows
  - Pad an element at the end of each row
- Solution 2: transpose the matrix before processing
  - Suffer bank conflicts during transpose
  - But possibly save them later



# **Shared Memory Optimization**

- Threads read SMEM with stride = 16
  - Bank conflicts

#### Solution

- Allocate an "extra" column
- Read stride = 17
- Threads read from consecutive banks

#### Reads from SMEM





#### How about Matrix Multiplication?

■ Matrix  $P = M \times N$ , (16x16)

When reading Ms, half-warps access the same bank, broadcast! for (int k = 0; k < 16; ++k) //ty is constant over a half-warp Csub += Ms[ty][k] \* Ns[k][tx];//here k is fixed over a half-warp





@ Ying Liu

2020-12-24

#### How about Matrix Multiplication?

Matrix P = M × N, (16x16 block)

```
When reading Ns, half-warps access different banks, no conflicts! for (int k = 0; k < 16; ++k) //tx is successive over a half-warp Csub += Ms[ty][k] * Ns[k][tx];
```



@ Ying Liu

2020-12-24

## **Shared Memory Access (Capability 2.x)**

- Shared memory has 32 banks
- Successive 32-bit words map to successive banks
- Each bank has a bandwidth of 32 bits per two clock cycles
- A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word

#### **Shared Memory Access (Capability 2.x)**



@ Ying Liu 2020-12-24 60

## **Shared Memory Access (Capability 2.x)**



@ Ying Liu 2020-12-24 61

## **Shared Memory Access (Capability 3.x)**

- Shared memory has 32 banks with two addressing modes
- 64-bit mode: successive 64-bit words map to successive banks
- 32-bit mode: successive 32-bit words map to successive banks
- Each bank has a bandwidth of 64 bits per two clock cycles

## **CUDA Performance Optimization**

- Optimization strategies
- Maximize utilization
- Maximize memory throughput
- Maximize instruction throughput
- Others

## **Instruction Throughput**

- Use intrinsic functions, less accurate, but faster device-only version

  - Minimize divergent warps
  - Reduce the number of synchronizations

@ Ying Liu 2020-12-24 66

## **CUDA Performance Optimization**

- Optimization strategies
- Maximize utilization
- Maximize memory throughput
- Maximize instruction throughput
- Others

## Unrolling

```
Ctemp = 0;
                                  Ctemp = 0;
for (...) {
                                  for (...) {
    shared float As[16][16];
                                      shared
                                                float As[16][16];
                                      shared float Bs[16][16];
    shared float Bs[16][16];
                                    // load input tile elements
  // load input tile elements
 As[ty][tx] = A[indexA];
                                    As[ty][tx] = A[indexA];
 Bs[ty][tx] = B[indexB];
                                    Bs[ty][tx] = B[indexB];
  indexA += 16;
                                     indexA += 16;
  indexB += 16 * widthB;
                                     indexB += 16 * widthB;
   syncthreads();
                                      syncthreads();
                                     // compute results for tile
  // compute results for tile
  for (i = 0; i < 16; i++)
                                    Ctemp +=
                                       As[ty][0] * Bs[0][tx];
      Ctemp += As[ty][i]
        * Bs[i][tx];
                                     Ctemp +=
                                       As[tv][15] * Bs[15][tx];
    syncthreads();
                                      syncthreads();
C[indexC] = Ctemp;
                                  C[indexC] = Ctemp;
       (b) Tiled Version
                                        (c) Unrolled Version
```

68

#### **Instruction Mix Considerations**

- for (int k = 0; k < BLOCK\_SIZE; ++k)</p>
  Pvalue += Ms[ty][k] \* Ns[k][tx];
- There are very few mul/add between branches and address calculation
- Loop unrolling can help

```
Pvalue += Ms[ty][0] * Ns[0][tx] + ... + Ms[ty][15] * Ns[15][tx];
```

# **Prefetching**

- One could double buffer the computation, getting better instruction mix within each thread
  - This is classic software pipelining in ILP compilers

```
Loop {

Load current tile to shared memory

syncthreads()

Compute current tile

syncthreads()
}
```

```
Load next tile from global
memory
Loop {
  Deposit current tile to shared
memory
  syncthreads()
  Load next tile from global
memory
  Compute current tile
  syncthreads()
```

@ Ying Liu

2020-12-24

70

#### **Matrix Multiplication Space**



#### Summary

- In decreasing order of importance
  - Algorithm parallelization
  - Asynchronous concurrent execution
  - Use a multiple of 32 threads per block and at least as many blocks as multiprocessors
  - Access global memory properly
  - Avoid shared memory bank conflicts
  - Use "-use-fast-math" option if you have operations like sine and cosine on the device
  - Have as few branching conditional loops as possible
  - Have small loops unrolled
  - Have no unnecessary \_\_syncthreads() calls
- The defaults in the complier have basically always been the best