

## BITONIC SORT PERFORMANCES

COMPUTER ARCHITECTURE PROJECT
A.A. 2023/2024

Giovanni Enrico Loni Lorenzo Mancinelli

## Table of Contents

01 ALGORITHM & GOALS

02 CPU IMPLEMENTATION

03 GPU IMPLEMENTATION

04 CONCLUSIONS

## O1 ALGORITHM & GOALS

Introduction to the algorithm and goals to be achieved

## THE ALGORITHM

**Bitonic sort** breaks the data into "bitonic" sequences, then iteratively compare and swap elements within these sequences, progressively merging them until the entire data is sorted.



## MAIN GOALS

- ▲ CPU IMPLEMENTATION

  Execution time for arrays with 2<sup>24</sup> entries ~1 s.
- GPU IMPLEMENTATION Execution time for arrays with 2<sup>30</sup> entries < 20 s.</p>

Requirements reached!

## O2 CPU IMPLEMENTATION

Study of performance on CPU as threads and load vary. Improving performance exploiting -02 and parallel merge.

### HARDWARE & SOFTWARE

#### **HARDWARE**

### AMD Ryzen 7 5800H

- 8 cores, 16 thread
- Cache:
  - L1D 256 KiB 8-way associative
  - L1I 256 KiB 8-way associative
  - L2 4 MiB 8-way associative
  - L3 16 MiB 16-way associative

### **OPERATIVE SYSTEM**

- EndeavourOS Linux x86\_64
- Kernel: 6.8.7-arch1-1

### **PROFILER**

 $AMD\mu Prof 4.2$ 

### COMPILER

- gcc 13.2.1
- -lrt -pthread

## EXECUTION TIME ON SMALL ARRAYS





For exponents up to 12 it is better to use the sequential version of the algorithm!

## EXECUTION TIME ON BIG ARRAYS



### SPEEDUP ON BIG ARRAYS



## FUNCTION HOTSPOTS





| Function                            | L2_CACHE_ACCESS_FROM_L1_DC_MISS [sample count] |
|-------------------------------------|------------------------------------------------|
| bitonic_merge(int*, int, int, bool) | 85                                             |
| random                              | 2                                              |
| ld-linux-x86-64.so.2!0x00008ae7     | 1                                              |
| random_r                            | 1                                              |
| [PLT] rand                          | 0                                              |

### FUNCTION HOTSPOTS

```
C++
```

```
if ((arr[i] > arr[i + mid]) == direction) {...}
```

```
lea
        rcx, [0+rax*4]
        rax, QWORD PTR [rbp-24]
mov
add
       rax, rcx
        eax, DWORD PTR [rax]
mov
cmp
        edx, eax
setg
        al
        edx, al
movzx
        eax, BYTE PTR [rbp-36]
MOVZX
        edx, eax
cmp
jne
        .L22
```

Almost 25% of retired instructions of the entire program!

Assembly

### PARALLELIZATION OF BITONIC MERGE



Swaps are executed in parallel, unlike the initial Bitonic Merge where they were executed sequentially.

### EXECUTION TIME WITH PARALLEL MERGE

The improvement can be seen more in the singlethread case than in the multi-thread case



## SPEEDUP WITH PARALLEL MERGE



Each speedup is related to its own optimization

## GOALS ACHIEVED

### **EXECUTION TIME**

- Almost reached for parallel merge (1,09 s).
- ✓ Fully reached through 02 (0,64-0,44 s).

### **SPEEDUP**

- ✓ Best value with parallel merge (x11).
- Reasonable value through parallel merge -02.

## 03 GPU IMPLEMENTATION

Study of performance on GPU as threads per block and load vary. Analysis with profiler.

### NEW APPROACH FOR IMPLEMENTATION LOGIC



Instead of recursive calls for different sizes of the sequences, swaps inside sequences with the same length are done in parallel.

### HARDWARE & SOFTWARE

### HARDWARE NVIDIA Tesla T4

- Compute Power: 7.5
- CUDA Cores: 2560
- Streaming multi-processors: 40
- Warp size: 32
- Max threads per block: 1024
- Memory: 16 GB GDDR6
- Bandwidth: 300GB/s
- Interconnection bandwidth: 32GB/s

### **OPERATIVE SYSTEM**

- Ubuntu 18.04.3 LTS
- Kernel: Linux 4.15.0-188-generic

#### **PROFILER**

Nsight Compute 2024.1.1

### COMPILER

■ nvcc 9.1.85

### EXECUTION TIME WITH MEMCPY

#blocks = (arraySize + numThreads - 1) / numThreads



### EXECUTION TIME - IMPACT OF MEMCPY



### GPU COMPUTE & MEMORY THROUGHPUT

More time spent exchanging data with memory than in the algorithm itself: possible **memory bottleneck**.





### GPU COMPUTE & MEMORY THROUGHPUT

#### ■ UNCOALESCED GLOBAL ACCESSES



Because of the nature of the algorithm, we can't ensure coalesced memory accesses to the array to sort!

#### ▲ KERNEL FUSION

Iterate inside the bitonic\_sort\_kernel to execute a single kernel: power of GPU not exploited, because the execution time increases a lot!

### WARP STATE



Only about half of the threads are active in a warp due to thread divergence when comparing the values to swap.



Possible solution: replace the conditional blocks inside the kernel with already calculated expressions to understand which elements to exchange.

## GOALS ACHIEVED

### **EXECUTION TIME**

- ✓ Fully reached: sorting 4GB of int in ~15 s using 128 threads per block.
- ! Side note: sorting on GPU is much faster than on CPU already starting from arrays with 2<sup>16</sup> elements!

## 04 CONCLUSIONS

Final considerations on the results obtained.

### CPU

Despite having a smaller speedup, the best setup is Parallel Merge -02. For this reason, this would be the final implementation of the algorithm.

### **GPU**

Goal has been reached, so no other improvements are needed: modifying memory accesses and thread divergence leads to worse execution times.

# THANKS FOR ATTENTION!