# Performance of Distributed and Shared Memory Parallel Sparse Matrix Vector Multiplication

Kristian Sordal



Thesis for Master of Science Degree at the University of Bergen, Norway

#### ©Copyright Kristian Sordal

The material in this publication is protected by copyright law.

Year: 2025

Title: Performance of Distributed and Shared Memory

Parallel Sparse Matrix Vector Multiplication

Author: Kristian Sordal

# Acknowledgements

## Abstract

SPARSE MATRIX VECTOR MULTIPLICATION is an important kernel used in scientific computing. It is a problem that lends itself well to parlallization. The problem is bounded by, and scales with the memory bandwidth of the system. Therefore in order to efficiently perform SpMV on large distributed memory systems, it is important to reduce the communication between nodes, in order to extract as much as possible out of the memory bandwidth of the system.

This thesis aims to investigate the results of SpMV when ran using different communication strategies.

# Contents

| A            | ckno         | wledgements                                    | iii |
|--------------|--------------|------------------------------------------------|-----|
| $\mathbf{A}$ | bstra        | act                                            | v   |
| 1            | Intr         | roduction                                      | 1   |
|              | 1.1          | Research Question                              | 1   |
| 2            | $Th\epsilon$ | eory                                           | 3   |
|              | 2.1          | Definitions                                    | 3   |
|              | 2.2          | Amdahl's Law                                   | 3   |
|              | 2.3          | Latency                                        | 4   |
| 3            | Bac          | kground                                        | 5   |
|              | 3.1          | CSR Storage Format                             | 5   |
|              | 3.2          | Sparse Matrix-Vector Multiplication            | 6   |
|              |              | 3.2.1 Computational Intensity                  | 6   |
|              | 3.3          | Sequential SpMV                                | 7   |
|              | 3.4          | Shared Memory SpMV                             | 8   |
|              |              | 3.4.1 Scheduling options                       | 8   |
|              |              | 3.4.2 Dynamic scheduling                       | 9   |
|              |              | 3.4.3 First touch policy                       | 9   |
|              |              | 3.4.4 Performance impact of first touch policy | 10  |
|              | 3.5          | Distributed Memory SpMV                        | 10  |
|              | 3.6          | Load Balancing                                 | 10  |
|              |              | 3.6.1 Separator                                | 11  |
|              | 3.7          | Distributed Memory CSR                         | 11  |
| 4            | Cor          | nmunication Strategies                         | 13  |
|              | 4.1          | 1a - Exchange entire vector                    | 13  |
|              | 4.2          | 1b - Exchange only separators                  | 14  |
|              |              | 421 Reordering                                 | 15  |

| riii | CONTENTS |
|------|----------|
|      |          |

|   | 4.3 1c - Exchange only required separators |    |
|---|--------------------------------------------|----|
| 5 | Results                                    | 19 |
| 6 | Previous Work                              | 23 |

# Chapter 1

# Introduction

## 1.1 Research Question

This thesis aims to investigate the impact different communication strategies have on the performance of parallel SpMV. The implementation of SpMV uses a shared memory layout on each socket of a node, and distributed memory for communication across nodes.

# Chapter 2

# Theory

### 2.1 Definitions

**Definition 2.1.1** (Separator). In the context of SpMV, a separator is a node in the graph that has an edge that strides between two partitions.

#### 2.2 Amdahl's Law

Amdahl's Law provides a theoretical framework for understanding the limits of performance improvement when additional computational resources are applied to a given problem. It quantifies the potential speedup achieved by optimizing a specific portion of a system, emphasizing that the overall gain is constrained by the proportion of time the optimized component contributes to execution.

**Definition 2.2.1** (Amdahl's Law). The maximum achievable speedup of a computation is limited by the fraction of execution time that remains sequential, even when an arbitrarily large number of parallel resources is employed.

In the context of parallel computing, this principle highlights that while increasing the number of processing units can accelerate the parallelizable portion of a workload, the sequential fraction imposes a fundamental performance ceiling. Formally, if S denotes the total speedup,  $t_p$  represents the fraction of execution time that can be parallelized, and  $s_p$  is the speedup achieved for that parallelizable portion, Amdahl's Law is

4 Theory

expressed as:

$$S = \frac{1}{(1 - t_p) + \frac{t_p}{s_p}} \tag{2.1}$$

This equation reveals that as  $s_p \to \infty$ , the theoretical maximum speedup approaches  $\frac{1}{1-t_p}$ , illustrating that the non-parallelizable portion becomes the dominant limiting factor in scalability.

### 2.3 Latency

It is worthwhile to speak on the typical latency numbers for various operations on a computer.

| Operation                                | Time [ns]   |
|------------------------------------------|-------------|
| L1 cache reference                       | 1           |
| Branch misprediction                     | 3           |
| L2 cache reference                       | 4           |
| Mutex lock/unlock                        | 17          |
| Main memory reference                    | 100         |
| Compress 1K bytes with Zippy             | 2000        |
| Send 2kB over 10 Gbps network            | 1600        |
| Send 1K bytes over 1 Gbps network        | 10 000      |
| Read 4K bytes randomly from SSD*         | 20 000      |
| Round trip within same datacenter        | 500 000     |
| Read 1MB sequentially from memory        | 1 000 000   |
| Disk seek                                | 10 000 000  |
| Read 1MB sequentially form disk          | 10 000 000  |
| TCP packet round trip between continents | 150 000 000 |

Figure 2.1: Latency numbers for common operation, adapted from [2]

# Chapter 3

# Background

### 3.1 CSR Storage Format

CSR (Compressed Sparse Row) is the most widely used storage format for sparse matrices. As its name suggests, it compresses the amount of memory used to store a matrix without loss of information. It does so by utilizing three vectors  $A_p, A_j, A_x$ . Figure 3.1 shows an example of a matrix stored in CSR format, adapted from [1].



Figure 3.1: Example matrix represented in CSR Format.

The first vector,  $A_p$  stores the indices of the first nonzero in the vectors  $A_p$  and  $A_x$ . For a given entry  $A_p[i]$ ,  $A_p[i]$  is the index of the first nonzero in the  $i^{\text{th}}$  row.  $A_j[j]$  and  $A_x[j]$  denotes the column index and value of the  $j^{\text{th}}$  nonzero, respectively.

6 Background

Throughout the remainder of this thesis, we operate under the assumption that all matrices are represented in CSR (Compressed Sparse Row) format, unless explicitly noted otherwise.

### 3.2 Sparse Matrix-Vector Multiplication

Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation encountered in many areas of scientific computing. It is especially prominent in solving large systems of linear equations and in large-scale simulations. The matrices involved are typically both very large and very sparse.

A matrix can in theory be considered sparse if it is worthwhile to treat zero values separately. In theory, this translates to a matrix being less than full, i.e. less than  $\mathcal{O}\left(n^2\right)$  nonzeros for a  $n \times n$  matrix. However, in the context of sparse linear algebra, sparse means that there is a constant number of nonzeros per row, i.e.  $\mathcal{O}\left(n\right)$  nonzeros per row. The matrices used in scientific computing, such as matrices based on meshes, or graphs such as social networks all have this property. Optimizing the performance of SpMV, particularly through parallel computing techniques, is crucial for enhancing the efficiency of many scientific applications.

However, SpMV is notoriously difficult to optimize, both in sequential and parallel implementations. One major reason is its inherently low computational intensity.

#### 3.2.1 Computational Intensity

The *computational intensity* of an operation describes the relation between the number of floating-point operations (FLOPS) and the number of memory accesses required. It is formally defined as:

Computational intensity = 
$$\frac{\text{FLOPS}}{\text{Memory accesses}}$$
 (3.1)

Operations with low computational intensity, such as SpMV, are often memory bound rather than compute bound. This means that increasing the computational power of a system (e.g., faster processors) does not necessarily lead to proportional speedups in SpMV performance, as memory bandwidth remains the limiting factor.

### 3.3 Sequential SpMV

A sequential implementation of SpMV on a matrix stored in the CSR format can be implemented in the following manner:

```
Algorithm 1: Sequential CSR-based SpMV
```

For SpMV on well structured matrices, i.e. those similar to the matrix shown in Figure 3.2 we read 12 bytes, and perform 2 FLOPS for each nonzero in the matrix. Keen-eyed readers might notice that this does not coincide with the amount of FLOPS read per nonzero in algorithm 3.3. Here we read two doubles, and one integer, which would be equivalent to 20 bytes. The reason for the discrepancy is due to the fact that after the first time x is accessed, it is loaded into cache, and heavily reused in subsequent iterations, and can for that reason be disregarded.

For heavily unstructured matrices, it is possible that we acutally read up to 76 bytes per 2 FLOPS. This will occur if there is no cache reuse, and a new cache line (64 bytes) is loaded for each nonzero.

8 Background



Figure 3.2: Well structured (a) and poorly structured (b) matrices.

### 3.4 Shared Memory SpMV

SpMV can be parallelized using the OpenMP directive #pragma omp parallel for. By default, this tells OpenMP to use static scheduling when parallelizing the outer iteration loop. When static scheduling is used, the span of the iteration that each thread will execute is precomputed, and stays static, as the naming suggests. There are other scheduling options, such as dynamic and guided, which will be discussed in later sections.

An implementation of shared memory SpMV is outlined below.

```
Algorithm 2: Shared Memory CSR-based SpMV

Input : A_p, A_j, A_x, x
Output: y

#pragma omp parallel for
for i \leftarrow 0 to n do
sum \leftarrow 0
for <math>j \leftarrow A_p[i] \text{ to } A_p[i+1] \text{ do}
sum = sum + A_x[j] \cdot x[A_j[j]]
y[i] \leftarrow sum
```

#### 3.4.1 Scheduling options

As seen in algorithm 2, the outer loop is parallelized, which translates to dividing the rows of the matrix evenly among the threads. This work fine for well structured matrices, but for matrices with dense rows, such as the matrix shown in Figure 3.3, we obtain large imbalances in the computational load for each thread, which impacts performance.



Figure 3.3: staticscheduling

#### 3.4.2 Dynamic scheduling

With dynamic scheduling, OpenMP still precomputes the span of iterations, however threads are not assigned a specific set of iterations to execute. If thread A is assigned a set of sparse rows, and thread B is assigned a set of dense rows, then with static scheduling, A will be idle and not perform any computation while B is busy with its dense rows. With dynamic scheduling, iterations will be dynamically assigned to threads that are idle, which can distribute the workload more evenly in cases where the computational load of each iteration can differ.

At first glance, it might seem that this will fix the problem static scheduling poses in regards to workload imbalance. This can be the case on single socketed machines with few physical cores such as on a personal laptop, but as the systems that are used in this thesis are dual-socketed, and have a large number of physical cores, it is crucial to consider the impact that *first touch policy* has.

### 3.4.3 First touch policy

In short, first touch policy means that the first thread to access a memory page will be the one to allocate it to its local memory. This means that 10 Background

if thread A accesses a memory page, and then thread B accesses the same memory page, thread B will not have access to the data in its local memory, and will have to access it from the main memory. This can lead to performance degradation if the data is accessed frequently, as accessing data from main memory is significantly slower than accessing it from local memory.

#### 3.4.4 Performance impact of first touch policy

As seen in table 2.1, accesses to main are significantly slower than accesses to say the L1 cache, and on dual-socketed nodes, accesses to main memory on the opposite socket have to be sent through the interlink, which is once again signicantly slower than accessing main memory. From this it becomes evident that using dynamic (or guided) scheduling is not a solution to the performance issues that might arise from poorly structured matrices.

### 3.5 Distributed Memory SpMV

This thesis examines the performance of SpMV in a hybrid environment combining distributed and shared memory. The matrix is divided among multiple nodes in a cluster, where each node has its own local memory. Within each node, OpenMP is used to parallelize the local SpMV computation. After completing local operations, nodes exchange data through the Message Passing Interface (MPI) to assemble the final result.

#### 3.6 Load Balancing

In a distributed memory system, it is important to partition the matrix so that the computational workload is evenly divided among the processes. This is typically achieved through the use of graph partitioning tools. A widely used tool for this purpose is METIS, which provides the function METIS\_PartGraphKway. Given a parameter nprocs, representing the number of processes the program will run on, METIS\_PartGraphKway attempts to partition the graph into nprocs equally sized parts. Since finding an optimal partition is an NP-hard problem, METIS does not guarantee an optimal solution, but it produces high-quality approximations that are sufficient for practical use.

### 3.6.1 Separator

When a graph is partitioned into different parts, there will inevitably be some edges which strides across different partitions. The endpoints of these edges are called separators, and will become important when it comes to reducing the communication load of the SpMV computation.

### 3.7 Distributed Memory CSR

# Chapter 4

# Communication Strategies

This thesis evaluates and compares several communication strategies for Sparse Matrix-Vector Multiplication (SpMV) in a parallel, distributed memory setting. During each iteration of SpMV, every process computes a partial result of the output vector y.

In subsequent iterations, these computed values may be required by other processes to proceed with their own calculations. To ensure correctness, it is therefore necessary to communicate between processes so that each has access to the values it depends on. This section outlines a progression of increasingly efficient strategies for managing this communication.

### 4.1 1a - Exchange entire vector

The most straightforward approach is to have each rank send all of its computed values of y to every other rank. This ensures that all processes possess a complete and updated copy of the output vector before the next iteration. This strategy can be implemented using MPI's collective communication operation MPI\_Allgatherv, which accommodates variable message sizes from each rank. Figure 4.1 illustrates the state of the y vector before and after communication using this strategy.



Figure 4.1: Visual representation of the y vector in communication strategy 1a.

```
Algorithm 3: 1a - Exchange entire vector

for each iteration do

spmv(g,x,y)

MPI_Allgatherv(local_y, sendcount, MPI_DOUBLE, y,
recvcounts, displs, MPI_DOUBLE, MPI_COMM_WORLD)
swap pointers of x and y
```

### 4.2 1b - Exchange only separators

An improvement to the previous strategy can be achieved by recognizing that only separator values—those required by multiple processes—must be communicated. Non-separator values are used exclusively by the process that computed them and therefore do not need to be communicated.

To facilitate this strategy, separator values are reordered such that they appear at the beginning of each process's local segment of y. Once this structure is established, communication is performed using MPI\_Allgatherv, transmitting only the subset of y that contains separator values. The number of separators on each process must be known beforehand, which can be computed by counting the number of elements that have neighbours belonging to a different partition.

#### 4.2.1 Reordering

After partitioning the matrix into different parts, we obtain a partition vector p, where the p[i] stores the index of the partition the  $i^{\text{th}}$  entry in  $A_p$ . It is necessary to reorder the entries in  $A_p$  in accordance with the partition vector, such that all entries belonging to the same partition are stored in sequence. The algorithm below gives an outline of how this can be achieved. Here  $n_p$  is the number of partitions,  $n_r$  is the size of  $A_p$ , and  $n_c$  is the size of  $A_j$  and  $A_x$ .

#### **Algorithm 4:** Reordering of Separators

```
Input : n_p, n_r, n_c, p, A_p, A_j, A_x
Output: Reordered A_p, A_i, A_x
newId \leftarrow [0] \cdot n_r
oldId \leftarrow [0] \cdot n_r
id \leftarrow 0
p_0 \leftarrow 0
for r \in \{0, ..., n_p - 1\} do
       for i \in \{0, ..., n_r - 1\} do
               if p[i] = r then
                   \begin{array}{c} \text{oldId[id]} \leftarrow i \\ \text{newId[i]} \leftarrow \text{id} \\ \text{id} \leftarrow \text{id} + 1 \end{array} 
       p[r+1] \leftarrow id
\text{newV} \leftarrow [0] \cdot (n_r + 1)
\text{newE} \leftarrow [0] \cdot n_c
\text{newA} \leftarrow [0] \cdot n_c
for i \in \{0, ..., n_r - 1\} do
       degree \leftarrow A_p[oldId[i] + 1] - A_p[oldId[i]]
       \text{newV}[i+1] \leftarrow \text{newV}[i] + \text{degree}
for i \in \{0, ..., n_r - 1\} do
       degree \leftarrow A_p[oldId[i] + 1] - A_p[oldId[i]]
       for j \in \{0, \dots, degree-1\} do
              \begin{array}{l} \operatorname{newE[newV[i]+j]} \leftarrow A_j[A_p[\operatorname{oldId}[i]] + j] \\ \operatorname{newA[newV[i]+j]} \leftarrow A_x[A_p[\operatorname{oldId}[i]] + j] \end{array}
       \begin{array}{l} \mathbf{for} \ j \in \{newV[i], \dots, newV[i+1]-1\} \ \mathbf{do} \\ \  \  \, \lfloor \  \, \mathrm{newE[j]} \leftarrow \  \, \mathrm{newId[newE[j]]} \end{array}
```

Overwrite  $A_p \leftarrow \text{newV}, A_j \leftarrow \text{newE}, A_x \leftarrow \text{newA}$ 



Figure 4.2: Main figure caption describing both subfigures

### 4.3 1c - Exchange only required separators

Further reduciton to the communication volume can be achieved by observing that not all separator values are required by every process. As the number of partitions increases, the set of dependencies between partitions tends towards sparcity. As a consequence of this, certain sets of separators may only need to be communicated to a given subset of processes. Using this strategy, each process only communicates its set of separator values to the processes that require them.



Figure 4.3: Main figure caption describing both subfigures

# 4.4 1d - Exchange only required separator values

The final strategy aims to minimize communication overhead by transmitting only the exact subset of separator values that are both computed by and required for inter-process computation. If a specific separator value computed by one process is needed by exactly one other process, then only that single recipient receives the value.

This approach eliminates all unnecessary data transfers but introduces additional complexity in managing communication schedules. Dependencies must be mapped at a fine-grained level, and communication patterns must be explicitly tailored to the structure of the matrix and its partitioning.

# Chapter 5

# Results



Figure 5.1: Stacked bar chart of SpMV and Halo Time.

20 Results



Figure 5.2: Aggregated Results of SpMV on defq chip



Figure 5.3: SpMV performance with 1 process vs 2 processes per node on Lynx144



Figure 5.4: SpMV performance with 1 process vs 2 processes per node on delaynay\_n24



Figure 5.5: SpMV performance with 1 process vs 2 processes per node on rgg\_n\_2\_24\_s0

22 Results

Table 5.1: Hardware used in our experiments. STREAM Triad was run with the -march=native and -03 compilation flags.

|                  | Xeon-A                        | Xeon-B                            | Naples              | Rome                 | Milan               | TX2                     | Hi1620                           |
|------------------|-------------------------------|-----------------------------------|---------------------|----------------------|---------------------|-------------------------|----------------------------------|
| CPUs             | Intel<br>Xeon<br>Gold<br>6130 | Intel<br>Xeon<br>Platinum<br>8168 | AMD<br>Epyc<br>7601 | AMD<br>Epyc<br>7302P | AMD<br>Epyc<br>7763 | Cavium<br>TX2<br>CN9980 | HiSilicon<br>Kunpeng<br>920-6426 |
| Instr. set       | x86-64                        | ×86-64                            | ×86-64              | ×86-64               | x86-64              | ARMv8.1                 | ARMv8.2                          |
| Microarch.       | Skylake                       | Skylake                           | Zen                 | Zen 2                | Zen 3               | Vulcan                  | TaiShan<br>v110                  |
| Sockets          | 2                             | 2                                 | 2                   | 1                    | 2                   | 2                       | 2                                |
| Cores            | $2 \times 16$                 | $2 \times 24$                     | $2 \times 32$       | $1 \times 16$        | $2 \times 64$       | $2 \times 32$           | $2 \times 64$                    |
| Freq. [GHz]      | 1.9 – 3.6                     | 2.5 – 3.7                         | 2.7 - 3.2           | 1.5 - 3.3            | 2.5 – 3.5           | 2.0 - 2.5               | 2.6                              |
| L1I/core [KiB]   | 32                            | 32                                | 64                  | 32                   | 32                  | 32                      | 64                               |
| L1D/core [KiB]   | 32                            | 32                                | 32                  | 32                   | 32                  | 32                      | 64                               |
| L2/core [KiB]    | 1024                          | 1024                              | 512                 | 512                  | 512                 | 256                     | 512                              |
| L3/socket [MiB]  | 22                            | 33                                | 64                  | 16                   | 256                 | 32                      | 64                               |
| Mem. channels    | $2 \times 6$                  | $2 \times 6$                      | $2 \times 8$        | $1 \times 8$         | $2 \times 8$        | $2 \times 8$            | $2 \times 8$                     |
| Bandwidth [GB/s] | 256                           | 256                               | 342                 | 204.8                | 409.6               | 342                     | 342                              |
| Triad [GB/s]     | 147.1                         | 137.4                             | 169.7               | 90.9                 | 256.5               | 236.4                   | 260.4                            |



Figure 5.6: SpMV performance with 1 process vs 2 processes per node on road\_usa

Chapter 6
Previous Work

# **Bibliography**

- [1] Gautam Gupta, Sivasankaran Rajamanickam, and Erik G. Boman. GAMGI: Communication-reducing algebraic multigrid for gpus. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 61–75. ACM, 2024.
- [2] Peter Norvig. Latency numbers every programmer should know. https://norvig.com/21-days.html#Latency, 2021. Accessed: 2025-04-29.