



### Optimising Locality of Reference

Cosmin E. Oancea and Troels Henriksen [cosmin.oancea,athas]@diku.dk

Department of Computer Science (DIKU) University of Copenhagen

October 2015 PMPH Lecture Notes



| Course | <b>Urga</b> | mıza | ation |
|--------|-------------|------|-------|

| W | HARDWARE        |                   | SOFTWARE       | LAB/CUDA              |
|---|-----------------|-------------------|----------------|-----------------------|
| 1 | Trends          |                   | List HOM       | Intro & Simple        |
|   | Vector Machine  | $\longleftarrow$  | (Map-Reduce)   | Map Programming       |
| 2 | In Order        | $\longrightarrow$ | VLIW Instr     | Scan &                |
|   | Processor       | $\leftarrow$      | Scheduling     | Reduce                |
| 3 | Cache           |                   | Loop           | Sparse Vect           |
|   | Coherence       |                   | Parallelism I  | Matrix Mult           |
| 4 | Interconnection |                   | Case Studies & | Transpose & Matrix    |
|   | Networks        |                   | Optimizations  | Matrix Mult           |
| 5 | Memory          |                   | Optimising     | Sorting & Profiling & |
|   | Consistency     |                   | Locality       | Mem Optimizations     |
| 6 | OoO, Spec       |                   | Thread-Level   | Project               |
|   | Processor       |                   | Speculation    | Work                  |
|   |                 |                   |                | Fig. 20)              |

Three narative threads: the path to complex & good design:

- Design Space tradeoffs, constraints, common case, trends.
- Reasoning: from simple to complex, Applying Secretary 1 and 1 and



#### **Motivation**

So far one perfect-loop nest with affine accesses in shared memory:

- loop interchange,
- loop distribution,
- block tiling, e.g., matrix transposition, multiplication.

#### Example: Loop Interchange Enhances Locality of Reference

```
// Bad locality both GPU & CPU
DOALL j = 1, N-1 // grid
DOALL i = 0, N-1 // block
    A[i,j] = sqrt(A[i,j] + B[i,j]);
    ENDDO
ENDDO

// Good locality both GPU & GPU
DOALL i = 0, N-1
DOALL j = 1, N-1
A[i,j] = sqrt(A[i,j] + B[i,j]);
ENDDO
ENDDO
ENDDO
```

But a program is a composition of loop nests & accesses are not always affine & how about communication in distributed programs (?!)



- Tiling Affine Loop Nests: Optimising Communication & Load Balancing [Reddy and Bondhugula'14]
- 2 Iteration and Data Reordering for Loop Nests with Irregular Accesses [Ding and Kenedy'99], [Strout et. al. 2003,2004]
  - Data Reordering (Packing)
  - Iteration Reordering (Locality Grouping)
  - Generalization: Temporal & Spatial Locality HyperGraphs
- Other Locality Optimizations: Parallel Tracing



#### A Simple Composition of Two Loop Nests

Effective Automatic Data Allocation for Parallelization of Affine Loop Nests, Chandan Reddy and Uday Bondhugula, ICS 2014.

```
Running Example: ADI benchmark
```

```
//forward x sweep
for (i=0; i<N; i++) //parallel
    for (j=1; j<N; j++) // sequential
S1    X[i,j] -= X[i,j-1]*..;

//upward y sweep
for (j=0; j<N; j++) // parallel
    for (i=1; i<N; i++) // sequential
S2    X[i,j] -= X[i-1,j]*..;</pre>
```

- Each loop (nest) can be efficiently parallelized individually,
- in a distributed setting with NO intra-loop-nest communication.
- How about the whole program?



#### Mapping Available Parallelism to A Set of Nodes

Commonly used patterns for distributing iterations across processors:

- Block Distribution: loop iterations are divided into number-of-processor, nearly-equal contiguous chunks.
- Cyclic Distribution: one iteration to each processor in a round robin fashion. Better load balancing when iterations have non-uniform cost.
- Block-Cyclic Distribution: like cyclic but contiguous chunks of iterations are distributed in round-robin fashion.
- Sudoku Distribution: assigns for example 2-dim tiles to processors such that ALL tiles inside a row or column are mapped to distinct processors.

#### OPENMP:

#pragma omp parallel for schedule(kind [,chunk size])

- Block: schedule(static)
- Cyclic: schedule(dynamic)
- Block-cyclic: schedule(dynamic, block\_size)



### Mapping Available Parallelism to A Set of Nodes

We call a computation mapping (iterations to nodes) optimal if it leads to the lowest communication and perfect load balance.

#### ADI program: Tiled Version (Tile Size: 128)

```
// forward x sweep
for (jj=0; jj<N; jj+=128) //serial loop
  for (ii=0; ii<N; ii+=128) //parallel loop
  for (i=max(1,ii); i<min(ii+127,N); i++)
    for (j=max(1,jj); j<min(jj+127,N); j++)
S1    X[i][j] -= X[i][j-1]*..;

// upward y sweep
for (ii=0; ii<N; ii+=128) //serial loop
  for (jj=0; jj<N; jj+=128) //parallel loop
  for (j=max(1,jj); j<min(jj+127,N); j++)
  for (i=max(1,ii); i<min(ii+127,N); i++)
S2    X[i][j] -= X[i-1][j]*..;</pre>
```

- A node consists of a set of processor operating in shared memory (communication is required between nodes).
- optimal mapping for the forward sweep is block distribution along ii,
- optimal mapping for the upward sweep is block distribution along jj,
- these mappings are not optimal for the entire program, since the transposition of X requires a lot of communication!

Solution: model the optimal-mapping problem as a graph paritionain problem on the inter-tile communication graph (TCG).

### Inter-Tile Communication Graph (TCG)

- Each Vertex in TCG represents a computation tile.
- An edge e is added between two vertices iff there is communication between those tiles (assuming they are executed on different nodes).
- The weight of an edge,  $e_w$  is equal to the communication volume between two tiles.

Finding the optimal mapping is equivalent to partitioning TCG into number-of-nodes p equal-sized partitions, i.e., optimal load balancing, with the objective to minimize the sum of those weights that straddle partitions.

Objective function is the total communication value for entire program execution, under load balancing constraint.



### TCG for Running Example

Left: Tiled Iteration Space with Dependencies. Dependence edges that cross tile boundaries used to determined the communication sets.

Right: Corresponding Inter-Tile Communication Graph

• Same color tiles can be executed in parallel!





### **Load Balancing TCG Constraints**

Program consists of multiple parallel phases.

Good load balance  $\Rightarrow$  nearly equal number of tiles are allocated to all nodes in each parallel phase.

Add constraints to minimize load imbalance in each parallel phase:

- vertex weights used to distinguish between tiles used in different parallel phases.
- all tiles belonging to parallel phase i will have at the i<sup>th</sup> position the number of iterations in the tile, and the others 0.
- Let  $S_i^n$  be the sum of the  $i^{th}$  vertex weight component of all vertexes in partition n.
- $\forall i$  vertex weight components, load-balancing constraints are added to minimize the difference between any two partitions n and m, i.e., minimize  $\sum_i (\sum_{n \neq m} (S_i^n S_i^m))$ .



### TCG Partitioning Result for Running Example

- optimal solution obtained via graph partitioning,
- same colored tile execute on the same processor,
- in each parallel phase, equal number of tiles assigned to each  $node \Rightarrow perfect load balance$
- computation mapping is identical for both loop nests, i.e., the "expansive" communication of matrix transposition has been eliminated.
- Sudoku distribution since all nodes are assigned equal number of tiles in each row and column!



### TCG Partitioning Result for Other Programs

Left: stencil computation with nearest neighbor communication ⇒ optimal results is block distribution.

Right: unbalanced computation. Result is slightly different from block-cyclic mapping in that first and last column are mapped to P0 instead of first and fourth.





### Perfect! Any Difficulties Left?

- As problem size increases, so do the number of vertexes and edges in the graph, and the number of constraints.
- even state-of-the-art graph partitioning software do not scale (heuristics way of solving the NP hard problen),
- for example, METIS takes 240s to partition ADI with 64 vertices into 4 partitions.
- further problem-size increase ⇒ drastic decrease in performance, and in the accuracy of the solution, e.g., perfect sudoku mappings were not obtained for more than 32 vertices.

To make the approach scale to larger sizes  $\Rightarrow$  an approximation of TCG is computed for a small number of iteration and the result is expanded across the whole iteration space.

Works in practice because accesses are affine, i.e., regular.



### **Empirical Results: Weak Scaling**

Weak Scaling: how the solution time varies with the number of processors for a fixed problem size per processor. Ideally one gets a horizontal line.



Figure 12: Weak scaling performance of scalapack and pluto-data-tile





- Tiling Affine Loop Nests: Optimising Communication & Load Balancing [Reddy and Bondhugula'14]
- 2 Iteration and Data Reordering for Loop Nests with Irregular Accesses [Ding and Kenedy'99], [Strout et. al. 2003,2004]
  - Data Reordering (Packing)
  - Iteration Reordering (Locality Grouping)
  - Generalization: Temporal & Spatial Locality HyperGraphs
- Other Locality Optimizations: Parallel Tracing



### Irregular Computation Based on Indirect Arrays

Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time, Chen Ding and Ken Kennedy, PLDI'99

Metrics and Models for Reordering Transformations, Michelle Strout and Paul Hovland, MSP'04.

Irregular applications do not access memory in a strided fashion, e.g.,

- molecular dynamics simulation, which model the movement of particles in some physical domain 

   the distribution of molecules is unknown until runtime, and even there it changes dynamically.
- sparse linear algebra, e.g., sparse matrix-vector multiplication,
- Impossible to optimise locality of reference statically!
- Use inspector-executor techniques [Saltz et. al] ⇒ insert code that reorganize at runtime the order in which iterations are executed or the data layout.



#### Running Example

- Indirect arrays left and right are invariant to the outermost loop, hence the runtime iteration/data reordering can be amortized across multiple executions.
- CHARMM, GROMOS, MESH benchmarks.

#### Simplified Moldyn Example: Iteration over the Graph Edges

```
DO s = 1 to num_steps
    // Update location based on old position, velocity and acceleration
    DO i = 1 to num_nodes //parallel
S1 x[i] += vx[i] + fx[i]
    ENDDO
    // Update the forces on the molecule
    DO j = 1 to num_iteractions
S2 fx[left[j]] += calcF(x[left[j]], x[right[j]])
     fx[ right[j] ] += calcF(x[left[j]], x[right[j]])
S3
    ENDDO
    // Update velocity based on force (acceleration)
    DO k = 1 to num_nodes //parallel
S4 vx[k] += fx[k]:
    ENDDO
ENDDO
                                                    C. Oancea: Locality Oct 2015
```

### Runtime Data Reordering

- Aims to improve the spatial locality in the loop by reordering the data based on the order in which it is referenced in the loop.
- Iteration j accesses x[left[j]], x[right[j]], fx[left[j]], fx[right[j]].
- Top Figure shows the original access patters.
- Bottom Figure shows the access pattern after consecutive-packing (CPACK), i.e., data is repacked to match the order in which it is used in the original loop.
- Notice better spatial locality!





# Consecutive Packing (CPACK) Implementation

 Aims to improve the spatial locality in the loop by reordering the data based on the order in which it is referenced in the loop.

```
CPACK(left, right) // Output: \sigma^{-1}
// alreadyOrdered bit vector set to Os.
count = 0
DO i = 1 to num interactions
  mem loc1 = left[i]
  mem_loc2 = right[j]
  IF not alreadyOrdered[mem_loc1]
    \sigma^{-1}[\text{count}] = \text{mem\_loc1}
    alreadyOrdered[mem_loc1] = 1
    count = count + 1
  ENDIF
  // DO THE SAME FOR mem loc2!
ENDDO
DO i = 1 to num_nodes
  IF not alreadyOrdered[i]
    \sigma^{-1}[count] = i
    count = count + 1
ENDIE ENDDO
```

Assuming a cache line holds 3 words



### **Code After Consecutive Packing**

 Aims to improve the spatial locality in the loop by reordering the data based on the order in which it is referenced in the loop.

```
\sigma^{-1} = \text{CPACK(left, right)}
D0 i = 1 to num nodes
   x'[i] = x[\sigma^{-1}[i]]
  fx'[i] = fx[\sigma^{-1}[i]]
ENDDO
\sigma = inverse(\sigma^{-1})
DO s = 1 to num_steps
  DO i = 1 to num_nodes // parallel
    x'[\sigma[i]] += vx[i] + fx'[\sigma[i]]
  ENDDO
  DO j = 1 to num_iteractions
    fx'[\sigma[left [j]]]+=calcF(x'[\sigma[left[j]]],
                                  x', [σ[right[i]]])
  ENDDO
  DO k = 1 to num_nodes // parallel
    vx[k] += fx'[\sigma[k]]
  ENDDO
ENDDO
```



# Optimising Overheads of Data Reordering/Packing

Overhead of (dynamic) data reordering/packing:

- Overhead of data reorganization, e.g., CPACK. This can be amortized over multiple computation iterations.
- Indirection Overhead very expensive @every access:
  - instructional overhead of indirection
  - indirection overhead: one extra load to memory
  - spatial locality might have been compromised in other loops.

 Pointer Update Optim eliminates the extra load from memory by computing (once) σ ⊙ left or right



# Optimising Overheads of Data Reordering/Packing

```
DO k = 1 to num_nodes // parallel vx[k] += fx'[\sigma[k]] ENDDO \downarrow reorganize vx \downarrow DO i = 1 to num_nodes // amortized vx'[i] = vx[\sigma^{-1}[i]] ENDDO DO k = 1 to num_nodes // parallel vx'[\sigma[k]] += fx'[\sigma[k]] ENDDO \downarrow parallel loop \Rightarrow reorder iters \downarrow DO k = 1 to num_nodes // parallel vx'[k] += fx'[k] ENDDO
```

- Array Alignment Optim reorganizes vx array in the same way as fx' (and x').
- Legality Requirements:
  - 1 the range of loop iterations is identical to the range of remapped data.
  - 2 the loop is parallel (so that its iterations can be reordered).



# Code After Dynamic Data Reordering/Packing

```
\sigma^{-1} = \text{CPACK(left, right)}
\sigma = inverse(\sigma^{-1})
DO i = 1 to num nodes // Overhead
   x'[i] = x[\sigma^{-1}[i]]
  fx'[i] = fx[\sigma^{-1}[i]]
  vx, [i] = vx[\sigma^{-1}[i]]
ENDDO
DO s = 1 to num_steps // convergence loop, allows amortization
  DO i = 1 to num_nodes // parallel
    x'[i] += vx'[i] + fx'[i]
  ENDDO
  DO j = 1 to num_iteractions
    fx'[left' [j]]+=calcF(x'[left'[j]], x'[right'[j]])
    fx'[right'[j]]+=calcF(x'[left'[j]], x'[right'[j]])
  ENDDO
  DO k = 1 to num_nodes // parallel
    vx'[k] += fx'[k]
  ENDDO
ENDDO
DO i = 1 to num_nodes // Overhead, appears only if x, fx, vx are live
   x[i] = x[\sigma[i]]
  fx[i] = fx[\sigma[i]]
  vx[i] = vx[\sigma[i]]
ENDDO
```

### **Iteration Reordering**

(a) Example Interactions

- Aims to improve the temporal locality across consecutive loop iterations.
- by ordering the iterations that touch the same data item consecutively in the resulted schedule.

Assuming a cache of size 3 words:



replacement)



ENDDO

### Data Packing and Iteration Reordering

1 Original code below corresp. to the top Figure  $\rightarrow$ 

```
DO i = 1 to N
... X[1[i]] ...
... X[r[i]] ...
ENDDO
```

- 2 After data reordering/packing, l'=σ⊙l, r'=σ⊙r and X', the reorganized X, are shown in middle Figure →
- 3 Loop iterations are reordered by lexicographically sorting index arrays 1' and r' into 1'' and r'', as shown in bottom Figure →

  DO i = 1 to N

  ... X'[1''[i]] ...

  ... X[r''[i]] ...

| i | =1 | 2 | 3 | 4 | 5 | 6 |
|---|----|---|---|---|---|---|
| 1 | 2  | 4 | 1 | 3 | 4 | 2 |
| r | 6  | 5 | 3 | 2 | 6 | 4 |
|   |    |   |   |   |   |   |

|   | 1 | 2 | 3 | 4 | 5 | 6 |   |
|---|---|---|---|---|---|---|---|
| ĸ | A | В | С | D | E | F | l |
|   |   |   |   |   |   |   | • |



|    | 1 | 2 | 3 | 4 | 5 | 6 |
|----|---|---|---|---|---|---|
| х' | В | F | D | E | A | С |

| i   | =1 | 2 | 3 | 4 | 5 | 6 |
|-----|----|---|---|---|---|---|
| 1   | 1  | 1 | 3 | 3 | 5 | 6 |
| r'' | 2  | 3 | 2 | 4 | 6 | 1 |

|    | 1 | 2 | 3 | 4 | 5 | 6 |
|----|---|---|---|---|---|---|
| x' | В | F | D | E | A | С |

# **Empirical Evaluation: Effects of Reordering**







### **Empirical Evaluation: Effects of Optimizations**



Figure 6: Effect of Compiler Optimizations



### Spatial Locality Graph Models Data Reordering

- vertices correspond to data items, and
- an edge connect items accesses in the same iteration, and is annotated by the iteration number.



- $G_{SL}$ : reordering  $\sigma$  that minimizes  $\sum_{(v,w)\in G_{SL}(E)} |\sigma(v) \sigma(w)|.$
- 1 Consecutive Packing (CPACK): traverses the edges in the current iteration order and packs data on a first-come-first-served basis.
- 2 GPart: Heuristic that partitions the graph such that the nodes (data) of each partition fits into some level of cache, and orders the data consecutively (CPACK) inside each partition.

## Temporal Locality HyperGraph (Iter Reordering)

- A HyperGraph  $G_{TL}(V, E)$  is a generalization of a graph in which each hyperedge can involve more than 2 vertices.
- A vertex correspond to an iteration (number)
- A hyperedge is a set of vertices. Two or more vertices (iterations)
   belong to the same hyperedge if they access the same data item.



- 1 CPACKiter: visits the hyperedges in order and packs the iterations in each of these hyperedges on a first-come-first-served basis.
- 2 BFSiter: performs a BFS ordering of the vertixes of the hypergraph. Alg uses also  $G_{SL}(V, E)$ .
- 3 HPart: graph partitioning heuristics.

# Temporal Locality HyperGraph (Iter Reordering)

```
Alg BFSiter(G_{TI}(V, E), G_{SI}(V, E)))
count = 0
ADD a vertex i \in G_{TL}(V, E) TO iter-queue
DO {
  WHILE(iter-queue not empty) do {
    i = dequeue(iter-queue)
    put i next in iteration ordering
    count ++
    // for all hyperedges to which i belongs
    FOR EACH (v, w) = E_i \in G_{SI}(V, E)
      IF not visited, add v to data-queue & mark
      IF not visited, add w to data-queue
                                 and mark visited
    // Add all iters belonging to data hyperedges
    // to iter-queue
    WHILE (v = dequeue(data-queue))
      FOR EACH i in v, where v \in G_{TL}(E)
        IF not visited, add i to iter-queue & mark
  if (count < n) add a non visited node to iter-queue
```

} WHILE (count < n) // until all nodes were visited.

• Metric to minimize:  $\sum_{e \in G_{TL}(E)} (\sum_{i_j, i_k \in e} |\delta(i_j) - \delta(i_k)|),$  where e is a hyperedge of the temporal locality hypergraph and  $\delta$  gives the iteration new ordering.

• Span metric:  $\sum_{e \in G_{TL}(E)} (\max_{i \in e} (\delta(i)) - \min_{i \in e} (\delta(i)))$ 

Density metric:

$$\sum_{e \in G_{TL}(E)} \left( \frac{\max_{i \in e} (\delta(i)) - \min_{i \in e} (\delta(i))}{|e|} \right)$$

Spatial Locality Hypergraph is the dual of the Temporal Locality Hypergraph, i.e., vertices are data items and an hyperedge is formed by the set of item accessed by an iteration.

### **Empirical Results: Weak Scaling**



Figure 14: Results that compare various datareordering heuristics applied to the mesh improvement application on the Xeon Pentium 4. Each bar represents the execution time for that dataset normalized to the execution time for the original ordering of that dataset. Each data reordering is followed by BFSIter for iteration reordering. The arrow indicates which data reordering results in the lowest spatial locality metric value for each dataset.



Figure 15: Results that compare various datareordering heuristics applied to the mesh improvement application on the PowerPC G5. Each bar represents the execution time for that dataset normalized to the execution time for the original ordering of that dataset. Each data reordering is followed by BFSIter for iteration reordering. The arrow indicates which data reordering results in the lowest spatial locality metric value for each dataset.



#### **Application: Parallelization of Irregular Arrays**

Code Generation for Parallel Execution of a Class of Irregular Loops on Distributed Memory Systems, M. Ravishankar et. al., ICS 2013

#### Sequential Conjugate Gradient Computation

```
while( !converged ) {
    //...Other computation not shown...
    //parallel, producer loop
   for(k = 0; k < n; k++)
       x[k] = \dots;
    //...Other computation not shown...
    //parallel, consumer loop
   for( i = 0 ; i < n ; i++ )
        for( j = ia[i] ; j < ia[i+1] ; j++ ){
            xindex = col[j];
            y[i] += A[j]*x[xindex];
    //...Other computation not shown...
```

- Generates automatically the inspector that determines which (indices of) elements of x (and A) are accessed in each outer iteration i, and builds the temporal locality hypergraph.
- Multi-Constraint Partitioning of the hypergraph (i) to achieve load balancing within each parallel loop and (ii) to minimize communication between the producer and consumer loops.

# Application: Parallelization of Irregular Arrays





(f) Inspector Overhead Breakdown - imi\_sym.r.
Fig. 6. CG Kernel with hood.rb and imi\_sym.rb

- Tiling Affine Loop Nests: Optimising Communication & Load Balancing [Reddy and Bondhugula'14]
- 2 Iteration and Data Reordering for Loop Nests with Irregular Accesses [Ding and Kenedy'99], [Strout et. al. 2003,2004]
  - Data Reordering (Packing)
  - Iteration Reordering (Locality Grouping)
  - Generalization: Temporal & Spatial Locality HyperGraphs
- Other Locality Optimizations: Parallel Tracing



### Tracing Application: Copy (Garbage) Collector

A Localized Tracing Scheme Applied to Garbage Collection, Chicha and Watt, APLAS'06.

A New Approach to Parallelising Tracing Algorithms, Oancea, Mycroft and Watt, ISMM'09.



### Tracing Application: Copy (Garbage) Collector

A Localized Tracing Scheme Applied to Garbage Collection, Chicha and Watt, APLAS'06.

A New Approach to Parallelising Tracing Algorithms, Oancea, Mycroft and Watt, ISMM'09.

// Abstract Alg for Tracing:

- How to parallelize? Use several worklists instead of one.
   Worklist semantics:
- 1 Processor Centric: as many worklists as number of processors. A worklist holds items that are to be processed by the same processor.
- 2 Memory Centric: super-partition memory. A worklist is associated to a memory partition: it holds elements that belong to the same partition.

### **Semi Space Copy Collector**

```
while(!queue.isEmpty()) {
  int ind = 0:
  Object from_child, to_child;
  Object to_obj = queue.dequeue();
  foreach (from_child in to_obj.fields()) {
    ind++:
    atomic{
      if( from_child.isForwarded() )
        continue:
      to_child = copy(from_child);
      setForwardingPtr(from child.to child):
    to_obj.setField(to_child, ind-1);
    queue.enqueue(to_child);
```

 Semi Space Collector partitions the memory into two halves: from and out space. When from space becomes full, the live objects are copied to the out space, and flips the role of the two spaces.



### Semi Space Copy Collector

```
while(!queue.isEmpty()) {
  int ind = 0:
  Object from_child, to_child;
  Object to_obj = queue.dequeue();
  foreach (from_child in to_obj.fields()) {
    ind++:
    atomic{
      if( from_child.isForwarded() )
        continue:
      to_child = copy(from_child);
      setForwardingPtr(from_child, to_child);
    to_obj.setField(to_child, ind-1);
    queue.enqueue(to_child);
} }
```

- Semi Space Collector partitions the memory into two halves: from and out space. When from space becomes full, the live objects are copied to the out space, and flips the role of the two spaces.
- forwardingPtr points to the to space object, and denotes marking.
- to\_obj.setField(to\_child, ind-1); sets field ind-1 of to\_obj to the copied child.
- Queue-access sync not a problem, e.g., double-ended queue data structure, that allows work stealing at minimal locking overhead.
- Problematic synchronization: the fine-grained, per object locking, without which an object can be copied to two to-space locations, with references split between the two.

### **Sequential Localized Tracing Scheme**

Sequential, but uses several memory-centric worklists. Reduces the working set and hence memory thrashing (TLB misses).













### Parallel Localized Tracing Scheme

Memory-Centric Parallelization eliminates the problematic (fine-grained) synchronization overhead (when copying an object) because there is exactly one processor that owns a memory partition.



