# Chapter 2: Perspectives on Parallel Programming

Copyright @ 2005-2008 Yan Solihin

#### Copyright notice:

No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author.

An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: "Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008".

## Module 2.1 Parallel Programming Models

#### **Programming Models**

- What is programming model?
  - An abstraction provided by the hardware to programmers
  - Determines how easy/difficult for programmers to express their algorithms into computation tasks that the hardware understands
- Uniprocessor programming model
  - Based on program + data
  - Bundled in Instruction Set Architecture (ISA)
  - Highly successful in hiding hardware from programmers
- Multiprocessor programming model
  - Much debate, still searching for the right one...
  - Most popular: shared memory and message passing

## Shared Mem vs. Msg Passing

## Shared Memory Model PPP III P st Id Memory



- Shared Memory / Shared Address Space
  - Each memory location visible to all processors
- Message Passing
  - Each memory location visible to 1 processor

## Thread/process — Uniproc analogy

Process: share nothing

```
if (fork() == 0)
    printf("I am the child process, my id is %d", getpid());
else
    printf("I am the parent process, my id is %d", getpid());
```

- -heavyweight => high thread creation overhead
- -The processes share nothing => explicit communication using socket, file, or messages

#### Thread: share everything

```
void sayhello() {
  printf("I am child thread, my id is %d", getpid());
}
printf("I am the parent thread, my id is %d", getpid());
clone(&sayhello,<stackarg>,<flags>,())
```

- + lightweight => small thread creation overhead
- + The processes share addr space => implicit communication

#### Thread communication analogy

```
int a, b, signal;
...
void dosum(<args>) {
  while (signal == 0) {}; // wait until instructed to work
  printf("child thread> sum is %d", a + b);
  signal = 0; // my work is done
}

void main() {
  a = 5, b = 3;
  signal = 0;
  clone(&dosum,...) // spawn child thread
  signal = 1; // tell child to work
  while (signal == 1) {} // wait until child done
  printf("all done, exiting\n");
}
```

Shared memory in multiproc provides similar memory sharing abstraction

## Message Passing Example

#### Differences with shared memory:

- Explicit communication
- Message send and receive provide automatic synchronization

#### **Quantitative Comparison**

Table 2.1: Comparing shared memory and message passing programming models.

| Aspects            | Shared Memory               | Message Passing         |
|--------------------|-----------------------------|-------------------------|
| Communication      | implicit (via loads/stores) | explicit messages       |
| Synchronization    | explicit                    | implicit (via messages) |
| Hardware support   | typically required          | none                    |
| Development effort | lower                       | higher                  |
| Tuning effort      | higher                      | lower                   |

### Development vs. Tuning Effort

- Easier to develop shared memory programs
  - Transparent data layout
  - Transparent communication between processors
  - Code structure little changed
  - Parallelizing compiler, directive-driven compiler help
- Harder to tune shared memory programs for scalability
  - Data layout must be tuned
  - Communication pattern must be tuned
  - Machine topology matters for performance

#### More Shared Memory Example

```
for (i=0; i<8; i++)
  a[i] = b[i] + c[i];
sum = 0;
for (i=0; i<8; i++)
  if (a[i] > 0)
    sum = sum + a[i];
Print sum;
```

- + Communication directly through memory.
- + Requires less code modification
- Requires privatization prior to parallel execution

```
begin parallel // spawn a child thread
private int start iter, end iter, i;
shared int local iter=4;
shared double sum=0.0, a[], b[], c[];
shared lock type mylock;
start iter = getid() * local iter;
end iter = start iter + local iter;
for (i=start iter; i<end iter; i++)</pre>
  a[i] = b[i] + c[i];
barrier;
for (i=start iter; i<end iter; i++)</pre>
  if (a[i] > 0) {
    lock (mylock);
      sum = sum + a[i];
    unlock (mylock);
barrier;
            // necessary
end parallel // kill the child thread
Print sum;
```

More Message Passing Example

```
for (i=0; i<8; i++)
  a[i] = b[i] + c[i];
sum = 0;
for (i=0; i<8; i++)
  if (a[i] > 0)
    sum = sum + a[i];
Print sum;
```

- + Communication only through messages
- Message sending and receiving overhead
- Requires algo and program modifications

```
// parent and child already spawned
id = getpid();
local iter = 4;
start iter = id * local iter;
end iter = start iter + local iter;
if (id == 0)
  send msg (P1, b[4..7], c[4..7]);
else
  recv msg (P0, \&b[4..7], \&c[4..7]);
for (i=start iter; i<end iter; i++)</pre>
  a[i] = b[i] + c[i];
local sum = 0;
for (i=start iter; i<end iter; i++)</pre>
  if (a[i] > 0)
    local sum = local sum + a[i];
if (id == 0) {
  recv msg (P1, &local sum1);
  sum = local sum + local sum1;
  Print sum;
else
  send msg (P0, local sum);
```

### Other Programming Models

- PGAS
  - Partitioned Global Address Space
- Data Parallel
- MapReduce
- Transactional Memory

#### **PGAS**

- Shared memory model too simple for NUMA
  - Data layout is hidden from programmers
  - But thread-data proximity is important for performance
- PGAS provides shared & private address space



int a; shared int b; shared [2] double x[8][8];



#### Example

- Every node has N\*P/THREADS rows of A and C
- &A[i][0] in upc\_forall assigns iteration i to thread where A[i][0] is located

#### Data Parallel Model

- Data parallel = programming model for SIMD
- Either vector or scalar with lanes
  - Requires packing data into a wide register



Once packed, can compute multiple data items at once



### Example

```
1// A 128-bit vector struct with four 32-bit floats
2struct Vector4
3 {
4 float x, y, z, w;
5 };
1// Add two constant vectors and return the resulting vector
8Vector4 SSE_Add ( const Vector4 &Operand1, const Vector4 &Operand2
  Vector4 Result;
11
   ___asm
13
    MOV EAX Operand1
                           // Load pointers into CPU regs
    MOV EBX, Operand2
     MOVUPS XMM0, [EAX]
                          // Move unaligned vectors to SSE regs
     MOVUPS XMM1, [EBX]
18
     ADDPS XMM0, XMM1
                            // Vector addition
     MOVUPS [Result], XMM0 // Save the return vector
22
   return Result;
24 }
```

#### MapReduce

- Data accessed through key rather than location
  - Can be implemented on shared memory or message passing models
- Two steps: Map (produce <key,val> pairs) and Reduce (aggregate values for each key)



### Example: Inverted Index

- Files distributed over map workers
- Each map worker
   produces <LinkX,DocY>
   when it encounters
   LinkX on DocY
- Reduce worker
   aggregates all Docs
   having the same Link



#### Transactional Memory

- Transaction = code region with ACID property (Atomicity, Consistency, Isolation, Durability)
- Transactional Memory (TM) adopts Atomicity and Isolation
  - It allows us to remove some locks and worries over deadlocks

### Lock Example

```
image = Image_Read(fn_input);

for (i=0; i<image->row; i++) {
   for (j=0; j<image->col; j++) {
     lock();
     histoRed[image->red[i][j]]++;
     histoGreen[image->green[i][j]]++;
     histoBlue[image->blue[i][j]]++;
     unlock();
}
```

```
image = Image_Read(fn_input);
   lock_t redLock[256], greenLock[256], blueLock[256];
   for (i=0; i<image->row; i++) {
       lock(&redLock(image->red[i][j]));
       histoRed[image->red[i][j]]++;
       unlock(&redLock(image->red[i][j]));
       lock(&greenLock(image->red[i][j]));
       histoGreen[image->red[i][j]]++;
11
       unlock(&greenLock(image->red[i][j]));
12
13
       lock(&blueLock(image->red[i][j]));
14
       histoBlue[image->blue[i][j]]++;
15
       unlock(&blueLock(image->red[i][j]));
```

### TM Example

```
image = Image_Read(fn_input);

for (i=0; i<image->row; i++) {
    for (j=0; j<image->col; j++) {
        atomic {
            histoRed[image->red[i][j]]++;
            histoGreen[image->green[i][j]]++;
            histoBlue[image->blue[i][j]]++;
        }
}
```

- Transaction enclosed by atomic {...}
- Hardware or software ensures atomicity and isolation
  - SW inflexible and slow
  - HW expensive
- No locks needed and no locking overheads are incurred

### **Closing Comments**

- Many programming models out there
- Most are based on shared memory or message passing, and build on top of them
- Trade offs between control vs. complexity

 Overall, parallel programs still require a lot of tuning despite the programming model used

#### Module Review Questions

- What are two basic parallel programming models?
- What are key advantages shared memory model have over message passing model?
- What primitives are necessary for supporting shared memory parallel programming?
- In what way Transactional Memory simplify parallel programming?

#### Module Review Questions

- What are two basic parallel programming models?
  - Shared memory and message passing
- What are key advantages and disadvantages shared memory model have over message passing model?
  - Pluses: implicit communication, lower development effort, finer communication, Minuses: explicit synchronization, higher tuning effort, requires hardware support
- What primitives are necessary for supporting shared memory parallel programming?
  - Variable scope (shared vs. private), synchronization primitives
- In what way Transactional Memory simplify parallel programming?
  - Higher abstraction (simpler coding and reasoning), removing lock-related problems