



# Multithreaded programming on hybrid parallel architectures

<u>Silvio Stanzani</u>, Raphael Cóbe, Rogério Iope, Jefferson Fialho

UNESP - Núcleo de Computação Científica <u>silvio@ncc.unesp.br</u>, <u>rmcobe@ncc.unesp.br</u>, <u>rogerio@ncc.unesp.br</u>, <u>jfialho@ncc.unesp.br</u>

## Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections

## **UNESP Center for Scientific Computing**

- Consolidates scientific computing resources for São Paulo State University (UNESP) researchers
  - It mainly uses Grid computing paradigm

- Main users
  - UNESP researchers, students, and software developers
  - SPRACE (São Paulo Research and Analysis Center) physicists and students
    - ☐ Caltech, Fermilab, CERN
    - ☐ São Paulo CMS Tier-2 Facility

## **Hybrid Parallel Architectures**

- Heterogeneous computational systems:
  - Multicore processors;
  - Multi-level memory sub-system;
- Multi-level parallelism:
  - Processing core;
  - Chip multiprocessor;
  - Computing node;
  - Computing cluster;
- Hybrid Parallel architectures
  - Coprocessors and accelerators;

#### Scalar and Vector Instructions

- **Scalar** Code computes this oneelement at a time.
- Vector (or SIMD) Code computes more than one element at a time.
  - SIMD stands for Single Instruction
     Multiple Data.
- Vectorization
  - Loading data into cache accordingly;
  - Store elements on SIMD registers or vectors;
  - Iterations need to be independent;
  - Usually on inner loops.



SIMD



## **Hybrid Parallel Architectures**



- Heterogeneous Multilevel Memory System
- Multilevel parallelism:
  - Instruction Level;
  - Vectorization;
  - Multithreading;
  - Multiprocessing;

#### **Native Execution**

#### Offloading

- Bus Communication
- Network
- QPI (Quick Path Interconnect)

#### Symmetric

- MPI
- Socket
- TCP/IP

Execution Mode =

#### Device

- Intel Xeon Phi (KNC)
- Intel Xeon Phi (KNL)
- Intel FPGA

## Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections

#### Intel Xeon and Intel<sup>®</sup> Xeon Phi<sup>™</sup> Overview

# Intel® Multicore Architecture





- Foundation of HPC Performance
- Suited for full scope of workloads
- Focus on fast single core/thread performance with "moderate" number of cores

# Intel® Many Integrated Core Architecture





- Performance and performance/watt optimized for highly parallelized compute workloads
- IA extension to Manycore
- Many cores/threads with wide SIMD

#### Intel Xeon Architecture Overview



- •Socket: mechanical component that provides mechanical and electrical connections between a microprocessor and a printed circuit board (PCB).
- •QPI (Intel QuickPath Interconnect): high speed, packetized, point-to-point interconnection, that stitch together processors in distributed shared memory and integrated I/O platform architecture.



#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Architecture Overview

Knights Core (KNC)



#### Intel® Xeon Phi™ Coprocessor Arch – System SW Perspective

- Large SMP UMA machine a set of x86 cores
  - 4 threads
    - □ 32 KB L1 I/D
    - □ 512 KB L2 per core
  - Supports loadable kernel modules
  - VM subsystem, File I/O

- Virtual Ethernet driver
  - supports NFS mounts from Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor
  - Support bridged network

# Knights Landing (KNL)



#### Cluster modes

#### One single space address

Node 0

#### Hemisphere:

the tiles are divided into two parts called hemisphere



#### **Quadrant:**

tiles are divided into two parts called hemisphere or into four parts called qudrants



01/08/16

#### Cluster modes

#### Cache data are isolated in each sub numa domain

snc-2: the tiles are divided into two Numa Nodes



SNC-4: the tiles are divided into two Numa Nodes



01/08/16

# Integrated On-Package Memory Usage Models

## Integrated On-Package Memory Usage Models







Split Options: 25/75% or 50/50%

| Cache Model                                                                                                                                        | Flat Model                                                                                                                                              | Hybrid Model                                                                                                                                                        |
|----------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hardware automatically manages the MCDRAM as a "L3 cache" between CPU and ext DDR memory                                                           | Manually manage how the app uses<br>the integrated on-package memory<br>and external DDR for peak perf                                                  | Harness the benefits of both Cache and Flat models by segmenting the integrated on-package memory                                                                   |
| <ul> <li>App and/or data set is very large<br/>and will not fit into MCDRAM</li> <li>Unknown or unstructured memory<br/>access behavior</li> </ul> | <ul> <li>App or portion of an app or data set<br/>that can be, or is needed to be<br/>"locked" into MCDRAM so it<br/>doesn't get flushed out</li> </ul> | <ul> <li>Need to "lock" in a relatively small portion of an app or data set via the Flat model</li> <li>Remaining MCDRAM can then be configured as Cache</li> </ul> |

## Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections

### **OpenMP**

- OpenMP is an acronym for Open Multi-Processing
- An Application Programming Interface (API) for developing parallel programs in shared memory architectures
- Three primary components of the API are:
  - Compiler Directives
  - Runtime Library Routines
  - Environment Variables
- De facto standard specified for C / C++ and FORTRAN
- http://www.openmp.org/
  - Specification, examples, tutorials and documentation

## OpenMP



Time

## OpenMP - Core elements



## OpenMP Sample Program

```
N=25;
#pragma omp parallel for
for (i=0; i<N; i++)
a[i] = a[i] + b;
```

|    | Thread 0     |  |  |  |   |   | Thr | eac | 1 1 |    |    | Thr | eac | 12 |    |    | Thr | eac | 13 |    | Thread 4 |    |    |    |  |
|----|--------------|--|--|--|---|---|-----|-----|-----|----|----|-----|-----|----|----|----|-----|-----|----|----|----------|----|----|----|--|
| i= | j= 0 1 2 3 4 |  |  |  | 5 | 6 | 7   | 8   | 9   | 10 | 11 | 12  | 13  | 14 | 15 | 16 | 17  | 18  | 19 | 20 | 21       | 22 | 23 | 24 |  |
|    |              |  |  |  |   |   |     |     |     |    |    |     |     |    |    |    |     |     |    |    |          |    |    |    |  |
|    |              |  |  |  |   |   |     |     |     |    |    |     |     |    |    |    |     |     |    |    |          |    |    |    |  |
|    |              |  |  |  |   |   |     |     |     |    |    |     |     |    |    |    |     |     |    |    |          |    |    |    |  |

## OpenMP Sample Program

```
#include <stdio.h>
                                                      #pragma omp parallel for
                                                        for(i=0; i<r; ++i)
int main(){
                                                          for(i=0; i<c; ++i) {
  int r, c, i, j, *a, *b, *sum;
                                                           a[i*r + i]=i+i
  char hn[600];
                                                           b[i*r + i]=i-i;
  #pragma omp parallel
                                                        #pragma omp parallel for
   gethostname(hn,600);
                                                        for(i=0;i<r;++i)
   printf("hostname %s\n",hn);
                                                          for(i=0:i<c:++i)
                                                             sum[i*r+i] = a[i*r+i] + b[i*r+i];
  r=40000:
                                                        free(a);
  c = 40000:
                                                        free(b);
                                                        free(sum);
  a = (int*)malloc(r*c*sizeof(double));
  b = (int*)malloc(r*c*sizeof(double));
                                                        return 0;
  sum = (int*)malloc(r*c*sizeof(double));
```

#### Compiling and running an OpenMP application

```
#Build the application for Multicore Architecture (Xeon) icc <source-code> -o <omp_binary> -fopenmp
```

#Build the application for the ManyCore Architecture (Xeon Phi) icc <source-code> -o <omp\_binary>.mic -fopenmp -mmic

#Launch the application on host ./omp\_binary

#Launch the application on the device from host micnativeloadex ./omp\_binary.mic -e "LD\_LIBRARY\_PATH=/opt/intel/lib/mic/"

#### Compiling and running an OpenMP application

export OMP\_NUM\_THREADS=10
./OMP-hello

hello from hostname phi02.ncc.unesp.br Launch the application on the Coprocessor from host

micnativeloadex ./OMP-hello.mic -e
"OMP\_NUM\_THREADS=10
LD\_LIBRARY\_PATH=/opt/intel/lib/mic/"

hello from hostname phi02-mic0.ncc.unesp.br sum of vector elements: 5789.473684

## Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections

#### **Identifying Parallelization Opportunities**

#### Intel Advisor steps:

- 1° Include headers
- #include "advisor-annotate.h"
- 2° add include reference; link library





#### <u>Linux - compiling / link</u> with Advisor

icpc -O2 -openmp

02\_ReferenceVersion.cpp

- -o 02\_ReferenceVersion
- -I/opt/intel/advisor/include/
- -L/opt/intel/advisor/lib64/

## Identifying Parallelization Opportunities

#### Intel Advisor Analysis:

- Survey
  - □ Vectorization of loops: detailed information about vectorization;
  - ☐ Total Time: elapsed time in each loop considering the time involved in internal loops;
  - ☐ Self Time: elapsed time in each loop without internal loops;
- Suitability
  - Speedup gains obtained parallelizing annotated loops;

5/22/17 26

## Intel Advisor - Survey Data



# Intel Advisor - Survey Data

| Function C                             | Call Sites and L                                                                                                                                                                                                                                                                                                                                                                                                                     | oops                                                                                                       | ۵                             | Vector Issues | Self Time    | ▼ Total   | Time     | Туре    | Why No Vectorization? |                                          |  |  |  |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|-------------------------------|---------------|--------------|-----------|----------|---------|-----------------------|------------------------------------------|--|--|--|
| □ 🖰 [loop                              | in multiply3 at                                                                                                                                                                                                                                                                                                                                                                                                                      | multiply.c:228]                                                                                            |                               |               | 2 Assume     | 0.170     | s 📗 0    | .170s 🔲 | Scalar                | vector dependence prevents vectorization |  |  |  |
|                                        | inlibc_csu_ir                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                            |                               |               |              | 0.000     | s ( 0    | .000s ( | Scalar                |                                          |  |  |  |
|                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                      | _16_offload_host_d                                                                                         | :pp_ad92                      |               |              | 0.000     | s ( 0    | .000s [ | Scalar                |                                          |  |  |  |
|                                        | in func@0x5b                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                            |                               |               | 💡 2 Data typ |           | -        | .000s ( | Scalar                |                                          |  |  |  |
|                                        | in main at ma                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                            |                               |               | 2 Data typ   |           | •        |         | Scalar                | inner loop was already vectorized        |  |  |  |
|                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                      | multiply.c:227]                                                                                            |                               |               | 2 Assume     |           |          |         | Scalar                | vector dependence prevents vectorization |  |  |  |
|                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                      | multiply.c:226]                                                                                            |                               |               | 2 Assume     |           |          |         | Scalar                | vector dependence prevents vectorization |  |  |  |
| ± () [loop                             | in main at ma                                                                                                                                                                                                                                                                                                                                                                                                                        | trix.c:144]                                                                                                |                               |               | 💡 1 Data typ | 0.000     | 5( 0     | .000s ( | Vectorized (B         |                                          |  |  |  |
| Source                                 | Source Top Down Loop Analytics Loop Assembly 💡 Recommendations 🖬 Compiler Diagnostic Details                                                                                                                                                                                                                                                                                                                                         |                                                                                                            |                               |               |              |           |          |         |                       |                                          |  |  |  |
|                                        | File: multiply.c:228 multiply3                                                                                                                                                                                                                                                                                                                                                                                                       |                                                                                                            |                               |               |              |           |          |         |                       |                                          |  |  |  |
| Line                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                                                            |                               |               |              |           |          |         |                       | Source                                   |  |  |  |
| 220<br>221<br>222<br>223<br>224<br>225 | //#pragma omp target device(0) map(a[0:NUM][0:NUM]) \ //map(b[0:NUM][0:NUM]) map(c[0:NUM][0:NUM])  //{  int i,j,k;  // #pragma omp parallel for collapse (2) //num threads(60)  for(i=0; i <msize; [loop="" applied<="" at="" dependence="" i++)="" in="" loop="" loop.="" multiply.c:226]="" multiply3="" no="" not="" prevents="" scalar="" td="" transformations="" vector="" vectorization="" vectorized:="" {=""  =""></msize;> |                                                                                                            |                               |               |              |           |          |         |                       |                                          |  |  |  |
|                                        | (b) [loop in multiply3 at multiply.c:227] Scalar loop. Not vectorized: vector dependence prevents vectorization Remainder loop                                                                                                                                                                                                                                                                                                       |                                                                                                            |                               |               |              |           |          |         |                       |                                          |  |  |  |
| 228                                    | ∖ () [loop :<br>Scal<br>Loop                                                                                                                                                                                                                                                                                                                                                                                                         | for(j=0; j <msiz<br>in multiply3 at<br/>ar loop. Not ve<br/>was unrolled b<br/>c[i][j] = c<br/>}</msiz<br> | multiply.<br>ctorized:<br>y 2 | vecto         |              | revents v | ectoriza | ntion   |                       |                                          |  |  |  |

5/22/17 28

## Matrix Multiplication

```
for(i=0; i<msize; i++) {
    for(j=0; j<msize; j++) {
       for(k=0; k<msize; k++) {
            c[i][j] = c[i][j] + a[i][k] * b[k][j];
            }
       }
     }
}</pre>
```

## Intel Advisor - Check Suitability

- Inserting advisor **Annotations key words** for Check Suitability:
  - ANNOTATE\_SITE\_BEGIN(id): before beginning of loop;
  - ANNOTATE\_ITERATION\_TASK(id): first line inside the loop;
  - ANNOTATE\_SITE\_END(): after end of loop;

```
ANNOTATE_SITE_BEGIN( MySite1);
for(i=0; i<msize; i++) {
    ANNOTATE_ITERATION_TASK( MyTask1 );
    for(k=0; k<msize; k++)
        for(j=0; j<msize; j++) {
        c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
ANNOTATE_SITE_END();</pre>
```

## Intel Advisor - Check Suitability



#### Intel Advisor - Check Suitability



## Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections

#### Critical section

 In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior;

 Regions of code where the shared resource is accessed, has to be protected against concurrent access and is known as critical section;

#### Challenge:

- Identify critical sections;
- Impose synchronization without loss of performance.

## Intel Inspector

 Intel Inspector is a debugger tool that performs dynamic analysis. It is capable of identify the following errors:

#### – Memory Errors

- Memory leaks;
- Memory corruption;
- ☐ Allocation / de-allocation API mismatches
- ☐ Inconsistent memory API usage;
- □ Illegal memory access;
- □ Uninitialized memory read;

#### Threading Errors

- Data races:
  - O Heap races;
  - o Stack races;
- Deadlocks:



## Intel Inspector



### Intel Inspector



#### Intel Inspector



### Intel Inspector



#### **Tachyon**

- Tachyon: a parallel/multiprocessor ray tracing software.
- Variable col is declared as global, but used by several threads;



#### **Tachyon**

Solution with synchronization;

- Eliminating concurrency
  - Variable is not used outside function;
  - chaging variable from global to local eliminates Syncronization;



#### Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections

#### Pragma omp target

Transfer control [and data] from host to device

```
Syntax
```

```
- #pragma omp target [data] [clause[[,] clause],...]
    structured-block
```

#### Clauses

#### Pragma omp target

#### Map clauses:

- alloc : allocate memory on device;
- to: transfer a variable from host to device;
- from: transfer a variable from device to host;
- tofrom :
  - ☐ transfer a variable from host to device before start execution;
  - ☐ transfer a variable from device to host after finish execution;



## Offloading - omp target



#### Pragma omp target example

```
#pragma omp target device(0) map(a[0:NUM][0:NUM])
map(b[0:NUM][0:NUM]) map(c[0:NUM][0:NUM])
  #pragma omp parallel for collapse (2)
  for(i=0; i<msize; i++) {
    for(k=0; k<msize; k++) {
      for(j=0; j<msize; j++) {
         c[i][i] = c[i][i] + a[i][k] * b[k][i];
```

#### Agenda

- Hybrid Parallel Architectures
- Intel HPC Architectures
- OpenMP
- Profiling with Intel Advisor (Threading Workflow)
- Critical Sections
- Offload
- N-Body Simulation

## **N-Body Simulation**

- An N-body simulation [1] aims to approximate the motion of particles that interact with each other according to some physical force;
- Used to study the movement of bodies such as satellites, planets, stars, galaxies, etc., which interact with each other according to the gravitational force;
- Newton's second law of motion can be used in a N-body simulation to define the bodies' movement.

[1] AARSETH, S. J. Gravitational n-body simulations. [S.I.]: Cambridge University Press, 2003. Cambridge Books Online.

#### N-Body Algorithm

#### Bodies struct:

- 3 matrix represents velocity (x,y and z)
- 3 matrix represents position (x,y and z)
- 1 matrix represent mass

#### A loop calculate temporal steps:

 At each temporal step new velocity and position are calculated to all bodies according to a function that implements Newton's second law of motion

# N-Body - Parallel version (host only)

```
function Newton(step)
  #pragma omp for
  for each body[x] {
    #pragma omp simd
    for each body[y]
      calc force exerted from body[y] to body[x];
    calc new velocity of body[x]
  #pragma omp simd
  for each body[x]
    calc new position of body[x]
Main() {
  for each temporal step
    Newton(step)
```

#### N-Body - Parallel version (Load balancing)

The temporal step loop remains sequential

- The N-bodies are divided among host and devices to be executed using Newton
- OpenMP offload pragmas are used to
  - Newton function offloading to devices
  - Transfer data (bodies) between host and devices

#### N-Body - Parallel version (Load balancing)

```
function Newton(step, begin body, end body, deviceId)
  #pragma omp target device (deviceId) {
    #pragma omp for
    for each body[x] from subset(begin_body, end_body) {
      #pragma omp simd
      for each body[y] from subset(begin_body, end_body)
        calc force exerted from body[y] to body[x];
      calc new velocity of body[x]
    #pragma omp simd
    for each body[x]
      calc new position of body[x]
```

#### N-Body - Parallel version (Load balancing)

```
for each temporal step
  Divide the amount of bodies among host and devices:
  #pragma omp parallel
    #pragma omp target data device (tid) to(bodies[begin_body:
end_body])
      Newton(step, begin_body, end_body, deviceId)
      #pragma omp target update device (tid) (from:bodies)
      #pragma omp barrier
      #pragma omp target data device ( tid ) to(bodies[begin_body:
end body])
```