# Creating SPMD parallelism using OpenMP **parallel** and **teams** directive

From this part, we begin to introduce how to use OpenMP directives to write programs. We first introduce the most basic and most commonly used **parallel** directive and **teams** directive.

## Semantics and Syntax

### **parallel** Directive
The **parallel** directive is used to mark a parallel region. When a thread encounters a parallel region, a group of threads is created to execute the parallel region.
The original thread that executed the serial part will be the primary thread of the new team. All threads in the team execute parallel regions together. After a team is created, the number of threads in the team remains constant for the duration of that parallel region.

> Primary thread is also known as the master thread

When a thread team is created, the primary thread will implicitly create as many tasks as the number of threads, each task is assigned and bounded to one thread.
When threads are all occupied, implicit tasks that have not been allocated will be suspended waiting for idle threads.

The following example from Chapter 1 shows how to use the parallel directive in C.

In [None]:
//%compiler: clang
//%cflags: -fopenmp

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]){
    #pragma omp parallel
    printf("%s\n", "Hello World");
    
    return 0;
}

This example prints *Hello World* 8 times, which means 8 threads are created by defualt. The default number of threads is determined by the computer hardware, 8 threads are created on the author's computer. 
The following example shows how to use the num_threads clause in the parallel directive to specify the number of threads to create.

In [2]:
//%compiler: clang
//%cflags: -fopenmp

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]){
    #pragma omp parallel num_threads(4)
    printf("%s\n", "Hello World");
    
    return 0;
}

Hello World
Hello World
Hello World
Hello World


In this example, we use the **num_threads** clause to specify the use of 4 threads to execute the parallel region. When the master thread encounters OpenMP constructs, three threads are created, and together with these three threads, a thread group of 4 is formed. 
*Hello World* is printed four times, once per thread.

Through the above examples, it is not difficult to find that the syntax of parallel directive in C is:

```
#pragma omp parallel [clause[ [,] clause] ... ] new-line
    structured-block
```

And the syntax of **num_threads** clause is：

```
num_threads(integer-expression)
```

The next two examples show how to use **paralle** diretcive in Fortran, and they have exactly same meaning as the two examples in C above.

In [16]:
!!%compiler: gfortran
!!%cflags: -fopenmp

PROGRAM Parallel_Hello_World
USE OMP_LIB

!$OMP PARALLEL

  PRINT *, "Hello World"

!$OMP END PARALLEL

END

 Hello World
 Hello World
 Hello World
 Hello World
 Hello World
 Hello World
 Hello World
 Hello World


In [13]:
!!%compiler: gfortran
!!%cflags: -fopenmp

PROGRAM Parallel_Hello_World
USE OMP_LIB

!$OMP PARALLEL num_threads(4)

  PRINT *, "Hello World"

!$OMP END PARALLEL

END

 Hello World
 Hello World
 Hello World
 Hello World


The syntax of **parallel** directive in Fortran is:
```
!$omp parallel do [clause[ [,] clause] ... ]
    loop-nest
[!$omp end parallel do]
```

Within a parallel region, the thread number uniquely identifies each thread. A thread can obtain its own thread number by calling the **omp_get_thread_num** library routine.

The following example is a little more complicated. It shows how to use the **omp_get_thread_num** library routine, and shows how to use two other clauses, the **default** clause and the **private** clause. It assigns tasks to each thread explicitly.

In [14]:
//%compiler: clang
//%cflags: -fopenmp

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void subdomain(float *x, int istart, int ipoints) {
    int i;
    for (i = 0; i < ipoints; i++)       
         x[istart+i] = 123.456;
}

void sub(float *x, int npoints) {
    int iam, nt, ipoints, istart;
    #pragma omp parallel default(shared) private(iam,nt,ipoints,istart)
    {
        iam = omp_get_thread_num();
        nt = omp_get_num_threads();
        ipoints = npoints / nt; /* size of partition */
        istart = iam * ipoints; /* starting array index */
        if (iam == nt-1) /* last thread may do more */
            ipoints = npoints - istart;
        subdomain(x, istart, ipoints);
    }
}

void print(float *x, int npoints) {
    for (int i = 0; i < npoints; i++) {
        if(i++ % 10 == 0)
            printf("\n");
        printf("%f ", x[i]);
    }
}

int main() {
    float array[100];
    sub(array, 100);
    print(array, 100);
    return 0;
}


123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 
123.456001 123.456001 123.456001 123.456001 123.456001 

In the above example, we use the default number of threads to perform assignment operations on 100 elements in the array. Tasks are evenly distributed to each thread, and when the number of tasks is not divisible by the number of threads, the remaining tasks will be completed by the last thread.

When programming in parallel, the most important and hardest part is how to assign tasks and manage threads. We already introduced that a thread can get its own id through the **omp_get_thread_num** routine. Another important routine is **omp_get_num_threads**, which returns the number of threads in the current team.

In the above example, variable *npoints* presents the total number of elements in the array, and it is divided into *nt* parts, each of size *ipoints*. The starting address of each part is *istart*. Each part is completed by one thread, and a total of 8 threads execute tasks in parallel.

The **default** clause is used to define the default data-sharing attributes of variables that are referenced in a parallel, teams, or task generating construct. In the above example, *default(shared)* indicates that by default, the variables in the parallel region are shared variables.
The **private** clause is used to explicitly specify variables that are private in each task or SIMD lane (SIMD will be introduced in the next chapter). In the above example, the variables *iam, nt, ipoints* and *istart* are private variables for each thread, which means a thread cannot access these variables of another thread.

Both of these two clauses belong to the data-sharing attribute clauses, which we will introduce in detail in the section of clauses later.

### **teams** Directive
The **teams** directive indicates that the loop that follows is split among multiple thread teams, one thread team computing one part of the task. Developers can use the **teams** directive to use a large number of thread teams.

The following figure shows the execution model of the **teams** directive:
![teams_directive](teams.jpeg "topic1")

A league of teams is created when a thread encounters a **teams** construct. Each team is an initial team, and the initial thread in each team executes the team area.
After a team is created, the number of initial teams remains the same for the duration of the **teams** region.
Within a **teams** region, the initial team number uniquely identifies each initial team. A thread can obtain its own initial team number by calling the *omp_get_team_num* library routine.
The teams directive has the following characteristics:
- the **teams** directive can spawn one or more thread teams with the same number of threads
- code is portable for one thread team or multiple thread teams
- only the primary thread of each team continues to execute
- no synchronization between thread teams
- programmers don't need to think about how to decompose loops

OpenMP was originally designed for multithreading on shared-memory parallel computers, so the parallel directive only creates a single layer of parallelism.
The team instruction is used to express the second level of scalable parallelization. Before OpenMP 5.0, it can be only used on the GPU (with an associated target construct). In OpenMP 5.0 the **teams** construct was extended to enable the host to execute a teams region.


In [10]:
//%compiler: clang
//%cflags: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda  --cuda-path=/usr/local/cuda
#include <stdlib.h>
#include <omp.h>
float dotprod(float B[], float C[], int N) {
    float sum0 = 0.0;
    float sum1 = 0.0;
    #pragma omp target map(to: B[:N], C[:N]) map(tofrom: sum0, sum1)
    #pragma omp teams num_teams(2) 
    {
        int i;
        if (omp_get_num_teams() != 2)
            abort();
        if (omp_get_team_num() == 0) {
            #pragma omp parallel for reduction(+:sum0)
            for (i=0; i<N/2; i++)
                sum0 += B[i] * C[i];
        } else if (omp_get_team_num() == 1) {
            #pragma omp parallel for reduction(+:sum1)
            for (i=N/2; i<N; i++)
                sum1 += B[i] * C[i];
        }
    }
    return sum0 + sum1;
}
/* Note: The variables sum0,sum1 are now mapped with tofrom, for correct
 execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
 */

clang: error: cannot find libdevice for sm_35. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice.
[Native kernel] clang exited with code 1, the executable will not be executed

In [19]:
//%compiler: clang
//%cflags: -fopenmp

// Need to update the native kernel, or specified that our kernel doesn't support OpenMP 5.0


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#define N 1000
 
int main(){
    int nteams_required=2, max_thrds, tm_id;
    float sp_x[N], sp_y[N], sp_a=0.0001e0;
    double dp_x[N], dp_y[N], dp_a=0.0001e0;

    // Create 2 teams, each team works in a different precision
    #pragma omp teams num_teams(nteams_required) thread_limit(max_thrds) private(tm_id)
    {
        tm_id = omp_get_team_num();
        if( omp_get_num_teams() != 2 ) //if only getting 1, quit 
        { 
            printf("error: Insufficient teams on host, 2 required\n");
            exit(0);
        }
        if(tm_id == 0) // Do Single Precision Work (SAXPY) with this team
        {
            #pragma omp parallel
            {
                #pragma omp for //init
                for(int i=0; i<N; i++){sp_x[i] = i*0.0001; sp_y[i]=i; }
                #pragma omp for simd simdlen(8)
                for(int i=0; i<N; i++){sp_x[i] = sp_a*sp_x[i] + sp_y[i];}
            }
        }
        if(tm_id == 1) // Do Double Precision Work (DAXPY) with this team
        {
            #pragma omp parallel
            {
                #pragma omp for //init
                for(int i=0; i<N; i++){dp_x[i] = i*0.0001; dp_y[i]=i; }
                #pragma omp for simd simdlen(4)
                for(int i=0; i<N; i++){dp_x[i] = dp_a*dp_x[i] + dp_y[i];}
            }
        }
    }
    printf("i=%d sp|dp %f %f \n",N-1, sp_x[N-1], dp_x[N-1]);
    printf("i=%d sp|dp %f %f \n",N/2, sp_x[N/2], dp_x[N/2]);
    //OUTPUT1:i=999 sp|dp 999.000000 999.000010
    //OUTPUT2:i=500 sp|dp 500.000000 500.000005
    return 0;
} 

/tmp/tmpwit3re_h.c:16:5: error: orphaned 'omp teams' directives are prohibited; perhaps you forget to enclose the directive into a target region?
    #pragma omp teams num_teams(nteams_required) thread_limit(max_thrds) private(tm_id)
    ^
1 error generated.
[Native kernel] clang exited with code 1, the executable will not be executed

Its syntax is:
```
#pragma omp teams [clause[ [,] clause] ... ] new-line
    structured-block
```
The syntax in Fortran is:
```
!$omp teams [clause[ [,] clause] ... ]
    loosely-structured-block
!$omp end teams
```

## Clauses

## Examples