# Creating SPMD parallelism using OpenMP **teams** directive

In this part, we will introduce how to use OpenMP **teams** directives to create SPMD parallelism.

### **teams** Directive
The **teams** directive indicates that the loop that follows is split among multiple thread teams, one thread team computing one part of the task. Developers can use the **teams** directive to use a large number of thread teams.

The following figure shows the execution model of the **teams** directive:
![teams_directive](teams.jpeg "topic1")

A league of teams is created when a thread encounters a **teams** construct. Each team is an initial team, and the initial thread in each team executes the team area.
After a team is created, the number of initial teams remains the same for the duration of the **teams** region.
Within a **teams** region, the initial team number uniquely identifies each initial team. A thread can obtain its own initial team number by calling the *omp_get_team_num* library routine.
The teams directive has the following characteristics:
- the **teams** directive can spawn one or more thread teams with the same number of threads
- code is portable for one thread team or multiple thread teams
- only the primary thread of each team continues to execute
- no synchronization between thread teams
- programmers don't need to think about how to decompose loops

OpenMP was originally designed for multithreading on shared-memory parallel computers, so the parallel directive only creates a single layer of parallelism.
The team instruction is used to express the second level of scalable parallelization. Before OpenMP 5.0, it can be only used on the GPU (with an associated target construct). In OpenMP 5.0 the **teams** construct was extended to enable the host to execute a teams region.


In [2]:
//%compiler: clang
//%cflags: -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda  -std=c++11 -Wall -Wno-unused-result -Wno-unknown-pragmas -Wfatal-errors -fPIC -fopenmp -fopenmp-targets=nvptx64
#include <stdlib.h>
#include <omp.h>
float dotprod(float B[], float C[], int N) {
    float sum0 = 0.0;
    float sum1 = 0.0;
    #pragma omp target map(to: B[:N], C[:N]) map(tofrom: sum0, sum1)
    #pragma omp teams num_teams(2) 
    {
        int i;
        if (omp_get_num_teams() != 2)
            abort();
        if (omp_get_team_num() == 0) {
            #pragma omp parallel for reduction(+:sum0)
            for (i=0; i<N/2; i++)
                sum0 += B[i] * C[i];
        } else if (omp_get_team_num() == 1) {
            #pragma omp parallel for reduction(+:sum1)
            for (i=N/2; i<N; i++)
                sum1 += B[i] * C[i];
        }
    }
    return sum0 + sum1;
}
/* Note: The variables sum0,sum1 are now mapped with tofrom, for correct
 execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
 */

clang: fatal error: cannot find libdevice for sm_35. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice.
[Native kernel] clang exited with code 1, the executable will not be executed

In [1]:
//%compiler: clang
//%cflags: -fopenmp

// Need to update the native kernel, or specified that our kernel doesn't support OpenMP 5.0


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#define N 1000
 
int main(){
    int nteams_required=2, max_thrds, tm_id;
    float sp_x[N], sp_y[N], sp_a=0.0001e0;
    double dp_x[N], dp_y[N], dp_a=0.0001e0;

    // Create 2 teams, each team works in a different precision
    #pragma omp teams num_teams(nteams_required) thread_limit(max_thrds) private(tm_id)
    {
        tm_id = omp_get_team_num();
        if( omp_get_num_teams() != 2 ) //if only getting 1, quit 
        { 
            printf("error: Insufficient teams on host, 2 required\n");
            exit(0);
        }
        if(tm_id == 0) // Do Single Precision Work (SAXPY) with this team
        {
            #pragma omp parallel
            {
                #pragma omp for //init
                for(int i=0; i<N; i++){sp_x[i] = i*0.0001; sp_y[i]=i; }
                #pragma omp for simd simdlen(8)
                for(int i=0; i<N; i++){sp_x[i] = sp_a*sp_x[i] + sp_y[i];}
            }
        }
        if(tm_id == 1) // Do Double Precision Work (DAXPY) with this team
        {
            #pragma omp parallel
            {
                #pragma omp for //init
                for(int i=0; i<N; i++){dp_x[i] = i*0.0001; dp_y[i]=i; }
                #pragma omp for simd simdlen(4)
                for(int i=0; i<N; i++){dp_x[i] = dp_a*dp_x[i] + dp_y[i];}
            }
        }
    }
    printf("i=%d sp|dp %f %f \n",N-1, sp_x[N-1], dp_x[N-1]);
    printf("i=%d sp|dp %f %f \n",N/2, sp_x[N/2], dp_x[N/2]);
    //OUTPUT1:i=999 sp|dp 999.000000 999.000010
    //OUTPUT2:i=500 sp|dp 500.000000 500.000005
    return 0;
} 

OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).


i=999 sp|dp 999.000000 999.000010 
i=500 sp|dp 500.000000 500.000005 


Its syntax is:
```
#pragma omp teams [clause[ [,] clause] ... ] new-line
    structured-block
```
The syntax in Fortran is:
```
!$omp teams [clause[ [,] clause] ... ]
    loosely-structured-block
!$omp end teams
```

## Clauses

## Examples