# Lecture 13 : Introduction to OpenMP

# Part 1 : Hello World!

## As our first example, we investigate the classic problem of printing Hello World! in parallel.

## To start, consider the sequential code.

In [1]:
%%writefile hello.c
#include <stdio.h>

int main () {
    printf ("Hello World!\n");
}

Writing hello.c


In [2]:
!gcc -o hello hello.c

In [3]:
!./hello

Hello World!


## To get ready for OpenMP we create a version 1 that adds the ability to read in the number of threads using the command line.

In [4]:
%%writefile omp_hello_v1.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char* argv[]) {

    // get num_threads from command line
    if (argc < 2) {
        printf ("Command usage : %s num_threads\n",argv[0]);
        return 1;
    }
    int num_threads = atoi(argv[1]);

    printf ("num_threads = %d\n",num_threads);
    printf ("Hello World!\n");
}

Writing omp_hello_v1.c


In [5]:
!gcc -o omp_hello_v1 omp_hello_v1.c -fopenmp

In [6]:
!./omp_hello_v1 4

num_threads = 4
Hello World!


## Version 1 is just a sequential program with the ability to read and print a command line argument.

## For version 2 we will put the code that prints the Hello World! message into an OpenMP parallel region using the OpenMP *parallel* pragma and set the number of OpenMP threads.

## Note that we have to include *omp.h*.

## A code block following the "#pragma omp parallel" is called a parallel region.
    
## Each thread executes the same code in a parallel region concurrently (at the same time).

## The code following a parallel region is not executed until all threads finish.

In [7]:
%%writefile omp_hello_v2.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main (int argc, char* argv[]) {

    // get num_threads from command line
    if (argc < 2) {
        printf ("Command usage : %s num_threads\n",argv[0]);
        return 1;
    }
    int num_threads = atoi(argv[1]);
    omp_set_num_threads(num_threads);

    printf ("num_threads = %d\n",num_threads);

#pragma omp parallel
    {
        printf ("Hello World!\n");
    }
}

Writing omp_hello_v2.c


In [8]:
!gcc -o omp_hello_v2 omp_hello_v2.c -fopenmp

## Note that we need to compile with the -fopenmp flag to incorporate the OpenMP library.

In [9]:
!./omp_hello_v2 4

num_threads = 4
Hello World!
Hello World!
Hello World!
Hello World!


## In order to take advantage of concurrency we have to have each thread do different work inside of the parallel region (if they all did the exact same work there would be no benefit!).

## One way threads can do distinct tasks inside of a parallel region is to use their unique *thread number*.  

## In our final version 3 we have each thread print a specialized message containing its own unique thread number.

In [10]:
%%writefile omp_hello_v3.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main (int argc, char* argv[]) {

    // get num_threads from command line
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"num_threads");
        return 1;
    }
    int num_threads = atoi(argv[1]);
    omp_set_num_threads(num_threads);

    printf ("num_threads = %d\n",num_threads);

#pragma omp parallel
    {
        int thread_num = omp_get_thread_num();
        printf ("Hello World from thread %d of %d!\n",thread_num,num_threads);
    }
}

Writing omp_hello_v3.c


In [11]:
!gcc -o omp_hello_v3 omp_hello_v3.c -fopenmp

In [64]:
!./omp_hello_v3 4

num_threads = 4
Hello World from thread 3 of 4!
Hello World from thread 2 of 4!
Hello World from thread 0 of 4!
Hello World from thread 1 of 4!


## Note that the print statements appear in random order because the threads are executing concurrently (at the same time)!

# Part 2 : Summing the integers $1, \ldots, N$.

## Gauss showed that
$$\displaystyle\sum_{i=1}^N i = \frac{N(N+1)}{2}$$

## We will incrementally revise a sequenctial code for computing the sum $\displaystyle\sum_{i=1}^N i$ to run on multiple CPU cores using OpenMP.

## Here is our starter sequential code.  

In [13]:
%%writefile sum.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N from the command line
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"N");
        return 1;
    }
    long long N = atoll(argv[1]);

    // calculate the sum
    long long sum = 0;
    for (long long i = 1; i <= N;i++) {
        sum += i;
    }

    // print the results
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Writing sum.c


## Let's run the code to verify Gauss's formula for a few N!

In [14]:
!gcc -o sum sum.c

In [15]:
!./sum 1000

sum = 500500
N*(N+1)/2 = 500500


In [16]:
!./sum 1000000

sum = 500000500000
N*(N+1)/2 = 500000500000


In [17]:
!time ./sum 4000000000

sum = 8000000002000000000
N*(N+1)/2 = 8000000002000000000

real	0m12.899s
user	0m12.735s
sys	0m0.002s


## We can see that it takes a while to run when $N$ is 4 billion.

## For our first OpenMP version let's add a command line argument for the number of threads to use and also add some code to time how long it takes the code to run.  Note that the Linux time command does not work as expected for OpenMP programs!

In [18]:
%%writefile omp_sum_v1.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;
    for (long long i = 1; i <= N;i++) {
        sum += i;
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Writing omp_sum_v1.c


## Note that we need to compile with the -fopenmp flag to incorporate the OpenMP library.

In [19]:
!gcc -o omp_sum_v1 omp_sum_v1.c -fopenmp

In [20]:
!./omp_sum_v1 100000000 1

num_threads = 1, elapsed time = 0.325666
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [21]:
!./omp_sum_v1 100000000 2

num_threads = 2, elapsed time = 0.321305
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


## Note that version 1 is just a timed sequential code since we do not yet use any parallel regions!

## For version 2 we will put the code that computes the sum into an OpenMP parallel region using the OpenMP *parallel* pragma.

In [22]:
%%writefile omp_sum_v2.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel
    {
	    for (long long i = 1; i <= N;i++) {
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}


Writing omp_sum_v2.c


In [23]:
!gcc -o omp_sum_v2 omp_sum_v2.c -fopenmp

In [24]:
!./omp_sum_v2 100000000 1

num_threads = 1, elapsed time = 0.339442
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [25]:
!./omp_sum_v2 100000000 2

num_threads = 2, elapsed time = 0.650309
sum = 8416081095427235
N*(N+1)/2 = 5000000050000000


## Note above that we run the code with one thread and with two threads.

## What are the two major problems with version 2?

## Answers:

## One reason that the performance is not better when using two CPU cores is that each thread is currently computing the entire sum!  

## In order to take advantage of multiple CPU cores we need to subdivide the work across the CPU cores.

## In this case we can subdivide the work by assigning each thread different terms to sum up.  This can be accomplished by assignment each thread different loop iterations to perform.  

## For example if we have two threads then we can have one thread add up the odd terms: 1 + 3 + 5 + ... and have the other thread add up the even terms 2 + 4 + 6 + ...

## In version 3 we will use the OpenMP loop scheduler to subdivide the iterations of the for loop across threads.

## When the **#pragma omp for** is encountered, the OpenMP loop scheduler automatically assigns each thread an approximately equal sized set of loop iterations to execute.

## If the DEBUG variable is set, our version 3 will print which iterations each thread is doing to illustrate how the OpenMP loop scheduler works.

In [27]:
%%writefile omp_sum_v3.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel
    {
	    int thread_num = omp_get_thread_num();
#pragma omp for
	    for (long long i = 1; i <= N;i++) {
#ifdef DEBUG
            printf ("thread %d is working on iteration %lld\n",thread_num,i);
#endif
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}


Overwriting omp_sum_v3.c


## Let's start by working on some small problems to illustrate the OpenMP loop scheduler.

In [28]:
!gcc -DDEBUG -o omp_sum_v3 omp_sum_v3.c -fopenmp

In [34]:
!./omp_sum_v3 10 3

thread 2 is working on iteration 8
thread 0 is working on iteration 1
thread 0 is working on iteration 2
thread 0 is working on iteration 3
thread 0 is working on iteration 4
thread 1 is working on iteration 5
thread 1 is working on iteration 6
thread 1 is working on iteration 7
thread 2 is working on iteration 9
thread 2 is working on iteration 10
num_threads = 3, elapsed time = 0.000305
sum = 55
N*(N+1)/2 = 55


## Note that with the OpenMP loop scheduler no thread has to do more than one extra iteration.  

## Next let's do some more serious testing.

In [35]:
!gcc -o omp_sum_v3 omp_sum_v3.c -fopenmp

In [36]:
!./omp_sum_v3 100000000 1

num_threads = 1, elapsed time = 0.294264
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [37]:
!./omp_sum_v3 100000000 2

num_threads = 2, elapsed time = 0.309262
sum = 3539890311608890
N*(N+1)/2 = 5000000050000000


In [38]:
!./omp_sum_v2 100000000 2

num_threads = 2, elapsed time = 0.636992
sum = 7277820443378182
N*(N+1)/2 = 5000000050000000


## What do you observe about version 3 compared to version 2.

## Answer:

## Version 3 is faster than version 2 when running on two CPUs but running verstion 3 on two CPUs is still not faster than running version 3 on just a single CPU.

## We will take a break from performance for the moment and work instead on correctness.  Note in particular that both versions 2 and 3 compute the incorrect sum when using 2 threads!

## Inside a parallel region there are two types of variables: shared and private.

## There is a single copy of each shared variable that is used by all threads.  

## Each thread uses its own local copy of each private variable.  

## Variables defined inside the parallel pragma are private.

## Variables not defined inside the parallel pragma are shared.  
## For each of the following variables, decide if it is shared or private:
    
* i :
* N :
* sum :
* thread_num :
* num_threads :

## Exercise : To get started on version 4, add *default(none)* to the parallel pragma in version 3 of the code.  

## What do you observe?

## Answers:

## Adding *default(none)* to the parallel pragma will tell the compiler to not assume that any variable used in the parallel region but defined outside of the parallel region is shared.  Version 3 of the code with *default(none)* added will not compile.  

## In version 4, we fix the compilation errors in our code by explicity adding a list of shared variables to the parallel pragma.  

In [39]:
%%writefile omp_sum_v4.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel default(none) shared(N,sum,num_threads)
    {
	    int thread_num = omp_get_thread_num();
#pragma omp for
	    for (long long i = 1; i <= N;i++) {
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Writing omp_sum_v4.c


In [40]:
!gcc -o omp_sum_v4 omp_sum_v4.c -fopenmp

In [41]:
!./omp_sum_v4 100000000 1

num_threads = 1, elapsed time = 0.293806
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [42]:
!./omp_sum_v4 100000000 2

num_threads = 2, elapsed time = 0.309779
sum = 3505380482502995
N*(N+1)/2 = 5000000050000000


## Note that version 4 still has the same issues as version 3 but we now have a better understanding of shared and private variables.  

## Going a bit deeper, there are different types of shared variables.  

## We call a shared variable *read only* if the code in the parallel region only reads from the variable.

## We call a shared variable *read/write* if the code in the parallel region both reads from the variable and writes to the variable.  

## Classify each of the following shared variables.  

* N :
* sum :
* num_threads :

## The read only shared variables are fine, but one problem with the current code is that there is a read/write race condition for sum.

## To illustrate the read/race condition let's assume for the moment **that are only two loop iterations and only two threads.**

## Here is what we might imagine happens when two different threads access sum during the two iteration of the for loop:

* Thread 1 reads the current value of sum : 0
* Thread 1 adds i=1 to the current value of sum : 1
* Thread 1 stores the new value of sum : 1
* Thread 2 reads the current value of sum : 1
* Thread 2 adds i=2 to the current value of sum : 3
* Thread 2 stores the new value of sum : 3

## Since the threads operate concurrently on different CPUs, what is actually happening is more like:

* Thread 1 reads the current value of sum : 0
* Thread 2 reads the current value of sum : 0
* Thread 1 adds i=1 to the current value of sum : 1
* Thread 2 adds i=2 to the current value of sum : 2
* Thread 1 stores the new value of sum : 1
* Thread 2 stores the new value of sum : 2

## In this case part of the sum is lost and we compute a value of the sum less than N*(N+1)/2 (this is exactly what happens when we run the current version!).

## There are a couple of ways to fix this issue.
## One way is to execute the update to sum as a **single atomic update**.
## This means that the read/sum/write steps of the update to sum cannot be preempted by another thread trying to execute a different update to the same shared sum variable.
## We can achieve this by using putting a **#pragma omp atomic** before the update to sum.


In [43]:
%%writefile omp_sum_v5.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel default(none) shared(N,sum,num_threads)
    {
	    int thread_num = omp_get_thread_num();
#pragma omp for
	    for (long long i = 1; i <= N;i++) {
#pragma omp atomic
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Writing omp_sum_v5.c


In [44]:
!gcc -o omp_sum_v5 omp_sum_v5.c -fopenmp

In [45]:
!./omp_sum_v5 100000000 1

num_threads = 1, elapsed time = 0.873200
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [46]:
!./omp_sum_v5 100000000 2

num_threads = 2, elapsed time = 0.993849
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


## Version 5 of the code now computes the correct value of sum!

## However, adding more threads still does not decrease the runtime!

## To fix the read/write race condition we needed to use an atomic update.

## The issue is that atomic updates are expensive because they require synchronization of the threads.  This synchronization also reduces parallelism because threads will have to wait on other threads to finish their atomic update before proceeding.

## To fix this remaining issue, in our final version 6 we change the code so that each thread only has to perform **a single atomic update** inside the parallel region.

## To ensure each thread only performs a single atomic update, we create a local thread version of sum so that each thread can do as much work as possible independently.

## After each thread fully computes its partial thread_sum it adds that value to the shared sum variable using an atomic update.

In [47]:
%%writefile omp_sum_v6.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel default(none) shared(N,sum,num_threads)
    {
	    int thread_num = omp_get_thread_num();
        long long thread_sum = 0;
#pragma omp for
	    for (long long i = 1; i <= N;i++) {
            thread_sum += i;
        }
#pragma omp atomic
	    sum += thread_sum;
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Writing omp_sum_v6.c


In [48]:
!gcc -o omp_sum_v6 omp_sum_v6.c -fopenmp

In [61]:
!./omp_sum_v6 100000000 1

num_threads = 1, elapsed time = 0.361812
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [62]:
!./omp_sum_v6 100000000 2

num_threads = 2, elapsed time = 0.180636
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


## With version 6 we finally see that using 2 threads running on two separate CPUs reduces the runtime by around 50% (this is what we expect since each CPU is only doing half of the work).

In [63]:
!./omp_sum_v6 100000000 4

num_threads = 4, elapsed time = 0.186425
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


## When we go to 4 threads on Google Colab we do not see any additional improvement!

## On Google Colab, our OpenMP programs can only utilize 2 CPU cores.  

## Run version 6 on the *matrix* server to see better results when using higher thread counts!