# Lecture 13 : Introduction to OpenMP

# Part 1 : Hello World!

## As our first example, we investigate the classic problem of printing Hello World! in parallel.

## To start, consider the sequential code.

In [1]:
%%writefile hello.c
#include <stdio.h>

int main () {
    printf ("Hello World!\n");
}

Overwriting hello.c


In [2]:
!gcc -o hello hello.c

In [3]:
!./hello

Hello World!


## To get ready for OpenMP we create a version 0 that adds the ability to read in the number of threads using the command line.

In [4]:
%%writefile omp_hello_v0.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char* argv[]) {

    // get num_threads from command line
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"num_threads");
        return 1;
    }
    int num_threads = atoi(argv[1]);

    printf ("num_threads = %d\n",num_threads);
    printf ("Hello World!\n");
}

Overwriting omp_hello_v0.c


In [5]:
!gcc -o omp_hello_v0 omp_hello_v0.c -fopenmp

In [6]:
!./omp_hello_v0 4

num_threads = 4
Hello World!


## Version 0 is just a sequential program with the ability to read and print a command line argument.

## For version 1 we will put the code that prints the Hello World! message into an OpenMP parallel region using the OpenMP *parallel* pragma and set the number of OpenMP threads.

## Note that we have to include *omp.h*.

## A code block following the "#pragma omp parallel" is called a parallel region.
    
## Each thread executes the same code in a parallel region concurrently (at the same time).

## The code following a parallel region is not executed until all threads finish.

In [7]:
%%writefile omp_hello_v1.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main (int argc, char* argv[]) {

    // get num_threads from command line
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"num_threads");
        return 1;
    }
    int num_threads = atoi(argv[1]);
    omp_set_num_threads(num_threads);

    printf ("num_threads = %d\n",num_threads);

#pragma omp parallel
    {
        printf ("Hello World!\n");
    }
}

Overwriting omp_hello_v1.c


In [8]:
!gcc -o omp_hello_v1 omp_hello_v1.c -fopenmp

## Note that we need to compile with the -fopenmp flag to incorporate the OpenMP library.

In [9]:
!./omp_hello_v1 4

num_threads = 4
Hello World!
Hello World!
Hello World!
Hello World!


## In order to take advantage of concurrency we have to have each thread do different work inside of the parallel region (if they all did the exact same work there would be no benefit!).

## The ways can do distinct tasks inside of a parallel region is to use their unique *thread number*.  

## In our final version 2 we have each thread print a specialized message containing its own unique thread number.

In [10]:
%%writefile omp_hello_v2.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main (int argc, char* argv[]) {

    // get num_threads from command line
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"num_threads");
        return 1;
    }
    int num_threads = atoi(argv[1]);
    omp_set_num_threads(num_threads);

    printf ("num_threads = %d\n",num_threads);

#pragma omp parallel
    {
        int thread_num = omp_get_thread_num();
        printf ("Hello World from thread %d of %d!\n",thread_num,num_threads);
    }
}

Overwriting omp_hello_v2.c


In [11]:
!gcc -o omp_hello_v2 omp_hello_v2.c -fopenmp

In [12]:
!./omp_hello_v2 4

num_threads = 4
Hello World from thread 1 of 4!
Hello World from thread 3 of 4!
Hello World from thread 2 of 4!
Hello World from thread 0 of 4!


## Note that the print statements appear in random order because the threads are executing concurrently (at the same time)!

# Part 2 : Summing the integers $1, \ldots, N$.

## Gauss showed that
$$\displaystyle\sum_{i=1}^N i = \frac{N(N+1)}{2}$$

## We will incrementally revise a sequenctial code for computing the sum $\displaystyle\sum_{i=1}^N i$ to run on multiple CPU cores using OpenMP.

## Here is our starter sequential code.  

In [13]:
%%writefile sum.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N from the command line
    if (argc < 2) {
        printf ("Command usage : %s %s\n",argv[0],"N");
        return 1;
    }
    long long N = atoll(argv[1]);

    // calculate the sum
    long long sum = 0;
    for (long long i = 1; i <= N;i++) {
        sum += i;
    }

    // print the results
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Overwriting sum.c


## Let's run the code to verify Gauss's formula for a few N!

In [14]:
!gcc -o sum sum.c

In [15]:
!./sum 1000

sum = 500500
N*(N+1)/2 = 500500


In [16]:
!./sum 1000000

sum = 500000500000
N*(N+1)/2 = 500000500000


In [17]:
!time ./sum 4000000000

sum = 8000000002000000000
N*(N+1)/2 = 8000000002000000000

real	0m13.587s
user	0m12.659s
sys	0m0.021s


## We can see that it takes a while to run when $N$ is 4 billion.

## For our first OpenMP version let's add a command line argument for the number of threads to use and also add some code to time how long it takes the code to run.  Note that the Linux time command does not work well for OpenMP programs!

In [18]:
%%writefile omp_sum_v0.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;
    for (long long i = 1; i <= N;i++) {
        sum += i;
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Overwriting omp_sum_v0.c


## Note that we need to compile with the -fopenmp flag to incorporate the OpenMP library.

In [19]:
!gcc -o omp_sum_v0 omp_sum_v0.c -fopenmp

In [20]:
!./omp_sum_v0 100000000 1

num_threads = 1, elapsed time = 0.323607
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [21]:
!./omp_sum_v0 100000000 2

num_threads = 2, elapsed time = 0.328249
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


## Note that version 0 is just a timed sequential code since we do not yet use any parallel regions!

## For version 1 we will put the code that computes the sum into an OpenMP parallel region using the OpenMP *parallel* pragma.

In [22]:
%%writefile omp_sum_v1.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel
    {
	    for (long long i = 1; i <= N;i++) {
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}


Overwriting omp_sum_v1.c


In [23]:
!gcc -o omp_sum_v1 omp_sum_v1.c -fopenmp

In [24]:
!./omp_sum_v1 100000000 1

num_threads = 1, elapsed time = 0.338632
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [25]:
!./omp_sum_v1 100000000 2

num_threads = 2, elapsed time = 0.655441
sum = 7219913859709671
N*(N+1)/2 = 5000000050000000


## A code block following the "#pragma omp parallel" is called a parallel region.
    
## Each thread executes the same code in a parallel region concurrently (at the same time).

## The code following a parallel region is not executed until all threads finish.

## Note above that we run the code with one thread and with two threads.

## What are the two major problems with version 1?

## Answers:

## One reason that the performance is not better when using two CPU cores is that each thread is currently computing the entire sum!  

## In order to take advantage of multiple CPU cores we need to subdivide the work across the CPU cores.

## In this case we can subdivide the work by assigning each thread different terms to sum up.  This can be accomplished by assignment each thread different loop iterations to perform.  

## For example if we have two threads then we can have one thread add up the odd terms: 1 + 3 + 5 + ... and have the other thread add up the even terms 2 + 4 + 6 + ...

## In version 2 we use the thread number to split up the work in a general way that works for any number of threads.  Draw a picture to illustrate how the work would be split up if there are three threads.  

In [26]:
%%writefile omp_sum_v2.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel
    {
	    int thread_num = omp_get_thread_num();
	    for (long long i = 1+thread_num; i <= N;i+=num_threads) {
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}


Overwriting omp_sum_v2.c


In [27]:
!gcc -o omp_sum_v2 omp_sum_v2.c -fopenmp

In [28]:
!./omp_sum_v2 100000000 1

num_threads = 1, elapsed time = 0.241548
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [29]:
!./omp_sum_v2 100000000 2

num_threads = 2, elapsed time = 0.318162
sum = 3571442418462272
N*(N+1)/2 = 5000000050000000


In [30]:
!./omp_sum_v1 100000000 2

num_threads = 2, elapsed time = 0.648259
sum = 7282740619959276
N*(N+1)/2 = 5000000050000000


## What do you observe about version 2 compared to version 1.

## Answer:

## Version 2 is faster than version 1 but it is still not running on two CPUs is not really much faster than running on just a single CPU.

## We will take a break from performance for the moment and work instead on correctness.  Note in particular that both versions 1 and 2 compute the incorrect sum!

## Inside a parallel region there are two types of variables: shared and private.

## There is a single copy of each shared variable that is used by all threads.  

## Each thread uses its own local copy of each private variable.  

## Variables defined inside the parallel pragma are private.

## Variables not defined inside the parallel pragma are shared.  
## For each of the following variables, decide if it is shared or private:
    
* i :
* N :
* sum :
* thread_num :
* num_threads :

## Exercise : To get started on version 3, add *default(none)* to the parallel pragma in verion 2 of the code.  

## What do you observe?

## Answers:

## Adding *default(none)* to the parallel pragma will tell the compiler to not assume that any variable used in the parallel region but defined outside of the parallel region is shared.  Version 2 of the code with *default(none)* added will not compile.  

## In version 3, we fix the compilation errors in our code by explicity adding a list of shared variables to the pragma.  

In [31]:
%%writefile omp_sum_v3.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel default(none) shared(N,sum,num_threads)
    {
	    int thread_num = omp_get_thread_num();
	    for (long long i = 1+thread_num; i <= N;i+=num_threads) {
	        sum += i;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Overwriting omp_sum_v3.c


In [32]:
!gcc -o omp_sum_v3 omp_sum_v3.c -fopenmp

In [33]:
!./omp_sum_v3 100000000 1

num_threads = 1, elapsed time = 0.251250
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [34]:
!./omp_sum_v3 100000000 2

num_threads = 2, elapsed time = 0.345451
sum = 3765535294206015
N*(N+1)/2 = 5000000050000000


## Note that version 3 still has the same issues as version 2 but we now have a better understanding of shared and private variables.  

## Going a bit deeper, there are different types of shared variables.  

## We call a shared variable *read only* if the code in the parallel region only reads from the variable.

## We call a shared variable *read/write* if the code in the parallel region both reads from the variable and writes to the variable.  

## Classify each of the following shared variables.  

* N :
* sum :
* num_threads :

## The read only shared variables are fine, but one problem with the current the code is that there is a read/write race condition for sum.

## Here is what we might imagine happens when two different threads access sum during the first two iteration of the for loop:

* Thread 1 reads the current value of sum : 0
* Thread 1 adds i=1 to the current value of sum : 1
* Thread 1 stores the new value of sum : 1
* Thread 2 reads the current value of sum : 1
* Thread 2 adds i=2 to the current value of sum : 3
* Thread 2 stores the new value of sum : 3

## Since the threads operate concurrently, what is actually happening could be more like:

* Thread 1 reads the current value of sum : 0
* Thread 2 reads the current value of sum : 0
* Thread 1 adds i=1 to the current value of sum : 1
* Thread 2 adds i=2 to the current value of sum : 2
* Thread 1 stores the new value of sum : 1
* Thread 2 stores the new value of sum : 2

## In this case part of the sum is lost and we end up computing a value of the sum < N*(N+1)/2 (this is exactly what happens when we run the current version!).

## There are a couple of ways to fix this issue.
## One way is to have threads update sum inside of a critical region.  
## By putting a block of code in a parallel region after a *#pragma omp critical* that block of code must be executed by a single thread at a time.
## In version 4 of our code, we use a critical region to fix the read/write race condition for sum.

In [35]:
%%writefile omp_sum_v4.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel default(none) shared(N,sum,num_threads)
    {
	    int thread_num = omp_get_thread_num();
	    for (long long i = 1+thread_num; i <= N;i+=num_threads) {
#pragma omp critical
            {
		        sum += i;
	        }
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Overwriting omp_sum_v4.c


In [36]:
!gcc -o omp_sum_v4 omp_sum_v4.c -fopenmp

In [37]:
!./omp_sum_v4 100000000 1

num_threads = 1, elapsed time = 2.212588
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


In [38]:
!./omp_sum_v4 100000000 2

num_threads = 2, elapsed time = 2.119281
sum = 5000000050000000
N*(N+1)/2 = 5000000050000000


## Version 4 of the code now computes the correct value of sum and the work is distributed approximately evenly between threads.

## However, adding more threads still increases the runtime!!

## To fix the read/write race condition we needed to use a critical region.

## The issue is that critical regions are expensive because they require synchronization of the threads.  This synchronization also reduces parallelism because will have to wait on other threads to enter the critical region.

## To fix this remaining issue, in our final version 5 we change the code so that each thread only has to enter the critical region one time inside the parallel region.

## To ensure each thread only enters the critical region one time, we create a local thread version of sum so that each thread can do as much work as possible independently.

## After each thread fully computes its partial thread_sum it adds that value to the shared sum variable inside the critical region.

In [39]:
%%writefile omp_sum_v5.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]) {

    // get N and num_threads from command line
    if (argc < 3) {
        printf ("Command usage : %s %s %s\n",argv[0],"N","num_threads");
        return 1;
    }
    long long N = atoll(argv[1]);
    int num_threads = atoi(argv[2]);
    omp_set_num_threads(num_threads);

    // start the timer
    double start_time, end_time;
    start_time = omp_get_wtime();

    // calculate the sum
    long long sum = 0;

#pragma omp parallel default(none) shared(N,sum,num_threads)
    {
	    int thread_num = omp_get_thread_num();
	    long long thread_sum = 0;
	    for (long long i = 1+thread_num; i <= N;i+=num_threads) {
	        thread_sum += i;
	    }
#pragma omp critical
	    {
	        sum += thread_sum;
	    }
    }

    // stop the timer
    end_time = omp_get_wtime();

    printf ("num_threads = %d, ",num_threads);
    printf ("elapsed time = %.6f\n",end_time-start_time);
    printf ("sum = %lld\n",sum);
    printf ("N*(N+1)/2 = %lld\n",(N/2)*(N+1));
}

Overwriting omp_sum_v5.c


In [40]:
!gcc -o omp_sum_v5 omp_sum_v5.c -fopenmp

In [41]:
!./omp_sum_v5 4000000000 1

num_threads = 1, elapsed time = 7.973311
sum = 8000000002000000000
N*(N+1)/2 = 8000000002000000000


In [42]:
!./omp_sum_v5 4000000000 2

num_threads = 2, elapsed time = 7.620994
sum = 8000000002000000000
N*(N+1)/2 = 8000000002000000000


## On Google Colab, our OpenMP programs can only utilize 2 CPU cores.  Run version 5 on the *matrix* server to see better results.