# Introduction to oneAPI and OpenMP* Offload with Fortran

#### Sections:
- [oneAPI Software Model Overview](#oneAPI-Software-Model-Overview)
- [HPC Single Node Workflow with oneAPI](#HPC-Single-Node-Workflow-with-oneAPI)
- Code: [Simple Exercise](#Simple-Exercise)
- [Compile and Running Fortran Programs](#Compile-and-Running-Fortran-Programs)
- [Target Directive](#Target-Directive)
- Code: [Simple Vector Increment with Target Directive](#Lab-Exercise:-Running-an-OpenMP-program-with-the-Target-Directive)

## Learning Objectives

* Explain how __oneAPI__ can solve the challenges of programming in a heterogeneous world 
* Use oneAPI solutions to enable your workflows
* Use __OpenMP Offload__ directives to execute code on the GPU
* Familiarization on the use Jupyter notebooks for training throughout the course

### Prerequisites
This course assumes general OpenMP knowledge for CPUs. If you are new to OpenMP, below are some great resources to get you started.
* [Basic Course on OpenMP](https://www.youtube.com/watch?v=nE-xN4Bf8XI&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG)
* [OpenMP Specification (for version 5.0)](https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf)

## oneAPI Software Model Overview
The oneAPI software model provides a comprehensive and unified portfolio of developer tools that can
be used across hardware targets, including a range of performance libraries spanning several workload
domains. The libraries include functions custom-coded for each target architecture so the same
function call delivers optimized performance across supported architectures. oneAPI initiative is based on __industry standards and open specifications__ and is interoperable with existing HPC programming models.

<img src="Assets/oneapi2.png">

## HPC Single-Node Workflow with oneAPI 
Accelerated code can be written in either a kernel (DPC++) or __directive-based style__(OpenMP). Developers can use the __Intel® DPC++ Compatibility tool__ to perform a one-time migration from __CUDA*__ to __Data Parallel C++__. Existing __Fortran__ applications can use a __directive style based on OpenMP__. Existing __C++__ applications can choose either the __Kernel style__ or the __directive based style option__ and existing __OpenCL__ applications can remain in the OpenCL language or migrate to Data Parallel C++.

__Intel® Advisor__ is recommended to  __Optimize__ the design for __vectorization and memory__ (CPU and GPU) and __Identify__ loops that are candidates for __offload__ and project the __performance on target accelerators.__

The figure below shows the recommended approach of different starting points for HPC developers:

<img src="Assets/workflow.png">

## OpenMP vs DPC++
Both OpenMP and DPC++ are open standards that can be used to accelerate algorithms on GPUs. As the workflow diagram shows, oneAPI supports both methodologies and you should be able to achieve similar optimized performance with either option. The decision between the two choices likely depends on workflow requirements and ease of porting. When migrating from existing __CUDA__ or __OpenCL__ projects, DPC++ would likely make more sense. When migrating from existing C/Fortran applications with __OpenMP__, then OpenMP offload would be the easier alternative.

## OpenMP Offload
**OpenMP Offload** constructs are a set of directives for C++ and Fortran introduced in OpenMP 4.0 and further enhanced in later versions that allows developers to offload data and execution to target accelerators such as GPUs. OpenMP offload is supported in the Intel® oneAPI HPC Toolkit with the Intel® C++ Compiler and the Intel® Fortran Compiler.

***
## Simple Exercise
This exercise introduces OpenMP offload to the developer by way of a small simple code. In addition, it introduces the developer to the Jupyter notebook environment for editing and saving code; and for running and submitting programs to the Intel® oneAPI DevCloud.

We start with a simple program that includes basic OpenMP constructs including *parallel* and *do*. We will then add the *target* directive to offload part of the program to the GPU device.

This simple program loops through all of the elements of data array and multiplies it by 2. 

###  Editing the simple.f90 code
The Jupyter cell below with the gray background can be edited in-place and saved.

The first line of the cell contains the command **%%writefile 'simple.f90'** This tells the input cell to save the contents of the cell into the file name 'simple.f90'  As you edit the cell and run it, it will save your changes into that file.
The code below shows the simple OpenMP code. Inspect the code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/simple.f90
!==============================================================
! Copyright © 2020 Intel Corporation
!
! SPDX-License-Identifier: MIT
! =============================================================
program main
    use omp_lib
    integer, parameter :: N=16
    integer :: i, x(N)
    logical :: is_cpu = .true.
        
    do i=1,N
        x(i) = i
    end do
        
     if ( .not.(omp_is_initial_device()) )  is_cpu=.false.
    
    !$omp parallel do
    do i=1,N
        x(i) = x(i) * 2
    end do
    !$omp end parallel do
        

    if (is_cpu) then
        print *, "Running on CPU"
    else
        print *, "Running on GPU"
    end if
        
    do i=1,N
        print *, x(i)
    end do
end program main

## Compile and Running Fortran Programs
 
#### Compiling and Running on DevCloud:
 
For this training purposes, we have written a script (__q__) to simplify launching tasks on the DevCloud. The __q__ script does the job of submiting a script to a GPU node on DevCloud for execution, waits for the job to complete and prints out the output/errors. We will be using this command to run programs on the DevCloud: `./q <script>.sh`

#### Compiling and Running on local system:

If you have installed oneAPI HPC Toolkit on your local system, you can use the commands below to compile and run a OpenMP offload program:
```shell
source /opt/intel/inteloneapi/setvars.sh

ifx -fiopenmp -fopenmp-targets=spir64 simple.f90

./simple
  
Note: our scripts is a combination of the above three steps.
```

Using the __ifx__ compiler with the _"-fiopenmp -fopenmp-targets=spir64"_ options enables OpenMP offload to the GPU.

### Compile the code
To compile the code above, we'll be using the _compile_f.sh_ script. This script sets up the compile environment and executes the Intel® Fortran Compiler.

In [None]:
#Optional: Examine contents of compile_f.sh
%pycat compile_f.sh

Execute the following cell to submit the compile_c.sh script using the q script.

In [None]:
! chmod 755 q; chmod 755 compile_f.sh; ./compile_f.sh;

_If the Jupyter cells are not responsive or if they error out when you compile the samples, please restart the Kernel and compile the samples again_
### Running the code
To execute the compiled executable, we'll be using the _run.sh_ script.

In [None]:
#Optional: Examine contents of run.sh
%pycat run.sh

Execute the following cell to submit the run.sh script using the q script.

In [None]:
! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

## Target Directive
The `omp target` construct transfers control and data from the host to the device. The transfer of control is sequential and synchronous. In a multi-device environment, the _device_ clause can be optionally used to denote a specific device. Each device is assigned an implementation-specific integer number. Map clauses can be used to control the direction of data flow. Map clauses will be discussed in detail in the next module.

Example:
```fortran
! Sequential Host Code
...

!$omp target
! Target Region Executed on the Device
   do i=1,N
       ...
   end do
!$omp end target

! More Sequential Host Code
...
```

## Lab Exercise: Running an OpenMP program with the Target Directive
In the example below, add the `!$omp target map(tofrom:is_cpu)` directive where stated to offload execution to the GPU.  We use the map clause here to copy the value of is_cpu back to the host to see if our code actually executed on the GPU. Ensure to also include the `!$omp end target` statement. The *map* clause will be discussed in detail in the next module.

In [None]:
%%writefile lab/simple.f90
!==============================================================
! Copyright © 2020 Intel Corporation
!
! SPDX-License-Identifier: MIT
! =============================================================
program main
    use omp_lib
    integer, parameter :: N=16
    integer :: i, x(N)
    logical :: is_cpu = .true.
        
    do i=1,N
        x(i) = i
    end do
       
    !TODO Place the target directive here including the map(tofrom:is_cpu) clause
    
    !$omp parallel do
    do i=1,N
        if ((i==1) .and. (.not.(omp_is_initial_device()))) is_cpu=.false.
        x(i) = x(i) * 2
    end do
    
    !TODO Place the end target directive here

    if (is_cpu) then
        print *, "Running on CPU"
    else
        print *, "Running on GPU"
    end if
        
    do i=1,N
        print *, x(i)
    end do
end program main

In [None]:
# Execute this cell to compile the code
! chmod 755 compile_f.sh; ./compile_f.sh;

In [None]:
# Execute this cell to run the code
! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_

Once execution completes, you should see the message that the the program ran on the GPU.

In [None]:
# See the solution by running this cell
%pycat simple_solution.f90

# Summary
In this module you have learned the following:
* How oneAPI solves the challenges of programming in a heterogeneous world 
* Take advantage of oneAPI solutions to enable your workflows
* Use the Intel DevCloud to test drive oneAPI tools and libraries
* Introduced to the target directive to enable OpenMP offload
* Become familiarized with the use of Juypter notebooks by editing of source code in context.

## Resources

Check out these related resources

#### Intel® oneAPI
* [oneAPI main page](https://software.intel.com/oneapi "oneAPI main page")
* [Intel® DevCloud](https://software.intel.com/en-us/devcloud/oneapi "Intel DevCloud")
* [Get Started with oneAPI for Linux*](https://software.intel.com/en-us/get-started-with-intel-oneapi-linux)
* [Get Started with oneAPI for Windows*](https://software.intel.com/en-us/get-started-with-intel-oneapi-windows)
* [oneAPI Release Notes](https://software.intel.com/en-us/articles/intel-oneapi-release-notes)
* [oneAPI Sample Codes](https://software.intel.com/en-us/articles/code-samples-for-intel-oneapibeta-toolkits)

#### OpenMP
* [OpenMP Specification (for version 5.0)](https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf)
***

@Intel Corporation | [\*Trademark](https://www.intel.com/content/www/us/en/legal/trademarks.html)