# Module 1.1 - port an Intel® oneAPI Collective Communications Library (oneCCL) sample from CPU to GPU   -  CCL Allreduce  

## Learning Objectives
In this module, the developer will:
* Learn different oneCCL configurations inside the Intel® oneAPI toolkit
* Learn how to compile a oneCCL sample with different configurations via batch jobs on the Intel® DevCloud for oneAPI or in local environments
* Learn how to program oneCCL with a simple sample
* Learn how to port a oneCCL sample from CPU-only version to CPU&GPU version by using DPC++
* Learn how to collect VTune™ Amplifier data for CPU and GPU runs



***
# CCL Allreduce CPU to GPU porting Exercise


## Step 1 : introduce oneCCL configurations inside oneAPI toolkits
oneCCL has two different configurations inside the oneAPI toolkits. Both lib and include folders under the oneCCL installation path contain two different configurations, and each configuration supports a different compiler.

Set the installation path of your oneAPI toolkit:

In [None]:
%env ONEAPI_INSTALL=/opt/intel/inteloneapi/

In [None]:
!printf '%s\n'    $ONEAPI_INSTALL/ccl/latest/lib/*

As you can see, there are two different folders under the oneCCL installation path, and each of those configurations supports different features. 
This tutorial will guide you on how to compile and run against different oneCCL configurations.

First, create a lab folder for this exercise:

In [None]:
!mkdir lab

##  Step 2 : Editing the cpu_allreduce_cpp_test.cpp code which only supports CPU

This C++ API example demonstrates how to build a global reduction operation by using the sum function, and it can run only on CPU.
You can find a detailed allreduce API explanation at this [link](https://intel.github.io/oneccl/spec/communication_primitives.html#allreduce)


The Jupyter cell below with the gray background can be edited in-place and saved.
The first line of the cell contains the command **%%writefile ' lab/cpu_allreduce_cpp_test.cpp'** This tells the input cell to save the contents of the cell into the file name ' cpu_allreduce_cpp_test.cpp'  As you edit the cell and run it, it will save your changes into that file.


In [None]:
%%writefile lab/cpu_allreduce_cpp_test.cpp
#include <iostream>
#include <stdio.h>
#include "ccl.hpp"

#define COUNT 128

using namespace std;

int main(int argc, char** argv)
{
    int i = 0;
    int size = 0;
    int rank = 0;

    auto sendbuf = new int[COUNT];
    auto recvbuf = new int[COUNT];

    auto comm = ccl::environment::instance().create_communicator();
    auto stream = ccl::environment::instance().create_stream();

    rank = comm->rank();
    size = comm->size();

    /* initialize sendbuf */
    for (i = 0; i < COUNT; i++) {
        sendbuf[i] = rank;
    }

    /* modify sendbuf */
    for (i = 0; i < COUNT; i++) {
        sendbuf[i] += 1;
    }

    /* invoke ccl_allreduce */
    comm->allreduce(sendbuf,
                   recvbuf,
                   COUNT,
                   ccl::reduction::sum,
                   nullptr, /* attr */
                   stream)->wait();

    /* check correctness of recvbuf */
    for (i = 0; i < COUNT; i++) {
        if (recvbuf[i] != size * (size + 1) / 2) {
           recvbuf[i] = -1;
        }
    }

    /* print out the result of the test */
    if (rank == 0) {
        for (i = 0; i < COUNT; i++) {
            if (recvbuf[i] == -1) {
                cout << "FAILED" << endl;
                break;
            }
        }
        if (i == COUNT) {
            cout << "PASSED" << endl;
        }
    }

    return 0;
}


Then, copy the required CMake file into lab folder. The top half of CMakeList.txt handles CPU-only samples, and the bottom half handles DPC++ samples with CPU and GPU support.

In [None]:
%%writefile lab/CMakeLists.txt
#cmake_minimum_required (VERSION 2.8)
#project(CCL_SAMPLES)
set(CCL_TEST_INCLUDE_DIR "$ENV{PWD}/../include")
set(CMAKE_INSTALL_PREFIX "$ENV{PWD}/_install")
if(${CMAKE_CXX_COMPILER_ID} STREQUAL "GNU")
    file(GLOB sources "cpu_*.c" "cpu_*.cpp")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CLANG_FLAGS} -std=c++11")
    set(CCL_INCLUDE_DIR "$ENV{CCL_ROOT}/include/cpu_icc")
    set(CCL_LIB_DIR "$ENV{CCL_ROOT}/lib/cpu_icc")
    foreach(src ${sources})
        include_directories(${CCL_INCLUDE_DIR})
        include_directories(${CCL_TEST_INCLUDE_DIR})
        link_directories(${CCL_LIB_DIR})
        get_filename_component(executable ${src} NAME_WE)
        add_executable(${executable} ${src})
        target_link_libraries(${executable} PUBLIC rt)
        target_link_libraries(${executable} PUBLIC m)
        target_link_libraries(${executable} PRIVATE ccl)
        target_link_libraries(${executable} PUBLIC pthread dl stdc++)
        install(TARGETS ${executable} RUNTIME DESTINATION "${CMAKE_INSTALL_PREFIX}")
    endforeach()
endif()

if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
    set(CCL_INCLUDE_DIRS "${CCL_INCLUDE_DIRS} $ENV{SYCL_BUNDLE_ROOT}/include")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl -std=c++11")
    file(GLOB sources "sycl_*.c" "sycl_*.cpp")
    set(CCL_INCLUDE_DIR "$ENV{CCL_ROOT}/include/cpu_gpu_dpcpp")
    set(CCL_LIB_DIR "$ENV{CCL_ROOT}/lib/cpu_gpu_dpcpp")
    foreach(src ${sources})
        include_directories(${CCL_INCLUDE_DIR})
        include_directories(${CCL_TEST_INCLUDE_DIR})
        link_directories(${CCL_LIB_DIR})
        get_filename_component(executable ${src} NAME_WE)
        add_executable(${executable} ${src})
        target_link_libraries(${executable} PUBLIC rt)
        target_link_libraries(${executable} PUBLIC m)
        target_link_libraries(${executable} PRIVATE ccl)
        target_link_libraries(${executable} PRIVATE OpenCL)
        target_link_libraries(${executable} PRIVATE sycl)
        install(TARGETS ${executable} RUNTIME DESTINATION "${CMAKE_INSTALL_PREFIX}")
    endforeach()
endif()


## Step3:  Build and Execution


### Build and Run with GNU Compiler and OpenMP
The global reduction operations by using sum function sample uses the GNU compiler for this CPU. The following section guides you on how to build with G++ and run on a CPU.

#### Script - build.sh
The script **build.sh** encapsulates the compiler  command and flags that will generate the executable.

In [None]:
%%writefile build.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_icc --force > /dev/null 2>&1
export EXAMPLE_ROOT=./lab/
mkdir cpu_gomp
cd cpu_gomp
cmake .. -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++
make cpu_allreduce_cpp_test



Once you achieve an all-clear from your compilation, you execute your program on the Intel DevCloud or in local environments.

#### Script - run.sh
the script **run.sh** encapsulates the program for submission to the job queue for execution.
The user must switch to the g++ oneCCL configuration by inputting a custom configuration "--ccl-configuration=cpu_icc" when running "source setvars.sh".


In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_icc --force > /dev/null 2>&1
echo "########## Executing the run"
./cpu_gomp/out/cpu_allreduce_cpp_test
echo "########## Done with the run"




#### Submitting **build.sh** and **run.sh** to the job queue
Now we can submit the **build.sh** and **run.sh** to the job queue.

##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts both on the DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails, it is assumed that build/run will be local.

In [None]:
!rm -rf cpu_gomp; chmod 755 q; chmod 755 build.sh; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q build.sh; ./q run.sh; else ./build.sh; ./run.sh; fi

## Step 4: Analyze performance with VTune Amplifier
Use the VTune Amplifier command line to analyze performance and display the summary.

### do CPU profiling first
The script vtune_collect.sh encapsulates the profiling command and flags that will generate the VTune Amplifier profiling results.

In [None]:
%%writefile vtune_collect.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_icc --force 
type=hotspots

rm -r $(pwd)/vtune_data

echo "VTune Collect $type"
vtune -collect $type -result-dir $(pwd)/vtune_data $(pwd)/cpu_gomp/out/cpu_allreduce_cpp_test

echo "VTune Summary Report"
vtune -report summary -result-dir $(pwd)/vtune_data -format html -report-output $(pwd)/summary.html
echo "Done profiling"

#### Run VTune Amplifier to Collect Hotspots and Generate Report
Collect VTune Amplifier data and generate report

In [None]:
! chmod 755 vtune_collect.sh; if [ -x "$(command -v qsub)" ]; then ./q vtune_collect.sh; else ./vtune_collect.sh; fi

#### DisplayVTune Amplifier Summary
Display VTune Amplifier summary report generated in HTML format

In [None]:
from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)

### do GPU profiling 
The script vtune_collect.sh encapsulates the profiling command and flags that will generate the VTune Amplifier profiling results.

The profiling type is changed from hotspots to gpu-hotspots in below script to do basic GPU profiling.

In [None]:
%%writefile vtune_collect.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_icc --force 
type=gpu-hotspots

rm -r $(pwd)/vtune_data

echo "VTune Collect $type"
vtune -collect $type -result-dir $(pwd)/vtune_data $(pwd)/cpu_gomp/out/cpu_allreduce_cpp_test

echo "VTune Summary Report"
vtune -report summary -result-dir $(pwd)/vtune_data -format html -report-output $(pwd)/summary-gpu.html
echo "Done profiling"

#### Run VTune Amplifier to Collect Hotspots and Generate Report
Collect VTune Amplifier data and generate report

In [None]:
! chmod 755 vtune_collect.sh; if [ -x "$(command -v qsub)" ]; then ./q vtune_collect.sh; else ./vtune_collect.sh; fi

#### Display VTune Amplifier Summary
Display the VTune Amplifier summary report generated in HTML format

In the VTune Amplifier summary page, the GPU is stalled/idle all the time. This sample does not utilize GPU.

In [None]:
from IPython.display import IFrame
IFrame(src='summary-gpu.html', width=960, height=600)

##  Step 5 : Modifying the cpu_allreduce_cpp_test.cpp code which supports both CPU and GPU

In this session, we will convert the above sycl_allreduce_cpp_test.cpp to support both CPU and GPU and compile the sample with DPC++ instead of g++.

There are several steps to complete the code conversion from CPU to GPU for this sample.

* Step 0 : Define inline functions to create sycl queue with the selected selector
* Step 1 : Declare the sycl queue and sycl buffers
* Step 2 : Use the inline functions in Step 0 to create the sycl queue
* Step 3 : Access sycl buffer via its accessor on both the host and target side 
* Step 3.1 : Initialize sycl buffer and its acccessor on the host side
* Step 3.2 : Modify sycl buffer via its accessor on the target device side 
* Step 3.3 : Check sycl buffer's correctness on the target device side 
* Step 3.4 : Check sycl buffer's correctness on the host side

You can find related modifications below in sycl_allreduce_cpp_test.cpp, and the modifications for each step are wrapped up with ">>>>>>" and "<<<<<<".

**_NOTE:_** Host Accessors: The constructor for a host accessor waits for all kernels that modify the same buffer (or
image) in any queues to complete and then copies data back to host memory before the constructor returns.
Any command groups with requirements to the same memory object cannot execute until the host accessor
is destroyed. **Therefore, we must have { } for Step 3.1**

There are two files in this DPC++ allreduce sample:
* sycl_base.hpp
* sycl_allreduce_cpp_test.cpp

sycl_base.hpp contains inline functions to create sycl queue with the selected selector, and main program is in sycl_allreduce_cpp_test.cpp.

The Jupyter cell below with the gray background can be edited in-place and saved.
The first line of the cell contains the command **%%writefile ' lab/sycl_base.hpp' '** This tells the input cell to save the contents of the cell into the file name '  lsycl_base.hpp'  As you edit the cell and run it, it will save your changes into that file.


##### lab/sycl_base.hpp
header file for inline functions

In [None]:
%%writefile lab/sycl_base.hpp
#include <iostream>
#include <stdio.h>

// ------ GPU code conversion --Step 0 >>>>>>
// Define inline functions to create sycl queue with the selected selector
#include <CL/sycl.hpp>
#include "ccl.hpp"

using namespace std;

using namespace cl::sycl;
using namespace cl::sycl::access;

inline bool has_gpu()
{
    std::vector<cl::sycl::device> devices = cl::sycl::device::get_devices();
    for (const auto& device : devices)
    {
        if (device.is_gpu())
        {
            return true;
        }
    }
    return false;
}

inline int create_sycl_queue(int argc, char **argv, cl::sycl::queue &queue)
{
    unique_ptr<cl::sycl::device_selector> selector;
    if (argc == 2)
    {
        if (strcmp(argv[1], "cpu") == 0)
        {
            selector.reset(new cl::sycl::cpu_selector());
        }
        else if (strcmp(argv[1], "gpu") == 0)
        {
            if (has_gpu()) 
            {
                selector.reset(new cl::sycl::gpu_selector());
            }
            else
            {
                selector.reset(new cl::sycl::default_selector());
                cout << "GPU is unavailable, default_selector has been created instead of gpu_selector." << std::endl;
            }
        }
        else if (strcmp(argv[1], "host") == 0)
        {
            selector.reset(new cl::sycl::host_selector());
        }
        else if (strcmp(argv[1], "default") == 0)
        {
            selector.reset(new cl::sycl::host_selector());
               cout << "Accelerator is unavailable for multiprocessing, host_selector has been created instead of default_selector." << std::endl;
         }
        else
        {
            cerr << "Please provide device type: cpu | gpu | host | default " << std::endl;
            return -1;
        }
        queue = cl::sycl::queue(*selector);
        cout << "Provided device type " << argv[1] << "\nRunning on "
                  << queue.get_device().get_info<cl::sycl::info::device::name>()
                  << "\n";
    }
    else
    {
        cerr << "Please provide device type: cpu | gpu | host | default " << std::endl;
        return -1;
    }
    return 0;
}
                
//<<<<<< ------ GPU code conversion --Step 0     

##### lab/sycl_allreduce_cpp_test.cpp
Implementation of SYCL allreduce functions

The Jupyter cell below with the gray background can be edited in-place and saved.
The first line of the cell contains the command **%%writefile ' lab/sycl_allreduce_cpp_test.cpp' '** This tells the input cell to save the contents of the cell into the file name ' sycl_allreduce_cpp_test.cpp'  As you edit the cell and run it, it will save your changes into that file.

In [None]:
%%writefile lab/sycl_allreduce_cpp_test.cpp
// ------ GPU code conversion --Step 0 >>>>>>
#include "sycl_base.hpp"
//<<<<<< ------ GPU code conversion --Step 0     
#define COUNT 128

int main(int argc, char** argv)
{
    int i = 0;
    int size = 0;
    int rank = 0;

    // ------ GPU code conversion --Step 1 >>>>>>
    // Declare the sycl queue and sycl buffers
    cl::sycl::queue q;
    cl::sycl::buffer<int, 1> sendbuf(COUNT);
    cl::sycl::buffer<int, 1> recvbuf(COUNT);
    //<<<<<< ------ GPU code conversion --Step 1    
    
    // ------ GPU code conversion --Step 2 >>>>>>
    // Use inline functions in Step 0 to create the sycl queue
    if (create_sycl_queue(argc, argv, q) != 0) {
        return -1;
    }
    //<<<<<< ------ GPU code conversion --Step 2
    
    auto comm = ccl::environment::instance().create_communicator();
    auto stream = ccl::environment::instance().create_stream();

    rank = comm->rank();
    size = comm->size();

    /* initialize sendbuf and recvbuf*/
    // ------ GPU code conversion --Step 3.1 >>>>>>
    {
        //  open buffers and initialize them on the CPU side 
        auto host_acc_sbuf = sendbuf.get_access<mode::write>();
        auto host_acc_rbuf = recvbuf.get_access<mode::write>();
        for (i = 0; i < COUNT; i++) {
            host_acc_sbuf[i] = rank;
            host_acc_rbuf[i] = -1;
        }
    }
    //<<<<<< ------ GPU code conversion --Step 3.1

    /* modify sendbuf */
    // ------ GPU code conversion --Step 3.2 >>>>>>
    // open sendbuf and modify it on the target device side 
    q.submit([&](handler& cgh){
       auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);
       cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) {
           dev_acc_sbuf[id] += 1;
       });
    });
    //<<<<<< ------ GPU code conversion --Step 3.2
    
    /* invoke ccl_allreduce */
    comm->allreduce(sendbuf,
                   recvbuf,
                   COUNT,
                   ccl::reduction::sum,
                   nullptr, /* attr */
                   stream)->wait();

    
    
    /* check correctness of recvbuf */
    // ------ GPU code conversion --Step 3.3 >>>>>>
    // open recvbuf and check its correctness on the target device side 
    q.submit([&](handler& cgh){
       auto dev_acc_rbuf = recvbuf.get_access<mode::write>(cgh);
       cgh.parallel_for<class allreduce_test_rbuf_check>(range<1>{COUNT}, [=](item<1> id) {
           if (dev_acc_rbuf[id] != size*(size+1)/2) {
               dev_acc_rbuf[id] = -1;
           }
       });
    });
    //<<<<<< ------ GPU code conversion --Step 3.3
    
    /* print out the result of the test */
    if (rank == 0) {
        // ------ GPU code conversion --Step 3.4 >>>>>>
        // open buffers and validate them on the CPU side 
        auto host_acc_rbuf_new = recvbuf.get_access<mode::read>();
        for (i = 0; i < COUNT; i++) {
            if (host_acc_rbuf_new[i] == -1) {
        //<<<<<< ------ GPU code conversion --Step 3.4
                cout << "FAILED" << std::endl;
                break;
            }
        }
        if (i == COUNT) {
            cout << "PASSED" << std::endl;
        }
    }

    return 0;
}


### Build and Run with the DPC++ Compiler
For this global reduction operation sample on GPU and CPU, DPC++ is used as the compiler.
The following section guides you how to build with DPC++ and run on GPU and CPU.

#### Script - build.sh
The script **build.sh** encapsulates the compiler  command and flags that will generate the executable.

In [None]:
%%writefile build.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_gpu_dpcpp --force > /dev/null 2>&1
export EXAMPLE_ROOT=./lab/
mkdir dpcpp
cd dpcpp
cmake .. -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=dpcpp
make sycl_allreduce_cpp_test


Once you achieve an all-clear from your compilation, execute your program on the DevCloud or in local environments.

#### Script - run.sh
The script **run.sh** encapsulates the program for submission to the job queue for execution.


In [None]:
%%writefile run.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_gpu_dpcpp --force > /dev/null 2>&1
echo "########## Executing the run"
./dpcpp/out/sycl_allreduce_cpp_test gpu
echo "########## Done with the run"



#### Submitting **build.sh** and **run.sh** to the job queue
Now we can submit the **build.sh** and **run.sh** to the job queue.

##### NOTE - it is possible to execute any of the build and run commands in local environments.
To enable users to run their scripts both on the Intel DevCloud or in local environments, this and subsequent training checks for the existence of the job submission command **qsub**.  If the check fails it is assumed that build/run will be local.

In [None]:
!rm -rf dpcpp; chmod 755 q; chmod 755 build.sh; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q build.sh; ./q run.sh; else ./build.sh; ./run.sh; fi

## Step 6: Analyze performance with VTune Amplifier
Use the VTune Amplifier command line to analyze performace and display the summary

### do CPU profiling first. 
The script vtune_collect.sh encapsulates the profiling command and flags that will generate the VTune Amplifier profiling results.

In [None]:
%%writefile vtune_collect.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_gpu_dpcpp --force
type=hotspots

rm -r $(pwd)/vtune_data

echo "VTune Collect $type"
vtune -collect $type -result-dir vtune_data $(pwd)/dpcpp/out/sycl_allreduce_cpp_test cpu

echo "VTune Summary Report"
vtune -report summary -result-dir $(pwd)/vtune_data -format html -report-output $(pwd)/summary.html
echo "Done profiling"

#### Run VTune Amplifier to Collect Hotspots and Generate Report
Collect VTune Amplifier data and generate report:

In [None]:
! chmod 755 vtune_collect.sh; if [ -x "$(command -v qsub)" ]; then ./q vtune_collect.sh; else ./vtune_collect.sh; fi

#### Display VTune Amplifier Summary
Display the VTune Amplifier summary report generated in HTML format:

In [None]:
from IPython.display import IFrame
IFrame(src='summary.html', width=960, height=600)

### do GPU profiling 
The script vtune_collect.sh encapsulates the profiling command and flags that will generate the VTune Amplifier profiling results.

In [None]:
%%writefile vtune_collect.sh
#!/bin/bash
source $ONEAPI_INSTALL/setvars.sh --ccl-configuration=cpu_gpu_dpcpp --force
type=gpu-hotspots

rm -r $(pwd)/vtune_data

echo "VTune Collect $type"
vtune -collect $type -result-dir $(pwd)/vtune_data $(pwd)/dpcpp/out/sycl_allreduce_cpp_test gpu


echo "VTune Summary Report"
vtune -report summary -result-dir $(pwd)/vtune_data -format html -report-output $(pwd)/summary-gpu.html
echo "Done profiling"

#### Run VTune Amplifier to Collect Hotspots and Generate Report
Collect VTune Amplifier data and generate report:

In [None]:
! chmod 755 vtune_collect.sh; if [ -x "$(command -v qsub)" ]; then ./q vtune_collect.sh; else ./vtune_collect.sh; fi

#### Display VTune Amplifier Summary
Display the VTune Amplifier summary report generated in HTML format:

In [None]:
from IPython.display import IFrame
IFrame(src='summary-gpu.html', width=960, height=600)

Here are the supported profiling types from VTune Amplifier.

* type=hotspots
* type=memory-consumption
* type=uarch-exploration
* type=memory-access
* type=threading
* type=hpc-performance
* type=system-overview
* type=graphics-rendering
* type=io
* type=fpga-interaction
* type=gpu-offload
* type=gpu-hotspots
* type=throttling
* type=platform-profiler
* type=cpugpu-concurrency
* type=tsx-exploration
* type=tsx-hotspots
* type=sgx-hotspots

For details of VTune Amplifier usage, please refer to https://software.intel.com/en-us/oneapi/vtune-profiler

***
# Summary
In this lab the developer will have learned the following:
* Know different oneCCL configurations inside oneAPI toolkit
* Know how to compile a oneCCL sample with different configurations via batch jobs on the Intel oneAPI DevCloud or in local environments
* Know how to program oneCCL with a simple sample
* Know how to port a oneCCL sample from CPU-only version to CPU&GPU version by using DPC++
* Know how to collect VTune Amplifier data for CPU and GPU runs