# GPAIRS Algorithm using Numba-Dppy


## Sections
- [Gpairs algorithm](#Gpairs-algorithm)
- _Code:_ [Implementation of Gpairs distance targeting CPU using Numba JIT](#Implementation-of-Gpairs-targeting-CPU-using-Numba-JIT)
- _Code:_ [Implementation of GPairs targeting GPU using Kernels](#Implementation-of-Gpairs-targeting-GPU-using-Kernel)
- _Code:_ [Plot the results for Gpairs on GPU](#Plot-the-results-for-Gpairs-on-GPU)

## Learning Objectives
* Build a Numba implementation of Gpairs targeting CPU and GPU using Numba Jit
* Build a  Numba-DPPY  implementation of Gpairs on CPU and GPU using Kernel approach

## numba-dppy

Numba-dppy is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy is packaged as part of the IDP that comes with oneAPI base toolkit, and you don’t need to install any specific Conda packages. The support for SYCL is via DPC++'s SYCL runtime and other SYCL compilers are not supported by Numba-dppy.



## Gpairs algorithm
The Gpairs distance application takes a set of multidimensional points and computes the Euclidean distance between every pair of points. For n observations, a common sub-task of different data analysis algorithms is to compute the symmetric matrix of distances between each pair of observations.

The algorithm Naively counts Npairs(<r), the total number of pairs that are separated by a distance less than r, for each r**2 in the input rbins_squared.


# Implementation of Gpairs targeting CPU using Numba JIT
In the following example, we introduce to a Gapirs pairwise distance implementation that targets a CPU using the Numba JIT.

This is the decorator-based approach, where we offload data parallel code sections like parallel-for, and certain NumPy function calls. With the decorator method, a programmer needs to simply identify the most time-consuming parts of the program. If those parts can be parallelized, the programmer needs to just annotate those sections using Numba-DPPy, and can expect those code sections to execute on a GPU.



1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs
import numpy as np
from gaussian_weighted_pair_counts import count_weighted_pairs_3d_cpu

def run_gpairs(x1, y1, z1, w1, x2, y2, z2, w2, d_rbins_squared):
    x1 = x1.astype(np.float32)
    y1 = y1.astype(np.float32)
    z1 = z1.astype(np.float32)
    w1 = w1.astype(np.float32)
    x2 = x2.astype(np.float32)
    y2 = y2.astype(np.float32)
    z2 = z2.astype(np.float32)
    w2 = w2.astype(np.float32)

    result = np.zeros_like(d_rbins_squared)[:-1]
    result = result.astype(np.float32)
    results_test = np.zeros_like(result).astype(np.float64)
    count_weighted_pairs_3d_cpu(
        x1, y1, z1, w1, x2, y2, z2, w2, d_rbins_squared.astype(np.float32), results_test)

base_gpairs.run("Gpairs Numba",run_gpairs) 

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_jit.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_jit.sh; else ./run_gpairs_jit.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of Gpairs targeting GPU using Kernel

## Writing Explicit Kernels in numba-dppy

Writing a SYCL kernel using the `@numba_dppy.kernel` decorator has similar syntax to writing OpenCL kernels. As such, the numba-dppy module provides similar indexing and other functions as OpenCL. The indexing functions supported inside a `numba_dppy.kernel` are:

* numba_dppy.get_local_id : Gets the local ID of the item
* numba_dppy.get_local_size: Gets the local work group size of the device
* numba_dppy.get_group_id : Gets the group ID of the item
* numba_dppy.get_num_groups: Gets the number of gropus in a worksgroup

Refer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details.

In the following example we use the dppy-kernel approach for explicit kernel programming where, if the programmer wants to extract further performance from the offloaded code, the programmer can use the explicit kernel programming approach using dppy-kernels and tune the GPU parameters, where we take advantage of the workgroups and the work items in a device using the kernel approach.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs_gpu.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs_gpu
import numpy as np
import gaussian_weighted_pair_counts_gpu as gwpc
import numba_dppy
import dpctl


def run_gpairs(
    d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    blocks = 512

    with dpctl.device_context(base_gpairs_gpu.get_device_selector()):
        gwpc.count_weighted_pairs_3d_intel_ver2[
            d_x1.shape[0], numba_dppy.DEFAULT_LOCAL_SIZE
        ](d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result)


base_gpairs_gpu.run("Gpairs Dppy kernel", run_gpairs)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_jit_gpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_jit_gpu.sh; else ./run_gpairs_jit_gpu.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

## Plot the results for Gpairs on GPU

The algorithm Naively counts Npairs(<r), the total number of pairs that are separated by a distance less than r, for each r**2 in the input rbins_squared.

In the below graphs you will see a three dimensional view of the points and the second plot you can see the logirthmtic view of the __results__ that are computed based on the distance less than the distance defeined by the RBINS_SQUARED.

1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs_gpu_graph.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs_gpu_graph
import numpy as np
import gaussian_weighted_pair_counts_gpu as gwpc
import numba_dppy
import dpctl


def run_gpairs(
    d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    blocks = 512

    with dpctl.device_context(base_gpairs_gpu_graph.get_device_selector()):
        gwpc.count_weighted_pairs_3d_intel_ver2[
            d_x1.shape[0], numba_dppy.DEFAULT_LOCAL_SIZE
        ](d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result)


base_gpairs_gpu_graph.run("Gpairs Dppy kernel", run_gpairs)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_jit_gpu_graph.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_jit_gpu_graph.sh; else ./run_gpairs_jit_gpu_graph.sh; fi

### View the results
Select the cell below and click run ▶ to view the graph:

In [None]:
def read_dictionary(fn):
    import pickle
    # Load data (deserialize)
    with open(fn, 'rb') as handle:
        dictionary = pickle.load(handle)
    return dictionary
resultsDict = read_dictionary('resultsDict.pkl')
limit = 10
#D = resultsDict['D'][:limit,:limit]
X1 = resultsDict['X1'][:limit]
Y1 = resultsDict['Y1'][:limit]
Z1 = resultsDict['Z1'][:limit]
X2 = resultsDict['X2'][:limit]
Y2 = resultsDict['Y2'][:limit]
Z2 = resultsDict['Z2'][:limit]
result = resultsDict['result']
RBINS_SQAURED = resultsDict['DEFAULT_RBINS_SQUARED']
#print(result)
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np 
Radius = .92
index = np.where(result < Radius)
plt.style.use('dark_background')
#plt.gcf().set_size_inches((12, 5))
# plt.grid()
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X1, Y1, Z1, c='blue', s = 40, alpha = .8)
ax.scatter(X2, Y2, Z2, c='y', s = 40, alpha = .8)
plt.show()

In [None]:
plt.figure(figsize=(8,8))
plt.yscale("log")
plt.ylabel("magnitude of results")
plt.xlabel("index of results")
plt.xticks(np.arange(0, 20, 1.0))
nonzero = 1e-4
plt.grid()
plt.plot(result + nonzero,c = 'y');
plt.plot(RBINS_SQAURED + nonzero,c = 'r');

## Summary
In this module you will have learned the following:
* Numba implementation of Gpairs targeting a CPU and GPU using Numba JIT
* Numba-DPPY  implementation of Gpairs on a GPU using the kernel approach