# Pairwise Distance Algorithm using Numba-Dppy


## Sections
- [Pairwise algorithm](#Pairwise-algorithm)
- _Code:_ [Implementation of Pairwise distance targeting CPU using Numba JIT](#Implementation-of-Pairwise-distance-targeting-CPU-using-Numba-JIT)
- _Code:_ [Implementation of Pairwise distance targeting GPU using Numba JIT](#Implementation-of-Pairwise-distance-targeting-GPU-using-Numba-JIT)
- _Code:_ [Implementation of Pairwise targeting GPU using Kernels](#Implementation-of-Pairwise-targeting-GPU-using-Kernels)
- _Code:_ [Implementation of Pairwise targeting GPU using Kernels and dpctl Memory](#Implementation-of-Pairwise-targeting-GPU-using-Kernels-and-dpCtl-Memory)

## Learning Objectives
* Build a Numba implementation of Pairwise targeting CPU and GPU using Numba Jit
* Build a  Numba-DPPY  implementation of Pairwise on CPU and GPU using Kernel approach
* Build a  Numba-DPPY  implementation of Pairwise on GPU using Numpy approach

## numba-dppy

Numba-dppy is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy is packaged as part of the IDP that comes with oneAPI base toolkit, and you don’t need to install any specific Conda packages. The support for SYCL is via DPC++'s SYCL runtime and other SYCL compilers are not supported by Numba-dppy.



## Pairwise algorithm
The pairwise distance application takes a set of multidimensional points and computes the Euclidean distance between every pair of points. For n observations, a common sub-task of different data analysis algorithms is to compute the symmetric matrix of distances between each pair of observations.

Euclidean distance is of great importance in machine learning, astronomy, and so on.
The following examples show how to calculate a Euclidean pairwise distance computation implemented using the Numba JIT method and also using a kernel function.


# Implementation of Pairwise distance targeting CPU using Numba JIT
In the following example, we introduce to a naive pairwise distance implementation that targets a CPU using the Numba JIT, where we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.

This is the decorator-based approach, where we offload data parallel code sections like parallel-for, and certain NumPy function calls. With the decorator method, a programmer needs to simply identify the most time-consuming parts of the program. If those parts can be parallelized, the programmer needs to just annotate those sections using Numba-DPPy, and can expect those code sections to execute on a GPU.



1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pairwise_distance.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT


import base_pair_wise
import numpy as np
import numba

@numba.jit(nopython=True,parallel=True,fastmath=True)
def pw_distance(X1,X2,D):
    M = X1.shape[0]
    N = X2.shape[0]
    O = X1.shape[1]
    for i in numba.prange(M):
        for j in range(N):
            d = 0.0
            for k in range(O):
                tmp = X1[i, k] - X2[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)

base_pair_wise.run("Numba par_for", pw_distance) 

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_jit.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_jit.sh; else ./run_pair_wise_jit.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of Pairwise distance targeting GPU using Numba JIT

In the following example, we introduce a naive pairwise distance implementation that targets a GPU using the Numba JIT, where we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pairwise_distance_gpu.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import dpctl
import base_pair_wise_gpu
import numpy as np
import numba

# Naieve pairwise distance impl - take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances
@numba.njit(parallel=True,fastmath=True)
def pw_distance_kernel(X1,X2,D):
    # Size of imputs
    M = X1.shape[0]
    N = X2.shape[0]
    O = X1.shape[1]

    # Outermost parallel loop over the matrix X1
    for i in numba.prange(M):
        # Loop over the matrix X2
        for j in range(N):
            d = 0.0
            #Compute exclidean distance
            for k in range(O):
                tmp = X1[i, k] - X2[j, k]
                d += tmp * tmp
            # Write computed distance to distance matrix
            D[i, j] = np.sqrt(d)

def pw_distance(X1,X2,D):
    with dpctl.device_context(base_pair_wise_gpu.get_device_selector()):
        pw_distance_kernel(X1,X2,D)

base_pair_wise_gpu.run("Numba par_for", pw_distance)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_jit_gpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_jit_gpu.sh; else ./run_pair_wise_jit_gpu.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of Pairwise targeting GPU using Kernels

## Writing Explicit Kernels in numba-dppy

Writing a SYCL kernel using the `@numba_dppy.kernel` decorator has similar syntax to writing OpenCL kernels. As such, the numba-dppy module provides similar indexing and other functions as OpenCL. The indexing functions supported inside a `numba_dppy.kernel` are:

* numba_dppy.get_local_id : Gets the local ID of the item
* numba_dppy.get_local_size: Gets the local work group size of the device
* numba_dppy.get_group_id : Gets the group ID of the item
* numba_dppy.get_num_groups: Gets the number of gropus in a worksgroup

Refer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details.

In the following example we use the dppy-kernel approach for explicit kernel programming where, if the programmer wants to extract further performance from the offloaded code, the programmer can use the explicit kernel programming approach using dppy-kernels and tune the GPU parameters, where we take advantage of the workgroups and the work items in a device using the kernel approach. Here, we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pair_wise_kernel.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import dpctl
import base_pair_wise_gpu
import numpy as np
import numba_dppy

@numba_dppy.kernel
def pairwise_python(X1, X2, D):
    i = numba_dppy.get_global_id(0)
    
    N = X2.shape[0]
    O = X1.shape[1]
    for j in range(N):
        d = 0.0
        for k in range(O):
            tmp = X1[i, k] - X2[j, k]
            d += tmp * tmp
        D[i, j] = np.sqrt(d)

def pw_distance(X1,X2,D):
    with dpctl.device_context(base_pair_wise_gpu.get_device_selector()):
        #pairwise_python[X1.shape[0],numba_dppy.DEFAULT_LOCAL_SIZE](X1, X2, D)
        pairwise_python[X1.shape[0],8](X1, X2, D)

base_pair_wise_gpu.run("Pairwise Distance Kernel", pw_distance)


### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_kernel.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_kernel.sh; else ./run_pair_wise_kernel.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

## Implementation of Pairwise targeting GPU using Kernels and dpctl Memory

In the following example we can observe the similar numba_dppy.kernel approach, where we take advantage of the workgroups and the work items in a device using the kernel approach. We are also using the dpCtl USM shared memory approach, where we are using the features of memory transfer between the host and the device using this approach. Here we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:
1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pair_wise_kernel2.py

# Copyright 2020, 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from time import time
from math import sqrt
import numpy as np
import argparse
import numba_dppy as dppy
import dpctl
import dpctl.memory as dpctl_mem

parser = argparse.ArgumentParser(description="Program to compute pairwise distance")

parser.add_argument("-n", type=int, default=10, help="Number of points")
parser.add_argument("-d", type=int, default=3, help="Dimensions")
parser.add_argument("-r", type=int, default=1, help="repeat")
parser.add_argument("-l", type=int, default=1, help="local_work_size")

args = parser.parse_args()

# Global work size is equal to the number of points
global_size = args.n
# Local Work size is optional
local_size = args.l

X = np.random.random((args.n, args.d))
D = np.empty((args.n, args.n))


@dppy.kernel
def pairwise_distance(X, D, xshape0, xshape1):
    """
    An Euclidean pairwise distance computation implemented as
    a ``kernel`` function.
    """
    idx = dppy.get_global_id(0)

    # for i in range(xshape0):
    for j in range(X.shape[0]):
        d = 0.0
        for k in range(X.shape[1]):
            tmp = X[idx, k] - X[j, k]
            d += tmp * tmp
        D[idx, j] = sqrt(d)


def driver():
    # measure running time
    times = list()

    xbuf = dpctl_mem.MemoryUSMShared(X.size * X.dtype.itemsize)
    x_ndarray = np.ndarray(X.shape, buffer=xbuf, dtype=X.dtype)
    np.copyto(x_ndarray, X)

    dbuf = dpctl_mem.MemoryUSMShared(D.size * D.dtype.itemsize)
    d_ndarray = np.ndarray(D.shape, buffer=dbuf, dtype=D.dtype)
    np.copyto(d_ndarray, D)

    for repeat in range(args.r):
        start = time()
        pairwise_distance[global_size, local_size](
            x_ndarray, d_ndarray, X.shape[0], X.shape[1]
        )
        end = time()

        total_time = end - start
        times.append(total_time)

    np.copyto(X, x_ndarray)
    np.copyto(D, d_ndarray)

    return times


def main():
    times = None

    # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
    # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
    #device = dpctl.select_default_device()
    #print("Using device ...")
    #device.print_device_info()

    with dpctl.device_context("opencl:gpu"):
        times = driver()

    times = np.asarray(times, dtype=np.float32)
    print("Average time of %d runs is = %fs" % (args.r, times.mean()))

    print("Done...")


if __name__ == "__main__":
    main()

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_kernel2.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_kernel2.sh; else ./run_pair_wise_kernel2.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

## Summary
In this module you will have learned the following:
* Numba implementation of Pairwise targeting a CPU and GPU using Numba JIT
* Numba-DPPY  implementation of Pairwise on a CPU and GPU using the kernel approach
* Numba-DPPY  implementation of Pairwise on a GPU using kernels and dpctl memory