# Pairwise Distance Algorithm using Numba-Dppy


## Sections
- [Pairwise algorithm](#Pairwise-algorithm)
- _Code:_ [Implementation of Pairwise distance targeting CPU using Numba JIT](#Implementation-of-Pairwise-distance-targeting-CPU-using-Numba-JIT)
- _Code:_ [Implementation of Pairwise distance targeting GPU using Numba JIT](#Implementation-of-Pairwise-distance-targeting-GPU-using-Numba-JIT)
- _Code:_ [Implementation of Pairwise targeting GPU using Kernels](#Implementation-of-Pairwise-targeting-GPU-using-Kernels)
- _Code:_ [Implementation of Pairwise targeting GPU using Numpy](#Implementation-of-Pairwise-targeting-GPU-using-Numpy)

## Learning Objectives
* Build a Numba implementation of Pairwise targeting CPU and GPU using Numba Jit
* Build a  Numba-DPPY  implementation of Pairwise on CPU and GPU using Kernel approach
* Build a Numba-DPPY implementation of Pairwise on GPU using Numpy approach

## numba-dppy

Numba-dppy is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dppy is packaged as part of the IDP that comes with oneAPI base toolkit, and you don’t need to install any specific Conda packages. The support for SYCL is via DPC++'s SYCL runtime and other SYCL compilers are not supported by Numba-dppy.



## Pairwise algorithm
The pairwise distance application takes a set of multidimensional points and computes the Euclidean distance between every pair of points. For n observations, a common sub-task of different data analysis algorithms is to compute the symmetric matrix of distances between each pair of observations.

Euclidean distance is of great importance in machine learning, astronomy, and so on.
The following examples show how to calculate a Euclidean pairwise distance computation implemented using the Numba JIT method and also using a kernel function.


# Implementation of Pairwise distance targeting CPU using Numba JIT
In the following example, we introduce to a naive pairwise distance implementation that targets a CPU using the Numba JIT, where we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.

This is the decorator-based approach, where we offload data parallel code sections like parallel-for, and certain NumPy function calls. With the decorator method, a programmer needs to simply identify the most time-consuming parts of the program. If those parts can be parallelized, the programmer needs to just annotate those sections using Numba-DPPy, and can expect those code sections to execute on a GPU.



1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pairwise_distance.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT


import base_pair_wise
import numpy as np
import numba


@numba.jit(nopython=True, parallel=True, fastmath=True)
def pw_distance(X1, X2, D):
    M = X1.shape[0]
    N = X2.shape[0]
    O = X1.shape[1]
    for i in numba.prange(M):
        for j in range(N):
            d = 0.0
            for k in range(O):
                tmp = X1[i, k] - X2[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)


base_pair_wise.run("Numba par_for", pw_distance) 

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_jit.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_jit.sh; else ./run_pair_wise_jit.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of Pairwise distance targeting GPU using Numba JIT

In the following example, we introduce a naive pairwise distance implementation that targets a GPU using the Numba JIT, where we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pairwise_distance_gpu.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import dpctl
import base_pair_wise_gpu
import numpy as np
import numba

# Naieve pairwise distance impl - take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances
@numba.njit(parallel=True, fastmath=True)
def pw_distance_kernel(X1, X2, D):
    # Size of imputs
    M = X1.shape[0]
    N = X2.shape[0]
    O = X1.shape[1]

    # Outermost parallel loop over the matrix X1
    for i in numba.prange(M):
        # Loop over the matrix X2
        for j in range(N):
            d = 0.0
            # Compute exclidean distance
            for k in range(O):
                tmp = X1[i, k] - X2[j, k]
                d += tmp * tmp
            # Write computed distance to distance matrix
            D[i, j] = np.sqrt(d)


def pw_distance(X1, X2, D):
    with dpctl.device_context(base_pair_wise_gpu.get_device_selector()):
        pw_distance_kernel(X1, X2, D)


base_pair_wise_gpu.run("Numba par_for", pw_distance)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_jit_gpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_jit_gpu.sh; else ./run_pair_wise_jit_gpu.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of Pairwise targeting GPU using Kernels

## Writing Explicit Kernels in numba-dppy

Writing a SYCL kernel using the `@numba_dppy.kernel` decorator has similar syntax to writing OpenCL kernels. As such, the numba-dppy module provides similar indexing and other functions as OpenCL. The indexing functions supported inside a `numba_dppy.kernel` are:

* numba_dppy.get_local_id : Gets the local ID of the item
* numba_dppy.get_local_size: Gets the local work group size of the device
* numba_dppy.get_group_id : Gets the group ID of the item
* numba_dppy.get_num_groups: Gets the number of gropus in a worksgroup

Refer https://intelpython.github.io/numba-dppy/latest/user_guides/kernel_programming_guide/index.html for more details.

In the following example we use the dppy-kernel approach for explicit kernel programming where, if the programmer wants to extract further performance from the offloaded code, the programmer can use the explicit kernel programming approach using dppy-kernels and tune the GPU parameters, where we take advantage of the workgroups and the work items in a device using the kernel approach. Here, we take an array representing M points in N dimensions, and return the M x M matrix of Euclidean distances.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pair_wise_kernel.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import dpctl
import base_pair_wise_gpu
import numpy as np
import numba_dppy


@numba_dppy.kernel
def pairwise_python(X1, X2, D):
    i = numba_dppy.get_global_id(0)

    N = X2.shape[0]
    O = X1.shape[1]
    for j in range(N):
        d = 0.0
        for k in range(O):
            tmp = X1[i, k] - X2[j, k]
            d += tmp * tmp
        D[i, j] = np.sqrt(d)


def pw_distance(X1, X2, D):
    with dpctl.device_context(base_pair_wise_gpu.get_device_selector()):
        # pairwise_python[X1.shape[0],numba_dppy.DEFAULT_LOCAL_SIZE](X1, X2, D)
        pairwise_python[X1.shape[0], 128](X1, X2, D)


base_pair_wise_gpu.run("Pairwise Distance Kernel", pw_distance)


### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_kernel.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_kernel.sh; else ./run_pair_wise_kernel.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Plot GPU Results

This finds nearest point pairs **BETWEEN** two datasets X1, X2.

It will not currently find close points **WITHIN** or among a single dataset.
The algorithm below is detecting closest point pair matches between the two datasets.
This means you may observe on the graph that cyan pairs that are closer to pink pairs are marked with a bigger size. 

In [None]:
%%writefile lab/pair_wise_graph.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import dpctl
import base_pair_wise_graph
import numpy as np
import numba_dppy


@numba_dppy.kernel
def pairwise_python(X1, X2, D):
    i = numba_dppy.get_global_id(0)
    
    N = X2.shape[0]
    O = X1.shape[1]
    for j in range(N):
        d = 0.0
        for k in range(O):
            tmp = X1[i, k] - X2[j, k]
            d += tmp * tmp
        D[i, j] = np.sqrt(d)

def pw_distance(X1,X2,D):
    with dpctl.device_context(base_pair_wise_graph.get_device_selector()):
        #pairwise_python[X1.shape[0],numba_dppy.DEFAULT_LOCAL_SIZE](X1, X2, D)
        pairwise_python[X1.shape[0],8](X1, X2, D)

base_pair_wise_graph.run("Pairwise Distance Kernel", pw_distance)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_graph.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_graph.sh; else ./run_pair_wise_graph.sh; fi

### View the results
Select the cell below and click run ▶ to view the graph:

In [None]:
def read_dictionary(fn):
    import pickle
    # Load data (deserialize)
    with open(fn, 'rb') as handle:
        dictionary = pickle.load(handle)
    return dictionary
resultsDict = read_dictionary('resultsDict.pkl')
limit = 10
D = resultsDict['D'][:limit,:limit]
X1 = resultsDict['X1'][:limit,:]
X2 = resultsDict['X2'][:limit,:]

from matplotlib import pyplot as plt 
import numpy as np 
Radius = .15
index = np.where(D  < Radius)
#plt.hist(resultsDict['D']) 
# plt.title("histogram") 
# plt.show()
x1i, x2i = index
plt.style.use('dark_background')
plt.gcf().set_size_inches((12, 5))
plt.grid()
plt.scatter(X1[:,0], X1[:,1], c='cyan',s = 20, alpha = .7)
plt.scatter(X2[:,0], X2[:,1], c='magenta', s = 20, alpha = .7)
plt.scatter(X1[x1i,0],X1[x1i,1],c='cyan', s = 80, alpha = 1)
plt.scatter(X2[x2i,0],X2[x2i,1],c='magenta', s = 80, alpha = 1)
plt.title('pl15ot of points within Radius: {} for {} points'.format(Radius, limit))
plt.xlabel('x coordinate')
plt.ylabel('y coordinate')


## Implementation of Pairwise targeting GPU using Numpy

In the following example, we can observe the pairwise NumPy implementation using the equation (a-b)^2 = a^2 + b^2 - 2ab, and we target the GPU using the NumPy approach.

1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/pair_wise_numpy.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import dpctl
import base_pair_wise_gpu
import numpy as np
import numba


# Pairwise Numpy implementation using the equation (a-b)^2 = a^2 + b^2 - 2*a*b
@numba.njit(parallel=True, fastmath=True)
def pw_distance_kernel(X1, X2, D):
    # return np.sqrt((np.square(X1 - X2.reshape((X2.shape[0],1,X2.shape[1])))).sum(axis=2))

    # Computing the first two terms (X1^2 and X2^2) of the Euclidean distance equation
    x1 = np.sum(np.square(X1), axis=1)
    x2 = np.sum(np.square(X2), axis=1)

    # Comnpute third term in equation
    D = -2 * np.dot(X1, X2.T)
    x3 = x1.reshape(x1.size, 1)
    D = D + x3  # x1[:,None] Not supported by Numba
    D = D + x2

    # Compute square root for euclidean distance
    D = np.sqrt(D)


def pw_distance(X1, X2, D):
    with dpctl.device_context(base_pair_wise_gpu.get_device_selector()):
        pw_distance_kernel(X1, X2, D)


base_pair_wise_gpu.run("Numba Numpy", pw_distance) 

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_pair_wise_numpy.sh; if [ -x "$(command -v qsub)" ]; then ./q run_pair_wise_numpy.sh; else ./run_pair_wise_numpy.sh; fi

## Summary
In this module you will have learned the following:
* Numba implementation of Pairwise targeting a CPU and GPU using Numba JIT
* Numba-DPPY  implementation of Pairwise on a CPU and GPU using the kernel approach
* Numba-DPPY implementation of Pairwise on a GPU using NumPy