# GPAIRS Algorithm using Numba-dpex


## Sections
- [Gpairs algorithm](#Gpairs-algorithm)
- _Code:_ [Implementation of Gpairs distance targeting CPU using Numba JIT](#Implementation-of-Gpairs-targeting-CPU-using-Numba-JIT)
- _Code:_ [Implementation of GPairs targeting GPU using Kernels](#Implementation-of-Gpairs-targeting-GPU-using-Kernel)
- _Code:_ [Plot the results for Gpairs on GPU](#Plot-the-results-for-Gpairs-on-GPU)

## Learning Objectives
* Build a Numba implementation of Gpairs targeting CPU and GPU using Numba Jit
* Build a  Numba-dpex  implementation of Gpairs on CPU and GPU using Kernel approach

## numba-dpex

Numba-dpex is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dpex is packaged as part of the IDP that comes with oneAPI base toolkit, and you don’t need to install any specific Conda packages. 



## Gpairs algorithm
The Gpairs distance application takes a set of multidimensional points and computes the Euclidean distance between every pair of points. For n observations, a common sub-task of different data analysis algorithms is to compute the symmetric matrix of distances between each pair of observations.

The algorithm Naively counts Npairs(<r), the total number of pairs that are separated by a distance less than r, for each r**2 in the input rbins_squared.


# Implementation of Gpairs targeting CPU using Numba JIT
In the following example, we introduce to a Gapirs pairwise distance implementation that targets a CPU using the Numba JIT.

This is the decorator-based approach, where we offload data parallel code sections like parallel-for, and certain NumPy function calls. With the decorator method, a programmer needs to simply identify the most time-consuming parts of the program. If those parts can be parallelized, the programmer needs to just annotate those sections using Numba-dpex, and can expect those code sections to execute on a GPU.



1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs
import numpy as np
from gaussian_weighted_pair_counts import count_weighted_pairs_3d_cpu

def run_gpairs(x1, y1, z1, w1, x2, y2, z2, w2, d_rbins_squared):
    x1 = x1.astype(np.float32)
    y1 = y1.astype(np.float32)
    z1 = z1.astype(np.float32)
    w1 = w1.astype(np.float32)
    x2 = x2.astype(np.float32)
    y2 = y2.astype(np.float32)
    z2 = z2.astype(np.float32)
    w2 = w2.astype(np.float32)

    result = np.zeros_like(d_rbins_squared)[:-1]
    result = result.astype(np.float32)
    results_test = np.zeros_like(result).astype(np.float64)
    count_weighted_pairs_3d_cpu(
        x1, y1, z1, w1, x2, y2, z2, w2, d_rbins_squared.astype(np.float32), results_test)

base_gpairs.run("Gpairs Numba",run_gpairs) 

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_jit.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_jit.sh; else ./run_gpairs_jit.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of Gpairs targeting GPU using Kernel

## Writing Explicit Kernels in numba-dpex

Writing a SYCL kernel using the `@numba_dpex.kernel` decorator has similar syntax to writing OpenCL kernels. As such, the numba-dpex module provides similar indexing and other functions as OpenCL. The indexing functions supported inside a `numba_dpex.kernel` are:

* numba_dpex.get_local_id : Gets the local ID of the item
* numba_dpex.get_local_size: Gets the local work group size of the device
* numba_dpex.get_group_id : Gets the group ID of the item
* numba_dpex.get_num_groups: Gets the number of gropus in a worksgroup

Refer https://intelpython.github.io/numba-dpex/latest/user_guides/kernel_programming_guide/index.html for more details.

In the following example we use the dpex-kernel approach for explicit kernel programming where, if the programmer wants to extract further performance from the offloaded code, the programmer can use the explicit kernel programming approach using dpex-kernels and tune the GPU parameters, where we take advantage of the workgroups and the work items in a device using the kernel approach.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs_gpu.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs_naive
import numpy as np
import gaussian_weighted_pair_counts_gpu as gwpc
import numba_dpex
import dpctl


def run_gpairs(
    d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    blocks = 512

    with dpctl.device_context(base_gpairs_naive.get_device_selector(is_gpu=True)):
        gwpc.count_weighted_pairs_3d_intel_ver2[
            d_x1.shape[0], numba_dpex.DEFAULT_LOCAL_SIZE
        ](d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result)


base_gpairs_naive.run("Gpairs Dpex kernel", run_gpairs)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_jit_gpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_jit_gpu.sh; else ./run_gpairs_jit_gpu.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

## Shared Local Memory (SLM) Implementation

In a parallel algorithm, there is a high degree of reuse, so instead of loading values from global memory each time we can load the values into local memory and perform the computation.  This will reduce the latency of accessing the data values.  The difference between this implementation and the ND-range implementation is the reading is done from global memory in the case of ND-range each time and in this implementation they dat is loaded into local memory and then computed.  

When a work-group begins, the contents of its local memory are uninitialized, and local memory does not persist after a work-group finishes executing. Because of these properties, local memory may only be used for temporary storage while a work-group is executing.  For other devices though, such as many GPU devices, there are dedicated resources for local memory, and on these devices, communicating via local memory should perform better than communicating via global memory.

In SYCL’s memory model, local memory is a contiguous region of memory allocated per work group and is visible to all the work items in that group. Local memory is device-only and cannot be accessed from the host. From the perspective offers the device, the local memory is exposed as a contiguous array of a specific types. The maximum available local memory is hardware-specific. The SYCL local memory concept is analogous to CUDA’s shared memory concept.

Numba-dpex provides a special function numba_dpex.local.array to allocate local memory for a kernel. To simplify kernel development and accelerate communication between work-items in a work-group, SYCL defines a special local memory space specifically for communication between work-items in a work-group.

Local Address Space refers to memory objects that need to be allocated in local memory pool and are shared by all work-items of a work-group. Numba-dpex does not support passing arguments that are allocated in the local address space to @numba_dpex.kernel. Users are allowed to allocate static arrays in the local address space inside the @numba_dpex.kernel. In the example below numba_dpex.local.array(shape, dtype) is the API used to allocate a static array in the local address space:
These are used to compute an intermediate result which does not use global memory for repeated access for computation. 

Also notice that we used a barrier that helps to synchronize all of the work-items in the work-group.  The performance is much better than the initial ND-range samples and slightly better than the ND-range sample utilizing local memory.

<img src="Assets/localmem.png">


### Private Address Space
Private Address Space refers to memory objects that are local to each work-item and is not shared with any other work-item. In the example below numba_dpex.private.array(shape, dtype) is the API used to allocate a static array in the private address space:

<img src="Assets/workgroup.png">


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs_gpu_private_memory.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs_gpu
import numpy as np
import gwpc_private as gwpc
import dpctl, dpctl.tensor as dpt
from device_selector import get_device_selector
import dpctl
from numba_dppy import kernel, atomic, DEFAULT_LOCAL_SIZE
import numba_dppy

atomic_add = atomic.add


@kernel
def count_weighted_pairs_3d_intel(
    x1, y1, z1, w1, x2, y2, z2, w2, rbins_squared, result
):
    """Naively count Npairs(<r), the total number of pairs that are separated
    by a distance less than r, for each r**2 in the input rbins_squared.
    """

    start = numba_dppy.get_global_id(0)
    stride = numba_dppy.get_global_size(0)

    n1 = x1.shape[0]
    n2 = x2.shape[0]
    nbins = rbins_squared.shape[0]

    for i in range(start, n1, stride):
        px = x1[i]
        py = y1[i]
        pz = z1[i]
        pw = w1[i]
        for j in range(n2):
            qx = x2[j]
            qy = y2[j]
            qz = z2[j]
            qw = w2[j]
            dx = px - qx
            dy = py - qy
            dz = pz - qz
            wprod = pw * qw
            dsq = dx * dx + dy * dy + dz * dz

            k = nbins - 1
            while dsq <= rbins_squared[k]:
                atomic_add(result, k - 1, wprod)
                k = k - 1
                if k <= 0:
                    break


@kernel
def count_weighted_pairs_3d_intel_ver2(
    x1, y1, z1, w1, x2, y2, z2, w2, rbins_squared, result
):
    """Naively count Npairs(<r), the total number of pairs that are separated
    by a distance less than r, for each r**2 in the input rbins_squared.
    """

    i = numba_dppy.get_global_id(0)
    nbins = rbins_squared.shape[0]
    n2 = x2.shape[0]

    px = x1[i]
    py = y1[i]
    pz = z1[i]
    pw = w1[i]
    for j in range(n2):
        qx = x2[j]
        qy = y2[j]
        qz = z2[j]
        qw = w2[j]
        dx = px - qx
        dy = py - qy
        dz = pz - qz
        wprod = pw * qw
        dsq = dx * dx + dy * dy + dz * dz

        k = nbins - 1
        while dsq <= rbins_squared[k]:
            # disabled for now since it's not supported currently
            # - could reenable later when it's supported (~April 2020)
            # - could work around this to avoid atomics, which would perform better anyway
            # cuda.atomic.add(result, k-1, wprod)
            atomic_add(result, k - 1, wprod)
            k = k - 1
            if k <= 0:
                break


def ceiling_quotient(n, m):
    return int((n + m - 1) / m)


def count_weighted_pairs_3d_intel_no_slm(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    n_wi = 20
    private_hist_size = 16
    lws0 = 16
    lws1 = 16

    m0 = n_wi * lws0
    m1 = n_wi * lws1

    n_groups0 = ceiling_quotient(n, m0)
    n_groups1 = ceiling_quotient(n, m1)

    gwsRange = n_groups0 * lws0, n_groups1 * lws1
    lwsRange = lws0, lws1

    slm_hist_size = ceiling_quotient(nbins, private_hist_size) * private_hist_size

    with dpctl.device_context(base_gpairs_gpu.get_device_selector(is_gpu=True)):
        gwpc.count_weighted_pairs_3d_intel_no_slm_ker[gwsRange, lwsRange](
            n,
            nbins,
            slm_hist_size,
            private_hist_size,
            d_x1,
            d_y1,
            d_z1,
            d_w1,
            d_x2,
            d_y2,
            d_z2,
            d_w2,
            d_rbins_squared,
            d_result,
        )


def count_weighted_pairs_3d_intel_orig(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):

    # create tmp result on device
    result_tmp = np.zeros(nbins, dtype=np.float32)
    d_result_tmp = dpt.usm_ndarray(
        result_tmp.shape, dtype=result_tmp.dtype, buffer="device"
    )
    d_result_tmp.usm_data.copy_from_host(result_tmp.reshape((-1)).view("|u1"))

    with dpctl.device_context(base_gpairs_gpu.get_device_selector()):
        gwpc.count_weighted_pairs_3d_intel_orig_ker[n,](
            n,
            nbins,
            d_x1,
            d_y1,
            d_z1,
            d_w1,
            d_x2,
            d_y2,
            d_z2,
            d_w2,
            d_rbins_squared,
            d_result_tmp,
        )
        gwpc.count_weighted_pairs_3d_intel_agg_ker[
            nbins,
        ](d_result, d_result_tmp)


def run_gpairs(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    count_weighted_pairs_3d_intel_no_slm(
        n,
        nbins,
        d_x1,
        d_y1,
        d_z1,
        d_w1,
        d_x2,
        d_y2,
        d_z2,
        d_w2,
        d_rbins_squared,
        d_result,
    )


base_gpairs_gpu.run("Gpairs Dpex kernel", run_gpairs)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_private_gpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_private_gpu.sh; else ./run_gpairs_private_gpu.sh; fi

## Private memory implementation


In [None]:
%%writefile lab/gpairs_gpu_private_memory_imp.py
import base_gpairs_gpu
import numpy as np
import gwpc_private as gwpc
import dpctl, dpctl.tensor as dpt
from device_selector import get_device_selector
import dpctl
from numba_dpex import kernel, atomic, DEFAULT_LOCAL_SIZE
import numba_dpex

atomic_add = atomic.add


@kernel
def count_weighted_pairs_3d_intel(
    x1, y1, z1, w1, x2, y2, z2, w2, rbins_squared, result
):
    """Naively count Npairs(<r), the total number of pairs that are separated
    by a distance less than r, for each r**2 in the input rbins_squared.
    """

    start = numba_dpex.get_global_id(0)
    stride = numba_dpex.get_global_size(0)

    n1 = x1.shape[0]
    n2 = x2.shape[0]
    nbins = rbins_squared.shape[0]

    for i in range(start, n1, stride):
        px = x1[i]
        py = y1[i]
        pz = z1[i]
        pw = w1[i]
        for j in range(n2):
            qx = x2[j]
            qy = y2[j]
            qz = z2[j]
            qw = w2[j]
            dx = px - qx
            dy = py - qy
            dz = pz - qz
            wprod = pw * qw
            dsq = dx * dx + dy * dy + dz * dz

            k = nbins - 1
            while dsq <= rbins_squared[k]:
                atomic_add(result, k - 1, wprod)
                k = k - 1
                if k <= 0:
                    break


@kernel
def count_weighted_pairs_3d_intel_ver2(
    x1, y1, z1, w1, x2, y2, z2, w2, rbins_squared, result
):
    """Naively count Npairs(<r), the total number of pairs that are separated
    by a distance less than r, for each r**2 in the input rbins_squared.
    """

    i = numba_dpex.get_global_id(0)
    nbins = rbins_squared.shape[0]
    n2 = x2.shape[0]

    px = x1[i]
    py = y1[i]
    pz = z1[i]
    pw = w1[i]
    for j in range(n2):
        qx = x2[j]
        qy = y2[j]
        qz = z2[j]
        qw = w2[j]
        dx = px - qx
        dy = py - qy
        dz = pz - qz
        wprod = pw * qw
        dsq = dx * dx + dy * dy + dz * dz

        k = nbins - 1
        while dsq <= rbins_squared[k]:
            # disabled for now since it's not supported currently
            # - could reenable later when it's supported (~April 2020)
            # - could work around this to avoid atomics, which would perform better anyway
            # cuda.atomic.add(result, k-1, wprod)
            atomic_add(result, k - 1, wprod)
            k = k - 1
            if k <= 0:
                break


def ceiling_quotient(n, m):
    return int((n + m - 1) / m)


def count_weighted_pairs_3d_intel_no_slm(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    n_wi = 20
    private_hist_size = 16
    lws0 = 16
    lws1 = 16

    m0 = n_wi * lws0
    m1 = n_wi * lws1

    n_groups0 = ceiling_quotient(n, m0)
    n_groups1 = ceiling_quotient(n, m1)

    gwsRange = n_groups0 * lws0, n_groups1 * lws1
    lwsRange = lws0, lws1

    slm_hist_size = ceiling_quotient(nbins, private_hist_size) * private_hist_size

    with dpctl.device_context(base_gpairs_gpu.get_device_selector(is_gpu=True)):
        gwpc.count_weighted_pairs_3d_intel_no_slm_ker[gwsRange, lwsRange](
            n,
            nbins,
            slm_hist_size,
            private_hist_size,
            d_x1,
            d_y1,
            d_z1,
            d_w1,
            d_x2,
            d_y2,
            d_z2,
            d_w2,
            d_rbins_squared,
            d_result,
        )


def count_weighted_pairs_3d_intel_orig(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):

    # create tmp result on device
    result_tmp = np.zeros(nbins, dtype=np.float32)
    d_result_tmp = dpt.usm_ndarray(
        result_tmp.shape, dtype=result_tmp.dtype, buffer="device"
    )
    d_result_tmp.usm_data.copy_from_host(result_tmp.reshape((-1)).view("|u1"))

    with dpctl.device_context(base_gpairs_gpu.get_device_selector()):
        gwpc.count_weighted_pairs_3d_intel_orig_ker[n,](
            n,
            nbins,
            d_x1,
            d_y1,
            d_z1,
            d_w1,
            d_x2,
            d_y2,
            d_z2,
            d_w2,
            d_rbins_squared,
            d_result_tmp,
        )
        gwpc.count_weighted_pairs_3d_intel_agg_ker[
            nbins,
        ](d_result, d_result_tmp)


def run_gpairs(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    count_weighted_pairs_3d_intel_no_slm(
        n,
        nbins,
        d_x1,
        d_y1,
        d_z1,
        d_w1,
        d_x2,
        d_y2,
        d_z2,
        d_w2,
        d_rbins_squared,
        d_result,
    )


base_gpairs_gpu.run("Gpairs Dpex kernel", run_gpairs)


### Private Memory differentials
In the below sample we are not using atomics to calcualte the values but we sum the private memory kernels in each work group and then we finally do an aggregation of all the first items of each work group.

1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs_gpu_private_memory_diff.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs_diff
import numpy as np
import gwpc_private_diff as gwpc
import dpctl, dpctl.tensor as dpt


def ceiling_quotient(n, m):
    return int((n + m - 1) / m)


def count_weighted_pairs_3d_intel_diff(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    with dpctl.device_context(base_gpairs_diff.get_device_selector(is_gpu=True)):
        gwpc.count_weighted_pairs_3d_intel_diff_ker[n, 64](
            n,
            nbins,
            d_x1,
            d_y1,
            d_z1,
            d_w1,
            d_x2,
            d_y2,
            d_z2,
            d_w2,
            d_rbins_squared,
            d_result,
        )
        gwpc.count_weighted_pairs_3d_intel_diff_agg_ker[
            nbins,
        ](d_result, n)


def run_gpairs(
    n, nbins, d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    count_weighted_pairs_3d_intel_diff(
        n,
        nbins,
        d_x1,
        d_y1,
        d_z1,
        d_w1,
        d_x2,
        d_y2,
        d_z2,
        d_w2,
        d_rbins_squared,
        d_result,
    )


base_gpairs_diff.run("Gpairs Dpex kernel", run_gpairs)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_private_gpu_diff.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_private_gpu_diff.sh; else ./run_gpairs_private_gpu_diff.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

## Plot the results for Gpairs on GPU

The algorithm Naively counts Npairs(<r), the total number of pairs that are separated by a distance less than r, for each r**2 in the input rbins_squared.

In the below graphs you will see a three dimensional view of the points and the second plot you can see the logirthmtic view of the __results__ that are computed based on the distance less than the distance defeined by the RBINS_SQUARED.

1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpairs_gpu_graph.py

# Copyright (C) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: MIT

import base_gpairs_gpu_graph
import numpy as np
import gaussian_weighted_pair_counts_gpu as gwpc
import numba_dpex
import dpctl


def run_gpairs(
    d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result
):
    blocks = 512

    with dpctl.device_context(base_gpairs_gpu_graph.get_device_selector(is_gpu=True)):
        gwpc.count_weighted_pairs_3d_intel_ver2[
            d_x1.shape[0], numba_dpex.DEFAULT_LOCAL_SIZE
        ](d_x1, d_y1, d_z1, d_w1, d_x2, d_y2, d_z2, d_w2, d_rbins_squared, d_result)


base_gpairs_gpu_graph.run("Gpairs Dpex kernel", run_gpairs)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpairs_jit_gpu_graph.sh; if [ -x "$(command -v qsub)" ]; then ./q run_gpairs_jit_gpu_graph.sh; else ./run_gpairs_jit_gpu_graph.sh; fi

### View the results
Select the cell below and click run ▶ to view the graph:

In [None]:
def read_dictionary(fn):
    import joblib
    # Load data (deserialize)
    with open(fn, 'rb') as handle:
        dictionary = joblib.load(handle)
    return dictionary
resultsDict = read_dictionary('resultsDict.dat')
limit = 10
#D = resultsDict['D'][:limit,:limit]
X1 = resultsDict['X1'][:limit]
Y1 = resultsDict['Y1'][:limit]
Z1 = resultsDict['Z1'][:limit]
X2 = resultsDict['X2'][:limit]
Y2 = resultsDict['Y2'][:limit]
Z2 = resultsDict['Z2'][:limit]
result = resultsDict['result']
RBINS_SQAURED = resultsDict['DEFAULT_RBINS_SQUARED']
#print(result)
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np 
Radius = .92
index = np.where(result < Radius)
plt.style.use('dark_background')
#plt.gcf().set_size_inches((12, 5))
# plt.grid()
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X1, Y1, Z1, c='blue', s = 40, alpha = .8)
ax.scatter(X2, Y2, Z2, c='y', s = 40, alpha = .8)
plt.show()

In [None]:
plt.figure(figsize=(8,8))
plt.yscale("log")
plt.ylabel("magnitude of results")
plt.xlabel("index of results")
plt.xticks(np.arange(0, 20, 1.0))
nonzero = 1e-4
plt.grid()
plt.plot(result + nonzero,c = 'y');
plt.plot(RBINS_SQAURED + nonzero,c = 'r');

## Generating the Vtune reports
Below exercises we use VTune™  analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.

#### Learn more about VTune
​
There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.

```
vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir python lab/gpairs_gpu_private_memory.py --steps 1 --size 16384 --repeat 5 --json result_gpu.json
```

```
vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_hotspots_dir_new python lab/gpairs_gpu_private_memory.py --steps 1 --size 16384 --repeat 5 --json result_gpu.json
```

```
vtune -report summary -result-dir vtune_dir -format html -report-output output.html
```

```
vtune -report summary -result-dir vtune_hotspots_dir -format html -report-output output_hotspots.html
```

In [None]:
run lab/mm_basic_vtune.py

## Summary
In this module you will have learned the following:
* Numba implementation of Gpairs targeting a CPU and GPU using Numba JIT
* Numba-dpex  implementation of Gpairs on a GPU using the kernel approach