# K-means Algorithm Using Numba-dpex

## Sections
- [KMeans Algorithm](#KMeans-Algorithm)
- _Code:_ [Implementation of Kmeans targeting CPU using Numba JIT](#Implementation-of-Kmeans-targeting-CPU-using-Numba-JIT)
- _Code:_ [Implementation of Kmeans targeting GPU using Numba JIT](#Implementation-of-K-Means-targeting-GPU-using-Numba-JIT)
- _Code:_ [Implementation of Kmeans targeting GPU using Kernels](#Implementation-of-Kmeans-targeting-GPU-using-Kernels)
- _Code:_ [Implementation of Kmeans targeting GPU using atomics](#Implementation-of-Kmeans-targeting-GPU-using-atomics)


## Learning Objectives

* Build a Numba implementation of K-means targeting CPU and GPU using Numba JIT
* Build a Numba-dpex implementation of K-means on CPU and GPU using kernel approach
* Build a Numba-dpex implementation of K-means on GPU using Atomics


## numba-dpex

Numba-dpex is a standalone extension to the Numba JIT compiler that adds SYCL programming capabilities to Numba. Numba-dpex is packaged as part of the IDP that comes with oneAPI base toolkit, and you don’t need to install any specific Conda packages. The support for SYCL is via SYCL runtime and other SYCL compilers are not supported by Numba-dpex.



## Command Line parameters

| Type | Default Value | Description |
|:---|:---|:---|
| --steps | 10 | Number of workload runs |
| --step | 2  | Data growth factor on each iteration |
| --size | 2 ** 28 | Initial data size |
| --repeat | 1 | Iterations inside measured region |
| --json | False | Output json data filename |
| -d | 1 | Data Dimension |
| --usm | False | Use USM Shared |

# K-Means Algorithm
Kmeans is a clustering algorithm that partitions observations from a dataset into a requested number of geometric clusters of points closest to the cluster’s own center of mass. Using an initial estimate of the centroids, the algorithm iteratively updates the positions of the centroids until a fixed point.


Kmeans is a simple and powerful ML algorithm to cluster data into similar groups. Its objective is to split a set of N observations into K clusters. This is achieved by minimizing inertia (i.e., the sum of squared Euclidian distances from observations to the cluster centers, or centroids). The algorithm is iterative, with two steps in each iteration:
* For each observation, compute the distance from it to each centroid, and then reassign each observation to the cluster with the nearest centroid.
* For each cluster, compute the centroid as the mean of observations assigned to this cluster.

Repeat these steps until the number of iterations exceeds a predefined maximum or the algorithm converges (i.e., the difference between two consecutive inertias is less than a predefined threshold).
Different methods are used to get initial centroids for the first iteration. The algorithm can select random observations as initial centroids or use more complex methods such as kmeans.

# Implementation of Kmeans targeting CPU using Numba JIT
In the following example, we introduce a naive K-means implementation that targets a CPU using the Numba JIT.

This is the decorator-based approach, where we offload data parallel code sections like parallel-for, and certain NumPy function calls. With the decorator method, a programmer needs to simply identify the most time-consuming parts of the program. If those parts can be parallelized, the programmer needs to annotate those sections using Numba-dpex, and can expect those code sections to execute on a GPU.




1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/kmeans.py
import base_kmeans
import numpy
import numba

REPEAT = 1

ITERATIONS = 30


@numba.jit(nopython=True, parallel=True, fastmath=True)
def groupByCluster(arrayP, arrayPcluster, arrayC, num_points, num_centroids):
    for i0 in numba.prange(num_points):
        minor_distance = -1
        for i1 in range(num_centroids):
            dx = arrayP[i0, 0] - arrayC[i1, 0]
            dy = arrayP[i0, 1] - arrayC[i1, 1]
            my_distance = numpy.sqrt(dx * dx + dy * dy)
            if minor_distance > my_distance or minor_distance == -1:
                minor_distance = my_distance
                arrayPcluster[i0] = i1
    return arrayPcluster


@numba.jit(nopython=True, parallel=True, fastmath=True)
def calCentroidsSum(
    arrayP, arrayPcluster, arrayCsum, arrayCnumpoint, num_points, num_centroids
):
    for i in numba.prange(num_centroids):
        arrayCsum[i, 0] = 0
        arrayCsum[i, 1] = 0
        arrayCnumpoint[i] = 0

    for i in range(num_points):
        ci = arrayPcluster[i]
        arrayCsum[ci, 0] += arrayP[i, 0]
        arrayCsum[ci, 1] += arrayP[i, 1]
        arrayCnumpoint[ci] += 1

    return arrayCsum, arrayCnumpoint


@numba.jit(nopython=True, parallel=True, fastmath=True)
def updateCentroids(arrayC, arrayCsum, arrayCnumpoint, num_centroids):
    for i in numba.prange(num_centroids):
        arrayC[i, 0] = arrayCsum[i, 0] / arrayCnumpoint[i]
        arrayC[i, 1] = arrayCsum[i, 1] / arrayCnumpoint[i]


def kmeans(
    arrayP, arrayPcluster, arrayC, arrayCsum, arrayCnumpoint, num_points, num_centroids
):

    for i in range(ITERATIONS):
        groupByCluster(arrayP, arrayPcluster, arrayC, num_points, num_centroids)

        calCentroidsSum(
            arrayP, arrayPcluster, arrayCsum, arrayCnumpoint, num_points, num_centroids
        )

        updateCentroids(arrayC, arrayCsum, arrayCnumpoint, num_centroids)

    return arrayC, arrayCsum, arrayCnumpoint


def printCentroid(arrayC, arrayCsum, arrayCnumpoint):
    for i in range(NUMBER_OF_CENTROIDS):
        print(
            "[x={:6f}, y={:6f}, x_sum={:6f}, y_sum={:6f}, num_points={:d}]".format(
                arrayC[i, 0],
                arrayC[i, 1],
                arrayCsum[i, 0],
                arrayCsum[i, 1],
                arrayCnumpoint[i],
            )
        )

    print("--------------------------------------------------")


def run_kmeans(
    arrayP,
    arrayPclusters,
    arrayC,
    arrayCsum,
    arrayCnumpoint,
    NUMBER_OF_POINTS,
    NUMBER_OF_CENTROIDS,
):

    for i in range(REPEAT):
        for i1 in range(NUMBER_OF_CENTROIDS):
            arrayC[i1, 0] = arrayP[i1, 0]
            arrayC[i1, 1] = arrayP[i1, 1]

        arrayC, arrayCsum, arrayCnumpoint = kmeans(
            arrayP,
            arrayPclusters,
            arrayC,
            arrayCsum,
            arrayCnumpoint,
            NUMBER_OF_POINTS,
            NUMBER_OF_CENTROIDS,
        )

    #     if i + 1 == REPEAT:
    #         printCentroid(arrayC, arrayCsum, arrayCnumpoint)

    # print("Iterations: {:d}".format(ITERATIONS))
    # print("Average Time: {:.4f} ms".format(total))


base_kmeans.run("Kmeans Numba", run_kmeans)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_kmeans_cpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_kmeans_cpu.sh; else ./run_kmeans_cpu.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Implementation of K-Means targeting GPU using Numba JIT

In the following example, we introduce a naive K-Means implementation that targets a GPU using the Numba JIT, where we take an array representing M points in N dimensions, and return the M x M matrix of K-Means distances.


1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/kmeans_gpu.py

import dpctl
import numpy
import base_kmeans_gpu
import numba

REPEAT = 1
# defines total number of iterations for kmeans accuracy
ITERATIONS = 30

__njit = numba.jit(nopython=True, parallel=True, fastmath=True)

# determine the euclidean distance from the cluster center to each point
@__njit
def groupByCluster(arrayP, arrayPcluster, arrayC, num_points, num_centroids):
    # parallel for loop
    for i0 in numba.prange(num_points):
        minor_distance = -1
        for i1 in range(num_centroids):
            dx = arrayP[i0, 0] - arrayC[i1, 0]
            dy = arrayP[i0, 1] - arrayC[i1, 1]
            my_distance = numpy.sqrt(dx * dx + dy * dy)
            if minor_distance > my_distance or minor_distance == -1:
                minor_distance = my_distance
                arrayPcluster[i0] = i1
    return arrayPcluster


# assign points to cluster
@__njit
def calCentroidsSum(
    arrayP, arrayPcluster, arrayCsum, arrayCnumpoint, num_points, num_centroids
):
    # parallel for loop
    for i in numba.prange(num_centroids):
        arrayCsum[i, 0] = 0
        arrayCsum[i, 1] = 0
        arrayCnumpoint[i] = 0

    for i in range(num_points):
        ci = arrayPcluster[i]
        arrayCsum[ci, 0] += arrayP[i, 0]
        arrayCsum[ci, 1] += arrayP[i, 1]
        arrayCnumpoint[ci] += 1

    return arrayCsum, arrayCnumpoint


# update the centriods array after computation
@__njit
def updateCentroids(arrayC, arrayCsum, arrayCnumpoint, num_centroids):
    for i in numba.prange(num_centroids):
        arrayC[i, 0] = arrayCsum[i, 0] / arrayCnumpoint[i]
        arrayC[i, 1] = arrayCsum[i, 1] / arrayCnumpoint[i]


def kmeans(
    arrayP, arrayPcluster, arrayC, arrayCsum, arrayCnumpoint, num_points, num_centroids
):

    for i in range(ITERATIONS):
        groupByCluster(arrayP, arrayPcluster, arrayC, num_points, num_centroids)

        calCentroidsSum(
            arrayP, arrayPcluster, arrayCsum, arrayCnumpoint, num_points, num_centroids
        )

        updateCentroids(arrayC, arrayCsum, arrayCnumpoint, num_centroids)

    return arrayC, arrayCsum, arrayCnumpoint


def run_kmeans(
    arrayP,
    arrayPclusters,
    arrayC,
    arrayCsum,
    arrayCnumpoint,
    NUMBER_OF_POINTS,
    NUMBER_OF_CENTROIDS,
):

    with dpctl.device_context(base_kmeans_gpu.get_device_selector(is_gpu=True)):
        for i in range(REPEAT):
            for i1 in range(NUMBER_OF_CENTROIDS):
                arrayC[i1, 0] = arrayP[i1, 0]
                arrayC[i1, 1] = arrayP[i1, 1]

            arrayC, arrayCsum, arrayCnumpoint = kmeans(
                arrayP,
                arrayPclusters,
                arrayC,
                arrayCsum,
                arrayCnumpoint,
                NUMBER_OF_POINTS,
                NUMBER_OF_CENTROIDS,
            )


base_kmeans_gpu.run("Kmeans Numba", run_kmeans)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_kmeans_gpu.sh; if [ -x "$(command -v qsub)" ]; then ./q run_kmeans_gpu.sh; else ./run_kmeans_gpu.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

## Implementation of Kmeans targeting GPU using Kernels

## Writing Explicit Kernels in numba-dpex

Writing a SYCL kernel using the `@numba_dpex.kernel` decorator has similar syntax to writing OpenCL kernels. As such, the numba-dpex module provides similar indexing and other functions as OpenCL. The indexing functions supported inside a `numba_dpex.kernel` are:

* numba_dpex.get_local_id : Gets the local ID of the item
* numba_dpex.get_local_size: Gets the local work group size of the device
* numba_dpex.get_group_id : Gets the group ID of the item
* numba_dpex.get_num_groups: Gets the number of gropus in a worksgroup

Refer https://intelpython.github.io/numba-dpex/latest/user_guides/kernel_programming_guide/index.html for more details.

In the following example we use the dpex-kernel approach for explicit kernel programming where, if the programmer wants to extract further performance from the offloaded code, the programmer can use the explicit kernel programming approach using dpex-kernels, and tune the GPU parameters where we take advantage of the work groups and the work items in a device using the kernel approach.

## Implementation of Kmeans targeting GPU using atomics
Atomics allow multiple work-items for any cross work-item communication via memory. SYCL atomics are similar to C++ atomics and make the access to resources protected by atomics guaranteed to be executed as a single unit.

Numba-dpex supports some of the atomic operations supported in SYCL. Those that are presently implemented are as follows:
* add(ary, idx, val): Perform atomic ary[idx] += val. Returns the old value at the index location as if it is loaded atomically.

* sub(ary, idx, val): Perform atomic ary[idx] -= val. Returns the old value at the index location as if it is loaded atomically.

The following code shows the implementation of a reduction operation where every work-item is updating a global accumulator atomically

Here’s an example of how to use atomics add in numba-dpex:

1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/kmeans_kernel_atomic.py
import dpctl
import base_kmeans_gpu
import numpy
import numba_dpex as nb
from numba_dpex import atomic

REPEAT = 1
ITERATIONS = 30

atomic_add = atomic.add


@nb.kernel
def groupByCluster(arrayP, arrayPcluster, arrayC, num_points, num_centroids):
    idx = nb.get_global_id(0)
    # if idx < num_points: # why it was removed??
    minor_distance = -1
    for i in range(num_centroids):
        dx = arrayP[idx, 0] - arrayC[i, 0]
        dy = arrayP[idx, 1] - arrayC[i, 1]
        my_distance = numpy.sqrt(dx * dx + dy * dy)
        if minor_distance > my_distance or minor_distance == -1:
            minor_distance = my_distance
            arrayPcluster[idx] = i


@nb.kernel
def calCentroidsSum1(arrayCsum, arrayCnumpoint):
    i = nb.get_global_id(0)
    arrayCsum[i, 0] = 0
    arrayCsum[i, 1] = 0
    arrayCnumpoint[i] = 0


@nb.kernel
def calCentroidsSum2(arrayP, arrayPcluster, arrayCsum, arrayCnumpoint):
    i = nb.get_global_id(0)
    ci = arrayPcluster[i]
    atomic_add(arrayCsum, (ci, 0), arrayP[i, 0])
    atomic_add(arrayCsum, (ci, 1), arrayP[i, 1])
    atomic_add(arrayCnumpoint, ci, 1)


@nb.kernel
def updateCentroids(arrayC, arrayCsum, arrayCnumpoint, num_centroids):
    i = nb.get_global_id(0)
    arrayC[i, 0] = arrayCsum[i, 0] / arrayCnumpoint[i]
    arrayC[i, 1] = arrayCsum[i, 1] / arrayCnumpoint[i]


@nb.kernel
def copy_arrayC(arrayC, arrayP):
    i = nb.get_global_id(0)
    arrayC[i, 0] = arrayP[i, 0]
    arrayC[i, 1] = arrayP[i, 1]


def kmeans(
    arrayP, arrayPcluster, arrayC, arrayCsum, arrayCnumpoint, num_points, num_centroids
):

    copy_arrayC[num_centroids,](arrayC, arrayP)

    for i in range(ITERATIONS):
        groupByCluster[num_points,](
            arrayP, arrayPcluster, arrayC, num_points, num_centroids
        )

        calCentroidsSum1[num_centroids,](
            arrayCsum, arrayCnumpoint
        )

        calCentroidsSum2[num_points,](
            arrayP, arrayPcluster, arrayCsum, arrayCnumpoint
        )

        updateCentroids[num_centroids,](
            arrayC, arrayCsum, arrayCnumpoint, num_centroids
        )

    return arrayC, arrayCsum, arrayCnumpoint


def run_kmeans(
    arrayP,
    arrayPclusters,
    arrayC,
    arrayCsum,
    arrayCnumpoint,
    NUMBER_OF_POINTS,
    NUMBER_OF_CENTROIDS,
):

    with dpctl.device_context(base_kmeans_gpu.get_device_selector(is_gpu=True)):
        for i in range(REPEAT):
            arrayC, arrayCsum, arrayCnumpoint = kmeans(
                arrayP,
                arrayPclusters,
                arrayC,
                arrayCsum,
                arrayCnumpoint,
                NUMBER_OF_POINTS,
                NUMBER_OF_CENTROIDS,
            )


base_kmeans_gpu.run("Kmeans Numba", run_kmeans)

### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_kmeans_atomic.sh; if [ -x "$(command -v qsub)" ]; then ./q run_kmeans_atomic.sh; else ./run_kmeans_atomic.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

# Plot GPU Results

Below sample runs the Kmeans algorithm on the GPU and plots the first 10 centroids and the cluster of points each centroid is associated with.

Here’s an example that runs the Kmeans algorithm using numba-dpex on a GPU and plots the centroids with the cluster of points:

1. Inspect the code cell below and click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/kmeans_kernel_atomic_graph.py
import dpctl
import base_kmeans_gpu_graph
import numpy
import numba_dpex as nb
from numba_dpex import atomic

REPEAT = 1
ITERATIONS = 30

atomic_add = atomic.add

@nb.kernel
def groupByCluster(arrayP, arrayPcluster, arrayC, num_points, num_centroids):
    idx = nb.get_global_id(0)
    # if idx < num_points: # why it was removed??
    minor_distance = -1
    for i in range(num_centroids):
        dx = arrayP[idx, 0] - arrayC[i, 0]
        dy = arrayP[idx, 1] - arrayC[i, 1]
        my_distance = numpy.sqrt(dx * dx + dy * dy)
        if minor_distance > my_distance or minor_distance == -1:
            minor_distance = my_distance
            arrayPcluster[idx] = i


@nb.kernel
def calCentroidsSum1(arrayCsum, arrayCnumpoint):
    i = nb.get_global_id(0)
    arrayCsum[i, 0] = 0
    arrayCsum[i, 1] = 0
    arrayCnumpoint[i] = 0


@nb.kernel
def calCentroidsSum2(arrayP, arrayPcluster, arrayCsum, arrayCnumpoint):
    i = nb.get_global_id(0)
    ci = arrayPcluster[i]
    atomic_add(arrayCsum, (ci, 0), arrayP[i, 0])
    atomic_add(arrayCsum, (ci, 1), arrayP[i, 1])
    atomic_add(arrayCnumpoint, ci, 1)


@nb.kernel
def updateCentroids(arrayC, arrayCsum, arrayCnumpoint, num_centroids):
    i = nb.get_global_id(0)
    arrayC[i, 0] = arrayCsum[i, 0] / arrayCnumpoint[i]
    arrayC[i, 1] = arrayCsum[i, 1] / arrayCnumpoint[i]


@nb.kernel
def copy_arrayC(arrayC, arrayP):
    i = nb.get_global_id(0)
    arrayC[i, 0] = arrayP[i, 0]
    arrayC[i, 1] = arrayP[i, 1]


def kmeans(
    arrayP, arrayPcluster, arrayC, arrayCsum, arrayCnumpoint, num_points, num_centroids
):

    copy_arrayC[num_centroids,](arrayC, arrayP)

    for i in range(ITERATIONS):
        groupByCluster[num_points,](
            arrayP, arrayPcluster, arrayC, num_points, num_centroids
        )

        calCentroidsSum1[num_centroids,](
            arrayCsum, arrayCnumpoint
        )

        calCentroidsSum2[num_points,](
            arrayP, arrayPcluster, arrayCsum, arrayCnumpoint
        )

        updateCentroids[num_centroids,](
            arrayC, arrayCsum, arrayCnumpoint, num_centroids
        )

    return arrayC, arrayCsum, arrayCnumpoint


def run_kmeans(
    arrayP,
    arrayPclusters,
    arrayC,
    arrayCsum,
    arrayCnumpoint,
    NUMBER_OF_POINTS,
    NUMBER_OF_CENTROIDS,
):

    with dpctl.device_context(base_kmeans_gpu_graph.get_device_selector(is_gpu=True)):
        for i in range(REPEAT):
            arrayC, arrayCsum, arrayCnumpoint = kmeans(
                arrayP,
                arrayPclusters,
                arrayC,
                arrayCsum,
                arrayCnumpoint,
                NUMBER_OF_POINTS,
                NUMBER_OF_CENTROIDS,
            )


base_kmeans_gpu_graph.run("Kmeans Numba", run_kmeans)


### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_kmeans_atomic_graph.sh; if [ -x "$(command -v qsub)" ]; then ./q run_kmeans_atomic_graph.sh; else ./run_kmeans_atomic_graph.sh; fi

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again__

### View the results
Select the cell below and click run ▶ to view the graph:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joblib
def read_dictionary(fn):
    import joblib
    # Load data (deserialize)
    with open(fn, 'rb') as handle:
        dictionary = joblib.load(handle)
    return dictionary
resultsDict = read_dictionary('resultsDict.dat')
limit = 10

arrayP = resultsDict['arrayP']
arrayPclusters = resultsDict['arrayPclusters']
# print(arrayP.shape)
# print(arrayPclusters.shape)
#y = resultsDict['y']
xC = resultsDict['xC']
yC = resultsDict['yC']
from matplotlib import pyplot as plt

# columns = ['x', 'y']
# df = pd.DataFrame(X, columns = columns)
# df['color'] = resultsDict['dbscan.labels_']
# colors = { 0: 'magenta', 1: 'lime', -1: 'b' }

cluster0 = np.where(arrayPclusters==0)
cluster1 = np.where(arrayPclusters==1)
cluster2 = np.where(arrayPclusters==2)
cluster3 = np.where(arrayPclusters==3)
cluster4 = np.where(arrayPclusters==4)
cluster5 = np.where(arrayPclusters==5)
cluster6 = np.where(arrayPclusters==6)
cluster7 = np.where(arrayPclusters==7)
cluster8 = np.where(arrayPclusters==8)
cluster9 = np.where(arrayPclusters==9)
# arrayP[cluster0]
# arrayP[cluster1]
# arrayP[cluster2]
# arrayP[cluster3]
plt.style.use('default')

plt.scatter(x=arrayP[cluster0][:,0], y=arrayP[cluster0][:,1], c='gold',alpha=0.08)
plt.scatter(x=arrayP[cluster1][:,0], y=arrayP[cluster1][:,1], c='magenta',alpha=0.08)
plt.scatter(x=arrayP[cluster2][:,0], y=arrayP[cluster2][:,1], c='cyan',alpha=0.08)
plt.scatter(x=arrayP[cluster3][:,0], y=arrayP[cluster3][:,1], c='green',alpha=0.08)
plt.scatter(x=arrayP[cluster4][:,0], y=arrayP[cluster4][:,1], c='y',alpha=0.08)
plt.scatter(x=arrayP[cluster5][:,0], y=arrayP[cluster5][:,1], c='r',alpha=0.08)
plt.scatter(x=arrayP[cluster6][:,0], y=arrayP[cluster6][:,1], c='purple',alpha=0.08)
plt.scatter(x=arrayP[cluster7][:,0], y=arrayP[cluster7][:,1], c='blue',alpha=0.08)
plt.scatter(x=arrayP[cluster8][:,0], y=arrayP[cluster8][:,1], c='green',alpha=0.08)
plt.scatter(x=arrayP[cluster9][:,0], y=arrayP[cluster9][:,1], c='orange',alpha=0.08)
#plt.scatter(x=x, y=y, c='lime',alpha=0.05)

plt.scatter(x=xC[:10], y=yC[:10],s=75,  c='r', edgecolor="k")
plt.title('Kmeans centroids')

#plt.grid()
plt.gcf().set_size_inches((16, 8))
plt.show()


### Advsior Roofline Report

A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks.  Intel Advisor includes an automated Roofline tool that measures and plots the chart on its own, so all you need to do is read it.

The chart can be used to identify not only where bottlenecks exist, but what’s likely causing them, and which ones will provide the most speedup if optimized.

The Survey is usually the first analysis you want to run with Intel® Advisor. The survey is mainly used to time your application as well as the different loops and functions. 

The second step is to run the trip count analysis. This step uses instrumentation to count how many iterations you are running in each loops. Adding the option -flop will also provide the precise number of operations executed in each of your code sections.

Execute the following line to display the roofline results 


* Run the Survey analysis with the --profile-gpu option

```
advisor --collect=survey --profile-gpu -run-pass-thru=--no-altstack -project-dir=roofline --search-dir src:r=. python lab/kmeans_kernel_atomic.py --steps 1 --size 16384 --repeat 5 --json result_gpu.json --usm -d 3
```
* Run the Trip Counts and FLOP analysis with --profile-gpu option:

```
advisor --collect=tripcounts --profile-gpu --project-dir=roofline "--search-dir src:r=." --flop --no-trip-counts python lab/kmeans_kernel_atomic.py --steps 1 --size 16384 --repeat 5 --json result_gpu.json --usm -d 3
```
* Generate a GPU Roofline report:

```
advisor --report=roofline --gpu --project-dir=roofline --report-output=roofline/roofline.html
```

### Advisor Roofline Report

Execute the following line to display the roofline results 


In [None]:
run lab/mm_basic_roofline.py

## Generating the Vtune reports
Below exercises we use VTune™  analyzer as a way to see what is going on with each implementation. The information was the high-level hotspot generated from the collection and rendered in an HTML iframe. Depending on the options chosen, many of the VTune analyzer's performance collections can be rendered via HTML pages. The below vtune scripts collect GPU offload and GPU hotspots information.

#### Learn more about VTune
​
There is extensive training on VTune, click [here](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2xmez3) to get deep dive training.

```
vtune -run-pass-thru=--no-altstack -collect=gpu-offload -result-dir=vtune_dir python lab/kmeans_kernel_atomic.py --steps 1 --size 16384 --repeat 5 --json result_gpu.json
```

```
vtune -run-pass-thru=--no-altstack -collect=gpu-hotspots -result-dir=vtune_hotspots_dir_new python lab/kmeans_kernel_atomic.py --steps 1 --size 16384 --repeat 5 --json result_gpu.json
```

```
vtune -report summary -result-dir vtune_dir -format html -report-output output.html
```

```
vtune -report summary -result-dir vtune_hotspots_dir -format html -report-output output_hotspots.html
```

In [None]:
run lab/mm_basic_vtune.py

## Summary
In this module you will have learned the following:
* Numba implementation of K-Means targeting a CPU and GPU using Numba JIT
* Numba-dpex implementation of K-Means on a CPU and GPU using the kernel approach
* Numba-dpex  implementation of K-means on a GPU using Atomics