<font color=gray>ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

***
# <font color=red>Improving Performance of Estimators Using `daal4py`</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle ADS Team </font></p>

***

Overview:

This notebook demonstrates an easy way to enhance performance of scikit-learn models using Intel provided Python accelerators. Acceleration is achieved by using the Intel(R)oneAPI Data Analytics Library (oneDAL) that allows fast use of the framework suited for Data Scientists or Machine Learning users. Daal4py was created to give data scientists the easiest way to get better performance while using the familiar `scikit-learn` package.

## Business Use Cases 

Performance improvement of using `daal4py` accelerator for `scikit-learn` models.

---

## Prerequisites 
  - Experience level: Novice (Python and Machine Learning)
  - Professional experience: Some industry experience

## Objectives:

- <a href='#intro'>Check for an Intel-based Shape</a>
- <a href='#prepare'>Prepare the Data</a>
- <a href='#default'>Train a K-Means Model Using `sklearn`</a>
- <a href='#daal4py'>Train K-Means Model Using the `daal4py` Accelerator</a>
- <a href='#unpatch'>Unpatch `daal4py` from `sklearn`</a>
- <a href="#reference">References</a>

---

 **Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

<a id='intro'></a>
### Check for an Intel-based Shape

Ensure that this notebook is running on an instance with Intel. The next cell validates whether this notebook is running on a valid instance.

In [None]:
import cpuinfo
shape_name = cpuinfo.get_cpu_info()['brand_raw']

assert "Intel" in shape_name, "Switch to a VM shape with Intel"

Load the necessary modules:

In [None]:
import daal4py.sklearn
import importlib
import logging
import numpy as np
import sklearn
import time
import warnings

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

warnings.filterwarnings('ignore')
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

<a id='prepare'></a>
### Prepare the Data

The data is prepared using the `sklearn` `make_blobs` function that generates isotropic Gaussian blobs for clustering.

In [None]:
rows, cols = 1000, 150
X, y = make_blobs(n_samples=rows, n_features=cols, centers=8, random_state=42)

<a id='default'></a>
### Train a K-Means Model Using `sklearn`

Use `sklearn` to train a K-Means model on a dataset:

In [None]:
estimator = KMeans(n_clusters=8)
print("Module being used: " + estimator.__module__)

t0 = time.perf_counter()
trained = estimator.fit(X)
fit_elapsed = str(time.perf_counter() - t0)

print("Training took seconds " + fit_elapsed + " seconds")

In [None]:
t0 = time.perf_counter()
preds = trained.predict([[1]*150])
predict_elapsed = str(time.perf_counter() - t0)

print("Prediction took " + predict_elapsed + " seconds")

<a id='daal4py'></a>
### Train K-Means Model Using the `daal4py` Accelerator

To use oneDAL as the underlying solver, you use `daal4py` to dynamically patch the `sklearn` estimators. You get the same solution as before, but faster. The `sklearn` modules must be imported again after the patching is complete.

In [None]:
daal4py.sklearn.patch_sklearn()
sklearn = importlib.reload(sklearn)

from sklearn.cluster import KMeans
estimator = KMeans(n_clusters=8)

# After patching, this should indicate daal4py is being used
print("Module being used: " + estimator.__module__)

In [None]:
t0 = time.perf_counter()
trained = estimator.fit(X)
fit_elapsed = str(time.perf_counter() - t0)

print("Training took seconds " + fit_elapsed + " seconds")

In [None]:
t0 = time.perf_counter()
preds = trained.predict([[1]*150])
predict_elapsed = str(time.perf_counter() - t0)

print("Prediction took " + predict_elapsed + " seconds")

Comparing the performance when using `sklearn` versus `daal4py`, it is evident that `daal4py` significantly improves performance.

<a id='unpatch'></a>
### Unpatch `daal4py` from `sklearn`

To use `sklearn` again, you simply unpatch `daal4py`, reload `sklearn`, and import the relevant `sklearn` modules again: 

In [None]:
daal4py.sklearn.unpatch_sklearn()
sklearn = importlib.reload(sklearn)
# remember to re-import all the relevant modules

<a id="reference"></a>
# References
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [ADS Library Documentation](https://docs.cloud.oracle.com/en-us/iaas/tools/ads-sdk/latest/index.html)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [Fast, Scalable and Easy Machine Learning With DAAL4PY](https://intelpython.github.io/daal4py/)