# Module 03_02: Pairwise Distance using Intel Extensions for Scikit-learn*

![Assets/PairwiseStocks.jpg](Assets/PairwiseStocks.jpg)

## Learning Objectives:

- Describe and apply the correct surgical patching method to patch pairwise_distance
-  recall that "the **'euclidean'** metric is not optimized by **Intel Extensions for Scikit learn**, but the metrics "**'cosine'** and **'correlation'**  are
- Describe the application of pairwise_distance to the problem of finding all time series  charts  similar to a chosen pattern

**References:**
for Background on Geometric Brownian Motion more generally see:

P. Glasserman, Monte Carlo methods in financial engineering. Vol. 53 (2013), Springer Science & Business Media.


## Background:

Geometrics Brownian Motion using arrays's of precomputed random numbers is used to synthesize a portfolio of 500 stocks which are saved in data/portfolio.npy. We created minute data for one years worth of trades. The data are random but partialy correlated with randomly generated eigenvectors to simulate stock behavoir.  

The goal for the exercise is to find one of several interesting trading patterns and plot the stocks that best match that pattern using **pairwise_distance** powered by oneAPI.

### Intel® Extension for Scikit-learn*

Intel® Extension for Scikit-learn* provides data scientists with a way to get a better performance and functionally equivalent library contained patched versions of popular scikit-learn* algorithms. 
To access these optimized alogirthms which are drop in replaceable with their stock counterparts, you need to:

* Download and install the Intel® oneAPI AI Analytics Toolkit
* import the library
    ```from sklearnex import patch_sklearn```
* Call the ```patch_sklearn()``` function
* Then import the deisred sklearn library

In the below example we can enable the patching for DBSCAN as below.

```
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.cluster import DBSCAN
```
 

The following code demonstrates usage of compute follows data. Inspect code; there are no modifications necessary:
1. Inspect the following code cell and click Run (▶) to save the code to file.
2. Next, run (▶) the cell in the __Build and Run__ section following the code to compile and execute the code.

In [None]:

# Copyright 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from sklearnex import patch_sklearn
patch_sklearn()

import numpy as np
from sklearn.cluster import DBSCAN
from daal4py.oneapi import sycl_context

X = np.array([[1., 2.], [2., 2.], [2., 3.],
            [8., 7.], [8., 8.], [25., 80.]], dtype=np.float32)

clustering = DBSCAN(eps=3, min_samples=2).fit(X)
print("DBSCAN components: ", clustering.components_, "\nDBSCAN labels: ",clustering.labels_)

resultsDict = {}
resultsDict['X'] = X
resultsDict['labels'] = clustering.labels_
resultsDict['components'] = clustering.components_
import pickle
with open('resultsDict.pkl', 'wb') as handle:
    pickle.dump(resultsDict, handle, protocol=pickle.HIGHEST_PROTOCOL) 

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_dbscan.sh; if [ -x "$(command -v qsub)" ]; then ./q run_dbscan.sh; else ./run_dbscan.sh; fi

In [None]:
import pickle
def read_results():
    f = open('resultsDict.pkl', 'rb')   # 'rb' for reading binary file
    resultsDict = pickle.load(f)     
    f.close()  
    return(resultsDict)

resultsDict = read_results()
resultsDict

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
X = resultsDict['X']
from matplotlib import pyplot as plt

columns = ['x', 'y']
df = pd.DataFrame(X, columns = columns)
df['color'] = resultsDict['labels']
colors = { 0: 'magenta', 1: 'lime', -1: 'b' }

df.plot.scatter(x='x', y='y', c=df['color'].apply(lambda x: colors[x]), s=30)
plt.title('DBSAN clustering of data')

plt.grid()
plt.show()

# Patching Strategies with Intel® Extension for Scikit-learn*

There are blunt/coarse methods t0 patch entire python scripts from the command line as well as finer granularity methods ising the patch_sklearn() down to almost surgical granularity methods of specifying which functions you wish to patch or unpatch

### patch an entire python script

Without editing the code of a scikit-learn application by using the following command line flag:

```python -m sklearnex my_application.py```


### to patch a Jupyter notebook cell

The order of steps is important here:

```
import the sklearnex library
patch_sklearn()
```

* Import any of the sklearn libraries you wish to use - **AFTER the call to patch_sklearn()** for example:

```
from sklearnex.neighbors import NearestNeighbors, PCA, Kmeans
```


### To UNPATCH sklearn to restore the stock behavior do the following:

The process is the same as for patching:
```
unpatch_sklearn()
```
* Re-import scikit-learn algorithms after the unpatch

```
from sklearn.cluster import PCA
```

### You can also specify which algorithms to patch explicitly

* Patching only one algorithm:

```
from sklearnex import patch_sklearn
patch_sklearn("SVC")
```

### To patch several algorithms explicitly

```
from sklearnex import patch_sklearn
patch_sklearn(["SVC", "DBSCAN"])
```

### To UNPATCH algorithms explicitly, try one of these methods:

```
unpatch_sklearn("KMeans")
unpatch_sklearn(["KMeans","SVC"])
```

In [None]:
## Read the precomputed, synthesized stock portfolio of 500 stocks (minute trades for a year)
import numpy as np
with open('data/portfolio500.npy', 'rb') as f:
    P = np.load(f)

## Exercise:

- Patch the pairwise_distance cells either individually or by region (first cell in notebook)

## Plot the whole portfolio at once to get a feel for the spreadof the data

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize = (16,8))
plt.title('Multidimensional Correlated GBM', fontsize = 18)
plt.xlabel('Time', fontsize = 18)

plt.plot(P)
plt.grid()
plt.show()

## Plot just a handful to see if they look "stock like"

In [None]:
plt.figure(figsize = (16,8))
plt.title('Multidimensional Correlated GBM', fontsize = 18)
plt.xlabel('Time', fontsize = 18)

plt.plot(P[:,:4])
plt.grid()

# Retreive previous compelling stock shapes

Retrieve shapes found during a previous run. These compelling shapes are ones that reflect a decline in the overall price over time (and if we use -1 times this shape plus adjusting a differenr offset for plotting purposes, we get an overall rise in price over time). Other intersting shapes are cyclical over various time periods within the year.

Now search for these shape patterns in the 500 generated stocks, to find similarly shaped times series

In [None]:
import numpy as np
import time
seed = 2022
with open('data/shapes{}.npy'.format(seed), 'rb') as f:
    shapes = np.load(f)
for i in range(3):
    plt.plot(shapes.T[i])

In [None]:
shapes.shape, P.shape

# Use Pairwise Distance find similar shaped stocks

Read shapes2022.pny (or shapesxxxx.pny)

This file contains 10 interesting shapes from a previous run

Find the four closest matching simulated stocks to the one of several interesting shapes

## Apply a surgical patch below

Use surgical patch where you specify the pairwise_distance function explicitly


In [None]:
# dominant trend - find top 3 stocks which follow the red rend

findN = 4

from sklearn.metrics.pairwise import pairwise_distances

# for stocks, I am treating the time as the components of the vector
# so I transpose the X & Y so that time[s] are the columns
sim = pairwise_distances(P.T, Y=shapes[:,0].reshape(-1,1).T, metric='cosine') 
#sim = pairwise_distances(P.T, Y=shapes[:,1].reshape(-1,1).T, metric="correlation")
# use np.argpartition to find the 4 closest similar to sorting the array and choosing the first 4 or last 4
idxs = np.argpartition(
    sim.flatten(), findN)[:findN]

plt.figure(figsize = (16,8))
plt.title('Pairwise Distance cosine Similar Time Series similar to downward red shape', fontsize = 18)
plt.xlabel('Time', fontsize = 18)

colors = ['lime','g','r','violet']
for i in range(len(colors)):
    plt.plot(P[:,idxs[i]], c=colors[i])
plt.plot(120*shapes[:,0] + 450, c = 'b')
sim[idxs]

In [None]:
# inverse dominant trend - find top 3 stocks - trending down which follow the red trend
# Experimenting with using Correlation instead of cosine - Cosine mathces much better
sim = pairwise_distances(P.T, Y=shapes[:,1].reshape(-1,1).T, metric='correlation') 
idxs = np.argpartition(sim.flatten(), findN)[:findN]

plt.figure(figsize = (16,8))
plt.title('Pairwise Distance Similar Time Series cyclical', fontsize = 18)
plt.xlabel('Time', fontsize = 18)

colors = ['lime','g','b','violet']
for i in range(len(colors)):
    plt.plot(P[:,idxs[i]], c=colors[i])
plt.plot(120*shapes[:,1] + 700, c = 'orange')
sim[idxs]

# Notices & Disclaimers 

Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. 
*Other names and brands may be claimed as the property of others.