# Scikit-learn IntelX accelerated clustering

In the pursuit of scaling clustering of our 2048 dimension dataset,
we test out the efficacy of Intel's Extension for Scikit-learn.

With a smaller size dataset of 100 squad2 examples ( 14400 attention heads ), kMeans on 8 cores shows a roughly linear improvement from 10 minutes to 1.5 minutes.

On a larger set - the 2000 squad2 example output from pipeline/transform_attentions.ipynb - speedup was similar (~7x) and took over 45x longer to cluster with only 2x the # of rows.

* we're going to need a bigger boat

Considering 2000 examples is only ~1/65th of our full dataset of over 130,000 examples in squad2, This doesn't seem like it could be a feasable option.  While our 400GB dataset could fit in a memory-optimized VM on GCP or AWS, even on a 128 core epyc server this could take a _long_ time.  Even if kMeans were linear in complexity it could take more hours than feasable ( which it is clearly not - its somewhere on the lines of O(n^2) depending on the algorithm, and # of rows, clusters, iterations, and columns ).

In [6]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [7]:
!echo $SKLEARNEX_VERBOSE

INFO


In [2]:
import pandas as pd
from sklearn import cluster
import os
import seaborn as sns

In [4]:
data_dir='/rapids/notebooks/host/representations/'

In [6]:
df = pd.read_csv(os.path.join(data_dir,'representation_df.csv'))

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288000 entries, 0 to 287999
Columns: 2049 entries, Unnamed: 0 to 2047
dtypes: float64(2048), int64(1)
memory usage: 4.4 GB


In [11]:
df_small = df[:14400]

In [12]:
%%time
kmeans_dataset_small = cluster.KMeans(n_clusters=30, 
                                 init='k-means++').fit_predict(df_small)

SKLEARNEX INFO: sklearn.cluster.KMeans.fit: running accelerated version on CPU
CPU times: user 10min 3s, sys: 2.44 s, total: 10min 5s
Wall time: 1min 26s


In [15]:
%%time
kmeans_dataset_large = cluster.KMeans(n_clusters=30, 
                                 init='k-means++').fit_predict(df)

SKLEARNEX INFO: sklearn.cluster.KMeans.fit: running accelerated version on CPU
CPU times: user 5h 3min 45s, sys: 1min 17s, total: 5h 5min 2s
Wall time: 42min 20s
