## Scikit Spark
https://pypi.org/project/scikit-spark/

This is a major re-write of the spark-sklearn project, which seems to no longer be under development. It focuses specifically on the acceleration of Scikit-Learn's cross validation functionality using PySpark

In [1]:
!pip install -U scikit-spark

Collecting scikit-spark
  Downloading https://files.pythonhosted.org/packages/dd/d7/2ef215c4c7d4edc567a49d5d1d66637e84a7944520fcd1eb783a01bbceac/scikit_spark-0.1.0-py3-none-any.whl
Collecting six==1.11.0 (from scikit-spark)
  Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
[31mERROR: pandas-ml 0.6.1 requires enum34, which is not installed.[0m
[31mERROR: tensorflow 1.10.0 has requirement numpy<=1.14.5,>=1.13.3, but you'll have numpy 1.17.1 which is incompatible.[0m
[31mERROR: tensorflow 1.10.0 has requirement setuptools<=39.1.0, but you'll have setuptools 41.0.1 which is incompatible.[0m
[31mERROR: tensorflow 1.10.0 has requirement tensorboard<1.11.0,>=1.10.0, but you'll have tensorboard 1.14.0 which is incompatible.[0m
[31mERROR: mlxtend 0.17.0 has requirement scikit-learn>=0.20.3, but you'll have scikit-learn 0.19.2 which is incompatible.[0m
[31mERROR: dask-ml 1.0.0 has re

In [2]:
import sklearn
sklearn.__version__

'0.19.2'

In [3]:
from sklearn import svm, datasets

In [4]:
iris = datasets.load_iris()

In [5]:
parameters = {'kernel':('linear', 'rbf'), 
              'C':[0.01, 0.1, 1, 10, 100]
             }

svc = svm.SVC(gamma='auto')

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
                    .master("local[*]")\
                    .appName("skspark-grid-search-doctests")\
                    .getOrCreate()

In [7]:
from skspark.model_selection import GridSearchCV

In [8]:
gs = GridSearchCV(svc, parameters, n_jobs=-1)

In [9]:
gs.fit(iris.data, iris.target)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [0.01, 0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None,
       spark=<pyspark.sql.session.SparkSession object at 0x1a1d184358>,
       verbose=0)

In [10]:
gs.best_params_

{'C': 1, 'kernel': 'linear'}

In [11]:
gs.best_score_

0.98

In [12]:
from skspark.model_selection import RandomizedSearchCV

In [13]:
rs = RandomizedSearchCV(svc, parameters, n_jobs=-1)
rs.fit(iris.data, iris.target)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'kernel': ('linear', 'rbf'), 'C': [0.01, 0.1, 1, 10, 100]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None,
          spark=<pyspark.sql.session.SparkSession object at 0x1a1d184358>,
          verbose=0)

In [14]:
rs.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
rs.best_params_

{'kernel': 'linear', 'C': 1}

In [16]:
rs.best_score_

0.98

#### Reinstall the newer version of scikit-learn

In [17]:
!pip uninstall scikit-learn -y

Uninstalling scikit-learn-0.19.2:
  Successfully uninstalled scikit-learn-0.19.2


In [18]:
!pip install -U scikit-learn

Collecting scikit-learn
  Using cached https://files.pythonhosted.org/packages/cf/b8/706e496d8b1207c1da154a7fe82753a2385edc1435ec524afa6c1baafed6/scikit_learn-0.21.3-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
[31mERROR: spark-sklearn 0.3.0 has requirement scikit-learn<0.20,>=0.18.1, but you'll have scikit-learn 0.21.3 which is incompatible.[0m
Installing collected packages: scikit-learn
Successfully installed scikit-learn-0.21.3
