Joblib Apache Spark Backend
This library provides Apache Spark backend for joblib to distribute tasks on a Spark cluster.
joblibspark requires Python 3.6+,
pyspark>=2.4 to run.
pip install joblibspark
The installation does not install PySpark because for most users, PySpark is already installed.
If you do not have PySpark installed, you can install
pyspark together with
pip install pyspark>=3.0.0 joblibspark
If you want to use
scikit-learn, please install
Run following example code in
from sklearn.utils import parallel_backend from sklearn.model_selection import cross_val_score from sklearn import datasets from sklearn import svm from joblibspark import register_spark register_spark() # register spark backend iris = datasets.load_iris() clf = svm.SVC(kernel='linear', C=1) with parallel_backend('spark', n_jobs=3): scores = cross_val_score(clf, iris.data, iris.target, cv=5) print(scores)
joblibspark does not generally support run model inference and feature engineering in parallel.
from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=10) with parallel_backend('spark', n_jobs=3): # This won't run parallelly on spark, it will still run locally. h.transform(...) from sklearn import linear_model regr = linear_model.LinearRegression() regr.fit(X_train, y_train) with parallel_backend('spark', n_jobs=3): # This won't run parallelly on spark, it will still run locally. regr.predict(diabetes_X_test)
sklearn.ensemble.RandomForestClassifier, there is a
that means the algorithm support model training/inference in parallel,
but in its inference implementation, it bind the backend to built-in backends,
so the spark backend not work for this case.