While sklearn can be parallelized and extremely versitile in the breath of algorithms it exposes it is not built for "medium"/"big" data. Some algorithms are "incremental" in the sense that they can be updated with only a small window on the training data (see http://scikit-learn.org/stable/modules/scaling_strategies.html). But for many learning algorithms, data is usually kept in RAM (and potentially copied many times) so you typically cannot learn on data with featurized size $>$ 10% RAM.

- `dask-learncv` is a new project that is attempting to expose much of sklearn capabilities but atop `dask.distributed` and hence do a lot of computation out of core (see https://github.com/dask/dask-searchcv). 

- `graphlab` (aka turi) is a Python 2.7 project for out-of-core learning and data-science pipeling (https://github.com/apple/turicreate).

- `MLLib` sits atop Spark and is meant for large-scale distributed learning tasks (http://spark.apache.org/docs/latest/ml-guide.html)



In [None]:
!conda install dask-searchcv -c conda-forge -y

In [None]:
from dask.distributed import Client
client = Client()

In [None]:
from sklearn.datasets import load_digits
from sklearn.svm import SVC

# Fit with dask-searchcv
from dask_searchcv import GridSearchCV

param_space = {'C': [1e-4, 1, 100],
               'gamma': [1e-3, 1e-2, 1e-2, 1],
               'class_weight': [None, 'balanced']}

model = SVC(kernel='rbf')

digits = load_digits()

search = GridSearchCV(model, param_space, cv=10, return_train_score=True)
search.fit(digits.data, digits.target)

In [None]:
search.best_estimator_

In [None]:
search.best_estimator_.score(digits.data,digits.target)

In [None]:
search.cv_results_