## LALE cross validation example
This example is in collaboration with the [LALE team](https://github.com/IBM/lale), which demonstrates how a LALE pipeline can be translated into a CodeFlare pipeline, targeting cross validation.

It assumes that LALE is available and installed in your local environment.

One can see from running this notebook that LALE cross validation is single threaded and takes ~10minutes on a laptop (depending on the configuration), whereas using CodeFlare pipelines running on Ray, this time is reduced to around 75 seconds (with 8x parallelism) for the 10 fold cross validation.

In [3]:
# Uncomment below to install lale for running this notebook

# !pip install lale
# !pip install 'liac-arff>=2.4.0'

In [1]:
from lale.datasets.openml import fetch

In [2]:
(X_train, y_train), (X_test, y_test) = fetch("jungle_chess_2pcs_raw_endgame_complete", "classification")

In [8]:
# First, we will show how this data can be used to do cross validation using a simple pipeline with random forest
from lale.lib.sklearn import PCA, Nystroem, SelectKBest, RandomForestClassifier
from lale.lib.lale import ConcatFeatures

pipeline = (PCA() & Nystroem() & SelectKBest(k=3)) >> ConcatFeatures() >> RandomForestClassifier(n_estimators=200)

In [9]:
%%time
from lale.helpers import cross_val_score
cross_val_score(pipeline, X_train, y_train, cv=10)

CPU times: user 7min 18s, sys: 8.91 s, total: 7min 27s
Wall time: 6min 59s


[0.8161838161838162,
 0.8105228105228105,
 0.8148518148518149,
 0.8218448218448219,
 0.8208458208458208,
 0.8111888111888111,
 0.8105228105228105,
 0.8181818181818182,
 0.8011325782811459,
 0.8121252498334444]

In [3]:
# Start Ray and init

import ray
ray.init(object_store_memory=16 * 1024 * 1024 * 1024)

2021-06-19 21:02:55,145	INFO services.py:1269 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8266[39m[22m


{'node_ip_address': '9.211.53.245',
 'raylet_ip_address': '9.211.53.245',
 'redis_address': '9.211.53.245:29680',
 'object_store_address': '/tmp/ray/session_2021-06-19_21-02-53_442708_86627/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-06-19_21-02-53_442708_86627/sockets/raylet',
 'webui_url': '127.0.0.1:8266',
 'session_dir': '/tmp/ray/session_2021-06-19_21-02-53_442708_86627',
 'metrics_export_port': 61863,
 'node_id': 'eff8f3d5558252aa2d18c57b62891d1a169a7404979a63b6c9882a14'}

In [4]:
from sklearn.model_selection import KFold, StratifiedKFold
kf = StratifiedKFold(n_splits=10)

In [5]:
from sklearn.pipeline import FeatureUnion
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.kernel_approximation import Nystroem
from sklearn.feature_selection import SelectKBest

In [6]:
feature_union = FeatureUnion(transformer_list=[('PCA', PCA()),
                                                ('Nystroem',
                                                 Nystroem()),
                                                ('SelectKBest',
                                                 SelectKBest(k=3))])

In [7]:
random_forest = RandomForestClassifier(n_estimators=200)

In [8]:
import codeflare.pipelines.Datamodel as dm

# Create the CF pipeline and the nodes, add the edge
pipeline = dm.Pipeline()
node_fu = dm.EstimatorNode('feature_union', feature_union)
node_rf = dm.EstimatorNode('randomforest', random_forest)

pipeline.add_edge(node_fu, node_rf)

In [9]:
import codeflare.pipelines.Runtime as rt
from codeflare.pipelines.Runtime import ExecutionType

In [10]:
pipeline_input = dm.PipelineInput()
xy = dm.Xy(X_train, y_train)
pipeline_input.add_xy_arg(node_fu, xy)

In [11]:
%%time
scores = rt.cross_validate(kf, pipeline, pipeline_input)

CPU times: user 3.06 s, sys: 2.11 s, total: 5.17 s
Wall time: 1min 15s


In [12]:
scores

[0.8161838161838162,
 0.8118548118548119,
 0.8145188145188145,
 0.8198468198468198,
 0.8175158175158175,
 0.8111888111888111,
 0.8158508158508159,
 0.8171828171828172,
 0.8037974683544303,
 0.8087941372418388]

In [13]:
ray.shutdown()