In [1]:
using Lale

┌ Info: Precompiling Lale [25676c37-aa2f-4f14-ad5b-b63670ababff]
└ @ Base loading.jl:1342


In [2]:
using Random
using Statistics
using Test
using DataFrames: DataFrame, nrow

In [3]:
iris = getiris();
trx,tstx = holdout(nrow(iris),0.30)
training = iris[trx,:]
testing = iris[tstx,:];

In [4]:
clf_tr_X = training[:,1:4] |> DataFrame
clf_tr_y = training[:,5]   |> Vector
clf_tst_X = testing[:,1:4] |> DataFrame
clf_tst_y = testing[:,5] |> Vector;

## AutoML for classifier pipeline

This example uses Lale for combined algorithm selection and hyperparameter tuning
on a classifier pipeline.

The first step to create a pipeline is to instantiate the operators. `laleoperator` takes the name of the operator and an optional package argument. The default package is sklearn, which means by default it will try to instantiate operators from sklearn.

In [5]:
PCA = laleoperator("PCA")
RobustScaler = laleoperator("RobustScaler")
ConcatFeatures = laleoperator("ConcatFeatures", "lale")
LogisticRegression = laleoperator("LogisticRegression")
RandomForestClassifier = laleoperator("RandomForestClassifier");

The next step is to compose a pipeline using the operators and combinators defined in Lale. The table below summarizes the available pipeline combinators which can be used to define the pipeline directed acyclic graph:

| Symbol | Name | Description  | Sklearn feature |
| ------ | ---- | ------------ | --------------- |
| >>     | pipe | Feed to next | `make_pipeline` |
| &      | and  | Run both     | `make_union`, includes concat |
| &#x7c; | or   | Choose one   | (missing) |


In [6]:
clf_planned = (PCA & RobustScaler) >> ConcatFeatures >> (LogisticRegression | RandomForestClassifier);

`LalePipeOptimizer` takes the pipeline graph from above and we can provide a budget of the number of optimizer iterations along with other parameters such as cross validation number of folds. It internally uses hyperopt for performing the algorithm selection and hyperparameter tuning.
Lale follows sklearn API, so `fit` is for trainind and `predict` for obtaining the predictions.

In [7]:
clf_hopt = LalePipeOptimizer(clf_planned, max_evals=10, cv=3)
clf_trained = fit(clf_hopt, clf_tr_X, clf_tr_y);

  0%|                                   | 0/10 [00:00<?, ?trial/s, best loss=?] 10%|▊       | 1/10 [00:00<00:06,  1.37trial/s, best loss: -0.9619047619047619] 20%|█▌      | 2/10 [00:01<00:03,  2.14trial/s, best loss: -0.9619047619047619] 30%|██▍     | 3/10 [00:01<00:02,  2.52trial/s, best loss: -0.9619047619047619] 40%|███▏    | 4/10 [00:01<00:02,  2.78trial/s, best loss: -0.9619047619047619] 50%|████    | 5/10 [00:02<00:02,  2.30trial/s, best loss: -0.9619047619047619] 60%|████▊   | 6/10 [00:02<00:01,  2.12trial/s, best loss: -0.9619047619047619] 70%|█████▌  | 7/10 [00:03<00:01,  2.31trial/s, best loss: -0.9619047619047619] 80%|██████▍ | 8/10 [00:03<00:00,  2.42trial/s, best loss: -0.9619047619047619] 90%|███████▏| 9/10 [00:03<00:00,  2.56trial/s, best loss: -0.9619047619047619]100%|███████| 10/10 [00:04<00:00,  2.79trial/s, best loss: -0.9619047619047619]100%|███████| 10/10 [00:04<00:00,  2.44trial/s, best loss: -0.9619047619047619]

In [8]:
clf_pred = predict(clf_trained, clf_tst_X)
clf_accu = score(:accuracy, clf_pred, clf_tst_y)

93.33333333333333

## AutoML for regressor pipeline

This example uses Lale for combined algorithm selection and hyperparameter tuning
on a regressor pipeline.

In [9]:
reg_tr_X = training[:,1:3] |> DataFrame
reg_tr_y = training[:,4]   |> Vector
reg_tst_X = testing[:,1:3] |> DataFrame
reg_tst_y = testing[:,4]   |> Vector;

In [10]:
PCA = laleoperator("PCA")
NoOp = laleoperator("NoOp", "lale")
LinearRegression = laleoperator("LinearRegression")
RandomForestRegressor = laleoperator("RandomForestRegressor");

In [11]:
reg_planned = (PCA | NoOp) >> (LinearRegression | RandomForestRegressor);

In [12]:
reg_hopt = LalePipeOptimizer(reg_planned, max_evals=10, cv=3)
reg_trained = fit(reg_hopt, reg_tr_X, reg_tr_y);


  0%|                                   | 0/10 [00:00<?, ?trial/s, best loss=?] 10%|▊       | 1/10 [00:00<00:02,  3.54trial/s, best loss: -0.8483038533149762] 20%|█▌      | 2/10 [00:00<00:01,  4.99trial/s, best loss: -0.9223868679195496] 30%|██▍     | 3/10 [00:00<00:01,  4.75trial/s, best loss: -0.9223868679195496] 40%|███▏    | 4/10 [00:00<00:01,  4.78trial/s, best loss: -0.9223868679195496] 50%|████    | 5/10 [00:01<00:01,  3.60trial/s, best loss: -0.9223868679195496] 60%|████▊   | 6/10 [00:01<00:01,  2.95trial/s, best loss: -0.9223868679195496] 70%|█████▌  | 7/10 [00:01<00:00,  3.33trial/s, best loss: -0.9223868679195496] 80%|██████▍ | 8/10 [00:02<00:00,  3.91trial/s, best loss: -0.9233280057705039] 90%|███████▏| 9/10 [00:02<00:00,  4.03trial/s, best loss: -0.9233280057705039]100%|███████| 10/10 [00:02<00:00,  4.18trial/s, best loss: -0.9233280057705039]100%|███████| 10/10 [00:02<00:00,  3.93trial/s, best loss: -0.9233280057705039]

In [13]:
reg_pred = predict(reg_trained, reg_tst_X)
reg_rmse = score(:rmse, reg_pred, reg_tst_y)

0.1511810672491045

In [14]:
using Distributed

In [15]:
nprocs == 1 && addprocs()
@everywhere using Lale
@everywhere using Statistics
@everywhere using Random: seed!
@everywhere using DataFrames
@everywhere using DataFrames: DataFrame, nrow

In [16]:
workers()

1-element Vector{Int64}:
 1

In [17]:
trials=10
results = @distributed (vcat) for i in 1:trials
    clf_planned = (PCA & RobustScaler) >> ConcatFeatures >> (LogisticRegression | RandomForestClassifier)
    clf_hopt = LalePipeOptimizer(clf_planned, max_evals=5, cv=3)
    clf_trained = fit(clf_hopt, clf_tr_X, clf_tr_y)
    clf_pred = predict(clf_trained, clf_tst_X)
    clf_accu = score(:accuracy, clf_pred, clf_tst_y)
    println(clf_accu)
    clf_accu
end


100%|█████████| 5/5 [00:02<00:00,  2.27trial/s, best loss: -0.9428571428571427]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.20trial/s, best loss: -0.9523809523809522]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.27trial/s, best loss: -0.9523809523809522]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.38trial/s, best loss: -0.9619047619047619]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.20trial/s, best loss: -0.9523809523809522]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.19trial/s, best loss: -0.9523809523809522]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.26trial/s, best loss: -0.9523809523809524]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.22trial/s, best loss: -0.9619047619047619]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.05trial/s, best loss: -0.9523809523809522]93.33333333333333

100%|█████████| 5/5 [00:02<00:00,  2.35trial/s, best loss: -0.9428571428571427]95.55555555555556


10-element Vector{Float64}:
 93.33333333333333
 93.33333333333333
 93.33333333333333
 93.33333333333333
 93.33333333333333
 93.33333333333333
 93.33333333333333
 93.33333333333333
 93.33333333333333
 95.55555555555556

In [18]:
results |> mean

93.55555555555557