Copyright (c) 2021. All rights reserved.

Contributed by: @bnriiitb

Licensed under the MIT License.

# Using AutoML in Sklearn Pipeline

This tutorial will help you understand how FLAML's AutoML can be used as a transformer in the Sklearn pipeline.


## 1.Introduction

### 1.1 FLAML - Fast and Lightweight AutoML

FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models with low computational cost. It is fast and cheap. The simple and lightweight design makes it easy  to use and extend, such as adding new learners. 

FLAML can 
- serve as an economical AutoML engine,
- be used as a fast hyperparameter tuning tool, or 
- be embedded in self-tuning software that requires low latency & resource in repetitive
   tuning tasks.

In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library.

FLAML requires `Python>=3.6`. To run this notebook example, please install flaml with the `notebook` option:
```bash
pip install flaml[notebook]
```

### 1.2 Why are pipelines a silver bullet?

In a typical machine learning workflow we have to apply all the transformations at least twice. 
1. During Training
2. During Inference

Scikit-learn pipelines provide an easy to use inteface to automate ML workflows by allowing several transformers to be chained together. 

The key benefits of using pipelines:
* Make ML workflows highly readable, enabling fast development and easy review
* Help to build sequential and parallel processes
* Allow hyperparameter tuning across the estimators
* Easier to share and collaborate with multiple users (bug fixes, enhancements etc)
* Enforce the implementation and order of steps

#### As FLAML's AutoML module can be used a transformer in the Sklearn's pipeline we can get all the benefits of pipeline and thereby write extremley clean, and resuable code.

In [44]:
!pip install flaml[notebook];

## 2. Classification Example
### Load data and preprocess

Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

In [45]:
from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='./',random_state=1234)

load dataset from ./openml_ds1169.pkl
Dataset name: airlines
X_train.shape: (404537, 7), y_train.shape: (404537,);
X_test.shape: (134846, 7), y_test.shape: (134846,)


In [46]:
X_train[0]

array([  12., 2648.,    4.,   15.,    4.,  450.,   67.], dtype=float32)

## 3. Create a Pipeline

In [47]:
import sklearn
from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from flaml import AutoML

set_config(display='diagram')

imputer = SimpleImputer()
standardizer = StandardScaler()
automl = AutoML()

automl_pipeline = Pipeline([
    ("imputuer",imputer),
    ("standardizer", standardizer),
    ("automl", automl)
])
automl_pipeline

### Run FLAML
In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default ML learners of FLAML are `['lgbm', 'xgboost', 'catboost', 'rf', 'extra_tree', 'lrl1']`. 

In [48]:
settings = {
    "time_budget": 60,  # total running time in seconds
    "metric": 'accuracy',  # primary metrics can be chosen from: ['accuracy','roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'f1','log_loss','mae','mse','r2']
    "task": 'classification',  # task type   
    "estimator_list":['xgboost','catboost','lgbm'],
    "log_file_name": 'airlines_experiment.log',  # flaml log file
}

In [49]:
automl_pipeline.fit(X_train, y_train, 
                        automl__time_budget=settings['time_budget'],
                        automl__metric=settings['metric'],
                        automl__estimator_list=settings['estimator_list'],
                        automl__log_training_metric=True)

[flaml.automl: 08-09 19:49:30] {884} INFO - Evaluation method: holdout
[flaml.automl: 08-09 19:49:30] {591} INFO - Using StratifiedKFold
[flaml.automl: 08-09 19:49:30] {905} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 08-09 19:49:30] {924} INFO - List of ML learners in AutoML Run: ['xgboost', 'catboost', 'lgbm']
[flaml.automl: 08-09 19:49:30] {986} INFO - iteration 0  current learner xgboost
[flaml.automl: 08-09 19:49:30] {1134} INFO -  at 0.4s,	best xgboost's error=0.3755,	best xgboost's error=0.3755
[flaml.automl: 08-09 19:49:30] {986} INFO - iteration 1  current learner lgbm
[flaml.automl: 08-09 19:49:30] {1134} INFO -  at 0.4s,	best lgbm's error=0.3704,	best lgbm's error=0.3704
[flaml.automl: 08-09 19:49:30] {986} INFO - iteration 2  current learner xgboost
[flaml.automl: 08-09 19:49:30] {1134} INFO -  at 0.5s,	best xgboost's error=0.3755,	best lgbm's error=0.3704
[flaml.automl: 08-09 19:49:30] {986} INFO - iteration 3  current learner lgbm
[flaml.automl: 08-09 19:49:



[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 0.6s,	best xgboost's error=0.3643,	best xgboost's error=0.3643
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 6  current learner xgboost
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 0.7s,	best xgboost's error=0.3624,	best xgboost's error=0.3624
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 7  current learner xgboost
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 0.8s,	best xgboost's error=0.3605,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 8  current learner xgboost
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 0.8s,	best xgboost's error=0.3605,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 9  current learner lgbm
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 0.9s,	best lgbm's error=0.3704,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 10  current learner xgboost




[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 1.1s,	best xgboost's error=0.3605,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 11  current learner lgbm
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 1.1s,	best lgbm's error=0.3704,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 12  current learner xgboost
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 1.2s,	best xgboost's error=0.3605,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 13  current learner lgbm




[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 1.4s,	best lgbm's error=0.3658,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 14  current learner xgboost
[flaml.automl: 08-09 19:49:31] {1134} INFO -  at 1.4s,	best xgboost's error=0.3605,	best xgboost's error=0.3605
[flaml.automl: 08-09 19:49:31] {986} INFO - iteration 15  current learner lgbm
[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 1.6s,	best lgbm's error=0.3588,	best lgbm's error=0.3588
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 16  current learner xgboost
[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 1.6s,	best xgboost's error=0.3605,	best lgbm's error=0.3588
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 17  current learner lgbm




[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 1.7s,	best lgbm's error=0.3588,	best lgbm's error=0.3588
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 18  current learner lgbm
[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 1.8s,	best lgbm's error=0.3588,	best lgbm's error=0.3588
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 19  current learner lgbm




[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 2.0s,	best lgbm's error=0.3588,	best lgbm's error=0.3588
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 20  current learner xgboost
[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 2.1s,	best xgboost's error=0.3531,	best xgboost's error=0.3531
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 21  current learner catboost
[flaml.automl: 08-09 19:49:32] {1134} INFO -  at 2.3s,	best catboost's error=0.3595,	best xgboost's error=0.3531
[flaml.automl: 08-09 19:49:32] {986} INFO - iteration 22  current learner xgboost
[flaml.automl: 08-09 19:49:33] {1134} INFO -  at 2.6s,	best xgboost's error=0.3531,	best xgboost's error=0.3531
[flaml.automl: 08-09 19:49:33] {986} INFO - iteration 23  current learner catboost
[flaml.automl: 08-09 19:49:33] {1134} INFO -  at 2.8s,	best catboost's error=0.3595,	best xgboost's error=0.3531
[flaml.automl: 08-09 19:49:33] {986} INFO - iteration 24  current learner lgbm
[flaml.automl: 08-09 19:49:33] {113



[flaml.automl: 08-09 19:49:33] {1134} INFO -  at 3.1s,	best catboost's error=0.3587,	best xgboost's error=0.3531
[flaml.automl: 08-09 19:49:33] {986} INFO - iteration 26  current learner lgbm
[flaml.automl: 08-09 19:49:33] {1134} INFO -  at 3.2s,	best lgbm's error=0.3588,	best xgboost's error=0.3531
[flaml.automl: 08-09 19:49:33] {986} INFO - iteration 27  current learner lgbm




[flaml.automl: 08-09 19:49:33] {1134} INFO -  at 3.4s,	best lgbm's error=0.3517,	best lgbm's error=0.3517
[flaml.automl: 08-09 19:49:33] {986} INFO - iteration 28  current learner lgbm
[flaml.automl: 08-09 19:49:34] {1134} INFO -  at 3.6s,	best lgbm's error=0.3517,	best lgbm's error=0.3517
[flaml.automl: 08-09 19:49:34] {986} INFO - iteration 29  current learner xgboost




[flaml.automl: 08-09 19:49:34] {1134} INFO -  at 3.8s,	best xgboost's error=0.3527,	best lgbm's error=0.3517
[flaml.automl: 08-09 19:49:34] {986} INFO - iteration 30  current learner xgboost
[flaml.automl: 08-09 19:49:34] {1134} INFO -  at 3.9s,	best xgboost's error=0.3527,	best lgbm's error=0.3517
[flaml.automl: 08-09 19:49:34] {986} INFO - iteration 31  current learner xgboost
[flaml.automl: 08-09 19:49:35] {1134} INFO -  at 4.9s,	best xgboost's error=0.3517,	best xgboost's error=0.3517
[flaml.automl: 08-09 19:49:35] {986} INFO - iteration 32  current learner lgbm
[flaml.automl: 08-09 19:49:35] {1134} INFO -  at 4.9s,	best lgbm's error=0.3517,	best xgboost's error=0.3517
[flaml.automl: 08-09 19:49:35] {986} INFO - iteration 33  current learner xgboost




[flaml.automl: 08-09 19:49:35] {1134} INFO -  at 5.2s,	best xgboost's error=0.3517,	best xgboost's error=0.3517
[flaml.automl: 08-09 19:49:35] {986} INFO - iteration 34  current learner catboost
[flaml.automl: 08-09 19:49:35] {1134} INFO -  at 5.4s,	best catboost's error=0.3587,	best xgboost's error=0.3517
[flaml.automl: 08-09 19:49:35] {986} INFO - iteration 35  current learner lgbm
[flaml.automl: 08-09 19:49:36] {1134} INFO -  at 5.6s,	best lgbm's error=0.3514,	best lgbm's error=0.3514
[flaml.automl: 08-09 19:49:36] {986} INFO - iteration 36  current learner lgbm




[flaml.automl: 08-09 19:49:36] {1134} INFO -  at 5.8s,	best lgbm's error=0.3501,	best lgbm's error=0.3501
[flaml.automl: 08-09 19:49:36] {986} INFO - iteration 37  current learner lgbm
[flaml.automl: 08-09 19:49:36] {1134} INFO -  at 6.0s,	best lgbm's error=0.3501,	best lgbm's error=0.3501
[flaml.automl: 08-09 19:49:36] {986} INFO - iteration 38  current learner lgbm




[flaml.automl: 08-09 19:49:37] {1134} INFO -  at 6.7s,	best lgbm's error=0.3492,	best lgbm's error=0.3492
[flaml.automl: 08-09 19:49:37] {986} INFO - iteration 39  current learner lgbm




[flaml.automl: 08-09 19:49:37] {1134} INFO -  at 7.3s,	best lgbm's error=0.3492,	best lgbm's error=0.3492
[flaml.automl: 08-09 19:49:37] {986} INFO - iteration 40  current learner lgbm




[flaml.automl: 08-09 19:49:39] {1134} INFO -  at 9.5s,	best lgbm's error=0.3492,	best lgbm's error=0.3492
[flaml.automl: 08-09 19:49:39] {986} INFO - iteration 41  current learner xgboost
[flaml.automl: 08-09 19:49:42] {1134} INFO -  at 12.4s,	best xgboost's error=0.3517,	best lgbm's error=0.3492
[flaml.automl: 08-09 19:49:42] {986} INFO - iteration 42  current learner lgbm




[flaml.automl: 08-09 19:49:44] {1134} INFO -  at 14.3s,	best lgbm's error=0.3424,	best lgbm's error=0.3424
[flaml.automl: 08-09 19:49:44] {986} INFO - iteration 43  current learner lgbm




[flaml.automl: 08-09 19:49:45] {1134} INFO -  at 15.5s,	best lgbm's error=0.3424,	best lgbm's error=0.3424
[flaml.automl: 08-09 19:49:45] {986} INFO - iteration 44  current learner lgbm




[flaml.automl: 08-09 19:49:48] {1134} INFO -  at 18.2s,	best lgbm's error=0.3424,	best lgbm's error=0.3424
[flaml.automl: 08-09 19:49:48] {986} INFO - iteration 45  current learner lgbm




[flaml.automl: 08-09 19:49:49] {1134} INFO -  at 19.1s,	best lgbm's error=0.3407,	best lgbm's error=0.3407
[flaml.automl: 08-09 19:49:49] {986} INFO - iteration 46  current learner lgbm




[flaml.automl: 08-09 19:49:51] {1134} INFO -  at 20.8s,	best lgbm's error=0.3407,	best lgbm's error=0.3407
[flaml.automl: 08-09 19:49:51] {986} INFO - iteration 47  current learner catboost
[flaml.automl: 08-09 19:49:51] {1134} INFO -  at 21.0s,	best catboost's error=0.3587,	best lgbm's error=0.3407
[flaml.automl: 08-09 19:49:51] {986} INFO - iteration 48  current learner lgbm




[flaml.automl: 08-09 19:49:52] {1134} INFO -  at 22.2s,	best lgbm's error=0.3376,	best lgbm's error=0.3376
[flaml.automl: 08-09 19:49:52] {986} INFO - iteration 49  current learner lgbm




[flaml.automl: 08-09 19:49:53] {1134} INFO -  at 23.0s,	best lgbm's error=0.3376,	best lgbm's error=0.3376
[flaml.automl: 08-09 19:49:53] {986} INFO - iteration 50  current learner lgbm




[flaml.automl: 08-09 19:49:56] {1134} INFO -  at 26.5s,	best lgbm's error=0.3351,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:49:56] {986} INFO - iteration 51  current learner lgbm




[flaml.automl: 08-09 19:50:00] {1134} INFO -  at 29.9s,	best lgbm's error=0.3351,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:00] {986} INFO - iteration 52  current learner lgbm




[flaml.automl: 08-09 19:50:05] {1134} INFO -  at 35.0s,	best lgbm's error=0.3351,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:05] {986} INFO - iteration 53  current learner xgboost
[flaml.automl: 08-09 19:50:05] {1134} INFO -  at 35.3s,	best xgboost's error=0.3517,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:05] {986} INFO - iteration 54  current learner lgbm




[flaml.automl: 08-09 19:50:07] {1134} INFO -  at 36.9s,	best lgbm's error=0.3351,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:07] {986} INFO - iteration 55  current learner catboost
[flaml.automl: 08-09 19:50:07] {1134} INFO -  at 37.4s,	best catboost's error=0.3515,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:07] {986} INFO - iteration 56  current learner catboost
[flaml.automl: 08-09 19:50:08] {1134} INFO -  at 37.6s,	best catboost's error=0.3515,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:08] {986} INFO - iteration 57  current learner catboost
[flaml.automl: 08-09 19:50:08] {1134} INFO -  at 37.9s,	best catboost's error=0.3515,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:08] {986} INFO - iteration 58  current learner catboost
[flaml.automl: 08-09 19:50:08] {1134} INFO -  at 38.1s,	best catboost's error=0.3515,	best lgbm's error=0.3351
[flaml.automl: 08-09 19:50:08] {986} INFO - iteration 59  current learner catboost
[flaml.automl: 08-09 19:50:08] {11



[flaml.automl: 08-09 19:50:12] {1134} INFO -  at 42.5s,	best lgbm's error=0.3328,	best lgbm's error=0.3328
[flaml.automl: 08-09 19:50:12] {986} INFO - iteration 62  current learner lgbm




[flaml.automl: 08-09 19:50:14] {1134} INFO -  at 44.4s,	best lgbm's error=0.3328,	best lgbm's error=0.3328
[flaml.automl: 08-09 19:50:14] {986} INFO - iteration 63  current learner catboost
[flaml.automl: 08-09 19:50:15] {1134} INFO -  at 44.7s,	best catboost's error=0.3515,	best lgbm's error=0.3328
[flaml.automl: 08-09 19:50:15] {986} INFO - iteration 64  current learner catboost
[flaml.automl: 08-09 19:50:18] {1134} INFO -  at 47.9s,	best catboost's error=0.3435,	best lgbm's error=0.3328
[flaml.automl: 08-09 19:50:18] {986} INFO - iteration 65  current learner lgbm




[flaml.automl: 08-09 19:50:23] {1134} INFO -  at 52.8s,	best lgbm's error=0.3328,	best lgbm's error=0.3328




[flaml.automl: 08-09 19:50:26] {1156} INFO - retrain lgbm for 3.3s
[flaml.automl: 08-09 19:50:26] {986} INFO - iteration 66  current learner catboost
[flaml.automl: 08-09 19:50:27] {1134} INFO -  at 57.4s,	best catboost's error=0.3435,	best lgbm's error=0.3328
[flaml.automl: 08-09 19:50:29] {1156} INFO - retrain catboost for 1.3s
[flaml.automl: 08-09 19:50:29] {986} INFO - iteration 67  current learner xgboost
[flaml.automl: 08-09 19:50:29] {1134} INFO -  at 58.9s,	best xgboost's error=0.3517,	best lgbm's error=0.3328
[flaml.automl: 08-09 19:50:30] {1156} INFO - retrain xgboost for 0.9s
[flaml.automl: 08-09 19:50:30] {1181} INFO - selected model: LGBMClassifier(colsample_bytree=0.7560357004495271,
               learning_rate=0.28478479182882205, max_bin=31, max_leaves=16,
               min_data_in_leaf=55, n_estimators=746, objective='binary',
               reg_alpha=0.0009765625, reg_lambda=0.032652090008547976,
               subsample=0.8847635935300631)
[flaml.automl: 08-09 19:5

In [51]:
# Get the automl object from the pipeline
automl = automl_pipeline.steps[2][1]

# Get the best config and best learner
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

Best ML leaner: lgbm
Best hyperparmeter config: {'n_estimators': 746.0, 'max_leaves': 16.0, 'min_data_in_leaf': 55.0, 'learning_rate': 0.28478479182882205, 'subsample': 0.8847635935300631, 'log_max_bin': 5.0, 'colsample_bytree': 0.7560357004495271, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.032652090008547976, 'FLAML_sample_size': 364083}
Best accuracy on validation data: 0.6672
Training duration of best run: 3.921 s


In [52]:
automl.model

## 4. Persist the model binary file

In [53]:
# Persist the automl object as pickle file
import pickle
with open('automl.pkl', 'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

In [54]:
# Performance inference on the testing dataset
y_pred = automl_pipeline.predict(X_test)
print('Predicted labels', y_pred)
print('True labels', y_test)
y_pred_proba = automl_pipeline.predict_proba(X_test)[:,1]
print('Predicted probas ',y_pred_proba[:5])

Predicted labels [0 1 1 ... 0 1 0]
True labels [0 0 0 ... 1 0 1]
Predicted probas  [0.36424183 0.59111937 0.64600957 0.27020691 0.23272711]
