Copyright (c) 2021. All rights reserved.

Contributed by: @bnriiitb

Licensed under the MIT License.

# Using AutoML in Sklearn Pipeline

This tutorial will help you understand how FLAML's AutoML can be used as a transformer in the Sklearn pipeline.


## 1.Introduction

### 1.1 FLAML - Fast and Lightweight AutoML

FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models with low computational cost. It is fast and economical. The simple and lightweight design makes it easy  to use and extend, such as adding new learners. 

FLAML can 
- serve as an economical AutoML engine,
- be used as a fast hyperparameter tuning tool, or 
- be embedded in self-tuning software that requires low latency & resource in repetitive
   tuning tasks.

In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library.

FLAML requires `Python>=3.8`. To run this notebook example, please install flaml with the `[automl]` option (this option is introduced from version 2, for version 1 it is installed by default):
```bash
pip install flaml[automl]
```

In [44]:
%pip install flaml[automl] openml

### 1.2 Why are pipelines a silver bullet?

In a typical machine learning workflow we have to apply all the transformations at least twice. 
1. During Training
2. During Inference

Scikit-learn pipelines provide an easy to use inteface to automate ML workflows by allowing several transformers to be chained together. 

The key benefits of using pipelines:
* Make ML workflows highly readable, enabling fast development and easy review
* Help to build sequential and parallel processes
* Allow hyperparameter tuning across the estimators
* Easier to share and collaborate with multiple users (bug fixes, enhancements etc)
* Enforce the implementation and order of steps

#### As FLAML's AutoML module can be used a transformer in the Sklearn's pipeline we can get all the benefits of pipeline and thereby write extremley clean, and resuable code.

## 2. Classification Example
### Load data and preprocess

Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

In [1]:
from flaml.automl.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(
    dataset_id=1169, data_dir='./', random_state=1234, dataset_format='array')

download dataset from openml
Dataset name: airlines
X_train.shape: (404537, 7), y_train.shape: (404537,);
X_test.shape: (134846, 7), y_test.shape: (134846,)


In [2]:
X_train[0]

array([  12., 2648.,    4.,   15.,    4.,  450.,   67.], dtype=float32)

## 3. Create a Pipeline

In [3]:
from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from flaml import AutoML

set_config(display='diagram')

imputer = SimpleImputer()
standardizer = StandardScaler()
automl = AutoML()

automl_pipeline = Pipeline([
    ("imputuer",imputer),
    ("standardizer", standardizer),
    ("automl", automl)
])
automl_pipeline

### Run FLAML
In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default ML learners of FLAML are `['lgbm', 'xgboost', 'catboost', 'rf', 'extra_tree', 'lrl1']`. 

In [4]:
automl_settings = {
    "time_budget": 60,  # total running time in seconds
    "metric": 'accuracy',  # primary metrics can be chosen from: ['accuracy','roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'f1','log_loss','mae','mse','r2']
    "task": 'classification',  # task type   
    "estimator_list": ['xgboost','catboost','lgbm'],
    "log_file_name": 'airlines_experiment.log',  # flaml log file
}
pipeline_settings = {f"automl__{key}": value for key, value in automl_settings.items()}

In [5]:
automl_pipeline.fit(X_train, y_train, **pipeline_settings)

[flaml.automl: 06-22 08:01:43] {2390} INFO - task = classification
[flaml.automl: 06-22 08:01:43] {2392} INFO - Data split method: stratified
[flaml.automl: 06-22 08:01:43] {2396} INFO - Evaluation method: holdout
[flaml.automl: 06-22 08:01:44] {2465} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 06-22 08:01:44] {2605} INFO - List of ML learners in AutoML Run: ['xgboost', 'catboost', 'lgbm']
[flaml.automl: 06-22 08:01:44] {2897} INFO - iteration 0, current learner xgboost
[flaml.automl: 06-22 08:01:44] {3025} INFO - Estimated sufficient time budget=105341s. Estimated necessary time budget=116s.
[flaml.automl: 06-22 08:01:44] {3072} INFO -  at 0.7s,	estimator xgboost's best error=0.3755,	best estimator xgboost's best error=0.3755
[flaml.automl: 06-22 08:01:44] {2897} INFO - iteration 1, current learner lgbm
[flaml.automl: 06-22 08:01:44] {3072} INFO -  at 0.9s,	estimator lgbm's best error=0.3814,	best estimator xgboost's best error=0.3755
[flaml.automl: 06-22 08:01:44] {2897

In [9]:
# Get the automl object from the pipeline
automl = automl_pipeline.steps[2][1]

# Get the best config and best learner
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

Best ML leaner: xgboost
Best hyperparmeter config: {'n_estimators': 63, 'max_leaves': 1797, 'min_child_weight': 0.07275175679381725, 'learning_rate': 0.06234183309508761, 'subsample': 0.9814772488195874, 'colsample_bylevel': 0.810466508891351, 'colsample_bytree': 0.8005378817953572, 'reg_alpha': 0.5768305704485758, 'reg_lambda': 6.867180836557797, 'FLAML_sample_size': 364083}
Best accuracy on validation data: 0.6721
Training duration of best run: 15.45 s


In [10]:
automl.model

<flaml.automl.model.XGBoostSklearnEstimator at 0x7f03a5eada00>

## 4. Persist the model binary file

In [11]:
# Persist the automl object as pickle file
import pickle
with open('automl.pkl', 'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

In [12]:
# Performance inference on the testing dataset
y_pred = automl_pipeline.predict(X_test)
print('Predicted labels', y_pred)
print('True labels', y_test)
y_pred_proba = automl_pipeline.predict_proba(X_test)[:,1]
print('Predicted probas ',y_pred_proba[:5])

Predicted labels [0 1 1 ... 0 1 0]
True labels [0 0 0 ... 1 0 1]
Predicted probas  [0.3764987  0.6126277  0.699604   0.27359942 0.25294745]
