<a href="https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/zeroshot_lightgbm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright (c) FLAML authors. All rights reserved. 

Licensed under the MIT License.

# Zero-shot AutoML with FLAML


## Introduction

In this notebook, we demonstrate a basic use case of zero-shot AutoML with FLAML.

FLAML requires `Python>=3.8`. To run this notebook example, please install the [autozero] option:

In [1]:
# %pip install flaml[autozero] lightgbm openml;

## What is zero-shot AutoML?

Zero-shot automl means automl systems without expensive tuning. But it does adapt to data.
A zero-shot automl system will recommend a data-dependent default configuration for a given dataset.

Think about what happens when you use a `LGBMRegressor`. When you initialize a `LGBMRegressor` without any argument, it will set all the hyperparameters to the default values preset by the lightgbm library.
There is no doubt that these default values have been carefully chosen by the library developers.
But they are static. They are not adaptive to different datasets.


In [2]:
from lightgbm import LGBMRegressor
estimator = LGBMRegressor()
print(estimator.get_params())

{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0}


It is unlikely that 100 trees with 31 leaves each is the best hyperparameter setting for every dataset.

So, we propose to recommend data-dependent default configurations at runtime. 
All you need to do is to import the `LGBMRegressor` from flaml.default instead of from lightgbm.


In [3]:
from flaml.default import LGBMRegressor

Other parts of code remain the same. The new `LGBMRegressor` will automatically choose a configuration according to the training data.
For different training data the configuration could be different.
The recommended configuration can be either the same as the static default configuration from the library, or different.
It is expected to be no worse than the static default configuration in most cases.

For example, let's download [houses dataset](https://www.openml.org/d/537) from OpenML. The task is to predict median price of the house in the region based on demographic composition and a state of housing market in the region.

In [5]:
from flaml.automl.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir='./')

download dataset from openml
Dataset name: houses
X_train.shape: (15480, 8), y_train.shape: (15480,);
X_test.shape: (5160, 8), y_test.shape: (5160,)


In [6]:
print(X_train)

       median_income  housing_median_age  total_rooms  total_bedrooms  \
19226         7.3003                  19       4976.0           711.0   
14549         5.9547                  18       1591.0           268.0   
9093          3.2125                  19        552.0           129.0   
12213         6.9930                  13        270.0            42.0   
12765         2.5162                  21       3260.0           763.0   
...              ...                 ...          ...             ...   
13123         4.4125                  20       1314.0           229.0   
19648         2.9135                  27       1118.0           195.0   
9845          3.1977                  31       1431.0           370.0   
10799         5.6315                  34       2125.0           498.0   
2732          1.3882                  15       1171.0           328.0   

       population  households  latitude  longitude  
19226      1926.0       625.0     38.46    -122.68  
14549       547.0

We fit the `flaml.default.LGBMRegressor` on this dataset.

In [7]:
estimator = LGBMRegressor()  # imported from flaml.default
estimator.fit(X_train, y_train)
print(estimator.get_params())

INFO:flaml.default.suggest:metafeature distance: 0.02197989436019765


{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.7019911744574896, 'importance_type': 'split', 'learning_rate': 0.022635758411078528, 'max_depth': -1, 'min_child_samples': 2, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 4797, 'n_jobs': -1, 'num_leaves': 122, 'objective': None, 'random_state': None, 'reg_alpha': 0.004252223402511765, 'reg_lambda': 0.11288241427227624, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'max_bin': 511, 'verbose': -1}


The configuration is adapted as shown here. 
The number of trees is 4797, the number of leaves is 122.
Does it work better than the static default configuration?
Let’s compare.


In [8]:
estimator.score(X_test, y_test)

0.8537444671194614

The data-dependent configuration has a $r^2$ metric 0.8537 on the test data. What about static default configuration from lightgbm?

In [9]:
from lightgbm import LGBMRegressor
estimator = LGBMRegressor()
estimator.fit(X_train, y_train)
estimator.score(X_test, y_test)

0.8296179648694404

The static default configuration gets $r^2=0.8296$, much lower than 0.8537 by the data-dependent configuration using `flaml.default`.
Again, the only difference in the code is from where you import the `LGBMRegressor`.
The adaptation to the training dataset is under the hood.

You might wonder, how is it possible to find the data-dependent configuration without tuning?
The answer is that,
flaml can recommend good data-dependent default configurations at runtime without tuning only because it mines the hyperparameter configurations across different datasets offline as a preparation step.
So basically, zero-shot automl shifts the tuning cost from online to offline.
In the offline preparation stage, we applied `flaml.AutoML`.

### Benefit of zero-shot AutoML
Now, what is the benefit of zero-shot automl? Or what is the benefit of shifting tuning from online to offline?
The first benefit is the online computational cost. That is the cost paid by the final consumers of automl. They only need to train one model.
They get the hyperparameter configuration right away. There is no overhead to worry about.
Another big benefit is that your code doesn’t need to change. So if you currently have a workflow without the setup for tuning, you can use zero-shot automl without breaking that workflow.
Compared to tuning-based automl, zero-shot automl requires less input. For example, it doesn’t need a tuning budget, resampling strategy, validation dataset etc.
A related benefit is that you don’t need to worry about holding a subset of the training data for validation, which the tuning process might overfit.
As there is no tuning, you can use all the training data to train your model.
Finally, you can customize the offline preparation for a domain, and leverage the past tuning experience for better adaptation to similar tasks.

## How to use at runtime
The easiest way to leverage this technique is to import a "flamlized" learner of your favorite choice and use it just as how you use the learner before. 
The automation is done behind the scene.
The current list of “flamlized” learners are:
* LGBMClassifier, LGBMRegressor (inheriting LGBMClassifier, LGBMRegressor from lightgbm)
* XGBClassifier, XGBRegressor (inheriting LGBMClassifier, LGBMRegressor from xgboost)
* RandomForestClassifier, RandomForestRegressor (inheriting from scikit-learn)
* ExtraTreesClassifier, ExtraTreesRegressor (inheriting from scikit-learn)
They work for classification or regression tasks.

### What's the magic behind the scene?
`flaml.default.LGBMRegressor` inherits `lightgbm.LGBMRegressor`, so all the methods and attributes in `lightgbm.LGBMRegressor` are still valid in `flaml.default.LGBMRegressor`.
The difference is, `flaml.default.LGBMRegressor` decides the hyperparameter configurations based on the training data. It would use a different configuration if it is predicted to outperform the original data-independent default. If you inspect the params of the fitted estimator, you can find what configuration is used. If the original default configuration is used, then it is equivalent to the original estimator.
The recommendation of which configuration should be used is based on offline AutoML run results. Information about the training dataset, such as the size of the dataset will be used to recommend a data-dependent configuration. The recommendation is done instantly in negligible time. The training can be faster or slower than using the original default configuration depending on the recommended configuration. 

### Can I check the configuration before training?
Yes. You can use `suggest_hyperparams()` method to find the suggested configuration.
For example, when you run the following code with the houses dataset, it will return the hyperparameter configuration instantly, without training the model.

In [10]:
from flaml.default import LGBMRegressor

estimator = LGBMRegressor()
hyperparams, _, _, _ = estimator.suggest_hyperparams(X_train, y_train)
print(hyperparams)

INFO:flaml.default.suggest:metafeature distance: 0.02197989436019765


{'n_estimators': 4797, 'num_leaves': 122, 'min_child_samples': 2, 'learning_rate': 0.022635758411078528, 'colsample_bytree': 0.7019911744574896, 'reg_alpha': 0.004252223402511765, 'reg_lambda': 0.11288241427227624, 'max_bin': 511, 'verbose': -1}


You can print the configuration as a dictionary, in case you want to check it before you use it for training.

This brings up an equivalent, open-box way for zero-shot AutoML if you would like more control over the training. 
Import the function `preprocess_and_suggest_hyperparams` from `flaml.default`.
This function takes the task name, the training dataset, and the estimator name as input:

In [11]:
from flaml.default import preprocess_and_suggest_hyperparams
(
    hyperparams,
    estimator_class,
    X_transformed,
    y_transformed,
    feature_transformer,
    label_transformer,
) = preprocess_and_suggest_hyperparams("regression", X_train, y_train, "lgbm")

INFO:flaml.default.suggest:metafeature distance: 0.02197989436019765


It outputs the hyperparameter configurations, estimator class, transformed data, feature transformer and label transformer.


In [12]:
print(estimator_class)

<class 'lightgbm.sklearn.LGBMRegressor'>


In this case, the estimator name is “lgbm”. The corresponding estimator class is `lightgbm.LGBMRegressor`.
This line initializes a LGBMClassifier with the recommended hyperparameter configuration:

In [13]:
model = estimator_class(**hyperparams)

Then we can fit the model on the transformed data.

In [14]:
model.fit(X_transformed, y_train)

The feature transformer needs to be applied to the test data before prediction.

In [15]:
X_test_transformed = feature_transformer.transform(X_test)
y_pred = model.predict(X_test_transformed)

These are automated when you use the "flamlized" learner. So you don’t need to know these details when you don’t need to open the box.
We demonstrate them here to help you understand what’s going on. And in case you need to modify some steps, you know what to do.

(Note that some classifiers like XGBClassifier require the labels to be integers, while others do not. So you can decide whether to use the transformed labels y_transformed and the label transformer label_transformer. Also, each estimator may require specific preprocessing of the data.)

## Combine Zero-shot AutoML and HPO

Zero Shot AutoML is fast and simple to use. It is very useful if speed and simplicity are the primary concerns. 
If you are not satisfied with the accuracy of the zero shot model, you may want to spend extra time to tune the model.
You can use `flaml.AutoML` to do that. Everything is the same as your normal `AutoML.fit()`, except to set `starting_points="data"`.
This tells AutoML to start the tuning from the data-dependent default configurations. You can set the tuning budget in the same way as before.
Note that if you set `max_iter=0` and `time_budget=None`, you are effectively using zero-shot AutoML. 
When `estimator_list` is omitted, the most promising estimator together with its hyperparameter configuration will be tried first, which are both decided by zero-shot automl.

In [16]:
from flaml import AutoML

automl = AutoML()
settings = {
    "task": "regression",
    "starting_points": "data",
    "estimator_list": ["lgbm"],
    "time_budget": 300,
}
automl.fit(X_train, y_train, **settings)

[flaml.automl.logger: 04-28 02:51:45] {1663} INFO - task = regression
[flaml.automl.logger: 04-28 02:51:45] {1670} INFO - Data split method: uniform
[flaml.automl.logger: 04-28 02:51:45] {1673} INFO - Evaluation method: cv
[flaml.automl.logger: 04-28 02:51:45] {1771} INFO - Minimizing error metric: 1-r2


INFO:flaml.default.suggest:metafeature distance: 0.02197989436019765
INFO:flaml.default.suggest:metafeature distance: 0.006677018633540373


[flaml.automl.logger: 04-28 02:51:45] {1881} INFO - List of ML learners in AutoML Run: ['lgbm']
[flaml.automl.logger: 04-28 02:51:45] {2191} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 04-28 02:53:39] {2317} INFO - Estimated sufficient time budget=1134156s. Estimated necessary time budget=1134s.
[flaml.automl.logger: 04-28 02:53:39] {2364} INFO -  at 113.5s,	estimator lgbm's best error=0.1513,	best estimator lgbm's best error=0.1513
[flaml.automl.logger: 04-28 02:53:39] {2191} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 04-28 02:55:32] {2364} INFO -  at 226.6s,	estimator lgbm's best error=0.1513,	best estimator lgbm's best error=0.1513
[flaml.automl.logger: 04-28 02:55:54] {2600} INFO - retrain lgbm for 22.3s
[flaml.automl.logger: 04-28 02:55:54] {2603} INFO - retrained model: LGBMRegressor(colsample_bytree=0.7019911744574896,
              learning_rate=0.02263575841107852, max_bin=511,
              min_child_samples=2, n_estimators=4797, num_lea