# FLAML AutoML on Apache Spark 

|  | | | | |
|-----|-----|--------|--------|--------|
|![synapse](https://microsoft.github.io/SynapseML/img/logo.svg)| <img src="https://www.microsoft.com/en-us/research/uploads/prod/2020/02/flaml-1024x406.png" alt="drawing" width="200"/> | ![image-alt-text](https://th.bing.com/th/id/OIP.5aNnFabBKoYIYhoTrNc_CAHaHa?w=174&h=180&c=7&r=0&o=5&pid=1.7)| 


<style>
td, th {
   border: none!important;
}
</style>
### Goal


## 1. Introduction

### FLAML
FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models 
with low computational cost. It is fast and economical. The simple and lightweight design makes it easy 
to use and extend, such as adding new learners. FLAML can 
- serve as an economical AutoML engine,
- be used as a fast hyperparameter tuning tool, or 
- be embedded in self-tuning software that requires low latency & resource in repetitive
   tuning tasks.

In this notebook, we demonstrate how to use FLAML library to do AutoML for SynapseML models and Apache Spark dataframes. We also compare the results between FLAML AutoML and the default SynapseML. 
 

In [1]:
%pip install flaml[synapse]==1.2.1 xgboost==1.6.1 pandas==1.5.1 numpy==1.23.4 --force-reinstall

StatementMeta(, 27, -1, Finished, Available)

Collecting flaml[synapse]@ git+https://github.com/microsoft/FLAML.git
  Cloning https://github.com/microsoft/FLAML.git to /tmp/pip-install-9bp9bnbp/flaml_f9ddffb8b30b4c1aaffd650b9b9ac29a
  Running command git clone --filter=blob:none --quiet https://github.com/microsoft/FLAML.git /tmp/pip-install-9bp9bnbp/flaml_f9ddffb8b30b4c1aaffd650b9b9ac29a
  Resolved https://github.com/microsoft/FLAML.git to commit 99bb0a8425a58a537ae34347c867b4bc05310471
  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting xgboost==1.6.1
  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.9/192.9 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas==1.5.1
  Downloading pandas-1.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m96.6 MB/s[0m eta [36m0:00:00[0m00:01[0m




Uncomment `_init_spark()` if run in local spark env.

In [None]:
def _init_spark():
    import pyspark

    spark = (
        pyspark.sql.SparkSession.builder.appName("MyApp")
        .master("local[2]")
        .config(
            "spark.jars.packages",
            (
                "com.microsoft.azure:synapseml_2.12:0.10.2,"
                "org.apache.hadoop:hadoop-azure:3.3.5,"
                "com.microsoft.azure:azure-storage:8.6.6"
            ),
        )
        .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
        .config("spark.sql.debug.maxToStringFields", "100")
        .getOrCreate()
    )
    return spark

# spark = _init_spark()

In [2]:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")

StatementMeta(automl, 27, 8, Finished, Available)

## Demo overview
In this example, we use FLAML & Apache Spark to build a classification model in order to predict bankruptcy.
1. **Tune**: Given an Apache Spark dataframe, we can use FLAML to tune a SynapseML Spark-based model.
2. **AutoML**: Given an Apache Spark dataframe, we can run AutoML to find the best classification model given our constraints.


## 2. Load data and preprocess

In [3]:
df = (
    spark.read.format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .load(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv"
    )
)
# print dataset size
print("records read: " + str(df.count()))

StatementMeta(automl, 27, 9, Finished, Available)

records read: 6819


In [4]:
display(df)

StatementMeta(automl, 27, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 27e3f6a9-6707-4f94-93cf-05ea98845414)

Split the dataset into train and test

In [19]:
train_raw, test_raw = df.randomSplit([0.8, 0.2], seed=41)

StatementMeta(automl, 27, 25, Finished, Available)

Add featurizer to convert features to vector

In [20]:
from pyspark.ml.feature import VectorAssembler

feature_cols = df.columns[1:]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = featurizer.transform(train_raw)["Bankrupt?", "features"]
test_data = featurizer.transform(test_raw)["Bankrupt?", "features"]

StatementMeta(automl, 27, 26, Finished, Available)

### Default SynapseML LightGBM

In [21]:
from synapse.ml.lightgbm import LightGBMClassifier

model = LightGBMClassifier(
    objective="binary", featuresCol="features", labelCol="Bankrupt?", isUnbalance=True
)

model = model.fit(train_data)

StatementMeta(automl, 27, 27, Finished, Available)

#### Model Prediction

In [22]:
def predict(model, test_data=test_data):
    from synapse.ml.train import ComputeModelStatistics

    predictions = model.transform(test_data)
    
    metrics = ComputeModelStatistics(
        evaluationMetric="classification",
        labelCol="Bankrupt?",
        scoredLabelsCol="prediction",
    ).transform(predictions)
    return metrics

default_metrics = predict(model)
default_metrics.show()

StatementMeta(automl, 27, 28, Finished, Available)

+---------------+--------------------+------------------+-------------------+------------------+------------------+
|evaluation_type|    confusion_matrix|          accuracy|          precision|            recall|               AUC|
+---------------+--------------------+------------------+-------------------+------------------+------------------+
| Classification|1253.0  20.0  \n2...|0.9627942293090357|0.42857142857142855|0.3409090909090909|0.6625990859101621|
+---------------+--------------------+------------------+-------------------+------------------+------------------+



## Run FLAML Tune

In [23]:
train_data_sub, val_data_sub = train_data.randomSplit([0.8, 0.2], seed=41)

StatementMeta(automl, 27, 29, Finished, Available)

In [10]:
def train(lambdaL1, learningRate, numLeaves, numIterations, train_data=train_data_sub, val_data=val_data_sub):
    """
    This train() function:
     - takes hyperparameters as inputs (for tuning later)
     - returns the AUC score on the validation dataset

    Wrapping code as a function makes it easier to reuse the code later for tuning.
    """

    lgc = LightGBMClassifier(
        objective="binary",
        lambdaL1=lambdaL1,
        learningRate=learningRate,
        numLeaves=numLeaves,
        labelCol="Bankrupt?",
        numIterations=numIterations,
        isUnbalance=True,
        featuresCol="features",
    )

    model = lgc.fit(train_data)

    # Define an evaluation metric and evaluate the model on the validation dataset.
    eval_metric = predict(model, val_data)
    eval_metric = eval_metric.toPandas()['AUC'][0]

    return model, eval_metric

StatementMeta(automl, 27, 16, Finished, Available)

In [24]:
import flaml
import time

# define the search space
params = {
    "lambdaL1": flaml.tune.uniform(0.001, 1),
    "learningRate": flaml.tune.uniform(0.001, 1),
    "numLeaves": flaml.tune.randint(30, 100),
    "numIterations": flaml.tune.randint(100, 300),
}

# define the tune function
def flaml_tune(config):
    _, metric = train(**config)
    return {"auc": metric}

StatementMeta(automl, 27, 30, Finished, Available)

In [25]:
analysis = flaml.tune.run(
    flaml_tune,
    params,
    time_budget_s=60,
    num_samples=100,
    metric="auc",
    mode="max",
    verbose=5,
    force_cancel=True,
    )

StatementMeta(automl, 27, 31, Finished, Available)

[flaml.tune.tune: 04-19 00:56:20] {508} INFO - Using search algorithm BlendSearch.
No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune
You passed a `space` parameter to OptunaSearch that contained unresolved search space definitions. OptunaSearch should however be instantiated with fully configured search spaces only. To use Ray Tune's automatic search space conversion, pass the space definition as part of the `config` argument to `tune.run()` instead.
[flaml.tune.tune: 04-19 00:56:20] {777} INFO - trial 1 config: {'lambdaL1': 0.09833464080607023, 'learningRate': 0.64761881525086, 'numLeaves': 30, 'numIterations': 172}
[flaml.tune.tune: 04-19 00:56:46] {197} INFO - result: {'auc': 0.7350263891359782, 'training_iteration': 0, 'config': {'lambdaL1': 0.0983346408060702

Best config and metric on validation data

In [26]:
tune_config = analysis.best_config
tune_metrics_val = analysis.best_result
print("Best config: ", tune_config)
print("Best metrics on validation data: ", tune_metrics_val)

StatementMeta(automl, 27, 32, Finished, Available)

Best config:  {'lambdaL1': 0.7715493226234792, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}
Best metrics on validation data:  {'auc': 0.7648994840775662, 'training_iteration': 0, 'config': {'lambdaL1': 0.7715493226234792, 'learningRate': 0.021731197410042098, 'numLeaves': 74, 'numIterations': 249}, 'config/lambdaL1': 0.7715493226234792, 'config/learningRate': 0.021731197410042098, 'config/numLeaves': 74, 'config/numIterations': 249, 'experiment_tag': 'exp', 'time_total_s': 33.43822383880615}


Retrain model on whole train_data and check metrics on test_data

In [27]:
tune_model, tune_metrics = train(train_data=train_data, val_data=test_data, **tune_config)
tune_metrics = predict(tune_model)
tune_metrics.show()

StatementMeta(automl, 27, 33, Finished, Available)

+---------------+--------------------+------------------+------------------+-------------------+------------------+
|evaluation_type|    confusion_matrix|          accuracy|         precision|             recall|               AUC|
+---------------+--------------------+------------------+------------------+-------------------+------------------+
| Classification|1247.0  26.0  \n2...|0.9597570235383447|0.3953488372093023|0.38636363636363635|0.6829697207741198|
+---------------+--------------------+------------------+------------------+-------------------+------------------+



### Run FLAML AutoML
In the FLAML AutoML run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. 

In [28]:
''' import AutoML class from the FLAML package '''
from flaml import AutoML
from flaml.automl.spark.utils import to_pandas_on_spark

automl = AutoML()

StatementMeta(automl, 27, 34, Finished, Available)

In [29]:
import os
settings = {
    "time_budget": 60,  # total running time in seconds
    "metric": 'roc_auc',
    "task": 'classification',  # task type
    "log_file_name": 'flaml_experiment.log',  # flaml log file
    "seed": 42,    # random seed
    "force_cancel": True,  # force stop training once time_budget is used up
}

StatementMeta(automl, 27, 35, Finished, Available)

In [30]:
df = to_pandas_on_spark(train_data)

type(df)

StatementMeta(automl, 27, 36, Finished, Available)

pyspark.pandas.frame.DataFrame

In [31]:
'''The main flaml automl API'''
automl.fit(dataframe=df, label='Bankrupt?', labelCol="Bankrupt?", isUnbalance=True, **settings)

StatementMeta(automl, 27, 37, Finished, Available)

[flaml.automl.logger: 04-19 00:58:37] {1682} INFO - task = classification
[flaml.automl.logger: 04-19 00:58:37] {1689} INFO - Data split method: stratified
[flaml.automl.logger: 04-19 00:58:37] {1692} INFO - Evaluation method: cv
[flaml.automl.logger: 04-19 00:58:38] {1790} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl.logger: 04-19 00:58:38] {1900} INFO - List of ML learners in AutoML Run: ['lgbm_spark']
[flaml.automl.logger: 04-19 00:58:38] {2210} INFO - iteration 0, current learner lgbm_spark
[flaml.automl.logger: 04-19 00:58:48] {2336} INFO - Estimated sufficient time budget=104269s. Estimated necessary time budget=104s.
[flaml.automl.logger: 04-19 00:58:48] {2383} INFO -  at 23.9s,	estimator lgbm_spark's best error=0.1077,	best estimator lgbm_spark's best error=0.1077
[flaml.automl.logger: 04-19 00:58:48] {2210} INFO - iteration 1, current learner lgbm_spark
[flaml.automl.logger: 04-19 00:58:56] {2383} INFO -  at 32.0s,	estimator lgbm_spark's best error=0.0962,	best esti

### Best model and metric

In [32]:
''' retrieve best config'''
print('Best hyperparmeter config:', automl.best_config)
print('Best roc_auc on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

StatementMeta(automl, 27, 38, Finished, Available)

Best hyperparmeter config: {'numIterations': 12, 'numLeaves': 6, 'minDataInLeaf': 17, 'learningRate': 0.1444074361218993, 'log_max_bin': 6, 'featureFraction': 0.9006280463830675, 'lambdaL1': 0.0021638671012090007, 'lambdaL2': 0.8181940184285643}
Best roc_auc on validation data: 0.924
Training duration of best run: 0.8982 s


In [33]:
automl_metrics = predict(automl.model.estimator)
automl_metrics.show()

StatementMeta(automl, 27, 39, Finished, Available)

+---------------+--------------------+------------------+-------------------+------------------+------------------+
|evaluation_type|    confusion_matrix|          accuracy|          precision|            recall|               AUC|
+---------------+--------------------+------------------+-------------------+------------------+------------------+
| Classification|1106.0  167.0  \n...|0.8686408504176157|0.18536585365853658|0.8636363636363636|0.8662250946225809|
+---------------+--------------------+------------------+-------------------+------------------+------------------+



## Use Apache Spark to Parallelize AutoML trials and tuning

In [38]:
settings = {
    "time_budget": 60,  # total running time in seconds
    "metric": 'roc_auc',  # primary metrics for regression can be chosen from: ['mae','mse','r2','rmse','mape']
    "task": 'classification',  # task type    
    "seed": 7654321,    # random seed
    "use_spark": True,
    "n_concurrent_trials": 2,
    "force_cancel": True,
}

StatementMeta(automl, 27, 44, Finished, Available)

In [39]:
pandas_df = train_raw.toPandas()
pandas_df.head()

StatementMeta(automl, 27, 45, Finished, Available)

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,0,0.0828,0.0693,0.0884,0.6468,0.6468,0.9971,0.7958,0.8078,0.3047,...,0.0,0.0,0.6237,0.6468,0.7483,0.2847,0.0268,0.5652,1.0,0.0199
1,0,0.1606,0.1788,0.1832,0.5897,0.5897,0.9986,0.7969,0.8088,0.3034,...,0.5917,4370000000.0,0.6236,0.5897,0.8023,0.2947,0.0268,0.5651,1.0,0.0151
2,0,0.204,0.2638,0.2598,0.4483,0.4483,0.9959,0.7937,0.8063,0.3034,...,0.6816,0.0003,0.6221,0.4483,0.8117,0.3038,0.0268,0.5651,1.0,0.0136
3,0,0.217,0.1881,0.2451,0.5992,0.5992,0.9962,0.794,0.8061,0.3034,...,0.6196,0.0011,0.6236,0.5992,0.6346,0.4359,0.0268,0.565,1.0,0.0108
4,0,0.2314,0.1628,0.2068,0.6001,0.6001,0.9988,0.796,0.8078,0.3015,...,0.5269,0.0003,0.6241,0.6001,0.7985,0.2903,0.0268,0.5651,1.0,0.0164


In [40]:
'''The main flaml automl API'''
automl.fit(dataframe=pandas_df, label='Bankrupt?', **settings)

StatementMeta(automl, 27, 46, Finished, Available)

[flaml.automl.logger: 04-19 01:10:19] {1682} INFO - task = classification
[flaml.automl.logger: 04-19 01:10:19] {1689} INFO - Data split method: stratified
[flaml.automl.logger: 04-19 01:10:19] {1692} INFO - Evaluation method: holdout
[flaml.automl.logger: 04-19 01:10:19] {1790} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl.logger: 04-19 01:10:19] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.tune.tune: 04-19 01:10:19] {701} INFO - Number of trials: 2/1000000, 2 RUNNING, 0 TERMINATED
[flaml.tune.tune: 04-19 01:10:22] {721} INFO - Brief result: {'pred_time': 2.9629555301389834e-06, 'wall_clock_time': 2.9545514583587646, 'metric_for_logging': {'pred_time': 2.9629555301389834e-06}, 'val_loss': 0.04636121259998027, 'trained_estimator': <flaml.automl.model.LGBMEstimator object at 0x7fec4bbdf430>}
[flaml.tune.tune: 04-19 01:10:22] {721} INFO - Brief result: {'pred_time': 3.1378822050232817e-06, 'wall_clock_

In [41]:
''' retrieve best config'''
print('Best hyperparmeter config:', automl.best_config)
print('Best roc_auc on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

StatementMeta(automl, 27, 47, Finished, Available)

Best hyperparmeter config: {'n_estimators': 4, 'num_leaves': 9, 'min_child_samples': 21, 'learning_rate': 0.27021587856943113, 'log_max_bin': 8, 'colsample_bytree': 0.9633671819625609, 'reg_alpha': 0.014098641144674361, 'reg_lambda': 1.5196347818125986}
Best roc_auc on validation data: 0.9557
Training duration of best run: 0.1563 s


In [90]:
# predict function for non-spark models
def predict_pandas(automl, test_raw):
    from synapse.ml.train import ComputeModelStatistics
    import pandas as pd
    pandas_test = test_raw.toPandas()
    predictions = automl.predict(pandas_test.iloc[:,1:]).astype('float')
    predictions = pd.DataFrame({"Bankrupt?":pandas_test.iloc[:,0], "prediction": predictions.tolist()})
    predictions = spark.createDataFrame(predictions)
    
    metrics = ComputeModelStatistics(
        evaluationMetric="classification",
        labelCol="Bankrupt?",
        scoredLabelsCol="prediction",
    ).transform(predictions)
    return metrics

automl_metrics = predict_pandas(automl, test_raw)
automl_metrics.show()

StatementMeta(automl, 27, 96, Finished, Available)

+---------------+--------------------+------------------+---------+------------------+------------------+
|evaluation_type|    confusion_matrix|          accuracy|precision|            recall|               AUC|
+---------------+--------------------+------------------+---------+------------------+------------------+
| Classification|1266.0  7.0  \n37...|0.9665907365223994|      0.5|0.1590909090909091|0.5767960437049204|
+---------------+--------------------+------------------+---------+------------------+------------------+

