# AutoML in Fabric Data Science

AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands a lot of skill and computing power.

AutoML can help ML professionals and developers from different sectors to:

1. Build ML solutions with minimal coding
1. Reduce time and cost
1. Apply data science best practices
1. Solve problems quickly and efficiently

![image-alt-text](https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/flaml-automl-workflow.png)

### Exercise overview
In this exercise, we will use `churn_data_clean` and ```flaml.AutoML``` to automate their machine learning tasks. We will track the results of these iterations with MLFlow.

### Helpful links
- [Autologging in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-science/mlflow-autologging)
- [Fabric Experiments](https://learn.microsoft.com/en-us/fabric/data-science/machine-learning-experiment)
- [AutoML Examples](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML)


In [None]:
# Install Fabric integrated version of FLAML
%pip install https://synapsemldatascience.blob.core.windows.net/releases/flaml/FLAML-2.1.1.post4-cp310-cp310-linux_x86_64.whl

## Step 1: Load the data

In [None]:
df_final = spark.read.format("delta").load("Tables/churn_data_clean")

In [None]:
display(df_final)

#### Set up MLflow experiment tracking

MLflow is an open source platform that is deeply integrated into the Data Science experience in Fabric and allows to easily track and compare the performance of different models and experiments without the need for manual tracking. For more information, see [Autologging in Microsoft Fabric](https://aka.ms/fabric-autologging).

In [None]:
import mlflow

# Disable exclusive mode for autologging to track additional metrics
mlflow.autolog(exclusive=False)

# Set the MLflow experiment to "FabCon-Demo" and enable automatic logging
mlflow.set_experiment("FabCon-Demo-Experiment")

# Set random state for all the iterations
random_state = 41


#### Set the logging level

You can configure the logging level to suppress unnecessary outputs from the SynapseML library to keep the logs cleaner.

In [None]:
import logging
 
logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)
logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)

## Step 2: Train a baseline machine learning model

With your data in place, you can now define the model. You'll train a LightGBM model in this notebook. You will also use MLfLow and Fabric Autologging to track the experiments.

#### Generate train-test datasets

Split the data into training and test datasets with an 80/20 ratio and prepare the data to train your machine learning model. Since we are working with LightGBM, we'll convert our dataset to Pandas for training.

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Convert Spark DataFrame to Pandas DataFrame
df_final_pd = df_final.toPandas()

# Define features (X) and target variable (y)
y = df_final_pd["Exited"]
X = df_final_pd.drop("Exited", axis=1)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=random_state)


#### Train and evaluate the baseline model

Train a `LightGBMClassifier` model on the training data that is configured with appropriate settings for binary classification and imbalance handling. Then make predictions on the test data using this trained model. Predicted probabilities for the positive class and true labels from the test data are extracted, followed by calculation of the ROC-AUC score using sklearn's `roc_auc_score` function.


In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import mlflow
import logging


# Start MLflow run
with mlflow.start_run(run_name="default") as run:
    # Define LGBMClassifier with specified parameters
    lgbm_model = LGBMClassifier(
        learning_rate=0.01,
        n_estimators=2,
        max_depth=2,
        num_leaves=3,
        objective='binary',
        random_state=random_state,
        verbosity=-1
    )

    # Capture run_id for model prediction later
    lgbm_model_run_id = run.info.run_id 

    # Fit the model to the training data
    lgbm_model.fit(X_train, y_train) 

    # Make predictions on the test data
    y_pred = lgbm_model.predict(X_test)

    # Compute ROC AUC score
    roc_auc_lgbm = roc_auc_score(y_train, lgbm_model.predict_proba(X_train)[:, 1])

    # Log ROC AUC score
    mlflow.log_metric("roc_auc", roc_auc_lgbm)

## Step 3: Create an AutoML trial with FLAML

In this section, you'll create an AutoML trial using the FLAML package, configure the trial settings, convert the Spark dataset to a Pandas on Spark dataset, run the AutoML trial, and view the resulting metrics.

#### Generate train-test datasets with Spark

Split the data into training and test datasets with an 80/20 ratio and prepare the data to train your machine learning model. This preparation involves importing the `VectorAssembler` from PySpark ML to combine feature columns into a single `features` column. Then, you'll use the `VectorAssembler` to transform the training and test datasets, resulting in `train_data` and `test_data` DataFrames containing the target variable `Exited` and the feature vectors. These datasets are now ready for building and evaluating machine learning models.

In [None]:
# Import the necessary library for feature vectorization
from pyspark.ml.feature import VectorAssembler

# Train-Test Separation
train_raw, test_raw = df_final.randomSplit([0.8, 0.2], seed=41)

# Define the feature columns (excluding the target variable 'Exited')
feature_cols = [col for col in df_final.columns if col != "Exited"]

# Create a VectorAssembler to combine feature columns into a single 'features' column
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")

# Transform the training and testing datasets using the VectorAssembler
train_data = featurizer.transform(train_raw)["Exited", "features"]
test_data = featurizer.transform(test_raw)["Exited", "features"]

#### Configure the AutoML trial and settings

Import the required classes and modules from the FLAML package and instantiate AutoML, which automates the machine learning pipeline.

In [None]:
# Import the AutoML class from the FLAML package
from flaml import AutoML
from flaml.automl.spark.utils import to_pandas_on_spark

# Create an AutoML instance
automl_spark = AutoML()

# Define AutoML settings
settings = {
    "time_budget": 100,        # Total running time in seconds
    "metric": 'roc_auc',       # Optimization metric (ROC AUC in this case)
    "task": 'classification',  # Task type (classification)
    "log_file_name": 'flaml_experiment.log',  # FLAML log file
    "max_iter":10, 
    "seed": 41,                # Random seed
    "mlflow_exp_name": "FabCon-Demo-Experiment",      # MLflow experiment name
    "verbose":1
}

#### Convert to Pandas on Spark

To execute AutoML with a Spark-based dataset, you must convert it to a Pandas on Spark dataset using the `to_pandas_on_spark` function. This ensures FLAML can efficiently work with the data.

In [None]:
df_automl = to_pandas_on_spark(train_data)

#### Run the AutoML trial

Execute the AutoML trial, using a nested MLflow run to track the experiment within the existing MLflow run context. The trial is conducted on the Pandas on Spark dataset `df_automl` with the target variable `Exited`, and the defined settings are passed to the `fit` function for configuration.

In [None]:
'''The main flaml automl API'''

with mlflow.start_run(nested=True, run_name = "spark_automl"):
    automl_spark.fit(dataframe=df_automl, label='Exited', isUnbalance=True, **settings)

#### View resulting metrics

Retrieve and display the results of the AutoML trial. These metrics offer insights into the performance and configuration of the AutoML model on the provided dataset.

In [None]:
# Retrieve and display the best hyperparameter configuration and metrics
print('Best hyperparameter config:', automl_spark.best_config)
print('Best ROC AUC on validation data: {0:.4g}'.format(1 - automl_spark.best_loss))
print('Training duration of the best run: {0:.4g} s'.format(automl_spark.best_config_train_time))

## Step 4: Parallelize your AutoML trial with Apache Spark

In scenarios where your dataset can fit into a single node and you aim to harness Spark's capabilities for running multiple parallel AutoML trials simultaneously, you can follow these steps:


#### Configure parallelization settings

Configure `use_spark` to `True` to enable Spark-based parallelism. By default, FLAML will initiate one trial per executor. You can customize the number of concurrent trials using the `n_concurrent_trials` argument. To learn more about how to parallelize your AutoML trails, you can visit [FLAML documentation for parallel Spark jobs](https://microsoft.github.io/FLAML/docs/Examples/Integrate%20-%20Spark#parallel-spark-jobs).

In [None]:
# Convert to Pandas for parallelization
pandas_df = train_raw.toPandas()

In [None]:
# Create an AutoML instance
automl = AutoML()

# Set MLflow experiment
mlflow.set_experiment("FabCon-Demo-Experiment")

# Define settings
settings = {
    "time_budget": 50,           # Total running time in seconds
    "metric": 'roc_auc',         # Optimization metric (ROC AUC in this case)
    "task": 'classification',    # Task type (classification)
    "seed": 41,                  # Random seed
    "use_spark": True,           # Enable Spark-based parallelism
    "n_concurrent_trials": 3,    # Number of concurrent trials to run
    "force_cancel": True,        # Force stop training once time_budget is used up
    "mlflow_exp_name": "FabCon-Demo",  # MLflow experiment name
    "verbose": 1
}

#### Run the AutoML trial

Execute the AutoML trial in parallel with the specified settings. Note that a nested MLflow run will be utilized to track the experiment within the existing MLflow run context.

In [None]:
'''The main flaml automl API'''
with mlflow.start_run(nested=True, run_name = "parallel_automl"):
    automl.fit(dataframe=pandas_df, label='Exited', **settings)

#### Understand AutoML runs

The `flaml.visualization` module provides functions for plotting and comparing runs in FLAML. Users can utilize Plotly to interact with their AutoML experiment plots. A **feature importance plot** is a valuable visualization tool enabling you to grasp the significance of various input features in determining the predictions of the final, best model.

In [None]:
import flaml.visualization as fviz
fig = fviz.plot_feature_importance(automl)
fig.show()

#### View metrics

Upon completion of the parallel AutoML trial, retrieve and showcase the results, including the best hyperparameter configuration, ROC-AUC on the validation dataset, and the training duration of the top-performing run.

In [None]:
''' retrieve best config'''
print('Best hyperparmeter config:', automl.best_config)
print('Best roc_auc on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

#### Experiments artifact for tracking model performance

The experiment runs are automatically saved in the experiment artifact that can be found from the workspace. They're named based on the name used for setting the experiment. All of the trained models, their runs, performance metrics and model parameters are logged as can be seen from the experiment page shown in the image below.   

To view your experiments:
1. On the left panel, select your workspace.
1. Find and select the experiment name, in this case _sample-automl-experiment_.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/AutoML_nested_details.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">


## Step 5: Save as the final machine learning model

Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric.

In [None]:
# Specify the model name and the path where you want to save it in the registry
model_name = "fabcon-churn-model"  # Replace with your desired model name
model_path = f"runs:/{automl.best_run_id}"

# Register the model to the MLflow registry
registered_model = mlflow.register_model(model_uri=model_path, name=model_name)

# Print the registered model's name and version
print(f"Model '{registered_model.name}' version {registered_model.version} registered successfully.")