# Automated ML
## Introduction

This notebook is automatically generated by the Fabric low-code AutoML wizard based on your selections. Whether you're building a regression model, a classifier, or another machine-learning solution, this tool simplifies the process by transforming your goals into executable code. You can easily modify any settings or code snippets to better align with your requirements.

### What is FLAML?

[FLAML (Fast and Lightweight Automated Machine Learning)](https://aka.ms/fabric-automl) is an open-source AutoML library designed to quickly and efficiently find the best machine learning models and hyperparameters. FLAML optimizes for speed, accuracy, and cost, making it an excellent choice for a wide range of machine-learning tasks.

### Steps in this notebook

1. **Load the data**: Import your dataset.
2. **Generate features**: Automatically transform and preprocess your data to improve model performance.
3. **Use AutoML to find your best model**: Use FLAML to automatically select the most suitable model and optimize its parameters.
4. **Save the final machine learning model**: Store the trained model for future use.
5. **Generate predictions**: Use the saved model to predict outcomes on new data.

> [!IMPORTANT]
> **Automated ML is currently supported on Fabric Runtimes 1.2+ or any Fabric environment with Spark 3.4+.**


In [1]:
%pip install scikit-learn==1.5.1


StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 7, Finished, Available, Finished)

Collecting scikit-learn==1.5.1
  Downloading scikit_learn-1.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn==1.5.1)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m72.0 MB/s[0m eta [36m0:00:00[0m00:01[0mm0:01[0m
[?25hDownloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: threadpoolctl
    Found existing installation: threadpoolctl 2.2.0
    Not uninstalling threadpoolctl at /home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages, outside environment /nfs4/pyenv-b80bd43a-c241-4a02-a19b-1c2d80b6f31f
    Can't uninstall 'threadpoolctl'. No files were found to uninstall.
  Attempting uninstall: 

### Default notebook optimization

This cell configures the logging and warning settings to reduce unnecessary output and focus on critical information. It suppresses specific warnings and logs from the underlying libraries, ensuring a cleaner and more readable notebook experience.

In [2]:
import logging
import warnings
 
logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)
logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)
warnings.simplefilter('ignore', category=FutureWarning)
warnings.simplefilter('ignore', category=UserWarning)

StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 9, Finished, Available, Finished)

## Step 1: Load the Data

This cell is responsible for importing the raw data from the specified source into the notebook environment. The data could come from various sources, such as a file or table in your lakehouse.

Once loaded, this data will serve as the input for subsequent steps, such as data transformation, model training, and evaluation.

In [13]:
import re
import pandas as pd
import numpy as np

df = spark.read.format("delta").load(
    "Tables/online_061622"
).cache()


StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 20, Finished, Available, Finished)

In [14]:
display(df)

StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 21, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 013d5ddb-894a-4a8c-b175-3f251a8618a8)

## Step 2: Generate features

Featurization is the process of transforming raw data into a format optimized for training a machine learning model. It ensures the model can access the most relevant information, significantly impacting its accuracy and performance.

This step applies various techniques to refine the data, enhance its quality, and make it compatible with the selected algorithms, helping the model learn patterns more effectively.

## Step 3: Use AutoML to find your best model

We will now use FLAML's AutoML to automatically find the best machine learning model for our data. AutoML (Automated Machine Learning) simplifies the model selection process by automatically testing and tuning various algorithms and configurations, helping us quickly identify the most effective model with minimal manual effort.

### Tracking results with experiments in Fabric

Experiments in Fabric let you track the results of your AutoML process, providing a comprehensive view of all the metrics and parameters from your trials.

In [15]:
# MLFlow Logging Related

import mlflow

mlflow.autolog(exclusive=False)
mlflow.set_experiment("Lab0625")


StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 22, Finished, Available, Finished)

2025/06/25 01:45:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
2025/06/25 01:45:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2025/06/25 01:45:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
2025/06/25 01:45:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for lightgbm.
2025/06/25 01:45:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for pytorch_lightning.
2025/06/25 01:45:23 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.ml.


<Experiment: artifact_location='', creation_time=1750814014997, experiment_id='62680ef9-82ce-4931-b084-aab092cf92de', last_update_time=None, lifecycle_stage='active', name='Lab0625', tags={}>

In [None]:
# # Split dataset
# X = df['UnitPrice', 'Discount', 'ShippingCost',
#     'Description', 'Year', 'Quarter', 'Month', 'DateofWeek', 'Hour',
#     'Country', 'PaymentMethod', 'Category', 'ShipmentProvider', 'WarehouseLocation']
# y = df["Quantity"]

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Fit the model
# pipeline.fit(X_train, y_train)

# # Predict and evaluate
# y_pred = pipeline.predict(X_test)
# mse = mean_squared_error(y_test, y_pred)
# r2 = r2_score(y_test, y_pred)

# mse, r2

#### Pipeline + AutoML

In [8]:
from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from flaml import AutoML

set_config(display="diagram")

imputer = SimpleImputer()
standardizer = StandardScaler()
automl = AutoML()

automl_pipeline = Pipeline(
    [("imputuer", imputer), ("standardizer", standardizer), ("automl", automl)]
)
automl_pipeline

StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 15, Finished, Available, Finished)

2025/06/25 01:35:20 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.ml.
2025/06/25 01:35:22 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
2025/06/25 01:35:27 INFO mlflow.tracking.fluent: Autologging successfully enabled for lightgbm.
2025/06/25 01:36:11 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
2025/06/25 01:36:14 INFO mlflow.tracking.fluent: Autologging successfully enabled for pytorch_lightning.


In [34]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np


# Step 1: Drop the "Date" column
df = df.drop('Date')

# Step 2: Separate numerical and categorical columns
num_cols = ['UnitPrice', 'Discount', 'ShippingCost']
cat_cols = [
    'Description', 'Year', 'Quarter', 'Month', 'DateofWeek', 'Hour',
    'Country', 'PaymentMethod', 'Category', 'ShipmentProvider', 'WarehouseLocation']

# Step 2: Impute missing values
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

# Step 3: One-hot encode categorical columns
cat_encoder = OneHotEncoder(handle_unknown="ignore")

# Combine transformations
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_imputer, num_cols),
        ("cat", Pipeline(steps=[("imputer", cat_imputer), ("encoder", cat_encoder)]), cat_cols)
    ]
)

# AutoMl
automl = AutoML()




# Full pipeline
automl_pipeline = Pipeline(
    [("preprocessor", preprocessor),
    #("model", RandomForestRegressor(n_estimators=100, random_state=42)), 
    ("automl", automl)]
)



StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 41, Finished, Available, Finished)

In [35]:
from sklearn import set_config
set_config(display="diagram")

automl_pipeline

StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 42, Finished, Available, Finished)

In [32]:
pip install "flaml[automl] openml"

StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 39, Finished, Available, Finished)

[31mERROR: Invalid requirement: 'flaml[automl] openml'[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


#### Configure the AutoML trial and settings

These configurations are driven by the AutoML mode and task selected in the wizard. For example, if you select "quick prototype", you'll see a setting for time budget.

In [None]:
# if flaml.__version__ > "2.3.3":
#     settings["entrypoint"] = "low-code"

In [36]:
# Import the AutoML class from the FLAML package
import flaml
from flaml import AutoML


automl_settings = {
    "time_budget": 1800, # Total running time in seconds
    "task": "regression",  # Task type 
    "log_file_name": "flaml_experiment.log",  # FLAML log file
    "eval_method": "cv",
    "n_splits": 3,
    "seed": 41 , # Random seed 
    "mlflow_exp_name": "Lab0625",  # MLflow experiment name
    "use_spark": True, # whether to use Spark for distributed training
    "n_concurrent_trials": 3,  # the maximum number of concurrent trials 
    "verbose": 1, 
    "featurization": "auto", 
}

pipeline_settings = {f"automl__{key}": value for key, value in automl_settings.items()}

with mlflow.start_run(nested=True, run_name="062501Model"):
    automl_pipeline.fit(X_train, y_train, **pipeline_settings)


StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, 43, Submitted, Running, Running)

[I 2025-06-25 02:03:41,368] A new study created in memory with name: optuna


[I 2025-06-25 02:03:59,071] A new study created in memory with name: optuna


#### Run the AutoML trial

Run the AutoML trial, with all trials being tracked as experiment runs. The trial is performed on the processed dataset, using the `Exited` variable as the target, and applying the defined configurations for optimal model selection.

In [None]:
# with mlflow.start_run(nested=True, run_name="062501Model"):
#     automl.fit(
#         X_train=X_train, 
#         y_train=y_train,  # target column of the training data 
#     )

StatementMeta(, 6383dafd-bace-4009-ae9a-f401c333a19b, -1, Cancelled, , Cancelled)

## Step 4: Save the final machine learning model

Upon completing the AutoML trial, you can now save the final, tuned model as an ML model in Fabric.

In [None]:
model_path = f"runs:/{automl.best_run_id}/model"

# Register the model to the MLflow registry
registered_model = mlflow.register_model(model_uri=model_path, name="062501Model")

# Print the registered model's name and version
print(f"Model '{registered_model.name}' version {registered_model.version} registered successfully.")

StatementMeta(, , -1, Waiting, , Waiting)

## Step 5: Generate predictions

Microsoft Fabric lets you operationalize machine learning models with a scalable function called `PREDICT`, which supports batch scoring (or batch inferencing) in any compute engine. You can generate batch predictions directly from the Microsoft Fabric notebook or from a given ML model's item page. For more information on how to use `PREDICT`, see [Model scoring with PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict).

1. Generate predictions.

In [None]:
model_name = "062501Model"
from synapse.ml.predict import MLFlowTransformer

feature_cols = X_train.columns.to_list()
model = MLFlowTransformer(
    inputCols=feature_cols,
    outputCol=target_col,
    modelName=model_name,
    modelVersion=registered_model.version,
)

df_test = spark.createDataFrame(X_test)
batch_predictions = model.transform(df_test)


StatementMeta(, , -1, Waiting, , Waiting)

In [None]:
display(batch_predictions)

StatementMeta(, , -1, Waiting, , Waiting)

2. Save the predictions to a table.

In [None]:
saved_name = "Tables/062501_pre".replace(".", "_")
batch_predictions.write.mode("overwrite").format("delta").option("overwriteSchema", "true").save(saved_name)

StatementMeta(, , -1, Waiting, , Waiting)