# MLflow overview

MLflow is an open-source platform designed for managing the entire machine learning lifecycle, including experimentation, deployment, and model registry. It enables tracking of experiments, comparison of results, and sharing across teams by logging parameters, metrics, and outputs. Its flexibility and comprehensive toolset make MLflow essential for data scientists and developers aiming to streamline the development, deployment, and maintenance of machine learning models.

### Exercise overview

In this exercise, we will use `churn_data_clean` to train several baseline models. We will track the results of these iterations with MLFlow and learn how we can use autologging to customize the details tracked.

### Helpful links
- [Autologging in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-science/mlflow-autologging)
- [Fabric Experiments](https://learn.microsoft.com/en-us/fabric/data-science/machine-learning-experiment)

## Step 1: Read cleaned data from the lakehouse


In [None]:
df = spark.sql("SELECT * FROM FC_Workshop.churn_data_clean")
display(df)

## Step 2: Prepare datasets for training

#### Generate train-test datasets 

The code snippet illustrates the process of preparing a dataset for machine learning model training and evaluation using Scikit-learn and Pandas. Initially, it converts a Spark DataFrame (`df`) into a Pandas DataFrame (`df_final_pd`) to utilize familiar data manipulation operations. It then identifies the target variable (`y`) as the "Exited" column and separates the features (`X`) by removing the target column from the dataset. With the features and target defined, it employs the `train_test_split` function from Scikit-learn to divide the dataset into training and testing sets, allocating 20% of the data for testing. This split is controlled by a specified `random_state` to ensure reproducibility of the results.

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Convert Spark DataFrame to Pandas DataFrame
df_final_pd = df.toPandas()

# Define features (X) and target variable (y)
y = df_final_pd["Exited"]
X = df_final_pd.drop("Exited", axis=1)
random_state = 41

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=random_state)


### Save test data

This code snippet demonstrates how to save test data for future predictions after processing. It first converts the test dataset (`X_test`), originally in a Pandas DataFrame format, back into a Spark DataFrame (`spark_df`) using the `createDataFrame` method. 

The snippet then proceeds to save this Spark DataFrame to a designated location specified by `Tables/churn_test_data`. The data is saved in Delta format, a storage layer that brings ACID transactions to Apache Spark and big data workloads, with the `mode` set to "overwrite" to ensure that any existing data in the specified path is replaced. This step is crucial for preserving the test set in a reliable and efficient format for later use in making predictions or further analysis.

In [None]:
# Save test data for predictions later

spark_df = spark.createDataFrame(X_test)
spark_df.write.mode("overwrite").format("delta").save(f"Tables/churn_test_data")

## Step 3: Train baseline models 

#### Tree based models

There are many different hyperparameters that can be tuned for tree based models. In the training exercises, we will experiment with the hyperparameters that impact:
- Tree Shape — ```num_leaves``` and ```max_depth```
- Tree Growth — ```min_data_in_leaf``` and ```min_gain_to_split```



#### Create a machine learning experiment

A machine learning experiment is the primary unit of organization and control for all related machine learning runs. A run corresponds to a single execution of model code. In MLflow, tracking is based on experiments and runs. You can tracks runs and the associated information using the inline MLflow widget or by using the Experiment item in Fabric. 

![Navigate to an ML Experiment in Fabric](https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/experiment-details.png)

In [None]:
# Set the MLflow experiment 
import mlflow

mlflow.set_experiment("FabCon-Demo-Experiment")

#### Set the logging level

You can configure the logging level to suppress unnecessary outputs from the SynapseML library to keep the logs cleaner.

In [None]:
import logging

logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)
logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)

#### Autologging

Synapse Data Science in Microsoft Fabric includes autologging, which significantly reduces the amount of code required to automatically log the parameters, metrics, and items of a machine learning model during training. This feature extends [MLflow autologging](https://mlflow.org/docs/latest/tracking.html#automatic-logging) capabilities and is deeply integrated into the Synapse Data Science in Microsoft Fabric experience. Using autologging, developers and data scientists can easily track and compare the performance of different models and experiments without the need for manual tracking.

Autologging works by automatically capturing the values of input parameters, output metrics, and output items of a machine learning model as it is being trained. This information is then logged to your Microsoft Fabric workspace, where it can be accessed and visualized using the MLflow APIs or the corresponding experiment & model items in your Microsoft Fabric workspace.

```python
mlflow.autolog(
    log_input_examples=False,
    log_model_signatures=True,
    log_models=True,
    disable=False,
    exclusive=True,
    disable_for_unsupported_versions=True,
    silent=True)
```

When you launch a Synapse Data Science notebook, Microsoft Fabric calls ```mlflow.autolog()``` to instantly enable the tracking and load the corresponding dependencies. As you train models in your notebook, this model information is automatically tracked with MLflow. This configuration is done automatically behind the scenes when you run import mlflow.

##### Mode 1: Enable Autologging

In Fabric workspaces, autologging is activated by default. After it has been executed, you have the ability to review the logged parameters and metrics. It's important to note that these details were logged automatically, without the need for manual intervention.

In [None]:
# This is enabled by default, but you can also call the following command to re-enable with the original settings

# Use this
# mlflow.autolog()

# or this 

# mlflow.autolog(
#     log_input_examples=False,
#     log_model_signatures=True,
#     log_models=True,
#     disable=False,
#     exclusive=True,
#     disable_for_unsupported_versions=True,
#     silent=True)

In [None]:
from sklearn.tree import DecisionTreeClassifier
import mlflow

random_state = 41  

with mlflow.start_run(run_name="decision_tree_default") as run:
    
    # Define DecisionTreeClassifier with specified parameters
    dt_model = DecisionTreeClassifier(
        max_depth=2,  
        random_state=random_state
    )
    
    # Fit the model to the training data
    dt_model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = dt_model.predict(X_test)



##### Mode 2: Disable Autologging

To disable Microsoft Fabric autologging in a notebook session, you can call ```mlflow.autolog()``` and set ```disable=True```. This will require you to manually log any metrics, files, or parameters that you want logged.


In [None]:
# Disable autologging
mlflow.autolog(disable=True)


In [None]:
from sklearn.tree import DecisionTreeClassifier
import mlflow

random_state = 41 

with mlflow.start_run(run_name="dt_autolog_disabled") as run:
    
    # Define DecisionTreeClassifier with specified parameters
    dt_model = DecisionTreeClassifier(
        max_depth=2,  
        random_state=random_state
    )
    
    # Fit the model to the training data
    dt_model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = dt_model.predict(X_test)



##### Mode 3: Custom logging

There are scenarios where you'll want to review the automatically logged metrics, parameters, or files, but also log your own custom metrics or metadata. To accommodate this, you can disable the exclusive autologging mode by setting it to ```False```. Doing so enables you to monitor both the properties automatically captured by MLflow and those manually logged by you.

Here's an example on how  you can enable and use custom logging:

```python
import mlflow
mlflow.autolog(exclusive=False)

with mlflow.start_run():
  mlflow.log_param("parameter name", "example value")
  # <add model training code here>
  mlflow.log_metric("metric name", 20)
```

In [None]:
import mlflow
mlflow.autolog(exclusive=False)


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
import mlflow

random_state = 41

with mlflow.start_run(run_name="dt_autolog_custom") as run:
    
    # Define DecisionTreeClassifier with specified parameters
    dt_model = DecisionTreeClassifier(
        max_depth=2,  
        random_state=random_state
    )
    
    # Fit the model to the training data
    dt_model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = dt_model.predict(X_test)

    # Log parameters
    mlflow.log_param("autolog_mode", "custom")

    # Generate probability scores for the positive class
    y_proba = dt_model.predict_proba(X_test)[:, 1]

    # Calculate ROC AUC score
    roc_auc = roc_auc_score(y_test, y_proba)

    # Log the ROC AUC score
    mlflow.log_metric("roc_auc_test", roc_auc)

# Exercise 1: Train a baseline LightGBM model

Next, we'll leverage the exclusive autologging feature to train our initial LightGBM model. Each model type records a unique set of information through autologging. By consulting the [MLflow documentation](https://mlflow.org/docs/2.4.2/tracking.html#lightgbm), we can observe that the following specifics are automatically documented for autologging:

![MLFlow Docs for LightGBM](https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/lgbm-autolog.png)

### To do
In this exercise, you will add additional code to: 
- Complete the TODO items below. You will need to calculate Accuracy and ROC_AUC score on the X_train dataset
- Log these new metrics using ```mlflow.log_metrics```

In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import mlflow

with mlflow.start_run(run_name="default_lgbm") as run:

    # Define LGBMClassifier with specified parameters
    lgbm_model = LGBMClassifier(
        learning_rate=0.01,
        n_estimators=2,
        max_depth=2,
        num_leaves=3,
        objective='binary',
        random_state=random_state,
        verbosity=-1
    )

    # Capture run_id for model prediction later
    lgbm_model_run_id = run.info.run_id 

    # Fit the model to the training data
    lgbm_model.fit(X_train, y_train) 

    # Make predictions on the test data
    y_pred = lgbm_model.predict(X_test)
    
    # TODO: Compute accuracy score

    # TODO: Compute ROC AUC score

    # TODO: Log all metrics




#### Save the final model

A machine learning model is a file trained to recognize certain types of patterns. You train a model over a set of data, and you provide it with an algorithm that uses to reason over and learn from that data set. After you train the model, you can use it to reason over data that it never saw before, and make predictions about that data.

In MLflow, a machine learning model can include multiple model versions. Here, each version can represent a model iteration. 



In [None]:
# Specify the model name and the path where you want to save it in the registry
model_name = "fabcon-churn-model"  # Replace with your desired model name
model_path = f"runs:/{lgbm_model_run_id}/model"

# Register the model to the MLflow registry
registered_model = mlflow.register_model(model_uri=model_path, name=model_name)

# Print the registered model's name and version
print(f"Model '{registered_model.name}' version {registered_model.version} registered successfully.")