# Lab 6: Batch inferencing with PREDICT

Microsoft Fabric allows users to operationalize machine learning models with a scalable function called PREDICT, which supports batch scoring in any compute engine. Users can generate batch predictions directly from a Microsoft Fabric notebook or from a given ML model's item page.

### Exercise overview

In this exercise, you learn how to apply PREDICT both ways, whether you're more comfortable writing code yourself or using a guided UI experience to handle batch scoring for you.

### Helpful links
- [PREDICT in Microsoft Fabric](https://aka.ms/fabric-predict) 

### Limitations 

Note that the PREDICT function is currently supported for a limited set of ML model flavors, including:
- PyTorch
- Sklearn
- Spark
- TensorFlow
- ONNX
- XGBoost
- LightGBM
- CatBoost
- Statsmodels
- Prophet
- Keras
- PREDICT ***requires*** ML models to be saved in the MLflow format with their signatures populated.
- PREDICT ***does*** not support ML models with multi-tensor inputs or outputs.

### Pre-Requisite

For this Exercise, we expect that you have completed and ran **Labs 1-4**. 

## Step 1: Setup your notebook

### Select Lakehouse

First, add the Lakehouse you created from the prior lab exercise.

<br>

![image-alt-text](https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/add-lakehouse.png)

## Step 2: Load test data as a Spark DataFrame

To generate batch predictions using an already trained ML model (in this case version 1 of the churn model in previous notebook), you need the test data in the form of a Spark DataFrame.

Load the test data that was stored as a Lakehouse table during training back into a Spark DataFrame in order to generate predictions.

In [None]:
df_test = spark.read.format("delta").load("Tables/churn_test_data")
display(df_test)

## Step 3: Generate PREDICT code from an ML model's item page

From any ML model's item page, you can choose either of the following options to start generating batch predictions for a specific model version with PREDICT.

- Use a guided UI experience to generate PREDICT code
- Copy a code template into a notebook and customize the parameters yourself

### Use a guided UI experience

The guided UI experience walks you through steps to:

- Select source data for scoring
- Map the data correctly to your ML model's inputs
- Specify the destination for your model's outputs
- Create a notebook that uses PREDICT to generate and store prediction results

To use the guided experience,

1. Go to the item page for a given ML model version.

2. Select **Apply this model in wizard** from the **Apply this version** dropdown.

The selection opens up the "Apply ML model predictions" window at the "Select input table" step.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/1.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">


3. Select an input table from one of the lakehouses in your current workspace.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/2.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">

4. Select Next to go to the "Map input columns" step.

5. Map column names from the source table to the ML model's input fields, which are pulled from the model's signature. You must provide an input column for all the model's required fields. Also, the data types for the source columns must match the model's expected data types.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/3.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">

6. Select Next to go to the "Create output table" step.

7. Provide a name for a new table within the selected lakehouse of your current workspace. This output table stores your ML model's input values with the prediction values appended. By default, the output table is created in the same lakehouse as the input table, but the option to change the destination lakehouse is also available.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/4.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">

8. Select Next to go to the "Map output columns" step.

9. Use the provided text fields to name the columns in the output table that stores the ML model's predictions.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/5.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">

10. Select **Next** to go to the "Configure notebook" step.

11. Provide a name for a new notebook that will run the generated PREDICT code. The wizard displays a preview of the generated code at this step. You can copy the code to your clipboard and paste it into an existing notebook if you prefer.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/6.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">

12. Select **Next** to go to the "Review and finish" step.

13. Review the details on the summary page and select **Create notebook** to add the new notebook with its generated code to your workspace. You're taken directly to that notebook, where you can run the code to generate and store predictions.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/Predict/7.png"  width="400%" height="100%" title="Screenshot shows logged values for one of the models.">

### Use a customizable code template

To generate predictions using the PREDICT function without using the UI, you can use the Transformer API, the Spark SQL API, or a PySpark user-defined function (UDF). The following sections show how to generate batch predictions with the test data and the trained ML model, using the different methods for invoking PREDICT. Note that you need to manually replace the following values:

- `<INPUT_COLS>`: An array of column names from the input table to feed to the ML model
- `<OUTPUT_COLS>`: A name for a new column in the output table that stores predictions
- `<MODEL_NAME>`: The name of the ML model to use for generating predictions
- `<MODEL_VERSION>`: The version of the ML model to use for generating predictions

#### PREDICT with the Transformer API

PREDICT supports MLflow-packaged models in the Microsoft Fabric registry. Therefore, to use the Transformer API from SynapseML, you'll need to first create an MLFlowTransformer object.

##### Instantiate MLFlowTransformer object

The MLFlowTransformer object is a wrapper around the MLFlow model that you have already registered. It allows you to generate batch predictions on a given DataFrame. To instantiate the MLFlowTransformer object, you'll need to provide the following parameters:

- The columns from the test DataFrame that you need as input to the model (in this case, you would need all of them).
- A name for the new output column (in this case, `predictions`).
- The correct model name and model version to generate the predictions (in this case, `churn-model` and version 1).

If you've been using your own ML model, substitute the values for the `model` and `test data`.

In [None]:
from synapse.ml.predict import MLFlowTransformer

model = MLFlowTransformer(
    inputCols=list(df_test.columns),
    outputCol='predictions',
    modelName='fabcon-churn-model',
    modelVersion=1
)

Now that you have the MLFlowTransformer object, you can use it to generate batch predictions.

In [None]:
import pandas

predictions = model.transform(df_test)
display(predictions)

In [None]:
predictions

#### PREDICT with the Spark SQL API

The following code invokes the PREDICT function with the Spark SQL API. If you've been using your own ML model, substitute the values for `model_name`, `model_version`, and `features` with your model name, model version, and feature columns.

> [!NOTE]
> Using the Spark SQL API to generate predictions still requires you to create an MLFlowTransformer object.

In [None]:
from pyspark.ml.feature import SQLTransformer 

# Substitute "model_name", "model_version", and "features" below with values for your own model name, model version, and feature columns
model_name = 'fabcon-churn-model'
model_version = 1
features = df_test.columns

sqlt = SQLTransformer().setStatement( 
    f"SELECT PREDICT('{model_name}/{model_version}', {','.join(features)}) as predictions FROM __THIS__")

# Substitute "X_test" below with your own test dataset
display(sqlt.transform(df_test))

#### PREDICT with a user-defined function (UDF)

The following code invokes the PREDICT function with a PySpark UDF. If you've been using your own ML model, substitute the values for the `model` and `features`.

In [None]:
from pyspark.sql.functions import col, pandas_udf, udf, lit

# Substitute "model" and "features" below with values for your own model name and feature columns
my_udf = model.to_udf()
features = df_test.columns

display(df_test.withColumn("predictions", my_udf(*[col(f) for f in features])))

## Step 4: Write model prediction results to the lakehouse

Once you have generated batch predictions, write the model prediction results back to the lakehouse.

In [None]:
# Save predictions to lakehouse to be used for generating a Power BI report
table_name = "customer_churn_test_predictions"
predictions.write.format('delta').mode("overwrite").save(f"Tables/{table_name}")
print(f"Spark DataFrame saved to delta table: {table_name}")

### Business Intelligence via Visualizations in Power BI

## Exercise 1: Build a Power BI dashboard report

Next, you'll analyze the saved prediction results in Power BI to build a dashboard to shed some lights on business insights that help with avoiding the retention of customers.

### To do
In this exercise, you will follow these instructions to build a Power BI report.

> [!NOTE]
> This shows an illustrated example of how you would analyze the saved prediction results in Power BI. However, for a real customer churn use-case, the platform user may have to do more thorough ideation of what visualizations to create, based on subject matter expertise, and what their firm and business analytics team has standardized as metrics.

To access your saved table in Power BI:

1. On the left, select **OneLake data hub**.
2. Select the lakehouse that you added to this notebook.
3. On the top right, select **Open** under the section titled **Open this Lakehouse**.
4. Select New Power BI dataset on the top ribbon and select `customer_churn_test_predictions`, then select **Continue** to create a new Power BI dataset linked to the predictions.
5. On the tools at the top of the dataset page, select **New report** to open the Power BI report authoring page.

Some example visualizations are shown here. The data panel shows the delta tables and columns from the table to select. Upon selecting appropriate x and y axes, you can pick the filters and functions, for example, sum or average of the table column.

## Create a semantic model

Create a new semantic model linked to the predictions data you produced in part 4:

1. On the left, select your workspace.
2. On the top left, select **Lakehouse** as a filter.
3. Select the lakehouse that you used in the previous parts of the tutorial series.
4. Select **New semantic model** on the top ribbon.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/new-power-bi-dataset.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">

1. Give the semantic model a name, such as "bank churn predictions." Then select the **customer_churn_test_predictions** dataset.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/select-predictions-data.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


2. Select **Confirm**.  

## Add new measures

Now add a few measures to the semantic model:

3. Add a new measure for the churn rate.

1. Select **New measure** in the top ribbon.  This action adds a new item named **Measure** to the **customer_churn_test_predictions** dataset, and opens a formula bar above the table.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/new-measure.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


2. To determine the average predicted churn rate, replace `Measure =` in the formula bar with:

        `Churn Rate = AVERAGE(customer_churn_test_predictions[predictions])`
    

3. To apply the formula, select the check mark in the formula bar.  The new measure appears in the data table.  The calculator icon shows it was created as a measure.

1. Change the format from **General** to **Percentage** in the **Properties** panel.
2. Scroll down in the **Properties** panel to change the **Decimal places** to 1.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/churn-rate.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


4. Add a new measure that counts the total number of bank customers.  You'll need it for the rest of the new measures.
  
    1. Select **New measure** in the top ribbon to add a new item named **Measure** to the `customer_churn_test_predictions` dataset.  This action also opens a formula bar above the table.
    2. Each prediction represents one customer. To determine the total number of customers, replace `Measure =` in the formula bar with:

        
        `Customers = COUNT(customer_churn_test_predictions[predictions])`
        

    3. Select the check mark in the formula bar to apply the formula.

5. Add the churn rate for Germany.

    1. Select **New measure** in the top ribbon to add a new item named **Measure** to the `customer_churn_test_predictions` dataset.  This action also opens a formula bar above the table.

    2. To determine the churn rate for Germany, replace `Measure =` in the formula bar with:

        `Germany Churn = CALCULATE(customer_churn_test_predictions[Churn Rate], customer_churn_test_predictions[Geography_Germany] = 1)`

        This filters the rows down to the ones with Germany as their geography (Geography_Germany equals one).

    3. To apply the formula, select the check mark in the formula bar.

6. Repeat the above step to add the churn rates for France and Spain.

    * Spain's churn rate:

        ```python
        Spain Churn = CALCULATE(customer_churn_test_predictions[Churn Rate], customer_churn_test_predictions[Geography_Spain] = 1)
        ```

    * France's churn rate:

        ```python
        France Churn = CALCULATE(customer_churn_test_predictions[Churn Rate], customer_churn_test_predictions[Geography_France] = 1)
        ```

## Create new report

Once you're done with all operations, move on to the Power BI report authoring page by selecting **Create report** on the top ribbon.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/visualize-this-data.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


Once the report page appears, add these visuals:

1. Select the text box on the top ribbon and enter a title for the report, such as "Bank Customer Churn".  Change the font size and background color in the Format panel.  Adjust the font size and color by selecting the text and using the format bar.

2. In the Visualizations panel, select the **Card** icon. From the **Data** pane, select **Churn Rate**. Change the font size and background color in the Format panel. Drag this visualization to the top right of the report.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/card-churn.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">

3. In the Visualizations panel, select the **Line and stacked column chart** icon. Select **age** for the x-axis, **Churn Rate** for column y-axis, and **Customers** for the line y-axis.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/age.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


4. In the Visualizations panel, select the **Line and stacked column chart** icon. Select **NumOfProducts** for x-axis, **Churn Rate** for column y-axis, and **Customers** for the line y-axis.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/number-of-products.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


5. In the Visualizations panel, select the **Stacked column chart** icon. Select **NewCreditsScore** for x-axis and  **Churn Rate** for y-axis.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/new-credit-score.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


    Change the title "NewCreditsScore" to "Credit Score" in the Format panel.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/change-title.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


6. In the Visualizations panel, select the **Clustered column chart** card. Select **Germany Churn**, **Spain Churn**, **France Churn** in that order for the y-axis.


<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/germany-spain-france.png"  width="50%" height="20%" title="Screenshot shows logged values for one of the models.">


The Power BI report shows:

* Customers who use more than two of the bank products have a higher churn rate although few customers had more than two products. The bank should collect more data, but also investigate other features correlated with more products (see the plot in the bottom left panel).
* Bank customers in Germany have a higher churn rate than in France and Spain (see the plot in the bottom right panel), which suggests that an investigation into what has encouraged customers to leave could be beneficial.
* There are more middle aged customers (between 25-45) and customers between 45-60 tend to exit more.
* Finally, customers with lower credit scores would most likely leave the bank for other financial institutes. The bank should look into ways that encourage customers with lower credit scores and account balances to stay with the bank.

<img src="https://synapseaisolutionsa.blob.core.windows.net/public/Fabric-Conference/PBI/germany-spain-france.png"  width="100%" height="100%" title="Screenshot shows logged values for one of the models.">



## Step 5: Bonus - Visualize the Power BI report in Notebook

In [None]:
from powerbiclient import QuickVisualize, get_dataset_config

df_predictions = spark.read.format("delta").load("Tables/customer_churn_test_predictions")

PBI_visualize = QuickVisualize(get_dataset_config(df_predictions))

# Render Power BI report in the notebook
PBI_visualize