## Module 4 - Simulate Input data, perform Batch Predictions and save predictions to Lakehouse

#### Install the required libraries and define the dataset schema


You will now use the [Faker](https://faker.readthedocs.io/en/master/) Python package to simulate heart failure diagnostic data. Python Libraries can be added in the Workspace Settings or installed inline using _%pip install Faker_. Read more on the public docs - [Manage Apache Spark libraries](https://learn.microsoft.com/en-us/fabric/data-engineering/library-management)

In [6]:
#Install the required library
%pip install Faker==18.10.1

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 12, Finished, Available)

Collecting Faker==18.10.1
  Downloading Faker-18.10.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: Faker
Successfully installed Faker-18.10.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



In [None]:
#Import the required libraries
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType, LongType
from faker import Faker
import pandas
from synapse.ml.predict import MLFlowTransformer

In [7]:
#Define the dataset schema

heartFailureDataSchema = StructType(
[
    StructField('Age', IntegerType(), True),
    StructField('Sex', IntegerType(), True),
    StructField('ChestPainType', IntegerType(), True),
    StructField('RestingBP', IntegerType(), True),
    StructField('Cholesterol', IntegerType(), True),
    StructField('FastingBS', IntegerType(), True),
    StructField('RestingECG', IntegerType(), True),
    StructField('MaxHR', IntegerType(), True),
    StructField('ExerciseAngina', IntegerType(), True),
    StructField('Oldpeak', DoubleType(), True),
    StructField('ST_Slope', IntegerType(), True)
]
)

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 14, Finished, Available)

#### Part 1 instructions

Simulate input heart failure diagnostic data to be used for predictions

First, set the simulated record count to 10 and set up a new empty list for the simulated data

In [None]:
faker = Faker()
simulateRecordCount = 10
simData = []

Now, you will loop through the 10 simulated elements to generate. For each element you need to generate a simulated value of each feature (age, bp, cholesterol...). 

Steps:
- Refer to the previously generated dataset schema to identify what kind of value (integer, float or categorical value) each feature is. 
- Refer to the statistics of the data on notebook 2 to pick an adequate range of values for each feature.
- Pick one of the following faker functions according to the value type of each feature and set it to generate a value from the range you chose

Faker functions:

- For integers use [faker.randomint](https://faker.readthedocs.io/en/master/providers/baseprovider.html#faker.providers.BaseProvider.random_int)
- For categorical values use faker.randomint but make sure you are choosing the correct range
- For floats use [faker.pyfloat](https://faker.readthedocs.io/en/master/providers/faker.providers.python.html#faker.providers.python.Provider.pyfloat) making sure you are specifying the correct amount of decimals

In [None]:

for i in range(simulateRecordCount):
    age = 
    RestingBP = 
    Cholesterol = 
    FastingBS=  
    MaxHR = 
    OldPeak = 
    ChestPain = 
    Sex = 
    RestingECG  = 
    ExerciseAngina = 
    StSlope= 
    #Adding all simulalted variables to the list
    simData.append((age, Sex,ChestPain, RestingBP,Cholesterol,FastingBS,RestingECG , MaxHR,ExerciseAngina,OldPeak, StSlope))



To complete this section, create a dataframe with the data you have just simulated (simData) and the explicit schema you created previously (heartFailureDataSchema). [Documentation link](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#DataFrame-Creation)

In [None]:
df = 
display(df)

Now, convert the dataframe to pandas.

In [18]:
data_df = df.toPandas()

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 25, Finished, Available)

#### Part 2 Instructions 
Load the trained and registered model to generate predictions

Load the model (rfc1_sm, version 1). From the [documentation link](https://learn.microsoft.com/en-us/fabric/data-science/model-scoring-predict#call-predict-from-a-notebook) go to step 3 to learn how to use the MLFlowTransformer function.

In [19]:
model = (




)

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 26, Finished, Available)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]



StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 27, Finished, Available)

Now, generate predictions using the new transformer you just created. Display the predictions. [Documentation link](https://learn.microsoft.com/en-us/fabric/data-science/model-scoring-predict#predict-with-the-transformer-api)

In [20]:
predictions = 
display(predictions)

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 28, Finished, Available)

SynapseWidget(Synapse.DataFrame, b0d0740a-32d9-4c71-b469-da24206009bd)

#### Part 3 instructions
Format Predictions and save them as a Delta Table for consumption.

First, set the correct spark configuration. Refer to notebook 1 for an in-depth explanation of the following code cell.

In [21]:
# Optimize writes to Delta Table
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 29, Finished, Available)

Before writing the predictions, we need to add an id column to be able to identify "patients" who are predicted to be at risk for heart failure. Run the following code cell to add the id column. [monotonically_increasing_id](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.monotonically_increasing_id.html)

In [None]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import monotonically_increasing_id

# Add a new column "row_number" using row_number() over the specified window
predictions = predictions.withColumn("id", row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)

display(predictions)

Now, write the predictions you just generated to a delta table under the "Tables/table_name" directory. Refer to notebook 1 for information on how to write delta tables.

In [23]:
table_name = "heartFailure_pred"
#Enter a line of code to write the predictions to a delta table
predictions.
print(f"Output Predictions saved to delta table: {table_name}")

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 31, Finished, Available)

Output Predictions saved to delta table: heartFailure_pred


On Fabric notebooks (and spark notebooks in general) you can use SparkSQL to read from delta tables using SQL commands. Preview the table you just loaded using SQL.

- [Convert the next cell to SQL](https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#use-multiple-languages)
- [Select all (*) data from a table in SQL](https://www.w3schools.com/sql/sql_select.asp)

StatementMeta(, 90dfa8b5-242d-46b7-962e-e13f54ac5fe3, 32, Finished, Available)

<Spark SQL result set with 10 rows and 12 fields>