## Module 4 - Simulate Input data, perform Batch Predictions and save predictions to Lakehouse

### Simulate input heart failure diagnostic data to be used for predictions


Use [Faker](https://faker.readthedocs.io/en/master/) Python package to simulate heart failure diagnostic data. Python Libraries can be added in the Workspace Settings or installed inline using _%pip install Faker_. Read more on the public docs - [Manage Apache Spark libraries](https://learn.microsoft.com/en-us/fabric/data-engineering/library-management)

In [1]:
%pip install Faker==18.10.1

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 7, Finished, Available, Finished)

Collecting Faker==18.10.1
  Downloading Faker-18.10.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: Faker
Successfully installed Faker-18.10.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



In [2]:
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType, LongType

heartFailureDataSchema = StructType(
[
    StructField('Age', IntegerType(), True),
    StructField('Sex', IntegerType(), True),
    StructField('ChestPainType', IntegerType(), True),
    StructField('RestingBP', IntegerType(), True),
    StructField('Cholesterol', IntegerType(), True),
    StructField('FastingBS', IntegerType(), True),
    StructField('RestingECG', IntegerType(), True),
    StructField('MaxHR', IntegerType(), True),
    StructField('ExerciseAngina', IntegerType(), True),
    StructField('Oldpeak', DoubleType(), True),
    StructField('ST_Slope', IntegerType(), True)
]
)

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 9, Finished, Available, Finished)

In [3]:
from faker import Faker

faker = Faker()
simulateRecordCount = 10
simData = []

for i in range(simulateRecordCount):
    age = faker.random_int(54,70)
    RestingBP = faker.random_int(70, 170)
    Cholesterol = faker.random_int(100, 300)
    FastingBS= faker.random_int(0, 1)    
    MaxHR = faker.random_int(100,200)
    OldPeak = faker.pyfloat(right_digits = 1, positive = True, max_value = 4.5)
    ChestPain = faker.random_int(0,3,1)
    Sex = faker.random_int(0,1)
    RestingECG  = faker.random_int(0,2,1)
    ExerciseAngina = faker.random_int(0,1)
    StSlope= faker.random_int(0,1)
    simData.append((age, Sex,ChestPain, RestingBP,Cholesterol,FastingBS,RestingECG , MaxHR,ExerciseAngina,OldPeak, StSlope))

df = spark.createDataFrame(data = simData, schema =heartFailureDataSchema)
display(df)



StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 10, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 8aa31ca2-b6d4-4069-9256-1186d99af1d7)

Transform dataframe to a Pandas df

In [4]:
data_df = df.toPandas()

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 11, Finished, Available, Finished)

### Load trained and registered model to generate predictions

In [5]:
from synapse.ml.predict import MLFlowTransformer

model = MLFlowTransformer(
    inputCols=list(df.columns),
    outputCol='predictions',
    modelName='rfc1_sm',
    modelVersion=1
)

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 12, Finished, Available, Finished)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]



StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 13, Finished, Available, Finished)

In [6]:
import pandas

predictions = model.transform(df)
display(predictions)

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 14, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 3b66a753-aa35-46b8-beb8-f0f617762188)

### Format Predictions and save as a Delta Table for consumption

Refer to notebook 1 for more information about vorder and optimizeWrite

In [8]:
# Optimize writes to Delta Table
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 16, Finished, Available, Finished)

Add an id column, which will allow us to identify individual "patients" at risk when building a report in PowerBI

In [9]:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import monotonically_increasing_id

# Add a new column "row_number" using row_number() over the specified window
predictions = predictions.withColumn("id", row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)

display(predictions)

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 17, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 93d1321f-f10c-4d28-a041-862dbaeb74a4)

In [10]:
table_name = "heartFailure_pred"
predictions.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Output Predictions saved to delta table: {table_name}")

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 18, Finished, Available, Finished)

Output Predictions saved to delta table: heartFailure_pred


In [11]:
%%sql
--preview predicted data
select * from heartFailure_pred limit 10;

StatementMeta(, 56a1cf7b-7798-4dd8-bad2-032a69c3b978, 19, Finished, Available, Finished)

<Spark SQL result set with 10 rows and 13 fields>