## Module 4 - Simulate Input data, perform Batch Predictions and save predictions to Lakehouse

### Simulate input heart failure diagnostic data to be used for predictions


Use [Faker](https://faker.readthedocs.io/en/master/) Python package to simulate heart failure diagnostic data. Python Libraries can be added in the Workspace Settings or installed inline using _%pip install Faker_. Read more on the public docs - [Manage Apache Spark libraries](https://learn.microsoft.com/en-us/fabric/data-engineering/library-management)

In [1]:
%pip install Faker==18.10.1

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 7, Finished, Available)

Collecting Faker==18.10.1
  Downloading Faker-18.10.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: Faker
Successfully installed Faker-18.10.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



In [2]:
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

heartFailureDataSchema = StructType(
[
    StructField('Age', IntegerType(), True),
    StructField('Sex', StringType(), True),
    StructField('ChestPainType', StringType(), True),
    StructField('RestingBP', IntegerType(), True),
    StructField('Cholesterol', IntegerType(), True),
    StructField('FastingBS', IntegerType(), True),
    StructField('RestingECG', StringType(), True),
    StructField('MaxHR', IntegerType(), True),
    StructField('ExerciseAngina', StringType(), True),
    StructField('Oldpeak', DoubleType(), True),
    StructField('ST_Slope', StringType(), True)
]
)

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 9, Finished, Available)

In [3]:
from faker import Faker

faker = Faker()
simulateRecordCount = 10
simData = []

for i in range(simulateRecordCount):
    age = faker.random_int(54,70)
    RestingBP = faker.random_int(70, 170)
    Cholesterol = faker.random_int(100, 300)
    FastingBS= faker.random_int(0, 1)    
    MaxHR = faker.random_int(100,200)
    OldPeak = faker.pyfloat(right_digits = 1, positive = True, max_value = 4.5)

    ChestPain = faker.random_element(elements=('ASY','ATA','TA','NAP'))
    Sex = faker.random_element(elements=('M','F'))
    RestingECG  = faker.random_element(elements=('ST','NORMAL','LVH'))
    ExerciseAngina = faker.random_element(elements=('N','Y'))
    StSlope= faker.random_element(elements=('Up','Down'))
    simData.append((age, Sex,ChestPain, RestingBP,Cholesterol,FastingBS,RestingECG , MaxHR,ExerciseAngina,OldPeak, StSlope))

df = spark.createDataFrame(data = simData, schema =heartFailureDataSchema)
display(df)



StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 9615cf2b-42db-4440-bd2f-a81385b2a18e)

### Load trained and registered model to generate predictions

In [4]:
import mlflow
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from synapse.ml.core.platform import *
from synapse.ml.lightgbm import LightGBMRegressor

model_uri = "models:/heartfailure-lgmb/latest"
model = mlflow.spark.load_model(model_uri)

predictions_df = model.transform(df)
display(predictions_df)

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 11, Finished, Available)

2024/05/23 14:42:15 INFO mlflow.spark: 'models:/heartfailure-lgmb/latest' resolved as 'abfss://22e6347e-faec-465d-993a-fc1f9464e82c@onelakewestus3.pbidedicated.windows.net/a8c148f8-5aea-489d-aa46-29c1f2060df2/50bc634e-e006-4f2a-bb45-07a0dda9848c/artifacts'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/87 [00:00<?, ?it/s]

2024/05/23 14:42:17 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
2024/05/23 14:42:17 INFO mlflow.spark: File 'models:/heartfailure-lgmb/latest/sparkml' not found on DFS. Will attempt to upload the file.
2024/05/23 14:42:22 INFO mlflow.spark: Copied SparkML model to Files/tmp/mlflow/12ce9e90-2011-4e7e-ab96-58dc3237ea95


SynapseWidget(Synapse.DataFrame, 316abfef-2380-4608-a169-70d34c0e7747)

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 12, Finished, Available)

### Format Predictions and save as a Delta Table for consumption

In [5]:
from pyspark.sql.functions import get_json_object
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from pyspark.sql.functions import format_number

firstelement=udf(lambda v: float(v[0]) if (float(v[0]) >  float(v[1])) else float(v[1]), FloatType())

predictions_formatted_df = predictions_df \
    .withColumn("prob", format_number(firstelement('probability'), 4)) \
    .withColumn("heartfailure_pred", predictions_df.prediction.cast('int')) \
    .drop("features", "rawPrediction", "probability", "prediction", "insulin_level_vec", "obesity_level_vec")

display(predictions_formatted_df)


StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, 604e865c-3046-49f6-ac43-b8875637639a)

In [6]:
# optimize writes to Delta Table
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 14, Finished, Available)

In [7]:
table_name = "heartFailure_pred"
predictions_formatted_df.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Output Predictions saved to delta table: {table_name}")

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 15, Finished, Available)

Output Predictions saved to delta table: heartFailure_pred


In [8]:
%%sql
--preview predicted data
select * from heartFailure_pred limit 10;

StatementMeta(, 5318d9c9-f2a2-4b44-8118-9eab8bf3df7d, 16, Finished, Available)

<Spark SQL result set with 10 rows and 25 fields>