## Module 4 - Simulate Input data, perform Batch Predictions and save predictions to Lakehouse

### Simulate input diabetes diagnostic data to be used for predictions


Use [Faker](https://faker.readthedocs.io/en/master/) Python package to simulate diabetes diagnostic data. Python Libraries can be added in the Workspace Settings or installed inline using _%pip install Faker_. Read more on the public docs - [Manage Apache Spark libraries](https://learn.microsoft.com/en-us/fabric/data-engineering/library-management)

In [9]:
%pip install Faker==18.10.1

StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, -1, Finished, Available)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/nfs4/pyenv-31283396-7ec4-4422-b045-52ad9a73929e/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.





In [10]:
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

diabDataSchema = StructType(
[
    StructField('pregnancies', IntegerType(), True),
    StructField('plasma_glucose', IntegerType(), True),
    StructField('blood_pressure', IntegerType(), True),
    StructField('triceps_skin_thickness', IntegerType(), True),
    StructField('insulin_level', StringType(), True),
    StructField('obesity_level', StringType(), True),
    StructField('diabetes_pedigree', DoubleType(), True),
    StructField('age', IntegerType(), True)
]
)

StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 24, Finished, Available)

In [11]:
from faker import Faker

faker = Faker()
simulateRecordCount = 10
simData = []

for i in range(simulateRecordCount):
    pregnancies = faker.random_int(0,8)
    plasma_glucose = faker.random_int(70, 170)
    blood_pressure = faker.random_int(50, 120)
    triceps_skin_thickness = faker.random_int(10, 50)    
    diabetes_pedigree = faker.pyfloat(right_digits = 3, positive = True, max_value = 2.42)
    age = faker.random_int(21, 81)

    insulin_level = faker.random_element(elements=('normal','abnormal'))
    obesity_level = faker.random_element(elements=('underweight','normal','overweight','obese'))


    simData.append((pregnancies, plasma_glucose, blood_pressure, triceps_skin_thickness, insulin_level, obesity_level, diabetes_pedigree, age))

#print(simData)

df = spark.createDataFrame(data = simData, schema = diabDataSchema)
display(df)



StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 25, Finished, Available)

SynapseWidget(Synapse.DataFrame, 684ba5b5-63e9-4b45-97f0-4c8bbe0aa1ad)

### Load trained and registered model to generate predictions

In [3]:
import mlflow
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from synapse.ml.core.platform import *
from synapse.ml.lightgbm import LightGBMRegressor

model_uri = "models:/diabetes-lgbm/latest"
model = mlflow.spark.load_model(model_uri)

predictions_df = model.transform(df)
display(predictions_df)

StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 7, Finished, Available)

2023/06/23 20:02:49 INFO mlflow.spark: 'models:/diabetes-lgbm/latest' resolved as 'abfss://f592ff04-4de0-4237-b356-fa21aef3f3e6@msit-onelake.dfs.fabric.microsoft.com/c9401ef2-c4c5-4320-b573-35eb5f5efdfc/17b710e2-fdc9-40cf-9bb6-24ccf84cf19f/artifacts'
2023/06/23 20:02:50 INFO mlflow.spark: File 'abfss://f592ff04-4de0-4237-b356-fa21aef3f3e6@msit-onelake.dfs.fabric.microsoft.com/c9401ef2-c4c5-4320-b573-35eb5f5efdfc/17b710e2-fdc9-40cf-9bb6-24ccf84cf19f/artifacts/sparkml' is already on DFS, copy is not necessary.


SynapseWidget(Synapse.DataFrame, 1aa3dc80-eab6-4d8c-839f-81386c2eb562)

### Format Predictions and save as a Delta Table for consumption

In [4]:
from pyspark.sql.functions import get_json_object
from pyspark.sql.functions import col
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from pyspark.sql.functions import format_number

firstelement=udf(lambda v: float(v[0]) if (float(v[0]) >  float(v[1])) else float(v[1]), FloatType())

predictions_formatted_df = predictions_df \
    .withColumn("prob", format_number(firstelement('probability'), 4)) \
    .withColumn("diab_pred", predictions_df.prediction.cast('int')) \
    .drop("features", "rawPrediction", "probability", "prediction", "insulin_level_vec", "obesity_level_vec")

display(predictions_formatted_df)


StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, 4c732970-1970-41e1-b728-101e7dff3e05)

In [5]:
# optimize writes to Delta Table
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 9, Finished, Available)

In [6]:
table_name = "diabetes_pred"
predictions_formatted_df.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Output Predictions saved to delta table: {table_name}")

StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 10, Finished, Available)

Output Predictions saved to delta table: diabetes_pred


In [7]:
%%sql

--preview predicted data
select * from diabetes_pred limit 10;

StatementMeta(, b510b761-bca5-49db-a0c6-37a1039325f1, 11, Finished, Available)

<Spark SQL result set with 10 rows and 12 fields>