## Module 2: Perform Data Cleansing and preparation using Apache Spark

#### Reading data from delta table

In [1]:
data_df = spark.read.format("delta").load("Tables/diabetes")
display(data_df)

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, f10470bd-ad49-4898-9832-f031a1179d37)

#### Checking if datatypes are numerical

In [2]:
data_df.dtypes

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 6, Finished, Available)

[('pregnancies', 'int'),
 ('plasma_glucose', 'int'),
 ('blood_pressure', 'int'),
 ('triceps_skin_thickness', 'int'),
 ('insulin', 'int'),
 ('bmi', 'double'),
 ('diabetes_pedigree', 'double'),
 ('age', 'int'),
 ('diabetes', 'int')]

#### Summarize dataframe

In [3]:
display(data_df.summary())

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 7, Finished, Available)

SynapseWidget(Synapse.DataFrame, d2594f60-d48f-4659-97b5-f7937854a486)

observations from above

- blood_pressure and BMI is 0 for few

In [4]:
display(data_df.select("age").summary())

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, 801efce9-3f6b-4c47-bd34-28da664ad9d3)

In [5]:
display(data_df.groupBy("age").count())

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, 139678ee-4571-4ccc-8031-c6ab97fa2fce)

####  Missing Observation Analysis

We saw on df.head() that some features contain 0, it doesn't make sense here and this indicates missing value Below we replace 0 value by NaN:

In [6]:
data_df_fillna = data_df.replace(0,None,['plasma_glucose','blood_pressure','triceps_skin_thickness','insulin','bmi'])

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 10, Finished, Available)

In [7]:
display(data_df_fillna.filter("blood_pressure IS NULL"))

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6bb1ae66-9e63-407e-8eaa-bb46bf651ca0)

In [8]:
data_df_fillna.summary("count").show()

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 12, Finished, Available)

+-------+-----------+--------------+--------------+----------------------+-------+---+-----------------+---+--------+
|summary|pregnancies|plasma_glucose|blood_pressure|triceps_skin_thickness|insulin|bmi|diabetes_pedigree|age|diabetes|
+-------+-----------+--------------+--------------+----------------------+-------+---+-----------------+---+--------+
|  count|        768|           763|           733|                   541|    394|757|              768|768|     768|
+-------+-----------+--------------+--------------+----------------------+-------+---+-----------------+---+--------+



## feature engineering
- adding obesity levels based on BMI

In [9]:
from pyspark.sql.functions import when

data_df_newbmi = data_df_fillna.withColumn('obesity_level', when(data_df_fillna.bmi <= 18.5, 'underweight')
                    .when((data_df_fillna.bmi > 18.5) & (data_df_fillna.bmi <= 24.9), 'normal')
                    .when((data_df_fillna.bmi > 24.9) & (data_df_fillna.bmi <= 29.9), 'overweight')
                    .otherwise('obese'))

display(data_df_newbmi)

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6a905a52-24d4-4ea3-b3fa-ad6fe28a0aec)

In [10]:
data_df_processed = data_df_newbmi.withColumn('insulin_level', 
              when(data_df_newbmi.insulin <= 16, 'normal')
              .otherwise('abnormal'))

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 14, Finished, Available)

In [11]:
display(data_df_processed)

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 15, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6ecf9fb1-e1a6-4d75-b210-99f8500dee80)

#### Save processed data to a Delta Table

In [12]:
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 16, Finished, Available)

In [13]:
table_name = "diabetes_processed"
data_df_processed.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 17, Finished, Available)

Spark dataframe saved to delta table: diabetes_processed


In [14]:
%%sql

select * from diabetes_processed limit 100;

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 18, Finished, Available)

<Spark SQL result set with 100 rows and 11 fields>

In [15]:
%%sql 

select obesity_level, diabetes, count(*) as count
from diabetes_processed 
--where diabetes = 1
GROUP By obesity_level, diabetes

StatementMeta(, 8d7c0321-042c-4f30-86fd-b8fb1cd0d87c, 19, Finished, Available)

<Spark SQL result set with 7 rows and 3 fields>