# Module 1: Read data from lakehouse and load into Delta format

**Lakehouse**: A lakehouse is a collection of files/folders/tables that represent a database over a data lake used by the Spark engine and SQL engine for big data processing and that includes enhanced capabilities for ACID transactions when using the open-source Delta formatted tables.

**Delta Lake**:Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and streaming data processing to Apache Spark. A Delta Lake table is a data table format that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata management.


#### Pre-Requisites 

* A [Microsoft Fabric subscription](https://learn.microsoft.com/en-us/fabric/enterprise/licenses) or sign up for a free [Microsoft Fabric (Preview) trial](https://learn.microsoft.com/en-us/fabric/get-started/fabric-trial).
* Sign in to [Microsoft Fabric](https://fabric.microsoft.com/).
* Create or use an existing Fabric Workspace and Lakehouse, follow the steps here to [Create a Lakehouse in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-engineering/create-lakehouse)
* Create shortcut to ADLS Gen2 account and load the the heart failure data into lakehouse file section
* After getting this notebook, attach the lakehouse with notebook.

In [1]:
# Read the heartfailure Dataset file
df = spark.read.format("csv").option("header","true").option("inferSchema", "true").load("Files/files/heart.csv")
# df now is a Spark DataFrame containing CSV data from "Files/heartdataset/heart.csv".
display(df)

StatementMeta(, a7813a4a-e4f9-4c30-b218-1e03be07b4d9, 3, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, d0caa3a6-8224-4d26-9f34-21dba475aefb)

In [2]:
#print schema
df.printSchema()

StatementMeta(, a7813a4a-e4f9-4c30-b218-1e03be07b4d9, 4, Finished, Available, Finished)

root
 |-- Age: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)
 |-- RestingBP: double (nullable = true)
 |-- Cholesterol: double (nullable = true)
 |-- FastingBS: double (nullable = true)
 |-- RestingECG: string (nullable = true)
 |-- MaxHR: integer (nullable = true)
 |-- ExerciseAngina: string (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- ST_Slope: string (nullable = true)
 |-- HeartDisease: integer (nullable = true)
 |-- RowNumber: integer (nullable = true)



## Write Spark dataframe to lakehouse delta table

**Enable Vorder and Optimized Delta Write**

**Verti-Parquet or VOrder** “ Fabric includes Microsoftâ€™s VertiParquet engine. VertiParquet writer optimizes the Delta Lake parquet files resulting in 3x-4x compression improvement and up to 10x performance acceleration over Delta Lake files not optimized using VertiParquet while still maintaining full Delta Lake and PARQUET format compliance.<p>
**Optimize write** “ Spark in Fabric includes an Optimize Write feature that reduces the number of files written and targets to increase individual file size of the written data. It dynamically optimizes files during write operations generating files with a default 128 MB size. The target file size may be changed per workload requirements using configurations.

These configs can be applied at a session level(as spark.conf.set in a notebook cell) as demonstrated in the following code cell, or at workspace level which is applied automatically to all spark sessions created in the workspace. The workspace level Apache Spark configuration can be set at:
- _Workspace settings >> Data Engineering/Sceience >> Spark Compute >> Spark Properties >> Add_

In [3]:
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, a7813a4a-e4f9-4c30-b218-1e03be07b4d9, 5, Finished, Available, Finished)

In [4]:
table_name = "heartFailure"
df.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

StatementMeta(, a7813a4a-e4f9-4c30-b218-1e03be07b4d9, 6, Finished, Available, Finished)

Spark dataframe saved to delta table: heartFailure


In [5]:
data_df = spark.read.format("delta").load("Tables/heartFailure")
display(data_df)

StatementMeta(, a7813a4a-e4f9-4c30-b218-1e03be07b4d9, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, f9c4bcc3-cddb-4c0f-b2d8-c9fedcc8edfd)