# Module 1: Read data from lakehouse and load into Delta format

**Lakehouse**: A lakehouse is a collection of files/folders/tables that represent a database over a data lake used by the Spark engine and SQL engine for big data processing and that includes enhanced capabilities for ACID transactions when using the open-source Delta formatted tables.

**Delta Lake**:Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and streaming data processing to Apache Spark. A Delta Lake table is a data table format that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata management.


#### Pre-Requisites 

* A [Microsoft Fabric subscription](https://learn.microsoft.com/en-us/fabric/enterprise/licenses) or sign up for a free [Microsoft Fabric (Preview) trial](https://learn.microsoft.com/en-us/fabric/get-started/fabric-trial).
* Sign in to [Microsoft Fabric](https://fabric.microsoft.com/).
* Create or use an existing Fabric Workspace and Lakehouse, follow the steps here to [Create a Lakehouse in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-engineering/create-lakehouse)
* Create shortcut to ADLS Gen2 account and load the the heart failure data into lakehouse file section
* After getting this notebook, attach the lakehouse with notebook.

#### Part 1 Instructions:

In the following code cell, load the Files/heart.csv into a dataframe. Display the dataframe to take a look at the data you will be working with. [Documentation link](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-notebook-load-data). Make sure you automatically read the schema from the source as well. [Documentation link](https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option)

In [2]:
# Read the heartfailure Dataset file
df =
# df now is a Spark DataFrame containing CSV data from "Files/heartdataset/heart.csv".
display(df)

StatementMeta(, 5f849dc6-e766-4a8c-a84d-2e13caa6a6c2, 3, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6b2bf44b-b19c-4f50-8066-688b5e9d15e7)

In [3]:
#Run this command to see the schema of your dataset
df.printSchema()

StatementMeta(, 5f849dc6-e766-4a8c-a84d-2e13caa6a6c2, 4, Finished, Available)

root
 |-- Age: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)
 |-- RestingBP: integer (nullable = true)
 |-- Cholesterol: integer (nullable = true)
 |-- FastingBS: integer (nullable = true)
 |-- RestingECG: string (nullable = true)
 |-- MaxHR: integer (nullable = true)
 |-- ExerciseAngina: string (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- ST_Slope: string (nullable = true)
 |-- HeartDisease: integer (nullable = true)



## Write Spark dataframe to lakehouse delta table

**Enable Vorder and Optimized Delta Write**

**Verti-Parquet or VOrder** “ Fabric includes Microsoft's VertiParquet engine. VertiParquet writer optimizes the Delta Lake parquet files resulting in 3x-4x compression improvement and up to 10x performance acceleration over Delta Lake files not optimized using VertiParquet while still maintaining full Delta Lake and PARQUET format compliance.<p>
**Optimize write** “ Spark in Fabric includes an Optimize Write feature that reduces the number of files written and targets to increase individual file size of the written data. It dynamically optimizes files during write operations generating files with a default 128 MB size. The target file size may be changed per workload requirements using configurations.

These configs can be applied at a session level(as spark.conf.set in a notebook cell) as demonstrated in the following code cell, or at workspace level which is applied automatically to all spark sessions created in the workspace. The workspace level Apache Spark configuration can be set at:
- _Workspace settings >> Data Engineering/Sceience >> Spark Compute >> Spark Properties >> Add_

[Learn more about V-Order and Optimized Delta Write](https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparksql)

In [4]:
#Run these configuration commands to set up your spark session
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 5f849dc6-e766-4a8c-a84d-2e13caa6a6c2, 5, Finished, Available)

#### Part 2 Instructions:

Now, you will create a Delta table in your Lakehouse where you will store your dataset to be accessed at later notebooks. Write the data to a Delta table named "heartFailure" and then, once it has been saved, test reading the table and displaying the data in the second code cell. [Documentation link](https://learn.microsoft.com/en-us/training/modules/work-delta-lake-tables-fabric/3-create-delta-tables.) 

In [None]:
#Set your table name
table_name = 

#Write your data to the table

In [None]:
#Load the table to the dataframe to ensure that it is loaded in the correct path
data_df = 

#Display the dataframe that you just loaded
display(data_df)