# Module 1: Ingest Data into Lakehouse Using Spark

**Lakehouse**: A lakehouse is a collection of files/folders/tables that represent a database over a data lake used by the Spark engine and SQL engine for big data processing and that includes enhanced capabilities for ACID transactions when using the open-source Delta formatted tables.

**Delta Lake**: Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata management, and batch and streaming data processing to Apache Spark. A Delta Lake table is a data table format that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata management.


#### Pre-Requisites 

* A [Microsoft Fabric subscription](https://learn.microsoft.com/en-us/fabric/enterprise/licenses) or sign up for a free [Microsoft Fabric (Preview) trial](https://learn.microsoft.com/en-us/fabric/get-started/fabric-trial). 
* Sign in to [Microsoft Fabric](https://fabric.microsoft.com/).
* Create or use an existing Fabric Workspace and Lakehouse, and then follow the steps here to [Create a Lakehouse in Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/data-engineering/create-lakehouse)
* Download the Notebooks from the Github Repo and import into your Fabric Workspace using Import Notebook option available on Data Science Experience


## Download the public diabetes dataset csv file to Files section of the Lakehouse.

In [1]:
diabetes_dataset_url = "https://raw.githubusercontent.com/isinghrana/fabric-samples-healthcare/main/datascience-diabetes-prediction/data/diabetes.csv"


StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 3, Finished, Available)

In [6]:
# Import necessary libraries
from notebookutils import mssparkutils
import requests

#create subfolder
mssparkutils.fs.mkdirs("Files/diabetesdataset")

#download the CSV file from Github URL and save to the folder
with requests.Session() as s:
    download = s.get(diabetes_dataset_url)
    #print(download.content.decode())
    mssparkutils.fs.put("Files/diabetesdataset/diabetes.csv", download.content.decode(), True)



StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 8, Finished, Available)

In [3]:
# Read the Diabetes Dataset file
df = spark.read.format("csv").option("header","true").option("inferSchema", "true").load("Files/diabetesdataset/diabetes.csv")
# df now is a Spark DataFrame containing CSV data from "Files/diabetestdataset/diabetes.csv".
display(df)

StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, d81a4f82-2d67-4b43-b4e2-09d2818a35e9)

In [4]:
# Print schema
df.printSchema()

StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 6, Finished, Available)

root
 |-- pregnancies: integer (nullable = true)
 |-- plasma glucose: integer (nullable = true)
 |-- blood pressure: integer (nullable = true)
 |-- triceps skin thickness: integer (nullable = true)
 |-- insulin: integer (nullable = true)
 |-- bmi: double (nullable = true)
 |-- diabetes pedigree: double (nullable = true)
 |-- age: integer (nullable = true)
 |-- diabetes: integer (nullable = true)



In [7]:
# Delta Table column names cannot have space characters so rename such columns (space relaced with _ character)
df = df.withColumnRenamed("plasma glucose","plasma_glucose").withColumnRenamed("blood pressure","blood_pressure") \
.withColumnRenamed("triceps skin thickness", "triceps_skin_thickness") \
.withColumnRenamed("diabetes pedigree", "diabetes_pedigree")   

display(df)

StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, 737438a5-c122-4f5d-bab6-23393e05f064)

## Write Spark dataframe to lakehouse delta table

**Enable Vorder and Optimized Delta Write**

**Verti-Parquet or VOrder** “Fabric includes Microsoft's VertiParquet engine. VertiParquet writer optimizes the Delta Lake parquet files resulting in 3x-4x compression improvement and up to 10x performance acceleration over Delta Lake files not optimized using VertiParquet while still maintaining full Delta Lake and PARQUET format compliance.<p>
**Optimize write** “Spark in Fabric includes an Optimize Write feature that reduces the number of files written and targets to increase individual file size of the written data. It dynamically optimizes files during write operations generating files with a default 128 MB size. The target file size may be changed per workload requirements using configurations.

These configurations can be applied at a session level(as spark.conf.set in a notebook cell) as demonstrated in the following code cell, or at workspace level which is applied automatically to all spark sessions created in the workspace. The workspace level Apache Spark configuration can be set at:
- _Workspace settings >> Data Engineering/Sceience >> Spark Compute >> Spark Properties >> Add_

In [9]:
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 11, Finished, Available)

In [10]:
table_name = "diabetes"
df.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

StatementMeta(, 29a8f686-b790-4c15-8c63-ac0873dedbba, 12, Finished, Available)

Spark dataframe saved to delta table: diabetes
