
# Architecture and Reference Links

[Medallion Architecture Reference - Microsoft](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion)

[Mssparkutils Reference](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities)

[Fabric Lakehouse](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview)


### Data flow diagram
<img src="https://github.com/MicrosoftLIAD/classroomB2/blob/main/Code/Databricks%20Notebooks/images/lakehousearchitecture.png?raw=true" style="width: 650px; max-width: 100%; height: auto" />

In [None]:
# Create a shortcut to the silver lakehouse


### Environment Setup

We will be using [Microsoft Spark Notebook Utilities](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#notebook-utilities) element to set up variables for this exercise. 

`mssparkutils.notebook.run()` command will run another notebook and return its output to be used here.

`mssparkutils` has some other interesting uses such as interacting with file system or reading [Key Vault Secrets](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#credentials-utilities)


## Set medallion paths

In [1]:
#  Reference a notebook to get and set Path variables 
setup_responses = mssparkutils.notebook.run("Get-Metadata").split()

# Set medallion paths
bronzePath = setup_responses[0]
bronzeLakehouse = setup_responses[1]
silverLakehouse = setup_responses[2]
goldLakehouse = setup_responses[3]

print(f"bronze data path is {bronzePath}")      
print("bronze lakehouse is {}".format(bronzeLakehouse))
print("silver lakehouse is {}".format(silverLakehouse))
print("gold lakehouse is {}".format(goldLakehouse))

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 3, Finished, Available)

bronze data path is abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files
bronze lakehouse is liad_bronze
silver lakehouse is liad_silver
gold lakehouse is liad_gold


In [2]:
# List silver files
mssparkutils.fs.ls(f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{silverLakehouse}.Lakehouse/Tables/combinedflightdata")

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 4, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/part-00000-0a100df1-684b-428c-a21f-86167e10091d-c000.snappy.parquet, name=part-00000-0a100df1-684b-428c-a21f-86167e10091d-c000.snappy.parquet, size=77731),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/part-00000-8d9fe446-54cf-4e30-b991-4b670649441d-c000.snappy.parquet, name=part-00000-8d9fe446-54cf-4e30-b991-4b670649441d-c000.snappy.parquet, size=77731),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/part-00000-d55e3ed3-7d56-43fd-8cbb-91411f5ec89d-c000.snappy.parquet, name=part-00000-d55e3ed3-7d56-43fd-8cbb-91411f5ec89d-c000.snappy.parquet, size=77731)]


# Working with SQL Tables - Gold

### Gold layer (curated business-level tables)

[Create and Manage Schemas Reference ](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/create-schemas)

[Create Tables Reference](https://learn.microsoft.com/en-us/azure/databricks/getting-started/dataframes-python#save-a-dataframe-to-a-table)

[Delta Best Practices](https://learn.microsoft.com/en-us/azure/databricks/delta/best-practices)

In [3]:
# DBTITLE 1,Create a flight dataset for gold
# Create the 'gold' data
combined_gold_sdf = spark.sql(f"SELECT * FROM {goldLakehouse}.combinedflightdata")

# define the columns to be dropped
cols = ("delay","id","latitude_deg", "longitude_deg", "elevation_ft", "iso_country", "iso_region", "gps_code", "home_link", "wikipedia_link", "keywords")

# drop the columns we do not want in the table
combined_gold_sdf.drop(*cols).printSchema()


StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 5, Finished, Available)

root
 |-- date: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- destination: string (nullable = true)
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- origin_name: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- scheduled_service: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)



In [5]:
query = f"DROP TABLE IF EXISTS {goldLakehouse}.flightData;"
print( query)

spark.sql( query)

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 7, Finished, Available)

DROP TABLE IF EXISTS liad_gold.flightData;


DataFrame[]

## Create a Managed Table


[Managed Table Reference](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/data-objects#--what-is-a-managed-table)

In [6]:
# Create a Managed Table
# create a managed table from an existing dataframe

combined_gold_sdf.write.mode("overwrite").format("delta").saveAsTable(f"{goldLakehouse}.flightData") 

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 8, Finished, Available)


### Describe History - Managed

[History Reference](https://learn.microsoft.com/en-us/azure/databricks/delta/history)

In [7]:

query = f"DESCRIBE HISTORY {goldLakehouse}.flightData;"
print( query )

display (spark.sql( query ) )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 9, Finished, Available)

DESCRIBE HISTORY liad_gold.flightData;


SynapseWidget(Synapse.DataFrame, 29522674-205b-447e-897c-53f0121559a1)


### Describe detail - Managed

In [8]:

display( spark.sql(f"DESCRIBE DETAIL {goldLakehouse}.flightData;") )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 85d83ef8-dcdd-4d30-906f-546af065d90d)

In [9]:
display( spark.sql(f"Select count(*) from {goldLakehouse}.flightdata;") )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, 1fda0445-f7f9-488d-86e8-64fba949210d)

## Create an UnManaged Table

[UnManaged Table Reference](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/data-objects#--what-is-an-unmanaged-table)

In [10]:
# Save data in the gold container in the Delta format
# save the dataframe to the gold lakehouse in Delta format partitioned by origin

combined_gold_sdf.write.partitionBy("origin").mode("overwrite").option("overwriteSchema", "true").format("delta").saveAsTable(f"{goldLakehouse}.flightdata") 

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 12, Finished, Available)

In [13]:
# unmanaged table so the data doesn't get deleted, rather just the table structure
spark.sql(f"DROP TABLE IF EXISTS {goldLakehouse}.flightDataUnmanaged;")


StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 15, Finished, Available)

DataFrame[]

In [14]:
query = f"CREATE EXTERNAL TABLE IF NOT EXISTS {goldLakehouse}.flightDataUnmanaged USING DELTA LOCATION 'abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{goldLakehouse}.Lakehouse/Tables/flightdata'" 
print( query )

spark.sql( query )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 16, Finished, Available)

CREATE EXTERNAL TABLE IF NOT EXISTS liad_gold.flightDataUnmanaged USING DELTA LOCATION 'abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_gold.Lakehouse/Tables/flightdata'


DataFrame[]

In [15]:
display( spark.sql(f"Select count(*) from {goldLakehouse}.flightDataUnmanaged;") )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 17, Finished, Available)

SynapseWidget(Synapse.DataFrame, 8b858a9c-0a2f-4919-94ca-3b7604372d57)


### Describe History - UnManaged

In [16]:
display( spark.sql(f"describe history {goldLakehouse}.flightDataUnmanaged;") )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 18, Finished, Available)

SynapseWidget(Synapse.DataFrame, 6f8592c6-644f-4bc5-a459-dd13c257af17)


### Describe Detail - UnManaged

In [17]:

display( spark.sql(f"DESCRIBE DETAIL {goldLakehouse}.flightDataUnmanaged;") )

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 19, Finished, Available)

SynapseWidget(Synapse.DataFrame, 535c8aae-4d18-4075-92c0-151ac4d07eba)

#Python Miscellaneous Examples of Data Manipulation

In [18]:
files = mssparkutils.fs.ls(f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{goldLakehouse}.Lakehouse/Tables/flightdata")


for file in files:
    print(file.name)

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 20, Finished, Available)

_delta_log
origin=ABE
origin=AUS
part-00000-2c6f2b9c-f991-4914-aa1b-2f277272fc97-c000.snappy.parquet


In [19]:
# List gold files managed table
mssparkutils.fs.ls(f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{goldLakehouse}.Lakehouse/Tables/flightdata")

StatementMeta(, 9b384258-e64c-4b1c-81a2-bbb7c3279688, 21, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_gold.Lakehouse/Tables/flightdata/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_gold.Lakehouse/Tables/flightdata/origin=ABE, name=origin=ABE, size=0),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_gold.Lakehouse/Tables/flightdata/origin=AUS, name=origin=AUS, size=0),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_gold.Lakehouse/Tables/flightdata/part-00000-2c6f2b9c-f991-4914-aa1b-2f277272fc97-c000.snappy.parquet, name=part-00000-2c6f2b9c-f991-4914-aa1b-2f277272fc97-c000.snappy.parquet, size=75548)]

In [None]:
# List gold files unmanaged table
# Error. The unmanaged table is not linked to the data in storage to allow independence.

mssparkutils.fs.ls(f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{goldLakehouse}.Lakehouse/Tables/flightdataunmanaged")
