
# Architecture and Reference Links

[Medallion Architecture Reference - Microsoft](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion)

[Mssparkutils Reference](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities)

[Fabric Lakehouse](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview)


### Data flow diagram
<img src="https://github.com/MicrosoftLIAD/classroomB2/blob/main/Code/Databricks%20Notebooks/images/lakehousearchitecture.png?raw=true" style="width: 650px; max-width: 100%; height: auto" />

### Create a shortcut to the bronze lakehouse (see Guide, Lab 1.2 - Silver)


### Environment Setup

We will be using [Microsoft Spark Notebook Utilities](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#notebook-utilities) element to set up variables for this exercise. 

`mssparkutils.notebook.run()` command will run another notebook and return its output to be used here.

`mssparkutils` has some other interesting uses such as interacting with file system or reading [Key Vault Secrets](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#credentials-utilities)


## Set medallion paths

In [1]:
# Reference a notebook to get and set Path variables 
setup_responses = mssparkutils.notebook.run("Get-Metadata").split()

# Set medallion paths
bronzePath = setup_responses[0]
bronzeLakehouse = setup_responses[1]
silverLakehouse = setup_responses[2]
goldLakehouse = setup_responses[3]

print(f"bronze data path is {bronzePath}")      
print("bronze lakehouse is {}".format(bronzeLakehouse))
print("silver lakehouse is {}".format(silverLakehouse))
print("gold lakehouse is {}".format(goldLakehouse))

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 3, Finished, Available)

bronze data path is abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files
bronze lakehouse is liad_bronze
silver lakehouse is liad_silver
gold lakehouse is liad_gold


In [2]:
# List bronze files
mssparkutils.fs.ls(f"{bronzePath}/flights")

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 4, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files/flights/airports.csv, name=airports.csv, size=10938397),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files/flights/departuredelays.csv, name=departuredelays.csv, size=33396236)]

## Load departuredelays.csv into a dataframe

In [3]:
# Load flight data into a dataframe

flightBronzePath = f'{bronzePath}/flights/departuredelays.csv'

flights_bronze_sdf = spark.read.load(flightBronzePath, format="csv", inferSchema="true", header="true")

display( flights_bronze_sdf )

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, 54dc18dd-d320-4451-9edf-34d0d8edbb9e)


## Load airports.csv into a dataframe

In [4]:
# Load airport data into a dataframe

airportBronzePath = f'{bronzePath}/flights/airports.csv'

airports_bronze_sdf = spark.read.load(airportBronzePath, format="csv", inferSchema="true", header="true")

display( airports_bronze_sdf )

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 2b1117ae-0b1c-4652-b585-36cad6964a81)


# Clean and Transform Data - Silver

### Silver layer (cleansed and conformed data)


## Transform Flight Data

In [5]:
# Inspect the flights schema
flights_bronze_sdf.printSchema()

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 7, Finished, Available)

root
 |-- date: integer (nullable = true)
 |-- delay: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- destination: string (nullable = true)



In [6]:
# make sure all values are upper case for later comparison
from pyspark.sql.functions import col, upper 

flights_bronze_sdf = flights_bronze_sdf.withColumn("origin", upper(col("origin"))).withColumn("destination", upper(col("destination")))

display( flights_bronze_sdf )

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, f0fcca23-2ebc-4ad5-bf87-592a79ceb128)


## Transform Airport Data

In [13]:
# Inspect the airports schema
airports_bronze_sdf.printSchema()

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 15, Finished, Available)

root
 |-- id: integer (nullable = true)
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- latitude_deg: double (nullable = true)
 |-- longitude_deg: double (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- scheduled_service: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- home_link: string (nullable = true)
 |-- wikipedia_link: string (nullable = true)
 |-- keywords: string (nullable = true)



In [8]:
# make sure all values are upper case for later comparison
from pyspark.sql.functions import col, upper 

airports_bronze_sdf = airports_bronze_sdf.withColumn("ident", upper(col("ident")))

display( airports_bronze_sdf )

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 4073035c-a9ce-45c6-8ef9-18b86239df12)


## Convert dataframe type

Sometimes we use Pandas dataframes because they are marginally faster in memory dataframes for small to medium sized datasets.  
They are used often by data scientists and data engineers because they provide powerful and simple data manipulation capabilities over 
spark in many cases.

Spark or regular dataframes are often used for large datasets because they scale across multiple nodes in a a multi-node spark cluster.   
Pandas dataframes run on only a single node in the cluster and will run into a ceiling quickly with very large datasets.

In [9]:
# Convert spark dataframes to Pandas dataframes
# create the silver data by combining both dataframes now that they are 'cleaned'
# convert to pandas using the toPandas()
import pandas as pd

flights_bronze_pdf = flights_bronze_sdf.toPandas()
airports_bronze_pdf = airports_bronze_sdf.toPandas()



StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 11, Finished, Available)


## Combine Flight and Airport Data

In [10]:
# Make a few changes before the combined data is saved
# recall in earlier cell we executed: import pandas as pd

# combined the two dataframes
combined_silver_pdf = pd.merge(flights_bronze_pdf,airports_bronze_pdf, how='inner', left_on='origin', right_on='ident', indicator=False, copy=True, sort=True)

# rename a column since we are only aligning origin data
combined_silver_pdf.rename(columns={'name': 'origin_name'},inplace=True) 

# replacing na values 'No data'
# this is required because when you write delta tables, a NaN (Not a Number) is a special floating-point value that represents an undefined value.
combined_silver_pdf["gps_code"].fillna("No Data", inplace = True)
combined_silver_pdf["iata_code"].fillna("No Data", inplace = True)
combined_silver_pdf["local_code"].fillna("No Data", inplace = True)
combined_silver_pdf["home_link"].fillna("No Data", inplace = True)


display( combined_silver_pdf )

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 12, Finished, Available)

  [(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]


SynapseWidget(Synapse.DataFrame, dc4b32d0-7639-43f8-88d5-2ee7213567da)


## Writing Data



Save the combined data sets in the parquet format in the silver lakehouse

In [16]:
# Save as Parquet
# create a spark DataFrame
combined_silver_sdf = spark.createDataFrame( combined_silver_pdf )

combined_silver_sdf.write.mode("overwrite").format("delta").saveAsTable(f"{silverLakehouse}.combinedflightdata") 

StatementMeta(, de981c09-33a3-4479-8e8f-6fafc5e5b4db, 18, Finished, Available)

  [(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]


In [14]:
# List silver files
mssparkutils.fs.ls(f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/{silverLakehouse}.Lakehouse/Tables/combinedflightdata")


StatementMeta(, f415fd41-b22e-41fd-957a-343d1fe1d42e, 16, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/part-00000-0a100df1-684b-428c-a21f-86167e10091d-c000.snappy.parquet, name=part-00000-0a100df1-684b-428c-a21f-86167e10091d-c000.snappy.parquet, size=77731),
 FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_silver.Lakehouse/Tables/combinedflightdata/part-00000-d55e3ed3-7d56-43fd-8cbb-91411f5ec89d-c000.snappy.parquet, name=part-00000-d55e3ed3-7d56-43fd-8cbb-91411f5ec89d-c000.snappy.parquet, size=77731)]