
# Architecture and Reference Links

[Medallion Architecture Reference - Microsoft](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion)

[Mssparkutils Reference](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities)

[Fabric Lakehouse](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview)


### Data flow diagram
<img src="https://github.com/MicrosoftLIAD/classroomB2/blob/main/Code/Databricks%20Notebooks/images/lakehousearchitecture.png?raw=true" style="width: 650px; max-width: 100%; height: auto" />

### Create a shortcut to the source folder using a SAS Token (see guide: Lab 1.1 - Bronze, Step 2)
https://storliadadls.blob.core.windows.net/source?sp=rl&st=2023-08-11T18:59:06Z&se=2024-08-12T02:59:06Z&spr=https&sv=2022-11-02&sr=c&sig=TkxfKfqFP%2BZ9fk12N2ZOaTVnH2shUJUHtB2AF%2BWn2pk%3D



### Environment Setup

We will be using [Microsoft Spark Notebook Utilities](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#notebook-utilities) element to set up variables for this exercise. 

`mssparkutils.notebook.run()` command will run another notebook and return its output to be used here.

`mssparkutils` has some other interesting uses such as interacting with file system or reading [Key Vault Secrets](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python#credentials-utilities)


## Set medallion paths

In [2]:
#  Reference a notebook to get and set Path variables 
setup_responses = mssparkutils.notebook.run("Get-Metadata").split()

# Set medallion paths
bronzePath = setup_responses[0]

#files = mssparkutils.fs.ls(f"abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_gold.Lakehouse/Tables/flightdata")
bronzeLakehouse = setup_responses[1]
silverLakehouse = setup_responses[2]
goldLakehouse = setup_responses[3]

print(f"bronze data path is {bronzePath}")
      
print("bronze lakehouse is {}".format(bronzeLakehouse))
print("silver lakehouse is {}".format(silverLakehouse))
print("gold lakehouse is {}".format(goldLakehouse))

StatementMeta(, e452f20e-a2b3-43ad-b8d8-4a7a7b483a94, 4, Finished, Available)

bronze data path is abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files
bronze lakehouse is liad_bronze
silver lakehouse is liad_silver
gold lakehouse is liad_gold



# Ingesting Data - Bronze
### Bronze layer (raw data)

Sample datasets to experiment with in the linked source folder

In [4]:
display(mssparkutils.fs.ls('Files/source/'))

StatementMeta(, e452f20e-a2b3-43ad-b8d8-4a7a7b483a94, 6, Finished, Available)

[FileInfo(path=abfss://83541342-08d2-4998-a1bb-e8c1e15ea2d5@msit-onelake.dfs.fabric.microsoft.com/188f3547-e920-46b1-b464-9202a6c21a93/Files/source/flights, name=flights, size=0),
 FileInfo(path=abfss://83541342-08d2-4998-a1bb-e8c1e15ea2d5@msit-onelake.dfs.fabric.microsoft.com/188f3547-e920-46b1-b464-9202a6c21a93/Files/source/retail-org, name=retail-org, size=0)]

## Use flights datasets as our bronze datasets

In [5]:
# review the files in the source data that we need to bring into our medallion architecture
display(mssparkutils.fs.ls('Files/source/flights'))

StatementMeta(, e452f20e-a2b3-43ad-b8d8-4a7a7b483a94, 7, Finished, Available)

[FileInfo(path=abfss://83541342-08d2-4998-a1bb-e8c1e15ea2d5@msit-onelake.dfs.fabric.microsoft.com/188f3547-e920-46b1-b464-9202a6c21a93/Files/source/flights/README.md, name=README.md, size=412),
 FileInfo(path=abfss://83541342-08d2-4998-a1bb-e8c1e15ea2d5@msit-onelake.dfs.fabric.microsoft.com/188f3547-e920-46b1-b464-9202a6c21a93/Files/source/flights/airport-codes-na.txt, name=airport-codes-na.txt, size=11411),
 FileInfo(path=abfss://83541342-08d2-4998-a1bb-e8c1e15ea2d5@msit-onelake.dfs.fabric.microsoft.com/188f3547-e920-46b1-b464-9202a6c21a93/Files/source/flights/departuredelays.csv, name=departuredelays.csv, size=33396236)]

In [6]:
# get the information about this dataset using the File API path
f = open('/lakehouse/default/Files/source/flights/README.md', 'r') 
print(f.read())


StatementMeta(, e452f20e-a2b3-43ad-b8d8-4a7a7b483a94, 8, Finished, Available)

# On-Time Performance Datasets

The source `airports` dataset can be found at [OpenFlights Airport, airline and route data](http://openflights.org/data.html). 

The `flights`, also known as the `departuredelays`, dataset can be found at [Airline On-Time Performance and Causes of Flight Delays: On_Time Data](https://catalog.data.gov/dataset/airline-on-time-performance-and-causes-of-flight-delays-on-time-data)




## Copy and load Flight Data

Read source files and copy into the Bronze lakehouse

In [7]:
# Copy flight data source data to Bronze
# first, make the directory where you want the source data to land. NOTE: we're using folders in our data lake storage defined in the {bronzePath} variable which was set up in previous cells.
# the following command appends a folder called /flights to our bronzePath folder structure.
mssparkutils.fs.mkdirs(f"{bronzePath}/flights")

# copy the source data into the bronze folder
mssparkutils.fs.cp("Files/source/flights/departuredelays.csv", f"{bronzePath}/flights/departuredelays.csv", True) 

# confirm the file is where it should be
mssparkutils.fs.ls(f"{bronzePath}/flights")

StatementMeta(, e452f20e-a2b3-43ad-b8d8-4a7a7b483a94, 9, Finished, Available)

[FileInfo(path=abfss://classroomB2@msit-onelake.dfs.fabric.microsoft.com/liad_bronze.Lakehouse/Files/flights/departuredelays.csv, name=departuredelays.csv, size=33396236)]


## Copy and load Airport Data

Read source files and copy into the Bronze folder

In [8]:
# Copy airport data into bronze
# Note that we can copy data files from Internet data sources using a URL:  https://ourairports.com/data/airports.csv

url = 'https://ourairports.com/data/airports.csv'

# copy the source data into the bronze folder Files/source/flights
mssparkutils.fs.cp (url, f"{bronzePath}/flights/airports.csv", True)

StatementMeta(, e452f20e-a2b3-43ad-b8d8-4a7a7b483a94, 10, Finished, Available)

True

In [None]:
# List bronze files
mssparkutils.fs.ls(f"{bronzePath}/flights")