### ETL in Databricks
This notebook gives some basic commands for performing necessary ETL functions. Use this to create a medallion ETL pipeline using the financial dataset.

You should have:
 - bronze schema with tables for each set
 - silver schema with cleaned and formatted tables
 - gold schema with aggregated tables (to answer the questions in the notion page)

The notebook will be used as the source a daily job to refresh the pipeline (The whole notebook will be executed) and a dashboard will be created using the gold tables as source data.


In [0]:
%sql
-- Setup catalog if not exists
CREATE CATALOG IF NOT EXISTS catalog;

-- Setup schemas for medallion architecture, you can also use the GUI
-- Schema == Database
CREATE SCHEMA IF NOT EXISTS catalog.bronze;
CREATE SCHEMA IF NOT EXISTS catalog.silver;
CREATE SCHEMA IF NOT EXISTS catalog.gold;

In [0]:
# Insert your database credentials and URL
url = "your database url"
 
raw_df = (spark.read
  .format("jdbc")
  .option("url", url)
  .option("dbtable", "your_table_name")
  .option("user", "your_user")
  .option("password", "your_password")
  .load()
)

display(df) # display your dataframe

In [0]:
# Read the JSON file from volume
json_df = spark.read.schema(json_schema).json("/Volumes/path/to/file.json")

In [0]:
# Write your raw data to a table
raw_df.write.mode("overwrite").saveAsTable("catalog.schema.bronze_table")

In [0]:
# Read from your bronze (raw) table
bronze_df = spark.read.table("catalog.schema.bronze_table")

# Transform your bronze dataframe to silver (clean, format, remove null, etc.)
df_silver = (df_bronze
             .withColumn("date_example_formatted", to_timestamp(col("date_example"), "yyyy-MM-dd HH:mm:ss"))
             # etc.
)

# confirm results
display(df_silver)

In [0]:
df_silver.write.mode("overwrite").saveAsTable("catalog.schema.silver_table")

In [0]:
# Transform your silver dataframe to gold (aggregated, joined, etc.)
silver_df = spark.read.table("catalog.schema.silver_table")

gold_df = (silver_df
           .groupBy("address") # example
           .agg(sum("total_debt").alias("total_debt"))
           .orderBy(desc("total_debt"))
)

display(gold_df)

In [0]:
gold_df.write.mode("overwrite").saveAsTable("catalog.schema.gold_table")