# Pipeline and Medallion Architecture

In this notebook we are going to combine the ideas from:
- [Example medallion architecture](https://docs.databricks.com/aws/en/lakehouse/medallion#example-medallion-architecture)
- [Load data with Lakeflow Declarative Pipelines
](https://docs.databricks.com/aws/en/ldp/load)
- [Manage data quality with pipeline expectations
](https://docs.databricks.com/aws/en/ldp/expectations)

In [None]:
from pyspark.sql import functions as F
from pyspark import pipelines as dp

# Landing zone directories to read the raw data

- In this example we are using the fake generated data stored in the Managed Volume.
- In a real scenario, could be:
    - Cloud-object storage path: S3, ADLS, GCS
    - External Volume path: Access existing Clod-object storage using volume-like paths

In [None]:
volume = spark.conf.get("landing_zone_volume", "")
bronze_schema = spark.conf.get("bronze_schema", "")
silver_schema = spark.conf.get("silver_schema", "")
gold_schema = spark.conf.get("gold_schema", "")

customers_directory = f"{volume}/customers"
products_directory = f"{volume}/products"
transactions_directory = f"{volume}/transactions"

# Bronze Tables

In the bronze layer we are supposed to only:

- Load the raw data into tables
- Avoid transformations, changes and filters in the data.
- Keep the data in the original format.

### Customers table

In [None]:
@dp.table(name=f"{bronze_schema}.customers_raw")
def bronze_customers():
  """
    Returns a Spark Streaming Dataframe that uses AutoLoader to incrementally read the customer data
  """

  SCHEMA_HINTS = "customer_id STRING, name STRING, country STRING, registration_date DATE, customer_segment STRING"
  
  df = (
      spark.readStream
          .format("cloudFiles")
          .option("cloudFiles.format", "csv")
          .option("cloudFiles.schemaHints", SCHEMA_HINTS)
          .load(customers_directory)
  )
  return df

NameError: name 'dp' is not defined

### Products table

In [None]:
@dp.table(name=f"{bronze_schema}.products_raw")
def bronze_products():
  """
    Returns a Spark Streaming Dataframe that uses AutoLoader to incrementally read the orders data
  """

  SCHEMA_HINTS = "product_id STRING, product_name STRING, category STRING, price DOUBLE, cost DOUBLE"

  df = (
      spark.readStream
          .format("cloudFiles")
          .option("cloudFiles.format", "json")
          .option("cloudFiles.schemaHints", SCHEMA_HINTS)
          .load(products_directory)
  )
  return df

### Transactions table

In [None]:
@dp.table(name=f"{bronze_schema}.transactions_raw")
def bronze_products():
  """
    Returns a Spark Streaming Dataframe that uses AutoLoader to incrementally read the orders data
  """

  SCHEMA_HINTS = "transaction_id STRING, customer_id STRING, product_id STRING, quantity_id STRING, category STRING, price DOUBLE, cost DOUBLE"

  df = (
      spark.readStream
          .format("cloudFiles")
          .option("cloudFiles.format", "parquet")
          .option("cloudFiles.schemaHints", SCHEMA_HINTS)
          .load(transactions_directory)
  )
  return df

## Silver Tables

In this stage yo are supposed to:
    - Clean the data.
    - Transform the data.
    - Apply business and data quality rules to the data.

The only rule in this example is that there cannot be nulls on the downstream, therefore, we'll exclude the records with this criteria.

### Create silver customers table with an expectation

In [None]:
""" 
@dp.table(name=f"{silver_schema}.customers_cleaned")
@dp.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")
def silver_customers():
  # Read Bronze table
  df = spark.readStream.table(f"{bronze_schema}.customers_raw")

  # Drop the Auto Loader generated column, no longer needed on silver.
  df = df.drop("_rescued_data")
  return df
"""

### Create silver orders table with an expectation

In [None]:
"""
@dp.table(name=f"{silver_schema}.orders_cleaned")
@dp.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
def silver_orders():
  # Read Bronze table
  df = spark.readStream.table(f"{bronze_schema}.orders_raw")

  # Drop the Auto Loader generated column, no longer needed on silver.
  df = df.
  ("_rescued_data")
  return df

"""

### Join the silver tables

In [None]:
"""
@dp.table(name=f"{silver_schema}.customers_orders")
def customer_orders():
    # Silver tables
    customers_cleaned_df = spark.readStream.table(f"{silver_schema}.customers_cleaned")
    orders_cleaned_df = spark.readStream.table(f"{silver_schema}.orders_cleaned")

    # Join
    df = customers_cleaned_df.join(orders_cleaned_df, on="customer_id", how="inner")
    return df
"""