In [0]:
"""

- Create the Silver layer customer table if it does not exist.
- Identify the maximum `last_updated` timestamp from the Silver layer table.
- Create a temporary view `bronze_incremental` to pull new data from the Bronze layer.
- Apply business rules and transformations:
  - Validate email addresses.
  - Ensure age is between 18 and 100.
  - Create customer segments based on total purchase.
  - Calculate days since user registration.
  - Remove records with negative total purchase.
- Merge the transformed data into the Silver layer customer table based on customer ID.
So, that's how we have load the data from bronze layer. We have clean and transform the data. And then, finally, we have saved it into our silver layer.

"""

In [0]:
"""
- Switches to the 'globalretail_silver' database.
- Creates the 'silver_customers' table if it does not exist, with columns for customer details, segmentation, and metadata.
"""
spark.sql("USE globalretail_silver")
spark.sql("""
    CREATE TABLE IF NOT EXISTS silver_customers (
    customer_id STRING,
    name STRING,
    email STRING,
    country STRING,
    customer_type STRING,
    registration_date DATE,
    age INT,
    gender STRING,
    total_purchases INT,
    customer_segment STRING,
    days_since_registration INT,
    last_updated TIMESTAMP)
""")

In [0]:
"""
- Retrieves the maximum 'last_updated' timestamp from the 'silver_customers' table.
- Sets 'last_processed_timestamp' to the retrieved value, or to a default if no value exists.
"""
# Get the last processed timestamp from silver layer
last_processed_df = spark.sql("SELECT MAX(last_updated) as last_processed FROM silver_customers")
last_processed_timestamp = last_processed_df.collect()[0]['last_processed']

if last_processed_timestamp is None:
    last_processed_timestamp = "1900-01-01T00:00:00.000+00:00"

In [0]:
"""
- Creates or replaces a temporary view named 'bronze_incremental' containing only new records from the Bronze layer.
- The view filters records from 'globalretail_bronze.bronze_customer' where 'ingestion_timestamp' is greater than the last processed timestamp.
- A temporary view in Databricks is a logical table that exists only within the current notebook session and is not persisted to the database.
- This allows downstream processing of only the new or updated customer records since the last pipeline run.
"""
# Create a temporary view of incremental bronze data
spark.sql(f"""
CREATE OR REPLACE TEMPORARY VIEW bronze_incremental AS
SELECT *
FROM globalretail_bronze.bronze_customer c where  c.ingestion_timestamp > '{last_processed_timestamp}'
""")

In [0]:
"""
- Displays the contents of the 'bronze_incremental' temporary view containing new or updated customer records from the Bronze layer.
"""
display(spark.sql("select * from bronze_incremental"))

In [0]:
"""
- Validate email addresses (null or not null).
- Ensure valid age is between 18 and 100.
- Create customer_segment: 'High Value' if total_purchases > 10000, 'Medium Value' if > 5000, else 'Low Value'.
- Calculate days since user registration. (since user is registered in the system)
- Remove records where total_purchase is negative. (Remove any junk records where total_purchase is negative number)
"""

In [0]:
"""
- Creates or replaces a temporary view 'silver_incremental' with transformed customer data from 'bronze_incremental'.
- Applies business rules:
    - Validates email is not null.
    - Ensures age is between 18 and 100.
    - Removes records with negative total_purchases.
    - Segments customers by total_purchases.
    - Calculates days since registration using DATEDIFF(endDate, startDate): returns the number of days from startDate to endDate.
      Example: DATEDIFF(CURRENT_DATE(), registration_date) gives days since registration.
    - Adds current timestamp as last_updated using CURRENT_TIMESTAMP().
    - Other useful functions:
        - DATE_ADD(date, days): adds days to a date.
        - DATE_SUB(date, days): subtracts days from a date.
        - DATE_FORMAT(date, format): formats a date as a string.
        - TIMESTAMPDIFF(unit, start, end): difference between two timestamps in specified units.
"""
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW silver_incremental AS
SELECT
    customer_id,
    name,
    email,
    country,
    customer_type,
    registration_date,
    age,
    gender,
    total_purchases,
    CASE
        WHEN total_purchases > 10000 THEN 'High Value'
        WHEN total_purchases > 5000 THEN 'Medium Value'
        ELSE 'Low Value'
    END AS customer_segment,
    DATEDIFF(CURRENT_DATE(), registration_date) AS days_since_registration,
    CURRENT_TIMESTAMP() AS last_updated
FROM bronze_incremental
WHERE 
    age BETWEEN 18 AND 100
    AND email IS NOT NULL
    AND total_purchases >= 0
""")

In [0]:
"""
- Displays the transformed customer data in 'silver_incremental'.
- Business rules applied:
    - New 'customer_segment' column added: 'High Value' if total_purchases > 10000, 'Medium Value' if > 5000, else 'Low Value'.
    - Only includes records where age is between 18 and 100.
    - Only includes records where email is not null.
    - Removes records with negative total_purchases.
"""
display(spark.sql("select * from silver_incremental"))

In [0]:
"""
Summary:
- This script merges cleaned and transformed customer data from the 'silver_incremental' view into the 'silver_customers' table in the Silver layer.
- The merge operation uses 'customer_id' as the unique key.
- If a customer_id already exists in the target table, the record is updated; if not, a new record is inserted.

Why use MERGE instead of APPEND:
- Using MERGE ensures that existing records are updated with the latest information, preventing duplicate entries for the same customer.
- APPEND would add new rows regardless of duplicates, leading to data redundancy and inconsistency.
- MERGE provides idempotency: re-running the script does not create duplicate records and only updates or inserts as needed.
"""

spark.sql("""
MERGE INTO silver_customers target
USING silver_incremental source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET *
WHEN NOT MATCHED THEN
    INSERT *
""")

In [0]:
"""
- Displays the total number of records in the 'silver_customers' table.
So, that's how we have load the data from bronze layer. We have clean and transform the data. And then, finally, we have saved it into our silver layer.
"""
display(spark.sql("select count(*) from silver_customers"))