# Silver Layer – Development Notebook

## Purpose
This notebook is used to develop the **actual Silver transformation logic**
from Bronze tables.

Tables created here are **development tables only** and are NOT consumed
directly by Gold or Power BI.

Final promotion to shared Silver (`sl_*`) is controlled by the PO.

---

## Allowed Sources (READ ONLY)
This notebook may read from:
- `br_customers`
- `br_geolocation`
- `br_orders`
- `br_order_items`
- `br_order_payments`
- `br_order_reviews`
- `br_products`
- `br_sellers`
- `br_product_category_translation`

---

## Output Tables (WRITE ONLY)
This notebook must write to **development Silver tables**:

Naming convention:


Examples:
- `sl_dev_orders`
- `sl_dev_order_items`
- `sl_dev_reviews`
- `sl_dev_sellers`

These tables are used for development and validation only.

---

## Contract Requirements
- Grain must match the Silver contract (same as stub tables)
- Column names must match the agreed Silver schema
- No business aggregations (metrics belong in Gold)
- Deduplication and data type standardization are allowed
- Timestamp parsing and null handling should be finalized here

---

## Do NOT
- Do not overwrite `sl_stub_*`
- Do not overwrite `sl_*`
- Do not write to any `gold_*` table
- Do not connect Power BI to Silver tables

---

## Promotion
Once logic is finalized, tables will be promoted by the PO:

```sql
CREATE OR REPLACE TABLE sl_orders AS
SELECT * FROM sl_dev_orders;

Promotion is a controlled step to ensure downstream stability.


### Cleaning Process

##### Objective
Analyse Seller Daily Performance 

At a seller × day level, the pipeline must answer:

- How many orders does each seller receive per day?

- How many orders are delivered?

- How many deliveries are late versus on time?

- What is the on-time delivery rate?

- What is the average delivery duration?

- What is the average early delivery duration?

##### Tables selected: 
1. Sellers 
2. Orders 
3. Order_items
4. Order_reviews

##### Actions performed on the data set 
- Show column names, number of rows and columns on the dataset 
- Check for null values and fill with '0' where applicable
- Check for duplication and drop duplication where applicable
- Recast of data types where applicable 
- Joined tables using inner join and left join 

##### No new or renaming of columns

##### Result
- Joined table name (view only): sl_seller_orders

    -  Cleaned Seller table name: sl_dev_sellers

    -  Cleaned Orders table name: sl_dev_orders

    -  Cleaned Order Items table name: sl_dev_order_items         

    -  Cleaned Order Reviews table name: sl_dev_order_reviews  

- Column names:

    1. "order_id"

    2. "order_item_id 

    3. "product_id"

    4. "seller_id"

    5. "seller_city"

    6. "seller_state"  

    7. "seller_zip_code_prefix"

    8. "customer_id"

    9. "price"

    10. "freight_value"

    11. "order_status"

    12. "order_purchase_timestamp"

    13. "order_approved_at"

    14. "order_estimated_delivery_date"

    15. "order_delivered_carrier_date"

    16. "order_delivered_customer_date"

    17. "shipping_limit_date     


In [98]:
# To load the sellers table and see the first 20 rows 

br_sellers = spark.read.table("dbo.br_sellers")
br_sellers.show(5, truncate=False)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 3, Finished, Available, Finished)

+--------------------------------+----------------------+-----------------+------------+
|seller_id                       |seller_zip_code_prefix|seller_city      |seller_state|
+--------------------------------+----------------------+-----------------+------------+
|3442f8959a84dea7ee197c632cb2df15|13023                 |campinas         |SP          |
|d1b65fc7debc3361ea86b5f14c68d2e2|13844                 |mogi guacu       |SP          |
|ce3ad9de960102d0677a81f5d0bb7b2d|20031                 |rio de janeiro   |RJ          |
|c0f3eea2e14555b6faeea3dd58c1b1c3|04195                 |sao paulo        |SP          |
|51a04a8a6bdcb23deccc82b0b80742cf|12914                 |braganca paulista|SP          |
+--------------------------------+----------------------+-----------------+------------+
only showing top 5 rows



In [99]:
# Number of rows and columns in SELLER table 

sellers_row = br_sellers.count()
sellers_col = len(br_sellers.columns)
print(sellers_row, sellers_col)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 4, Finished, Available, Finished)

3095 4


In [100]:
# Check if seller_id is a unique number or duplication is involved

from pyspark.sql.functions import countDistinct

br_sellers = spark.read.table("dbo.br_sellers")

total_rows = br_sellers.count()

distinct_sellers = (
    br_sellers
        .select(countDistinct("seller_id"))
        .collect()[0][0]
)

print("rows:", total_rows, "distinct seller_id:", distinct_sellers)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 5, Finished, Available, Finished)

rows: 3095 distinct seller_id: 3095


In [101]:
# check for any null values in SELLER table

from pyspark.sql.functions import col

br_sellers_cleaned = br_sellers.where(
        col('seller_id').isNotNull() &
        col('seller_zip_code_prefix').isNotNull() &
        col('seller_city').isNotNull() &
        col('seller_state').isNotNull() 
)

print(br_sellers_cleaned.count())     
print(br_sellers.count())              # since the number of rows match, there are no null values 

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 6, Finished, Available, Finished)

3095


3095


In [102]:
# Drop column "seller_zip_code_prefix" from the table as not required. 


sl_dev_sellers = br_sellers_cleaned.drop("seller_zip_code_prefix")

print(sl_dev_sellers)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 7, Finished, Available, Finished)

DataFrame[seller_id: string, seller_city: string, seller_state: string]


In [103]:
# To load the ORDERS table and see the first 5 rows 

br_orders = spark.read.table('dbo.br_orders')
print(br_orders)
br_orders.show(5, truncate=False, vertical=True)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 8, Finished, Available, Finished)

DataFrame[order_id: string, customer_id: string, order_status: string, order_purchase_timestamp: string, order_approved_at: string, order_delivered_carrier_date: string, order_delivered_customer_date: string, order_estimated_delivery_date: string]


-RECORD 0---------------------------------------------------------
 order_id                      | 55c27ec4136ec6d13577c7fb34c62b68 
 customer_id                   | 5040cce1f64706863160f0a3b757a894 
 order_status                  | delivered                        
 order_purchase_timestamp      | 2017-11-20 11:46:40              
 order_approved_at             | 2017-11-20 12:07:40              
 order_delivered_carrier_date  | 2017-11-21 22:11:12              
 order_delivered_customer_date | 2017-12-04 21:42:06              
 order_estimated_delivery_date | 2017-12-21 00:00:00              
-RECORD 1---------------------------------------------------------
 order_id                      | afa7ff555249234a1316e4b88f5f5aa3 
 customer_id                   | 7daa8a3a2fa25e50b04909c1235f4e2f 
 order_status                  | delivered                        
 order_purchase_timestamp      | 2017-12-05 11:22:23              
 order_approved_at             | 2017-12-05 11:33:24          

In [104]:
# Number of rows and columns in ORDERS table 

orders_row = br_orders.count()
orders_columns = len(br_orders.columns)

print(orders_row, orders_columns)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 9, Finished, Available, Finished)

99441 8


In [105]:
# Check if there are any null values for ORDERS table

from pyspark.sql.functions import col  

br_orders_null = br_orders.where(
    col("order_id").isNotNull() &
    col("customer_id").isNotNull() &
    col("order_status").isNotNull() & 
    col("order_purchase_timestamp").isNotNull() &
    col("order_approved_at").isNotNull() &
    col("order_delivered_carrier_date").isNotNull() &
    col("order_delivered_customer_date").isNotNull() &
    col("order_estimated_delivery_date").isNotNull()
)

print(br_orders_null.count())

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 10, Finished, Available, Finished)

96461


In [106]:
# Filter to find which columns are the null values located 

from pyspark.sql import functions as F

col_with_nulls = [
    "order_status",
    "order_purchase_timestamp",
    "order_approved_at",
    "order_estimated_delivery_date",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
]

null_counts = br_orders.select([
    F.count(F.when(F.col(c).isNull(), 1)).alias(f"null_{c}")
    for c in col_with_nulls
])

null_counts.show(truncate=False, vertical=True)




StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 11, Finished, Available, Finished)

-RECORD 0----------------------------------
 null_order_status                  | 0    
 null_order_purchase_timestamp      | 0    
 null_order_approved_at             | 160  
 null_order_estimated_delivery_date | 0    
 null_order_delivered_carrier_date  | 1783 
 null_order_delivered_customer_date | 2965 



In [107]:
# Recasting of data types from string to timestamp

from pyspark.sql.functions import col  

timestamp_cols = [
        'order_purchase_timestamp',
        'order_approved_at',
        'order_delivered_carrier_date',
        'order_delivered_customer_date',
        'order_estimated_delivery_date'
]

for t in timestamp_cols:
    br_orders = br_orders.withColumn(t, col(t).cast('timestamp'))

print(br_orders)


StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 12, Finished, Available, Finished)

DataFrame[order_id: string, customer_id: string, order_status: string, order_purchase_timestamp: timestamp, order_approved_at: timestamp, order_delivered_carrier_date: timestamp, order_delivered_customer_date: timestamp, order_estimated_delivery_date: timestamp]


In [108]:
# Check if there are any duplicated values for ORDERS table 
# For customer_id and order_id only 

from pyspark.sql.functions import countDistinct

sl_dev_orders = br_orders
total_rows = sl_dev_orders.count()

distinct_cust_id = (
    sl_dev_orders
        .select(countDistinct("customer_id"))
        .collect()[0][0]
)

distinct_order_id = (
    sl_dev_orders
        .select(countDistinct("order_id"))
        .collect()[0][0]
)

print("rows:", total_rows, 
      "distinct customer_id:", distinct_cust_id, 
      "distinct_order_id", distinct_order_id
)


StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 13, Finished, Available, Finished)

rows: 99441 distinct customer_id: 99441 distinct_order_id 99441


In [109]:
# To load the ORDERS_ITEMS table and see the first 5 rows 

br_order_items = spark.read.table('dbo.br_order_items')
print(br_order_items.show(5, truncate=False, vertical=True))

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 14, Finished, Available, Finished)

-RECORD 0-----------------------------------------------
 order_id            | 45193dbb3e96a6b68bcd317c6f840fb5 
 order_item_id       | 2                                
 product_id          | a52c53c58fd2105adfe80a817dfa5a76 
 seller_id           | 3b15288545f8928d3e65a8f949a28291 
 shipping_limit_date | 2017-10-02 16:15:08              
 price               | 106.99                           
 freight_value       | 17.25                            
-RECORD 1-----------------------------------------------
 order_id            | 4519ce49b67354e83892e20c66e45b65 
 order_item_id       | 1                                
 product_id          | e09134e776e503444db67bd5b239b56a 
 seller_id           | 6fd52c528dcb38be2eea044946b811f8 
 shipping_limit_date | 2018-05-02 02:15:20              
 price               | 129.89                           
 freight_value       | 9.26                             
-RECORD 2-----------------------------------------------
 order_id            | 4519d054

In [110]:
# Number of rows and columns in ORDER_ITEMS table 

br_order_items_rows = br_order_items.count()
br_order_items_columns = len(br_order_items.columns) 
print(br_order_items_rows, br_order_items_columns)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 15, Finished, Available, Finished)

112650 7


In [111]:
# Check if there are any null values for ORDERS_ITEMS table

from pyspark.sql.functions import col  

br_order_items = spark.read.table('dbo.br_order_items')

br_order_items_cleaned = br_order_items.where(
    col("order_id").isNotNull() &
    col("order_item_id").isNotNull() &
    col("product_id").isNotNull() & 
    col("seller_id").isNotNull() &
    col("shipping_limit_date").isNotNull() &
    col("price").isNotNull() &
    col("freight_value").isNotNull() 
)

print(br_order_items_cleaned.count())     # No null values 

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 16, Finished, Available, Finished)

112650


In [112]:
# Check for any duplicated values 
# Should be expected because more than 1 quantity of an product can be sold. 

from pyspark.sql.functions import countDistinct

total_rows = br_order_items_cleaned.count()

distinct_order_item_id = (
    br_order_items_cleaned
        .select(countDistinct("order_item_id"))
        .collect()[0][0]
)

distinct_order_id = (
    br_order_items_cleaned
        .select(countDistinct("order_id"))
        .collect()[0][0]
)

distinct_product_id = (
    br_order_items_cleaned
        .select(countDistinct("product_id"))
        .collect()[0][0]
)

distinct_seller_id = (
    br_order_items_cleaned
        .select(countDistinct("seller_id"))
        .collect()[0][0]
)


print("rows:", total_rows, 
      "distinct order_item_id:", distinct_order_item_id, 
      "distinct_order_id", distinct_order_id,
      "distinct_product_id", distinct_product_id,
      "distinct_seller_id", distinct_seller_id      
)


StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 17, Finished, Available, Finished)

rows: 112650 distinct order_item_id: 21 distinct_order_id 98666 distinct_product_id 32951 distinct_seller_id 3095


In [113]:
# Drop column 'shipping_limit_date' as it is not required

sl_dev_order_items = br_order_items_cleaned.drop('shipping_limit_date')

print(sl_dev_order_items)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 18, Finished, Available, Finished)

DataFrame[order_id: string, order_item_id: string, product_id: string, seller_id: string, price: string, freight_value: string]


In [114]:
# # Recasting of data types from string to timestamp

from pyspark.sql.functions import col, round 

sl_dev_order_items = (sl_dev_order_items 
        .withColumn('order_item_id', col('order_item_id').cast('int'))
        .withColumn('price', col('price').cast('double'))
        .withColumn('freight_value', round(col('freight_value'), 2).cast('double'))
)
        
print(sl_dev_order_items)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 19, Finished, Available, Finished)

DataFrame[order_id: string, order_item_id: int, product_id: string, seller_id: string, price: double, freight_value: double]


In [115]:
# Load data from the REVIEWS table

sl_dev_order_reviews = spark.read.table('br_reviews')

print(sl_dev_order_reviews.show(5, truncate=False, vertical=True))

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 20, Finished, Available, Finished)

-RECORD 0-----------------------------------------------------------------------------------------------------------------------
 review_id               | 7bc2406110b926393aa56f80a40eba40                                                                     
 order_id                | 73fc7af87114b39712e6da79b0a377eb                                                                     
 review_score            | 4                                                                                                    
 review_comment_title    | NULL                                                                                                 
 review_comment_message  | NULL                                                                                                 
 review_creation_date    | 2018-01-18 00:00:00                                                                                  
 review_answer_timestamp | 2018-01-18 21:46:59                                                   

In [116]:
# Drop all columns and keep only 'order_id' and 'review_score' in REVIEWS table 

sl_dev_order_reviews = sl_dev_order_reviews.drop('review_id',
                                                 'review_comment_title',
                                                 'review_comment_message',
                                                 'review_creation_date',
                                                 'review_answer_timestamp')

print(sl_dev_order_reviews)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 21, Finished, Available, Finished)

DataFrame[order_id: string, review_score: string]


In [117]:
# Recasting of data type in REVIEWS table

from pyspark.sql.functions import col

sl_dev_order_reviews = sl_dev_order_reviews.withColumn('review_score', col('review_score').cast('int'))

print(sl_dev_order_reviews)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 22, Finished, Available, Finished)

DataFrame[order_id: string, review_score: int]


In [118]:
# Number of rows and columns in REVIEWS table 

rows = sl_dev_order_reviews.count()
columns = len(sl_dev_order_reviews.columns)

print("rows:", rows, "columns:", columns) 


StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 23, Finished, Available, Finished)

rows: 99224 columns: 2


In [119]:
# check for null values in REVIEWS table

from pyspark.sql.functions import col

sl_dev_order_reviews = sl_dev_order_reviews.where(
    col('order_id').isNotNull() &
    col('review_score').isNotNull()
)

print(sl_dev_order_reviews.count())


StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 24, Finished, Available, Finished)

99224


In [120]:
# Check for duplication in 'order_id' in REVIEWS table 

from pyspark.sql.functions import countDistinct

total_rows = sl_dev_order_reviews.count()

distinct_order_id = sl_dev_order_reviews.select(countDistinct('order_id')).collect()[0][0]

print("total_rows:", total_rows, "distinct_order_id:", distinct_order_id)

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 25, Finished, Available, Finished)

total_rows: 99224 distinct_order_id: 98673


In [1]:
# Cleaned output of 4 Silver tables: sl_dev_sellers, sl_dev_orders, sl_dev_order_items, sl_dev_order_reviews

# Output of sl_dev_sellers 
print("Sellers Table Schema")
sl_dev_sellers.printSchema()
print("Total rows:", sl_dev_sellers.count())
print("Total columns:", len(sl_dev_sellers.columns)) 

print("\n")

# Output of sl_dev_orders
print("Orders Table Schema")
sl_dev_orders.printSchema()
print("Total rows:", sl_dev_orders.count())
print("Total columns:", len(sl_dev_orders.columns))

print("\n")

# Output for sl_dev_order_items
print("Order Items Table")
sl_dev_order_items.printSchema()
print('Total rows:', sl_dev_order_items.count())
print('Total columns:', len(sl_dev_order_items.columns))

print("\n")

# Output for sl_dev_order_reviews
print("Order Reviews Table")
sl_dev_order_reviews.printSchema()
print('Total rows:', sl_dev_order_reviews.count())
print('Total columns:', len(sl_dev_order_reviews.columns))

StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 26, Finished, Available, Finished)

Sellers Table Schema
root
 |-- seller_id: string (nullable = true)
 |-- seller_city: string (nullable = true)
 |-- seller_state: string (nullable = true)

Total rows: 3095
Total columns: 3


Orders Table Schema
root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- order_approved_at: timestamp (nullable = true)
 |-- order_delivered_carrier_date: timestamp (nullable = true)
 |-- order_delivered_customer_date: timestamp (nullable = true)
 |-- order_estimated_delivery_date: timestamp (nullable = true)



Total rows: 99441
Total columns: 8


Order Items Table
root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)

Total rows: 112650
Total columns: 6


Order Reviews Table
root
 |-- order_id: string (nullable = true)
 |-- review_score: integer (nullable = true)

Total rows: 99224
Total columns: 2


In [122]:
# Write 4 Silver tables to Lakehouse: sl_dev_sellers, sl_dev_orders, sl_dev_order_items, sl_dev_order_reviews

# sl_dev_sellers
sl_dev_sellers.write.mode('overwrite').format('delta').saveAsTable('sl_dev_sellers')

# sl_dev_orders
sl_dev_orders.write.mode('overwrite').format('delta').saveAsTable('sl_dev_orders')

# sl_dev_order_items
sl_dev_order_items.write.mode('overwrite').format('delta').saveAsTable('sl_dev_order_items')

# sl_dev_order_reviews
sl_dev_order_reviews.write.mode('overwrite').format('delta').saveAsTable('sl_dev_order_reviews')


StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 27, Finished, Available, Finished)

#### Perform sanity checks on promoted silver tables to ensure data is the same as cleaned silver tables 

In [42]:
# Check rows and columns for sl_sellers table vs sl_dev_sellers to ensure they match 

# List of (sl_table, dev_table, name) tuples
table_comparison = [
    ('sl_sellers', 'sl_dev_sellers', 'Seller'),
    ('sl_orders', 'sl_dev_orders', 'Orders'),
    ('sl_order_items', 'sl_dev_order_items', 'Order Items'),
    ('sl_order_reviews', 'sl_dev_order_reviews', 'Order Reviews')
]

for sl_table, dev_table, name in table_comparison:
    df_silver = spark.read.table(sl_table)     # Load tables 
    df_dev = spark.read.table(dev_table)

    rows_silver = df_silver.count()            # Getting rows and columns for comparison
    columns_silver = len(df_silver.columns)
    rows_dev = df_dev.count()
    columns_dev = len(df_dev.columns)

    print(f"Silver {name} Table")
    print(f"{rows_silver}, {columns_silver}")
    print(f"Silver Dev {name} Table")
    print(f"{rows_dev}, {columns_dev}")

# Schema Comparison
    if df_silver.schema != df_dev.schema:
        print(f"⚠️ Schema mismatch between {sl_table} and {dev_table}")
    else:
        print(f"✅ Schema Match between {sl_table} and {dev_table}")

    print("\n")
 



StatementMeta(, 39767079-c112-455d-9e4d-1829bb337478, 44, Finished, Available, Finished)

Silver Seller Table
3095, 3
Silver Dev Seller Table
3095, 3
✅ Schema Match between sl_sellers and sl_dev_sellers


Silver Orders Table
99441, 8
Silver Dev Orders Table
99441, 8
✅ Schema Match between sl_orders and sl_dev_orders


Silver Order Items Table
112650, 6
Silver Dev Order Items Table
112650, 6
✅ Schema Match between sl_order_items and sl_dev_order_items


Silver Order Reviews Table
99224, 2
Silver Dev Order Reviews Table
99224, 2
✅ Schema Match between sl_order_reviews and sl_dev_order_reviews


