# Delta Table Filter

This notebook filters data from the **Bronze Delta Lake tables** and prepares it for the Silver layer.

**Important**: Ensure that the filtering logic aligns with the business requirements for data transformation.

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = (
    f"--jars /path/to/your/jars/hadoop-common-3.3.6.jar,"  # <-- Add this line
    f"/path/to/your/jars/hadoop-azure-3.3.6.jar,"
    f"/path/to/your/jars/azure-storage-8.6.6.jar,"
    f"/path/to/your/jars/jetty-client-9.4.43.v20210629.jar,"
    f"/path/to/your/jars/jetty-http-9.4.43.v20210629.jar,"
    f"/path/to/your/jars/jetty-io-9.4.43.v20210629.jar,"
    f"/path/to/your/jars/jetty-util-9.4.43.v20210629.jar,"
    f"/path/to/your/jars/jetty-util-ajax-9.4.43.v20210629.jar "
    "--packages io.delta:delta-spark_2.12:3.0.0 "
    "pyspark-shell"
)

In [2]:
# Init spark session with Delta Lake support
spark = (SparkSession.builder 
    .appName("Bronze to Silver: Delta Table Filter") 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.databricks.delta.schema.autoMerge.enabled", "true")  # Allow schema evolution
    .config("spark.sql.parquet.enableVectorizedReader", "false")  # Disable vectorized reader to avoid type conflicts
    .getOrCreate())

print("✅ Spark session with Delta Lake support initialized")

25/07/22 16:41:32 WARN Utils: Your hostname, lenovo-slim resolves to a loopback address: 127.0.1.1; using 192.168.199.13 instead (on interface wlp2s0)
25/07/22 16:41:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/bnguyen/Desktop/DE_project/venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/bnguyen/.ivy2/cache
The jars for the packages stored in: /home/bnguyen/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d351d405-16bf-4855-9465-707005b762c9;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.0.0 in central
	found io.delta#delta-storage;3.0.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 134ms :: artifacts dl 4ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.0.0 from central in [default]
	io.delta#delta-storage;3.0.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   

✅ Spark session with Delta Lake support initialized


In [None]:
# Configure Azure storage access keys for bronze and silver

# Bronze access key
spark.conf.set(
    "fs.azure.account.key.YOUR_BRONZE_STORAGE_ACCOUNT.dfs.core.windows.net",
    "YOUR_AZURE_BRONZE_STORAGE_KEY"
)

# Silver access key
spark.conf.set(
    "fs.azure.account.key.YOUR_SILVER_STORAGE_ACCOUNT.dfs.core.windows.net",
    "YOUR_AZURE_SILVER_STORAGE_KEY"
)

print("Azure storage credentials configured")

✅ Azure storage credentials configured


In [None]:
# Define storage accounts and containers for Delta Lake
storage_account_bronze = "YOUR_BRONZE_STORAGE_ACCOUNT"
bronze_delta_container = "bronze-delta"  # Source: Existing Delta tables
storage_account_silver = "YOUR_SILVER_STORAGE_ACCOUNT"
silver_delta_container = "silver-delta"  # Target: Delta format container

tables = ["Customers", "Products", "ProductCategories", "Sellers", "Addresses", 
          "Inventory", "ShoppingCarts", "CartItems", "Orders", "OrderItems", 
          "Payments", "PaymentMethods", "OrderStatus", "Reviews", "Reasons"]

print(f" Processing {len(tables)} tables in Delta Lake format")
print(f" Source: {bronze_delta_container} container (Delta format)")
print(f" Target: {silver_delta_container} container (Delta format)")

 Processing 9 tables in Delta Lake format
 Source: bronze-delta container (Delta format)
 Target: silver-delta container (Delta format)


In [None]:
# Process all tables: Copy from bronze-delta to silver-delta (Both Delta format)
# This reads from existing Delta tables and outputs to silver-delta container

print("Starting Delta Lake processing...\n")

for table in tables:
    bronze_delta_path = f"abfss://{bronze_delta_container}@{storage_account_bronze}.dfs.core.windows.net/{table}"
    silver_delta_path = f"abfss://{silver_delta_container}@{storage_account_silver}.dfs.core.windows.net/{table}"

    print(f" Processing {table}...")
    
    try:
        # Read from bronze-delta using Delta format
        df = spark.read.format("delta").load(bronze_delta_path)
        record_count = df.count()
        
        print(f"   Read {record_count} records from bronze-delta")
        
        # Write to silver-delta using Delta format with overwrite mode
        df.write.format("delta").mode("overwrite").save(silver_delta_path)
        
        print(f"   Written {record_count} records to silver-delta")
        print(f"   Destination: {silver_delta_path}\n")
        
    except Exception as e:
        print(f"   Error processing {table}: {e}\n")

🚀 Starting Delta Lake processing...

📋 Processing Customers...


25/07/22 16:41:55 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/07/22 16:41:55 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

   Read 7000 records from bronze-delta


25/07/22 16:42:00 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

   Written 7000 records to silver-delta
   Destination: abfss://silver-delta@mysilver.dfs.core.windows.net/Customers

📋 Processing Products...
