# üöÄ 04. FULL PIPELINE DEMO

## üéØ M·ª§C TI√äU:
- Ch·∫°y to√†n b·ªô ETL pipeline t·ª´ ƒë·∫ßu ƒë·∫øn cu·ªëi
- Hi·ªÉu data flow qua 3 layers
- Monitor v√† track progress

## üìö N·ªòI DUNG:
1. Setup & Preparation
2. Generate Raw Data
3. Run RAW Layer ETL
4. Run STAGING Layer ETL
5. Run PRODUCTION Layer ETL
6. End-to-End Validation

In [1]:
import sys
sys.path.append('../scripts')

import pandas as pd
import time
from datetime import datetime, timedelta
from pathlib import Path

from db_connector import DatabaseConnector
from etl_raw import RawLayerETL
from etl_stg import StagingLayerETL
from etl_prod import ProdLayerETL

print("‚úÖ Libraries imported!")

‚úÖ Libraries imported!


## 1. SETUP & PREPARATION

In [2]:
print("üîß SETUP & PREPARATION")
print("="*70)

# Initialize database connection
db = DatabaseConnector()

# Test connection
result = db.read_sql("SELECT current_database(), version()")
print(f"\n‚úÖ Connected to: {result['current_database'][0]}")
print(f"PostgreSQL version: {result['version'][0][:50]}...")

2025-12-20 09:31:11,570 - db_connector - INFO - Database connector initialized for data_engineer@postgres
2025-12-20 09:31:11,585 - db_connector - INFO - Query executed, DataFrame shape: (1, 2)


üîß SETUP & PREPARATION

‚úÖ Connected to: data_engineer
PostgreSQL version: PostgreSQL 15.15 on x86_64-pc-linux-musl, compiled...


In [3]:
# Check current state of all layers
print("\nüìä Current State of All Layers:")
print("-" * 70)

layers = {
    'RAW': ['customers', 'products', 'orders', 'order_items'],
    'STAGING': ['customers', 'products', 'orders', 'order_items'],
    'PROD': ['daily_sales', 'monthly_sales', 'daily_category_metrics', 
             'daily_product_metrics', 'customer_metrics']
}

for layer, tables in layers.items():
    print(f"\n{layer} Layer:")
    for table in tables:
        try:
            count_query = f"SELECT COUNT(*) as count FROM {layer.lower()}.{table}"
            count = db.read_sql(count_query)['count'][0]
            print(f"  {table:.<35} {count:>10,} rows")
        except Exception as e:
            print(f"  {table:.<35} {'ERROR':>10}")

2025-12-20 09:31:11,635 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,645 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,664 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)



üìä Current State of All Layers:
----------------------------------------------------------------------

RAW Layer:
  customers..........................     11,116 rows
  products...........................     36,500 rows
  orders.............................    119,726 rows


2025-12-20 09:31:11,700 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,705 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,710 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,721 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,741 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,745 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,749 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,753 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,762 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:11,768 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


  order_items........................    360,327 rows

STAGING Layer:
  customers..........................     10,680 rows
  products...........................        100 rows
  orders.............................    110,791 rows
  order_items........................    333,217 rows

PROD Layer:
  daily_sales........................        364 rows
  monthly_sales......................         12 rows
  daily_category_metrics.............      2,548 rows
  daily_product_metrics..............     35,837 rows
  customer_metrics...................     10,680 rows


## 2. GENERATE RAW DATA (Optional)

In [4]:
print("üìÅ CHECK RAW DATA FILES")
print("="*70)

raw_data_dir = Path('../raw_data')

if raw_data_dir.exists():
    print("\n‚úÖ Raw data directory exists")
    
    for entity in ['customers', 'products', 'orders', 'order_items']:
        entity_dir = raw_data_dir / entity
        if entity_dir.exists():
            partitions = sorted([d.name for d in entity_dir.iterdir() if d.is_dir()])
            print(f"\n{entity}:")
            print(f"  Partitions: {len(partitions)}")
            if partitions:
                print(f"  Range: {partitions[0]} ‚Üí {partitions[-1]}")
else:
    print("\n‚ö†Ô∏è Raw data not found!")
    print("\nTo generate raw data, run:")
    print("  python scripts/generate_raw_data.py --test-mode")
    print("\nOr for full data:")
    print("  python scripts/generate_raw_data.py --start-date 2025-01-01 --end-date 2025-01-31")

üìÅ CHECK RAW DATA FILES

‚úÖ Raw data directory exists

customers:
  Partitions: 365
  Range: 2025-01-01 ‚Üí 2025-12-31

products:
  Partitions: 365
  Range: 2025-01-01 ‚Üí 2025-12-31

orders:
  Partitions: 364
  Range: 2025-01-02 ‚Üí 2025-12-31

order_items:
  Partitions: 364
  Range: 2025-01-02 ‚Üí 2025-12-31


## 3. RUN RAW LAYER ETL

In [5]:
print("üîµ STEP 1: RAW LAYER ETL")
print("="*70)

etl_raw = RawLayerETL(db)

# Track start time
start_time = time.time()

# Ingest all tables
raw_results = {}
tables = ['customers', 'products', 'orders', 'order_items']

for table in tables:
    print(f"\nüì• Ingesting {table}...")
    result = etl_raw.ingest_table(table, incremental=True)
    raw_results[table] = result
    print(f"  ‚úÖ Partitions: {result['partitions_processed']}, Rows: {result['total_rows']:,}")

# Calculate elapsed time
elapsed = time.time() - start_time
print(f"\n‚è±Ô∏è RAW Layer ETL completed in {elapsed:.2f} seconds")

üîµ STEP 1: RAW LAYER ETL

üì• Ingesting customers...


2025-12-20 09:31:15,450 - db_connector - INFO - Query executed, DataFrame shape: (365, 1)
2025-12-20 09:31:15,451 - etl_raw - INFO - Incremental mode: 0 new partitions to ingest


  ‚úÖ Partitions: 0, Rows: 0

üì• Ingesting products...


2025-12-20 09:31:16,632 - db_connector - INFO - Query executed, DataFrame shape: (365, 1)
2025-12-20 09:31:16,635 - etl_raw - INFO - Incremental mode: 0 new partitions to ingest


  ‚úÖ Partitions: 0, Rows: 0

üì• Ingesting orders...


2025-12-20 09:31:18,286 - db_connector - INFO - Query executed, DataFrame shape: (364, 1)
2025-12-20 09:31:18,288 - etl_raw - INFO - Incremental mode: 0 new partitions to ingest


  ‚úÖ Partitions: 0, Rows: 0

üì• Ingesting order_items...


2025-12-20 09:31:19,869 - db_connector - INFO - Query executed, DataFrame shape: (364, 1)
2025-12-20 09:31:19,871 - etl_raw - INFO - Incremental mode: 0 new partitions to ingest


  ‚úÖ Partitions: 0, Rows: 0

‚è±Ô∏è RAW Layer ETL completed in 5.63 seconds


In [6]:
# Verify RAW layer
print("\nüîç Verify RAW Layer:")
print("-" * 70)

for table in tables:
    count_query = f"SELECT COUNT(*) as count FROM raw.{table}"
    count = db.read_sql(count_query)['count'][0]
    print(f"  {table:.<35} {count:>10,} rows")

2025-12-20 09:31:19,887 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:19,894 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:19,906 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:19,926 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)



üîç Verify RAW Layer:
----------------------------------------------------------------------
  customers..........................     11,116 rows
  products...........................     36,500 rows
  orders.............................    119,726 rows
  order_items........................    360,327 rows


## 4. RUN STAGING LAYER ETL

In [7]:
print("\nüü¢ STEP 2: STAGING LAYER ETL")
print("="*70)

etl_stg = StagingLayerETL(db)

# Track start time
start_time = time.time()

# Transform all tables
stg_results = etl_stg.transform_all()

# Calculate elapsed time
elapsed = time.time() - start_time

print(f"\n‚úÖ STAGING Layer ETL Results:")
for table, result in stg_results.items():
    print(f"\n{table}:")
    print(f"  Rows loaded: {result['rows']:,}")
    if 'dups_removed' in result:
        print(f"  Duplicates removed: {result['dups_removed']}")
    if 'nulls_removed' in result:
        print(f"  Nulls removed: {result['nulls_removed']}")

print(f"\n‚è±Ô∏è STAGING Layer ETL completed in {elapsed:.2f} seconds")

2025-12-20 09:31:19,941 - etl_stg - INFO - ETL STAGING LAYER - STARTING
2025-12-20 09:31:19,943 - etl_stg - INFO - Transforming customers...
2025-12-20 09:31:20,024 - db_connector - INFO - Query executed, DataFrame shape: (11116, 6)



üü¢ STEP 2: STAGING LAYER ETL


2025-12-20 09:31:20,122 - db_connector - INFO - Query executed successfully
2025-12-20 09:31:20,122 - db_connector - INFO - Truncated: staging.customers
2025-12-20 09:31:20,705 - db_connector - INFO - Written 10680 rows to staging.customers
2025-12-20 09:31:20,705 - etl_stg - INFO - Customers: 11116 raw -> 10680 stg
2025-12-20 09:31:20,706 - etl_stg - INFO -   Duplicates removed: 0
2025-12-20 09:31:20,706 - etl_stg - INFO -   Invalid emails removed: 224
2025-12-20 09:31:20,707 - etl_stg - INFO -   Nulls removed: 436
2025-12-20 09:31:20,709 - etl_stg - INFO - Transforming products...
2025-12-20 09:31:20,755 - db_connector - INFO - Query executed, DataFrame shape: (100, 5)
2025-12-20 09:31:20,774 - db_connector - INFO - Query executed successfully
2025-12-20 09:31:20,775 - db_connector - INFO - Truncated: staging.products
2025-12-20 09:31:20,787 - db_connector - INFO - Written 100 rows to staging.products
2025-12-20 09:31:20,788 - etl_stg - INFO - Products: 100 raw -> 100 stg
2025-12-20 


‚úÖ STAGING Layer ETL Results:

customers:
  Rows loaded: 10,680
  Duplicates removed: 0
  Nulls removed: 436

products:
  Rows loaded: 100

orders:
  Rows loaded: 110,791

order_items:
  Rows loaded: 333,217

‚è±Ô∏è STAGING Layer ETL completed in 37.56 seconds


In [8]:
# Verify STAGING layer
print("\nüîç Verify STAGING Layer:")
print("-" * 70)

for table in tables:
    count_query = f"SELECT COUNT(*) as count FROM staging.{table}"
    count = db.read_sql(count_query)['count'][0]
    print(f"  {table:.<35} {count:>10,} rows")

2025-12-20 09:31:57,513 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:57,519 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:57,529 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:31:57,556 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)



üîç Verify STAGING Layer:
----------------------------------------------------------------------
  customers..........................     10,680 rows
  products...........................        100 rows
  orders.............................    110,791 rows
  order_items........................    333,217 rows


## 5. RUN PRODUCTION LAYER ETL

In [9]:
print("\nüü° STEP 3: PRODUCTION LAYER ETL")
print("="*70)

etl_prod = ProdLayerETL(db)

# Track start time
start_time = time.time()

# Build all production tables
prod_results = etl_prod.build_all()

# Calculate elapsed time
elapsed = time.time() - start_time

print(f"\n‚úÖ PRODUCTION Layer ETL Results:")
for table, result in prod_results.items():
    print(f"\n{table}:")
    print(f"  Rows created: {result['rows']:,}")

print(f"\n‚è±Ô∏è PRODUCTION Layer ETL completed in {elapsed:.2f} seconds")

2025-12-20 09:31:57,570 - etl_prod - INFO - ETL PRODUCTION LAYER - STARTING
2025-12-20 09:31:57,572 - etl_prod - INFO - Building daily_sales...



üü° STEP 3: PRODUCTION LAYER ETL


2025-12-20 09:31:57,957 - db_connector - INFO - Query executed, DataFrame shape: (364, 5)
2025-12-20 09:31:57,969 - db_connector - INFO - Query executed successfully
2025-12-20 09:31:57,971 - db_connector - INFO - Truncated: prod.daily_sales
2025-12-20 09:31:57,999 - db_connector - INFO - Written 364 rows to prod.daily_sales
2025-12-20 09:31:58,000 - etl_prod - INFO - Daily Sales: 364 days aggregated
2025-12-20 09:31:58,001 - etl_prod - INFO - Building monthly_sales...
2025-12-20 09:31:58,601 - db_connector - INFO - Query executed, DataFrame shape: (12, 7)
2025-12-20 09:31:58,625 - db_connector - INFO - Query executed successfully
2025-12-20 09:31:58,626 - db_connector - INFO - Truncated: prod.monthly_sales
2025-12-20 09:31:58,639 - db_connector - INFO - Written 12 rows to prod.monthly_sales
2025-12-20 09:31:58,639 - etl_prod - INFO - Monthly Sales: 12 months aggregated
2025-12-20 09:31:58,640 - etl_prod - INFO - Building daily_category_metrics...
2025-12-20 09:31:59,114 - db_connector


‚úÖ PRODUCTION Layer ETL Results:

daily_sales:
  Rows created: 364

monthly_sales:
  Rows created: 12

daily_category_metrics:
  Rows created: 2,548

daily_product_metrics:
  Rows created: 35,837

customer_metrics:
  Rows created: 10,680

‚è±Ô∏è PRODUCTION Layer ETL completed in 5.88 seconds


In [10]:
# Verify PRODUCTION layer
print("\nüîç Verify PRODUCTION Layer:")
print("-" * 70)

prod_tables = ['daily_sales', 'monthly_sales', 'daily_category_metrics', 
               'daily_product_metrics', 'customer_metrics']

for table in prod_tables:
    count_query = f"SELECT COUNT(*) as count FROM prod.{table}"
    count = db.read_sql(count_query)['count'][0]
    print(f"  {table:.<35} {count:>10,} rows")

2025-12-20 09:32:03,453 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:32:03,459 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:32:03,465 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:32:03,475 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:32:03,481 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)



üîç Verify PRODUCTION Layer:
----------------------------------------------------------------------
  daily_sales........................        364 rows
  monthly_sales......................         12 rows
  daily_category_metrics.............      2,548 rows
  daily_product_metrics..............     35,837 rows
  customer_metrics...................     10,680 rows


## 6. END-TO-END VALIDATION

In [11]:
print("\n‚úÖ END-TO-END VALIDATION")
print("="*70)

# Compare row counts across layers
validation_query = """
SELECT 
    'customers' as entity,
    (SELECT COUNT(*) FROM raw.customers) as raw_count,
    (SELECT COUNT(*) FROM staging.customers) as staging_count,
    (SELECT COUNT(*) FROM prod.customer_metrics) as prod_count
UNION ALL
SELECT 
    'products' as entity,
    (SELECT COUNT(*) FROM raw.products) as raw_count,
    (SELECT COUNT(*) FROM staging.products) as staging_count,
    0 as prod_count
UNION ALL
SELECT 
    'orders' as entity,
    (SELECT COUNT(*) FROM raw.orders) as raw_count,
    (SELECT COUNT(*) FROM staging.orders) as staging_count,
    (SELECT COUNT(*) FROM prod.daily_sales) as prod_count
UNION ALL
SELECT 
    'order_items' as entity,
    (SELECT COUNT(*) FROM raw.order_items) as raw_count,
    (SELECT COUNT(*) FROM staging.order_items) as staging_count,
    0 as prod_count
"""

validation_df = db.read_sql(validation_query)
validation_df['data_loss_%'] = ((validation_df['raw_count'] - validation_df['staging_count']) / validation_df['raw_count'] * 100).round(2)

print("\nüìä Row Count Comparison:")
display(validation_df)

2025-12-20 09:32:03,549 - db_connector - INFO - Query executed, DataFrame shape: (4, 4)



‚úÖ END-TO-END VALIDATION

üìä Row Count Comparison:


Unnamed: 0,entity,raw_count,staging_count,prod_count,data_loss_%
0,customers,11116,10680,10680,3.92
1,products,36500,100,0,99.73
2,orders,119726,110791,364,7.46
3,order_items,360327,333217,0,7.52


In [12]:
# Revenue validation
print("\nüí∞ Revenue Validation:")
print("-" * 70)

revenue_check_query = """
WITH staging_revenue AS (
    SELECT SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as total
    FROM staging.orders o
    JOIN staging.order_items oi ON o.order_id = oi.order_id
    WHERE o.order_status = 'completed'
),
prod_revenue AS (
    SELECT SUM(total_revenue) as total
    FROM prod.daily_sales
)
SELECT 
    s.total as staging_revenue,
    p.total as prod_revenue,
    ABS(s.total - p.total) as difference,
    CASE 
        WHEN ABS(s.total - p.total) < 0.01 THEN '‚úÖ MATCH'
        ELSE '‚ùå MISMATCH'
    END as status
FROM staging_revenue s, prod_revenue p
"""

revenue_check = db.read_sql(revenue_check_query)
display(revenue_check)


üí∞ Revenue Validation:
----------------------------------------------------------------------


2025-12-20 09:32:03,662 - db_connector - INFO - Query executed, DataFrame shape: (1, 4)


Unnamed: 0,staging_revenue,prod_revenue,difference,status
0,293959300.0,293959300.0,0.002,‚úÖ MATCH


In [13]:
# Data quality checks
print("\nüîç Data Quality Checks:")
print("-" * 70)

quality_checks = [
    ("Duplicate emails in staging.customers", 
     "SELECT COUNT(*) - COUNT(DISTINCT email) as duplicates FROM staging.customers"),
    
    ("NULL emails in staging.customers", 
     "SELECT COUNT(*) as nulls FROM staging.customers WHERE email IS NULL"),
    
    ("Invalid order amounts", 
     "SELECT COUNT(*) as invalid FROM staging.orders WHERE total_amount < 0"),
    
    ("Orphaned order_items", 
     """SELECT COUNT(*) as orphans FROM staging.order_items oi 
        WHERE NOT EXISTS (SELECT 1 FROM staging.orders o WHERE o.order_id = oi.order_id)""")
]

for check_name, query in quality_checks:
    result = db.read_sql(query)
    value = result.iloc[0, 0]
    status = "‚úÖ PASS" if value == 0 else f"‚ùå FAIL ({value})"
    print(f"  {check_name:.<50} {status}")


üîç Data Quality Checks:
----------------------------------------------------------------------


2025-12-20 09:32:03,704 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:32:03,708 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 09:32:03,728 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


  Duplicate emails in staging.customers............. ‚úÖ PASS
  NULL emails in staging.customers.................. ‚úÖ PASS
  Invalid order amounts............................. ‚úÖ PASS


2025-12-20 09:32:03,778 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


  Orphaned order_items.............................. ‚úÖ PASS


## 7. PIPELINE SUMMARY

In [14]:
print("\n" + "="*70)
print("üìä PIPELINE EXECUTION SUMMARY")
print("="*70)

summary_query = """
SELECT 
    'RAW' as layer,
    (SELECT COUNT(*) FROM raw.customers) as customers,
    (SELECT COUNT(*) FROM raw.products) as products,
    (SELECT COUNT(*) FROM raw.orders) as orders,
    (SELECT COUNT(*) FROM raw.order_items) as order_items
UNION ALL
SELECT 
    'STAGING' as layer,
    (SELECT COUNT(*) FROM staging.customers) as customers,
    (SELECT COUNT(*) FROM staging.products) as products,
    (SELECT COUNT(*) FROM staging.orders) as orders,
    (SELECT COUNT(*) FROM staging.order_items) as order_items
UNION ALL
SELECT 
    'PROD' as layer,
    (SELECT COUNT(*) FROM prod.customer_metrics) as customers,
    0 as products,
    (SELECT COUNT(*) FROM prod.daily_sales) as orders,
    0 as order_items
"""

summary_df = db.read_sql(summary_query)
display(summary_df)

2025-12-20 09:32:03,841 - db_connector - INFO - Query executed, DataFrame shape: (3, 5)



üìä PIPELINE EXECUTION SUMMARY


Unnamed: 0,layer,customers,products,orders,order_items
0,RAW,11116,36500,119726,360327
1,STAGING,10680,100,110791,333217
2,PROD,10680,0,364,0


In [15]:
# Sample queries from production
print("\nüìà Sample Business Metrics:")
print("-" * 70)

# Top 5 days by revenue
print("\nüèÜ Top 5 Days by Revenue:")
top_days_query = """
SELECT order_date, total_revenue, total_orders, avg_order_value
FROM prod.daily_sales
ORDER BY total_revenue DESC
LIMIT 5
"""
display(db.read_sql(top_days_query))

2025-12-20 09:32:03,859 - db_connector - INFO - Query executed, DataFrame shape: (5, 4)



üìà Sample Business Metrics:
----------------------------------------------------------------------

üèÜ Top 5 Days by Revenue:


Unnamed: 0,order_date,total_revenue,total_orders,avg_order_value
0,2025-12-23,1395882.75,335,4166.81
1,2025-11-21,1337894.02,336,3981.83
2,2025-02-24,1337375.75,335,3992.17
3,2025-04-12,1324190.84,337,3929.35
4,2025-11-18,1321001.49,351,3763.54


In [16]:
# Top 5 customers by lifetime value
print("\nüë• Top 5 Customers by Lifetime Value:")
top_customers_query = """
SELECT customer_name, total_orders, total_revenue, avg_order_value
FROM prod.customer_metrics
ORDER BY total_revenue DESC
LIMIT 5
"""
display(db.read_sql(top_customers_query))

2025-12-20 09:32:03,886 - db_connector - INFO - Query executed, DataFrame shape: (5, 4)



üë• Top 5 Customers by Lifetime Value:


Unnamed: 0,customer_name,total_orders,total_revenue,avg_order_value
0,Belinda Mccullough,51,211259.92,4142.35
1,James Gilbert,45,202498.59,4499.97
2,Michelle Harris,43,201554.25,4687.31
3,Sydney White,45,198595.12,4413.22
4,Kenneth Parks Jr.,46,189273.0,4114.63


In [17]:
# Category performance
print("\nüì¶ Category Performance:")
category_query = """
SELECT category, SUM(total_revenue) as revenue, SUM(total_orders) as orders
FROM prod.daily_category_metrics
GROUP BY category
ORDER BY revenue DESC
"""
display(db.read_sql(category_query))


üì¶ Category Performance:


2025-12-20 09:32:03,907 - db_connector - INFO - Query executed, DataFrame shape: (7, 3)


Unnamed: 0,category,revenue,orders
0,Electronics,66200428.22,37325
1,Home,51824647.91,34480
2,Clothing,41728361.38,25528
3,Sports,37776215.19,27174
4,Books,37711151.29,27253
5,Toys,29410999.99,18687
6,Food,29307482.58,20565


# üéì KEY TAKEAWAYS

## ‚úÖ Pipeline Flow:
```
Parquet Files ‚Üí RAW Layer ‚Üí STAGING Layer ‚Üí PRODUCTION Layer
     ‚Üì              ‚Üì              ‚Üì                ‚Üì
  Raw data    Immutable      Cleaned         Aggregated
              + Metadata    + Validated      + Metrics
```

## üìä Data Transformation:
- **RAW**: Append-only, with metadata tracking
- **STAGING**: Cleaned, deduplicated, validated
- **PRODUCTION**: Aggregated, business-ready metrics

## üîÑ Best Practices:
1. Always validate data at each layer
2. Track row counts and data loss
3. Verify revenue and key metrics
4. Monitor execution time
5. Check data quality constraints

## üîÑ Next Steps:
- Open `05_data_quality_checks.ipynb` for detailed quality validation
- Open `06_troubleshooting_guide.ipynb` if you encounter issues

In [18]:
print("\n" + "="*70)
print("üéâ FULL PIPELINE EXECUTION COMPLETE!")
print("="*70)
print("\n‚úÖ All layers processed successfully")
print("‚úÖ Data quality validated")
print("‚úÖ Business metrics ready for analysis")


üéâ FULL PIPELINE EXECUTION COMPLETE!

‚úÖ All layers processed successfully
‚úÖ Data quality validated
‚úÖ Business metrics ready for analysis
