# üèÅ 00. SETUP & OVERVIEW - ETL Pipeline Architecture

## üéØ M·ª§C TI√äU:
- Hi·ªÉu ki·∫øn tr√∫c 3-layer ETL (Raw ‚Üí Staging ‚Üí Production)
- Ki·ªÉm tra m√¥i tr∆∞·ªùng v√† k·∫øt n·ªëi database
- T·ªïng quan v·ªÅ data flow

## üìö N·ªòI DUNG:
1. Ki·∫øn tr√∫c ETL 3-layer
2. Ki·ªÉm tra k·∫øt n·ªëi database
3. T·ªïng quan v·ªÅ d·ªØ li·ªáu
4. Workflow t·ªïng th·ªÉ

# üèóÔ∏è KI·∫æN TR√öC ETL 3-LAYER

```
raw_data/          ‚Üí    raw schema      ‚Üí    staging schema    ‚Üí    prod schema
(Parquet files)         (Immutable)          (Cleaned)              (Aggregated)

customers/              raw.customers   ‚Üí    staging.customers ‚Üí    prod.customer_metrics
products/               raw.products    ‚Üí    staging.products
orders/                 raw.orders      ‚Üí    staging.orders    ‚Üí    prod.daily_sales
order_items/            raw.order_items ‚Üí    staging.order_items    prod.monthly_sales
                                                                     prod.daily_category_metrics
                                                                     prod.daily_product_metrics
```

## üìã C√ÅC LAYER:

### 1Ô∏è‚É£ **RAW Layer** (Bronze)
- **M·ª•c ƒë√≠ch**: L∆∞u tr·ªØ d·ªØ li·ªáu g·ªëc, kh√¥ng thay ƒë·ªïi
- **ƒê·∫∑c ƒëi·ªÉm**: 
  - C√≥ duplicate, null, l·ªói format
  - C√≥ metadata columns (_ingested_at, _source_file, _partition_date)
  - Immutable (kh√¥ng s·ª≠a, ch·ªâ append)

### 2Ô∏è‚É£ **STAGING Layer** (Silver)
- **M·ª•c ƒë√≠ch**: D·ªØ li·ªáu ƒë√£ ƒë∆∞·ª£c l√†m s·∫°ch, chu·∫©n h√≥a
- **ƒê·∫∑c ƒëi·ªÉm**:
  - Remove duplicates
  - Validate data (email format, ranges)
  - Standardize text (capitalize names)
  - Enforce referential integrity

### 3Ô∏è‚É£ **PRODUCTION Layer** (Gold)
- **M·ª•c ƒë√≠ch**: D·ªØ li·ªáu ƒë√£ ƒë∆∞·ª£c t·ªïng h·ª£p, s·∫µn s√†ng cho business
- **ƒê·∫∑c ƒëi·ªÉm**:
  - Aggregated metrics
  - Denormalized
  - Optimized for reporting

In [1]:
# Import libraries
import sys
sys.path.append('../scripts')

import pandas as pd
import numpy as np
from datetime import datetime
from pathlib import Path

from db_connector import DatabaseConnector
from data_cleaner import DataCleaner
from validators import DataValidator

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


## 1. KI·ªÇM TRA K·∫æT N·ªêI DATABASE

In [2]:
print("üîå Ki·ªÉm tra k·∫øt n·ªëi database...")

db = DatabaseConnector()

# Test query
result = db.read_sql("SELECT current_database(), current_user, version()")
print("\n‚úÖ K·∫øt n·ªëi th√†nh c√¥ng!")
print(f"Database: {result['current_database'][0]}")
print(f"User: {result['current_user'][0]}")
print(f"Version: {result['version'][0][:50]}...")

2025-12-20 08:45:35,750 - db_connector - INFO - Database connector initialized for data_engineer@postgres
2025-12-20 08:45:35,803 - db_connector - INFO - Query executed, DataFrame shape: (1, 3)


üîå Ki·ªÉm tra k·∫øt n·ªëi database...

‚úÖ K·∫øt n·ªëi th√†nh c√¥ng!
Database: data_engineer
User: dataengineer
Version: PostgreSQL 15.15 on x86_64-pc-linux-musl, compiled...


## 2. KI·ªÇM TRA C√ÅC SCHEMA

In [3]:
print("üìÇ Ki·ªÉm tra schemas...")

schemas_query = """
SELECT schema_name 
FROM information_schema.schemata 
WHERE schema_name IN ('raw', 'staging', 'prod')
ORDER BY schema_name
"""

schemas = db.read_sql(schemas_query)
display(schemas)

2025-12-20 08:45:40,293 - db_connector - INFO - Query executed, DataFrame shape: (3, 1)


üìÇ Ki·ªÉm tra schemas...


Unnamed: 0,schema_name
0,prod
1,raw
2,staging


## 3. KI·ªÇM TRA TABLES TRONG M·ªñI SCHEMA

In [4]:
print("üìä Ki·ªÉm tra tables...\n")

for schema in ['raw', 'staging', 'prod']:
    try:
        # Get table names
        table_names_query = f"""
        SELECT table_name
        FROM information_schema.tables 
        WHERE table_schema = '{schema}'
        ORDER BY table_name
        """
        tables = db.read_sql(table_names_query)
        
        print(f"{schema.upper()} Schema:")
        print("-" * 50)
        
        for table_name in tables['table_name']:
            count_query = f"SELECT COUNT(*) as count FROM {schema}.{table_name}"
            count = db.read_sql(count_query)['count'][0]
            print(f"  {table_name:.<30} {count:>10,} rows")
        
        print()
            
    except Exception as e:
        print(f"  ‚ö†Ô∏è Error: {e}\n")

2025-12-20 08:45:41,209 - db_connector - INFO - Query executed, DataFrame shape: (4, 1)
2025-12-20 08:45:41,219 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,285 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,308 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,318 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,323 - db_connector - INFO - Query executed, DataFrame shape: (4, 1)
2025-12-20 08:45:41,335 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,377 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


üìä Ki·ªÉm tra tables...

RAW Schema:
--------------------------------------------------
  customers.....................     11,116 rows
  order_items...................    360,327 rows
  orders........................    119,726 rows
  products......................     36,500 rows

STAGING Schema:
--------------------------------------------------
  customers.....................     10,680 rows
  order_items...................    333,217 rows


2025-12-20 08:45:41,400 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,403 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,408 - db_connector - INFO - Query executed, DataFrame shape: (5, 1)
2025-12-20 08:45:41,417 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,423 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,446 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,449 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)
2025-12-20 08:45:41,454 - db_connector - INFO - Query executed, DataFrame shape: (1, 1)


  orders........................    110,791 rows
  products......................        100 rows

PROD Schema:
--------------------------------------------------
  customer_metrics..............     10,680 rows
  daily_category_metrics........      2,548 rows
  daily_product_metrics.........     35,837 rows
  daily_sales...................        364 rows
  monthly_sales.................         12 rows



## 4. KI·ªÇM TRA RAW DATA FILES

In [5]:
print("üìÅ Ki·ªÉm tra raw data files...")

raw_data_dir = Path('../raw_data')

if raw_data_dir.exists():
    for entity in ['customers', 'products', 'orders', 'order_items']:
        entity_dir = raw_data_dir / entity
        if entity_dir.exists():
            partitions = sorted([d.name for d in entity_dir.iterdir() if d.is_dir()])
            print(f"\n{entity}:")
            print(f"  Total partitions: {len(partitions)}")
            if partitions:
                print(f"  First: {partitions[0]}")
                print(f"  Last: {partitions[-1]}")
else:
    print("‚ö†Ô∏è Raw data directory not found!")
    print("Run: python scripts/generate_raw_data.py --test-mode")

üìÅ Ki·ªÉm tra raw data files...

customers:
  Total partitions: 365
  First: 2025-01-01
  Last: 2025-12-31

products:
  Total partitions: 365
  First: 2025-01-01
  Last: 2025-12-31

orders:
  Total partitions: 364
  First: 2025-01-02
  Last: 2025-12-31

order_items:
  Total partitions: 364
  First: 2025-01-02
  Last: 2025-12-31


## 5. DATA FLOW VISUALIZATION

In [6]:
print("="*70)
print("üìä DATA FLOW SUMMARY")
print("="*70)

flow_query = """
SELECT 
    'RAW' as layer,
    (SELECT COUNT(*) FROM raw.customers) as customers,
    (SELECT COUNT(*) FROM raw.products) as products,
    (SELECT COUNT(*) FROM raw.orders) as orders,
    (SELECT COUNT(*) FROM raw.order_items) as order_items
UNION ALL
SELECT 
    'STAGING' as layer,
    (SELECT COUNT(*) FROM staging.customers) as customers,
    (SELECT COUNT(*) FROM staging.products) as products,
    (SELECT COUNT(*) FROM staging.orders) as orders,
    (SELECT COUNT(*) FROM staging.order_items) as order_items
UNION ALL
SELECT 
    'PROD' as layer,
    (SELECT COUNT(*) FROM prod.customer_metrics) as customers,
    0 as products,
    (SELECT COUNT(*) FROM prod.daily_sales) as orders,
    0 as order_items
"""

try:
    flow_df = db.read_sql(flow_query)
    display(flow_df)
except Exception as e:
    print(f"‚ö†Ô∏è Some tables might be empty: {e}")

2025-12-20 08:45:54,584 - db_connector - INFO - Query executed, DataFrame shape: (3, 5)


üìä DATA FLOW SUMMARY


Unnamed: 0,layer,customers,products,orders,order_items
0,RAW,11116,36500,119726,360327
1,STAGING,10680,100,110791,333217
2,PROD,10680,0,364,0


# üéØ WORKFLOW T·ªîNG TH·ªÇ

## B∆∞·ªõc 1: Generate Raw Data
```bash
python scripts/generate_raw_data.py --test-mode
```

## B∆∞·ªõc 2: Run ETL Pipeline
```bash
# Option 1: Run t·ª´ng layer
make etl-run-raw    # Raw layer
make etl-run-stg    # Staging layer
make etl-run-prod   # Production layer

# Option 2: Run full pipeline
make etl-run-full
```

## B∆∞·ªõc 3: Explore Data
- Notebook 01: Raw layer exploration
- Notebook 02: Staging transformation
- Notebook 03: Production aggregation

## B∆∞·ªõc 4: Validate Data Quality
- Notebook 05: Data quality checks

In [7]:
print("\n‚úÖ Setup complete! Ready to explore ETL pipeline.")
print("\nüìö Next steps:")
print("  1. Run: python scripts/generate_raw_data.py --test-mode")
print("  2. Open: 01_raw_layer_exploration.ipynb")


‚úÖ Setup complete! Ready to explore ETL pipeline.

üìö Next steps:
  1. Run: python scripts/generate_raw_data.py --test-mode
  2. Open: 01_raw_layer_exploration.ipynb
