# üîß 06. TROUBLESHOOTING GUIDE

## üéØ M·ª§C TI√äU:
- Ch·∫©n ƒëo√°n v√† fix c√°c l·ªói th∆∞·ªùng g·∫∑p
- Debug ETL pipeline
- Recovery strategies

## üìö N·ªòI DUNG:
1. Common Errors & Solutions
2. Database Connection Issues
3. Data Quality Issues
4. ETL Pipeline Failures
5. Performance Issues
6. Recovery Procedures

In [None]:
import sys
sys.path.append('../scripts')

import pandas as pd
import psycopg2
from pathlib import Path
from datetime import datetime

from db_connector import DatabaseConnector

print("‚úÖ Libraries imported!")

## 1. COMMON ERRORS & SOLUTIONS

### ‚ùå Error 1: Connection Refused
```
psycopg2.OperationalError: could not connect to server: Connection refused
```

**Causes:**
- PostgreSQL service not running
- Wrong host/port in .env
- Firewall blocking connection

**Solutions:**

In [None]:
print("üîç DIAGNOSE: Connection Issues")
print("="*70)

# Check if PostgreSQL is running
import subprocess

try:
    # For macOS/Linux
    result = subprocess.run(['pg_isready'], capture_output=True, text=True)
    if result.returncode == 0:
        print("‚úÖ PostgreSQL is running")
    else:
        print("‚ùå PostgreSQL is not running")
        print("\nTo start PostgreSQL:")
        print("  macOS: brew services start postgresql")
        print("  Linux: sudo systemctl start postgresql")
        print("  Windows: net start postgresql-x64-14")
except FileNotFoundError:
    print("‚ö†Ô∏è pg_isready not found. Check PostgreSQL installation.")

In [None]:
# Check .env configuration
print("\nüîç Check .env Configuration:")
print("-" * 70)

from dotenv import load_dotenv
import os

load_dotenv()

required_vars = ['DB_HOST', 'DB_PORT', 'DB_NAME', 'DB_USER', 'DB_PASSWORD']

for var in required_vars:
    value = os.getenv(var)
    if value:
        # Mask password
        display_value = '***' if var == 'DB_PASSWORD' else value
        print(f"  {var:.<20} {display_value}")
    else:
        print(f"  {var:.<20} ‚ùå NOT SET")

In [None]:
# Test connection
print("\nüîç Test Database Connection:")
print("-" * 70)

try:
    db = DatabaseConnector()
    result = db.read_sql("SELECT version()")
    print("‚úÖ Connection successful!")
    print(f"\nPostgreSQL version: {result['version'][0][:50]}...")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    print("\nüí° Solutions:")
    print("  1. Check if PostgreSQL is running")
    print("  2. Verify .env configuration")
    print("  3. Check firewall settings")
    print("  4. Verify database exists: psql -l")

### ‚ùå Error 2: Table Does Not Exist
```
psycopg2.errors.UndefinedTable: relation "raw.customers" does not exist
```

**Solutions:**

In [None]:
print("üîç DIAGNOSE: Missing Tables")
print("="*70)

db = DatabaseConnector()

# Check schemas
schemas_query = """
SELECT schema_name 
FROM information_schema.schemata 
WHERE schema_name IN ('raw', 'staging', 'prod')
"""

schemas = db.read_sql(schemas_query)['schema_name'].tolist()

print("\nSchemas:")
for schema in ['raw', 'staging', 'prod']:
    status = "‚úÖ" if schema in schemas else "‚ùå"
    print(f"  {status} {schema}")

if len(schemas) < 3:
    print("\nüí° Solution: Run schema initialization")
    print("  make db-init")
    print("  or: psql -d your_db -f sql/init_schema.sql")

In [None]:
# Check tables in each schema
print("\nüîç Check Tables:")
print("-" * 70)

expected_tables = {
    'raw': ['customers', 'products', 'orders', 'order_items'],
    'staging': ['customers', 'products', 'orders', 'order_items'],
    'prod': ['daily_sales', 'monthly_sales', 'daily_category_metrics', 
             'daily_product_metrics', 'customer_metrics']
}

missing_tables = []

for schema, tables in expected_tables.items():
    if schema not in schemas:
        continue
        
    print(f"\n{schema.upper()} Schema:")
    for table in tables:
        check_query = f"""
        SELECT COUNT(*) 
        FROM information_schema.tables 
        WHERE table_schema = '{schema}' AND table_name = '{table}'
        """
        exists = db.read_sql(check_query).iloc[0, 0] > 0
        status = "‚úÖ" if exists else "‚ùå"
        print(f"  {status} {table}")
        
        if not exists:
            missing_tables.append(f"{schema}.{table}")

if missing_tables:
    print("\nüí° Solution: Create missing tables")
    print("  make db-init")

### ‚ùå Error 3: Duplicate Key Violation
```
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint
```

**Solutions:**

In [None]:
print("üîç DIAGNOSE: Duplicate Keys")
print("="*70)

# Check for duplicates in staging
duplicate_checks = [
    ("staging.customers", "customer_id"),
    ("staging.customers", "email"),
    ("staging.products", "product_id"),
    ("staging.orders", "order_id")
]

print("\nDuplicate Check:")
for table, column in duplicate_checks:
    query = f"""
    SELECT COUNT(*) - COUNT(DISTINCT {column}) as duplicates
    FROM {table}
    """
    try:
        result = db.read_sql(query)
        dups = result['duplicates'][0]
        status = "‚úÖ" if dups == 0 else f"‚ùå {dups} duplicates"
        print(f"  {table}.{column:.<20} {status}")
    except Exception as e:
        print(f"  {table}.{column:.<20} ‚ö†Ô∏è Error: {e}")

In [None]:
# Find duplicate records
print("\nüîç Find Duplicate Records:")
print("-" * 70)

# Example: Find duplicate emails
dup_query = """
SELECT email, COUNT(*) as count
FROM staging.customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY count DESC
LIMIT 5
"""

try:
    duplicates = db.read_sql(dup_query)
    if len(duplicates) > 0:
        print("\n‚ùå Found duplicate emails:")
        display(duplicates)
        
        print("\nüí° Solution: Remove duplicates")
        print("  Option 1: Truncate and reload staging")
        print("    TRUNCATE staging.customers CASCADE;")
        print("    make etl-run-stg")
        print("\n  Option 2: Delete duplicates keeping first")
        print("    DELETE FROM staging.customers WHERE customer_id IN (")
        print("      SELECT customer_id FROM (")
        print("        SELECT customer_id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) as rn")
        print("        FROM staging.customers")
        print("      ) t WHERE rn > 1")
        print("    );")
    else:
        print("‚úÖ No duplicate emails found")
except Exception as e:
    print(f"‚ö†Ô∏è Error: {e}")

## 2. DATA QUALITY ISSUES

In [None]:
print("üîç DIAGNOSE: Data Quality Issues")
print("="*70)

# Check for common data quality issues
quality_checks = [
    ("NULL emails", "SELECT COUNT(*) FROM staging.customers WHERE email IS NULL"),
    ("Invalid emails", "SELECT COUNT(*) FROM staging.customers WHERE email NOT LIKE '%@%.%'"),
    ("Negative prices", "SELECT COUNT(*) FROM staging.products WHERE price < 0"),
    ("Future dates", "SELECT COUNT(*) FROM staging.orders WHERE order_date > CURRENT_DATE"),
    ("Orphaned orders", 
     """SELECT COUNT(*) FROM staging.orders o 
        WHERE NOT EXISTS (SELECT 1 FROM staging.customers c WHERE c.customer_id = o.customer_id)""")
]

print("\nData Quality Issues:")
issues_found = False

for check_name, query in quality_checks:
    try:
        result = db.read_sql(query)
        count = result.iloc[0, 0]
        if count > 0:
            print(f"  ‚ùå {check_name}: {count} records")
            issues_found = True
        else:
            print(f"  ‚úÖ {check_name}: OK")
    except Exception as e:
        print(f"  ‚ö†Ô∏è {check_name}: Error - {e}")

if issues_found:
    print("\nüí° Solutions:")
    print("  1. Check data generation logic in generate_raw_data.py")
    print("  2. Review transformation logic in etl_stg.py")
    print("  3. Re-run ETL pipeline: make etl-run-full")

## 3. ETL PIPELINE FAILURES

In [None]:
print("üîç DIAGNOSE: ETL Pipeline Status")
print("="*70)

# Check data flow through pipeline
pipeline_query = """
SELECT 
    'RAW' as layer,
    (SELECT COUNT(*) FROM raw.customers) as customers,
    (SELECT COUNT(*) FROM raw.orders) as orders,
    (SELECT COUNT(*) FROM raw.order_items) as order_items
UNION ALL
SELECT 
    'STAGING' as layer,
    (SELECT COUNT(*) FROM staging.customers) as customers,
    (SELECT COUNT(*) FROM staging.orders) as orders,
    (SELECT COUNT(*) FROM staging.order_items) as order_items
UNION ALL
SELECT 
    'PROD' as layer,
    (SELECT COUNT(*) FROM prod.customer_metrics) as customers,
    (SELECT COUNT(*) FROM prod.daily_sales) as orders,
    0 as order_items
"""

try:
    pipeline_status = db.read_sql(pipeline_query)
    display(pipeline_status)
    
    # Analyze pipeline
    raw_customers = pipeline_status.loc[pipeline_status['layer'] == 'RAW', 'customers'].values[0]
    stg_customers = pipeline_status.loc[pipeline_status['layer'] == 'STAGING', 'customers'].values[0]
    prod_customers = pipeline_status.loc[pipeline_status['layer'] == 'PROD', 'customers'].values[0]
    
    print("\nüìä Pipeline Analysis:")
    print("-" * 70)
    
    if raw_customers == 0:
        print("‚ùå RAW layer is empty")
        print("üí° Solution: Generate raw data")
        print("  python scripts/generate_raw_data.py --test-mode")
    elif stg_customers == 0:
        print("‚ùå STAGING layer is empty")
        print("üí° Solution: Run staging ETL")
        print("  make etl-run-stg")
    elif prod_customers == 0:
        print("‚ùå PROD layer is empty")
        print("üí° Solution: Run production ETL")
        print("  make etl-run-prod")
    else:
        print("‚úÖ All layers have data")
        
        # Check data loss
        loss_pct = (raw_customers - stg_customers) / raw_customers * 100
        print(f"\nData loss RAW ‚Üí STAGING: {loss_pct:.1f}%")
        
        if loss_pct > 20:
            print("‚ö†Ô∏è High data loss detected!")
            print("üí° Check data quality issues in staging transformation")
        
except Exception as e:
    print(f"‚ùå Error: {e}")

## 4. PERFORMANCE ISSUES

In [None]:
print("üîç DIAGNOSE: Performance Issues")
print("="*70)

# Check table sizes
size_query = """
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname IN ('raw', 'staging', 'prod')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
"""

try:
    sizes = db.read_sql(size_query)
    print("\nTable Sizes:")
    display(sizes)
except Exception as e:
    print(f"‚ö†Ô∏è Error: {e}")

In [None]:
# Check indexes
index_query = """
SELECT 
    schemaname,
    tablename,
    indexname,
    indexdef
FROM pg_indexes
WHERE schemaname IN ('raw', 'staging', 'prod')
ORDER BY schemaname, tablename
"""

try:
    indexes = db.read_sql(index_query)
    print("\nIndexes:")
    display(indexes)
    
    if len(indexes) == 0:
        print("\n‚ö†Ô∏è No indexes found!")
        print("üí° Consider adding indexes for better performance:")
        print("  CREATE INDEX idx_orders_customer ON staging.orders(customer_id);")
        print("  CREATE INDEX idx_orders_date ON staging.orders(order_date);")
        print("  CREATE INDEX idx_order_items_order ON staging.order_items(order_id);")
except Exception as e:
    print(f"‚ö†Ô∏è Error: {e}")

## 5. RECOVERY PROCEDURES

### üîÑ Scenario 1: Reset Entire Pipeline

**When to use:** Complete data corruption or major schema changes

```bash
# 1. Drop and recreate schemas
make db-reset

# 2. Initialize schemas
make db-init

# 3. Generate fresh data
python scripts/generate_raw_data.py --test-mode

# 4. Run full pipeline
make etl-run-full
```

In [None]:
# Quick reset function (use with caution!)
def reset_layer(layer_name):
    """
    Reset a specific layer by truncating all tables
    WARNING: This will delete all data in the layer!
    """
    print(f"‚ö†Ô∏è WARNING: This will delete all data in {layer_name} layer!")
    print("Uncomment the code below to execute.")
    
    # Uncomment to execute
    # db = DatabaseConnector()
    # 
    # # Get all tables in layer
    # tables_query = f"""
    # SELECT table_name 
    # FROM information_schema.tables 
    # WHERE table_schema = '{layer_name}'
    # """
    # tables = db.read_sql(tables_query)['table_name'].tolist()
    # 
    # # Truncate each table
    # for table in tables:
    #     truncate_query = f"TRUNCATE {layer_name}.{table} CASCADE;"
    #     db.execute(truncate_query)
    #     print(f"  ‚úÖ Truncated {layer_name}.{table}")
    # 
    # print(f"\n‚úÖ {layer_name} layer reset complete!")

# Example usage (commented out for safety)
# reset_layer('staging')

### üîÑ Scenario 2: Reload Specific Date Range

**When to use:** Data issues in specific date range

In [None]:
def reload_date_range(start_date, end_date):
    """
    Reload data for specific date range
    """
    print(f"üîÑ Reloading data from {start_date} to {end_date}")
    print("\nSteps:")
    print(f"  1. Delete from RAW: DELETE FROM raw.customers WHERE _partition_date BETWEEN '{start_date}' AND '{end_date}';")
    print(f"  2. Delete from STAGING: DELETE FROM staging.customers WHERE signup_date BETWEEN '{start_date}' AND '{end_date}';")
    print(f"  3. Delete from PROD: DELETE FROM prod.daily_sales WHERE order_date BETWEEN '{start_date}' AND '{end_date}';")
    print(f"  4. Re-generate raw data: python scripts/generate_raw_data.py --start-date {start_date} --end-date {end_date}")
    print(f"  5. Re-run ETL: make etl-run-full")

# Example
reload_date_range('2025-01-01', '2025-01-07')

### üîÑ Scenario 3: Fix Specific Table

**When to use:** Issues in one specific table

In [None]:
def fix_table(table_name):
    """
    Fix specific table by reloading from previous layer
    """
    print(f"üîß Fixing {table_name}")
    print("\nSteps:")
    
    if table_name.startswith('staging.'):
        print(f"  1. Truncate: TRUNCATE {table_name} CASCADE;")
        print(f"  2. Reload from RAW: make etl-run-stg")
    elif table_name.startswith('prod.'):
        print(f"  1. Truncate: TRUNCATE {table_name};")
        print(f"  2. Rebuild from STAGING: make etl-run-prod")
    else:
        print(f"  1. Delete raw data files")
        print(f"  2. Truncate: TRUNCATE {table_name};")
        print(f"  3. Re-generate: python scripts/generate_raw_data.py")
        print(f"  4. Re-ingest: make etl-run-raw")

# Example
fix_table('staging.customers')

## 6. DIAGNOSTIC QUERIES

In [None]:
print("üîç USEFUL DIAGNOSTIC QUERIES")
print("="*70)

diagnostic_queries = {
    "Row counts by layer": """
        SELECT 'RAW' as layer, COUNT(*) FROM raw.customers
        UNION ALL
        SELECT 'STAGING', COUNT(*) FROM staging.customers
        UNION ALL
        SELECT 'PROD', COUNT(*) FROM prod.customer_metrics
    """,
    
    "Latest ingestion time": """
        SELECT MAX(_ingested_at) as latest_ingestion
        FROM raw.customers
    """,
    
    "Date range in data": """
        SELECT 
            MIN(order_date) as first_date,
            MAX(order_date) as last_date,
            COUNT(DISTINCT order_date) as total_days
        FROM staging.orders
    """,
    
    "Revenue by layer": """
        SELECT 
            'STAGING' as layer,
            SUM(total_amount) as total_revenue
        FROM staging.orders
        WHERE order_status = 'completed'
        UNION ALL
        SELECT 
            'PROD' as layer,
            SUM(total_revenue) as total_revenue
        FROM prod.daily_sales
    """
}

for query_name, query in diagnostic_queries.items():
    print(f"\nüìä {query_name}:")
    print("-" * 70)
    try:
        result = db.read_sql(query)
        display(result)
    except Exception as e:
        print(f"‚ö†Ô∏è Error: {e}")

## 7. HEALTH CHECK SUMMARY

In [None]:
print("\n" + "="*70)
print("üè• SYSTEM HEALTH CHECK")
print("="*70)

health_checks = []

# 1. Database connection
try:
    db.read_sql("SELECT 1")
    health_checks.append(("Database Connection", "‚úÖ OK"))
except:
    health_checks.append(("Database Connection", "‚ùå FAILED"))

# 2. Schemas exist
try:
    schemas = db.read_sql("SELECT schema_name FROM information_schema.schemata WHERE schema_name IN ('raw', 'staging', 'prod')")['schema_name'].tolist()
    if len(schemas) == 3:
        health_checks.append(("Schemas", "‚úÖ OK (3/3)"))
    else:
        health_checks.append(("Schemas", f"‚ö†Ô∏è PARTIAL ({len(schemas)}/3)"))
except:
    health_checks.append(("Schemas", "‚ùå FAILED"))

# 3. Data in RAW
try:
    count = db.read_sql("SELECT COUNT(*) as c FROM raw.customers")['c'][0]
    if count > 0:
        health_checks.append(("RAW Layer", f"‚úÖ OK ({count:,} rows)"))
    else:
        health_checks.append(("RAW Layer", "‚ö†Ô∏è EMPTY"))
except:
    health_checks.append(("RAW Layer", "‚ùå FAILED"))

# 4. Data in STAGING
try:
    count = db.read_sql("SELECT COUNT(*) as c FROM staging.customers")['c'][0]
    if count > 0:
        health_checks.append(("STAGING Layer", f"‚úÖ OK ({count:,} rows)"))
    else:
        health_checks.append(("STAGING Layer", "‚ö†Ô∏è EMPTY"))
except:
    health_checks.append(("STAGING Layer", "‚ùå FAILED"))

# 5. Data in PROD
try:
    count = db.read_sql("SELECT COUNT(*) as c FROM prod.daily_sales")['c'][0]
    if count > 0:
        health_checks.append(("PROD Layer", f"‚úÖ OK ({count:,} rows)"))
    else:
        health_checks.append(("PROD Layer", "‚ö†Ô∏è EMPTY"))
except:
    health_checks.append(("PROD Layer", "‚ùå FAILED"))

# 6. Data quality
try:
    dups = db.read_sql("SELECT COUNT(*) - COUNT(DISTINCT email) as d FROM staging.customers")['d'][0]
    if dups == 0:
        health_checks.append(("Data Quality", "‚úÖ OK (no duplicates)"))
    else:
        health_checks.append(("Data Quality", f"‚ö†Ô∏è ISSUES ({dups} duplicates)"))
except:
    health_checks.append(("Data Quality", "‚ùå FAILED"))

# Print results
print("\nHealth Check Results:")
for check, status in health_checks:
    print(f"  {check:.<30} {status}")

# Overall status
failed = sum(1 for _, status in health_checks if "‚ùå" in status)
warnings = sum(1 for _, status in health_checks if "‚ö†Ô∏è" in status)

print("\n" + "="*70)
if failed == 0 and warnings == 0:
    print("üéâ ALL SYSTEMS OPERATIONAL")
elif failed == 0:
    print(f"‚ö†Ô∏è SYSTEM OPERATIONAL WITH {warnings} WARNING(S)")
else:
    print(f"‚ùå SYSTEM ISSUES DETECTED: {failed} FAILED, {warnings} WARNING(S)")
print("="*70)

# üéì KEY TAKEAWAYS

## üîß Common Issues:
1. **Connection Issues**: Check PostgreSQL service, .env config
2. **Missing Tables**: Run `make db-init`
3. **Duplicate Keys**: Review deduplication logic
4. **Data Quality**: Check transformation rules
5. **Performance**: Add indexes, optimize queries

## üîÑ Recovery Strategies:
1. **Full Reset**: Drop and recreate everything
2. **Layer Reset**: Truncate and reload specific layer
3. **Date Range**: Reload specific dates
4. **Table Fix**: Fix individual table

## üìö Useful Commands:
```bash
# Database
make db-init          # Initialize schemas
make db-reset         # Reset database

# ETL
make etl-run-full     # Run full pipeline
make etl-run-raw      # Run RAW layer only
make etl-run-stg      # Run STAGING layer only
make etl-run-prod     # Run PROD layer only

# Data Generation
python scripts/generate_raw_data.py --test-mode
python scripts/generate_raw_data.py --start-date 2025-01-01 --end-date 2025-01-31
```

## üí° Best Practices:
1. Always backup before major changes
2. Test on small data first
3. Monitor logs during ETL runs
4. Validate data after each layer
5. Document any manual fixes

In [None]:
print("\n‚úÖ Troubleshooting Guide Complete!")
print("\nüí° If you still have issues:")
print("  1. Check logs in logs/ directory")
print("  2. Review ETL scripts in scripts/ directory")
print("  3. Check SQL schemas in sql/ directory")
print("  4. Run health check above")