# Week 6 - SQL and Python Integration Part 1: Database Connections

## Learning Objectives
By the end of this lesson, you will be able to:
1. Establish PostgreSQL database connections from Python using SQLAlchemy
2. Connect to cloud databases (Supabase) for real-world data analysis
3. Execute SQL queries from Python notebooks using real e-commerce data
4. Understand the relationship between SQL databases and Python DataFrames
5. Implement proper connection management and error handling
6. Compare SQL and Pandas approaches for business analytics

## Business Context: Bridging SQL and Python

In modern business environments, data often lives in **cloud databases** while analysis happens in **Python**. The ability to seamlessly bridge these two worlds is essential for:

- **Real-time Data Access** - Connect directly to live business systems
- **Scalability** - Handle enterprise-scale datasets
- **Collaboration** - Multiple analysts accessing the same data source
- **Performance** - Leverage database engines for heavy computation
- **Integration** - Combine SQL's querying power with Python's analytical capabilities

Today we'll master connecting Python to **PostgreSQL databases** using **Supabase** (a cloud database platform) and work with real Olist e-commerce data that's already stored in the cloud.

In [23]:
# Import required libraries for PostgreSQL database connectivity
import pandas as pd
import numpy as np
import sqlalchemy
from sqlalchemy import create_engine, text, inspect
from datetime import datetime, timedelta
import warnings
import os
from dotenv import load_dotenv

warnings.filterwarnings('ignore')

# Load environment variables from .env file
load_dotenv()

# Supabase PostgreSQL Database Configuration from environment variables
DATABASE_CONFIG = {
    'host': os.getenv('POSTGRES_HOST'),
    'port': int(os.getenv('POSTGRES_PORT', 5432)),
    'database': os.getenv('POSTGRES_DATABASE'),
    'user': os.getenv('POSTGRES_USER'),
    'password': os.getenv('POSTGRES_PASSWORD'),
    'connection_timeout': 30,
    'echo': False  # Set to True to see SQL queries
}

# Verify that environment variables were loaded
if not all([DATABASE_CONFIG['host'], DATABASE_CONFIG['user'], DATABASE_CONFIG['password']]):
    raise ValueError("Missing required database credentials. Please check your .env file.")

# PostgreSQL connection string
POSTGRES_URL = f"postgresql://{DATABASE_CONFIG['user']}:{DATABASE_CONFIG['password']}@{DATABASE_CONFIG['host']}:{DATABASE_CONFIG['port']}/{DATABASE_CONFIG['database']}"

print("üêò PostgreSQL-Python Integration Environment Ready!")
print(f"SQLAlchemy version: {sqlalchemy.__version__}")
print(f"Pandas version: {pd.__version__}")
print("‚úÖ Connecting to Supabase PostgreSQL Database...")
print("üóÑÔ∏è Real Olist E-commerce & Marketing data awaits!")
print("üîí Database credentials loaded securely from .env file")

üêò PostgreSQL-Python Integration Environment Ready!
SQLAlchemy version: 2.0.41
Pandas version: 2.3.0
‚úÖ Connecting to Supabase PostgreSQL Database...
üóÑÔ∏è Real Olist E-commerce & Marketing data awaits!
üîí Database credentials loaded securely from .env file


## 1. SQLAlchemy Basics and PostgreSQL Connection

**SQLAlchemy** is Python's most popular database toolkit. It provides:
- **Connection Management**: Handle database connections efficiently
- **SQL Query Execution**: Run SQL directly from Python
- **ORM (Object-Relational Mapping)**: Map Python objects to database tables
- **Database Abstraction**: Work with different databases using the same API

**PostgreSQL** is an enterprise-grade database that excels at:
- **Complex Queries**: Advanced SQL features like window functions, CTEs
- **Scalability**: Handle millions of rows efficiently  
- **Data Integrity**: ACID compliance for business-critical data
- **JSON Support**: Store and query semi-structured data

In [24]:
# PostgreSQL Connection Functions (Functional Programming Approach)

def create_database_engine():
    """
    Create and configure PostgreSQL database engine with optimal settings.
    Returns configured SQLAlchemy engine.
    """
    try:
        engine = create_engine(
            POSTGRES_URL,
            echo=DATABASE_CONFIG['echo'],
            pool_size=5,
            max_overflow=10,
            pool_timeout=DATABASE_CONFIG['connection_timeout'],
            pool_recycle=3600,
            connect_args={
                "connect_timeout": DATABASE_CONFIG['connection_timeout'],
                "application_name": "Python_Data_Analysis_Course"
            }
        )
        
        # Test connection
        with engine.connect() as conn:
            result = conn.execute(text("SELECT version()"))
            version = result.scalar()
            print("‚úÖ PostgreSQL connection established successfully")
            print(f"üêò Database version: {version[:50]}...")
        
        return engine
        
    except Exception as e:
        print(f"‚ùå PostgreSQL connection failed: {e}")
        print("üîß Troubleshooting tips:")
        print("  ‚Ä¢ Check your internet connection")
        print("  ‚Ä¢ Verify database credentials")
        print("  ‚Ä¢ Ensure Supabase database is running")
        raise

def get_table_info(engine):
    """
    Get comprehensive information about all tables in the Olist schemas.
    Returns dictionary with schema-qualified table information.
    """
    inspector = inspect(engine)
    
    # Define the schemas we're interested in
    target_schemas = ['olist_marketing_data_set', 'olist_sales_data_set']
    
    table_info = {}
    
    print("üìã Discovering database schema...")
    
    # Get all available schemas first
    try:
        with engine.connect() as conn:
            result = conn.execute(text("""
                SELECT schema_name 
                FROM information_schema.schemata 
                WHERE schema_name IN ('olist_marketing_data_set', 'olist_sales_data_set')
            """))
            available_schemas = [row[0] for row in result]
            print(f"üîç Found schemas: {available_schemas}")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not query schemas: {e}")
        available_schemas = target_schemas  # Fallback to expected schemas
    
    # Query tables in each schema
    for schema in available_schemas:
        try:
            tables = inspector.get_table_names(schema=schema)
            print(f"\nüìä Schema '{schema}' contains {len(tables)} tables:")
            
            for table in tables:
                schema_qualified_name = f"{schema}.{table}"
                try:
                    with engine.connect() as conn:
                        # Get row count
                        result = conn.execute(text(f'SELECT COUNT(*) FROM "{schema}"."{table}"'))
                        row_count = result.scalar()
                        
                        # Get column information
                        columns = inspector.get_columns(table, schema=schema)
                        
                        table_info[schema_qualified_name] = {
                            'schema': schema,
                            'table': table,
                            'rows': row_count,
                            'columns': [col['name'] for col in columns],
                            'column_types': {col['name']: str(col['type']) for col in columns}
                        }
                        
                        print(f"  üìã {table}: {row_count:,} rows, {len(columns)} columns")
                        
                except Exception as e:
                    print(f"  ‚ö†Ô∏è Could not access {schema}.{table}: {e}")
                    continue
                    
        except Exception as e:
            print(f"‚ö†Ô∏è Could not access schema '{schema}': {e}")
            continue
    
    return table_info

def execute_query(engine, query, params=None):
    """
    Execute a SQL query with proper error handling and return results as DataFrame.
    """
    try:
        with engine.connect() as conn:
            if params:
                result = pd.read_sql(text(query), conn, params=params)
            else:
                result = pd.read_sql(text(query), conn)
            
            return result
            
    except Exception as e:
        print(f"‚ùå Query execution failed: {e}")
        print(f"üìù Query: {query[:100]}...")
        raise

def get_sample_data(engine, table_name, limit=5):
    """
    Get sample data from a schema-qualified table for exploration.
    table_name should be in format 'schema.table'
    """
    if '.' in table_name:
        schema, table = table_name.split('.', 1)
        query = f'SELECT * FROM "{schema}"."{table}" LIMIT {limit}'
    else:
        query = f'SELECT * FROM "{table_name}" LIMIT {limit}'
    
    return execute_query(engine, query)

def get_table_schema(engine, table_name):
    """
    Get detailed schema information for a specific table.
    table_name should be in format 'schema.table'
    """
    if '.' in table_name:
        schema, table = table_name.split('.', 1)
    else:
        schema, table = None, table_name
        
    inspector = inspect(engine)
    columns = inspector.get_columns(table, schema=schema)
    
    schema_df = pd.DataFrame([
        {
            'column_name': col['name'],
            'data_type': str(col['type']),
            'nullable': col['nullable'],
            'default': col.get('default'),
            'primary_key': col.get('primary_key', False)
        }
        for col in columns
    ])
    
    return schema_df

def close_database_engine(engine):
    """
    Properly close database connections.
    """
    if engine:
        engine.dispose()
        print("üîí PostgreSQL connections closed")

# Create database engine and connect to Supabase
print("üöÄ Connecting to Supabase PostgreSQL Database...")
db_engine = create_database_engine()

# Display database information
print("\nüìä Olist E-commerce Database Overview:")
db_info = get_table_info(db_engine)

print(f"\nüóÉÔ∏è Total tables discovered: {len(db_info)}")
total_rows = sum(info['rows'] for info in db_info.values())
print(f"üìè Total rows across all tables: {total_rows:,}")

üöÄ Connecting to Supabase PostgreSQL Database...
‚úÖ PostgreSQL connection established successfully
üêò Database version: PostgreSQL 17.4 on aarch64-unknown-linux-gnu, comp...

üìä Olist E-commerce Database Overview:
üìã Discovering database schema...
üîç Found schemas: ['olist_marketing_data_set', 'olist_sales_data_set']

üìä Schema 'olist_marketing_data_set' contains 2 tables:
  üìã olist_marketing_qualified_leads_dataset: 8,000 rows, 4 columns
  üìã olist_closed_deals_dataset: 380 rows, 14 columns

üìä Schema 'olist_sales_data_set' contains 9 tables:
  üìã olist_order_reviews_dataset: 98,410 rows, 7 columns
  üìã olist_order_items_dataset: 112,650 rows, 7 columns
  üìã olist_order_payments_dataset: 103,886 rows, 5 columns
  üìã olist_customers_dataset: 99,441 rows, 5 columns
  üìã olist_orders_dataset: 99,441 rows, 8 columns
  üìã olist_sellers_dataset: 3,095 rows, 4 columns
  üìã product_category_name_translation: 73 rows, 2 columns
  üìã olist_products_dataset: 3

## 2. Exploring the Database Schema

Let's explore the structure of our Olist e-commerce database to understand the business data model.

In [25]:
# Real Business Data Exploration: Olist E-commerce Dataset
print("üîç Real Business Data Exploration")
print("\n" + "="*60)

# Now we have actual data! Let's explore the real Olist e-commerce business model
sales_tables = [table for table in db_info.keys() if table.startswith('olist_sales_data_set.')]
marketing_tables = [table for table in db_info.keys() if table.startswith('olist_marketing_data_set.')]

print(f"\nüìä OLIST SALES DATASET - Brazilian E-commerce Marketplace")
print(f"Found {len(sales_tables)} core business tables:")

# Display actual business tables with their purpose
table_descriptions = {
    'olist_customers_dataset': 'Customer demographics and location data',
    'olist_orders_dataset': 'Order lifecycle and delivery tracking',
    'olist_order_items_dataset': 'Product items within each order',
    'olist_order_payments_dataset': 'Payment methods and transaction values',
    'olist_order_reviews_dataset': 'Customer satisfaction scores and feedback',
    'olist_products_dataset': 'Product catalog with categories and dimensions',
    'olist_sellers_dataset': 'Marketplace seller information and locations',
    'olist_geolocation_dataset': 'Geographic coordinates for Brazilian ZIP codes',
    'product_category_name_translation': 'Portuguese to English category translations'
}

for table in sales_tables:
    if table in db_info:
        info = db_info[table]
        table_name = info['table']
        description = table_descriptions.get(table_name, 'Business dataset')
        print(f"  ‚Ä¢ {table_name:<35} {info['rows']:>8,} rows")
        print(f"    {description}")

print(f"\nüìà OLIST MARKETING DATASET")
print(f"Found {len(marketing_tables)} marketing tables:")

for table in marketing_tables:
    if table in db_info:
        info = db_info[table]
        table_name = info['table']
        print(f"  ‚Ä¢ {table_name:<35} {info['rows']:>8,} rows")

# Let's examine the core business relationships
if sales_tables:
    main_orders_table = 'olist_sales_data_set.olist_orders_dataset'
    print(f"\nüìã Core Business Schema - Orders Table Structure:")
    orders_schema = get_table_schema(db_engine, main_orders_table)
    display(orders_schema)
    
    print(f"\nüì¶ Real Order Data Sample:")
    orders_sample = get_sample_data(db_engine, main_orders_table, 3)
    display(orders_sample)

print("\nüí° Real Business Model Understanding:")
print("  üè™ Olist is a Brazilian e-commerce marketplace (like Amazon)")
print("  üõí Connects sellers with customers across Brazil")
print("  üì¶ Handles logistics, payments, and customer service")
print("  üìä Rich dataset: 100K+ orders, 32K+ products, 3K+ sellers")
print("  üåç Geographic coverage: All Brazilian states")
print("  üí∞ Business metrics: R$ 13.6M+ in sales, 140+ avg order value")
print(f"  üìÖ Time period: 2016-2018 Brazilian e-commerce data")

üîç Real Business Data Exploration


üìä OLIST SALES DATASET - Brazilian E-commerce Marketplace
Found 9 core business tables:
  ‚Ä¢ olist_order_reviews_dataset           98,410 rows
    Customer satisfaction scores and feedback
  ‚Ä¢ olist_order_items_dataset            112,650 rows
    Product items within each order
  ‚Ä¢ olist_order_payments_dataset         103,886 rows
    Payment methods and transaction values
  ‚Ä¢ olist_customers_dataset               99,441 rows
    Customer demographics and location data
  ‚Ä¢ olist_orders_dataset                  99,441 rows
    Order lifecycle and delivery tracking
  ‚Ä¢ olist_sellers_dataset                  3,095 rows
    Marketplace seller information and locations
  ‚Ä¢ product_category_name_translation         73 rows
    Portuguese to English category translations
  ‚Ä¢ olist_products_dataset                32,951 rows
    Product catalog with categories and dimensions
  ‚Ä¢ olist_geolocation_dataset           1,000,163 rows
    Geog

Unnamed: 0,column_name,data_type,nullable,default,primary_key
0,order_id,TEXT,False,,False
1,customer_id,TEXT,True,,False
2,order_status,TEXT,True,,False
3,order_purchase_timestamp,TIMESTAMP,True,,False
4,order_approved_at,TIMESTAMP,True,,False
5,order_delivered_carrier_date,TIMESTAMP,True,,False
6,order_delivered_customer_date,TIMESTAMP,True,,False
7,order_estimated_delivery_date,TIMESTAMP,True,,False



üì¶ Real Order Data Sample:


Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04



üí° Real Business Model Understanding:
  üè™ Olist is a Brazilian e-commerce marketplace (like Amazon)
  üõí Connects sellers with customers across Brazil
  üì¶ Handles logistics, payments, and customer service
  üìä Rich dataset: 100K+ orders, 32K+ products, 3K+ sellers
  üåç Geographic coverage: All Brazilian states
  üí∞ Business metrics: R$ 13.6M+ in sales, 140+ avg order value
  üìÖ Time period: 2016-2018 Brazilian e-commerce data


## 3. Running SQL Queries from Python

Now let's execute SQL queries directly from Python and see how they work with real cloud data.

In [None]:
# Real Business Intelligence SQL Queries
print("üîç Real Business Intelligence with SQL")
print("\n" + "="*60)

# Now we can run actual business analysis instead of guessing column names!
print("\nüìã Example 1: Customer Distribution Analysis")
print("Business Question: Where are our customers located?")

if sales_tables:
    # Real customer geographic analysis
    customer_geo_query = """
    SELECT 
        customer_state,
        COUNT(*) as customer_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as percentage
    FROM "olist_sales_data_set"."olist_customers_dataset"
    WHERE customer_state IS NOT NULL
    GROUP BY customer_state
    ORDER BY customer_count DESC
    LIMIT 8
    """
    
    try:
        customer_analysis = execute_query(db_engine, customer_geo_query)
        print("‚úÖ Customer Geographic Distribution:")
        display(customer_analysis)
        
        print(f"\nüí° Business Insights:")
        print(f"  ‚Ä¢ S√£o Paulo (SP) dominates with {customer_analysis.iloc[0]['customer_count']:,} customers")
        print(f"  ‚Ä¢ Top 3 states account for {customer_analysis.head(3)['percentage'].sum():.1f}% of customers")
        print(f"  ‚Ä¢ Geographic concentration in Southeast Brazil")
    except Exception as e:
        print(f"‚ùå Query failed: {e}")

print("\n" + "-"*60)
print("\nüìä Example 2: Order Status Pipeline Analysis")
print("Business Question: What's our order fulfillment performance?")

if sales_tables:
    # Real order status analysis
    order_status_query = """
    SELECT 
        order_status,
        COUNT(*) as order_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as percentage
    FROM "olist_sales_data_set"."olist_orders_dataset"
    GROUP BY order_status
    ORDER BY order_count DESC
    """
    
    try:
        order_status_analysis = execute_query(db_engine, order_status_query)
        print("‚úÖ Order Status Distribution:")
        display(order_status_analysis)
        
        delivered_rate = order_status_analysis[order_status_analysis['order_status'] == 'delivered']['percentage'].iloc[0]
        print(f"\nüí° Operational Insights:")
        print(f"  ‚Ä¢ {delivered_rate}% successful delivery rate")
        print(f"  ‚Ä¢ {order_status_analysis.iloc[0]['order_count']:,} total orders processed")
        print(f"  ‚Ä¢ Strong operational performance with minimal cancellations")
    except Exception as e:
        print(f"‚ùå Query failed: {e}")

print("\n" + "-"*60)
print("\nüí∞ Example 3: Revenue and Payment Analysis")
print("Business Question: What are our key financial metrics?")

if sales_tables:
    # Real revenue analysis
    revenue_query = """
    SELECT 
        payment_type,
        COUNT(*) as transaction_count,
        SUM(payment_value::numeric) as total_revenue,
        AVG(payment_value::numeric) as avg_payment_value
    FROM "olist_sales_data_set"."olist_order_payments_dataset"
    WHERE payment_value IS NOT NULL
    GROUP BY payment_type
    ORDER BY total_revenue DESC
    """
    
    try:
        revenue_analysis = execute_query(db_engine, revenue_query)
        print("‚úÖ Payment Method Analysis:")
        display(revenue_analysis)
        
        total_revenue = revenue_analysis['total_revenue'].sum()
        print(f"\nüí° Financial Insights:")
        print(f"  ‚Ä¢ Total revenue: R$ {total_revenue:,.2f}")
        print(f"  ‚Ä¢ Credit cards dominate: {revenue_analysis.iloc[0]['total_revenue']/total_revenue*100:.1f}% of revenue")
        print(f"  ‚Ä¢ Brazilian payment preferences clearly visible")
    except Exception as e:
        print(f"‚ùå Query failed: {e}")

print("\n" + "-"*60)

In [None]:
# Real Business Relationships: JOIN Analysis
print("\nüîó Real Business Relationships: JOIN Analysis")
print("Business Question: How do our core business entities connect?")

if sales_tables:
    print("\nüìã Customer-Order-Product Relationship Analysis")
    
    # Real business relationship query showing actual data flow
    customer_order_analysis_query = """
    SELECT 
        c.customer_state,
        COUNT(DISTINCT c.customer_id) as unique_customers,
        COUNT(DISTINCT o.order_id) as total_orders,
        COUNT(oi.order_item_id) as total_items,
        SUM(oi.price::numeric) as total_item_value,
        AVG(oi.price::numeric) as avg_item_price
    FROM "olist_sales_data_set"."olist_customers_dataset" c
    INNER JOIN "olist_sales_data_set"."olist_orders_dataset" o 
        ON c.customer_id = o.customer_id
    INNER JOIN "olist_sales_data_set"."olist_order_items_dataset" oi 
        ON o.order_id = oi.order_id
    WHERE c.customer_state IS NOT NULL 
        AND o.order_status = 'delivered'
        AND oi.price IS NOT NULL
    GROUP BY c.customer_state
    ORDER BY total_item_value DESC
    LIMIT 10
    """
    
    try:
        relationship_analysis = execute_query(db_engine, customer_order_analysis_query)
        print("‚úÖ Customer-Order-Product JOIN Analysis:")
        display(relationship_analysis)
        
        if len(relationship_analysis) > 0:
            top_state = relationship_analysis.iloc[0]
            print(f"\nüîç JOIN Analysis Insights:")
            print(f"  ‚Ä¢ Top revenue state: {top_state['customer_state']}")
            print(f"  ‚Ä¢ {top_state['unique_customers']:,} customers generated R$ {top_state['total_item_value']:,.2f}")
            print(f"  ‚Ä¢ Average item price: R$ {top_state['avg_item_price']:,.2f}")
            print(f"  ‚Ä¢ This demonstrates real foreign key relationships working!")
    except Exception as e:
        print(f"‚ùå JOIN analysis failed: {e}")

    print("\n" + "-"*50)
    print("\n‚≠ê Product Performance with Reviews Analysis")
    
    # Real product-review relationship analysis
    product_review_query = """
    SELECT 
        p.product_category_name,
        t.product_category_name_english,
        COUNT(DISTINCT p.product_id) as unique_products,
        COUNT(r.review_id) as total_reviews,
        AVG(r.review_score::numeric) as avg_review_score,
        SUM(oi.price::numeric) as category_revenue
    FROM "olist_sales_data_set"."olist_products_dataset" p
    INNER JOIN "olist_sales_data_set"."olist_order_items_dataset" oi 
        ON p.product_id = oi.product_id
    INNER JOIN "olist_sales_data_set"."olist_order_reviews_dataset" r 
        ON oi.order_id = r.order_id
    LEFT JOIN "olist_sales_data_set"."product_category_name_translation" t
        ON p.product_category_name = t.product_category_name
    WHERE p.product_category_name IS NOT NULL
        AND r.review_score IS NOT NULL
        AND oi.price IS NOT NULL
    GROUP BY p.product_category_name, t.product_category_name_english
    ORDER BY category_revenue DESC
    LIMIT 8
    """
    
    try:
        product_performance = execute_query(db_engine, product_review_query)
        print("‚úÖ Product Category Performance (with Reviews):")
        display(product_performance)
        
        print(f"\nüí° Multi-Table JOIN Insights:")
        print(f"  ‚Ä¢ Successfully joined 4 tables: products ‚Üí order_items ‚Üí reviews ‚Üí translations")
        print(f"  ‚Ä¢ Real business metrics: revenue + customer satisfaction")
        print(f"  ‚Ä¢ Shows complex relationships in e-commerce data")
    except Exception as e:
        print(f"‚ùå Product performance analysis failed: {e}")

print("\n" + "-"*60)

In [None]:
# Example 3: Marketing funnel analysis
print("\nüìà Example 3: Marketing Analysis")
print("Business Question: How effective are our marketing channels?")

if marketing_tables:
    main_marketing_table = marketing_tables[0]
    schema, table = main_marketing_table.split('.', 1)
    
    # Examine marketing table structure
    marketing_sample = get_sample_data(db_engine, main_marketing_table, 1)
    print(f"\nActual columns in {table}:")
    print(list(marketing_sample.columns))
    
    marketing_columns = db_info[main_marketing_table]['columns']
    
    # Look for relevant marketing columns
    channel_cols = [col for col in marketing_columns if any(keyword in col.lower() for keyword in ['origin', 'source', 'channel', 'medium'])]
    lead_cols = [col for col in marketing_columns if any(keyword in col.lower() for keyword in ['lead', 'mql', 'conversion'])]
    type_cols = [col for col in marketing_columns if any(keyword in col.lower() for keyword in ['type', 'category', 'segment'])]
    
    print(f"\nChannel-related columns: {channel_cols}")
    print(f"Lead-related columns: {lead_cols}")
    print(f"Type/Category columns: {type_cols}")
    
    # Try analysis with available columns
    analysis_col = channel_cols[0] if channel_cols else (type_cols[0] if type_cols else None)
    
    if analysis_col:
        # Build marketing analysis query with schema qualification
        marketing_query = f"""
        SELECT 
            "{analysis_col}" as marketing_dimension,
            COUNT(*) as total_records,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
        FROM "{schema}"."{table}"
        WHERE "{analysis_col}" IS NOT NULL
        GROUP BY "{analysis_col}"
        ORDER BY total_records DESC
        LIMIT 10
        """
        
        try:
            marketing_analysis = execute_query(db_engine, marketing_query)
            print(f"\n‚úÖ Marketing Analysis by {analysis_col.replace('_', ' ').title()}:")
            display(marketing_analysis)
            
            if len(marketing_analysis) > 0:
                top_dimension = marketing_analysis.iloc[0]
                print(f"\nüí° Marketing Insights:")
                print(f"  ‚Ä¢ Top {analysis_col.replace('_', ' ')}: {top_dimension['marketing_dimension']} ({top_dimension['percentage']}%)")
                print(f"  ‚Ä¢ Total categories: {len(marketing_analysis)}")
                print(f"  ‚Ä¢ Total records analyzed: {top_dimension['total_records']:,}")
        except Exception as e:
            print(f"‚ùå Marketing analysis failed: {e}")
    else:
        # Fallback: show basic table statistics
        print("‚ö†Ô∏è No obvious analysis columns found, showing basic statistics:")
        
        basic_marketing_query = f"""
        SELECT 
            COUNT(*) as total_records,
            COUNT(DISTINCT *) as unique_records
        FROM "{schema}"."{table}"
        """
        
        try:
            basic_stats = execute_query(db_engine, basic_marketing_query)
            print("‚úÖ Basic Marketing Table Statistics:")
            display(basic_stats)
        except Exception as e:
            print(f"‚ùå Basic statistics failed: {e}")
            print("Showing sample data instead:")
            display(marketing_sample)

else:
    print("‚ö†Ô∏è No marketing tables found in the database")

print("\n" + "-"*60)

## 4. Advanced SQL Features

Let's explore more sophisticated SQL queries that are common in business intelligence scenarios.

In [None]:
# Advanced SQL: Real Time-Series Analysis with Brazilian E-commerce Data
print("üß† Advanced SQL Analysis: Real Time-Series & Window Functions")
print("\n" + "="*60)

print("\nüìÖ Example: Brazilian E-commerce Growth Trends")
print("Business Question: How did Olist's business grow month-by-month?")

if sales_tables:
    # Real time-series analysis with actual order dates
    time_analysis_query = """
    SELECT 
        DATE_TRUNC('month', order_purchase_timestamp) as month,
        COUNT(*) as monthly_orders,
        COUNT(DISTINCT customer_id) as unique_customers,
        LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', order_purchase_timestamp)) as prev_month_orders,
        CASE 
            WHEN LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', order_purchase_timestamp)) IS NOT NULL
            THEN ROUND(
                (COUNT(*) - LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', order_purchase_timestamp))) * 100.0 / 
                LAG(COUNT(*), 1) OVER (ORDER BY DATE_TRUNC('month', order_purchase_timestamp)), 
                1
            )
            ELSE NULL
        END as month_over_month_growth
    FROM "olist_sales_data_set"."olist_orders_dataset"
    WHERE order_purchase_timestamp IS NOT NULL
        AND order_purchase_timestamp >= '2017-01-01'
        AND order_purchase_timestamp < '2019-01-01'
    GROUP BY DATE_TRUNC('month', order_purchase_timestamp)
    ORDER BY month
    """
    
    try:
        time_analysis = execute_query(db_engine, time_analysis_query)
        print("\n‚úÖ Monthly Growth Analysis (2017-2018):")
        display(time_analysis)
        
        if len(time_analysis) > 1:
            # Calculate business insights
            avg_monthly_orders = time_analysis['monthly_orders'].mean()
            peak_month = time_analysis.loc[time_analysis['monthly_orders'].idxmax()]
            
            print(f"\nüìà Brazilian E-commerce Trends:")
            print(f"  ‚Ä¢ Average monthly orders: {avg_monthly_orders:.0f}")
            print(f"  ‚Ä¢ Peak month: {peak_month['month']} with {peak_month['monthly_orders']:,} orders")
            print(f"  ‚Ä¢ Growth rate analysis shows seasonal e-commerce patterns")
            print(f"  ‚Ä¢ Window functions (LAG) enable month-over-month calculations")
    except Exception as e:
        print(f"‚ùå Time analysis failed: {e}")

    print("\n" + "-"*50)
    print("\nüåç Geographic Analysis with Window Functions")
    print("Business Question: How do Brazilian states rank by business performance?")
    
    # Advanced window functions for regional ranking
    regional_ranking_query = """
    SELECT 
        c.customer_state,
        COUNT(DISTINCT o.order_id) as total_orders,
        SUM(oi.price::numeric + oi.freight_value::numeric) as total_revenue,
        ROUND(AVG(oi.price::numeric), 2) as avg_item_price,
        RANK() OVER (ORDER BY COUNT(DISTINCT o.order_id) DESC) as order_rank,
        RANK() OVER (ORDER BY SUM(oi.price::numeric + oi.freight_value::numeric) DESC) as revenue_rank,
        ROUND(
            SUM(oi.price::numeric + oi.freight_value::numeric) * 100.0 / 
            SUM(SUM(oi.price::numeric + oi.freight_value::numeric)) OVER(), 2
        ) as revenue_percentage
    FROM "olist_sales_data_set"."olist_customers_dataset" c
    INNER JOIN "olist_sales_data_set"."olist_orders_dataset" o ON c.customer_id = o.customer_id
    INNER JOIN "olist_sales_data_set"."olist_order_items_dataset" oi ON o.order_id = oi.order_id
    WHERE c.customer_state IS NOT NULL 
        AND o.order_status = 'delivered'
        AND oi.price IS NOT NULL
        AND oi.freight_value IS NOT NULL
    GROUP BY c.customer_state
    ORDER BY total_revenue DESC
    LIMIT 10
    """
    
    try:
        regional_analysis = execute_query(db_engine, regional_ranking_query)
        print("\n‚úÖ Regional Performance Ranking:")
        display(regional_analysis)
        
        print(f"\nüèÜ Geographic Business Insights:")
        top_state = regional_analysis.iloc[0]
        print(f"  ‚Ä¢ #1 Revenue State: {top_state['customer_state']} (R$ {top_state['total_revenue']:,.2f})")
        print(f"  ‚Ä¢ Market concentration: Top 3 states = {regional_analysis.head(3)['revenue_percentage'].sum():.1f}% of revenue")
        print(f"  ‚Ä¢ Window functions enable sophisticated ranking and percentage calculations")
        print(f"  ‚Ä¢ Real business intelligence for Brazilian market analysis")
    except Exception as e:
        print(f"‚ùå Regional analysis failed: {e}")

print("\nüí° Advanced SQL Features Demonstrated:")
print("  ‚Ä¢ DATE_TRUNC() for time-series grouping with real dates")
print("  ‚Ä¢ LAG() window function for month-over-month growth analysis")
print("  ‚Ä¢ RANK() for competitive regional analysis")
print("  ‚Ä¢ SUM() OVER() for percentage calculations across result set")
print("  ‚Ä¢ Complex JOINs with real business logic")
print("  ‚Ä¢ Brazilian e-commerce seasonality and geographic patterns")

print("\n" + "-"*60)

In [None]:
# Advanced SQL: Real CTEs for Business Intelligence
print("\nüíº Advanced SQL: CTEs for Real Business Intelligence")
print("Business Question: Can we build a customer lifetime value model?")

if sales_tables:
    # Real CTE analysis using actual Brazilian e-commerce data
    print("\nüìä Customer Lifetime Value Analysis (using CTEs)")
    
    clv_analysis_query = """
    WITH customer_metrics AS (
        -- Calculate per-customer metrics
        SELECT 
            c.customer_id,
            c.customer_state,
            COUNT(DISTINCT o.order_id) as total_orders,
            SUM(oi.price::numeric + oi.freight_value::numeric) as lifetime_value,
            AVG(oi.price::numeric) as avg_item_price,
            MIN(o.order_purchase_timestamp) as first_order_date,
            MAX(o.order_purchase_timestamp) as last_order_date
        FROM "olist_sales_data_set"."olist_customers_dataset" c
        INNER JOIN "olist_sales_data_set"."olist_orders_dataset" o ON c.customer_id = o.customer_id
        INNER JOIN "olist_sales_data_set"."olist_order_items_dataset" oi ON o.order_id = oi.order_id
        WHERE o.order_status = 'delivered'
            AND oi.price IS NOT NULL 
            AND oi.freight_value IS NOT NULL
        GROUP BY c.customer_id, c.customer_state
    ),
    customer_segments AS (
        -- Segment customers based on behavior
        SELECT 
            customer_id,
            customer_state,
            total_orders,
            lifetime_value,
            avg_item_price,
            CASE 
                WHEN lifetime_value >= (SELECT AVG(lifetime_value) * 2 FROM customer_metrics) THEN 'High Value'
                WHEN total_orders > 1 THEN 'Repeat Customer'
                ELSE 'Single Purchase'
            END as customer_segment
        FROM customer_metrics
    ),
    state_performance AS (
        -- Aggregate by state and segment
        SELECT 
            customer_state,
            customer_segment,
            COUNT(*) as customer_count,
            AVG(lifetime_value) as avg_clv,
            SUM(lifetime_value) as total_state_revenue
        FROM customer_segments
        GROUP BY customer_state, customer_segment
    )
    SELECT 
        customer_state,
        customer_segment,
        customer_count,
        ROUND(avg_clv, 2) as avg_customer_lifetime_value,
        ROUND(total_state_revenue, 2) as segment_revenue
    FROM state_performance
    WHERE customer_state IN ('SP', 'RJ', 'MG', 'RS', 'PR')  -- Top 5 states
    ORDER BY customer_state, segment_revenue DESC
    """
    
    try:
        clv_analysis = execute_query(db_engine, clv_analysis_query)
        print("\n‚úÖ Customer Lifetime Value by State & Segment:")
        display(clv_analysis)
        
        print(f"\nüéØ CTE Business Intelligence Insights:")
        # Analyze the segments
        high_value_customers = clv_analysis[clv_analysis['customer_segment'] == 'High Value']
        if len(high_value_customers) > 0:
            best_high_value_state = high_value_customers.loc[high_value_customers['avg_customer_lifetime_value'].idxmax()]
            print(f"  ‚Ä¢ Best high-value state: {best_high_value_state['customer_state']}")
            print(f"  ‚Ä¢ High-value CLV: R$ {best_high_value_state['avg_customer_lifetime_value']:,.2f}")
        
        print(f"  ‚Ä¢ CTEs enabled complex 3-step analysis: metrics ‚Üí segments ‚Üí aggregation")
        print(f"  ‚Ä¢ Real business model: customer segmentation for Brazilian e-commerce")
    except Exception as e:
        print(f"‚ùå CTE analysis failed: {e}")

    print("\n" + "-"*50)
    print("\nüîÑ Advanced CTE: Product Category Performance Analysis")
    
    category_performance_query = """
    WITH product_sales AS (
        -- Product-level sales metrics
        SELECT 
            p.product_category_name,
            p.product_id,
            SUM(oi.price::numeric) as product_revenue,
            COUNT(oi.order_id) as times_sold,
            AVG(r.review_score::numeric) as avg_review_score
        FROM "olist_sales_data_set"."olist_products_dataset" p
        INNER JOIN "olist_sales_data_set"."olist_order_items_dataset" oi ON p.product_id = oi.product_id
        LEFT JOIN "olist_sales_data_set"."olist_order_reviews_dataset" r ON oi.order_id = r.order_id
        WHERE p.product_category_name IS NOT NULL
            AND oi.price IS NOT NULL
        GROUP BY p.product_category_name, p.product_id
    ),
    category_summary AS (
        -- Category-level aggregation
        SELECT 
            product_category_name,
            COUNT(DISTINCT product_id) as unique_products,
            SUM(product_revenue) as category_revenue,
            AVG(avg_review_score) as category_satisfaction,
            SUM(times_sold) as total_units_sold
        FROM product_sales
        WHERE avg_review_score IS NOT NULL
        GROUP BY product_category_name
    ),
    category_ranking AS (
        -- Add performance rankings
        SELECT 
            product_category_name,
            unique_products,
            ROUND(category_revenue, 2) as category_revenue,
            ROUND(category_satisfaction, 2) as avg_satisfaction_score,
            total_units_sold,
            RANK() OVER (ORDER BY category_revenue DESC) as revenue_rank,
            RANK() OVER (ORDER BY category_satisfaction DESC) as satisfaction_rank
        FROM category_summary
    )
    SELECT 
        product_category_name,
        unique_products,
        category_revenue,
        avg_satisfaction_score,
        total_units_sold,
        revenue_rank,
        satisfaction_rank
    FROM category_ranking
    ORDER BY category_revenue DESC
    LIMIT 10
    """
    
    try:
        category_analysis = execute_query(db_engine, category_performance_query)
        print("\n‚úÖ Product Category Performance (Revenue vs Satisfaction):")
        display(category_analysis)
        
        print(f"\nüìà Category Performance Insights:")
        top_revenue = category_analysis.iloc[0]
        best_satisfaction = category_analysis.loc[category_analysis['avg_satisfaction_score'].idxmax()]
        
        print(f"  ‚Ä¢ Top revenue category: {top_revenue['product_category_name']} (R$ {top_revenue['category_revenue']:,.2f})")
        print(f"  ‚Ä¢ Highest satisfaction: {best_satisfaction['product_category_name']} ({best_satisfaction['avg_satisfaction_score']:.2f} stars)")
        print(f"  ‚Ä¢ Complex CTEs: sales ‚Üí category ‚Üí ranking analysis")
        print(f"  ‚Ä¢ Real business intelligence: revenue vs customer satisfaction trade-offs")
    except Exception as e:
        print(f"‚ùå Category performance analysis failed: {e}")

print("\nüí° Advanced SQL Features Demonstrated:")
print("  ‚Ä¢ Multi-level CTEs for complex business logic")
print("  ‚Ä¢ Customer segmentation with CASE statements")
print("  ‚Ä¢ Subqueries for dynamic threshold calculations")
print("  ‚Ä¢ Window functions for ranking and percentiles")
print("  ‚Ä¢ Real business intelligence metrics (CLV, satisfaction, performance)")
print("  ‚Ä¢ Brazilian e-commerce insights with actual marketplace data")

print("\nüéì Educational Value:")
print("  ‚Ä¢ Students see real SQL patterns used in business intelligence")
print("  ‚Ä¢ Complex queries broken down into logical, reusable CTEs")
print("  ‚Ä¢ Actual business insights from real e-commerce operations")
print("  ‚Ä¢ Practical application of advanced SQL concepts")

In [None]:
# Advanced SQL Example: Common Table Expressions (CTEs)
print("\nüíº Advanced SQL: CTEs for Complex Business Logic")
print("Business Question: Can we segment our data for deeper insights?")

if sales_tables:
    main_sales_table = sales_tables[1]
    schema, table = main_sales_table.split('.', 1)
    
    # Build a more complex query with CTEs
    available_columns = db_info[main_sales_table]['columns']
    
    # Look for different types of columns
    numeric_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['price', 'value', 'amount', 'cost'])]
    location_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['state', 'city', 'region'])]
    customer_cols = [col for col in available_columns if 'customer' in col.lower()]
    
    print(f"\nNumeric value columns found: {numeric_cols}")
    print(f"Location columns found: {location_cols}")
    print(f"Customer columns found: {customer_cols}")
    
    if numeric_cols and location_cols:
        # Build CTE query for business segmentation with schema qualification
        cte_query = f"""
        WITH regional_stats AS (
            SELECT 
                "{location_cols[0]}" as region,
                COUNT(*) as total_records,
                AVG("{numeric_cols[0]}") as avg_value,
                STDDEV("{numeric_cols[0]}") as value_stddev
            FROM "{schema}"."{table}"
            WHERE "{location_cols[0]}" IS NOT NULL 
                AND "{numeric_cols[0]}" IS NOT NULL
                AND "{numeric_cols[0]}" > 0
            GROUP BY "{location_cols[0]}"
        ),
        regional_segments AS (
            SELECT 
                region,
                total_records,
                ROUND(avg_value, 2) as avg_value,
                CASE 
                    WHEN avg_value > (SELECT AVG(avg_value) FROM regional_stats) THEN 'High Value'
                    WHEN total_records > (SELECT AVG(total_records) FROM regional_stats) THEN 'High Volume'
                    ELSE 'Standard'
                END as segment
            FROM regional_stats
        )
        SELECT 
            segment,
            COUNT(*) as region_count,
            SUM(total_records) as total_records,
            ROUND(AVG(avg_value), 2) as segment_avg_value
        FROM regional_segments
        GROUP BY segment
        ORDER BY segment_avg_value DESC
        """
        
        try:
            cte_analysis = execute_query(db_engine, cte_query)
            print("\n‚úÖ Regional Segmentation Analysis (using CTEs):")
            display(cte_analysis)
            
            print(f"\nüéØ Segmentation Insights:")
            for _, row in cte_analysis.iterrows():
                print(f"  ‚Ä¢ {row['segment']}: {row['region_count']} regions, avg value: {row['segment_avg_value']}")
        except Exception as e:
            print(f"‚ùå CTE analysis failed: {e}")
            print("Trying simpler segmentation...")
            
            # Fallback: simpler analysis
            simple_cte_query = f"""
            WITH location_summary AS (
                SELECT 
                    "{location_cols[0]}" as location,
                    COUNT(*) as record_count,
                    AVG("{numeric_cols[0]}") as avg_value
                FROM "{schema}"."{table}"
                WHERE "{location_cols[0]}" IS NOT NULL 
                    AND "{numeric_cols[0]}" IS NOT NULL
                GROUP BY "{location_cols[0]}"
            )
            SELECT 
                COUNT(*) as total_locations,
                ROUND(AVG(record_count), 0) as avg_records_per_location,
                ROUND(AVG(avg_value), 2) as overall_avg_value
            FROM location_summary
            """
            
            try:
                simple_cte = execute_query(db_engine, simple_cte_query)
                print("\n‚úÖ Simple Location Analysis:")
                display(simple_cte)
            except Exception as e2:
                print(f"‚ùå Simple CTE also failed: {e2}")
    
    elif customer_cols and location_cols:
        # Alternative CTE with customer and location data
        customer_cte_query = f"""
        WITH customer_location_stats AS (
            SELECT 
                "{location_cols[0]}" as location,
                COUNT(DISTINCT "{customer_cols[0]}") as unique_customers,
                COUNT(*) as total_records
            FROM "{schema}"."{table}"
            WHERE "{location_cols[0]}" IS NOT NULL 
                AND "{customer_cols[0]}" IS NOT NULL
            GROUP BY "{location_cols[0]}"
        )
        SELECT 
            location,
            unique_customers,
            total_records,
            ROUND(total_records * 1.0 / unique_customers, 2) as records_per_customer
        FROM customer_location_stats
        ORDER BY unique_customers DESC
        LIMIT 10
        """
        
        try:
            customer_cte = execute_query(db_engine, customer_cte_query)
            print("\n‚úÖ Customer-Location Analysis (using CTEs):")
            display(customer_cte)
        except Exception as e:
            print(f"‚ùå Customer CTE analysis failed: {e}")
    
    else:
        print("‚ö†Ô∏è Insufficient columns for CTE segmentation analysis")
        print("Available for analysis:")
        print(f"  ‚Ä¢ Numeric columns: {len(numeric_cols)}")
        print(f"  ‚Ä¢ Location columns: {len(location_cols)}")
        print(f"  ‚Ä¢ Customer columns: {len(customer_cols)}")

else:
    print("‚ö†Ô∏è No sales tables available for CTE analysis")

print("\nüí° Advanced SQL Features Demonstrated:")
print("  ‚Ä¢ Window functions (LAG, OVER) for time-series analysis")
print("  ‚Ä¢ Date functions (DATE_TRUNC) for temporal grouping")
print("  ‚Ä¢ CTEs for complex multi-step business logic")
print("  ‚Ä¢ CASE statements for business rule implementation")
print("  ‚Ä¢ Subqueries for dynamic threshold calculations")
print("  ‚Ä¢ Schema-qualified table references for multi-schema databases")

## 5. SQL vs Pandas: When to Use Each Approach

Let's compare the strengths of SQL versus pandas for different types of data operations.

In [None]:
def compare_sql_vs_pandas_approaches():
    """
    Compare SQL and pandas approaches for different types of analysis.
    """
    print("‚ö° SQL vs Pandas: Strategic Comparison")
    print("\n" + "="*60)
    
    # Example 1: Simple aggregation comparison
    print("\nüìä Example 1: Simple Aggregation")
    print("Task: Count records by category")
    
    if sales_tables:
        main_sales_table = sales_tables[0]
        schema, table = main_sales_table.split('.', 1)
        
        # Find a categorical column for grouping
        available_columns = db_info[main_sales_table]['columns']
        location_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['state', 'city', 'region'])]
        
        if location_cols:
            print("\nüóÑÔ∏è SQL Approach:")
            sql_agg_query = f"""
            SELECT 
                "{location_cols[0]}" as category,
                COUNT(*) as record_count,
                ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
            FROM "{schema}"."{table}"
            WHERE "{location_cols[0]}" IS NOT NULL
            GROUP BY "{location_cols[0]}"
            ORDER BY record_count DESC
            LIMIT 5
            """
            
            try:
                sql_result = execute_query(db_engine, sql_agg_query)
                print("‚úÖ SQL Result:")
                display(sql_result)
                
                print("\nüêº Pandas Equivalent (conceptual):")
                print("""
                # If we had the data in a pandas DataFrame:
                pandas_result = (
                    df.groupby('category')['record_id']
                    .count()
                    .sort_values(ascending=False)
                    .head(5)
                )
                # Then calculate percentages:
                pandas_result_pct = pandas_result / pandas_result.sum() * 100
                """)
                
                # Now let's actually demonstrate with the SQL result
                if len(sql_result) > 0:
                    print("\nüîÑ Converting SQL result to pandas for further analysis:")
                    # Calculate additional statistics using pandas
                    total_records = sql_result['record_count'].sum()
                    avg_records = sql_result['record_count'].mean()
                    std_records = sql_result['record_count'].std()
                    
                    print(f"  ‚Ä¢ Total records: {total_records:,}")
                    print(f"  ‚Ä¢ Average per category: {avg_records:.1f}")
                    print(f"  ‚Ä¢ Standard deviation: {std_records:.1f}")
                    print(f"  ‚Ä¢ Coefficient of variation: {(std_records/avg_records)*100:.1f}%")
                    
            except Exception as e:
                print(f"‚ùå SQL aggregation failed: {e}")
    
    # Example 2: When SQL excels
    print("\n" + "-"*40)
    print("\nüìÖ Example 2: When SQL Excels - Date Operations")
    
    if sales_tables:
        main_sales_table = sales_tables[0]
        schema, table = main_sales_table.split('.', 1)
        available_columns = db_info[main_sales_table]['columns']
        date_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['date', 'timestamp', 'time'])]
        
        if date_cols:
            print("\nüóÑÔ∏è SQL Approach (Superior for date functions):")
            sql_date_query = f"""
            SELECT 
                EXTRACT(YEAR FROM "{date_cols[0]}") as year,
                EXTRACT(QUARTER FROM "{date_cols[0]}") as quarter,
                COUNT(*) as quarterly_records
            FROM "{schema}"."{table}"
            WHERE "{date_cols[0]}" IS NOT NULL
            GROUP BY EXTRACT(YEAR FROM "{date_cols[0]}"), EXTRACT(QUARTER FROM "{date_cols[0]}")
            ORDER BY year, quarter
            LIMIT 8
            """
            
            try:
                sql_date_result = execute_query(db_engine, sql_date_query)
                print("‚úÖ SQL Date Analysis:")
                display(sql_date_result)
                
                print("\nüí° SQL Advantage: Date extraction and grouping in one step")
                print("üêº Pandas equivalent would require:")
                print("  df['year'] = df['date'].dt.year")
                print("  df['quarter'] = df['date'].dt.quarter") 
                print("  result = df.groupby(['year', 'quarter']).size()")
                
            except Exception as e:
                print(f"‚ùå SQL date analysis failed: {e}")
    
    # Analysis summary
    print("\n" + "="*60)
    print("\nüéØ When to Use SQL vs Pandas:")
    
    print("\nüóÑÔ∏è Use SQL when:")
    print("  ‚Ä¢ Working with large datasets (millions of rows)")
    print("  ‚Ä¢ Need complex JOINs across multiple tables/schemas")
    print("  ‚Ä¢ Performing set operations (UNION, INTERSECT, EXCEPT)")
    print("  ‚Ä¢ Using window functions for analytics")
    print("  ‚Ä¢ Implementing business logic with CASE statements")
    print("  ‚Ä¢ Need database-level performance optimization")
    print("  ‚Ä¢ Data lives in cloud databases (like our Supabase setup)")
    
    print("\nüêº Use Pandas when:")
    print("  ‚Ä¢ Dataset fits comfortably in memory")
    print("  ‚Ä¢ Need statistical analysis (correlation, regression)")
    print("  ‚Ä¢ Data cleaning and transformation tasks")
    print("  ‚Ä¢ Creating visualizations")
    print("  ‚Ä¢ Machine learning feature engineering")
    print("  ‚Ä¢ Iterative data exploration and experimentation")
    
    print("\nüîÑ Best Practice: Hybrid Approach")
    print("  1. Use SQL for data extraction and initial processing")
    print("  2. Use pandas for analysis, statistics, and visualization")
    print("  3. Leverage each tool's strengths for optimal performance")
    print("  4. Handle schema-qualified table names in SQL")
    print("  5. Convert SQL results to pandas for advanced analytics")
    
    return "SQL excels at data processing, pandas excels at analysis"

# Run the comparison
comparison_insights = compare_sql_vs_pandas_approaches()

## 6. Error Handling and Best Practices

Production database applications require robust error handling and connection management.

In [None]:
def demonstrate_error_handling():
    """
    Demonstrate proper error handling techniques for database operations.
    """
    print("üõ°Ô∏è Database Error Handling and Best Practices")
    print("\n" + "="*60)
    
    # Example 1: Handling SQL syntax errors
    print("\n‚ùå Example 1: SQL Syntax Error Handling")
    try:
        # Intentional syntax error
        result = execute_query(db_engine, """
            SELCT * FROM "non_existent_schema"."non_existent_table"  -- Missing 'E' in SELECT
            WHERE some_column = 'value'
            LIMIT 5
        """)
    except Exception as e:
        print(f"‚úÖ Caught SQL syntax error: {type(e).__name__}")
        print(f"   Error message: {str(e)[:100]}...")
    
    # Example 2: Handling schema/table not found
    print("\nüîç Example 2: Schema/Table Not Found Error")
    try:
        if sales_tables:
            main_sales_table = sales_tables[0]
            schema, table = main_sales_table.split('.', 1)
            result = execute_query(db_engine, f"""
                SELECT customer_id, nonexistent_column 
                FROM "{schema}"."nonexistent_table"
                LIMIT 5
            """)
    except Exception as e:
        print(f"‚úÖ Caught schema/table error: {type(e).__name__}")
        print(f"   Error message: {str(e)[:100]}...")
    
    # Example 3: Parameterized queries (SQL injection prevention)
    print("\nüîí Example 3: Safe Parameterized Queries")
    
    def safe_data_lookup(engine, table_name, column_name, value):
        """
        Safely query data using parameterized queries.
        Note: Schema and table names can't be parameterized, so validate them first.
        """
        try:
            # Validate table exists (schema-qualified)
            if table_name not in db_info:
                raise ValueError(f"Table {table_name} not found")
            
            # Validate column exists
            if column_name not in db_info[table_name]['columns']:
                raise ValueError(f"Column {column_name} not found in {table_name}")
            
            # Split schema and table for proper quoting
            if '.' in table_name:
                schema, table = table_name.split('.', 1)
                table_ref = f'"{schema}"."{table}"'
            else:
                table_ref = f'"{table_name}"'
            
            # Use parameterized query for the value
            query = f"""
                SELECT COUNT(*) as record_count
                FROM {table_ref}
                WHERE "{column_name}" = %(search_value)s
            """
            result = execute_query(engine, query, params={'search_value': value})
            return result
        except Exception as e:
            print(f"‚ùå Query failed: {e}")
            return pd.DataFrame()
    
    # Test safe query
    if sales_tables:
        main_sales_table = sales_tables[0]
        available_columns = db_info[main_sales_table]['columns']
        location_cols = [col for col in available_columns if any(keyword in col.lower() for keyword in ['state', 'city', 'region'])]
        
        if location_cols:
            # Get a real value first
            schema, table = main_sales_table.split('.', 1)
            sample_query = f'SELECT DISTINCT "{location_cols[0]}" FROM "{schema}"."{table}" WHERE "{location_cols[0]}" IS NOT NULL LIMIT 1'
            try:
                sample_value = execute_query(db_engine, sample_query)
                if len(sample_value) > 0:
                    test_value = sample_value.iloc[0, 0]
                    safe_result = safe_data_lookup(db_engine, main_sales_table, location_cols[0], test_value)
                    if len(safe_result) > 0:
                        print(f"‚úÖ Safe query returned {safe_result.iloc[0, 0]} records for '{test_value}'")
            except Exception as e:
                print(f"‚ö†Ô∏è Could not test safe query: {e}")
    
    # Example 4: Connection management with context managers
    print("\nüîå Example 4: Proper Connection Management")
    
    def safe_database_query(engine, query, params=None):
        """
        Function-based safe database operations using context manager pattern.
        """
        try:
            with engine.connect() as conn:
                if params:
                    result = pd.read_sql(text(query), conn, params=params)
                else:
                    result = pd.read_sql(text(query), conn)
                return result
        except Exception as e:
            print(f"‚ùå Database error occurred: {type(e).__name__}: {e}")
            return pd.DataFrame()
    
    # Use function for safe operations
    try:
        result = safe_database_query(
            db_engine, 
            "SELECT 'Schema-aware connection test successful' as message"
        )
        if not result.empty:
            print(f"‚úÖ Safe query function successful: {result.iloc[0, 0]}")
    except Exception as e:
        print(f"‚ùå Safe query function caught error: {e}")
    
    # Example 5: Data validation
    print("\n‚úÖ Example 5: Data Validation Best Practices")
    
    def validate_query_result(df, expected_columns=None, min_rows=0):
        """
        Validate query results meet business requirements.
        """
        validations = []
        
        # Check if DataFrame is empty
        if df.empty:
            validations.append("‚ùå Query returned no data")
        else:
            validations.append(f"‚úÖ Query returned {len(df):,} rows")
        
        # Check minimum row count
        if len(df) < min_rows:
            validations.append(f"‚ö†Ô∏è Row count ({len(df)}) below minimum ({min_rows})")
        
        # Check expected columns
        if expected_columns:
            missing_cols = set(expected_columns) - set(df.columns)
            if missing_cols:
                validations.append(f"‚ùå Missing columns: {missing_cols}")
            else:
                validations.append("‚úÖ All expected columns present")
        
        # Check for null values in key columns
        if not df.empty:
            null_counts = df.isnull().sum()
            if null_counts.any():
                validations.append(f"‚ö†Ô∏è Null values found: {dict(null_counts[null_counts > 0])}")
            else:
                validations.append("‚úÖ No null values detected")
        
        return validations
    
    # Test validation
    if sales_tables:
        main_sales_table = sales_tables[0]
        schema, table = main_sales_table.split('.', 1)
        
        test_query = f"""
            SELECT *
            FROM "{schema}"."{table}"
            LIMIT 10
        """
        
        try:
            test_data = execute_query(db_engine, test_query)
            validations = validate_query_result(
                test_data, 
                expected_columns=list(test_data.columns)[:3],  # Check first 3 columns
                min_rows=5
            )
            
            print("Query validation results:")
            for validation in validations:
                print(f"  {validation}")
        except Exception as e:
            print(f"‚ùå Validation test failed: {e}")
    
    return validations

# Run error handling demonstration
error_handling_results = demonstrate_error_handling()

print("\nüìö Database Best Practices Summary:")
print("  üîí Always use parameterized queries to prevent SQL injection")
print("  üõ°Ô∏è Implement comprehensive error handling for all database operations")
print("  üîå Use connection context managers to ensure proper resource cleanup")
print("  ‚úÖ Validate query results before processing in business logic")
print("  üìä Log query performance for optimization opportunities")
print("  üîÑ Implement retry logic for transient connection issues")
print("  üìù Document query patterns and business logic for team maintenance")
print("  üóÇÔ∏è Handle schema-qualified table names properly in multi-schema databases")
print("  üîç Validate schema and table existence before executing queries")

## 7. Key Takeaways and Next Steps

### What We've Accomplished:

1. **PostgreSQL Database Connection**
   - Connected to Supabase cloud PostgreSQL database
   - Established professional connection patterns with SQLAlchemy
   - Implemented proper resource management and error handling

2. **SQL Query Execution from Python**
   - Basic data exploration and filtering
   - Complex business intelligence with JOINs
   - Advanced analytics with window functions and CTEs

3. **Real-World Data Integration**
   - Worked with actual Olist e-commerce and marketing datasets
   - Adapted queries to real schema structures
   - Handled data quality issues and missing values

4. **Production-Ready Practices**
   - Error handling and validation
   - Parameterized queries for security
   - Connection pooling and resource management

### Business Value:

- **Real-time Analysis**: Connect directly to live business systems
- **Scalability**: Handle enterprise-scale datasets efficiently
- **Performance**: Leverage database engines for heavy computation
- **Security**: Proper authentication and query sanitization
- **Collaboration**: Multiple analysts accessing the same cloud data source

### When to Use SQL vs Pandas:

**Use SQL for:**
- Data extraction from large datasets
- Complex joins across multiple tables
- Window functions and analytical queries
- Business logic implementation with CASE statements
- Database-level performance optimization

**Use Pandas for:**
- Statistical analysis and modeling
- Data cleaning and transformation
- Visualization preparation
- Machine learning feature engineering
- Iterative data exploration

### Next Session Preview:
In our next sessions, we'll explore:
- Advanced SQL patterns for business intelligence
- Real-time data pipeline automation
- Combining SQL analytics with interactive visualizations
- Building automated reporting systems

**üéâ You now have the fundamental skills to connect Python to cloud databases and perform enterprise-level data analysis!**

## 8. Practice Exercise

**Your Challenge! üöÄ**

**Business Scenario**: The Olist analytics team wants to understand the relationship between their marketing efforts and customer behavior. Your task is to create an analysis that bridges the marketing and sales datasets.

**Your Task**: Create a comprehensive analysis that combines both datasets to answer business questions.

**Requirements**:
1. **Data Exploration**: Explore both olist_sales_data_set and olist_marketing_data_set
2. **Schema Analysis**: Document the structure and relationships between datasets
3. **Business Intelligence**: Create queries that provide actionable insights
4. **Error Handling**: Implement proper error handling for your queries
5. **Best Practices**: Use parameterized queries and validation

**Specific Questions to Answer**:
- What is the structure of each dataset?
- How can these datasets be connected?
- What insights can we derive about customer acquisition and behavior?
- Which marketing channels or strategies show the most promise?

**Deliverable**: A comprehensive analysis with SQL queries, data validation, and business insights.

In [None]:
# Your practice exercise solution here

def comprehensive_business_analysis():
    """
    Your challenge: Create a comprehensive analysis bridging marketing and sales data.
    
    Business Goal: Understand the relationship between marketing efforts and customer behavior.
    
    Implementation steps:
    1. Explore both datasets thoroughly
    2. Identify connection points between datasets
    3. Create business intelligence queries
    4. Validate results and handle errors
    5. Generate actionable insights
    """
    
    print("üéØ Comprehensive Business Analysis Challenge")
    print("üìä Goal: Bridge marketing and sales data for business insights")
    print("\n" + "="*60)
    
    # Step 1: Dataset Exploration
    print("\nüìã Step 1: Dataset Structure Analysis")
    
    # TODO: Explore olist_sales_data_set structure
    # Think about: What columns are available? What do they represent?
    
    # TODO: Explore olist_marketing_data_set structure  
    # Think about: How does this relate to sales data?
    
    # Step 2: Connection Analysis
    print("\nüîó Step 2: Identify Dataset Relationships")
    
    # TODO: Find common columns or keys between datasets
    # Think about: How can we join these datasets?
    
    # Step 3: Business Intelligence Queries
    print("\nüíº Step 3: Business Intelligence Analysis")
    
    # TODO: Create queries that answer business questions:
    # - Which marketing channels are most effective?
    # - What's the customer journey from lead to purchase?
    # - How do marketing efforts correlate with sales performance?
    
    # Step 4: Advanced Analytics
    print("\nüìà Step 4: Advanced Business Insights")
    
    # TODO: Use advanced SQL features:
    # - Window functions for trend analysis
    # - CTEs for complex business logic
    # - Statistical functions for performance metrics
    
    # Step 5: Validation and Error Handling
    print("\n‚úÖ Step 5: Data Validation and Quality Checks")
    
    # TODO: Implement proper error handling and data validation
    
    # Step 6: Business Recommendations
    print("\nüéØ Step 6: Strategic Business Recommendations")
    
    # TODO: Synthesize findings into actionable business insights
    
    return None

print("üí° Hints for Your Analysis:")
print("  ‚Ä¢ Start by examining the schema of both datasets")
print("  ‚Ä¢ Look for common identifiers (seller_id, customer_id, etc.)")
print("  ‚Ä¢ Use SQL JOINs to combine datasets where appropriate")
print("  ‚Ä¢ Focus on metrics that matter to business decision-makers")
print("  ‚Ä¢ Always validate your results and handle potential errors")

print("\nüîç Analysis Framework:")
print("  1. Data Discovery: Understand what data is available")
print("  2. Relationship Mapping: How datasets connect")
print("  3. Business Metrics: What KPIs can we calculate?")
print("  4. Trend Analysis: How do metrics change over time?")
print("  5. Insights Generation: What actions should the business take?")

print("\nüìä Expected Deliverables:")
print("  ‚Ä¢ Dataset structure documentation")
print("  ‚Ä¢ Relationship mapping between datasets")
print("  ‚Ä¢ Business intelligence SQL queries")
print("  ‚Ä¢ Data quality assessment")
print("  ‚Ä¢ Strategic recommendations based on findings")

# Uncomment to run your solution:
# comprehensive_analysis_results = comprehensive_business_analysis()

# Remember to clean up database connection when done
# db.close()

In [None]:
# Clean up database connection
print("üîí Closing database connection...")
close_database_engine(db_engine)
print("‚úÖ Session complete!")