# Capstone Project 5: Database Analytics Dashboard

## Building a Business Intelligence Dashboard with SQL, Pandas, and Matplotlib

This capstone project combines database skills with data analysis and visualization to build a comprehensive analytics dashboard for an e-commerce business.

### What We're Building
An end-to-end analytics solution that:
- Creates and populates a realistic SQLite database
- Queries data using SQL for business insights
- Loads results into Pandas for advanced analysis
- Visualizes findings in a multi-panel dashboard

### Skills Used
- **SQLite**: Database creation, schema design, data insertion
- **SQL Queries**: SELECT, JOIN, GROUP BY, aggregations, subqueries
- **Pandas**: Data loading, transformation, analysis
- **Matplotlib**: Multi-panel dashboards, various chart types

### Prerequisites
- Module 3: Data Visualization with Matplotlib
- Module 4: Data Analysis with Pandas
- Module 7: Database Programming with SQLite

---
## Part 1: Project Setup and Database Creation

We'll create a SQLite database with a realistic e-commerce schema including:
- **customers**: Customer information
- **products**: Product catalog with categories
- **orders**: Order header information
- **order_items**: Individual items within each order

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import random
import os

# Configuration
np.random.seed(42)
random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

# Database file path
DB_PATH = 'ecommerce_analytics.db'

# Remove existing database for fresh start
if os.path.exists(DB_PATH):
    os.remove(DB_PATH)
    print(f"Removed existing database: {DB_PATH}")

print("Libraries loaded successfully!")
print(f"Pandas version: {pd.__version__}")

In [None]:
def create_database_schema(db_path: str) -> None:
    """
    Create the e-commerce database schema.
    
    Parameters
    ----------
    db_path : str
        Path to the SQLite database file
    
    Notes
    -----
    Creates four tables:
    - customers: Customer master data
    - products: Product catalog
    - orders: Order headers
    - order_items: Order line items
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    # Customers table
    cursor.execute('''
        CREATE TABLE customers (
            customer_id INTEGER PRIMARY KEY,
            first_name TEXT NOT NULL,
            last_name TEXT NOT NULL,
            email TEXT UNIQUE NOT NULL,
            city TEXT,
            state TEXT,
            country TEXT DEFAULT 'USA',
            registration_date DATE,
            customer_segment TEXT
        )
    ''')
    
    # Products table
    cursor.execute('''
        CREATE TABLE products (
            product_id INTEGER PRIMARY KEY,
            product_name TEXT NOT NULL,
            category TEXT NOT NULL,
            subcategory TEXT,
            unit_price REAL NOT NULL,
            cost_price REAL NOT NULL,
            stock_quantity INTEGER DEFAULT 0,
            supplier TEXT
        )
    ''')
    
    # Orders table
    cursor.execute('''
        CREATE TABLE orders (
            order_id INTEGER PRIMARY KEY,
            customer_id INTEGER NOT NULL,
            order_date DATE NOT NULL,
            ship_date DATE,
            status TEXT DEFAULT 'Completed',
            shipping_method TEXT,
            payment_method TEXT,
            FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        )
    ''')
    
    # Order items table
    cursor.execute('''
        CREATE TABLE order_items (
            item_id INTEGER PRIMARY KEY,
            order_id INTEGER NOT NULL,
            product_id INTEGER NOT NULL,
            quantity INTEGER NOT NULL,
            unit_price REAL NOT NULL,
            discount_percent REAL DEFAULT 0,
            FOREIGN KEY (order_id) REFERENCES orders(order_id),
            FOREIGN KEY (product_id) REFERENCES products(product_id)
        )
    ''')
    
    # Create indexes for better query performance
    cursor.execute('CREATE INDEX idx_orders_customer ON orders(customer_id)')
    cursor.execute('CREATE INDEX idx_orders_date ON orders(order_date)')
    cursor.execute('CREATE INDEX idx_items_order ON order_items(order_id)')
    cursor.execute('CREATE INDEX idx_items_product ON order_items(product_id)')
    
    conn.commit()
    conn.close()
    print("Database schema created successfully!")


# Create the schema
create_database_schema(DB_PATH)

---
## Part 2: Populating the Database with Realistic Data

We'll generate realistic sample data that includes:
- Seasonal sales patterns (holiday spikes)
- Growth trends over time
- Different customer segments with varying behaviors
- Product categories with different price points

In [None]:
def generate_customers(n_customers: int = 75) -> list:
    """
    Generate realistic customer data.
    
    Parameters
    ----------
    n_customers : int
        Number of customers to generate
    
    Returns
    -------
    list
        List of customer tuples for database insertion
    """
    first_names = [
        'James', 'Mary', 'John', 'Patricia', 'Robert', 'Jennifer', 'Michael', 'Linda',
        'William', 'Elizabeth', 'David', 'Barbara', 'Richard', 'Susan', 'Joseph', 'Jessica',
        'Thomas', 'Sarah', 'Charles', 'Karen', 'Christopher', 'Nancy', 'Daniel', 'Lisa',
        'Matthew', 'Betty', 'Anthony', 'Margaret', 'Mark', 'Sandra', 'Donald', 'Ashley',
        'Steven', 'Kimberly', 'Paul', 'Emily', 'Andrew', 'Donna', 'Joshua', 'Michelle',
        'Kenneth', 'Dorothy', 'Kevin', 'Carol', 'Brian', 'Amanda', 'George', 'Melissa'
    ]
    
    last_names = [
        'Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis',
        'Rodriguez', 'Martinez', 'Hernandez', 'Lopez', 'Gonzalez', 'Wilson', 'Anderson',
        'Thomas', 'Taylor', 'Moore', 'Jackson', 'Martin', 'Lee', 'Perez', 'Thompson',
        'White', 'Harris', 'Sanchez', 'Clark', 'Ramirez', 'Lewis', 'Robinson', 'Walker',
        'Young', 'Allen', 'King', 'Wright', 'Scott', 'Torres', 'Nguyen', 'Hill', 'Flores'
    ]
    
    cities_states = [
        ('New York', 'NY'), ('Los Angeles', 'CA'), ('Chicago', 'IL'), ('Houston', 'TX'),
        ('Phoenix', 'AZ'), ('Philadelphia', 'PA'), ('San Antonio', 'TX'), ('San Diego', 'CA'),
        ('Dallas', 'TX'), ('San Jose', 'CA'), ('Austin', 'TX'), ('Jacksonville', 'FL'),
        ('Fort Worth', 'TX'), ('Columbus', 'OH'), ('Charlotte', 'NC'), ('Seattle', 'WA'),
        ('Denver', 'CO'), ('Boston', 'MA'), ('Portland', 'OR'), ('Atlanta', 'GA')
    ]
    
    segments = ['Standard', 'Premium', 'VIP', 'New']
    segment_weights = [0.50, 0.30, 0.10, 0.10]
    
    customers = []
    used_emails = set()
    
    for i in range(1, n_customers + 1):
        first = random.choice(first_names)
        last = random.choice(last_names)
        
        # Generate unique email
        base_email = f"{first.lower()}.{last.lower()}"
        email = f"{base_email}@email.com"
        counter = 1
        while email in used_emails:
            email = f"{base_email}{counter}@email.com"
            counter += 1
        used_emails.add(email)
        
        city, state = random.choice(cities_states)
        
        # Registration date (last 3 years)
        days_ago = random.randint(0, 1095)
        reg_date = (datetime(2024, 12, 31) - timedelta(days=days_ago)).strftime('%Y-%m-%d')
        
        segment = random.choices(segments, weights=segment_weights)[0]
        
        customers.append((
            i, first, last, email, city, state, 'USA', reg_date, segment
        ))
    
    return customers


# Generate and insert customers
customers = generate_customers(75)
print(f"Generated {len(customers)} customers")
print(f"Sample: {customers[0]}")

In [None]:
def generate_products() -> list:
    """
    Generate product catalog with realistic items.
    
    Returns
    -------
    list
        List of product tuples for database insertion
    """
    products_data = [
        # Electronics
        ('Wireless Bluetooth Headphones', 'Electronics', 'Audio', 79.99, 45.00, 150, 'TechSupply Co'),
        ('USB-C Hub 7-in-1', 'Electronics', 'Accessories', 49.99, 22.00, 200, 'TechSupply Co'),
        ('Mechanical Keyboard RGB', 'Electronics', 'Peripherals', 129.99, 65.00, 80, 'TechSupply Co'),
        ('Wireless Mouse Ergonomic', 'Electronics', 'Peripherals', 39.99, 18.00, 250, 'TechSupply Co'),
        ('4K Webcam Pro', 'Electronics', 'Video', 149.99, 75.00, 60, 'VisionTech'),
        ('Portable SSD 1TB', 'Electronics', 'Storage', 99.99, 55.00, 120, 'DataStore Inc'),
        ('Smart Watch Fitness', 'Electronics', 'Wearables', 199.99, 95.00, 90, 'WearTech'),
        
        # Home & Kitchen
        ('Coffee Maker Programmable', 'Home & Kitchen', 'Appliances', 89.99, 42.00, 75, 'HomeGoods Ltd'),
        ('Air Fryer 5.5L', 'Home & Kitchen', 'Appliances', 119.99, 58.00, 60, 'HomeGoods Ltd'),
        ('Knife Set 15-Piece', 'Home & Kitchen', 'Cookware', 79.99, 35.00, 100, 'KitchenPro'),
        ('Non-Stick Pan Set', 'Home & Kitchen', 'Cookware', 69.99, 30.00, 85, 'KitchenPro'),
        ('Blender High-Speed', 'Home & Kitchen', 'Appliances', 149.99, 70.00, 50, 'HomeGoods Ltd'),
        
        # Clothing & Apparel
        ('Running Shoes Pro', 'Clothing', 'Footwear', 129.99, 55.00, 120, 'SportStyle'),
        ('Casual Sneakers', 'Clothing', 'Footwear', 79.99, 35.00, 180, 'SportStyle'),
        ('Winter Jacket Insulated', 'Clothing', 'Outerwear', 189.99, 85.00, 70, 'OutdoorGear'),
        ('Yoga Pants Premium', 'Clothing', 'Activewear', 59.99, 25.00, 200, 'FitWear'),
        ('Cotton T-Shirt Pack (3)', 'Clothing', 'Basics', 34.99, 12.00, 300, 'BasicWear'),
        
        # Sports & Outdoors
        ('Yoga Mat Premium', 'Sports', 'Fitness', 49.99, 20.00, 150, 'FitGear'),
        ('Dumbbell Set Adjustable', 'Sports', 'Fitness', 199.99, 90.00, 40, 'FitGear'),
        ('Camping Tent 4-Person', 'Sports', 'Outdoor', 249.99, 120.00, 35, 'OutdoorGear'),
        ('Hiking Backpack 50L', 'Sports', 'Outdoor', 89.99, 40.00, 80, 'OutdoorGear'),
        ('Resistance Bands Set', 'Sports', 'Fitness', 29.99, 10.00, 250, 'FitGear'),
        
        # Books & Media
        ('Python Programming Guide', 'Books', 'Technology', 44.99, 18.00, 100, 'BookWorld'),
        ('Data Science Handbook', 'Books', 'Technology', 54.99, 22.00, 80, 'BookWorld'),
        ('Business Strategy Essentials', 'Books', 'Business', 29.99, 12.00, 120, 'BookWorld'),
        ('Cooking Masterclass', 'Books', 'Lifestyle', 39.99, 16.00, 90, 'BookWorld'),
    ]
    
    products = []
    for i, p in enumerate(products_data, 1):
        products.append((i,) + p)
    
    return products


# Generate products
products = generate_products()
print(f"Generated {len(products)} products")
print(f"Sample: {products[0]}")

In [None]:
def generate_orders_and_items(
    n_orders: int = 300,
    n_customers: int = 75,
    n_products: int = 26
) -> tuple:
    """
    Generate orders and order items with realistic patterns.
    
    Parameters
    ----------
    n_orders : int
        Number of orders to generate
    n_customers : int
        Number of customers in the database
    n_products : int
        Number of products in the database
    
    Returns
    -------
    tuple
        (orders_list, order_items_list)
    
    Notes
    -----
    Includes seasonal patterns:
    - Higher sales in Q4 (holiday season)
    - Growth trend over time
    - Weekend vs weekday patterns
    """
    shipping_methods = ['Standard', 'Express', 'Next Day', 'Free Shipping']
    payment_methods = ['Credit Card', 'Debit Card', 'PayPal', 'Apple Pay']
    statuses = ['Completed', 'Completed', 'Completed', 'Completed', 'Shipped', 'Processing']
    
    # Product prices for reference
    product_prices = {
        1: 79.99, 2: 49.99, 3: 129.99, 4: 39.99, 5: 149.99, 6: 99.99, 7: 199.99,
        8: 89.99, 9: 119.99, 10: 79.99, 11: 69.99, 12: 149.99,
        13: 129.99, 14: 79.99, 15: 189.99, 16: 59.99, 17: 34.99,
        18: 49.99, 19: 199.99, 20: 249.99, 21: 89.99, 22: 29.99,
        23: 44.99, 24: 54.99, 25: 29.99, 26: 39.99
    }
    
    orders = []
    order_items = []
    item_id = 1
    
    # Date range: 2 years of data
    start_date = datetime(2023, 1, 1)
    end_date = datetime(2024, 12, 31)
    
    for order_id in range(1, n_orders + 1):
        # Generate order date with seasonal weighting
        # More orders in Q4, slight growth trend
        if random.random() < 0.30:  # 30% chance of Q4 date
            # Q4 months: October, November, December
            month = random.choice([10, 11, 12])
            year = random.choice([2023, 2024])
            day = random.randint(1, 28)
            order_date = datetime(year, month, day)
        else:
            # Random date with slight bias toward recent dates
            days_range = (end_date - start_date).days
            # Bias toward more recent dates (growth trend)
            day_offset = int(np.random.beta(2, 1.5) * days_range)
            order_date = start_date + timedelta(days=day_offset)
        
        customer_id = random.randint(1, n_customers)
        
        # Ship date (1-7 days after order)
        ship_days = random.randint(1, 7)
        ship_date = order_date + timedelta(days=ship_days)
        
        orders.append((
            order_id,
            customer_id,
            order_date.strftime('%Y-%m-%d'),
            ship_date.strftime('%Y-%m-%d'),
            random.choice(statuses),
            random.choice(shipping_methods),
            random.choice(payment_methods)
        ))
        
        # Generate 1-5 items per order
        n_items = random.choices([1, 2, 3, 4, 5], weights=[0.35, 0.30, 0.20, 0.10, 0.05])[0]
        order_products = random.sample(range(1, n_products + 1), min(n_items, n_products))
        
        for product_id in order_products:
            quantity = random.choices([1, 2, 3], weights=[0.70, 0.20, 0.10])[0]
            unit_price = product_prices[product_id]
            
            # Occasional discounts (20% chance)
            discount = 0
            if random.random() < 0.20:
                discount = random.choice([5, 10, 15, 20])
            
            order_items.append((
                item_id,
                order_id,
                product_id,
                quantity,
                unit_price,
                discount
            ))
            item_id += 1
    
    return orders, order_items


# Generate orders and items
orders, order_items = generate_orders_and_items(300, 75, 26)
print(f"Generated {len(orders)} orders")
print(f"Generated {len(order_items)} order items")
print(f"Average items per order: {len(order_items) / len(orders):.2f}")

In [None]:
def populate_database(db_path: str, customers: list, products: list, 
                      orders: list, order_items: list) -> None:
    """
    Insert all generated data into the database.
    
    Parameters
    ----------
    db_path : str
        Path to the SQLite database
    customers : list
        Customer data tuples
    products : list
        Product data tuples
    orders : list
        Order data tuples
    order_items : list
        Order item data tuples
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    # Insert customers
    cursor.executemany('''
        INSERT INTO customers VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    ''', customers)
    print(f"Inserted {len(customers)} customers")
    
    # Insert products
    cursor.executemany('''
        INSERT INTO products VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    ''', products)
    print(f"Inserted {len(products)} products")
    
    # Insert orders
    cursor.executemany('''
        INSERT INTO orders VALUES (?, ?, ?, ?, ?, ?, ?)
    ''', orders)
    print(f"Inserted {len(orders)} orders")
    
    # Insert order items
    cursor.executemany('''
        INSERT INTO order_items VALUES (?, ?, ?, ?, ?, ?)
    ''', order_items)
    print(f"Inserted {len(order_items)} order items")
    
    conn.commit()
    conn.close()
    print("\nDatabase populated successfully!")


# Populate the database
populate_database(DB_PATH, customers, products, orders, order_items)

---
## Part 3: Data Exploration with SQL

Let's explore our database using SQL queries to understand the data structure and basic statistics.

In [None]:
def run_query(query: str, db_path: str = DB_PATH) -> pd.DataFrame:
    """
    Execute a SQL query and return results as a DataFrame.
    
    Parameters
    ----------
    query : str
        SQL query to execute
    db_path : str
        Path to the SQLite database
    
    Returns
    -------
    pd.DataFrame
        Query results as a DataFrame
    """
    conn = sqlite3.connect(db_path)
    result = pd.read_sql_query(query, conn)
    conn.close()
    return result


# Basic table exploration
print("="*60)
print("DATABASE OVERVIEW")
print("="*60)

# Count records in each table
tables = ['customers', 'products', 'orders', 'order_items']
for table in tables:
    count = run_query(f"SELECT COUNT(*) as count FROM {table}")['count'][0]
    print(f"{table}: {count} records")

In [None]:
# Preview each table
print("\n" + "="*60)
print("CUSTOMERS TABLE (First 5 rows)")
print("="*60)
run_query("SELECT * FROM customers LIMIT 5")

In [None]:
print("PRODUCTS TABLE (First 5 rows)")
print("="*60)
run_query("SELECT * FROM products LIMIT 5")

In [None]:
print("ORDERS TABLE (First 5 rows)")
print("="*60)
run_query("SELECT * FROM orders LIMIT 5")

In [None]:
print("ORDER ITEMS TABLE (First 5 rows)")
print("="*60)
run_query("SELECT * FROM order_items LIMIT 5")

In [None]:
# Aggregate statistics
print("\n" + "="*60)
print("AGGREGATE STATISTICS")
print("="*60)

# Customer segments distribution
print("\nCustomer Segments:")
run_query('''
    SELECT 
        customer_segment,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM customers), 1) as percentage
    FROM customers
    GROUP BY customer_segment
    ORDER BY count DESC
''')

In [None]:
# Product categories
print("Product Categories:")
run_query('''
    SELECT 
        category,
        COUNT(*) as product_count,
        ROUND(AVG(unit_price), 2) as avg_price,
        ROUND(AVG(unit_price - cost_price), 2) as avg_margin
    FROM products
    GROUP BY category
    ORDER BY avg_price DESC
''')

In [None]:
# Order date range
print("Order Date Range:")
run_query('''
    SELECT 
        MIN(order_date) as first_order,
        MAX(order_date) as last_order,
        COUNT(*) as total_orders,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM orders
''')

---
## Part 4: Business Questions with SQL

Now let's answer key business questions using SQL queries.

In [None]:
# Question 1: Top 10 Customers by Revenue
print("="*60)
print("TOP 10 CUSTOMERS BY REVENUE")
print("="*60)

top_customers_query = '''
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name as customer_name,
        c.customer_segment,
        c.city,
        COUNT(DISTINCT o.order_id) as total_orders,
        ROUND(SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 2) as total_revenue
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN order_items oi ON o.order_id = oi.order_id
    GROUP BY c.customer_id
    ORDER BY total_revenue DESC
    LIMIT 10
'''

top_customers = run_query(top_customers_query)
top_customers

In [None]:
# Question 2: Best-Selling Products
print("="*60)
print("TOP 10 BEST-SELLING PRODUCTS")
print("="*60)

best_products_query = '''
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        SUM(oi.quantity) as units_sold,
        ROUND(SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 2) as revenue,
        ROUND(SUM(oi.quantity * (oi.unit_price - p.cost_price)), 2) as gross_profit
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id
    ORDER BY revenue DESC
    LIMIT 10
'''

best_products = run_query(best_products_query)
best_products

In [None]:
# Question 3: Monthly Sales Trends
print("="*60)
print("MONTHLY SALES TRENDS")
print("="*60)

monthly_sales_query = '''
    SELECT 
        strftime('%Y-%m', o.order_date) as month,
        COUNT(DISTINCT o.order_id) as orders,
        COUNT(DISTINCT o.customer_id) as unique_customers,
        SUM(oi.quantity) as items_sold,
        ROUND(SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 2) as revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    GROUP BY strftime('%Y-%m', o.order_date)
    ORDER BY month
'''

monthly_sales = run_query(monthly_sales_query)
monthly_sales

In [None]:
# Question 4: Average Order Value
print("="*60)
print("AVERAGE ORDER VALUE ANALYSIS")
print("="*60)

aov_query = '''
    WITH order_totals AS (
        SELECT 
            o.order_id,
            c.customer_segment,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as order_total
        FROM orders o
        JOIN customers c ON o.customer_id = c.customer_id
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY o.order_id, c.customer_segment
    )
    SELECT 
        customer_segment,
        COUNT(*) as order_count,
        ROUND(AVG(order_total), 2) as avg_order_value,
        ROUND(MIN(order_total), 2) as min_order,
        ROUND(MAX(order_total), 2) as max_order
    FROM order_totals
    GROUP BY customer_segment
    ORDER BY avg_order_value DESC
'''

aov_by_segment = run_query(aov_query)
aov_by_segment

In [None]:
# Question 5: Customer Purchase Frequency
print("="*60)
print("CUSTOMER PURCHASE FREQUENCY")
print("="*60)

frequency_query = '''
    WITH customer_orders AS (
        SELECT 
            c.customer_id,
            c.customer_segment,
            COUNT(o.order_id) as order_count
        FROM customers c
        LEFT JOIN orders o ON c.customer_id = o.customer_id
        GROUP BY c.customer_id, c.customer_segment
    )
    SELECT 
        CASE 
            WHEN order_count = 0 THEN '0 orders'
            WHEN order_count = 1 THEN '1 order'
            WHEN order_count BETWEEN 2 AND 3 THEN '2-3 orders'
            WHEN order_count BETWEEN 4 AND 6 THEN '4-6 orders'
            ELSE '7+ orders'
        END as frequency_bucket,
        COUNT(*) as customer_count,
        ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM customers), 1) as percentage
    FROM customer_orders
    GROUP BY frequency_bucket
    ORDER BY 
        CASE frequency_bucket
            WHEN '0 orders' THEN 1
            WHEN '1 order' THEN 2
            WHEN '2-3 orders' THEN 3
            WHEN '4-6 orders' THEN 4
            ELSE 5
        END
'''

purchase_frequency = run_query(frequency_query)
purchase_frequency

In [None]:
# Question 6: Sales by Category
print("="*60)
print("SALES BY PRODUCT CATEGORY")
print("="*60)

category_sales_query = '''
    SELECT 
        p.category,
        COUNT(DISTINCT oi.order_id) as orders_with_category,
        SUM(oi.quantity) as units_sold,
        ROUND(SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 2) as revenue,
        ROUND(SUM(oi.quantity * (oi.unit_price - p.cost_price)), 2) as gross_profit,
        ROUND(SUM(oi.quantity * (oi.unit_price - p.cost_price)) * 100.0 / 
              SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 1) as profit_margin_pct
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.category
    ORDER BY revenue DESC
'''

category_sales = run_query(category_sales_query)
category_sales

---
## Part 5: Loading Data into Pandas

Now let's combine SQL filtering with Pandas analysis. We'll explore when to use SQL vs Pandas.

In [None]:
# Load full tables for Pandas analysis
conn = sqlite3.connect(DB_PATH)

# Load all tables
df_customers = pd.read_sql_query("SELECT * FROM customers", conn)
df_products = pd.read_sql_query("SELECT * FROM products", conn)
df_orders = pd.read_sql_query("SELECT * FROM orders", conn)
df_items = pd.read_sql_query("SELECT * FROM order_items", conn)

conn.close()

print("Loaded DataFrames:")
print(f"  customers: {df_customers.shape}")
print(f"  products: {df_products.shape}")
print(f"  orders: {df_orders.shape}")
print(f"  order_items: {df_items.shape}")

In [None]:
# Convert date columns
df_orders['order_date'] = pd.to_datetime(df_orders['order_date'])
df_orders['ship_date'] = pd.to_datetime(df_orders['ship_date'])
df_customers['registration_date'] = pd.to_datetime(df_customers['registration_date'])

# Create a comprehensive orders DataFrame by merging
df_order_details = df_items.merge(df_orders, on='order_id')
df_order_details = df_order_details.merge(df_products, on='product_id')
df_order_details = df_order_details.merge(
    df_customers[['customer_id', 'customer_segment', 'city', 'state']], 
    on='customer_id'
)

# Calculate line item revenue
df_order_details['line_revenue'] = (
    df_order_details['quantity'] * 
    df_order_details['unit_price_x'] * 
    (1 - df_order_details['discount_percent'] / 100)
)

# Calculate profit
df_order_details['line_profit'] = (
    df_order_details['quantity'] * 
    (df_order_details['unit_price_x'] - df_order_details['cost_price'])
)

print(f"Combined order details: {df_order_details.shape}")
df_order_details.head()

In [None]:
# When to use SQL vs Pandas:
print("="*60)
print("SQL vs PANDAS: WHEN TO USE EACH")
print("="*60)

print("""
USE SQL WHEN:
- Filtering large datasets before loading (reduces memory)
- Complex JOINs across multiple tables
- Aggregations on full tables (database optimized)
- Data stays in database (reporting, ETL)

USE PANDAS WHEN:
- Data already loaded in memory
- Complex transformations (string manipulation, custom functions)
- Statistical analysis and modeling
- Iterative exploration and visualization
- Working with time series (resampling, rolling windows)

COMBINE BOTH:
- Use SQL to filter/aggregate, then Pandas for analysis
- Pre-filter with SQL WHERE clause, analyze with Pandas
""")

In [None]:
# Example: SQL filtering + Pandas analysis
# Get only 2024 data for analysis

recent_orders_query = '''
    SELECT 
        o.order_id,
        o.order_date,
        o.customer_id,
        c.customer_segment,
        p.category,
        p.product_name,
        oi.quantity,
        oi.unit_price,
        oi.discount_percent,
        oi.quantity * oi.unit_price * (1 - oi.discount_percent/100) as revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE o.order_date >= '2024-01-01'
'''

df_2024 = run_query(recent_orders_query)
df_2024['order_date'] = pd.to_datetime(df_2024['order_date'])

print(f"2024 orders loaded: {len(df_2024)} line items")
print(f"\nRevenue by segment (2024):")
df_2024.groupby('customer_segment')['revenue'].agg(['sum', 'mean', 'count']).round(2)

---
## Part 6: Data Visualization Dashboard

Now let's create a comprehensive analytics dashboard using subplots.

In [None]:
# Prepare data for dashboard

# 1. Monthly revenue data
monthly_data = run_query('''
    SELECT 
        strftime('%Y-%m', o.order_date) as month,
        SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    GROUP BY month
    ORDER BY month
''')

# 2. Top products data
top_products_data = run_query('''
    SELECT 
        p.product_name,
        SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.product_id
    ORDER BY revenue DESC
    LIMIT 8
''')

# 3. Customer segments data
segment_data = run_query('''
    SELECT 
        c.customer_segment,
        SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN order_items oi ON o.order_id = oi.order_id
    GROUP BY c.customer_segment
''')

# 4. Order values for histogram
order_values = run_query('''
    SELECT 
        o.order_id,
        SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as order_total
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    GROUP BY o.order_id
''')

# 5. Category sales by year
category_by_year = run_query('''
    SELECT 
        p.category,
        strftime('%Y', o.order_date) as year,
        SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    JOIN orders o ON oi.order_id = o.order_id
    GROUP BY p.category, year
''')

print("Data prepared for dashboard visualization")

In [None]:
# Create the main dashboard
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('E-Commerce Analytics Dashboard', fontsize=20, fontweight='bold', y=1.02)

# 1. Revenue Over Time (Line Chart)
ax1 = axes[0, 0]
ax1.plot(range(len(monthly_data)), monthly_data['revenue'], 
         marker='o', linewidth=2, markersize=5, color='#2E86AB')
ax1.fill_between(range(len(monthly_data)), monthly_data['revenue'], alpha=0.3, color='#2E86AB')
ax1.set_xlabel('Month')
ax1.set_ylabel('Revenue ($)')
ax1.set_title('Monthly Revenue Trend', fontsize=12, fontweight='bold')
# Set x-ticks to show every 3rd month
tick_positions = range(0, len(monthly_data), 3)
ax1.set_xticks(tick_positions)
ax1.set_xticklabels([monthly_data['month'].iloc[i] for i in tick_positions], rotation=45, ha='right')
ax1.grid(True, alpha=0.3)

# 2. Top Products (Horizontal Bar Chart)
ax2 = axes[0, 1]
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(top_products_data)))
# Shorten product names for display
short_names = [name[:20] + '...' if len(name) > 20 else name for name in top_products_data['product_name']]
bars = ax2.barh(range(len(top_products_data)), top_products_data['revenue'], color=colors)
ax2.set_yticks(range(len(top_products_data)))
ax2.set_yticklabels(short_names)
ax2.set_xlabel('Revenue ($)')
ax2.set_title('Top 8 Products by Revenue', fontsize=12, fontweight='bold')
ax2.invert_yaxis()

# 3. Customer Segments (Pie Chart)
ax3 = axes[0, 2]
colors_pie = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']
explode = [0.02] * len(segment_data)
wedges, texts, autotexts = ax3.pie(
    segment_data['revenue'], 
    labels=segment_data['customer_segment'],
    autopct='%1.1f%%',
    colors=colors_pie,
    explode=explode,
    startangle=90
)
ax3.set_title('Revenue by Customer Segment', fontsize=12, fontweight='bold')

# 4. Order Value Distribution (Histogram)
ax4 = axes[1, 0]
ax4.hist(order_values['order_total'], bins=30, edgecolor='black', alpha=0.7, color='#2E86AB')
ax4.axvline(order_values['order_total'].mean(), color='red', linestyle='--', 
            label=f"Mean: ${order_values['order_total'].mean():.2f}")
ax4.axvline(order_values['order_total'].median(), color='green', linestyle='--',
            label=f"Median: ${order_values['order_total'].median():.2f}")
ax4.set_xlabel('Order Value ($)')
ax4.set_ylabel('Frequency')
ax4.set_title('Order Value Distribution', fontsize=12, fontweight='bold')
ax4.legend()

# 5. Sales by Category (Stacked Bar by Year)
ax5 = axes[1, 1]
pivot_category = category_by_year.pivot(index='category', columns='year', values='revenue').fillna(0)
pivot_category.plot(kind='bar', stacked=True, ax=ax5, colormap='viridis', edgecolor='white')
ax5.set_xlabel('Category')
ax5.set_ylabel('Revenue ($)')
ax5.set_title('Category Revenue by Year', fontsize=12, fontweight='bold')
ax5.tick_params(axis='x', rotation=45)
ax5.legend(title='Year')

# 6. KPI Summary Box
ax6 = axes[1, 2]
ax6.axis('off')

# Calculate KPIs
total_revenue = df_order_details['line_revenue'].sum()
total_orders = df_orders.shape[0]
total_customers = df_customers.shape[0]
avg_order_value = order_values['order_total'].mean()
total_profit = df_order_details['line_profit'].sum()
profit_margin = (total_profit / total_revenue) * 100

kpi_text = f"""
KEY PERFORMANCE INDICATORS
{'='*35}

Total Revenue:      ${total_revenue:,.2f}

Total Orders:       {total_orders:,}

Total Customers:    {total_customers:,}

Avg Order Value:    ${avg_order_value:,.2f}

Gross Profit:       ${total_profit:,.2f}

Profit Margin:      {profit_margin:.1f}%
"""

ax6.text(0.1, 0.5, kpi_text, transform=ax6.transAxes, fontsize=12,
         verticalalignment='center', fontfamily='monospace',
         bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))

plt.tight_layout()
plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nDashboard saved as 'dashboard.png'")

---
## Part 7: Advanced Analytics

Let's perform more sophisticated analyses including cohort analysis, product correlations, and seasonal trends.

In [None]:
# Customer Cohort Analysis
print("="*60)
print("CUSTOMER COHORT ANALYSIS")
print("="*60)

# Get customer first purchase month (cohort)
cohort_query = '''
    WITH customer_first_order AS (
        SELECT 
            customer_id,
            MIN(strftime('%Y-%m', order_date)) as cohort_month
        FROM orders
        GROUP BY customer_id
    ),
    customer_orders AS (
        SELECT 
            o.customer_id,
            cfo.cohort_month,
            strftime('%Y-%m', o.order_date) as order_month,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
        FROM orders o
        JOIN customer_first_order cfo ON o.customer_id = cfo.customer_id
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY o.customer_id, cfo.cohort_month, order_month
    )
    SELECT 
        cohort_month,
        order_month,
        COUNT(DISTINCT customer_id) as customers,
        SUM(revenue) as total_revenue
    FROM customer_orders
    GROUP BY cohort_month, order_month
    ORDER BY cohort_month, order_month
'''

cohort_data = run_query(cohort_query)
print(f"Cohort data shape: {cohort_data.shape}")
cohort_data.head(10)

In [None]:
# Create cohort retention matrix
def calculate_months_since_cohort(row):
    """Calculate months between cohort and order."""
    cohort = datetime.strptime(row['cohort_month'], '%Y-%m')
    order = datetime.strptime(row['order_month'], '%Y-%m')
    return (order.year - cohort.year) * 12 + (order.month - cohort.month)

cohort_data['months_since'] = cohort_data.apply(calculate_months_since_cohort, axis=1)

# Pivot for retention matrix
retention_matrix = cohort_data.pivot_table(
    index='cohort_month',
    columns='months_since',
    values='customers',
    aggfunc='sum'
).fillna(0)

# Calculate retention percentages
retention_pct = retention_matrix.div(retention_matrix[0], axis=0) * 100

# Plot cohort heatmap (limited to first 6 months)
fig, ax = plt.subplots(figsize=(12, 8))

# Limit columns for readability
cols_to_show = [c for c in retention_pct.columns if c <= 6]
retention_display = retention_pct[cols_to_show].tail(12)  # Last 12 cohorts

import matplotlib.colors as mcolors
cmap = plt.cm.YlGnBu

im = ax.imshow(retention_display.values, cmap=cmap, aspect='auto')

# Add labels
ax.set_xticks(range(len(cols_to_show)))
ax.set_xticklabels([f'Month {c}' for c in cols_to_show])
ax.set_yticks(range(len(retention_display)))
ax.set_yticklabels(retention_display.index)

# Add percentage text
for i in range(len(retention_display)):
    for j in range(len(cols_to_show)):
        value = retention_display.iloc[i, j]
        if value > 0:
            text_color = 'white' if value > 50 else 'black'
            ax.text(j, i, f'{value:.0f}%', ha='center', va='center', 
                   color=text_color, fontsize=9)

ax.set_xlabel('Months Since First Purchase')
ax.set_ylabel('Cohort (First Purchase Month)')
ax.set_title('Customer Retention by Cohort', fontsize=14, fontweight='bold')

plt.colorbar(im, label='Retention %')
plt.tight_layout()
plt.show()

In [None]:
# Product Correlation Analysis - What's bought together?
print("="*60)
print("PRODUCT CORRELATION: FREQUENTLY BOUGHT TOGETHER")
print("="*60)

# Find products commonly purchased in the same order
product_pairs_query = '''
    SELECT 
        p1.product_name as product_1,
        p2.product_name as product_2,
        COUNT(*) as times_bought_together
    FROM order_items oi1
    JOIN order_items oi2 ON oi1.order_id = oi2.order_id AND oi1.product_id < oi2.product_id
    JOIN products p1 ON oi1.product_id = p1.product_id
    JOIN products p2 ON oi2.product_id = p2.product_id
    GROUP BY p1.product_id, p2.product_id
    HAVING COUNT(*) >= 3
    ORDER BY times_bought_together DESC
    LIMIT 15
'''

product_pairs = run_query(product_pairs_query)
product_pairs

In [None]:
# Seasonal Trends Analysis
print("="*60)
print("SEASONAL TRENDS ANALYSIS")
print("="*60)

seasonal_query = '''
    SELECT 
        CASE 
            WHEN CAST(strftime('%m', o.order_date) AS INTEGER) IN (12, 1, 2) THEN 'Winter'
            WHEN CAST(strftime('%m', o.order_date) AS INTEGER) IN (3, 4, 5) THEN 'Spring'
            WHEN CAST(strftime('%m', o.order_date) AS INTEGER) IN (6, 7, 8) THEN 'Summer'
            ELSE 'Fall'
        END as season,
        p.category,
        COUNT(DISTINCT o.order_id) as orders,
        SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    GROUP BY season, p.category
'''

seasonal_data = run_query(seasonal_query)

# Pivot for visualization
seasonal_pivot = seasonal_data.pivot(index='category', columns='season', values='revenue').fillna(0)
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_pivot = seasonal_pivot[season_order]

# Plot seasonal heatmap
fig, ax = plt.subplots(figsize=(10, 6))

im = ax.imshow(seasonal_pivot.values, cmap='YlOrRd', aspect='auto')

ax.set_xticks(range(len(season_order)))
ax.set_xticklabels(season_order)
ax.set_yticks(range(len(seasonal_pivot)))
ax.set_yticklabels(seasonal_pivot.index)

# Add revenue values
for i in range(len(seasonal_pivot)):
    for j in range(len(season_order)):
        value = seasonal_pivot.iloc[i, j]
        text_color = 'white' if value > seasonal_pivot.values.mean() else 'black'
        ax.text(j, i, f'${value/1000:.1f}K', ha='center', va='center', 
               color=text_color, fontsize=10)

ax.set_xlabel('Season')
ax.set_ylabel('Product Category')
ax.set_title('Category Revenue by Season', fontsize=14, fontweight='bold')

plt.colorbar(im, label='Revenue ($)')
plt.tight_layout()
plt.show()

In [None]:
# Year-over-Year Comparison
print("="*60)
print("YEAR-OVER-YEAR COMPARISON")
print("="*60)

yoy_query = '''
    SELECT 
        strftime('%Y', order_date) as year,
        CAST(strftime('%m', order_date) AS INTEGER) as month,
        COUNT(DISTINCT order_id) as orders,
        SUM(revenue) as revenue
    FROM (
        SELECT 
            o.order_id,
            o.order_date,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
        FROM orders o
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY o.order_id
    )
    GROUP BY year, month
    ORDER BY year, month
'''

yoy_data = run_query(yoy_query)

# Pivot by year
yoy_pivot = yoy_data.pivot(index='month', columns='year', values='revenue').fillna(0)

# Plot YoY comparison
fig, ax = plt.subplots(figsize=(12, 6))

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

x = np.arange(12)
width = 0.35

if '2023' in yoy_pivot.columns:
    bars1 = ax.bar(x - width/2, yoy_pivot['2023'], width, label='2023', color='#2E86AB')
if '2024' in yoy_pivot.columns:
    bars2 = ax.bar(x + width/2, yoy_pivot['2024'], width, label='2024', color='#A23B72')

ax.set_xlabel('Month')
ax.set_ylabel('Revenue ($)')
ax.set_title('Year-over-Year Monthly Revenue Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(months)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Calculate YoY growth
if '2023' in yoy_pivot.columns and '2024' in yoy_pivot.columns:
    total_2023 = yoy_pivot['2023'].sum()
    total_2024 = yoy_pivot['2024'].sum()
    yoy_growth = ((total_2024 - total_2023) / total_2023) * 100
    print(f"\n2023 Total Revenue: ${total_2023:,.2f}")
    print(f"2024 Total Revenue: ${total_2024:,.2f}")
    print(f"Year-over-Year Growth: {yoy_growth:+.1f}%")

---
## Part 8: Creating Summary Reports

Let's create executive summary statistics and formatted tables.

In [None]:
# Executive Summary Statistics
print("="*70)
print("                    EXECUTIVE SUMMARY REPORT                        ")
print("                 E-Commerce Analytics Dashboard                     ")
print("="*70)

# Overall metrics
total_revenue = df_order_details['line_revenue'].sum()
total_orders = df_orders.shape[0]
total_customers = df_customers.shape[0]
active_customers = df_orders['customer_id'].nunique()
total_products = df_products.shape[0]
total_items = df_items['quantity'].sum()
avg_order_value = order_values['order_total'].mean()
total_profit = df_order_details['line_profit'].sum()
profit_margin = (total_profit / total_revenue) * 100

print("\n## FINANCIAL METRICS")
print("-" * 40)
print(f"Total Revenue:           ${total_revenue:>15,.2f}")
print(f"Gross Profit:            ${total_profit:>15,.2f}")
print(f"Profit Margin:           {profit_margin:>15.1f}%")
print(f"Average Order Value:     ${avg_order_value:>15,.2f}")

print("\n## OPERATIONAL METRICS")
print("-" * 40)
print(f"Total Orders:            {total_orders:>15,}")
print(f"Total Items Sold:        {total_items:>15,}")
print(f"Total Customers:         {total_customers:>15,}")
print(f"Active Customers:        {active_customers:>15,}")
print(f"Products in Catalog:     {total_products:>15,}")

In [None]:
# Formatted Category Performance Table
print("\n## CATEGORY PERFORMANCE")
print("-" * 40)

category_performance = run_query('''
    SELECT 
        p.category as Category,
        COUNT(DISTINCT oi.order_id) as Orders,
        SUM(oi.quantity) as "Units Sold",
        ROUND(SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 2) as Revenue,
        ROUND(SUM(oi.quantity * (oi.unit_price - p.cost_price)), 2) as Profit,
        ROUND(AVG(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)), 2) as "Avg Sale"
    FROM products p
    JOIN order_items oi ON p.product_id = oi.product_id
    GROUP BY p.category
    ORDER BY Revenue DESC
''')

# Style the DataFrame
def format_currency(x):
    """Format number as currency."""
    return f"${x:,.2f}"

styled_category = category_performance.copy()
styled_category['Revenue'] = styled_category['Revenue'].apply(format_currency)
styled_category['Profit'] = styled_category['Profit'].apply(format_currency)
styled_category['Avg Sale'] = styled_category['Avg Sale'].apply(format_currency)

styled_category

In [None]:
# Customer Segment Analysis
print("\n## CUSTOMER SEGMENT ANALYSIS")
print("-" * 40)

segment_analysis = run_query('''
    WITH customer_metrics AS (
        SELECT 
            c.customer_segment,
            c.customer_id,
            COUNT(DISTINCT o.order_id) as orders,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
        FROM customers c
        LEFT JOIN orders o ON c.customer_id = o.customer_id
        LEFT JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY c.customer_segment, c.customer_id
    )
    SELECT 
        customer_segment as Segment,
        COUNT(*) as Customers,
        SUM(CASE WHEN orders > 0 THEN 1 ELSE 0 END) as "Active Customers",
        ROUND(AVG(orders), 2) as "Avg Orders",
        ROUND(SUM(revenue), 2) as "Total Revenue",
        ROUND(AVG(revenue), 2) as "Avg CLV"
    FROM customer_metrics
    GROUP BY customer_segment
    ORDER BY "Total Revenue" DESC
''')

segment_analysis

In [None]:
# Export visualizations
print("\n## EXPORTING REPORTS")
print("-" * 40)

# Create a summary report figure
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('E-Commerce Monthly Report', fontsize=16, fontweight='bold')

# Top performers
ax1 = axes[0, 0]
top_5 = top_products_data.head(5)
bars = ax1.barh(range(5), top_5['revenue'], color=plt.cm.Blues(np.linspace(0.4, 0.8, 5)))
ax1.set_yticks(range(5))
ax1.set_yticklabels([n[:18] + '...' if len(n) > 18 else n for n in top_5['product_name']])
ax1.set_xlabel('Revenue ($)')
ax1.set_title('Top 5 Products', fontweight='bold')
ax1.invert_yaxis()

# Revenue trend
ax2 = axes[0, 1]
ax2.plot(range(len(monthly_data)), monthly_data['revenue'], marker='o', linewidth=2, color='#2E86AB')
ax2.fill_between(range(len(monthly_data)), monthly_data['revenue'], alpha=0.3, color='#2E86AB')
ax2.set_title('Revenue Trend', fontweight='bold')
ax2.set_ylabel('Revenue ($)')
tick_pos = range(0, len(monthly_data), 4)
ax2.set_xticks(tick_pos)
ax2.set_xticklabels([monthly_data['month'].iloc[i] for i in tick_pos], rotation=45, ha='right')

# Segment breakdown
ax3 = axes[1, 0]
segment_data_sorted = segment_data.sort_values('revenue', ascending=False)
colors = ['#2E86AB', '#A23B72', '#F18F01', '#C73E1D']
ax3.pie(segment_data_sorted['revenue'], labels=segment_data_sorted['customer_segment'],
        autopct='%1.1f%%', colors=colors, startangle=90)
ax3.set_title('Revenue by Segment', fontweight='bold')

# Monthly growth
ax4 = axes[1, 1]
monthly_growth = monthly_data['revenue'].pct_change() * 100
colors = ['green' if x >= 0 else 'red' for x in monthly_growth.fillna(0)]
ax4.bar(range(len(monthly_growth)), monthly_growth.fillna(0), color=colors, alpha=0.7)
ax4.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax4.set_title('Month-over-Month Growth %', fontweight='bold')
ax4.set_ylabel('Growth %')
ax4.set_xlabel('Month')

plt.tight_layout()
plt.savefig('monthly_report.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nReports exported:")
print("  - dashboard.png")
print("  - monthly_report.png")

---
## Part 9: Challenges (Optional Extensions)

Here are some challenges to extend your learning.

In [None]:
# Challenge 1: Advanced SQL - Using CTEs and Subqueries
print("="*60)
print("CHALLENGE 1: ADVANCED SQL QUERIES")
print("="*60)

# Find customers whose spending is above average for their segment
above_avg_query = '''
    WITH segment_avg AS (
        SELECT 
            c.customer_segment,
            AVG(order_total) as avg_spending
        FROM customers c
        JOIN (
            SELECT 
                o.customer_id,
                SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as order_total
            FROM orders o
            JOIN order_items oi ON o.order_id = oi.order_id
            GROUP BY o.customer_id
        ) customer_totals ON c.customer_id = customer_totals.customer_id
        GROUP BY c.customer_segment
    ),
    customer_spending AS (
        SELECT 
            c.customer_id,
            c.first_name || ' ' || c.last_name as name,
            c.customer_segment,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as total_spending
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY c.customer_id
    )
    SELECT 
        cs.name,
        cs.customer_segment,
        ROUND(cs.total_spending, 2) as spending,
        ROUND(sa.avg_spending, 2) as segment_avg,
        ROUND((cs.total_spending - sa.avg_spending) / sa.avg_spending * 100, 1) as pct_above_avg
    FROM customer_spending cs
    JOIN segment_avg sa ON cs.customer_segment = sa.customer_segment
    WHERE cs.total_spending > sa.avg_spending
    ORDER BY pct_above_avg DESC
    LIMIT 10
'''

print("\nTop 10 Customers Spending Above Segment Average:")
run_query(above_avg_query)

In [None]:
# Challenge 2: Running Totals and Rankings
print("="*60)
print("CHALLENGE 2: RUNNING TOTALS AND RANKINGS")
print("="*60)

# Calculate running total of revenue
running_total_query = '''
    WITH monthly_rev AS (
        SELECT 
            strftime('%Y-%m', o.order_date) as month,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as revenue
        FROM orders o
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY month
    )
    SELECT 
        month,
        ROUND(revenue, 2) as monthly_revenue,
        ROUND(SUM(revenue) OVER (ORDER BY month), 2) as running_total,
        ROUND(AVG(revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2) as moving_avg_3m
    FROM monthly_rev
    ORDER BY month
'''

print("\nMonthly Revenue with Running Total and 3-Month Moving Average:")
run_query(running_total_query)

In [None]:
# Challenge 3: Simple Sales Forecasting with Pandas
print("="*60)
print("CHALLENGE 3: SIMPLE SALES FORECASTING")
print("="*60)

# Get monthly data
monthly_df = monthly_data.copy()
monthly_df['month_idx'] = range(len(monthly_df))

# Simple linear regression for forecasting
from numpy.polynomial import polynomial as P

# Fit linear trend
x = monthly_df['month_idx'].values
y = monthly_df['revenue'].values

# Linear fit
coeffs = np.polyfit(x, y, 1)
trend_line = np.poly1d(coeffs)

# Forecast next 3 months
future_months = np.array([len(x), len(x)+1, len(x)+2])
forecast = trend_line(future_months)

# Plot with forecast
fig, ax = plt.subplots(figsize=(12, 6))

# Historical data
ax.plot(x, y, 'o-', label='Historical', color='#2E86AB', linewidth=2)

# Trend line
ax.plot(x, trend_line(x), '--', label='Trend', color='gray', alpha=0.7)

# Forecast
ax.plot(future_months, forecast, 's--', label='Forecast', color='#A23B72', 
        linewidth=2, markersize=10)

ax.fill_between(future_months, forecast * 0.9, forecast * 1.1, 
                alpha=0.3, color='#A23B72', label='Confidence Band')

ax.set_xlabel('Month Index')
ax.set_ylabel('Revenue ($)')
ax.set_title('Revenue Forecast (Linear Trend)', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nForecast for next 3 months:")
for i, f in enumerate(forecast):
    print(f"  Month +{i+1}: ${f:,.2f}")

In [None]:
# Challenge 4: Customer RFM Analysis
print("="*60)
print("CHALLENGE 4: RFM (RECENCY, FREQUENCY, MONETARY) ANALYSIS")
print("="*60)

rfm_query = '''
    WITH customer_rfm AS (
        SELECT 
            c.customer_id,
            c.first_name || ' ' || c.last_name as name,
            c.customer_segment,
            julianday('2024-12-31') - julianday(MAX(o.order_date)) as recency_days,
            COUNT(DISTINCT o.order_id) as frequency,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as monetary
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY c.customer_id
    )
    SELECT 
        name,
        customer_segment,
        CAST(recency_days AS INTEGER) as recency_days,
        frequency,
        ROUND(monetary, 2) as monetary,
        CASE 
            WHEN recency_days <= 30 AND frequency >= 5 AND monetary >= 1000 THEN 'Champion'
            WHEN recency_days <= 60 AND frequency >= 3 THEN 'Loyal'
            WHEN recency_days <= 90 AND monetary >= 500 THEN 'Potential Loyalist'
            WHEN recency_days > 180 AND frequency >= 3 THEN 'At Risk'
            WHEN recency_days > 180 THEN 'Hibernating'
            ELSE 'Regular'
        END as rfm_segment
    FROM customer_rfm
    ORDER BY monetary DESC
    LIMIT 20
'''

rfm_results = run_query(rfm_query)
rfm_results

In [None]:
# RFM Segment Distribution
rfm_all_query = '''
    WITH customer_rfm AS (
        SELECT 
            c.customer_id,
            julianday('2024-12-31') - julianday(MAX(o.order_date)) as recency_days,
            COUNT(DISTINCT o.order_id) as frequency,
            SUM(oi.quantity * oi.unit_price * (1 - oi.discount_percent/100)) as monetary
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        JOIN order_items oi ON o.order_id = oi.order_id
        GROUP BY c.customer_id
    )
    SELECT 
        CASE 
            WHEN recency_days <= 30 AND frequency >= 5 AND monetary >= 1000 THEN 'Champion'
            WHEN recency_days <= 60 AND frequency >= 3 THEN 'Loyal'
            WHEN recency_days <= 90 AND monetary >= 500 THEN 'Potential Loyalist'
            WHEN recency_days > 180 AND frequency >= 3 THEN 'At Risk'
            WHEN recency_days > 180 THEN 'Hibernating'
            ELSE 'Regular'
        END as rfm_segment,
        COUNT(*) as customer_count,
        ROUND(SUM(monetary), 2) as total_value
    FROM customer_rfm
    GROUP BY rfm_segment
    ORDER BY total_value DESC
'''

rfm_segments = run_query(rfm_all_query)

# Visualize RFM segments
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Customer count by segment
colors = plt.cm.Set2(np.linspace(0, 1, len(rfm_segments)))
ax1.bar(rfm_segments['rfm_segment'], rfm_segments['customer_count'], color=colors)
ax1.set_xlabel('RFM Segment')
ax1.set_ylabel('Number of Customers')
ax1.set_title('Customer Count by RFM Segment', fontweight='bold')
ax1.tick_params(axis='x', rotation=45)

# Value by segment
ax2.bar(rfm_segments['rfm_segment'], rfm_segments['total_value'], color=colors)
ax2.set_xlabel('RFM Segment')
ax2.set_ylabel('Total Customer Value ($)')
ax2.set_title('Customer Value by RFM Segment', fontweight='bold')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nRFM Segment Summary:")
rfm_segments

In [None]:
# Clean up - close any remaining connections
print("\n" + "="*60)
print("PROJECT COMPLETE")
print("="*60)
print(f"""
Database file: {DB_PATH}
Exported files:
  - dashboard.png
  - monthly_report.png

Key learnings from this project:
1. Database design with proper schema and relationships
2. SQL queries: SELECT, JOIN, GROUP BY, CTEs, window functions
3. Combining SQL with Pandas for analysis
4. Creating multi-panel visualization dashboards
5. Advanced analytics: cohort analysis, RFM, forecasting
6. Generating business reports and summaries
""")

---
## Summary

This capstone project demonstrated a complete analytics workflow:

### Skills Applied
- **Database Design**: Created a normalized SQLite schema with proper relationships
- **Data Generation**: Built realistic sample data with seasonal patterns and trends
- **SQL Queries**: Used SELECT, JOIN, GROUP BY, CTEs, subqueries, and window functions
- **Pandas Integration**: Combined SQL filtering with Pandas analysis
- **Visualization**: Created comprehensive dashboards with multiple chart types
- **Business Analytics**: Performed cohort analysis, RFM segmentation, and forecasting

### Key Takeaways
1. Use SQL for filtering and aggregating large datasets before loading into memory
2. Pandas excels at complex transformations and statistical analysis
3. Matplotlib subplots enable comprehensive dashboard layouts
4. Business questions should drive your analysis approach
5. Always translate technical findings into actionable insights

### Next Steps
- Connect to production databases (PostgreSQL, MySQL)
- Build interactive dashboards with Plotly or Dash
- Implement automated reporting pipelines
- Add machine learning for predictive analytics