# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 03 · Notebook 02 — From DataFrames to Databases: Mental Model Mapping
**Instructor:** Amir Charkhi  |  **Goal:** Master dataframes to databases.

> Format: theory → implementation → best practices → real-world application.


## 🎯 The Big Picture

You've mastered pandas. You're comfortable with DataFrames. Now we're adding SQL to your toolkit.

**Why both?**
- **Pandas**: In-memory, flexible, great for exploration
- **SQL**: Scalable, persistent, great for production

Think of them as complementary tools:
- Use SQL to **extract and reduce** data from large sources
- Use pandas to **explore and visualize** the reduced data
- Use SQL to **productionize** your proven analyses

In [None]:
# Setup and imports
import pandas as pd
import numpy as np
import sqlite3
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Custom SQL display function
from IPython.display import Markdown, display

def show_sql(query):
    """Pretty print SQL queries"""
    display(Markdown(f"```sql\n{query}\n```"))

print("✅ Environment ready!")

## 📊 Setting Up Our Data Laboratory

We'll use the same retail dataset from Week 1, but now in both pandas AND SQL!

In [None]:
# Create sample retail data (same structure as Week 1)
np.random.seed(42)

# Generate sample data
n_transactions = 10000
n_customers = 1500
n_products = 200

# Create transactions
transactions = pd.DataFrame({
    'transaction_id': range(1, n_transactions + 1),
    'customer_id': np.random.randint(1, n_customers + 1, n_transactions),
    'product_id': np.random.randint(1, n_products + 1, n_transactions),
    'quantity': np.random.randint(1, 5, n_transactions),
    'date': pd.date_range('2024-01-01', periods=n_transactions, freq='15min'),
    'store_id': np.random.choice(['NYC', 'LA', 'CHI', 'HOU', 'PHX'], n_transactions)
})

# Create products
categories = ['Electronics', 'Clothing', 'Food', 'Books', 'Sports']
products = pd.DataFrame({
    'product_id': range(1, n_products + 1),
    'product_name': [f'Product_{i}' for i in range(1, n_products + 1)],
    'category': np.random.choice(categories, n_products),
    'price': np.round(np.random.uniform(10, 500, n_products), 2)
})

# Create customers
customers = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'customer_name': [f'Customer_{i}' for i in range(1, n_customers + 1)],
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_customers),
    'signup_date': pd.date_range('2023-01-01', periods=n_customers, freq='6H')
})

# Add revenue column
transactions = transactions.merge(products[['product_id', 'price']], on='product_id')
transactions['revenue'] = transactions['quantity'] * transactions['price']

print(f"📦 Created {len(transactions):,} transactions")
print(f"👥 Created {len(customers):,} customers")
print(f"🏷️ Created {len(products):,} products")

# Preview the data
transactions.head()

In [None]:
# Create SQLite database and load our data
conn = sqlite3.connect('retail.db')

# Load data into SQL
transactions.to_sql('transactions', conn, if_exists='replace', index=False)
products.to_sql('products', conn, if_exists='replace', index=False)
customers.to_sql('customers', conn, if_exists='replace', index=False)

# Create a SQLAlchemy engine for pandas integration
engine = create_engine('sqlite:///retail.db')

print("✅ Database created and data loaded!")

# Verify tables
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", conn)
print("\n📊 Tables in database:")
for table in tables['name']:
    count = pd.read_sql(f"SELECT COUNT(*) as count FROM {table}", conn).iloc[0, 0]
    print(f"  - {table}: {count:,} rows")

---

## 🔄 Part 1: Basic Operations - SELECT, WHERE, ORDER BY

Let's start with the fundamentals. Every pandas operation has a SQL equivalent!

### 1.1 Selecting Columns

The most basic operation - choosing which columns to work with.

In [None]:
# PANDAS: Select specific columns
pandas_result = transactions[['transaction_id', 'customer_id', 'revenue']].head()
print("🐼 Pandas approach:")
print(pandas_result)

print("\n" + "="*50 + "\n")

# SQL: Select specific columns
sql_query = """
SELECT transaction_id, customer_id, revenue
FROM transactions
LIMIT 5
"""

print("🗄️ SQL approach:")
show_sql(sql_query)
sql_result = pd.read_sql(sql_query, conn)
print(sql_result)

print("\n✅ Results are identical!")

### 1.2 Filtering Rows (WHERE clause)

Filtering is where SQL starts to shine with complex conditions.

In [None]:
# PANDAS: Multiple filter conditions
pandas_filter = transactions[
    (transactions['revenue'] > 500) & 
    (transactions['store_id'] == 'NYC')
][['transaction_id', 'revenue', 'store_id']].head()

print("🐼 Pandas filtering:")
print("df[(df['revenue'] > 500) & (df['store_id'] == 'NYC')]")
print(pandas_filter)

print("\n" + "="*50 + "\n")

# SQL: WHERE clause
sql_query = """
SELECT transaction_id, revenue, store_id
FROM transactions
WHERE revenue > 500 
  AND store_id = 'NYC'
LIMIT 5
"""

print("🗄️ SQL filtering:")
show_sql(sql_query)
sql_filter = pd.read_sql(sql_query, conn)
print(sql_filter)

# Pro tip comparison
print("\n💡 Pro Tip: SQL WHERE is often more readable for complex conditions!")

### 1.3 Sorting (ORDER BY)

Sorting is fundamental for rankings and time series analysis.

In [None]:
# PANDAS: Sort by multiple columns
pandas_sorted = transactions.nlargest(10, 'revenue')[['transaction_id', 'customer_id', 'revenue']]

print("🐼 Pandas sorting (top 10 by revenue):")
print("df.nlargest(10, 'revenue')")
print(pandas_sorted)

print("\n" + "="*50 + "\n")

# SQL: ORDER BY
sql_query = """
SELECT transaction_id, customer_id, revenue
FROM transactions
ORDER BY revenue DESC
LIMIT 10
"""

print("🗄️ SQL sorting:")
show_sql(sql_query)
sql_sorted = pd.read_sql(sql_query, conn)
print(sql_sorted)

---

## 🔗 Part 2: Aggregations - GROUP BY

This is where the mental models really start to connect!

### 2.1 Simple Aggregation

In [None]:
# PANDAS: Group by store and calculate metrics
pandas_agg = transactions.groupby('store_id').agg({
    'revenue': ['sum', 'mean', 'count']
}).round(2)

print("🐼 Pandas aggregation:")
print("df.groupby('store_id').agg({'revenue': ['sum', 'mean', 'count']})")
print(pandas_agg)

print("\n" + "="*50 + "\n")

# SQL: GROUP BY with multiple aggregations
sql_query = """
SELECT 
    store_id,
    SUM(revenue) as revenue_sum,
    AVG(revenue) as revenue_mean,
    COUNT(*) as revenue_count
FROM transactions
GROUP BY store_id
ORDER BY revenue_sum DESC
"""

print("🗄️ SQL aggregation:")
show_sql(sql_query)
sql_agg = pd.read_sql(sql_query, conn)
print(sql_agg)

### 2.2 Multiple Grouping Columns

In [None]:
# Add date components for better grouping
transactions['date_only'] = transactions['date'].dt.date
transactions['hour'] = transactions['date'].dt.hour

# Update SQL table
transactions.to_sql('transactions', conn, if_exists='replace', index=False)

# PANDAS: Multi-level groupby
pandas_multi = transactions.groupby(['store_id', 'date_only'])['revenue'].sum().head(10)

print("🐼 Pandas multi-level groupby:")
print("df.groupby(['store_id', 'date_only'])['revenue'].sum()")
print(pandas_multi)

print("\n" + "="*50 + "\n")

# SQL: Multiple GROUP BY columns
sql_query = """
SELECT 
    store_id,
    date_only,
    SUM(revenue) as total_revenue
FROM transactions
GROUP BY store_id, date_only
ORDER BY store_id, date_only
LIMIT 10
"""

print("🗄️ SQL multi-level groupby:")
show_sql(sql_query)
sql_multi = pd.read_sql(sql_query, conn)
print(sql_multi)

### 2.3 Filtering After Aggregation (HAVING clause)

This is a key concept - filtering AFTER grouping!

In [None]:
# PANDAS: Filter after groupby
store_totals = transactions.groupby('store_id')['revenue'].sum()
pandas_having = store_totals[store_totals > 100000]

print("🐼 Pandas approach (filter after groupby):")
print("grouped = df.groupby('store_id')['revenue'].sum()")
print("grouped[grouped > 100000]")
print(pandas_having)

print("\n" + "="*50 + "\n")

# SQL: HAVING clause
sql_query = """
SELECT 
    store_id,
    SUM(revenue) as total_revenue
FROM transactions
GROUP BY store_id
HAVING SUM(revenue) > 100000
ORDER BY total_revenue DESC
"""

print("🗄️ SQL approach (HAVING clause):")
show_sql(sql_query)
sql_having = pd.read_sql(sql_query, conn)
print(sql_having)

print("\n💡 Key Insight: WHERE filters rows BEFORE grouping, HAVING filters AFTER grouping!")

---

## 🔗 Part 3: JOINs - Combining Tables

JOINs are SQL's superpower. Let's map them to pandas merge operations!

### 3.1 Inner Join (Default)

In [None]:
# PANDAS: Inner join
pandas_inner = transactions[['transaction_id', 'customer_id', 'product_id', 'revenue']].merge(
    products[['product_id', 'product_name', 'category']],
    on='product_id',
    how='inner'
).head()

print("🐼 Pandas inner join:")
print("df1.merge(df2, on='product_id', how='inner')")
print(pandas_inner)

print("\n" + "="*50 + "\n")

# SQL: INNER JOIN
sql_query = """
SELECT 
    t.transaction_id,
    t.customer_id,
    t.product_id,
    t.revenue,
    p.product_name,
    p.category
FROM transactions t
INNER JOIN products p ON t.product_id = p.product_id
LIMIT 5
"""

print("🗄️ SQL inner join:")
show_sql(sql_query)
sql_inner = pd.read_sql(sql_query, conn)
print(sql_inner)

### 3.2 Left Join

In [None]:
# Create some customers without transactions for demonstration
all_customers = customers[['customer_id', 'customer_name']].head(10)
customer_revenue = transactions.groupby('customer_id')['revenue'].sum().reset_index()

# PANDAS: Left join
pandas_left = all_customers.merge(
    customer_revenue,
    on='customer_id',
    how='left'
).fillna(0)

print("🐼 Pandas left join (all customers, even without purchases):")
print("customers.merge(revenue, on='customer_id', how='left').fillna(0)")
print(pandas_left)

print("\n" + "="*50 + "\n")

# SQL: LEFT JOIN
sql_query = """
SELECT 
    c.customer_id,
    c.customer_name,
    COALESCE(SUM(t.revenue), 0) as total_revenue
FROM customers c
LEFT JOIN transactions t ON c.customer_id = t.customer_id
WHERE c.customer_id <= 10
GROUP BY c.customer_id, c.customer_name
ORDER BY c.customer_id
"""

print("🗄️ SQL left join:")
show_sql(sql_query)
sql_left = pd.read_sql(sql_query, conn)
print(sql_left)

print("\n💡 COALESCE in SQL = fillna in pandas!")

### 3.3 Multiple Joins

In [None]:
# PANDAS: Chain multiple merges
pandas_multi_join = transactions[['transaction_id', 'customer_id', 'product_id', 'revenue']].merge(
    products[['product_id', 'product_name', 'category']],
    on='product_id'
).merge(
    customers[['customer_id', 'customer_name', 'city']],
    on='customer_id'
).head()

print("🐼 Pandas multiple joins:")
print("df.merge(products, on='product_id').merge(customers, on='customer_id')")
print(pandas_multi_join)

print("\n" + "="*50 + "\n")

# SQL: Multiple JOINs
sql_query = """
SELECT 
    t.transaction_id,
    c.customer_name,
    c.city,
    p.product_name,
    p.category,
    t.revenue
FROM transactions t
JOIN products p ON t.product_id = p.product_id
JOIN customers c ON t.customer_id = c.customer_id
LIMIT 5
"""

print("🗄️ SQL multiple joins:")
show_sql(sql_query)
sql_multi_join = pd.read_sql(sql_query, conn)
print(sql_multi_join)

---

## 🚀 Part 4: Advanced Operations - Window Functions

Window functions are incredibly powerful for analytics. Let's see how pandas and SQL compare!

### 4.1 Ranking Functions

In [None]:
# PANDAS: Ranking within groups
pandas_rank = transactions.copy()
pandas_rank['rank_in_store'] = pandas_rank.groupby('store_id')['revenue'].rank(method='dense', ascending=False)
top_per_store = pandas_rank[pandas_rank['rank_in_store'] <= 3][['store_id', 'transaction_id', 'revenue', 'rank_in_store']].sort_values(['store_id', 'rank_in_store'])

print("🐼 Pandas ranking (top 3 transactions per store):")
print("df['rank'] = df.groupby('store_id')['revenue'].rank(method='dense', ascending=False)")
print(top_per_store.head(10))

print("\n" + "="*50 + "\n")

# SQL: Window function with RANK()
sql_query = """
WITH ranked_transactions AS (
    SELECT 
        store_id,
        transaction_id,
        revenue,
        DENSE_RANK() OVER (PARTITION BY store_id ORDER BY revenue DESC) as rank_in_store
    FROM transactions
)
SELECT *
FROM ranked_transactions
WHERE rank_in_store <= 3
ORDER BY store_id, rank_in_store
LIMIT 10
"""

print("🗄️ SQL window function:")
show_sql(sql_query)
sql_rank = pd.read_sql(sql_query, conn)
print(sql_rank)

### 4.2 Running Totals and Moving Averages

In [None]:
# PANDAS: Cumulative sum and rolling average
daily_revenue = transactions.groupby('date_only')['revenue'].sum().reset_index()
daily_revenue = daily_revenue.sort_values('date_only')
daily_revenue['cumulative_revenue'] = daily_revenue['revenue'].cumsum()
daily_revenue['moving_avg_7d'] = daily_revenue['revenue'].rolling(window=7, min_periods=1).mean()

print("🐼 Pandas cumulative and rolling:")
print("df['cumsum'] = df['revenue'].cumsum()")
print("df['rolling_avg'] = df['revenue'].rolling(window=7).mean()")
print(daily_revenue.head(10))

print("\n" + "="*50 + "\n")

# SQL: Window functions for running totals
sql_query = """
WITH daily_totals AS (
    SELECT 
        date_only,
        SUM(revenue) as daily_revenue
    FROM transactions
    GROUP BY date_only
)
SELECT 
    date_only,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY date_only) as cumulative_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY date_only 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7d
FROM daily_totals
ORDER BY date_only
LIMIT 10
"""

print("🗄️ SQL window functions:")
show_sql(sql_query)
sql_window = pd.read_sql(sql_query, conn)
print(sql_window)

### 4.3 Lead and Lag Operations

In [None]:
# PANDAS: Shift operations for time series
daily_revenue = transactions.groupby('date_only')['revenue'].sum().reset_index().sort_values('date_only')
daily_revenue['prev_day_revenue'] = daily_revenue['revenue'].shift(1)
daily_revenue['next_day_revenue'] = daily_revenue['revenue'].shift(-1)
daily_revenue['day_over_day_change'] = daily_revenue['revenue'] - daily_revenue['prev_day_revenue']

print("🐼 Pandas shift operations:")
print("df['prev'] = df['revenue'].shift(1)")
print("df['next'] = df['revenue'].shift(-1)")
print(daily_revenue.head(10))

print("\n" + "="*50 + "\n")

# SQL: LAG and LEAD functions
sql_query = """
WITH daily_totals AS (
    SELECT 
        date_only,
        SUM(revenue) as daily_revenue
    FROM transactions
    GROUP BY date_only
)
SELECT 
    date_only,
    daily_revenue,
    LAG(daily_revenue, 1) OVER (ORDER BY date_only) as prev_day_revenue,
    LEAD(daily_revenue, 1) OVER (ORDER BY date_only) as next_day_revenue,
    daily_revenue - LAG(daily_revenue, 1) OVER (ORDER BY date_only) as day_over_day_change
FROM daily_totals
ORDER BY date_only
LIMIT 10
"""

print("🗄️ SQL LAG/LEAD:")
show_sql(sql_query)
sql_lag_lead = pd.read_sql(sql_query, conn)
print(sql_lag_lead)

---

## 🔄 Part 5: Subqueries and CTEs

Complex analytical questions often require multiple steps. Let's see how to structure them!

### 5.1 Subqueries vs Method Chaining

In [None]:
# PANDAS: Method chaining for complex logic
# Find customers whose average order is above the overall average
overall_avg = transactions['revenue'].mean()

pandas_complex = (
    transactions
    .groupby('customer_id')['revenue']
    .mean()
    .reset_index()
    .rename(columns={'revenue': 'avg_revenue'})
    .query(f'avg_revenue > {overall_avg}')
    .sort_values('avg_revenue', ascending=False)
    .head(10)
)

print(f"🐼 Pandas: Customers with avg order > ${overall_avg:.2f}")
print(pandas_complex)

print("\n" + "="*50 + "\n")

# SQL: Using subquery
sql_query = """
SELECT 
    customer_id,
    AVG(revenue) as avg_revenue
FROM transactions
GROUP BY customer_id
HAVING AVG(revenue) > (
    SELECT AVG(revenue) 
    FROM transactions
)
ORDER BY avg_revenue DESC
LIMIT 10
"""

print("🗄️ SQL with subquery:")
show_sql(sql_query)
sql_subquery = pd.read_sql(sql_query, conn)
print(sql_subquery)

### 5.2 Common Table Expressions (CTEs)

CTEs are like creating temporary DataFrames in your SQL query!

In [None]:
# PANDAS: Multi-step analysis
# Step 1: Calculate customer metrics
customer_metrics = transactions.groupby('customer_id').agg({
    'revenue': ['sum', 'mean', 'count']
}).round(2)
customer_metrics.columns = ['total_revenue', 'avg_revenue', 'transaction_count']
customer_metrics = customer_metrics.reset_index()

# Step 2: Categorize customers
customer_metrics['customer_segment'] = pd.cut(
    customer_metrics['total_revenue'],
    bins=[0, 1000, 5000, float('inf')],
    labels=['Low', 'Medium', 'High']
)

# Step 3: Summary by segment
segment_summary = customer_metrics.groupby('customer_segment').agg({
    'customer_id': 'count',
    'total_revenue': 'mean'
}).round(2)

print("🐼 Pandas multi-step analysis:")
print(segment_summary)

print("\n" + "="*50 + "\n")

# SQL: Using CTEs for the same analysis
sql_query = """
WITH customer_metrics AS (
    SELECT 
        customer_id,
        SUM(revenue) as total_revenue,
        AVG(revenue) as avg_revenue,
        COUNT(*) as transaction_count
    FROM transactions
    GROUP BY customer_id
),
customer_segments AS (
    SELECT 
        *,
        CASE 
            WHEN total_revenue <= 1000 THEN 'Low'
            WHEN total_revenue <= 5000 THEN 'Medium'
            ELSE 'High'
        END as customer_segment
    FROM customer_metrics
)
SELECT 
    customer_segment,
    COUNT(*) as customer_count,
    AVG(total_revenue) as avg_segment_revenue
FROM customer_segments
GROUP BY customer_segment
ORDER BY avg_segment_revenue
"""

print("🗄️ SQL with CTEs:")
show_sql(sql_query)
sql_cte = pd.read_sql(sql_query, conn)
print(sql_cte)

print("\n💡 CTEs make complex SQL queries readable and modular, just like method chaining in pandas!")

---

## 🚀 Part 6: Performance Considerations

When should you use pandas vs SQL? Let's understand the trade-offs!

In [None]:
import time

# Test 1: Simple filtering
print("📊 Test 1: Simple filtering (revenue > 100)\n")

# Pandas timing
start = time.time()
pandas_result = transactions[transactions['revenue'] > 100]
pandas_time = time.time() - start
print(f"🐼 Pandas: {len(pandas_result):,} rows in {pandas_time:.4f} seconds")

# SQL timing
start = time.time()
sql_result = pd.read_sql("SELECT * FROM transactions WHERE revenue > 100", conn)
sql_time = time.time() - start
print(f"🗄️ SQL: {len(sql_result):,} rows in {sql_time:.4f} seconds")

print(f"\n⚡ Faster: {'Pandas' if pandas_time < sql_time else 'SQL'} by {abs(pandas_time - sql_time):.4f}s")

print("\n" + "="*50 + "\n")

# Test 2: Complex aggregation
print("📊 Test 2: Complex aggregation (group by store, calculate multiple metrics)\n")

# Pandas timing
start = time.time()
pandas_agg = transactions.groupby('store_id').agg({
    'revenue': ['sum', 'mean', 'std'],
    'quantity': ['sum', 'mean'],
    'transaction_id': 'count'
})
pandas_time = time.time() - start
print(f"🐼 Pandas: Aggregated in {pandas_time:.4f} seconds")

# SQL timing
start = time.time()
sql_agg = pd.read_sql("""
    SELECT 
        store_id,
        SUM(revenue) as revenue_sum,
        AVG(revenue) as revenue_mean,
        SUM(quantity) as quantity_sum,
        AVG(quantity) as quantity_mean,
        COUNT(*) as transaction_count
    FROM transactions
    GROUP BY store_id
""", conn)
sql_time = time.time() - start
print(f"🗄️ SQL: Aggregated in {sql_time:.4f} seconds")

print(f"\n⚡ Faster: {'Pandas' if pandas_time < sql_time else 'SQL'} by {abs(pandas_time - sql_time):.4f}s")

### 📊 When to Use Each Tool

Based on our experiments and real-world experience:

In [None]:
decision_matrix = pd.DataFrame({
    'Scenario': [
        'Data exploration & prototyping',
        'Production data pipelines',
        'Complex statistical analysis',
        'Data > 1GB',
        'Real-time dashboards',
        'Ad-hoc business queries',
        'Machine learning features',
        'Data validation & cleaning',
        'Time series manipulation',
        'Joining multiple large tables'
    ],
    'Preferred Tool': [
        '🐼 Pandas',
        '🗄️ SQL',
        '🐼 Pandas',
        '🗄️ SQL',
        '🗄️ SQL',
        '🗄️ SQL',
        '🐼 Pandas → SQL',
        '🐼 Pandas',
        '🐼 Pandas',
        '🗄️ SQL'
    ],
    'Reason': [
        'Interactive, flexible, great for iteration',
        'Scalable, auditable, version-controlled',
        'Rich statistical libraries (scipy, statsmodels)',
        'Memory constraints, let database do the work',
        'Direct queries, no data movement',
        'Standard language, shareable queries',
        'Prototype in pandas, productionize in SQL',
        'Better string/regex operations',
        'Superior datetime handling',
        'Optimized query planner'
    ]
})

print("🎯 DECISION MATRIX: When to Use Pandas vs SQL\n")
for _, row in decision_matrix.iterrows():
    print(f"{row['Preferred Tool']} {row['Scenario']}")
    print(f"    → {row['Reason']}\n")

---

## 💡 Part 7: Hybrid Workflows - Best of Both Worlds

The real power comes from combining pandas and SQL seamlessly!

### 7.1 Using SQL for Data Reduction, Pandas for Analysis

In [None]:
# Hybrid approach: Let SQL do the heavy lifting, pandas for fine-tuning

# Step 1: Use SQL to filter and aggregate large data
sql_query = """
SELECT 
    date_only,
    store_id,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(revenue) as total_revenue,
    AVG(revenue) as avg_transaction
FROM transactions
WHERE revenue > 50  -- Pre-filter in SQL
GROUP BY date_only, store_id
"""

print("Step 1: SQL for heavy lifting")
show_sql(sql_query)

# Execute and get results
daily_store_metrics = pd.read_sql(sql_query, conn)
print(f"\n✅ Reduced to {len(daily_store_metrics):,} rows\n")

# Step 2: Use pandas for complex transformations
print("Step 2: Pandas for complex analysis")

# Convert to datetime
daily_store_metrics['date_only'] = pd.to_datetime(daily_store_metrics['date_only'])

# Add time-based features
daily_store_metrics['day_of_week'] = daily_store_metrics['date_only'].dt.day_name()
daily_store_metrics['is_weekend'] = daily_store_metrics['date_only'].dt.dayofweek.isin([5, 6])

# Calculate store performance ranking by day
daily_store_metrics['daily_rank'] = daily_store_metrics.groupby('date_only')['total_revenue'].rank(ascending=False)

# Show results
print(daily_store_metrics.head(10))

print("\n💡 Best Practice: Use SQL to reduce data volume, pandas for complex transformations!")

### 7.2 Parameterized Queries from Pandas

In [None]:
# Safe parameterized queries - avoid SQL injection!

def get_customer_history(customer_id, min_revenue=0):
    """
    Safely query customer transaction history
    """
    query = """
    SELECT 
        t.transaction_id,
        t.date,
        p.product_name,
        p.category,
        t.revenue
    FROM transactions t
    JOIN products p ON t.product_id = p.product_id
    WHERE t.customer_id = ?
      AND t.revenue > ?
    ORDER BY t.date DESC
    LIMIT 10
    """
    
    # Use parameterized query for safety
    return pd.read_sql(query, conn, params=(customer_id, min_revenue))

# Example usage
customer_data = get_customer_history(customer_id=42, min_revenue=100)
print("🔒 Safe parameterized query result:")
print(customer_data)

print("\n⚠️ NEVER use string formatting for SQL queries - always use parameters!")

### 7.3 Pushing Pandas Operations to SQL

In [None]:
# Sometimes it's better to push operations to the database

# Scenario: Complex filtering that could be done in either tool
stores_of_interest = ['NYC', 'LA']
date_range = ('2024-01-01', '2024-01-07')

print("Approach 1: Load all data, filter in pandas (DON'T DO THIS)")
print("```python")
print("# This loads ALL data into memory first!")
print("df = pd.read_sql('SELECT * FROM transactions', conn)")
print("df_filtered = df[(df['store_id'].isin(stores)) & (df['date'] >= start)]")
print("```\n")

print("Approach 2: Filter in SQL (DO THIS)")
query = f"""
SELECT *
FROM transactions
WHERE store_id IN ({','.join(['?'] * len(stores_of_interest))})
  AND date >= ?
  AND date <= ?
"""

show_sql(query.replace('?', "'store_name'"))

# Execute with parameters
filtered_data = pd.read_sql(
    query, 
    conn, 
    params=stores_of_interest + list(date_range)
)

print(f"\n✅ Loaded only {len(filtered_data):,} relevant rows instead of {len(transactions):,}!")

---

## 🎯 Practice Exercises

Now it's your turn! Complete these exercises using BOTH pandas and SQL.

### Exercise 1: Customer Segmentation

Find the top 10% of customers by total spending and analyze their behavior.

In [None]:
# TODO: Your pandas solution here
# Hint: Use quantile() to find the 90th percentile threshold

# pandas_solution = ...

print("🐼 Your pandas solution:")

In [None]:
# TODO: Your SQL solution here
# Hint: Use NTILE() or calculate percentiles with window functions

sql_query = """
-- Your SQL here
"""

print("🗄️ Your SQL solution:")
# sql_solution = pd.read_sql(sql_query, conn)

### Exercise 2: Cohort Analysis

Calculate retention by customer signup month.

In [None]:
# TODO: Create a cohort analysis
# 1. Group customers by signup month
# 2. Track their activity in subsequent months
# 3. Calculate retention rates

# Your solution here

### Exercise 3: Product Affinity

Find which products are frequently bought together.

In [None]:
# TODO: Implement market basket analysis
# Find products that appear in the same transactions

# Your solution here

---

## 🎓 Key Takeaways

1. **Mental Model Mapping**:
   - `df.groupby()` → `GROUP BY`
   - `df.merge()` → `JOIN`
   - `df['col'].rank()` → `RANK() OVER()`
   - `df['col'].shift()` → `LAG()/LEAD()`

2. **When to Use SQL**:
   - Data is in a database (avoid loading unnecessary data)
   - Need to share queries with non-Python users
   - Production pipelines requiring audit trails
   - Working with data larger than memory

3. **When to Use Pandas**:
   - Exploratory data analysis
   - Complex statistical operations
   - Data cleaning and string manipulation
   - Visualization preparation

4. **Best Practices**:
   - Use SQL to reduce data volume first
   - Always use parameterized queries
   - Think in sets (SQL) vs iterations (pandas)
   - Document complex queries with CTEs
   - Profile performance for large datasets

5. **Hybrid Approach**:
   - SQL for extraction and reduction
   - Pandas for transformation and analysis
   - SQL for productionization

---

## 🚀 Next Steps

In the next notebook, we'll dive deeper into:
- Data warehouse design patterns
- Star and snowflake schemas
- Optimizing query performance
- Working with cloud data warehouses

Remember: **You don't choose pandas OR SQL - you master BOTH!** 🎯

In [None]:
# Clean up
conn.close()
print("✅ Database connection closed. Great work!")