<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/7_Basic_Query_Optimization/1_Explain.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Explain

## Overview

### 🥅 Analysis Goals

Analyze customer-level revenue, order behavior, and cohort classification to assess purchasing patterns and customer value.  

- **Plan Query Execution:** Use `EXPLAIN` to understand how PostgreSQL will execute the query, identifying potential inefficiencies like sequential scans or costly joins.  
- **Measure Actual Query Performance:** Use `EXPLAIN ANALYZE` to execute the query while collecting real performance metrics, comparing estimated vs. actual execution times for optimization.

### 📘 Concepts Covered

- `EXPLAIN`
- `EXPLAIN ANALYZE`

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## EXPLAIN

### 📝 Notes

**`EXPLAIN`**  

- **EXPLAIN**: Displays the execution plan of a SQL query, showing how PostgreSQL will execute it.

- Syntax:  
  ```sql
  EXPLAIN SELECT column FROM table WHERE condition;
  ```
  
- **`EXPLAIN ANALYZE`**: Executes the query and provides actual execution times, row estimates, and other runtime details.
  ```sql
  EXPLAIN ANALYZE SELECT column FROM table WHERE condition;
  ```

- Helps with query optimization by showing:
  - Index usage
  - Join methods (`Nested Loop`, `Hash Join`, `Merge Join`)
  - Sequential vs. index scans
  - Estimated vs. actual row counts

- **Example Output** (simplified):
  ```
  Seq Scan on users  (cost=0.00..18.50 rows=850 width=64)
  ```
  - `Seq Scan`: PostgreSQL is doing a sequential scan (no index used).
  - `cost`: Estimated startup and total cost.
  - `rows`: Estimated number of rows.
  - `width`: Estimated row size in bytes.

- **Use Cases**:
  - Debugging slow queries
  - Checking if indexes are being used
  - Understanding query performance bottlenecks

### 💻 Final Result

- Understand how PostgreSQL plans to execute the query without running it, helping identify potential inefficiencies like sequential scans or costly joins.
- Execute the query while collecting actual performance metrics, allowing comparison between estimated and real execution times to optimize query performance.

### 💡 Note

For the queries below since we've already done them the explanation is explaining the results of the `EXPLAIN` plan.

#### Basic Execution Plan

**`EXPLAIN`**

1. Using the last query from `String Formatting` chapter, add `EXPLAIN` to the beginning of the query. 
    - **Hash Left Join:** PostgreSQL joins `customer` (104,990 rows) and `sales_data` (37,024 rows).  
    - **Sequential Scan on `customer`:** Scans all rows instead of using an index.  
    - **Hash Aggregation on `sales`:** Groups `sales` (199,873 rows) by `customerkey`, calculating revenue, orders, and cohort year.  
    - **Sequential Scan on `sales`:** No filtering, so the entire table is scanned.  

In [2]:
%%sql

EXPLAIN
WITH sales_data AS (
    SELECT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS net_revenue,
        COUNT(orderkey) AS num_orders
    FROM sales
    GROUP BY customerkey
)
SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(TRIM(c.givenname), ' ', TRIM(c.surname)) AS cleaned_name,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;


Unnamed: 0,QUERY PLAN
0,Hash Left Join (cost=9312.35..15292.71 rows=1...
1,Hash Cond: (c.customerkey = s.customerkey)
2,-> Seq Scan on customer c (cost=0.00..4129...
3,-> Hash (cost=8849.55..8849.55 rows=37024 ...
4,-> Subquery Scan on s (cost=8016.51....
5,-> HashAggregate (cost=8016.51...
6,Group Key: sales.customerkey
7,-> Seq Scan on sales (co...


Below is the query output in pgAdmin.

<img src="../Resources/query_results/7_explain_1.png" alt="Query Results 1" style="width: 70%; height: auto;">

2. To view the `EXPLAIN` plan visually in pgAdmin go to the `E` option and click it.
    - **Sequential Scan on `sales`** – Reads all rows without an index, leading to a full table scan.  
    - **Aggregation on `sales`** – Groups data by `customerkey` to calculate cohort year, net revenue, and order count.  
    - **SubQuery Scan (Materialized CTE)** – Stores aggregated results in a temporary set instead of inlining.  
    - **Hash on `sales_data`** – Builds a hash table for efficient joining.  
    - **Sequential Scan on `customer`** – Reads all rows without an index.  
    - **Hash Left Join (`customer` → `sales_data`)** – Joins `customer` with hashed `sales_data` for efficiency.  

<img src="../Resources/query_results/7_explain_3.gif" alt="Query Results 1" style="width: 90%; height: auto;">

<img src="../Resources/query_results/7_explain_3.png" alt="Query Results 1" style="width: 90%; height: auto;">

#### Full Execution Plan

**`EXPLAIN ANALYZE`**

1. Using `EXPLAIN ANALYZE` to the beginning of the query. 
    - **Total Execution Time:** **101.254 ms.**  
    - **Hash Join Execution:** **60.479 to 99.183 ms**, most of the query cost is in hashing `sales_data`.  
    - **Subquery Scan on `sales_data`:** **50.604 to 57.226 ms**, shows the CTE is materialized.  
    - **Sequential Scan on `sales`:** **0.008 to 10.471 ms**, confirms full table scan.  

In [3]:
%%sql

EXPLAIN ANALYZE
WITH sales_data AS (
    SELECT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS net_revenue,
        COUNT(orderkey) AS num_orders
    FROM sales
    GROUP BY customerkey
)
SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(TRIM(c.givenname), ' ', TRIM(c.surname)) AS cleaned_name,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;


Unnamed: 0,QUERY PLAN
0,Hash Left Join (cost=9312.35..15292.71 rows=1...
1,Hash Cond: (c.customerkey = s.customerkey)
2,-> Seq Scan on customer c (cost=0.00..4129...
3,-> Hash (cost=8849.55..8849.55 rows=37024 ...
4,Buckets: 65536 Batches: 1 Memory Usa...
5,-> Subquery Scan on s (cost=8016.51....
6,-> HashAggregate (cost=8016.51...
7,Group Key: sales.customerkey
8,Batches: 1 Memory Usage: ...
9,-> Seq Scan on sales (co...


Below is the query output in pgAdmin.

<img src="../Resources/query_results/7_explain_2.png" alt="Query Results 1" style="width: 80%; height: auto;">