<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/7_Basic_Query_Optimization/1_Explain.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Explain

## Overview

### 🥅 Analysis Goals

Analyze customer-level revenue, order behavior, and cohort classification to assess purchasing patterns and customer value.  

- **Plan Query Execution:** Use `EXPLAIN` to understand how PostgreSQL will execute the query, identifying potential inefficiencies like sequential scans or costly joins.  
- **Measure Actual Query Performance:** Use `EXPLAIN ANALYZE` to execute the query while collecting real performance metrics, comparing estimated vs. actual execution times for optimization.

### 📘 Concepts Covered

- `EXPLAIN`
- `EXPLAIN ANALYZE`

[Source Documentation for Using Explain](https://www.postgresql.org/docs/17/using-explain.html)

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## EXPLAIN

### 📝 Notes

**`EXPLAIN`**  

- **EXPLAIN**: Displays the execution plan of a SQL query, showing how PostgreSQL will execute it.

- Syntax:  
  ```sql
  EXPLAIN SELECT column FROM table WHERE condition;
  ```
  
**`EXPLAIN ANALYZE`**: 
- **EXPLAIN ANALYZE**: Executes the query and provides actual execution times, row estimates, and other runtime details.
  ```sql
  EXPLAIN ANALYZE SELECT column FROM table WHERE condition;
  ```

- Helps with query optimization by showing:
  - Index usage
  - Join methods (`Nested Loop`, `Hash Join`, `Merge Join`)
  - Sequential vs. index scans
  - Estimated vs. actual row counts

- **Example Output** (simplified):
  ```
  Seq Scan on users  (cost=0.00..18.50 rows=850 width=64)
  ```
  - `Seq Scan`: PostgreSQL is doing a sequential scan (no index used).
  - `cost`: Estimated startup and total cost.
  - `rows`: Estimated number of rows.
  - `width`: Estimated row size in bytes.

- **Use Cases**:
  - Debugging slow queries
  - Checking if indexes are being used
  - Understanding query performance bottlenecks

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Query Performance: Speed of data retrieval
  - Execution Cost: Resources needed to run query
  - Data Processing: How database handles requests
- **💡 Why It Matters**: Optimizes complex customer revenue analysis
    - Reduces query execution time for large customer datasets
    - Lowers computational costs for frequent cohort analysis
    - Enables faster business decision making
    - Improves efficiency of revenue tracking systems
- **🎯 Common Use Cases**: 
  - Performance tuning
  - Cost optimization
  - Query improvement
  - Resource management
- **📈 Related KPIs**: 
  - Query execution time
  - Processing costs
  - Resource utilization
  - Response time

### 📈 Analysis

- Understand how PostgreSQL plans to execute the query without running it, helping identify potential inefficiencies like sequential scans or costly joins.
- Execute the query while collecting actual performance metrics, allowing comparison between estimated and real execution times to optimize query performance.

> **⚠️ Note**: For the queries below since we've already done them the explanation is explaining the results of the `EXPLAIN` plan.

#### Basic Execution Plan

**`EXPLAIN`**

1. Using the last query from `String Formatting` chapter, add `EXPLAIN` to the beginning of the query. 

    - Sequential Scan on `customer`: Reads all rows from `customer`, suggesting there's no index on `customerkey`.
    - Hash on `customer`: Prepares a hash table of 104,990 rows for efficient lookups in the join.
    - Sequential Scan on `sales`: No filtering is applied, causing PostgreSQL to scan 199,873 rows.
    - Hash Left Join: Joins the `customer` table (104,990 rows) with `sales` (199,873 rows) using a hash join, indicating `customerkey` is used as a join condition.
    - Sort Operation: Orders the grouped data by the same fields before performing further calculations.
    - Group Aggregate: Groups data by multiple fields (`customerkey`, `orderdate`, `countryfull`, `age`, `givenname`, `surname`) before performing aggregations.
    - Window Aggregation: A window function is applied to 199,873 rows, increasing the width of each row.
    - Subquery Scan on `cd`: The entire subquery result is scanned (199,873 rows), with an estimated cost between 35,601.24 and 50,591.71.

In [2]:
%%sql

EXPLAIN
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM
		sales s
	LEFT JOIN customer c ON
		c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	customerkey,
	orderdate,
	total_net_revenue,
	num_orders,
	countryfull,
	age,
	CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
	MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM
	customer_revenue cr;


Unnamed: 0,QUERY PLAN
0,WindowAgg (cost=35601.24..48592.98 rows=19987...
1,-> GroupAggregate (cost=35601.24..43596.16...
2,"Group Key: s.customerkey, s.orderdate,..."
3,-> Sort (cost=35601.24..36100.92 row...
4,"Sort Key: s.customerkey, s.order..."
5,-> Hash Left Join (cost=5442.2...
6,Hash Cond: (s.customerkey ...
7,-> Seq Scan on sales s (...
8,-> Hash (cost=4129.90..4...
9,-> Seq Scan on cust...


Below is the query output.

<img src="../Resources/query_results/7.1_explain_1.png" alt="Query Results 1" style="width: 70%; height: auto;">

2. Alternatively you can view the `EXPLAIN` using `dbeaver` by selecting `Explain Execution Plan`. For this you don't need the `EXPLAIN` keyword.

<img src="../Resources/query_results/7.1_explain_1.gif" alt="View Explain using dbeaver" style="width: 90%; height: auto;">

#### Full Execution Plan

**`EXPLAIN ANALYZE`**

1. Using `EXPLAIN ANALYZE` to the beginning of the query. 
    - Sequential Scan on customer (c): Full table scan reads all 104,990 customer records in 19.249 seconds - no index usage suggests potential for optimization.
    - Hash: Creates an in-memory hash table of customer data using 8105kB of memory with 131,072 buckets - efficient for subsequent join.
    - Sequential Scan on sales (s): Scans all 199,873 sales records in 5.348 seconds - another candidate for index optimization.
    - Hash Left Join: Matches sales to customers using the hash table, processing nearly 200K rows in 72.484 seconds - reasonable performance for the data volume.
    - Sort: External merge sort required 15MB disk space - indicates memory constraints for sorting large dataset.
    - GroupAggregate: Consolidates the sorted data into 83,099 final groups in 159.837 seconds - reduction shows significant data aggregation.
    - WindowAgg: Final window calculations complete in 189.804 seconds - most time-intensive operation in the query.
    - Query Performance Summary:
        - Fast planning (0.442ms) but lengthy execution (192.769ms)
        - Main bottlenecks: sorting, grouping, and window calculations

In [3]:
%%sql

EXPLAIN ANALYZE
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM
		sales s
	LEFT JOIN customer c ON
		c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
)
SELECT
	customerkey,
	orderdate,
	total_net_revenue,
	num_orders,
	countryfull,
	age,
	CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS cleaned_name,
	MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date,
	EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM
	customer_revenue cr;

Unnamed: 0,QUERY PLAN
0,WindowAgg (cost=35601.24..48592.98 rows=19987...
1,-> GroupAggregate (cost=35601.24..43596.16...
2,"Group Key: s.customerkey, s.orderdate,..."
3,-> Sort (cost=35601.24..36100.92 row...
4,"Sort Key: s.customerkey, s.order..."
5,Sort Method: external merge Dis...
6,-> Hash Left Join (cost=5442.2...
7,Hash Cond: (s.customerkey ...
8,-> Seq Scan on sales s (...
9,-> Hash (cost=4129.90..4...


Below is the query output.

<img src="../Resources/query_results/7.1_explain_analyze.png" alt="Explain analyze results" style="width: 80%; height: auto;">