<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/7_Basic_Query_Optimization/1_Explain.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Explain

## Overview

### ü•Ö Analysis Goals

Analyze customer-level revenue, order behavior, and cohort classification to assess purchasing patterns and customer value.  

- **Plan Query Execution:** Use `EXPLAIN` to understand how PostgreSQL will execute the query, identifying potential inefficiencies like sequential scans or costly joins.  
- **Measure Actual Query Performance:** Use `EXPLAIN ANALYZE` to execute the query while collecting real performance metrics, comparing estimated vs. actual execution times for optimization.

### üìò Concepts Covered

- `EXPLAIN`
- `EXPLAIN ANALYZE`

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## EXPLAIN

### üìù¬†Notes

**`EXPLAIN`**  

- **EXPLAIN**: Displays the execution plan of a SQL query, showing how PostgreSQL will execute it.

- Syntax:  
  ```sql
  EXPLAIN SELECT column FROM table WHERE condition;
  ```
  
**`EXPLAIN ANALYZE`**: 
- **EXPLAIN ANALYZE**: Executes the query and provides actual execution times, row estimates, and other runtime details.
  ```sql
  EXPLAIN ANALYZE SELECT column FROM table WHERE condition;
  ```

- Helps with query optimization by showing:
  - Index usage
  - Join methods (`Nested Loop`, `Hash Join`, `Merge Join`)
  - Sequential vs. index scans
  - Estimated vs. actual row counts

- **Example Output** (simplified):
  ```
  Seq Scan on users  (cost=0.00..18.50 rows=850 width=64)
  ```
  - `Seq Scan`: PostgreSQL is doing a sequential scan (no index used).
  - `cost`: Estimated startup and total cost.
  - `rows`: Estimated number of rows.
  - `width`: Estimated row size in bytes.

- **Use Cases**:
  - Debugging slow queries
  - Checking if indexes are being used
  - Understanding query performance bottlenecks

### üîë Key Concepts
- **üìä Business Terms**: 
  - Query Performance: Speed of data retrieval
  - Execution Cost: Resources needed to run query
  - Data Processing: How database handles requests
- **üí° Why It Matters**: Optimizes complex customer revenue analysis
    - Reduces query execution time for large customer datasets
    - Lowers computational costs for frequent cohort analysis
    - Enables faster business decision making
    - Improves efficiency of revenue tracking systems
- **üéØ Common Use Cases**: 
  - Performance tuning
  - Cost optimization
  - Query improvement
  - Resource management
- **üìà Related KPIs**: 
  - Query execution time
  - Processing costs
  - Resource utilization
  - Response time

### üìà Analysis

- Understand how PostgreSQL plans to execute the query without running it, helping identify potential inefficiencies like sequential scans or costly joins.
- Execute the query while collecting actual performance metrics, allowing comparison between estimated and real execution times to optimize query performance.

> **‚ö†Ô∏è Note**: For the queries below since we've already done them the explanation is explaining the results of the `EXPLAIN` plan.

#### Basic Execution Plan

**`EXPLAIN`**

1. Using the last query from `String Formatting` chapter, add `EXPLAIN` to the beginning of the query. 
Here‚Äôs the explanation of your `EXPLAIN` plan based on the provided image:

    - Subquery Scan on `cd`: The entire subquery result is scanned (199,873 rows), with an estimated cost between 35,601.24 and 50,591.71.
    - Window Aggregation: A window function is applied to 199,873 rows, increasing the width of each row.
    - Group Aggregate: Groups data by multiple fields (`customerkey`, `orderdate`, `countryfull`, `age`, `givenname`, `surname`) before performing aggregations.
    - Sort Operation: Orders the grouped data by the same fields before performing further calculations.
    - Hash Left Join: Joins the `customer` table (104,990 rows) with `sales` (**199,873 rows**) using a **hash join**, indicating `customerkey` is used as a join condition.
    - Sequential Scan on `sales`: No filtering is applied, causing PostgreSQL to scan **199,873 rows**.
    - Hash on `customer`: Prepares a hash table of 104,990 rows for efficient lookups in the join.
    - Sequential Scan on `customer`: Reads all rows from `customer`, suggesting there‚Äôs no index on `customerkey`.

In [3]:
%%sql

EXPLAIN
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;


Unnamed: 0,QUERY PLAN
0,Subquery Scan on cd (cost=35601.24..50591.71 ...
1,-> WindowAgg (cost=35601.24..47093.93 rows...
2,-> GroupAggregate (cost=35601.24..43...
3,"Group Key: s.customerkey, s.orde..."
4,-> Sort (cost=35601.24..36100....
5,"Sort Key: s.customerkey, s..."
6,-> Hash Left Join (cost=...
7,Hash Cond: (s.custom...
8,-> Seq Scan on sale...
9,-> Hash (cost=4129...


Below is the query output.

<img src="../Resources/query_results/7.1_explain_1.png" alt="Query Results 1" style="width: 70%; height: auto;">

2. Altneratively you can view the `EXPLAIN` using `dbeaver` by selecting `Explain Execution Plan`.
    - Subquery Scan on `cd`: The entire subquery result is scanned (199,873 rows), with an estimated cost between 35,601.24 and 50,591.71.
    - Window Aggregation: A window function is applied to 199,873 rows, increasing the width of each row.
    - Group Aggregate: Groups data by multiple fields (`customerkey`, `orderdate`, `countryfull`, `age`, `givenname`, `surname`) before performing aggregations.
    - Sort Operation: Orders the grouped data by the same fields before performing further calculations.
    - Hash Left Join: Joins the `customer` table (104,990 rows) with `sales` (**199,873 rows**) using a **hash join**, indicating `customerkey` is used as a join condition.
    - Sequential Scan on `sales`: No filtering is applied, causing PostgreSQL to scan **199,873 rows**.
    - Hash on `customer`: Prepares a hash table of 104,990 rows for efficient lookups in the join.
    - Sequential Scan on `customer`: Reads all rows from `customer`, suggesting there‚Äôs no index on `customerkey`.

<img src="../Resources/query_results/7.1_explain_1.gif" alt="View Explain using dbeaver" style="width: 90%; height: auto;">

#### Full Execution Plan

**`EXPLAIN ANALYZE`**

1. Using `EXPLAIN ANALYZE` to the beginning of the query. 
    - Subquery Scan on `cd`: The subquery result is scanned, with an estimated 199,873 rows, but the actual execution processed 83,099 rows. The cost ranges from 35,601.24 to 50,591.71, and execution time was 121.585 ms to 191.811 ms.  
    - Window Aggregation: A window function is applied to 83,099 rows. The estimated cost was 35,601.24 to 47,093.93, but actual execution time was 121.580 ms to 177.167 ms.  
    - Group Aggregate: Groups data by `customerkey`, `orderdate`, `countryfull`, `age`, `givenname`, `surname`. The estimate was 199,873 rows, but actual rows were 83,099. Execution time was 121.572 ms to 158.194 ms.  
    - Sort Operation: Data is sorted by `customerkey`, `orderdate`, `countryfull`, `age`, `givenname`, `surname` using an external merge sort (disk usage: 1505 kB). Estimated 199,873 rows, but actual execution processed 199,873 rows in 121.563 ms to 129.692 ms.  
    - Hash Left Join: Joins `customer` (104,990 rows) with `sales` (199,873 rows) using a hash join on `customerkey`. Execution time was 37.958 ms to 69.797 ms.  
    - Sequential Scan on `sales`: Full table scan of `sales` (199,873 rows). Execution time was 0.011 ms to 4.966 ms.  
    - Hash on `customer`: Builds a hash table for 104,990 rows. Execution time was 37.846 ms.  
    - Sequential Scan on `customer`: Reads all 104,990 rows. Execution time was 0.007 ms to 19.627 ms.  
    - Planning Time: 0.519 ms.  
    - Execution Time: 194.770 ms.  

In [4]:
%%sql

EXPLAIN ANALYZE
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;


Unnamed: 0,QUERY PLAN
0,Subquery Scan on cd (cost=35601.24..50591.71 ...
1,-> WindowAgg (cost=35601.24..47093.93 rows...
2,-> GroupAggregate (cost=35601.24..43...
3,"Group Key: s.customerkey, s.orde..."
4,-> Sort (cost=35601.24..36100....
5,"Sort Key: s.customerkey, s..."
6,Sort Method: external merg...
7,-> Hash Left Join (cost=...
8,Hash Cond: (s.custom...
9,-> Seq Scan on sale...


Below is the query output.

<img src="../Resources/query_results/7.1_explain_analyze_1.png" alt="Explain analyze results" style="width: 80%; height: auto;">