<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/7_Basic_Query_Optimization/2_Basic_Optimization.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Basic Optimization

## Overview

### 🥅 Analysis Goals
 
Improve query efficiency and reduce costs by optimizing cohort revenue tracking and lifetime value (LTV) calculations.  
- **Optimize Sales Data Aggregation:** Summarizes customer purchases, calculates total revenue, and assigns cohort years while ensuring efficient joins with `customer`.  

### 📘 Concepts Covered

- Basic query optimization

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Query Optimization

### 📝 Notes

Query Optimizaiton

- Improves SQL performance by reducing execution time and resource usage.
- **Basic Optimization Tips**  
    - Use `INNER JOIN` instead of `LEFT JOIN` when unmatched rows aren’t needed.  
    - Filter early using `WHERE`, not `HAVING`, to reduce processed rows.  
    - Avoid `SELECT *`, select only required columns.  
    - Use `UNION` instead of `UNION ALL` when removing duplicates is acceptable.  
    - Pre-filter data before `GROUP BY` and `DISTINCT` to avoid unnecessary calculations.  
    - Replace `OR` conditions with `IN` for better index usage.  
    - Use `EXISTS` instead of `IN` for subqueries on large datasets.  
    - Ensure data types match in comparisons to prevent slow implicit conversions.  

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Query Efficiency: Optimized data retrieval methods
  - Resource Management: Efficient use of database resources
  - Performance Scaling: Handling growing data volumes
- **💡 Why It Matters**: Improves business operations and costs
    - Reduces cloud computing costs through efficient queries
    - Enables faster reporting for business decisions
    - Supports analysis of larger customer datasets
    - Allows more frequent cohort analysis updates
    - Makes revenue tracking more cost-effective
- **🎯 Common Use Cases**: 
  - Daily revenue reporting
  - Real-time customer analysis
  - Large-scale cohort tracking
  - Regular performance monitoring
- **📈 Related KPIs**: 
  - Query cost reduction
  - Report generation time
  - System resource savings
  - Analysis turnaround time    

### 📈 Analysis

- Summarizes customer purchases, calculates total revenue, and assigns cohort years while ensuring efficient joins with `customer` (from `1_Explain.ipynb`).  

> **⚠️ Note**: For the queries below since we've already done them the explanation is focusing on how we're optimizing the query.

#### Simple Query Optimization

**Query Optimization**

1. Use `EXPLAIN` on the query to find ways to optimize it better (from the last example in the `1_Explain.ipynb` notebook).
    - Subquery Scan on `cd`: The entire subquery result is scanned (199,873 rows), with an estimated cost between 35,601.24 and 50,591.71.
    - Window Aggregation: A window function is applied to 199,873 rows, increasing the width of each row.
    - Group Aggregate: Groups data by multiple fields (`customerkey`, `orderdate`, `countryfull`, `age`, `givenname`, `surname`) before performing aggregations.
    - Sort Operation: Orders the grouped data by the same fields before performing further calculations.
    - Hash Left Join: Joins the `customer` table (104,990 rows) with `sales` (**199,873 rows**) using a **hash join**, indicating `customerkey` is used as a join condition.
    - Sequential Scan on `sales`: No filtering is applied, causing PostgreSQL to scan **199,873 rows**.
    - Hash on `customer`: Prepares a hash table of 104,990 rows for efficient lookups in the join.
    - Sequential Scan on `customer`: Reads all rows from `customer`, suggesting there’s no index on `customerkey`.

In [3]:
%%sql

EXPLAIN
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;


Unnamed: 0,QUERY PLAN
0,Subquery Scan on cd (cost=35601.24..50591.71 ...
1,-> WindowAgg (cost=35601.24..47093.93 rows...
2,-> GroupAggregate (cost=35601.24..43...
3,"Group Key: s.customerkey, s.orde..."
4,-> Sort (cost=35601.24..36100....
5,"Sort Key: s.customerkey, s..."
6,-> Hash Left Join (cost=...
7,Hash Cond: (s.custom...
8,-> Seq Scan on sale...
9,-> Hash (cost=4129...


Below is the query output.

<img src="../Resources/query_results/7.1_explain_1.png" alt="Query Results 1" style="width: 70%; height: auto;">

2. Use `INNER JOIN` in the `customer_revenue` CTE.
- If every `s`ales.customerkey` exists in customer, change `LEFT JOIN` to `INNER JOIN`.
- This eliminates unnecessary NULL checks and improves join efficiency.

In [5]:
%%sql

WITH customer_revenue AS (
    SELECT
        s.customerkey,
        s.orderdate,
        SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
        COUNT(s.orderkey) AS num_orders,
        c.countryfull,
        c.age,
        c.givenname,
        c.surname
    FROM sales s 
    INNER JOIN customer c ON c.customerkey = s.customerkey -- Update
    GROUP BY
        s.customerkey,
        s.orderdate,
        c.countryfull,
        c.age,
        c.givenname,
        c.surname
),

cohort_data AS (
    SELECT
        cr.*,
        MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
        EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
    FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;


Unnamed: 0,customerkey,cohort_year,cleaned_name,num_orders,total_net_revenue,countryfull,age,first_purchase_date,orderdate
0,15,2021,Julian McGuigan,1,2217.41,Australia,55,2021-03-08,2021-03-08
1,180,2018,Gabriel Bosanquet,1,525.31,Australia,65,2018-07-28,2018-07-28
2,180,2018,Gabriel Bosanquet,2,1984.90,Australia,65,2018-07-28,2023-08-28
3,185,2019,Gabrielle Castella,1,1395.52,Australia,40,2019-06-01,2019-06-01
4,243,2016,Maya Atherton,1,287.67,Australia,66,2016-05-19,2016-05-19
...,...,...,...,...,...,...,...,...,...
83094,2099697,2022,Phillipp Maier,3,38.20,United States,54,2022-09-13,2022-09-13
83095,2099711,2016,Katerina Pavlícková,1,2067.75,United States,80,2016-08-13,2016-08-13
83096,2099711,2016,Katerina Pavlícková,1,3940.92,United States,80,2016-08-13,2017-08-14
83097,2099743,2022,Luciana Almonte,2,469.62,United States,21,2022-03-17,2022-03-17


3. Use `EXPLAIN` on the query to view the query execution plan.

    - If `countryfull`, `age`, `givenname`, and surname do not change for each `customerkey`, remove them from `GROUP BY` and use `MAX()`.
    - This reduces sorting and aggregation overhead.

> **⚠️ Note**: Using MAX() on non-numeric fields is a good practice when the values are guaranteed to be the same within each group, as it reduces the need for unnecessary GROUP BY operations and improves query performance.

In [6]:
%%sql

WITH customer_revenue AS (
    SELECT
        s.customerkey,
        s.orderdate,
        SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
        COUNT(s.orderkey) AS num_orders,
        MAX(c.countryfull) AS countryfull, -- Update  
        MAX(c.age) AS age, -- Update
        MAX(c.givenname) AS givenname, -- Update
        MAX(c.surname) AS surname -- Update
    FROM sales s 
    INNER JOIN customer c ON c.customerkey = s.customerkey    
    GROUP BY
        s.customerkey,
        s.orderdate
),

cohort_data AS (
    SELECT
        cr.customerkey,
        cr.orderdate,
        cr.total_net_revenue,
        cr.num_orders,
        cr.countryfull,
        cr.age,
        cr.givenname,
        cr.surname,
        MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
        EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
    FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;


Unnamed: 0,customerkey,cohort_year,cleaned_name,num_orders,total_net_revenue,countryfull,age,first_purchase_date,orderdate
0,15,2021,Julian McGuigan,1,2217.41,Australia,55,2021-03-08,2021-03-08
1,180,2018,Gabriel Bosanquet,1,525.31,Australia,65,2018-07-28,2018-07-28
2,180,2018,Gabriel Bosanquet,2,1984.90,Australia,65,2018-07-28,2023-08-28
3,185,2019,Gabrielle Castella,1,1395.52,Australia,40,2019-06-01,2019-06-01
4,243,2016,Maya Atherton,1,287.67,Australia,66,2016-05-19,2016-05-19
...,...,...,...,...,...,...,...,...,...
83094,2099697,2022,Phillipp Maier,3,38.20,United States,54,2022-09-13,2022-09-13
83095,2099711,2016,Katerina Pavlícková,1,2067.75,United States,80,2016-08-13,2016-08-13
83096,2099711,2016,Katerina Pavlícková,1,3940.92,United States,80,2016-08-13,2017-08-14
83097,2099743,2022,Luciana Almonte,2,469.62,United States,21,2022-03-17,2022-03-17


4. Run the `EXPLAIN` plan again on this updated query to see what's been improved.

    - Filtered rows earlier by adding a `WHERE` clause on `sales.orderdate`, reducing processed rows from **199,873 to 37,024** before aggregation.  
    - Switched from `LEFT JOIN` to `INNER JOIN`, eliminating unnecessary NULL checks and improving join efficiency.  
    - Optimized `GROUP BY` by using `MAX()` for constant values like `givenname` and `surname`, reducing sorting and aggregation complexity.  
    - Enabled parallel execution by allowing `Parallel Hash Join` and `Parallel Sequential Scan`, distributing workload across multiple workers.  
    - Reduced overall query cost by combining early filtering, parallelization, and reduced aggregation overhead, making execution faster.

In [8]:
%%sql

EXPLAIN
WITH customer_revenue AS (
    SELECT
        s.customerkey,
        s.orderdate,
        SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
        COUNT(s.orderkey) AS num_orders,
        MAX(c.countryfull) AS countryfull, -- Update  
        MAX(c.age) AS age, -- Update
        MAX(c.givenname) AS givenname, -- Update
        MAX(c.surname) AS surname -- Update
    FROM sales s 
    INNER JOIN customer c ON c.customerkey = s.customerkey    
    GROUP BY
        s.customerkey,
        s.orderdate
),

cohort_data AS (
    SELECT
        cr.customerkey,
        cr.orderdate,
        cr.total_net_revenue,
        cr.num_orders,
        cr.countryfull,
        cr.age,
        cr.givenname,
        cr.surname,
        MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
        EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
    FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;


Unnamed: 0,QUERY PLAN
0,Subquery Scan on cd (cost=23390.59..33859.75 ...
1,-> WindowAgg (cost=23390.59..33211.83 rows...
2,-> Finalize GroupAggregate (cost=233...
3,"Group Key: s.customerkey, s.orde..."
4,-> Gather Merge (cost=23390.59...
5,Workers Planned: 1
6,-> Partial GroupAggregate...
7,Group Key: s.custome...
8,-> Sort (cost=2239...
9,Sort Key: s.cu...


<img src="../Resources/query_results/7.2_basic_optimization_1.png" alt="Basic Optimization 1" style="width: 70%; height: auto;">