<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/2_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Aggregation in Window Functions

## Overview

### 🥅 Analysis Goals

Explore user-level metrics to understand cohort size and the average lifetime value of users in each cohort.

- **Total Users by Cohort:** Get the total number of unique users in each cohort to get insight into the scale of each cohort. Helps evaluate the relative size of cohorts, which is essential for benchmarking revenue.
- **Cohort-Based LTV Analysis:** Assign each customer to a cohort based on their first purchase year, calculate their total lifetime revenue, and compute the average LTV across all customers within each cohort to analyze customer value trends.
- **Filtered Average LTV by Customer:** Aggregate total revenue per customer, filter out low-value transactions before aggregation, and calculate the average lifetime value. Then, filter again to focus only on high-value customers. Ensures meaningful purchases contribute to total revenue while highlighting top-spending users for deeper cohort insights.


### 📘 Concepts Covered

- `COUNT()`
- `AVG()`
- Filtering windows functions

In [66]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## CTEs Review

**Common Table Expressions (CTEs)**

- **CTE**: A temporary result set defined within a query, often used to improve readability and manageability of complex queries.
    - CTEs are created using the `WITH` keyword and can be referenced in the main query.
    - They are reusable within the same query and make it easier to break down large queries into smaller, logical steps.

**Syntax**:
- Basic CTE
    - ```sql
      WITH cte_name AS (
        SELECT column1, column2
        FROM table_name
        WHERE condition
      )
      SELECT column1, column2
      FROM cte_name;
      ```

- CTE with multiple definitions
    - ```sql
      WITH cte1 AS (
        SELECT column1, column2
        FROM table1
        WHERE condition
      ),
      cte2 AS (
        SELECT column1, column3
        FROM table2
        WHERE condition
      )
      SELECT cte1.column1, cte2.column3
      FROM cte1
      JOIN cte2 ON cte1.column1 = cte2.column1;
      ```
- **Note**: CTEs are a cleaner alternative to subqueries for improving query organization, especially in PostgreSQL. They can also replace temporary tables in some cases, but they only exist for the duration of the query.

### 📈 Analysis

- 📊 **Daily Revenue Analysis**: Get the average net revenue by day to review CTEs (same example used for subqueries)
- 👥 **Cohort Analysis**: Group users into cohorts based on their first order year (`cohort_year`) to analyze long-term revenue growth
> ⚠️ **Data Note**: Customer table contains `startdt` field but will not be used since historical data (1980-2010) is not available in our dataset

#### Average Net Revenue by Day

**CTE**

1. Using a CTE called `revenue_by_day`, get the average net revenue by day (`orderdate`).  
   - Create a subquery `revenue_by_day` to calculate the net revenue for each sale using `(quantity * netprice * exchangerate)`.  
        - Include `orderdate` in the subquery to associate each sale with its corresponding date.  
   - In the main query: 
        - Calculate the average net revenue per day using `AVG(net_revenue)`.  
        - Group the results by `orderdate` to compute the average for each unique day.  
        - Use `ORDER BY orderdate` to display the results in chronological order.  

In [67]:
%%sql

-- Moved subquery to a CTE
WITH revenue_by_day AS (
    SELECT 
        orderdate, 
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
)

SELECT 
    orderdate,
    AVG(net_revenue)
FROM revenue_by_day
GROUP BY orderdate
ORDER BY orderdate;

Unnamed: 0,orderdate,avg
0,2015-01-01,465.63
1,2015-01-02,736.30
2,2015-01-03,942.70
3,2015-01-05,1240.63
4,2015-01-06,862.49
...,...,...
3289,2024-04-16,784.34
3290,2024-04-17,539.98
3291,2024-04-18,498.40
3292,2024-04-19,967.74


This is what it looked like as a subquery.

In [68]:
%%sql

SELECT 
    orderdate, 
    AVG(net_revenue)
FROM (
    SELECT orderdate, (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
) AS revenue_by_day
GROUP BY orderdate
ORDER BY orderdate;

Unnamed: 0,orderdate,avg
0,2015-01-01,465.63
1,2015-01-02,736.30
2,2015-01-03,942.70
3,2015-01-05,1240.63
4,2015-01-06,862.49
...,...,...
3289,2024-04-16,784.34
3290,2024-04-17,539.98
3291,2024-04-18,498.40
3292,2024-04-19,967.74


### 💡 CTEs > Subqueries 

We prefer using CTEs over subqueries for a few reasons: 
-  📚 **Improved Readability**: CTEs separate logic into named blocks, making complex queries easier to follow and understand compared to deeply nested subqueries.
- ♻️ **Reusability**: CTEs can be referenced multiple times within the same query, avoiding duplication and reducing redundancy compared to repeating subqueries.
- 🐞 **Debugging Friendly**: With CTEs, you can test and debug each logical step independently, whereas subqueries are harder to isolate and analyze.

We'll still use subqueries occasionally, but only for simple logic (e.g., `SELECT column_name FROM table`).

---
## COUNT

### 📝 Notes

`COUNT`

- **COUNT**: Counts the values, `DISTINCT` can be added to get the unique count.
- Syntax: 
  ```sql
    SELECT
      COUNT() OVER(
          PARTITION BY partition_expression
      ) AS window_column_alias
      FROM table_name
  ```

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Customer Cohort: Group of customers who made their first purchase in the same time period
  - User Acquisition: Process of gaining new customers
  - Customer Base: Total number of active customers
- **💡 Why It Matters**: Essential for understanding customer growth and comparing performance across different sized groups (e.g. segements).
- **🎯 Common Use Cases**: Tracking new customer growth, measuring retention rates
- **📈 Related Metrics**: Acquisition rate, retention rate, churn rate

### 📈 Analysis

-  Group customers by the year of their first purchase, then calculate the total number of customers in each cohort to analyze customer retention and growth trends over time.


#### Counts of Orders per Customer

**`COUNT()`**, **`OVER`**, **`PARTITION BY`**

1. Get the total number of orders for each customer from the `sales` table.
   - Select `customerkey` and `orderdate` to show the order details.
   - Use `COUNT(*) OVER(PARTITION BY customerkey)` to count all orders for each customer.
   - Order results by `customerkey` and `orderdate` to group each customer's orders together chronologically.

In [69]:
%%sql 

SELECT 
    customerkey,
    orderdate,
    COUNT(*) OVER(PARTITION BY customerkey) as total_orders_per_customer
FROM sales
ORDER BY customerkey, orderdate
;

Unnamed: 0,customerkey,orderdate,total_orders_per_customer
0,15,2021-03-08,1
1,180,2018-07-28,3
2,180,2023-08-28,3
3,180,2023-08-28,3
4,185,2019-06-01,1
...,...,...,...
199868,2099711,2016-08-13,2
199869,2099711,2017-08-14,2
199870,2099743,2022-03-17,3
199871,2099743,2022-03-17,3


#### Total Count of Customers by Cohort

**`COUNT()`, `OVER`, `PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and the `customerkey` in the `sales` table.
    - Use `EXTRACT(YEAR FROM MIN(orderdate))` to find the earliest year a customer placed an order.
    - Select `customerkey` to associate each customer with their cohort year.
    - Group the data by `customerkey` to calculate the cohort year for each customer.

In [70]:
%%sql

SELECT 
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    customerkey
FROM sales
GROUP BY 
    customerkey

Unnamed: 0,cohort_year,customerkey
0,2018,2044589
1,2021,1603477
2,2017,876049
3,2024,1469222
4,2018,2089398
...,...,...
49482,2019,853617
49483,2016,1573639
49484,2022,1355936
49485,2024,967453


2. Create a CTE to calculate the cohort year for each customer and return all results in the main query.  
   - 🔔 Define a CTE `yearly_cohort` to extract the cohort year for each customer.  
      - Use `EXTRACT(YEAR FROM MIN(orderdate))` in the CTE to determine the earliest order year for each customer.  
      - Group by `customerkey` in the CTE to assign each customer to a single cohort year.  
   - 🔔 In the main query, use `SELECT * FROM yearly_cohort` to return all the results from the CTE.  

In [71]:
%%sql

-- Put query into a CTE
WITH yearly_cohort AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey
    FROM sales
    GROUP BY 
        customerkey
)

-- Added
SELECT * 
FROM yearly_cohort

Unnamed: 0,cohort_year,customerkey
0,2018,2044589
1,2021,1603477
2,2017,876049
3,2024,1469222
4,2018,2089398
...,...,...
49482,2019,853617
49483,2016,1573639
49484,2022,1355936
49485,2024,967453


3. Get total customers by cohort using window functions.
    - First create a CTE to identify each customer's cohort year
        - Use `EXTRACT(YEAR FROM MIN(orderdate))` to determine cohort year
        - Group by `customerkey` to get one cohort year per customer
    - In the main query:
        - Use `COUNT(customerkey) OVER (PARTITION BY cohort_year)` to count customers in each cohort
        - Add `DISTINCT` to show only one row per cohort year
        - Order by cohort year to show progression

> NOTE: You could achieve this more easily with `GROUP BY`:
> ```sql
> SELECT 
>     EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
>     COUNT(*) as total_customers
> FROM sales
> GROUP BY EXTRACT(YEAR FROM MIN(orderdate))
> ORDER BY cohort_year;
> ```


In [72]:
%%sql

WITH yearly_cohort AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey
    FROM sales
    GROUP BY 
        customerkey
)

SELECT DISTINCT
    cohort_year,
    COUNT(customerkey) OVER (PARTITION BY cohort_year) AS total_customers
FROM yearly_cohort
ORDER BY cohort_year;

Unnamed: 0,cohort_year,total_customers
0,2015,2825
1,2016,3397
2,2017,4068
3,2018,7446
4,2019,7755
5,2020,3031
6,2021,4663
7,2022,9010
8,2023,5890
9,2024,1402


<img src="../Resources/images/3.2_cohort_customers.png" alt="Cohort Customers" width="50%">

4. **BONUS:**  Get active customers by business year and cohort using window functions.
    - Create a CTE to identify each customer's cohort and business year
        - Use `MIN(orderdate) OVER (PARTITION BY customerkey)` to determine first purchase date
        - Extract cohort year from first purchase date
        - Extract business year from orderdate
        - Add `DISTINCT` to get unique customer-year combinations
    - In the main query:
        - Use `COUNT(*) OVER (PARTITION BY business_year, cohort_year)` to count customers
        - Add `DISTINCT` to show only one row per year combination
        - Order by business year then cohort year

In [73]:
%%sql

WITH yearly_cohort AS (
    SELECT DISTINCT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year,
        EXTRACT(YEAR FROM orderdate) AS purchase_year  --moved
    FROM sales
)
SELECT DISTINCT  -- added
    cohort_year,
    purchase_year, --added
    COUNT(*) OVER (PARTITION BY purchase_year, cohort_year) as num_customers  --added
FROM yearly_cohort
ORDER BY 
    purchase_year,
    cohort_year
LIMIT 40

Unnamed: 0,cohort_year,purchase_year,num_customers
0,2015,2015,2825
1,2015,2016,126
2,2016,2016,3397
3,2015,2017,149
4,2016,2017,174
5,2017,2017,4068
6,2015,2018,348
7,2016,2018,374
8,2017,2018,473
9,2018,2018,7446


<img src="../Resources/images/3.2_cohort_customer_yearly.png" alt="Cohort Customers" width="50%">  

> NOTE: Earlier cohorts continue to contribute to customer growth in future years, but their impact compared to new customer acquisitions is minimal.  
> 
> Driving the questions: Why are more resources not spent on customer retention?

---
## Average

### 📝 Notes

- `AVG()`: Calculates the average of the values

```sql
  SELECT
    COUNT() OVER(
         PARTITION BY partition_expression
    ) AS window_column_alias
    FROM table_name
```

### 🔑 Key Concepts
- **📊 Business Terms**:
  - Customer Lifetime Value (LTV): Total revenue generated by a customer over time
  - Average Order Value (AOV): Typical amount spent per transaction
  - Revenue Per User: Average revenue generated by each customer
- **💡 Why It Matters**: Helps identify valuable customer segments and predict future revenue
- **🎯 Common Use Cases**: Calculating customer LTV, analyzing spending patterns
- **📈 Related Metrics**: ARPU (Average Revenue Per User), CAC (Customer Acquisition Cost)

### 📕 Definitions

- **Lifetime Value (LTV)**: the total revenue a customer generates for a business over their entire relationship.


### 📈 Analysis

- Find the average order value per customer.
- Aggregate total revenue per user and calculate the average lifetime value for each cohort. 

#### Average Order Value per Customer
**`AVG()`**, **`OVER`**, **`PARTITION BY`**

1. Calculate the average order value for each customer from the `sales` table.
   - Select `customerkey` and `orderdate` to show the order details.
   - Calculate `order_value` using `quantity * netprice * exchangerate` for each individual order.
   - Use `AVG(order_value) OVER(PARTITION BY customerkey)` to compute the average order value across all orders for each customer.
   - Order results by `customerkey` and `orderdate` to group each customer's orders together chronologically.

In [74]:
%%sql

SELECT 
    customerkey,
    orderdate,
    (quantity * netprice * exchangerate) as order_value,
    AVG(quantity * netprice * exchangerate) OVER(PARTITION BY customerkey) as avg_order_value
FROM sales
ORDER BY customerkey, orderdate
;

Unnamed: 0,customerkey,orderdate,order_value,avg_order_value
0,15,2021-03-08,2217.41,2217.41
1,180,2018-07-28,525.31,836.74
2,180,2023-08-28,71.36,836.74
3,180,2023-08-28,1913.55,836.74
4,185,2019-06-01,1395.52,1395.52
...,...,...,...,...
199868,2099711,2016-08-13,2067.75,3004.34
199869,2099711,2017-08-14,3940.92,3004.34
199870,2099743,2022-03-17,375.57,356.03
199871,2099743,2022-03-17,94.05,356.03


#### Average LTV by Cohort

**`AVG()`, `OVER`, `PARTITION BY`**

1. Get the `cohort_year` and the total revenue for each user.  
   - Use `EXTRACT(YEAR FROM MIN(orderdate))` to calculate the cohort year for each customer.  
   - Group by `customerkey` to ensure the revenue and cohort year are calculated per user.  
   - Calculate the total revenue for each customer using `SUM(quantity * netprice * exchangerate)`.  
   - Select `cohort_year`, `customerkey`, and the total revenue (`total_customer_net_revenue`) to display the results.  

In [75]:
%%sql

SELECT 
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    customerkey,
    SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
FROM sales
GROUP BY 
    customerkey


Unnamed: 0,cohort_year,customerkey,total_customer_net_revenue
0,2018,2044589,2470.73
1,2021,1603477,136.62
2,2017,876049,2601.13
3,2024,1469222,5278.54
4,2018,2089398,98.39
...,...,...,...
49482,2019,853617,903.31
49483,2016,1573639,6973.42
49484,2022,1355936,149.99
49485,2024,967453,5.40


2. Create a CTE to calculate the cohort year for each customer and their net revenue and return all results in the main query.  
   - 🔔 Define a CTE `cohort_analysis` to extract the cohort year for each customer.  
        - Use `EXTRACT(YEAR FROM MIN(orderdate))` in the CTE to determine the earliest order year for each customer. 
        - Calculate the total revenue for each customer using `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` in the CTE to assign each customer to a single cohort year.  
   - 🔔 In the main query, use `SELECT * FROM cohort_analysis` to return all the results from the CTE.  

In [76]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
)

SELECT 
    *
FROM cohort_analysis
;

Unnamed: 0,cohort_year,customerkey,total_customer_net_revenue
0,2018,2044589,2470.73
1,2021,1603477,136.62
2,2017,876049,2601.13
3,2024,1469222,5278.54
4,2018,2089398,98.39
...,...,...,...
49482,2019,853617,903.31
49483,2016,1573639,6973.42
49484,2022,1355936,149.99
49485,2024,967453,5.40


3. Get the average LTV by each cohort.  
   - Define a CTE `cohort_analysis` to calculate the cohort year and total revenue for each customer.  
        - Extract the cohort year using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Calculate the total revenue for each customer with `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` to ensure total revenue and cohort year are assigned to each customer.  
   - In the main query: 
        - 🔔 Use `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year)` to calculate the average revenue per customer for each cohort.  
        - 🔔 Select `cohort_year`, `customerkey`, `total_customer_net_revenue` (rename this to `customer_ltv`) and the average total revenue for output.  
        - 🔔 Use `ORDER BY cohort_year, customerkey` to sort the results by cohort and customer.  

In [77]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
)

SELECT 
    cohort_year, -- Updated
    customerkey, -- Updated
    total_customer_net_revenue AS customer_ltv, -- Updated
    AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv-- Added
FROM cohort_analysis
ORDER BY -- Added
    cohort_year,
    customerkey
LIMIT 20

Unnamed: 0,cohort_year,customerkey,customer_ltv,avg_cohort_ltv
0,2015,4376,182.0,5271.59
1,2015,4403,9530.35,5271.59
2,2015,4925,6078.08,5271.59
3,2015,5729,192.16,5271.59
4,2015,6048,1903.89,5271.59
5,2015,6705,13133.76,5271.59
6,2015,9440,208.01,5271.59
7,2015,10806,442.09,5271.59
8,2015,12116,9714.29,5271.59
9,2015,12973,253.06,5271.59


<img src="../Resources/images/3.2_customer_ltv.png" alt="Customer LTV & Average LTV" width="50%">

> ⚠️ **Chart Note**: This plots only 20 of our customers for a better visualization.

<img src="../Resources/images/3.2_cohort_ltv.png" alt="Cohort LTV" width="50%">

In [78]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),
yearly AS (
    SELECT
        cohort_year, -- Updated
        AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv-- Added
    FROM cohort_analysis
    ORDER BY -- Added
        cohort_year,
        customerkey
)
SELECT DISTINCT *
FROM yearly
ORDER BY cohort_year

Unnamed: 0,cohort_year,avg_cohort_ltv
0,2015,5271.59
1,2016,5404.92
2,2017,5403.08
3,2018,4896.64
4,2019,4731.95
5,2020,3933.32
6,2021,3943.33
7,2022,3315.52
8,2023,2543.18
9,2024,2037.55


---
## Filtering Windows Function

### 📝 Notes

**Filtering Before Windows Function**

- Use `WHERE` to filter rows before aggregation. 
- Syntax: 
    ```sql
    SELECT 
        column_name,
        window_function(column_to_aggregate) 
            OVER (PARTITION BY partition_column ORDER BY order_column) AS window_column_alias   
    FROM table_name
    WHERE condition; -- Filters data BEFORE applying window function
    ```

**EXAMPLE:** Only look at cohorts from 2020 onward

In [79]:
%%sql

SELECT 
    customerkey,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM sales
WHERE orderdate >= '2020-01-01'  -- Filters BEFORE window function
LIMIT 10

Unnamed: 0,customerkey,cohort_year
0,15,2021
1,180,2023
2,180,2023
3,387,2021
4,387,2021
5,387,2021
6,387,2021
7,387,2021
8,406,2021
9,406,2021


**Fitlering After Windows Function**

- Use a subquery + `WHERE` to filter based on window function results.  
- Syntax: 
    ```sql
    WITH windowed_data AS (
        SELECT 
            column_name,
            window_function(column_to_aggregate) 
                OVER (PARTITION BY partition_column) AS window_column_alias
        FROM table_name
    )

    SELECT *
    FROM windowed_data
    WHERE window_column_alias condition; -- Filters data AFTER window function
    ```

**EXAMPLE:** Filter customers that have total revenue >$10,0000

In [80]:
%%sql

WITH cohort_revenue AS (
    SELECT 
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) OVER (PARTITION BY customerkey) AS total_customer_revenue
    FROM sales
)
SELECT *
FROM cohort_revenue
WHERE total_customer_revenue > 10000 -- Filters AFTER window function
ORDER BY cohort_year;  

Unnamed: 0,customerkey,cohort_year,total_customer_revenue
0,1993137,2015,14861.60
1,1993137,2015,14861.60
2,1993137,2015,14861.60
3,365248,2015,18320.46
4,1330586,2015,21228.99
...,...,...,...
40402,480501,2024,16455.81
40403,480501,2024,16455.81
40404,494202,2024,20000.34
40405,494202,2024,20000.34


#### 💡 What about `QUALIFY`?  

- Some databases (**BigQuery, Snowflake**) support `QUALIFY` to filter directly on **window function results**.  
- **PostgreSQL does not support `QUALIFY`**, so we use a CTE or subquery with a `WHERE` clause instead.  

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Revenue Threshold: Minimum transaction value considered significant for analysis
  - High-Value Customer: Customer whose spending exceeds a defined threshold
  - Transaction Quality: Measure of how meaningful or significant a sale is
- **💡 Why It Matters**: 
  - Helps focus analysis on meaningful transactions
  - Removes noise from low-value or insignificant data points
  - Enables targeted analysis of specific customer segments
- **🎯 Common Use Cases**: Identifying premium customers and analyzing high-value purchases
- **📈 Related Metrics**: 
  - High-value transaction rate
  - Average transaction size after filtering

### 📈 Analysis

- Filters out low-value orders ($500) before aggregating total revenue per user and finding the average lifetime value for each cohort.
- Filter out average lifetime values for each cohort, ensuring only high-value users (avg_revenue > 5000).

#### Filter Low-Value Orders - Filter Revenue Before LTV Calculation

**`WHERE`**

1. Filter out line item orders where they are less than `$500`.
   - Define a CTE `cohort_analysis` to calculate the cohort year and total revenue for each customer.  
        - Extract the cohort year using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Calculate the total revenue for each customer with `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` to ensure total revenue and cohort year are assigned to each customer.  
        - 🔔 In a `WHERE` clause filter only sales where revenue is >= `500`.
   - In the main query:
        - Use `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year)` to calculate the average revenue per customer for each cohort.  
        - Select `cohort_year`, `customerkey`, `total_customer_net_revenue` (rename this to `customer_ltv`) and the average total revenue for output.  
        - Use `ORDER BY cohort_year, customerkey` to sort the results by cohort and customer.   


In [81]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    WHERE (quantity * netprice * exchangerate) >= 500 -- Added
    GROUP BY 
        customerkey
)

SELECT 
    cohort_year, 
    customerkey, 
    total_customer_net_revenue,
    AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv
FROM cohort_analysis
ORDER BY 
    cohort_year,
    customerkey
;

Unnamed: 0,cohort_year,customerkey,total_customer_net_revenue,avg_cohort_ltv
0,2015,4403,9435.54,6265.91
1,2015,4925,6056.34,6265.91
2,2015,6048,729.86,6265.91
3,2015,6705,12907.10,6265.91
4,2015,12116,9714.29,6265.91
...,...,...,...,...
38784,2024,2090359,4905.81,2745.37
38785,2024,2092135,1758.24,2745.37
38786,2024,2096470,535.78,2745.37
38787,2024,2096509,817.75,2745.37


> NOTE: We can see that by removing low value orders (<$500) the `avg_chort_ltv` increased by ~$1,000.

#### High-Value Users in Cohort Analysis

**`WHERE`**

1. Create a new CTE that calculates the `avg_cohort_ltv` and in the main query only return results where the `avg_cohort_ltv` > 5000.
   - Define a CTE (`cohort_analysis`) to calculate the cohort year and total revenue for each customer.  
     - Extract the cohort year using `EXTRACT(YEAR FROM MIN(orderdate))`.  
     - Calculate the total revenue for each customer with `SUM(quantity * netprice * exchangerate)`.  
     - Group by `customerkey` to ensure total revenue and cohort year are assigned to each customer.  
   - 🔔 Create another CTE (`cohort_summary`) to calculate `avg_coho` using a window function.  
     - Use `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year)` to compute the average lifetime value per cohort.  
     - Retain `customerkey` to allow for further analysis at the user level.
     - Also return the `total_customer_net_revenue` and rename to `customer_ltv`.
   - Filter results in the main query.
     - Select `cohort_year`, `customerkey`, `customer_ltv` and `avg_coho`.  
     - 🔔 Use `WHERE avg_coho > 5000` to keep only customers with an average LTV above the threshold.  
     - Sort results with `ORDER BY cohort_year, customerkey` for clarity.  

In [82]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

-- Added 
cohort_summary AS (
    SELECT 
        cohort_year, 
        customerkey, 
        total_customer_net_revenue AS customer_ltv,
        AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv
    FROM cohort_analysis
)

-- Added 
SELECT DISTINCT
    cohort_year,
    avg_cohort_ltv
FROM cohort_summary
-- WHERE avg_cohort_ltv > 5000 
ORDER BY 
    cohort_year

Unnamed: 0,cohort_year,avg_cohort_ltv
0,2015,5271.59
1,2016,5404.92
2,2017,5403.08
3,2018,4896.64
4,2019,4731.95
5,2020,3933.32
6,2021,3943.33
7,2022,3315.52
8,2023,2543.18
9,2024,2037.55


In [83]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

-- Added 
cohort_summary AS (
    SELECT 
        cohort_year, 
        customerkey, 
        total_customer_net_revenue AS customer_ltv,
        AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv
    FROM cohort_analysis
)

-- Added 
SELECT 
    cohort_year,
    customerkey,
    customer_ltv,
    avg_cohort_ltv
FROM cohort_summary
WHERE avg_cohort_ltv > 5000 
ORDER BY 
    cohort_year,
    customerkey

Unnamed: 0,cohort_year,customerkey,customer_ltv,avg_cohort_ltv
0,2015,4376,182.00,5271.59
1,2015,4403,9530.35,5271.59
2,2015,4925,6078.08,5271.59
3,2015,5729,192.16,5271.59
4,2015,6048,1903.89,5271.59
...,...,...,...,...
10285,2017,2096866,8208.79,5403.08
10286,2017,2096994,2149.18,5403.08
10287,2017,2098189,8276.54,5403.08
10288,2017,2098471,4243.49,5403.08
