<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/1_Syntax.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Windows Functions Syntax

## Overview

### 🥅 Analysis Goals


Analyze revenue growth trends by cohort to understand how total sales (net revenue) evolve over time.

- **Total Net Revenue by Day**: Get the total net revenue by day for each sale transaction to find the the percentage of total revenue share.
- **Cumulative Revenue by Cohort**: Groups users into cohorts based on their first order year (called `cohort_year`) to analyze long-term revenue growth.

### 📘 Concepts Covered

- `SUM()`
- `OVER()` 
- `PARTITION BY`

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Syntax

### 📝 Notes

`window_function OVER (PARTITION BY)`

- **Why Use Window Functions?**
  - They let you perform calculations across a set of table rows related to the current row.
  - Unlike aggregate functions, they don't group the results into a single output row.
  - They allow you to easily partition and order data within the query, making them great for calculating things like running totals, ranks, or averages within partitions.

- **Syntax:**
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
        ) AS window_column_alias
    FROM table_name;
    ```

    - `OVER()`: Defines the window for the function. It can include `PARTITION BY` and other functions.
    - `PARTITION BY`: Divides the result set into partitions. The function is then applied to each partition.

In [66]:
%%sql

SELECT 
    customerkey,
    orderkey,
    linenumber,
    (quantity * netprice * exchangerate) as net_revenue,
    AVG(quantity * netprice * exchangerate) OVER() as avg_net_revenue_all_orders,
    AVG(quantity * netprice * exchangerate) OVER(PARTITION BY customerkey) as avg_net_revenue_this_customer
FROM sales
ORDER BY customerkey
LIMIT 5;

Unnamed: 0,customerkey,orderkey,linenumber,net_revenue,avg_net_revenue_all_orders,avg_net_revenue_this_customer
0,15,2259001,0,2217.41,1032.69,2217.41
1,180,1305016,0,525.31,1032.69,836.74
2,180,3162018,0,71.36,1032.69,836.74
3,180,3162018,1,1913.55,1032.69,836.74
4,185,1613010,0,1395.52,1032.69,1395.52


**Why I Like Window Functions**
  - Window functions are great for calculating things like running totals, ranks, or averages within partitions.

> **NOTE:** This is an example of what we'll cover this chapter; this is displayed to show the **power of window functions**.

In [30]:
%%sql

SELECT 
    customerkey as customer,
    orderdate,  -- Added to make running totals more meaningful
    (quantity * netprice * exchangerate) as net_revenue,
    ROW_NUMBER() OVER(
        PARTITION BY customerkey 
        ORDER BY quantity * netprice * exchangerate DESC
    ) as order_rank,
    SUM(quantity * netprice * exchangerate) OVER(
        PARTITION BY customerkey 
        ORDER BY orderdate
    ) as customer_running_total,
    SUM(quantity * netprice * exchangerate) OVER(PARTITION BY customerkey) as customer_net_revenue,
    (quantity * netprice * exchangerate) / SUM(quantity * netprice * exchangerate) OVER(PARTITION BY customerkey) * 100 as pct_customer_revenue
FROM sales
ORDER BY customerkey, orderdate
LIMIT 10;

Unnamed: 0,customer,orderdate,net_revenue,order_rank,customer_running_total,customer_net_revenue,pct_customer_revenue
0,15,2021-03-08,2217.41,1,2217.41,2217.41,100.0
1,180,2018-07-28,525.31,2,525.31,2510.22,20.93
2,180,2023-08-28,1913.55,1,2510.22,2510.22,76.23
3,180,2023-08-28,71.36,3,2510.22,2510.22,2.84
4,185,2019-06-01,1395.52,1,1395.52,1395.52,100.0
5,243,2016-05-19,287.67,1,287.67,287.67,100.0
6,387,2018-12-21,619.77,3,2370.54,4655.84,13.31
7,387,2018-12-21,1608.1,1,2370.54,4655.84,34.54
8,387,2018-12-21,97.05,7,2370.54,4655.84,2.08
9,387,2018-12-21,45.62,8,2370.54,4655.84,0.98


### 🔑 Key Concepts
- **📊 Business Terms**: Window Function (row-level calculation), Partition (group of related rows), Revenue Share
- **💡 Why It Matters**: Enables detailed analysis while maintaining transaction-level granularity
- **🎯 Common Use Cases**: 
  - Market share calculations
  - Revenue distribution analysis
  - Performance comparisons within groups
- **📈 Related KPIs**: Market share %, revenue distribution

### 📈 Analysis

- Calculate the total net revenue and total net revenue by day.

#### Calculate Total Net Revenue

**`SUM`**, **`OVER`**

1. Get the `orderdate` and calculate the total net revenue for the `sales` table.
    - Select the `orderdate` to group data by the specific date of the orders.
    - Use a `SUM` window function with `OVER()` to compute `total_net_revenue` for all orders in the `sales` table.
    - Order the results by `orderdate` to present the data chronologically.

In [4]:
%%sql

SELECT
    orderdate,
    SUM(quantity * netprice * exchangerate) OVER() AS total_net_revenue
FROM 
    sales
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_net_revenue
0,2015-01-01,206407538.58
1,2015-01-01,206407538.58
2,2015-01-01,206407538.58
3,2015-01-01,206407538.58
4,2015-01-01,206407538.58
...,...,...
199868,2024-04-20,206407538.58
199869,2024-04-20,206407538.58
199870,2024-04-20,206407538.58
199871,2024-04-20,206407538.58


#### Calculate Total Net Revenue by Day

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the `orderdate` and calculate net revenue (for each order) and total net revenue (by each day) from the `sales` table.
    - Select the `orderdate` to group data by the specific date of the orders.
    - 🔔 Calculate `net_revenue` by multiplying `quantity`, `netprice`, and `exchangerate` to obtain the revenue for each individual sale.
    - Use a `SUM` window function with `OVER(PARTITION BY orderdate)` to compute `daily_net_revenue` for all orders sharing the same `orderdate`.
    - Order the results by `orderdate` to present the data chronologically.
    - Calculate the percentage of total revenue share for each order by dividing the `net_revenue` by the `daily_net_revenue` and multiplying by 100.

In [62]:
%%sql

SELECT
    orderdate,
    orderkey * 10 + linenumber AS order_line_item,
    (quantity * netprice * exchangerate) AS net_revenue,
    SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS daily_net_revenue, -- Added
    100 * (quantity * netprice * exchangerate) / SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS pct_daily_revenue
FROM 
    sales
ORDER BY
    orderdate,
    pct_daily_revenue DESC
LIMIT 10

Unnamed: 0,orderdate,order_line,net_revenue,daily_net_revenue,pct_daily_revenue
0,2015-01-01,10043,2395.1,11640.8,20.58
1,2015-01-01,10061,1552.32,11640.8,13.34
2,2015-01-01,10022,1302.91,11640.8,11.19
3,2015-01-01,10020,1146.75,11640.8,9.85
4,2015-01-01,10050,975.16,11640.8,8.38
5,2015-01-01,10021,950.25,11640.8,8.16
6,2015-01-01,10041,578.52,11640.8,4.97
7,2015-01-01,10081,574.05,11640.8,4.93
8,2015-01-01,10001,423.28,11640.8,3.64
9,2015-01-01,10040,263.11,11640.8,2.26


<img src="../Resources/images/3.1_daily_revenue.png" alt="Daily Revenue Share" width="50%">

### 💡 Why not use GROUP BY instead? 

- Window functions are good when you need both row-level information and aggregated values.
- **Limitation of `GROUP BY`:** Grouping by `orderdate` can tell you the net revenue per order date, but it aggregates at the order date level, so you lose individual order details. 
- Adding windows functions let us make calculations like the percentage of revenue share for each order (what we'll be doing next).

---
## Subqueries Review

**SubQueries**

- **Subquery**: a query nested inside another query. 
    - Subqueries let you perform complex queries by using the result of one query as input for another. 
    - It can be used in clauses like `SELECT`, `FROM`, `WHERE`, and `HAVING`.

**Syntax**:
- In `SELECT` clause
    - ```sql
      SELECT 
        column1, 
        column2, 
        (SELECT single_value_expression FROM table_name WHERE condition) AS alias_name
      FROM main_table
      WHERE condition;
      ```
- In `WHERE` clause 
    - ```sql
      SELECT column1, column2
      FROM table_name
      WHERE column_name operator (SELECT column_name FROM table_name WHERE condition);
      ```
- In `FROM` clause
    - ```sql
      SELECT alias_name.column1, alias_name.column2
      FROM (
        SELECT column1, column2 
        FROM table_name 
        WHERE condition
      ) AS alias_name
      WHERE condition
      ```
- There are more ways to use subqueries, such as with `EXISTS`, `NOT EXISTS`, correlated subqueries, and in `HAVING`, but these are the most common.

### 📈 Analysis

- Get the average net revenue by month (this is for reviewing subqueries).
- Calculate for each transaction the percentage of total revenue share.

#### Average Net Revenue by Month

**Subquery**

1. Using a subquery, get the average net revenue by month (`orderdate`).  
   - Create a subquery `revenue_by_day` to calculate the net revenue for each sale using `(quantity * netprice * exchangerate)`.  
        - Include `orderdate` in the subquery to associate each sale with its corresponding date.  
   - In the main query, calculate the average net revenue per day using `AVG(net_revenue)`.  
        - Group the results by `orderdate` to compute the average for each unique day.  
        - Use `ORDER BY orderdate` to display the results in chronological order.  

In [6]:
%%sql

SELECT 
    orderdate,
    AVG(net_revenue) AS avg_net_revenue
FROM (
    SELECT orderdate, (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
) AS revenue_by_day
GROUP BY orderdate
ORDER BY orderdate;

Unnamed: 0,orderdate,avg_net_revenue
0,2015-01-01,465.63
1,2015-01-02,736.30
2,2015-01-03,942.70
3,2015-01-05,1240.63
4,2015-01-06,862.49
...,...,...
3289,2024-04-16,784.34
3290,2024-04-17,539.98
3291,2024-04-18,498.40
3292,2024-04-19,967.74


2. Get the average net revenue by month using `TO_CHAR`.  
   - Create a subquery `revenue_by_day` to calculate the net revenue for each sale using `(quantity * netprice * exchangerate)`.  
        - Include `orderdate` in the subquery to associate each sale with its corresponding date.  
   - In the main query, calculate the average net revenue per month using `AVG(net_revenue)`.  
        - 🔔 Get the month from `orderdate` using `TO_CHAR`.
        - Group the results by `year_month` to compute the average for each month.  
        - Use `ORDER BY year_month` to display the results in chronological order.  

In [18]:
%%sql

SELECT 
    TO_CHAR(orderdate, 'YYYY-MM') AS year_month,
    AVG(net_revenue) AS avg_net_revenue
FROM (
    SELECT orderdate, (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
) AS revenue_by_day
GROUP BY year_month
ORDER BY year_month;

Unnamed: 0,year_month,avg_net_revenue
0,2015-01,791.94
1,2015-02,1051.15
2,2015-03,1074.07
3,2015-04,918.67
4,2015-05,1055.06
...,...,...
107,2023-12,819.63
108,2024-01,834.37
109,2024-02,849.48
110,2024-03,785.55


<img src="../Resources/images/3.1_monthly_avg_rev.png" alt="Avg Net Revenue" width="50%">

#### Percentage of Total Revenue Share

**Subquery**

1. Calculate for each transaction the percentage of total revenue share.
    - 🔔 Select `orderdate`, `net_revenue`, and `total_net_revenue` from the subquery.  
    - 🔔 Calculate `revenue_share` by dividing `net_revenue` by `total_net_revenue` for each row.  
    - Use a subquery that calculates `net_revenue` as the product of `quantity`, `netprice`, and `exchangerate`.  
        - Use a window function (`SUM` with `PARTITION BY orderdate`) in the subquery to compute `total_net_revenue` for each `orderdate`.  
    - Order the final results by `orderdate`.  

In [65]:
%%sql

SELECT
    *,
    100 * net_revenue / daily_net_revenue AS pct_daily_revenue
FROM 
    -- Use query from previous section (Calculate Daily Net Revenue) as a subquery
    (
    SELECT
        orderdate,
        orderkey * 10 + linenumber AS order_line_item,
        (quantity * netprice * exchangerate) AS net_revenue,
        SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS daily_net_revenue
    FROM
        sales
    ) AS revenue_by_day
ORDER BY
    orderdate,
    pct_daily_revenue DESC
LIMIT 10

Unnamed: 0,orderdate,order_line_item,net_revenue,daily_net_revenue,pct_daily_revenue
0,2015-01-01,10043,2395.1,11640.8,20.58
1,2015-01-01,10061,1552.32,11640.8,13.34
2,2015-01-01,10022,1302.91,11640.8,11.19
3,2015-01-01,10020,1146.75,11640.8,9.85
4,2015-01-01,10050,975.16,11640.8,8.38
5,2015-01-01,10021,950.25,11640.8,8.16
6,2015-01-01,10041,578.52,11640.8,4.97
7,2015-01-01,10081,574.05,11640.8,4.93
8,2015-01-01,10001,423.28,11640.8,3.64
9,2015-01-01,10040,263.11,11640.8,2.26


<img src="../Resources/images/3.1_daily_revenue.png" alt="Daily Revenue Share" width="50%">

---
## SUM

### 📝 Notes

`SUM`

- **SUM**: Sums up all of the values
- Syntax: 
  ```sql
    SELECT
      SUM() OVER(
          PARTITION BY partition_expression
      ) AS window_column_alias
      FROM table_name
  ```

### 🔑 Key Concepts

- **📊 Business Terms**: 
  - Cumulative Revenue: Running total of sales over time
  - Cohort Revenue: Total revenue by customer group
- **💡 Why It Matters**: Reveals revenue patterns and cohort performance
- **🎯 Common Use Cases**: Tracking growth and cohort analysis
- **📈 Related KPIs**: Growth rate, revenue run rate, performance trends  

#### 📕 Definitions

- **Cohort analysis**: Examines the behavior of specific groups over time.  
- **Cohort**: A group of people or items sharing a common characteristic. 

###  📈 Analysis

- Groups users into cohorts based on their first order year (called `cohort_year`) and calculate the cumulative and monthly net revenue.
> ⚠️ **Data Note**: Customer table contains `startdt` field but this will not be used since historical data (1980-2010) is not available in our dataset

In [9]:
%%sql 

SELECT 
    customerkey,
    orderdate,
    COUNT(*) OVER() as total_orders_per_customer
FROM sales
ORDER BY customerkey, orderdate
;

Unnamed: 0,customerkey,orderdate,total_orders_per_customer
0,15,2021-03-08,199873
1,180,2018-07-28,199873
2,180,2023-08-28,199873
3,180,2023-08-28,199873
4,185,2019-06-01,199873
...,...,...,...
199868,2099711,2016-08-13,199873
199869,2099711,2017-08-14,199873
199870,2099743,2022-03-17,199873
199871,2099743,2022-03-17,199873


#### Calculate Cumulative Revenue by Cohort

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and calculate `net_revenue`.
    - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to get the earliest year for each group.  
    - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
    - Group the results by `orderdate` and `net_revenue` using `GROUP BY`.  
    - Sort the output by `cohort_year` and `orderdate` using `ORDER BY`.  

In [10]:
%%sql 

SELECT 
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY
    orderdate,
    net_revenue
ORDER BY  
    cohort_year,
    orderdate

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2015,2015-01-01,950.25
1,2015,2015-01-01,108.75
2,2015,2015-01-01,1146.75
3,2015,2015-01-01,37.51
4,2015,2015-01-01,63.49
...,...,...,...
199080,2024,2024-04-20,1871.37
199081,2024,2024-04-20,187.02
199082,2024,2024-04-20,40.46
199083,2024,2024-04-20,8.35


2. Put the query into a subquery.
    - Use a subquery to encapsulate the logic for calculating `cohort_year` and `net_revenue`.
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to find the earliest year for each `orderdate` group.
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.
        - Group the data by `orderdate` and `net_revenue` using `GROUP BY`.
    - 🔔 Outer query: 
        - Select all columns (`*`) from the CTE in the main query.
        - Move the `ORDER BY cohort_year, orderdate` to the main query to ensure sorting is applied to the final result.
    - **Why use a Subquery:** To clean and prepare the raw data first, avoiding unnecessary grouping by `quantity`, `netprice`, and `exchangerate`, and to keep the query modular and easy to read.

In [11]:
%%sql

SELECT 
    cohort_year,
    orderdate,
    net_revenue
FROM ( --previous query as a subquery
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
) AS revenue_by_day
ORDER BY  -- Move order by here
    cohort_year,
    orderdate;

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2015,2015-01-01,950.25
1,2015,2015-01-01,108.75
2,2015,2015-01-01,1146.75
3,2015,2015-01-01,37.51
4,2015,2015-01-01,63.49
...,...,...,...
199080,2024,2024-04-20,1871.37
199081,2024,2024-04-20,187.02
199082,2024,2024-04-20,40.46
199083,2024,2024-04-20,8.35


3. Calculate the monthly revenue.
    - Use a subquery to calculate `cohort_year` and `net_revenue` for each transaction.  
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to identify the earliest year for each `orderdate` group.  
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
        - Group the results in the CTE by `orderdate` and `net_revenue` to ensure distinct combinations of these values.  
    - In the outer query, select `cohort_year` and `orderdate` from the CTE for further analysis.  
        - 🔔 Calculate `monthly_revenue` using `SUM(net_revenue)` to aggregate total revenue for each day. 
        - 🔔 Get the order year month from `orderdate` using `TO_CHAR`.
        - Group by `cohort_year` and `year_month` to finalize the daily revenue calculation.  
        - Sort the results by `cohort_year` and `year_month` using `ORDER BY`.  

In [19]:
%%sql

SELECT 
    cohort_year,
    TO_CHAR(orderdate, 'YYYY-MM') AS year_month, -- Update
    SUM(net_revenue) AS monthly_revenue -- Added
FROM (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
    ) AS revenue_by_day
GROUP BY 
    cohort_year, 
    year_month
ORDER BY 
    cohort_year, 
    year_month;


Unnamed: 0,cohort_year,year_month,monthly_revenue
0,2015,2015-01,383920.37
1,2015,2015-02,705828.98
2,2015,2015-03,330943.44
3,2015,2015-04,160767.00
4,2015,2015-05,548252.85
...,...,...,...
107,2023,2023-12,2912914.47
108,2024,2024-01,2672785.66
109,2024,2024-02,3537863.87
110,2024,2024-03,1687688.64


4. Calculate the cumulative revenue by cohort using a windows function.
    - Use a subquery to calculate `cohort_year` and `net_revenue` for each transaction.  
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to identify the earliest year for each `orderdate` group.  
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
        - Group the results in the CTE by `orderdate` and `net_revenue` to ensure distinct combinations of these values.   
    - In the outer query, select `cohort_year` and `orderdate` from the subquery for further analysis.  
        - Calculate `monthly_revenue` using `SUM(net_revenue)` to aggregate total revenue for each month.  
        - 🔔 Use a window function to calculate `cumulative_revenue` as `SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY TO_CHAR(orderdate, 'YYYY-MM'))`, summing the monthly revenue progressively within each cohort year.  
        - Get the order year month from `orderdate` using `DATE_TRUNC`.
        - Sort the results by `cohort_year` and `year_month` using `ORDER BY`.  .  
        - **Why use another `SUM()` for the window function:** The first `SUM(net_revenue)` calculates the monthly revenue for each `year_month`. The second `SUM(SUM(net_revenue))` ensures the cumulative revenue is calculated progressively by adding monthly revenue values in order for each `cohort_year`. Without the second `SUM()`, the query would not provide a running total by cohort year.  

In [20]:
%%sql

SELECT 
    cohort_year,
    TO_CHAR(orderdate, 'YYYY-MM') AS year_month,
    SUM(net_revenue) AS monthly_revenue,
    SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY TO_CHAR(orderdate, 'YYYY-MM')) AS cumulative_revenue -- Added
FROM  (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
) AS revenue_by_day
GROUP BY 
    cohort_year, 
    year_month
ORDER BY 
    cohort_year, 
    year_month
;

Unnamed: 0,cohort_year,year_month,monthly_revenue,cumulative_revenue
0,2015,2015-01,383920.37,383920.37
1,2015,2015-02,705828.98,1089749.35
2,2015,2015-03,330943.44,1420692.78
3,2015,2015-04,160767.00,1581459.78
4,2015,2015-05,548252.85,2129712.63
...,...,...,...,...
107,2023,2023-12,2912914.47,33021117.03
108,2024,2024-01,2672785.66,2672785.66
109,2024,2024-02,3537863.87,6210649.54
110,2024,2024-03,1687688.64,7898338.17


<img src="../Resources/images/3.1_monthly_cumulative_rev.png" alt="Processing & Revenue" width="50%">

> ⚠️ **Chart Note**: This only plots the 2023 `cohort_year`.

In [69]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        customerkey,
        MIN(orderdate) AS first_order_date,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year
    FROM sales
    GROUP BY customerkey
),
yearly_totals AS (
    -- Calculate total revenue for each year across all cohorts
    SELECT 
        EXTRACT(YEAR FROM orderdate) AS revenue_year,
        SUM(quantity * netprice * exchangerate) AS total_year_revenue
    FROM sales
    GROUP BY EXTRACT(YEAR FROM orderdate)
)

SELECT 
    c.cohort_year,
    EXTRACT(YEAR FROM s.orderdate) AS revenue_year,
    SUM(s.quantity * s.netprice * s.exchangerate) AS yearly_revenue,
    -- Running total by cohort
    SUM(SUM(s.quantity * s.netprice * s.exchangerate)) 
        OVER (PARTITION BY c.cohort_year ORDER BY EXTRACT(YEAR FROM s.orderdate)) AS cohort_revenue,
    -- Calculate percentage using the yearly_totals CTE
    100.0 * SUM(s.quantity * s.netprice * s.exchangerate) / yt.total_year_revenue AS pct_of_yearly_revenue
FROM sales s
JOIN cohort_analysis c ON s.customerkey = c.customerkey
JOIN yearly_totals yt ON EXTRACT(YEAR FROM s.orderdate) = yt.revenue_year
GROUP BY 
    c.cohort_year, 
    revenue_year,
    yt.total_year_revenue
ORDER BY c.cohort_year, revenue_year;

RuntimeError: (The named parameters feature is "disabled". Enable it with: %config SqlMagic.named_parameters="enabled".
For more info, see the docs: https://jupysql.ploomber.io/en/latest/api/configuration.html#named-parameters)
(psycopg2.errors.GroupingError) column "s.orderdate" must appear in the GROUP BY clause or be used in an aggregate function
LINE 20:     EXTRACT(YEAR FROM s.orderdate) AS revenue_year,
                               ^

[SQL: WITH cohort_analysis AS (
    SELECT
        customerkey,
        MIN(orderdate) AS first_order_date,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year
    FROM sales
    GROUP BY customerkey
),
yearly_totals AS (

    SELECT
        EXTRACT(YEAR FROM orderdate) AS revenue_year,
        SUM(quantity * netprice * exchangerate) AS total_year_revenue
    FROM sales
    GROUP BY EXTRACT(YEAR FROM orderdate)
)

SELECT
    c.cohort_year,
    EXTRACT(YEAR FROM s.orderdate) AS revenue_year,
    SUM(s.quantity * s.netprice * s.exchangerate) AS

In [71]:
%%sql

WITH first_purchase AS (
    SELECT 
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
    FROM sales
)
SELECT 
    fp.cohort_year,
    EXTRACT(YEAR FROM s.orderdate) AS purchase_year,
    SUM(s.quantity * s.netprice * s.exchangerate) AS total_revenue
FROM sales s
JOIN first_purchase fp ON s.customerkey = fp.customerkey
GROUP BY 
    fp.cohort_year,
    EXTRACT(YEAR FROM s.orderdate)
ORDER BY 
    cohort_year, 
    purchase_year;

Unnamed: 0,cohort_year,purchase_year,total_revenue
0,2015,2015,44488293.79
1,2015,2016,3830704.79
2,2015,2017,4529453.66
3,2015,2018,10005638.74
4,2015,2019,11190652.79
5,2015,2020,3819952.99
6,2015,2021,7806913.46
7,2015,2022,14027358.57
8,2015,2023,10554136.42
9,2015,2024,3042643.53


<img src="../Resources/images/3.1_cohort_year_rev.png" alt="Processing & Revenue" width="50%">