<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/1_Syntax.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Windows Functions Syntax

## Overview

### 🥅 Analysis Goals


Analyze revenue growth trends by cohort to understand how total sales (net revenue) evolve over time.

- **Total Net Revenue by Day**: Get the total net revenue by day for each sale transaction to find the the percentage of total revenue share.
- **Cumulative Revenue by Cohort**: Groups users into cohorts based on their first order year (cohort_year) to analyze long-term revenue growth.

### 📘 Concepts Covered

Basic syntax: 
- `SUM()`
- `OVER()` 
- `PARTITION BY`

### 📕 Definitions

- **Cohort analysis** - Examines the behavior of specific groups over time.  
- **Cohort** - A group of people or items sharing a common characteristic.  
- **Time series** - Data tracked in sequence over time.  
- **Retention** - Keeping users, customers, or items over time.  


In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## Syntax

### 📝 Notes

`window_function OVER (PARTITION BY)`

- **Why Use Window Functions?**
  - They let you perform calculations across a set of table rows related to the current row.
  - Unlike aggregate functions, they don't group the results into a single output row.
  - They allow you to easily partition and order data within the query, making them great for calculating things like running totals, ranks, or averages within partitions.


- **Syntax:**
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
        ) AS window_column_alias
    FROM table_name;
    ```

    - `OVER()`: Defines the window for the function. It can include `PARTITION BY` and other functions.
    - `PARTITION BY`: Divides the result set into partitions. The function is then applied to each partition.

### 💻 Final Result

- Showing how revenue grows over time, enabling businesses to track the long-term contributions of each cohort.
  - Extracts the cohort_year from signup_date to group users by the year they joined.
  - Calculates daily revenue for each cohort.
  - Uses a window function to compute the cumulative revenue for each cohort, ordered by purchase date.

#### Calculate Total Net Revenue by Day

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and calculate `net_revenue`.
    - Note: The `customer` table has a `startdt` which may be when they started but since we don't have data that goes back this far (we don't have anything within this range of 1980 - 2010), we will not be using this data.

In [20]:
%%sql

SELECT
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue,
    SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS total_net_revenue
FROM 
    sales
ORDER BY
    orderdate

Unnamed: 0,orderdate,net_revenue,total_net_revenue
0,2015-01-01,63.49,11640.80
1,2015-01-01,423.28,11640.80
2,2015-01-01,108.75,11640.80
3,2015-01-01,1146.75,11640.80
4,2015-01-01,950.25,11640.80
...,...,...,...
199868,2024-04-20,914.61,96879.43
199869,2024-04-20,150.18,96879.43
199870,2024-04-20,147.78,96879.43
199871,2024-04-20,2019.62,96879.43


2. Let's you calculate for each transaction the percentage of total revenue share.

In [21]:
%%sql

SELECT
    orderdate,
    net_revenue,
    total_net_revenue,
    net_revenue / total_net_revenue AS revenue_share
FROM 
    (
    SELECT
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue,
        SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS total_net_revenue
    FROM
        sales
    )
ORDER BY
    orderdate

Unnamed: 0,orderdate,net_revenue,total_net_revenue,revenue_share
0,2015-01-01,63.49,11640.80,0.01
1,2015-01-01,423.28,11640.80,0.04
2,2015-01-01,108.75,11640.80,0.01
3,2015-01-01,1146.75,11640.80,0.10
4,2015-01-01,950.25,11640.80,0.08
...,...,...,...,...
199868,2024-04-20,914.61,96879.43,0.01
199869,2024-04-20,150.18,96879.43,0.00
199870,2024-04-20,147.78,96879.43,0.00
199871,2024-04-20,2019.62,96879.43,0.02


---
## SUM

### 📝 Notes

- `SUM()`: Sums up all of the values

```sql
  SELECT
    SUM() OVER(
         PARTITION BY partition_expression
    ) AS window_column_alias
    FROM table_name
```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Calculate Cumulative Revenue by Cohort

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and calculate `net_revenue`.
    - Note: The `customer` table has a `startdt` which may be when they started but since we don't have data that goes back this far (we don't have anything within this range of 1980 - 2010), we will not be using this data.

In [23]:
%%sql 

SELECT 
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY
    orderdate,
    net_revenue
ORDER BY  
    cohort_year,
    orderdate

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2015,2015-01-01,37.51
1,2015,2015-01-01,2395.10
2,2015,2015-01-01,58.73
3,2015,2015-01-01,423.28
4,2015,2015-01-01,262.80
...,...,...,...
199080,2024,2024-04-20,144.77
199081,2024,2024-04-20,17.13
199082,2024,2024-04-20,215.40
199083,2024,2024-04-20,14.02


2. Put the query into a CTE.
    - **Add notes to WHY we need to put this into a CTE**. Because we need to get the raw, cleaned data first, then aggregate. If we try to aggregate the data now it will try to get us to group by quantity, netprice, exchangerate which is NOT what we want.

In [25]:
%%sql

-- Updated query
WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
    )
SELECT 
    *
FROM cohort_analysis
ORDER BY  -- Move order by here
    cohort_year,
    orderdate;

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2015,2015-01-01,37.51
1,2015,2015-01-01,2395.10
2,2015,2015-01-01,58.73
3,2015,2015-01-01,423.28
4,2015,2015-01-01,262.80
...,...,...,...
199080,2024,2024-04-20,144.77
199081,2024,2024-04-20,17.13
199082,2024,2024-04-20,215.40
199083,2024,2024-04-20,14.02


3. Calculate the daily revenue

In [26]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
    )
SELECT 
    cohort_year,
    orderdate,
    SUM(net_revenue) AS daily_revenue -- Added
FROM cohort_analysis
GROUP BY 
    cohort_year, 
    orderdate
ORDER BY 
    cohort_year, 
    orderdate;


Unnamed: 0,cohort_year,orderdate,daily_revenue
0,2015,2015-01-01,11498.37
1,2015,2015-01-02,5890.40
2,2015,2015-01-03,19796.67
3,2015,2015-01-05,12406.27
4,2015,2015-01-06,10349.87
...,...,...,...
3289,2024,2024-04-16,25098.99
3290,2024,2024-04-17,32938.67
3291,2024,2024-04-18,28408.76
3292,2024,2024-04-19,48386.88


4. Calculate the cumulative revenue by cohort using a windows function.
    - **Note Explain why we have to put another SUM() for the windows function**. It would only get the daily_revenue not by cohort year.

In [29]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
)
SELECT 
    cohort_year,
    orderdate,
    SUM(net_revenue) AS daily_revenue,
    SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY orderdate) AS cumulative_revenue -- Added
FROM cohort_analysis
GROUP BY 
    cohort_year, 
    orderdate
ORDER BY 
    cohort_year, 
    orderdate
;

Unnamed: 0,cohort_year,orderdate,daily_revenue,cumulative_revenue
0,2015,2015-01-01,11498.37,11498.37
1,2015,2015-01-02,5890.40,17388.77
2,2015,2015-01-03,19796.67,37185.44
3,2015,2015-01-05,12406.27,49591.71
4,2015,2015-01-06,10349.87,59941.58
...,...,...,...,...
3289,2024,2024-04-16,25098.99,8175575.82
3290,2024,2024-04-17,32938.67,8208514.49
3291,2024,2024-04-18,28408.76,8236923.24
3292,2024,2024-04-19,48386.88,8285310.13


5. **Bonus**: Update the `orderdate` to be the 'YYYY-MM'.

In [30]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        TO_CHAR(orderdate, 'YYYY-MM') AS order_month,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        order_month,
        net_revenue
)
SELECT 
    cohort_year,
    order_month,
    SUM(net_revenue) AS daily_revenue,
    SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY order_month) AS cumulative_revenue -- Added
FROM cohort_analysis
GROUP BY 
    cohort_year, 
    order_month
ORDER BY 
    cohort_year, 
    order_month
;

Unnamed: 0,cohort_year,order_month,daily_revenue,cumulative_revenue
0,2015,2015-01,383259.62,383259.62
1,2015,2015-02,704101.87,1087361.49
2,2015,2015-03,330470.13,1417831.62
3,2015,2015-04,160267.54,1578099.16
4,2015,2015-05,546999.18,2125098.34
...,...,...,...,...
107,2023,2023-12,2803841.12,31799814.31
108,2024,2024-01,2564371.92,2564371.92
109,2024,2024-02,3392378.75,5956750.67
110,2024,2024-03,1656202.99,7612953.66


### 💡 Why not use GROUP BY instead? 

- Window functions are good when you need both row-level information and aggregated values.
- **Limitation of `GROUP BY`:** Grouping by state can tell you the average age and customer count per state, but it aggregates at the state level, so you lose individual customer details. This makes it impossible to identify specific customers who are younger than the state average for targeted campaigns.