<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/1_Syntax.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Windows Functions Syntax

## Overview

### 🥅 Analysis Goals


Analyze revenue growth trends by cohort to understand how total sales (net revenue) evolve over time.

- **Total Net Revenue by Day**: Get the total net revenue by day for each sale transaction to find the the percentage of total revenue share.
- **Cumulative Revenue by Cohort**: Groups users into cohorts based on their first order year (called `cohort_year`) to analyze long-term revenue growth.

### 📘 Concepts Covered

Basic syntax: 
- `SUM()`
- `OVER()` 
- `PARTITION BY`

### 📕 Definitions

- **Cohort analysis**: Examines the behavior of specific groups over time.  
- **Cohort**: A group of people or items sharing a common characteristic.  
- **Time series**: Data tracked in sequence over time.  


In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Syntax

### 📝 Notes

`window_function OVER (PARTITION BY)`

- **Why Use Window Functions?**
  - They let you perform calculations across a set of table rows related to the current row.
  - Unlike aggregate functions, they don't group the results into a single output row.
  - They allow you to easily partition and order data within the query, making them great for calculating things like running totals, ranks, or averages within partitions.


- **Syntax:**
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
        ) AS window_column_alias
    FROM table_name;
    ```

    - `OVER()`: Defines the window for the function. It can include `PARTITION BY` and other functions.
    - `PARTITION BY`: Divides the result set into partitions. The function is then applied to each partition.

### 💻 Final Result

- Showing how revenue grows over time, helps track the long-term contributions of each cohort.

#### Calculate Total Net Revenue

**`SUM`**, **`OVER`**

1. Get the `orderdate` and calculate the total net revenue for the `sales` table.
    - Select the `orderdate` to group data by the specific date of the orders.
    - Use a `SUM` window function with `OVER()` to compute `total_net_revenue` for all orders in the `sales` table.
    - Order the results by `orderdate` to present the data chronologically.

In [4]:
%%sql

SELECT
    orderdate,
    SUM(quantity * netprice * exchangerate) OVER() AS total_net_revenue
FROM 
    sales
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_net_revenue
0,2015-01-01,206407538.58
1,2015-01-01,206407538.58
2,2015-01-01,206407538.58
3,2015-01-01,206407538.58
4,2015-01-01,206407538.58
...,...,...
199868,2024-04-20,206407538.58
199869,2024-04-20,206407538.58
199870,2024-04-20,206407538.58
199871,2024-04-20,206407538.58


#### Calculate Total Net Revenue by Day

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the `orderdate` and calculate net revenue (for each order) and total net revenue (by each day) from the `sales` table.
    - Select the `orderdate` to group data by the specific date of the orders.
    - 🔔 Calculate `net_revenue` by multiplying `quantity`, `netprice`, and `exchangerate` to obtain the revenue for each individual sale.
    - Use a `SUM` window function with `OVER(PARTITION BY orderdate)` to compute `total_net_revenue` for all orders sharing the same `orderdate`.
    - Order the results by `orderdate` to present the data chronologically.

In [20]:
%%sql

SELECT
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue,
    SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS total_net_revenue -- Added
FROM 
    sales
ORDER BY
    orderdate

Unnamed: 0,orderdate,net_revenue,total_net_revenue
0,2015-01-01,63.49,11640.80
1,2015-01-01,423.28,11640.80
2,2015-01-01,108.75,11640.80
3,2015-01-01,1146.75,11640.80
4,2015-01-01,950.25,11640.80
...,...,...,...
199868,2024-04-20,914.61,96879.43
199869,2024-04-20,150.18,96879.43
199870,2024-04-20,147.78,96879.43
199871,2024-04-20,2019.62,96879.43


### 💡 Why not use GROUP BY instead? 

- Window functions are good when you need both row-level information and aggregated values.
- **Limitation of `GROUP BY`:** Grouping by `orderdate` can tell you the net revenue per order date, but it aggregates at the order date level, so you lose individual order details. 
- Adding windows functions let us make calculations like the percentage of revenue share for each order (what we'll be doing next).

---
## Subqueries Review

**SubQueries**

- **Subquery**: a query nested inside another query. 
    - Subqueries let you perform complex queries by using the result of one query as input for another. 
    - It can be used in clauses like `SELECT`, `FROM`, `WHERE`, and `HAVING`.

**Syntax**:
- In `SELECT` clause
    - ```sql
      SELECT 
        column1, 
        column2, 
        (SELECT single_value_expression FROM table_name WHERE condition) AS alias_name
      FROM main_table
      WHERE condition;
      ```
- In `WHERE` clause 
    - ```sql
      SELECT column1, column2
      FROM table_name
      WHERE column_name operator (SELECT column_name FROM table_name WHERE condition);
      ```
- In `FROM` clause
    - ```sql
      SELECT alias_name.column1, alias_name.column2
      FROM (
        SELECT column1, column2 
        FROM table_name 
        WHERE condition
      ) AS alias_name
      WHERE condition
      ```
- There are more ways to use subqueries, such as with `EXISTS`, `NOT EXISTS`, correlated subqueries, and in `HAVING`, but these are the most common.

### 💻 Final Result

- Get the average net revenue by day (this is for reviewing subqueries).
- Calculate for each transaction the percentage of total revenue share.

#### Average Net Revenue by Day

**Subquery**

1. Using a subquery, get the average net revenue by day (`orderdate`).  
   - Create a subquery `revenue_by_day` to calculate the net revenue for each sale using `(quantity * netprice * exchangerate)`.  
        - Include `orderdate` in the subquery to associate each sale with its corresponding date.  
   - In the main query, calculate the average net revenue per day using `AVG(net_revenue)`.  
        - Group the results by `orderdate` to compute the average for each unique day.  
        - Use `ORDER BY orderdate` to display the results in chronological order.  

In [14]:
%%sql

SELECT 
    orderdate, 
    AVG(net_revenue) AS avg_net_revenue
FROM (
    SELECT orderdate, (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
) AS revenue_by_day
GROUP BY orderdate
ORDER BY orderdate;

Unnamed: 0,orderdate,avg_net_revenue
0,2015-01-01,465.63
1,2015-01-02,736.30
2,2015-01-03,942.70
3,2015-01-05,1240.63
4,2015-01-06,862.49
...,...,...
3289,2024-04-16,784.34
3290,2024-04-17,539.98
3291,2024-04-18,498.40
3292,2024-04-19,967.74


#### Percentage of Total Revenue Share

**Subquery**

1. Calculate for each transaction the percentage of total revenue share.
    - 🔔 Select `orderdate`, `net_revenue`, and `total_net_revenue` from the subquery.  
    - 🔔 Calculate `revenue_share` by dividing `net_revenue` by `total_net_revenue` for each row.  
    - Use a subquery that calculates `net_revenue` as the product of `quantity`, `netprice`, and `exchangerate`.  
        - Use a window function (`SUM` with `PARTITION BY orderdate`) in the subquery to compute `total_net_revenue` for each `orderdate`.  
    - Order the final results by `orderdate`.  

In [7]:
%%sql

SELECT
    orderdate,
    net_revenue,
    total_net_revenue,
    net_revenue / total_net_revenue AS revenue_share
FROM 
    -- Use query from previous section (Calculate Total Net Revenue) as a subquery
    (
    SELECT
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue,
        SUM(quantity * netprice * exchangerate) OVER(PARTITION BY orderdate) AS total_net_revenue
    FROM
        sales
    )
ORDER BY
    orderdate

Unnamed: 0,orderdate,net_revenue,total_net_revenue,revenue_share
0,2015-01-01,63.49,11640.80,0.01
1,2015-01-01,423.28,11640.80,0.04
2,2015-01-01,108.75,11640.80,0.01
3,2015-01-01,1146.75,11640.80,0.10
4,2015-01-01,950.25,11640.80,0.08
...,...,...,...,...
199868,2024-04-20,914.61,96879.43,0.01
199869,2024-04-20,150.18,96879.43,0.00
199870,2024-04-20,147.78,96879.43,0.00
199871,2024-04-20,2019.62,96879.43,0.02


---
## SUM

### 📝 Notes

`SUM`

- **SUM**: Sums up all of the values
- Syntax: 
  ```sql
    SELECT
      SUM() OVER(
          PARTITION BY partition_expression
      ) AS window_column_alias
      FROM table_name
  ```

### 💻 Final Result

- Groups users into cohorts based on their first order year (called `cohort_year`) to analyze long-term revenue growth.
- **Note**: The `customer` table has a `startdt` which may be when they started but since we don't have data that goes back this far (we don't have anything within this range of 1980 - 2010), we will not be using this data.

#### Calculate Cumulative Revenue by Cohort

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and calculate `net_revenue`.
    - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to get the earliest year for each group.  
    - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
    - Group the results by `orderdate` and `net_revenue` using `GROUP BY`.  
    - Sort the output by `cohort_year` and `orderdate` using `ORDER BY`.  

In [23]:
%%sql 

SELECT 
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY
    orderdate,
    net_revenue
ORDER BY  
    cohort_year,
    orderdate

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2015,2015-01-01,37.51
1,2015,2015-01-01,2395.10
2,2015,2015-01-01,58.73
3,2015,2015-01-01,423.28
4,2015,2015-01-01,262.80
...,...,...,...
199080,2024,2024-04-20,144.77
199081,2024,2024-04-20,17.13
199082,2024,2024-04-20,215.40
199083,2024,2024-04-20,14.02


2. Put the query into a subquery.
    - Use a subquery to encapsulate the logic for calculating `cohort_year` and `net_revenue`.
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to find the earliest year for each `orderdate` group.
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.
        - Group the data by `orderdate` and `net_revenue` using `GROUP BY`.
    - 🔔 Outer query: 
        - Select all columns (`*`) from the CTE in the main query.
        - Move the `ORDER BY cohort_year, orderdate` to the main query to ensure sorting is applied to the final result.
    - **Why use a Subquery:** To clean and prepare the raw data first, avoiding unnecessary grouping by `quantity`, `netprice`, and `exchangerate`, and to keep the query modular and easy to read.

In [8]:
%%sql

SELECT 
    cohort_year,
    orderdate,
    net_revenue
FROM ( --previous query as a subquery
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
)
ORDER BY  -- Move order by here
    cohort_year,
    orderdate;

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2015,2015-01-01,37.51
1,2015,2015-01-01,2395.10
2,2015,2015-01-01,58.73
3,2015,2015-01-01,423.28
4,2015,2015-01-01,262.80
...,...,...,...
199080,2024,2024-04-20,144.77
199081,2024,2024-04-20,17.13
199082,2024,2024-04-20,215.40
199083,2024,2024-04-20,14.02


3. Calculate the monthly revenue.
    - Use a subquery to calculate `cohort_year` and `net_revenue` for each transaction.  
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to identify the earliest year for each `orderdate` group.  
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
        - Group the results in the CTE by `orderdate` and `net_revenue` to ensure distinct combinations of these values.  
    - In the outer query, select `cohort_year` and `orderdate` from the CTE for further analysis.  
        - 🔔 Calculate `monthly_revenue` using `SUM(net_revenue)` to aggregate total revenue for each day. 
        - 🔔 Get the order year month from `orderdate` using `DATE_TRUNC`.
        - Group by `cohort_year` and `year_month` to finalize the daily revenue calculation.  
        - Sort the results by `cohort_year` and `year_month` using `ORDER BY`.  

In [9]:
%%sql

SELECT 
    cohort_year,
    DATE_TRUNC('month', orderdate) AS year_month, -- Update
    SUM(net_revenue) AS monthly_revenue -- Added
FROM (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
    )
GROUP BY 
    cohort_year, 
    year_month
ORDER BY 
    cohort_year, 
    year_month;


Unnamed: 0,cohort_year,year_month,monthly_revenue
0,2015,2015-01-01 00:00:00-08:00,383920.37
1,2015,2015-02-01 00:00:00-08:00,705828.98
2,2015,2015-03-01 00:00:00-08:00,330943.44
3,2015,2015-04-01 00:00:00-07:00,160767.00
4,2015,2015-05-01 00:00:00-07:00,548252.85
...,...,...,...
107,2023,2023-12-01 00:00:00-08:00,2912914.47
108,2024,2024-01-01 00:00:00-08:00,2672785.66
109,2024,2024-02-01 00:00:00-08:00,3537863.87
110,2024,2024-03-01 00:00:00-08:00,1687688.64


4. Calculate the cumulative revenue by cohort using a windows function.
    - Use a subquery to calculate `cohort_year` and `net_revenue` for each transaction.  
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to identify the earliest year for each `orderdate` group.  
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
        - Group the results in the CTE by `orderdate` and `net_revenue` to ensure distinct combinations of these values.   
    - In the outer query, select `cohort_year` and `orderdate` from the subquery for further analysis.  
        - Calculate `monthly_revenue` using `SUM(net_revenue)` to aggregate total revenue for each month.  
        - 🔔 Use a window function to calculate `cumulative_revenue` as `SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY DATE_TRUNC('month', orderdate))`, summing the monthly revenue progressively within each cohort year.  
        - Get the order year month from `orderdate` using `DATE_TRUNC`.
        - Sort the results by `cohort_year` and `year_month` using `ORDER BY`.  .  
        - **Why use another `SUM()` for the window function:** The first `SUM(net_revenue)` calculates the monthly revenue for each `year_month`. The second `SUM(SUM(net_revenue))` ensures the cumulative revenue is calculated progressively by adding monthly revenue values in order for each `cohort_year`. Without the second `SUM()`, the query would not provide a running total by cohort year.  

In [30]:
%%sql

SELECT 
    cohort_year,
    DATE_TRUNC('month', orderdate) AS year_month,
    SUM(net_revenue) AS monthly_revenue,
    SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY DATE_TRUNC('month', orderdate)) AS cumulative_revenue -- Added
FROM  (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
)
GROUP BY 
    cohort_year, 
    year_month
ORDER BY 
    cohort_year, 
    year_month
;

Unnamed: 0,cohort_year,year_month,monthly_revenue,cumulative_revenue
0,2015,2015-01-01 00:00:00-08:00,383920.37,383920.37
1,2015,2015-02-01 00:00:00-08:00,705828.98,1089749.35
2,2015,2015-03-01 00:00:00-08:00,330943.44,1420692.78
3,2015,2015-04-01 00:00:00-07:00,160767.00,1581459.78
4,2015,2015-05-01 00:00:00-07:00,548252.85,2129712.63
...,...,...,...,...
107,2023,2023-12-01 00:00:00-08:00,2912914.47,33021117.03
108,2024,2024-01-01 00:00:00-08:00,2672785.66,2672785.66
109,2024,2024-02-01 00:00:00-08:00,3537863.87,6210649.54
110,2024,2024-03-01 00:00:00-08:00,1687688.64,7898338.17


5. **Bonus**: Update the `orderdate` to remove the timestamp to make it easier to read.
    - Use a subquery to calculate `cohort_year`, `order_month`, and `net_revenue`.  
        - Extract `cohort_year` using `EXTRACT(YEAR FROM MIN(orderdate))` to identify the earliest year for each `orderdate` group.  
        - Calculate `net_revenue` as `quantity * netprice * exchangerate`.  
        - Group the results in the CTE by `orderdate` and `net_revenue` to ensure distinct combinations of these values.   
    - In the outer query, select `cohort_year` and `order_month` for further aggregation and analysis.  
        - Compute `monthly_revenue` as `SUM(net_revenue)` to aggregate revenue for each month.  
        - 🔔 Convert the `orderdate` to get the year and month of the order using `DATE_TRUNC` and add `::date` to remove the timestamp to make the format cleaner. 
        - Use a window function to calculate `cumulative_revenue` as `SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY DATE_TRUNC('month', orderdate))`, producing a running total of revenue for each cohort by month.  
        - 🔔 Group the final query by `cohort_year` and `DATE_TRUNC('month', orderdate)` (we're using `DATE_TRUNC` instead of `year_month` due to casting the date in the outer query) to ensure the calculations apply to each cohort and month combination.  
        - 🔔 Sort the results by `cohort_year` and `DATE_TRUNC('month', orderdate)` using `ORDER BY`.  

In [4]:
%%sql

SELECT 
    cohort_year,
    (DATE_TRUNC('month', orderdate))::date AS year_month,
    SUM(net_revenue) AS monthly_revenue, -- Update
    SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY DATE_TRUNC('month', orderdate)) AS cumulative_revenue
FROM  (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
)
GROUP BY 
    cohort_year, 
    DATE_TRUNC('month', orderdate) --Update
ORDER BY 
    cohort_year, 
    DATE_TRUNC('month', orderdate) --Update
;

Unnamed: 0,cohort_year,year_month,monthly_revenue,cumulative_revenue
0,2015,2015-01-01,383920.37,383920.37
1,2015,2015-02-01,705828.98,1089749.35
2,2015,2015-03-01,330943.44,1420692.78
3,2015,2015-04-01,160767.00,1581459.78
4,2015,2015-05-01,548252.85,2129712.63
...,...,...,...,...
107,2023,2023-12-01,2912914.47,33021117.03
108,2024,2024-01-01,2672785.66,2672785.66
109,2024,2024-02-01,3537863.87,6210649.54
110,2024,2024-03-01,1687688.64,7898338.17
