<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/1_Syntax.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Windows Functions Syntax

## Overview

### 🥅 Analysis Goals


Run a cohort analysis LTV
- **Analysis 1**: Text

### 📘 Concepts Covered

Basic syntax: 
- `SUM()`
- `OVER()` 
- `PARTITION BY`

### 📕 Definitions

- **Cohort analysis** - Examines the behavior of specific groups over time.  
- **Cohort** - A group of people or items sharing a common characteristic.  
- **Time series** - Data tracked in sequence over time.  
- **Retention** - Keeping users, customers, or items over time.  


In [17]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## Syntax

### 📝 Notes

`window_function OVER (PARTITION BY)`

- **Why Use Window Functions?**
  - They let you perform calculations across a set of table rows related to the current row.
  - Unlike aggregate functions, they don't group the results into a single output row.
  - They allow you to easily partition and order data within the query, making them great for calculating things like running totals, ranks, or averages within partitions.


- **Syntax:**
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
        ) AS window_column_alias
    FROM table_name;
    ```

    - `OVER()`: Defines the window for the function. It can include `PARTITION BY` and other functions.
    - `PARTITION BY`: Divides the result set into partitions. The function is then applied to each partition.

### 💻 Final Result

- Showing how revenue grows over time, enabling businesses to track the long-term contributions of each cohort.
  - Extracts the cohort_year from signup_date to group users by the year they joined.
  - Calculates daily revenue for each cohort.
  - Uses a window function to compute the cumulative revenue for each cohort, ordered by purchase date.

#### Text

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the net_revenue, customerkey and order date

In [36]:
%%sql 

SELECT 
    customerkey
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue
FROM sales

Unnamed: 0,orderdate,net_revenue
0,947009,63.49
1,947009,423.28
2,1772036,108.75
3,1518349,1146.75
4,1518349,950.25
...,...,...
199868,664396,914.61
199869,664396,150.18
199870,267690,147.78
199871,267690,2019.62


In [42]:
%%sql 

WITH sales_data AS (
    SELECT 
        customerkey,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
)

SELECT 
    customerkey,
    orderdate,
    net_revenue,
    SUM(net_revenue) OVER(PARTITION BY customerkey) AS cumulative_revenue -- Added
FROM sales_data
GROUP BY
    customerkey,
    orderdate,
    net_revenue
ORDER BY
    customerkey,
    orderdate

Unnamed: 0,customerkey,orderdate,net_revenue,cumulative_revenue
0,15,2021-03-08,2217.41,2217.41
1,180,2018-07-28,525.31,2510.22
2,180,2023-08-28,1913.55,2510.22
3,180,2023-08-28,71.36,2510.22
4,185,2019-06-01,1395.52,1395.52
...,...,...,...,...
199824,2099711,2016-08-13,2067.75,6008.67
199825,2099711,2017-08-14,3940.92,6008.67
199826,2099743,2022-03-17,94.05,1068.08
199827,2099743,2022-03-17,375.57,1068.08


#### Calculate Cumulative Revenue by Cohort

**`SUM`**, **`OVER`**, **`PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and calculate `net_revenue`.
    - Note: The `customer` table has a `startdt` which may be when they started but since we don't have data that goes back this far (we don't have anything within this range of 1980 - 2010), we will not be using this data.

In [30]:
%%sql 

SELECT 
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    orderdate,
    (quantity * netprice * exchangerate) AS net_revenue
FROM sales
GROUP BY
    orderdate,
    net_revenue

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2019,2019-03-22,456.87
1,2017,2017-11-11,171.55
2,2019,2019-07-04,951.70
3,2018,2018-10-22,381.63
4,2019,2019-02-25,380.21
...,...,...,...
199080,2022,2022-03-06,991.82
199081,2023,2023-11-26,221.10
199082,2022,2022-11-05,38.85
199083,2022,2022-05-14,163.63


2. Put the query into a CTE.

In [31]:
%%sql

-- Updated query
WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
    )
SELECT 
    *
FROM cohort_analysis;

Unnamed: 0,cohort_year,orderdate,net_revenue
0,2019,2019-03-22,456.87
1,2017,2017-11-11,171.55
2,2019,2019-07-04,951.70
3,2018,2018-10-22,381.63
4,2019,2019-02-25,380.21
...,...,...,...
199080,2022,2022-03-06,991.82
199081,2023,2023-11-26,221.10
199082,2022,2022-11-05,38.85
199083,2022,2022-05-14,163.63


3. Calculate the daily revenue

In [33]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
    )
SELECT 
    cohort_year,
    orderdate,
    SUM(net_revenue) AS daily_revenue -- Added
FROM cohort_analysis
GROUP BY 
    cohort_year, 
    orderdate
ORDER BY 
    cohort_year, 
    orderdate;


Unnamed: 0,cohort_year,orderdate,daily_revenue
0,2015,2015-01-01,11498.37
1,2015,2015-01-02,5890.40
2,2015,2015-01-03,19796.67
3,2015,2015-01-05,12406.27
4,2015,2015-01-06,10349.87
...,...,...,...
3289,2024,2024-04-16,25098.99
3290,2024,2024-04-17,32938.67
3291,2024,2024-04-18,28408.76
3292,2024,2024-04-19,48386.88


4. Calculate the cumulative revenue by cohort using a windows function.

In [34]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        orderdate,
        (quantity * netprice * exchangerate) AS net_revenue
    FROM sales
    GROUP BY
        orderdate,
        net_revenue
)
SELECT 
    cohort_year,
    orderdate,
    SUM(net_revenue) AS daily_revenue,
    SUM(SUM(net_revenue)) OVER(PARTITION BY cohort_year ORDER BY orderdate) AS cumulative_revenue -- Added
FROM cohort_analysis
GROUP BY 
    cohort_year, 
    orderdate
ORDER BY 
    cohort_year, 
    orderdate;


Unnamed: 0,cohort_year,orderdate,daily_revenue,cumulative_revenue
0,2015,2015-01-01,11498.37,11498.37
1,2015,2015-01-02,5890.40,17388.77
2,2015,2015-01-03,19796.67,37185.44
3,2015,2015-01-05,12406.27,49591.71
4,2015,2015-01-06,10349.87,59941.58
...,...,...,...,...
3289,2024,2024-04-16,25098.99,8175575.82
3290,2024,2024-04-17,32938.67,8208514.49
3291,2024,2024-04-18,28408.76,8236923.24
3292,2024,2024-04-19,48386.88,8285310.13
