<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/1_Syntax.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Windows Functions Syntax

## Overview

### 🥅 Analysis Goals


Run a customer segmentation analysis to compare purchasing behavior by continent:
- **Calculate the average customer age**: Make customer age in months since their first purchase for each continent, allowing comparisons between individual customers and their regional peers for targeted insights.

### 📘 Concepts Covered

Basic syntax: 
- `OVER()`, 
- `PARTITION BY`

### 📕 Definitions

- **Cohort analysis** - Examines the behavior of specific groups over time.  
- **Cohort** - A group of people or items sharing a common characteristic.  
- **Time series** - Data tracked in sequence over time.  
- **Retention** - Keeping users, customers, or items over time.  


In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Syntax

### 📝 Notes

`window_function OVER (PARTITION BY)`

- **Why Use Window Functions?**
  - They let you perform calculations across a set of table rows related to the current row.
  - Unlike aggregate functions, they don't group the results into a single output row.
  - They allow you to easily partition and order data within the query, making them great for calculating things like running totals, ranks, or averages within partitions.


- **Syntax:**
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
        ) AS window_column_alias
    FROM table_name;
    ```

    - `OVER()`: Defines the window for the function. It can include `PARTITION BY` and other functions.
    - `PARTITION BY`: Divides the result set into partitions. The function is then applied to each partition.

### 💻 Final Result

- Make customer age in months since their first purchase for each continent, allowing comparisons between individual customers and their regional peers for targeted insights.

#### Calculate Average Customer Age by Continent

**`OVER`**, **`PARTITION BY`**

1. Get the first date of each customers order
    - Note: The `customer` table has a `startdt` which may be when they started but since we don't have data that goes back this far, we will not be using this data.

In [6]:
%%sql

SELECT 
	customerkey, 
	MIN(orderdate) AS first_order_date
	--AVG(age) OVER(PARTITION BY state) AS avg_age_state
FROM sales
GROUP BY	
	customerkey
ORDER BY 
	customerkey

Unnamed: 0,customerkey,first_order_date
0,15,2021-03-08
1,180,2018-07-28
2,185,2019-06-01
3,243,2016-05-19
4,387,2018-12-21
...,...,...
49482,2099619,2018-09-11
49483,2099656,2023-05-11
49484,2099697,2022-09-13
49485,2099711,2016-08-13


2. Find the current age of the customer in days using `AGE`.

In [8]:
%%sql

SELECT 
	customerkey, 
	MIN(orderdate) AS first_order_date,
    AGE(CURRENT_DATE, MIN(orderdate)) AS age_customer
	--AVG(age) OVER(PARTITION BY state) AS avg_age_state
FROM sales
GROUP BY	
	customerkey
ORDER BY 
	customerkey

Unnamed: 0,customerkey,first_order_date,age_customer
0,15,2021-03-08,1401 days
1,180,2018-07-28,2357 days
2,185,2019-06-01,2048 days
3,243,2016-05-19,3156 days
4,387,2018-12-21,2214 days
...,...,...,...
49482,2099619,2018-09-11,2313 days
49483,2099656,2023-05-11,608 days
49484,2099697,2022-09-13,851 days
49485,2099711,2016-08-13,3071 days


3. Get the current age in months.

In [9]:
%%sql

SELECT 
	customerkey, 
	MIN(orderdate) AS first_order_date,
    EXTRACT (MONTH FROM AGE(CURRENT_DATE, MIN(orderdate))) AS age_customer
	--AVG(age) OVER(PARTITION BY state) AS avg_age_state
FROM sales
GROUP BY	
	customerkey
ORDER BY 
	customerkey

Unnamed: 0,customerkey,first_order_date,age_customer
0,15,2021-03-08,10
1,180,2018-07-28,5
2,185,2019-06-01,7
3,243,2016-05-19,7
4,387,2018-12-21,0
...,...,...,...
49482,2099619,2018-09-11,4
49483,2099656,2023-05-11,8
49484,2099697,2022-09-13,4
49485,2099711,2016-08-13,5


4. Add in the continent from the `customer` table.

In [12]:
%%sql

SELECT 
	s.customerkey, 
    c.continent,
	MIN(s.orderdate) AS first_order_date,
    EXTRACT (MONTH FROM AGE(CURRENT_DATE, MIN(s.orderdate))) AS age_customer
	--AVG(EXTRACT (MONTH FROM AGE(CURRENT_DATE, MIN(orderdate)))) OVER(PARTITION BY state) AS avg_age_state
FROM sales AS s
LEFT JOIN customer AS c ON c.customerkey = s.customerkey
GROUP BY	
	s.customerkey,
    c.continent
ORDER BY 
	s.customerkey

Unnamed: 0,customerkey,continent,first_order_date,age_customer
0,15,Australia,2021-03-08,10
1,180,Australia,2018-07-28,5
2,185,Australia,2019-06-01,7
3,243,Australia,2016-05-19,7
4,387,Australia,2018-12-21,0
...,...,...,...,...
49482,2099619,North America,2018-09-11,4
49483,2099656,North America,2023-05-11,8
49484,2099697,North America,2022-09-13,4
49485,2099711,North America,2016-08-13,5


5. Calculate the average age per customer for each row using windows functions.

In [14]:
%%sql

SELECT 
	s.customerkey, 
    c.continent,
	MIN(s.orderdate) AS first_order_date,
    EXTRACT (MONTH FROM AGE(CURRENT_DATE, MIN(s.orderdate))) AS age_customer,
	AVG(EXTRACT (MONTH FROM AGE(CURRENT_DATE, MIN(orderdate)))) OVER(PARTITION BY c.continent) AS avg_age_continent
FROM sales AS s
LEFT JOIN customer AS c ON c.customerkey = s.customerkey
GROUP BY	
	s.customerkey,
    c.continent
ORDER BY 
	s.customerkey

Unnamed: 0,customerkey,continent,first_order_date,age_customer,avg_age_continent
0,15,Australia,2021-03-08,10,5.3076275939427930
1,180,Australia,2018-07-28,5,5.3076275939427930
2,185,Australia,2019-06-01,7,5.3076275939427930
3,243,Australia,2016-05-19,7,5.3076275939427930
4,387,Australia,2018-12-21,0,5.3076275939427930
...,...,...,...,...,...
49482,2099619,North America,2018-09-11,4,5.3207605238860582
49483,2099656,North America,2023-05-11,8,5.3207605238860582
49484,2099697,North America,2022-09-13,4,5.3207605238860582
49485,2099711,North America,2016-08-13,5,5.3207605238860582


### 📊 Real Life Example (Bonus)

How would we actually use this?

- Segment customers based on purchasing behavior relative to their regional peers.  
- For a retention strategy targeting newer customers in continents with high average customer ages:  
  - Use a CTE and window functions to calculate the average customer "age" (time since first purchase) for each continent, while keeping individual ages in the final output.  
  - This approach enables a side-by-side comparison of individual customers with their continental average, allowing businesses to identify customers who are newer or more established than their regional peers.  
- Focus on customers who are significantly younger than their continent’s average for targeted onboarding or retention campaigns, ensuring efforts are tailored to specific lifecycle stages.  

**Note:** This query demonstrates how window functions enable multi-level analysis by aggregating data while retaining individual detail, a key skill for real-world customer segmentation and analysis.

In [16]:
%%sql

WITH customer_age_data AS (
    SELECT 
        s.customerkey, 
        c.continent,
        MIN(s.orderdate) AS first_order_date,
        EXTRACT(MONTH FROM AGE(CURRENT_DATE, MIN(s.orderdate))) AS age_customer
    FROM sales AS s
    LEFT JOIN customer AS c 
        ON c.customerkey = s.customerkey
    GROUP BY
        s.customerkey,
        c.continent
)
SELECT 
    customerkey, 
    continent, 
    first_order_date, 
    age_customer,
    ROUND(AVG(age_customer) OVER(PARTITION BY continent), 1) AS avg_age_continent,
    age_customer - AVG(age_customer) OVER(PARTITION BY continent) AS age_diff
FROM customer_age_data
ORDER BY customerkey;


Unnamed: 0,customerkey,continent,first_order_date,age_customer,avg_age_continent,age_diff
0,15,Australia,2021-03-08,10,5.3,4.6923724060572070
1,180,Australia,2018-07-28,5,5.3,-0.3076275939427930
2,185,Australia,2019-06-01,7,5.3,1.6923724060572070
3,243,Australia,2016-05-19,7,5.3,1.6923724060572070
4,387,Australia,2018-12-21,0,5.3,-5.3076275939427930
...,...,...,...,...,...,...
49482,2099619,North America,2018-09-11,4,5.3,-1.3207605238860582
49483,2099656,North America,2023-05-11,8,5.3,2.6792394761139418
49484,2099697,North America,2022-09-13,4,5.3,-1.3207605238860582
49485,2099711,North America,2016-08-13,5,5.3,-0.3207605238860582
