# Windows Functions Syntax

## Overview

### 🥅 Analysis Goals

- Identify high-revenue states and segment younger customers for targeted campaigns.
    - Calculate the average age by state to allow comparisons of individual ages to their regional peers.
    - Calculate the total revenue by state to identify regions with high revenue potential.
- Focus on segments of younger customers in high-revenue states for ad campaigns, targeting demographics that may respond better to age-appropriate marketing.

### 📘 Concepts Covered

- Window functions basic syntax
- `PARTITION BY`
- Aggregate functions with windows funcitons

## Syntax

### 📝 Notes

- Let's you perform calculations across a set of table rows related to the current row.
- Unlike aggregate functions, they do not group the results into a single output row.
- Easily partition and order data within the query, great for calculating things like running totals, ranks or averages within partitions (more on this later).

Syntax
- `OVER()`: Defines the window for the function. It can include `PARTITION BY` and other functions.
- `PARTITION BY`: Divides the result set into partitions. The function is then applied to each partition.

### 💻 Final Result

#### Average Age by State

**`AVG`, `OVER`, `PARTITION BY`**

1. Using a windows function return the average age by the state for only customers who are in the 'North America' continent.
    1. Return the following columns:
        1. `customerkey`
        2. `state`
        3. `age`
    2. In the windows function use `AVG` for the `age` and `PARTITION BY` the state. 
    3. Filter for only records in 'North America'.

In [1]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

In [2]:
%%sql 

SELECT 
	customerkey, 
	state,
	age,
	AVG(age) OVER(PARTITION BY state) AS avg_age_state
FROM customer
WHERE
	continent = 'North America'

customerkey,state,age,avg_age_state
225614,AB,25,51.588552188552185
340528,AB,27,51.588552188552185
303721,AB,73,51.588552188552185
303757,AB,23,51.588552188552185
340138,AB,46,51.588552188552185
304019,AB,50,51.588552188552185
304083,AB,21,51.588552188552185
213648,AB,41,51.588552188552185
339805,AB,73,51.588552188552185
213658,AB,60,51.588552188552185


#### Total Revenue by State

**`SUM`, `OVER`, `PARTITION BY`**

1. Add in another column with a window function that calculates the total number of customers by state.
    1. Using `SUM()` for the `customerkey`.
    2. `PARTITION BY` the state.
    3. Name this column as `total_customers_state`.

In [3]:
%%sql

SELECT 
    customerkey, 
    state, 
    age,
    AVG(age) OVER(PARTITION BY state) AS avg_age_state,
    COUNT(customerkey) OVER(PARTITION BY state) AS total_customers_state
FROM customer
WHERE   
    continent = 'North America'

customerkey,state,age,avg_age_state,total_customers_state
225614,AB,25,51.588552188552185,1485
340528,AB,27,51.588552188552185,1485
303721,AB,73,51.588552188552185,1485
303757,AB,23,51.588552188552185,1485
340138,AB,46,51.588552188552185,1485
304019,AB,50,51.588552188552185,1485
304083,AB,21,51.588552188552185,1485
213648,AB,41,51.588552188552185,1485
339805,AB,73,51.588552188552185,1485
213658,AB,60,51.588552188552185,1485


2. Add filters to get only states with a large customer base and customers who are under the average age for their state. 
    1. Put the query into a CTE (`customer_data`) which lets us use these calculated values in the main query to filter. The in the main query filter out specific customers.
    2. In the `WHERE` clause, include `age < avg_age_state` to select customers younger than their state’s average.
    3. Add an additional condition with `AND` to ensure that `total_customers_state` is over 1000, focusing on states with a large customer base.

In [11]:
%%sql

WITH customer_data AS (
    SELECT 
        customerkey, 
        state, 
        age, 
        AVG(age) OVER(PARTITION BY state) AS avg_age_state,
        COUNT(customerkey) OVER(PARTITION BY state) AS total_customers_state,
        continent
    FROM customer
    WHERE   
        continent = 'North America'
)

SELECT 
    customerkey, 
    state, 
    age, 
    avg_age_state,
    total_customers_state
FROM customer_data
WHERE 
    age < avg_age_state
    AND total_customers_state > 1000; -- Example threshold for high-customer-count states

customerkey,state,age,avg_age_state,total_customers_state
225614,AB,25,51.588552188552185,1485
340528,AB,27,51.588552188552185,1485
303757,AB,23,51.588552188552185,1485
340138,AB,46,51.588552188552185,1485
304019,AB,50,51.588552188552185,1485
304083,AB,21,51.588552188552185,1485
213648,AB,41,51.588552188552185,1485
339470,AB,35,51.588552188552185,1485
339201,AB,30,51.588552188552185,1485
304489,AB,19,51.588552188552185,1485


### 💡 Why not use GROUP BY instead? 

- Window functions are good when you need both row-level information and aggregated values.
- **Limitation of `GROUP BY`:** Grouping by state can tell you the average age and customer count per state, but it aggregates at the state level, so you lose individual customer details. This makes it impossible to identify specific customers who are younger than the state average for targeted campaigns.

In [17]:
%%sql

SELECT 
    state,
    AVG(age) AS avg_age_state,
    COUNT(customerkey) AS total_customers_state
FROM customer
WHERE 
    continent = 'North America'
GROUP BY 
    state
HAVING
    AVG(age) < 55
    AND COUNT(customerkey) > 1000;

state,avg_age_state,total_customers_state
QC,51.58186244674376,1643
MI,51.833709131905294,1774
ON,52.13272684861208,4287
VA,52.484052532833026,1066
GA,51.911454102355805,1231
PA,51.96132870599901,2017
IL,52.134909596662034,2157
AB,51.588552188552185,1485
NJ,52.054845980465814,1331
CA,51.423012200252415,4754


### 📊 Example for targeted marketing:

- Segment customers based on characteristics relative to their regional peers.
- For an ad campaign targeting younger customers in states with a high number of customers:
    - Use window functions to calculate each customer’s age difference from the state average and the total number of customers in each state.
    - This allows you to keep individual ages while also accessing the state’s average age and total customer count for refined segmentation.
- Focus on customers who are younger than their state’s average in high-customer-count areas, creating more targeted and impactful ads.

**Note:** This query shows real-life applications of window functions for practical marketing analysis.

In [19]:
%%sql

SELECT 
    customerkey, 
    state, 
    age, 
    ROUND(avg_age_state, 1) AS avg_age_state,
    ROUND(age_diff, 0) AS age_diff,
    total_customers_state
FROM 
    -- Calculate avg_age by state, age difference, and total customers by state
    (
        SELECT 
            customerkey, 
            state, 
            age, 
            AVG(age) OVER(PARTITION BY state) AS avg_age_state,
            age - AVG(age) OVER(PARTITION BY state) AS age_diff,
            COUNT(customerkey) OVER(PARTITION BY state) AS total_customers_state
        FROM customer
    ) AS subquery
WHERE age_diff < -5 -- Younger than average by at least 5 years
AND total_customers_state > 1000; -- Example threshold for high customer count states


customerkey,state,age,avg_age_state,age_diff,total_customers_state
303299,AB,43,51.6,-9,1485
256292,AB,43,51.6,-9,1485
331269,AB,29,51.6,-23,1485
330941,AB,41,51.6,-11,1485
256608,AB,36,51.6,-16,1485
256753,AB,34,51.6,-18,1485
330353,AB,45,51.6,-7,1485
330244,AB,35,51.6,-17,1485
329787,AB,37,51.6,-15,1485
329607,AB,39,51.6,-13,1485
