<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/2_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Aggregation in Window Functions

## Overview

### 🥅 Analysis Goals

- **Total Users by Cohort:** Get the total number of unique users in each cohort to get insight into the scale of each cohort. Helps evaluate the relative size of cohorts, which is essential for benchmarking revenue.
- **Cohort-Based LTV Analysis:** Assign each customer to a cohort based on their first purchase year, calculate their average LTV across all customers within each cohort to analyze customer value trends.
- **Filtered Average LTV by Customer:** Aggregate total revenue per customer, filter out the order year.


### 📘 Concepts Covered

- `COUNT()`
- `AVG()`
- Filtering windows functions

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Update package installer
    !sudo apt-get update -qq > /dev/null 2>&1

    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## COUNT

### 📝 Notes

`COUNT`

- **COUNT**: Counts the values, `DISTINCT` can be added to get the unique count.
- Syntax:
  ```sql
    SELECT
      COUNT() OVER(
          PARTITION BY partition_expression
      ) AS window_column_alias
      FROM table_name
  ```

### 🔑 Key Concepts
- **📊 Business Terms**:
  - Customer Cohort: Group of customers who made their first purchase in the same time period
  - User Acquisition: Process of gaining new customers
  - Customer Base: Total number of active customers
- **💡 Why It Matters**: Essential for understanding customer growth and comparing performance across different sized groups (e.g. segements).
- **🎯 Common Use Cases**: Tracking new customer growth, measuring retention rates
- **📈 Related Metrics**: Acquisition rate, retention rate, churn rate

### 📈 Analysis

-  Group customers by the year of their first purchase, then calculate the total number of customers in each cohort to analyze customer retention and growth trends over time.


#### Total Count of Customers by Cohort

**`COUNT()`, `OVER`, `PARTITION BY`**

1. Get the cohorts by year from the `orderdate` and the `customerkey` in the `sales` table.
    - Use `EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey))` to find the earliest year a customer placed an order.
    - Select `customerkey` to associate each customer with their cohort year.

In [None]:
%%sql

SELECT
    customerkey,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM sales

Unnamed: 0,customerkey,cohort_year
0,15,2021
1,180,2018
2,180,2018
3,180,2018
4,185,2019
...,...,...
199868,2099711,2016
199869,2099711,2016
199870,2099743,2022
199871,2099743,2022


2. Add in the purchase year using `EXTRACT`.
    - Use `EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey))` to find the earliest year a customer placed an order.
    - Select `customerkey` to associate each customer with their cohort year.
    - 🔔 Get the `purchase_year` using `EXTRACT(YEAR FROM orderdate) AS purchase_year`

In [None]:
%%sql

SELECT
    customerkey,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year,
    EXTRACT(YEAR FROM orderdate) AS purchase_year -- Added
FROM sales

Unnamed: 0,customerkey,cohort_year,purchase_year
0,15,2021,2021
1,180,2018,2018
2,180,2018,2023
3,180,2018,2023
4,185,2019,2019
...,...,...,...
199868,2099711,2016,2017
199869,2099711,2016,2016
199870,2099743,2022,2022
199871,2099743,2022,2022


3. Get total customers by cohort using window functions.
    - 🔔 First create a CTE to identify each customer's cohort year
        - Use `EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey))` to find the earliest year a customer placed an order.
        - Select `customerkey` to associate each customer with their cohort year.
        - Get the `purchase_year` using `EXTRACT(YEAR FROM orderdate) AS purchase_year`
        - Use `DISTINCT` for the SELECT statement.
    - In the main query:
        - Select the `cohort_year` and `purchase_year`
        - Use `COUNT(customerkey) OVER (PARTITION BY purchase_year, cohort_year)` to count customers in each cohort
        - Add `DISTINCT` to show only one row per cohort year
        - Order by cohort year to show progression

> NOTE: You could achieve this more easily with `GROUP BY`:
> ```sql
> SELECT
>    COUNT(DISTINCT customerkey),
>    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year,
>    EXTRACT(YEAR FROM orderdate) AS purchase_year
> FROM sales
> GROUP BY
>    cohort_year
>    purchase_year
> ```


In [None]:
%%sql

-- Put previous query into CTE
WITH yearly_cohort AS (
    SELECT DISTINCT -- Updated
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year,
        EXTRACT(YEAR FROM orderdate) AS purchase_year
    FROM sales
)

-- Added
SELECT DISTINCT
    cohort_year,
    purchase_year,
    COUNT(customerkey) OVER (PARTITION BY purchase_year, cohort_year) AS num_customers
FROM yearly_cohort
ORDER BY cohort_year, purchase_year;

Unnamed: 0,cohort_year,purchase_year,num_customers
0,2015,2015,2825
1,2015,2016,126
2,2015,2017,149
3,2015,2018,348
4,2015,2019,388
5,2015,2020,171
6,2015,2021,295
7,2015,2022,600
8,2015,2023,499
9,2015,2024,146


<img src="https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/Resources/images/3.2_cohort_customer_yearly.png?raw=1" alt="Cohort Customers" width="50%">  

> NOTE: Earlier cohorts continue to contribute to customer growth in future years, but their impact compared to new customer acquisitions is minimal.  
>
> Driving the questions: Why are more resources not spent on customer retention?

---
## Average

### 📝 Notes

- `AVG()`: Calculates the average of the values

```sql
  SELECT
    AVG() OVER(
         PARTITION BY partition_expression
    ) AS window_column_alias
    FROM table_name
```

### 🔑 Key Concepts
- **📊 Business Terms**:
  - Customer Lifetime Value (LTV): Total revenue generated by a customer over time
  - Average Order Value (AOV): Typical amount spent per transaction
  - Revenue Per User: Average revenue generated by each customer
- **💡 Why It Matters**: Helps identify valuable customer segments and predict future revenue
- **🎯 Common Use Cases**: Calculating customer LTV, analyzing spending patterns
- **📈 Related Metrics**: ARPU (Average Revenue Per User), CAC (Customer Acquisition Cost)

### 📕 Definitions

- **Lifetime Value (LTV)**: the total revenue a customer generates for a business over their entire relationship.


### 📈 Analysis
- Aggregate total revenue per user and calculate the average lifetime value for each cohort.

#### Average LTV by Cohort

**`AVG()`, `OVER`, `PARTITION BY`**

1. Get the `cohort_year` and the total revenue for each user.  
   - Use `EXTRACT(YEAR FROM MIN(orderdate))` to calculate the cohort year for each customer.  
   - Group by `customerkey` to ensure the revenue and cohort year are calculated per user.  
   - Calculate the total revenue for each customer using `SUM(quantity * netprice * exchangerate)`.  
   - Select `cohort_year`, `customerkey`, and the total revenue (`total_customer_net_revenue`) to display the results.  

In [None]:
%%sql

SELECT
    customerkey,
    EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
    SUM(quantity * netprice * exchangerate) AS customer_ltv
FROM sales
GROUP BY
    customerkey

Unnamed: 0,customerkey,cohort_year,customer_ltv
0,2044589,2018,2470.73
1,1603477,2021,136.62
2,876049,2017,2601.13
3,1469222,2024,5278.54
4,2089398,2018,98.39
...,...,...,...
49482,853617,2019,903.31
49483,1573639,2016,6973.42
49484,1355936,2022,149.99
49485,967453,2024,5.40


2. Create a CTE to calculate the cohort year for each customer and their net revenue and return all results in the main query.  
   - 🔔 Define a CTE `yearly_cohort` to extract the cohort year for each customer.  
        - Use `EXTRACT(YEAR FROM MIN(orderdate))` in the CTE to determine the earliest order year for each customer.
        - Calculate the total revenue for each customer using `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` in the CTE to assign each customer to a single cohort year.  
   - 🔔 In the main query, use `SELECT * FROM yearly_cohort` to return all the results from the CTE.  

In [None]:
%%sql

WITH yearly_cohort AS (
    SELECT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS customer_ltv
        FROM sales
    GROUP BY
        customerkey
)

SELECT
    *
FROM yearly_cohort
;

Unnamed: 0,customerkey,cohort_year,customer_ltv
0,2044589,2018,2470.73
1,1603477,2021,136.62
2,876049,2017,2601.13
3,1469222,2024,5278.54
4,2089398,2018,98.39
...,...,...,...
49482,853617,2019,903.31
49483,1573639,2016,6973.42
49484,1355936,2022,149.99
49485,967453,2024,5.40


3. Get the average customer LTV along side the average LTV for each cohort.
        - Extract the cohort year using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Calculate the total revenue for each customer with `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` to ensure total revenue and cohort year are assigned to each customer.  
   - In the main query:
        - 🔔 Use `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year)` to calculate the average revenue per customer for each cohort.  
        - 🔔 Select `cohort_year`, `customerkey`, `total_customer_net_revenue` (rename this to `customer_ltv`) and the average total revenue for output.  
        - 🔔 Use `ORDER BY cohort_year, customerkey` to sort the results by cohort and customer.  

In [None]:
%%sql

WITH yearly_cohort AS (
    SELECT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS customer_ltv
    FROM sales
    GROUP BY
        customerkey
)

SELECT
    cohort_year, -- Updated
    customerkey, -- Updated
    customer_ltv, -- Updated
    AVG(customer_ltv) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv-- Added
FROM yearly_cohort
ORDER BY -- Added
    cohort_year,
    customerkey
LIMIT 20

Unnamed: 0,cohort_year,customerkey,customer_ltv,avg_cohort_ltv
0,2015,4376,182.0,5271.59
1,2015,4403,9530.35,5271.59
2,2015,4925,6078.08,5271.59
3,2015,5729,192.16,5271.59
4,2015,6048,1903.89,5271.59
5,2015,6705,13133.76,5271.59
6,2015,9440,208.01,5271.59
7,2015,10806,442.09,5271.59
8,2015,12116,9714.29,5271.59
9,2015,12973,253.06,5271.59


<img src="https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/Resources/images/3.2_customer_ltv.png?raw=1" alt="Customer LTV & Average LTV" width="50%">

> ⚠️ **Chart Note**: This plots only 20 of our customers for a better visualization.


4. (BONUS) Get the average LTV for each cohort.
    - Create a CTE `cohort_summary` to calculate the average LTV for each cohort.
    - Select the `cohort_year` and the average LTV for each cohort.
    - Order by `cohort_year`.
    - Use `DISTINCT` to avoid duplicate rows.

In [None]:
%%sql

WITH yearly_cohort AS (
    SELECT
        customerkey,
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        SUM(quantity * netprice * exchangerate) AS customer_ltv
    FROM sales
    GROUP BY
        customerkey
), cohort_summary AS (
    SELECT
        cohort_year,
        customerkey,
        customer_ltv,
        AVG(customer_ltv) OVER (PARTITION BY cohort_year) AS avg_cohort_ltv
    FROM yearly_cohort
    ORDER BY
        cohort_year,
        customerkey
)
SELECT DISTINCT
    cohort_year,
    avg_cohort_ltv
FROM cohort_summary
ORDER BY
    cohort_year

Unnamed: 0,cohort_year,avg_cohort_ltv
0,2015,5271.59
1,2016,5404.92
2,2017,5403.08
3,2018,4896.64
4,2019,4731.95
5,2020,3933.32
6,2021,3943.33
7,2022,3315.52
8,2023,2543.18
9,2024,2037.55


<img src="https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/Resources/images/3.2_cohort_ltv.png?raw=1" alt="Cohort LTV" width="50%">

---
## Filtering Windows Function

### 📝 Notes

**Filtering Before Windows Function**

- Use `WHERE` to filter rows before aggregation.
- Syntax:
    ```sql
    SELECT
        column_name,
        window_function(column_to_aggregate)
            OVER (PARTITION BY partition_column) AS window_column_alias   
    FROM table_name
    WHERE condition; -- Filters data BEFORE applying window function
    ```

**EXAMPLE:** Only create cohorts from 2020 onward; can use customers from early years.

In [None]:
%%sql

SELECT
    customerkey,
    EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
FROM sales
WHERE orderdate >= '2020-01-01'  -- Filters BEFORE window function
LIMIT 10

Unnamed: 0,customerkey,cohort_year
0,15,2021
1,180,2023
2,180,2023
3,387,2021
4,387,2021
5,387,2021
6,387,2021
7,387,2021
8,406,2021
9,406,2021


**Fitlering After Windows Function**

- Use a subquery + `WHERE` to filter based on window function results.  
- Syntax:
    ```sql
    WITH windowed_data AS (
        SELECT
            column_name,
            window_function(column_to_aggregate)
                OVER (PARTITION BY partition_column) AS window_column_alias
        FROM table_name
    )

    SELECT *
    FROM windowed_data
    WHERE window_column_alias condition; -- Filters data AFTER window function
    ```

**EXAMPLE:** Only create cohorts from 2020 onward AND only with customers that purchased after 2020.

In [None]:
%%sql

WITH cohort AS (
  SELECT
      customerkey,
      EXTRACT(YEAR FROM MIN(orderdate) OVER (PARTITION BY customerkey)) AS cohort_year
  FROM sales
)
SELECT *
FROM cohort
WHERE cohort_year >= '2020'

Unnamed: 0,customerkey,cohort_year
0,15,2021
1,406,2021
2,406,2021
3,545,2023
4,545,2023
...,...,...
81365,2099697,2022
81366,2099697,2022
81367,2099743,2022
81368,2099743,2022


#### 💡 What about `QUALIFY`?  

- Some databases (**BigQuery, Snowflake**) support `QUALIFY` to filter directly on **window function results**.  
- **PostgreSQL does not support `QUALIFY`**, so we use a CTE or subquery with a `WHERE` clause instead.  

---

## Window Functions and Group By

### 📝 Notes
### 🚫 Avoid Combining `GROUP BY` with Window Functions  

When using **window functions** in SQL, it's not recommended to combine them directly with `GROUP BY`. This is because:  

- **Conflicting Aggregations**: `GROUP BY` collapses rows into groups, but window functions operate on individual rows while maintaining access to the full dataset. This can lead to unexpected results or errors.  
- **Better Alternatives**: Use **Common Table Expressions (CTEs)** or **subqueries** to first apply the window function, then perform `GROUP BY` in a separate step for clarity and correctness.  


### 📈 Analysis
- Calculate the total number of orders for each customer using window functions
- Combine this with `GROUP BY` to get the total number of orders for each customer


In [None]:
%%sql

SELECT
    customerkey,
    AVG(quantity * netprice * exchangerate) as net_revenue,
    COUNT(*) OVER(PARTITION BY customerkey) as total_orders  -- This will give wrong results!
FROM sales
GROUP BY customerkey

Unnamed: 0,customerkey,net_revenue,total_orders
0,15,2217.41,1
1,180,836.74,1
2,185,1395.52,1
3,243,287.67,1
4,387,517.32,1
...,...,...,...
49482,2099619,838.74,1
49483,2099656,800.36,1
49484,2099697,12.73,1
49485,2099711,3004.34,1


Here is a more likely example of what you would want to do:

In [None]:
%%sql

SELECT
    customerkey,
    AVG(quantity * netprice * exchangerate) as avg_order_value,
    COUNT(customerkey) OVER(PARTITION BY customerkey) as total_orders
FROM sales
GROUP BY customerkey

Unnamed: 0,customerkey,avg_order_value,total_orders
0,15,2217.41,1
1,180,836.74,1
2,185,1395.52,1
3,243,287.67,1
4,387,517.32,1
...,...,...,...
49482,2099619,838.74,1
49483,2099656,800.36,1
49484,2099697,12.73,1
49485,2099711,3004.34,1


In [None]:
%%sql

WITH customer_orders AS (
    SELECT
        customerkey,
        quantity * netprice * exchangerate as order_value,
        COUNT(customerkey) OVER(PARTITION BY customerkey) as total_orders
    FROM sales
)
SELECT
    customerkey,
    total_orders,
    AVG(order_value) as avg_order_value
FROM customer_orders
GROUP BY
    customerkey,
    total_orders
ORDER BY
    customerkey


Unnamed: 0,customerkey,total_orders,avg_order_value
0,15,1,2217.41
1,180,3,836.74
2,185,1,1395.52
3,243,1,287.67
4,387,9,517.32
...,...,...,...
49482,2099619,8,838.74
49483,2099656,13,800.36
49484,2099697,3,12.73
49485,2099711,2,3004.34


In [4]:
##Average Quantity by Store (3.2.1) - Problem
##Analyze how much quantity each customer orders and how it compares to the average quantity ordered at both the customer level and the store level.

    #Select customerkey, storekey, and quantity from the sales table to view individual transactions.
    #Use two window functions:
        #One to calculate the average quantity ordered per customer
        #One to calculate the average quantity ordered per store
    #This helps you understand whether a customer tends to order more or less than the store average, and whether certain stores have higher or lower order quantities on average.
    #Order the results by customerkey and storekey for easier comparison.

%%sql



SELECT
  customerkey,
  storekey,
  quantity,
  AVG (quantity) OVER (PARTITION BY customerkey) as customer_avg,
  AVG (quantity) OVER (PARTITION BY storekey) as store_avg

FROM
  sales
ORDER BY customerkey,storekey

Unnamed: 0,customerkey,storekey,quantity,customer_avg,store_avg
0,15,999999,5,5.0000000000000000,3.1424813230317349
1,180,50,2,2.0000000000000000,3.1489785749875436
2,180,50,3,2.0000000000000000,3.1489785749875436
3,180,999999,1,2.0000000000000000,3.1424813230317349
4,185,50,3,3.0000000000000000,3.1489785749875436
...,...,...,...,...,...
199868,2099711,670,6,3.5000000000000000,3.1854493580599144
199869,2099711,999999,1,3.5000000000000000,3.1424813230317349
199870,2099743,540,2,3.0000000000000000,3.0697674418604651
199871,2099743,610,6,3.0000000000000000,3.1370725854829034


In [16]:
##Customer Orders in 2022 (3.2.2) - Problem
##Problem Statement

#Identify how many unique orders each customer placed in 2022 to better understand purchasing frequency across the customer base.

    #First, use a CTE to select distinct customerkey, orderkey, and orderdate from the sales table, filtered to only include orders from 2022.
    #Then, use a window function to count the total number of orders per customer.
    #Finally, order the results by total_orders in descending order, then by customerkey to highlight the most active customers first.
    #This allows you to rank and compare customers based on how frequently they placed orders during the year.

%%sql

WITH unique_customer AS (

SELECT DISTINCT
  customerkey,
  orderkey,
  orderdate
FROM
  sales
WHERE orderdate BETWEEN '2022-01-01' AND '2022-12-31')



SELECT
  customerkey,
  orderkey,
  COUNT (*) OVER (PARTITION BY customerkey) as orders,
  orderdate
FROM
  unique_customer
ORDER BY orders DESC,customerkey


Unnamed: 0,customerkey,orderkey,orders,orderdate
0,368817,2700016,5,2022-05-23
1,368817,2918016,5,2022-12-27
2,368817,2822027,5,2022-09-22
3,368817,2711006,5,2022-06-03
4,368817,2917044,5,2022-12-26
...,...,...,...,...
18890,2099380,2824060,1,2022-09-24
18891,2099511,2822044,1,2022-09-22
18892,2099603,2631007,1,2022-03-15
18893,2099697,2813044,1,2022-09-13


In [25]:
##Customer LTV by Birth Decade (3.2.3) - Problem
##Problem Statement

#Calculate the lifetime value (LTV) for each individual customer and compare it to the average LTV of other customers in the same birth decade. This will help identify which customers—and which generations—tend to spend more over time.

    #Use a CTE to calculate the total revenue (LTV) for each customerkey, based on their full purchase history.
    #Extract the birth decade of each customer using their birthday, and include it in the aggregation.
    #In the main query, calculate the average LTV for each birth decade using a window function.
    #Filter to include only customers with LTV greater than 1000.
    #Order the results by birth_decade and customerkey.

%%sql

with total_revenue_customer AS (

SELECT
  s.customerkey,
  EXTRACT (DECADE from c.birthday) * 10 as generation,
  SUM (s.netprice*s.quantity*s.exchangerate) as total_revenue
FROM
  sales s
  LEFT JOIN customer c ---INNER JOIN was used in the solution
  ON s.customerkey = c.customerkey
GROUP BY
  s.customerkey, generation)

SELECT
  *,
  AVG (total_revenue) OVER (PARTITION BY generation) as avg_per_generation
FROM total_revenue_customer
WHERE total_revenue > 1000
ORDER BY generation, customerkey


Unnamed: 0,customerkey,generation,total_revenue,avg_per_generation
0,649,1930,4063.09,5586.39
1,2268,1930,1243.54,5586.39
2,2599,1930,8608.97,5586.39
3,3706,1930,1759.19,5586.39
4,4713,1930,1993.40,5586.39
...,...,...,...,...
35580,2092434,2000,6611.90,5868.09
35581,2093202,2000,1504.75,5868.09
35582,2093263,2000,2440.68,5868.09
35583,2093736,2000,4049.01,5868.09


In [39]:
## Revenue by Gender Analysis (3.2.4) - Problem

##Calculate the total net revenue generated by each individual customer between 2015 and 2020, and compare it to the average revenue of other customers within the same gender group.

    #Use a CTE (net_revenue_base) to compute net_revenue per transaction (quantity * netprice * exchangerate).
    #Filter the data in that CTE to only include sales from 2015 to 2020 using orderdate.
    #In the next CTE (customer_sales), join the net_revenue_base CTE with the customer table to attach the gender column.
        #This CTE needs to determine the total revenue (total_revenue) associated with each customer.
        #Ensure the data prepared by this CTE accurately represents each unique customer only once before proceeding to the next step.
    #In the final SELECT statement:
        #Calculate the avg_revenue_by_gender by finding the average of the total_revenue values for the unique customers within each gender group (Hint: A window function like AVG(...) OVER (PARTITION BY gender) is suitable here, operating on the prepared data from customer_sales).
        #Compute a revenue_vs_group ratio (e.g., as a percentage) comparing each customer's total_revenue to their calculated avg_revenue_by_gender.
    #Return customerkey, gender, total_revenue, avg_revenue_by_gender, and revenue_vs_group.


%%sql

WITH net_revenue_base AS (
SELECT
  customerkey,
  (netprice*quantity*exchangerate) as revenue
FROM
  sales
WHERE
  orderdate BETWEEN '2015-01-01' AND '2020-12-31'),


customer_sales AS (
SELECT
  nrb.customerkey,
  SUM (nrb.revenue) as total_revenue_per_customer,
  c.gender
FROM net_revenue_base nrb
INNER JOIN customer c
ON nrb.customerkey = c.customerkey
GROUP BY nrb.customerkey, c.gender )

SELECT
  *,
  AVG (total_revenue_per_customer) OVER (partition by gender) as avg_revenue_gender,
  total_revenue_per_customer / AVG (total_revenue_per_customer) OVER (partition by gender) as reveneu_vs_group_ratio
from customer_sales



Unnamed: 0,customerkey,total_revenue_per_customer,gender,avg_revenue_gender,reveneu_vs_group_ratio
0,2099711,6008.67,female,3460.79,1.74
1,185,1395.52,female,3460.79,0.40
2,243,287.67,female,3460.79,0.08
3,387,2370.54,female,3460.79,0.68
4,957,567.12,female,3460.79,0.16
...,...,...,...,...,...
28517,1657577,4362.60,male,3458.78,1.26
28518,1212224,4754.82,male,3458.78,1.37
28519,2044999,358.20,male,3458.78,0.10
28520,1990450,7262.05,male,3458.78,2.10
