<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Conditional_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Conditional Aggregation

**Product focused**

## Overview

### 🥅 Analysis Goals

Use the following to do an EDA of the products and their categories ordered from the `sales` table.
- Compare total sales of products ordered in 2023 and 2022
- Total sales in 2023 and 2022.
- Categorize sales as low, moderate or high and pivot the sales by category and year.

### 📘 Concepts Covered

- `SUM` Review
- `SUM` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

---

In [1]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Set up the connection parameters for this notebook
import psycopg2
import pandas as pd
import numpy as np

# Database connection parameters
connection = psycopg2.connect(
    dbname='contoso_100k',
    user='postgres',
    password='password',
    host='localhost',
    port='5432'
)

---
## SUM Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Net Revenue by Day in 2023

**`SUM`**

1. Find the net revenue by orderdate for 2023 orders.

In [3]:
%%sql

SELECT
    s.orderdate,
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue -- Added
FROM
    sales s
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY   
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,net_revenue
0,2023-01-01,30140.799315
1,2023-01-02,107847.490191
2,2023-01-03,192655.596657
3,2023-01-04,189451.707871
4,2023-01-05,216573.229817
...,...,...
359,2023-12-27,141981.336234
360,2023-12-28,138772.189742
361,2023-12-29,85913.440327
362,2023-12-30,165917.019796


#### Total Net Revenue by Product Category in 2023 and 2022

**`SUM`**

1. Find the total net revenue by the product category for 2023 orders.

In [None]:
%%sql

SELECT
    p.categoryname AS category_name, -- Added
WHERE
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey -- Added
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    p.categoryname -- Update
ORDER BY
    p.categoryname -- Update

Unnamed: 0,category_name,net_revenue
0,Audio,688690.2
1,Cameras and camcorders,1983546.0
2,Cell phones,6002148.0
3,Computers,11650870.0
4,Games and Toys,270375.0
5,Home Appliances,5919993.0
6,"Music, Movies and Audio Books",2180768.0
7,TV and Video,4412178.0


2. Find the total net revenue by the product category for 2022 orders.

In [5]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' -- Updated
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

Unnamed: 0,category_name,net_revenue
0,Audio,766938.2
1,Cameras and camcorders,2382533.0
2,Cell phones,8119665.0
3,Computers,17862210.0
4,Games and Toys,316127.3
5,Home Appliances,6612447.0
6,"Music, Movies and Audio Books",2989297.0
7,TV and Video,5815337.0


---
## SUM with CASE WHEN

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Sales by Category and Year (2022 vs 2023)

**`CASE WHEN` and `SUM`**

1. Pivot to get the total sales by category and compare 2023 with 2022.

In [6]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2023_total_sales,y2022_total_sales
0,Audio,688690.2,766938.2
1,Cameras and camcorders,1983546.0,2382533.0
2,Cell phones,6002148.0,8119665.0
3,Computers,11650870.0,17862210.0
4,Games and Toys,270375.0,316127.3
5,Home Appliances,5919993.0,6612447.0
6,"Music, Movies and Audio Books",2180768.0,2989297.0
7,TV and Video,4412178.0,5815337.0


---
## Pivot with Multiple CASE WHEN Statements

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Categorize as Low and High for Total Sale

**`FUNCTION` / Concept Covered**

1. Find the median for a single total sale made in 2022 and 2023.

To categorize sales into low, moderate, and high, we'll use the 25th percentile (Q1) and 75th percentile (Q3). It lets us segment the data into three meaningful categories, instead of guessing:

- **Low**: Below the 25th percentile (Q1).
- **Moderate**: Between the 25th and 75th percentiles (Q1 and Q3).
- **High**: Above the 75th percentile (Q3).

To calculate the percentiles (25th, 50th, and 75th) 
- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
```sql
SELECT 
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
FROM table_name
WHERE column_name IS NOT NULL;
```

In [7]:
%%sql 

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * s.netprice * exchangerate)) AS median
FROM
    sales s
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
;

Unnamed: 0,median
0,398.0


2. **Validate data**. Validating data another way using Python.

**Note for Luke**: We could also say they could export the data in Excel but it might not load quickly on their computer.

In [9]:
# SQL Query to fetch data
query = '''
SELECT 
    s.quantity * s.netprice * exchangerate AS net_revenue
FROM 
    sales s
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
ORDER BY 
    net_revenue;
'''

# Fetch data into a Pandas DataFrame
sales_df = pd.read_sql_query(query, connection)

import warnings

# Suppress specific warning from Pandas about psycopg2
warnings.filterwarnings(
    'ignore',
    category=UserWarning,
    message=".*only supports SQLAlchemy connectable.*"
)


# Calculate percentiles
median = np.percentile(sales_df, 50)

print(f"25th Percentile (Q1): {median}")

25th Percentile (Q1): 398.0


3. Pivot by category and then categorize the sale into low, moderate and high based on the 25th, 50th, and 75th percentile for sales in 2022 and 2023. Then get the total sale amount.

In [10]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398 THEN (s.quantity * s.netprice * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398 THEN (s.quantity * s.netprice * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,low_total_sales,high_total_sales
0,Audio,402589.0,1053039.0
1,Cameras and camcorders,237874.0,4128205.0
2,Cell phones,1544149.0,12577660.0
3,Computers,1215131.0,28297950.0
4,Games and Toys,438083.0,148419.3
5,Home Appliances,396058.4,12136380.0
6,"Music, Movies and Audio Books",1260767.0,3909298.0
7,TV and Video,436613.6,9790901.0


4. Add in the year to pivot by category, sale amount and year to compare 2023 vs 2022 sales for the sales label.

In [11]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,y2023_low_total_sales,y2023_high_total_sales,y2022_low_total_sales,y2022_high_total_sales
0,Audio,180251.127651,508439.1,222337.826357,544600.4
1,Cameras and camcorders,104869.455535,1878677.0,133004.539868,2249528.0
2,Cell phones,729699.394352,5272448.0,814449.529033,7305216.0
3,Computers,590790.306691,11060080.0,624340.418719,17237870.0
4,Games and Toys,206103.364757,64271.6,231979.632972,84147.67
5,Home Appliances,176261.350547,5743732.0,219797.072756,6392650.0
6,"Music, Movies and Audio Books",574958.761525,1605809.0,685808.485356,2303489.0
7,TV and Video,164275.353074,4247903.0,272338.286919,5542998.0


### 💡 Note

You can also pivot with other aggregate functions though it is less common. Example: Here's our query from the last example using `AVG`, `MIN`, `MAX`.

**INSERT 3 QUERY EXAMPLES**