<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Conditional_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Conditional Aggregation

**Product focused**

## Overview

### 🥅 Analysis Goals

Use the following to do an EDA of the products and their categories ordered from the `sales` table.
- Compare total sales of products ordered in 2023 and 2022
- Total sales in 2023 and 2022.
- Categorize sales as low, moderate or high and pivot the sales by category and year.

### 📘 Concepts Covered

- `SUM` Review
- `SUM` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

---

In [21]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Set up the connection parameters for this notebook
import psycopg2
import pandas as pd
import numpy as np

# Database connection parameters
connection = psycopg2.connect(
    dbname='contoso_100k',
    user='postgres',
    password='password',
    host='localhost',
    port='5432'
)

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## SUM Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Sales by Category for 2022 and 2023

**`FUNCTION` / Concept Covered**

1. Find the total sales for each entry by multiplying `quantity` (which is from the `sales` table) by the `price` in the `product` table and `exchangerate` (since not all sales are made in `USD`).

2. Filter the data to only return data from 2023 and return the `categoryname`.

In [23]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    p.categoryname,
    s.quantity,
    s.unitprice,
    s.quantity * s.unitprice * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' --Added
ORDER BY
    s.orderkey

Unnamed: 0,orderkey,orderdate,customerkey,storekey,productkey,categoryname,quantity,unitprice,total_sale_amount
0,2923000,2023-01-01,239821,90,1581,"Music, Movies and Audio Books",7,219.00,2075.421390
1,2923001,2023-01-01,1025340,999999,2013,Home Appliances,1,665.94,553.762407
2,2923002,2023-01-01,686958,120,1602,"Music, Movies and Audio Books",3,179.99,506.254273
3,2923002,2023-01-01,686958,120,349,Computers,1,383.00,359.085480
4,2923002,2023-01-01,686958,120,1644,"Music, Movies and Audio Books",1,57.88,54.265973
...,...,...,...,...,...,...,...,...,...
37512,3287006,2023-12-31,334289,999999,1440,Cell phones,1,189.00,250.438230
37513,3287007,2023-12-31,1398584,480,786,Computers,2,11.50,23.000000
37514,3287007,2023-12-31,1398584,480,1463,Cell phones,3,293.00,879.000000
37515,3287007,2023-12-31,1398584,480,1527,Cell phones,1,268.00,268.000000


3. Aggregegate the data to get the total sales by category. 
    - Remove other columns except for category
    - Aggregate by category

In [24]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * s.unitprice * s.exchangerate) AS total_sale_amount -- Added
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    p.categoryname -- Added
ORDER BY
    p.categoryname -- Added

Unnamed: 0,category_name,total_sale_amount
0,Audio,730647.9
1,Cameras and camcorders,2107966.0
2,Cell phones,6383098.0
3,Computers,12373770.0
4,Games and Toys,286481.7
5,Home Appliances,6317839.0
6,"Music, Movies and Audio Books",2321667.0
7,TV and Video,4699135.0


4. For 2022 edit the date filter in the `WHERE` clause to be 2022.

In [25]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * s.unitprice * s.exchangerate) AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' -- Updated
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

Unnamed: 0,category_name,total_sale_amount
0,Audio,814438.9
1,Cameras and camcorders,2539234.0
2,Cell phones,8623592.0
3,Computers,19003050.0
4,Games and Toys,335683.2
5,Home Appliances,7026622.0
6,"Music, Movies and Audio Books",3178853.0
7,TV and Video,6187337.0


---
## SUM with CASE WHEN

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Sales by Category and Year

**`CASE WHEN` and `SUM`**

1. Pivot to get the total sales by category and compare 2023 with 2022.

In [26]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.unitprice * s.exchangerate) END) AS y2023_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.unitprice * s.exchangerate) END) AS y2022_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2023_total_sales,y2022_total_sales
0,Audio,730647.9,814438.9
1,Cameras and camcorders,2107966.0,2539234.0
2,Cell phones,6383098.0,8623592.0
3,Computers,12373770.0,19003050.0
4,Games and Toys,286481.7,335683.2
5,Home Appliances,6317839.0,7026622.0
6,"Music, Movies and Audio Books",2321667.0,3178853.0
7,TV and Video,4699135.0,6187337.0


---
## Pivot with Multiple CASE WHEN Statements

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Categorize as Low, Moderate and High for Total Sale

**`FUNCTION` / Concept Covered**

1. Find the minimum, 25th percentile, 75th percentile, and maximum for a single total sale made in 2022 and 2023.

To categorize sales into low, moderate, and high, we'll use the 25th percentile (Q1) and 75th percentile (Q3). It lets us segment the data into three meaningful categories, instead of guessing:

- **Low**: Below the 25th percentile (Q1).
- **Moderate**: Between the 25th and 75th percentiles (Q1 and Q3).
- **High**: Above the 75th percentile (Q3).

To calculate the percentiles (25th, 50th, and 75th) 
- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
```sql
SELECT 
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
FROM table_name
WHERE column_name IS NOT NULL;
```

In [27]:
%%sql 

SELECT
    MIN(s.quantity * s.unitprice * exchangerate) AS minimum_sales,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY (s.quantity * s.unitprice * exchangerate)) AS q1_sales,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY (s.quantity * s.unitprice * exchangerate)) AS q3_sales,
    MAX(s.quantity * s.unitprice * exchangerate) AS maximum_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
;

Unnamed: 0,minimum_sales,q1_sales,q3_sales,maximum_sales
0,0.864576,118.0,1128.0,38082.66084


2. **Validate data**. Validating data another way using Python.

**Note for Luke**: We could also say they could export the data in Excel but it might not load quickly on their computer.

In [28]:
# SQL Query to fetch data
query = '''
SELECT 
    s.quantity * s.unitprice * exchangerate AS total_sale_amount
FROM 
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
ORDER BY 
    total_sale_amount;
'''

# Fetch data into a Pandas DataFrame
sales_df = pd.read_sql_query(query, connection)

import warnings

# Suppress specific warning from Pandas about psycopg2
warnings.filterwarnings(
    'ignore',
    category=UserWarning,
    message=".*only supports SQLAlchemy connectable.*"
)


# Calculate percentiles
q1 = np.percentile(sales_df, 25)
q3 = np.percentile(sales_df, 75)

print(f"25th Percentile (Q1): {q1}")
print(f"75th Percentile (Q3): {q3}")

25th Percentile (Q1): 118.0
75th Percentile (Q3): 1128.0


3. Pivot by category and then categorize the sale into low, moderate and high based on the 25th, 50th, and 75th percentile for sales in 2022 and 2023. Then get the total sale amount.

In [29]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) < 114.59 THEN (s.quantity * s.unitprice * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) >= 114.59 AND (s.quantity * s.unitprice * exchangerate) < 1064.00 THEN (s.quantity * s.unitprice * exchangerate) END) AS mod_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) >= 1064.00 THEN (s.quantity * s.unitprice * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,low_total_sales,mod_total_sales,high_total_sales
0,Audio,48645.534494,978843.4,517597.8
1,Cameras and camcorders,22033.305577,916169.3,3708998.0
2,Cell phones,211053.242184,5222102.0,9573535.0
3,Computers,113554.564758,5161734.0,26101530.0
4,Games and Toys,198526.011477,383192.8,40446.02
5,Home Appliances,34252.624856,1672944.0,11637260.0
6,"Music, Movies and Audio Books",281043.885642,2853328.0,2366148.0
7,TV and Video,21475.979833,1773837.0,9091159.0


4. Add in the year to pivot by category, sale amount and year to compare 2023 vs 2022 sales for the sales label.

In [30]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) < 114.59
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.unitprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) >= 114.59 AND (s.quantity * s.unitprice * exchangerate) < 1064.00 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.unitprice * exchangerate) END) AS y2023_mod_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) >= 1064.00 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.unitprice * exchangerate) END) AS y2023_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) < 114.59 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.unitprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) >= 114.59 AND (s.quantity * s.unitprice * exchangerate) < 1064.00 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.unitprice * exchangerate) END) AS y2022_mod_total_sales,
    SUM(CASE WHEN (s.quantity * s.unitprice * exchangerate) >= 1064.00 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.unitprice * exchangerate) END) AS y2022_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,y2023_low_total_sales,y2023_mod_total_sales,y2023_high_total_sales,y2022_low_total_sales,y2022_mod_total_sales,y2022_high_total_sales
0,Audio,20323.109896,444781.1,265543.7,28322.424598,534062.4,252054.1
1,Cameras and camcorders,10786.24239,415790.4,1681389.0,11247.063187,500378.9,2027608.0
2,Cell phones,95435.786047,2472805.0,3814857.0,115617.456137,2749297.0,5758678.0
3,Computers,57090.218488,2284592.0,10032090.0,56464.34627,2877142.0,16069440.0
4,Games and Toys,90668.263703,179538.8,16274.62,107857.747774,203654.0,24171.39
5,Home Appliances,16079.263003,754234.9,5547525.0,18173.361852,918709.0,6089740.0
6,"Music, Movies and Audio Books",133228.202369,1262702.0,925736.7,147815.683273,1590626.0,1440411.0
7,TV and Video,7811.140475,721944.1,3969380.0,13664.839358,1051893.0,5121779.0
