<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Conditional_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Conditional Aggregation

**Product focused**

## Overview

### 🥅 Analysis Goals

- Use the following to do an EDA of the products and their categories ordered from the `sales` table.
    - Compare total sales of products ordered in 2023 and 2022
    - Total sales in 2023 and 2022.
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Aggregation Review
- `SUM` with `CASE WHEN`
- Concept 3

---

In [1]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Set up the connection parameters for this notebook
import psycopg2
import pandas as pd
import numpy as np

# Database connection parameters
connection = psycopg2.connect(
    dbname='contoso_100k',
    user='postgres',
    password='password',
    host='localhost',
    port='5432'
)

---
## SUM Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Find the total sales for each entry by multiplying `quantity` (which is from the `sales` table) by the `price` in the `product` table and `exchangerate` (since not all sales are made in `USD`).

In [2]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    s.quantity,
    p.price,
    s.quantity * p.price * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
ORDER BY
    s.orderkey

orderkey,orderdate,customerkey,storekey,productkey,quantity,price,total_sale_amount
1000,2015-01-01,947009,400,48,1,149.95,96.2004225
1000,2015-01-01,947009,400,460,1,299.9,192.400845
1001,2015-01-01,1772036,430,1730,2,77.68,155.36
1002,2015-01-01,1518349,660,955,4,196.9,787.6
1002,2015-01-01,1518349,660,62,7,181.0,1267.0
1002,2015-01-01,1518349,660,1050,3,312.0,936.0
1002,2015-01-01,1518349,660,1608,1,109.99,109.99
1003,2015-01-01,1317097,510,85,3,99.99,299.97
1004,2015-01-01,254117,80,128,2,143.4,332.203308
1004,2015-01-01,254117,80,2079,1,665.94,771.3649614000001


2. Filter the data to only return data from 2023 and return the `categoryname`.

In [3]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    p.categoryname,
    s.quantity,
    p.price,
    s.quantity * p.price * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' --Added
ORDER BY
    s.orderkey

orderkey,orderdate,customerkey,storekey,productkey,categoryname,quantity,price,total_sale_amount
2923000,2023-01-01,239821,90,1581,"Music, Movies and Audio Books",7,219.0,2075.42139
2923001,2023-01-01,1025340,999999,2013,Home Appliances,1,665.94,553.762407
2923002,2023-01-01,686958,120,1602,"Music, Movies and Audio Books",3,179.99,506.2542732
2923002,2023-01-01,686958,120,349,Computers,1,383.0,359.08548
2923002,2023-01-01,686958,120,1644,"Music, Movies and Audio Books",1,57.88,54.2659728
2923003,2023-01-01,1889683,470,371,Computers,3,599.0,1797.0
2923003,2023-01-01,1889683,470,1605,"Music, Movies and Audio Books",6,289.99,1739.94
2923003,2023-01-01,1889683,470,1258,Cameras and camcorders,1,39.99,39.99
2923003,2023-01-01,1889683,470,1976,Home Appliances,3,899.0,2697.0
2923004,2023-01-01,55996,999999,2467,Home Appliances,3,30.99,136.78769069999998


3. Aggregegate the data to get the total sales by category. 
    - Remove other columns except for category
    - Aggregate by category

In [4]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * p.price * s.exchangerate) AS total_sale_amount -- Added
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    p.categoryname -- Added
ORDER BY
    p.categoryname -- Added

category_name,total_sale_amount
Audio,730647.8724823
Cameras and camcorders,2107965.6327211987
Cell phones,6383097.762667844
Computers,12373767.735130329
Games and Toys,286481.6953874804
Home Appliances,6317839.183700321
"Music, Movies and Audio Books",2321667.2394959824
TV and Video,4699134.796674995


4. For 2022 edit the date filter in the `WHERE` clause to be 2022.

In [5]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * p.price * s.exchangerate) AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' -- Updated
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

category_name,total_sale_amount
Audio,854127.3322440994
Cameras and camcorders,2429201.739937798
Cell phones,7342863.472145045
Computers,15548062.129970036
Games and Toys,351464.6304658012
Home Appliances,7374114.8490392305
"Music, Movies and Audio Books",2814693.739286461
TV and Video,6338489.860811012


---
## SUM with CASE WHEN

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Sales by Category and Year

**`CASE WHEN` and `SUM`**

1. Pivot to get the total sales by category and compare 2023 with 2022.

In [6]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * s.exchangerate) END) AS y2023_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * s.exchangerate) END) AS y2022_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

category,y2023_total_sales,y2022_total_sales
Audio,730647.8724822998,854127.3322440991
Cameras and camcorders,2107965.6327211983,2429201.7399377986
Cell phones,6383097.76266784,7342863.472145045
Computers,12373767.73513032,15548062.129970033
Games and Toys,286481.69538748014,351464.63046580146
Home Appliances,6317839.183700321,7374114.849039231
"Music, Movies and Audio Books",2321667.2394959824,2814693.73928646
TV and Video,4699134.796674997,6338489.86081101


---
## Pivot with Multiple CASE WHEN Statements

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Categorize as Low, Moderate and High for Total Sale

**`FUNCTION` / Concept Covered**

1. Find the minimum, 25th percentile, 75th percentile, and maximum for a single total sale made in 2022 and 2023.

To categorize sales into low, moderate, and high, we'll use the 25th percentile (Q1) and 75th percentile (Q3). It lets us segment the data into three meaningful categories, instead of guessing:

- **Low**: Below the 25th percentile (Q1).
- **Moderate**: Between the 25th and 75th percentiles (Q1 and Q3).
- **High**: Above the 75th percentile (Q3).

To calculate the percentiles (25th, 50th, and 75th) 
- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
```sql
SELECT 
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
FROM table_name
WHERE column_name IS NOT NULL;
```

In [19]:
%%sql 

SELECT
    MIN(s.quantity * p.price * exchangerate) AS minimum_sales,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS q1_sales,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS q3_sales,
    MAX(s.quantity * p.price * exchangerate) AS maximum_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
;

minimum_sales,q1_sales,q3_sales,maximum_sales
0.864576,114.589244675,1064.0,37037.222285


2. **Validate data**. Validating data another way using Python.

**Note for Luke**: We could also say they could export the data in Excel but it might not load quickly on their computer.

In [20]:
# SQL Query to fetch data
query = '''
SELECT 
    s.quantity * p.price * exchangerate AS total_sale_amount
FROM 
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
ORDER BY 
    total_sale_amount;
'''

# Fetch data into a Pandas DataFrame
sales_df = pd.read_sql_query(query, connection)

import warnings

# Suppress specific warning from Pandas about psycopg2
warnings.filterwarnings(
    'ignore',
    category=UserWarning,
    message=".*only supports SQLAlchemy connectable.*"
)


# Calculate percentiles
q1 = np.percentile(sales_df, 25)
q3 = np.percentile(sales_df, 75)

print(f"25th Percentile (Q1): {q1}")
print(f"75th Percentile (Q3): {q3}")

25th Percentile (Q1): 114.589244675
75th Percentile (Q3): 1064.0


3. Pivot by category and then categorize the sale into low, moderate and high based on the 25th, 50th, and 75th percentile for sales in 2022 and 2023. Then get the total sale amount.

In [17]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 114.59 THEN (s.quantity * p.price * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 114.59 AND (s.quantity * p.price * exchangerate) < 1064.00 THEN (s.quantity * p.price * exchangerate) END) AS mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 1064.00 THEN (s.quantity * p.price * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

category,low_total_sales,mod_total_sales,high_total_sales
Audio,45969.24364080004,989486.8335323985,549319.1275532001
Cameras and camcorders,22088.60888519998,919585.0044553,3595493.7593185008
Cell phones,213232.41172610116,5325644.986252799,8187083.836833905
Computers,121006.3471812001,5329935.284563575,22470888.23335549
Games and Toys,201038.58898816,396222.2009611201,40685.535904
Home Appliances,33174.97112309997,1677420.9750328902,11981358.086583484
"Music, Movies and Audio Books",286290.37461500184,2790983.5849920614,2059087.0191754012
TV and Video,19586.119747599987,1760333.6261631963,9257704.911575217


4. Add in the year to pivot by category, sale amount and year to compare 2023 vs 2022 sales for the sales label.

In [18]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 114.59
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 114.59 AND (s.quantity * p.price * exchangerate) < 1064.00 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2023_mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 1064.00 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2023_high_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 114.59 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 114.59 AND (s.quantity * p.price * exchangerate) < 1064.00 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2022_mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 1064.00 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2022_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

category,y2023_low_total_sales,y2023_mod_total_sales,y2023_high_total_sales,y2022_low_total_sales,y2022_mod_total_sales,y2022_high_total_sales
Audio,20323.109895699992,444781.0624753999,265543.7001112001,25646.1337451,544705.7710569997,283775.427442
Cameras and camcorders,10786.242390299998,415790.3580089003,1681389.0323219995,11302.36649489999,503794.6464464001,1914104.7269965005
Cell phones,95435.78604670016,2472804.5601229984,3814857.4164980976,117796.62567940015,2852840.4261297905,4372226.420335799
Computers,57090.218487800026,2284591.8917785,10032085.624864,63916.12869340008,3045343.392785094,12438802.60849149
Games and Toys,90668.26370327995,179538.80774420005,16274.623939999998,110370.32528487987,216683.39321691997,24410.911964
Home Appliances,16079.263003399985,754234.9155321986,5547525.005164703,17095.70811969998,923186.0595006978,6433833.0814188
"Music, Movies and Audio Books",133228.2023688998,1262702.2980729928,925736.7390541004,153062.17224610006,1528281.2869190886,1133350.2801213
TV and Video,7811.140474800001,721944.0762298997,3969379.579970296,11774.979272799996,1038389.549933299,5288325.3316049
