<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Sum_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Conditional Aggregation

**Product focused**

## Overview

### 🥅 Analysis Goals

Use the following to do an EDA of the products and their categories ordered from the `sales` table.
- Total sales in 2023 and 2022.
- Compare total sales of products ordered in 2023 and 2022
- Categorize sales as low or high and pivot the sales by category and year.

### 📘 Concepts Covered

- `SUM` Review
- `SUM` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

---

In [20]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

# Set up the connection parameters for this notebook
import psycopg2
import pandas as pd
import numpy as np

# Database connection parameters
connection = psycopg2.connect(
    dbname='contoso_100k',
    user='postgres',
    password='password',
    host='localhost',
    port='5432'
)

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## SUM Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Find the total sales by day in 2023 and 2022.

#### Total Net Revenue by Day in 2023

**`SUM`**

1. Find the net revenue by orderdate for 2023 orders.

In [25]:
%%sql

SELECT
    s.orderdate,
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue -- Added
FROM
    sales s
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY   
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,net_revenue
0,2023-01-01,30140.799315
1,2023-01-02,107847.490191
2,2023-01-03,192655.596657
3,2023-01-04,189451.707871
4,2023-01-05,216573.229817
...,...,...
359,2023-12-27,141981.336234
360,2023-12-28,138772.189742
361,2023-12-29,85913.440327
362,2023-12-30,165917.019796


#### Total Net Revenue by Product Category in 2023 and 2022

**`SUM`**

1. Find the total net revenue by the product category for 2023 orders.

In [30]:
%%sql

SELECT
    p.categoryname AS category_name, -- Added
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey -- Added
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    p.categoryname -- Update
ORDER BY
    p.categoryname -- Update

Unnamed: 0,category_name,net_revenue
0,Audio,688690.18
1,Cameras and camcorders,1983546.29
2,Cell phones,6002147.63
3,Computers,11650867.21
4,Games and Toys,270374.96
5,Home Appliances,5919992.87
6,"Music, Movies and Audio Books",2180768.13
7,TV and Video,4412178.23


2. Find the total net revenue by the product category for 2022 orders.

In [31]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' -- Updated
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

Unnamed: 0,category_name,net_revenue
0,Audio,766938.21
1,Cameras and camcorders,2382532.56
2,Cell phones,8119665.07
3,Computers,17862213.49
4,Games and Toys,316127.3
5,Home Appliances,6612446.68
6,"Music, Movies and Audio Books",2989297.28
7,TV and Video,5815336.61


---
## SUM with CASE WHEN

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Compare total sales of products by category ordered in 2023 and 2022,

#### Total Sales by Category and Year (2022 vs 2023)

**`CASE WHEN` and `SUM`**

1. Pivot to get the total sales by category and compare 2023 with 2022.

In [32]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,766938.21,688690.18
1,Cameras and camcorders,2382532.56,1983546.29
2,Cell phones,8119665.07,6002147.63
3,Computers,17862213.49,11650867.21
4,Games and Toys,316127.3,270374.96
5,Home Appliances,6612446.68,5919992.87
6,"Music, Movies and Audio Books",2989297.28,2180768.13
7,TV and Video,5815336.61,4412178.23


---
## Pivot with Multiple CASE WHEN Statements

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Categorize sales as low or high and pivot the sales by category and year.

#### Categorize as Low and High for Total Sale

**`SUM`**, **`CASE WHEN`**

1. Categorize the sale as low or high and find the total sales by category and low or high.

In [33]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 1000 THEN (s.quantity * s.netprice * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 1000 THEN (s.quantity * s.netprice * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,low_total_sales,high_total_sales
0,Audio,970542.98,485085.41
1,Cameras and camcorders,884178.45,3481900.4
2,Cell phones,5173880.4,8947932.31
3,Computers,4937765.59,24575315.1
4,Games and Toys,547757.88,38744.39
5,Home Appliances,1581307.97,10951131.58
6,"Music, Movies and Audio Books",2973461.1,2196604.3
7,TV and Video,1704582.92,8522931.91


2. Get the total sales by category, sale price (low or high) and year (2022 vs 2023).

In [34]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 1000
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 1000
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 1000
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 1000 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,y2022_low_total_sales,y2022_high_total_sales,y2023_low_total_sales,y2023_high_total_sales
0,Audio,531436.42,235501.79,439106.56,249583.63
1,Cameras and camcorders,489304.45,1893228.11,394874.0,1588672.29
2,Cell phones,2728570.0,5391095.08,2445310.4,3556837.24
3,Computers,2732432.34,15129781.15,2205333.25,9445533.96
4,Games and Toys,292055.19,24072.11,255702.69,14672.27
5,Home Appliances,872122.52,5740324.16,709185.46,5210807.42
6,"Music, Movies and Audio Books",1655321.2,1333976.08,1318139.91,862628.22
7,TV and Video,1001097.14,4814239.47,703485.79,3708692.44


##### Optional: Find Median

To categorize sales into low and high into more meaningful categories (instead of guessing), we can find the median. The median is the middle number if you sort the values in a set from low to high. Based on the median we'll categorize the sale as either low or high:

- **Low**: Below the median.
- **High**: Above the median.

The median can also be written as the 50th percentile. Which means that 50% of the data is above or below it.

To calculate the 50th percentile
- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
```sql
SELECT 
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
FROM table_name
WHERE column_name IS NOT NULL;
```

In [35]:
%%sql 

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * s.netprice * exchangerate)) AS median
FROM
    sales s
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
;

Unnamed: 0,median
0,398.0


2. **Validate data**. Validating data another way using Python.

**Note for Luke**: We could also say they could export the data in Excel but it might not load quickly on their computer.

In [36]:
# SQL Query to fetch data
query = '''
SELECT 
    s.quantity * s.netprice * exchangerate AS net_revenue
FROM 
    sales s
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
ORDER BY 
    net_revenue;
'''

# Fetch data into a Pandas DataFrame
sales_df = pd.read_sql_query(query, connection)

import warnings

# Suppress specific warning from Pandas about psycopg2
warnings.filterwarnings(
    'ignore',
    category=UserWarning,
    message=".*only supports SQLAlchemy connectable.*"
)


# Calculate percentiles
median = np.percentile(sales_df, 50)

print(f"50th Percentile (Median): {median}")

50th Percentile (Median): 398.0


3. Pivot by category and then categorize the sale into low, moderate and high based on the 25th, 50th, and 75th percentile for sales in 2022 and 2023. Then get the total sale amount.

In [37]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398 THEN (s.quantity * s.netprice * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398 THEN (s.quantity * s.netprice * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,low_total_sales,high_total_sales
0,Audio,402588.95,1053039.44
1,Cameras and camcorders,237874.0,4128204.85
2,Cell phones,1544148.92,12577663.79
3,Computers,1215130.73,28297949.97
4,Games and Toys,438083.0,148419.27
5,Home Appliances,396058.42,12136381.13
6,"Music, Movies and Audio Books",1260767.25,3909298.16
7,TV and Video,436613.64,9790901.19


4. Add in the year to pivot by category, sale amount and year to compare 2023 vs 2022 sales for the sales label.

In [38]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,y2022_low_total_sales,y2022_high_total_sales,y2023_low_total_sales,y2023_high_total_sales
0,Audio,222337.83,544600.39,180251.13,508439.06
1,Cameras and camcorders,133004.54,2249528.02,104869.46,1878676.83
2,Cell phones,814449.53,7305215.55,729699.39,5272448.24
3,Computers,624340.42,17237873.07,590790.31,11060076.9
4,Games and Toys,231979.63,84147.67,206103.36,64271.6
5,Home Appliances,219797.07,6392649.61,176261.35,5743731.52
6,"Music, Movies and Audio Books",685808.49,2303488.8,574958.76,1605809.37
7,TV and Video,272338.29,5542998.32,164275.35,4247902.87


5. **Bonus** IF we wanted to make the median dynamic instead of having to input it manually.

In [16]:
%%sql

-- Calculate the median values
WITH median_value AS (
    SELECT 
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * s.netprice * exchangerate)) AS median
    FROM sales s
    WHERE orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
)

-- Pivot the data by cateogry, low and high sales, and year
SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < mv.median
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= mv.median
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < mv.median
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= mv.median 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
    CROSS JOIN median_value mv
GROUP BY
    category
ORDER BY
    category;

Unnamed: 0,category,y2022_low_total_sales,y2022_high_total_sales,y2023_low_total_sales,y2023_high_total_sales
0,Audio,222337.826357,544600.4,180251.127651,508439.1
1,Cameras and camcorders,133004.539868,2249528.0,104869.455535,1878677.0
2,Cell phones,814449.529033,7305216.0,729699.394352,5272448.0
3,Computers,624340.418719,17237870.0,590790.306691,11060080.0
4,Games and Toys,231979.632972,84147.67,206103.364757,64271.6
5,Home Appliances,219797.072756,6392650.0,176261.350547,5743732.0
6,"Music, Movies and Audio Books",685808.485356,2303489.0,574958.761525,1605809.0
7,TV and Video,272338.286919,5542998.0,164275.353074,4247903.0


---
## Pivot with Other Aggregation Functions

You can also pivot with other aggregate functions though it's not used as frequently as `SUM` or `COUNT`. Example: We'll pivot the values by the average, minimum, and maximum in our **SUM with Case When** query below. Essentially we'll replace `SUM` with `AVG`, `MIN`, and `MAX`.

```sql
SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;
```

1. Pivoting to find the averages.

In [13]:
%%sql 

SELECT
    p.categoryname AS category,
    AVG(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    AVG(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,392.29576,425.379978
1,Cameras and camcorders,1210.021616,1210.956219
2,Cell phones,722.197374,623.275974
3,Computers,1565.624813,1292.386823
4,Games and Toys,81.287556,80.829586
5,Home Appliances,1755.361476,1886.549673
6,"Music, Movies and Audio Books",386.61372,334.57627
7,TV and Video,1535.605124,1687.902917


2. Pivoting to find the minimums.

In [14]:
%%sql 

SELECT
    p.categoryname AS category,
    MIN(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    MIN(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,9.307088,10.848666
1,Cameras and camcorders,6.738685,5.977
2,Cell phones,2.5284,2.284764
3,Computers,0.8265,0.752181
4,Games and Toys,2.832147,3.488148
5,Home Appliances,4.0419,4.5409
6,"Music, Movies and Audio Books",7.285707,6.912539
7,TV and Video,41.301263,42.295818


3. Pivoting to find the maximums.

In [15]:
%%sql 

SELECT
    p.categoryname AS category,
    MAX(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    MAX(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,3473.35872,2730.8664
1,Cameras and camcorders,15008.392476,13572.0
2,Cell phones,7692.36888,8912.2179
3,Computers,38082.66084,27611.59941
4,Games and Toys,5202.013683,3357.303936
5,Home Appliances,31654.545559,32915.59146
6,"Music, Movies and Audio Books",5415.192063,3804.909492
7,TV and Video,30259.410607,27503.115401


There are other aggregate functions you can pivot by but we won't be going into depth in this course. Below are the others you can use (some may not work depending on the SQL language you're using): 

- `VARIANCE`  
- `VAR_POP`  
- `VAR_SAMP`  
- `STDDEV`  
- `STDDEV_POP`  
- `STDDEV_SAMP`  
- `ARRAY_AGG`  
- `STRING_AGG`  
- `BOOL_AND`  
- `BOOL_OR`  