<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Sum_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Other Aggregations

## Overview

### 🥅 Analysis Goals

Continue the exploration of the products and their categories ordered from the `sales` table. 
- Analyze average, median, minimum, and maximum net revenue to understand category performance across different perspectives (central tendency, extremes, and distribution).
- Compare these metrics for 2022 and 2023 to highlight growth, decline, or stability in category revenues year-over-year.

### 📘 Concepts Covered

- `AVG`
- `MIN`
- `MAX`
- Median with `PERCENTILE_CONT`

---

In [20]:
import sys
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## Pivot with Other Aggregation Functions

### 📝 Notes

You can also pivot with other aggregate functions though it's not used as frequently as `SUM` or `COUNT`. Example: We'll pivot the values by the average, minimum, and maximum in our **SUM with Case When** query below. Essentially we'll replace `SUM` with `AVG`, `MIN`, and `MAX`.

#### Aggergation Review
- **Average:** The sum of all values divided by the total number of values.  
- **Minimum:** The smallest value in a dataset.  
- **Maximum:** The largest value in a dataset.  

#### Syntax

```sql
SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;
```

#### More Aggregations

There are other aggregate functions you can pivot by but we won't be going into depth in this course. Below are the others you can use (some may not work depending on the SQL language you're using): 

- `VARIANCE`  
- `VAR_POP`  
- `VAR_SAMP`  
- `STDDEV`  
- `STDDEV_POP`  
- `STDDEV_SAMP`  
- `ARRAY_AGG`  
- `STRING_AGG`  
- `BOOL_AND`  
- `BOOL_OR`  

### 💻 Final Result

- Find the average, minimum, and maximum net revenue by category for 2023 and 2022.

#### Average Net Revenue by Category

**`AVG`**

1. Find the average net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `CASE WHEN` to calculate the net revenue only for 2022 and 2023:
     - For 2022, include sales where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include sales where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - Use `AVG` to calculate the average net revenue for each year.
   - Group the data by `categoryname` to get average revenue by category.
   - Order the results alphabetically by `categoryname`.

In [13]:
%%sql 

SELECT
    p.categoryname AS category,
    AVG(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    AVG(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,392.29576,425.379978
1,Cameras and camcorders,1210.021616,1210.956219
2,Cell phones,722.197374,623.275974
3,Computers,1565.624813,1292.386823
4,Games and Toys,81.287556,80.829586
5,Home Appliances,1755.361476,1886.549673
6,"Music, Movies and Audio Books",386.61372,334.57627
7,TV and Video,1535.605124,1687.902917


#### Minimum Net Revenue by Category

**`MIN`**

1. Find the minimum net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `CASE WHEN` to calculate the net revenue only for 2022 and 2023:
     - For 2022, include sales where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include sales where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - Use `MIN` to calculate the minimum net revenue for each year.
   - Group the data by `categoryname` to get minimum revenue by category.
   - Order the results alphabetically by `categoryname`.

In [14]:
%%sql 

SELECT
    p.categoryname AS category,
    MIN(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    MIN(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,9.307088,10.848666
1,Cameras and camcorders,6.738685,5.977
2,Cell phones,2.5284,2.284764
3,Computers,0.8265,0.752181
4,Games and Toys,2.832147,3.488148
5,Home Appliances,4.0419,4.5409
6,"Music, Movies and Audio Books",7.285707,6.912539
7,TV and Video,41.301263,42.295818


#### Maximum Net Revenue by Category

**`MAX`**

1. Find the maximum net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `CASE WHEN` to calculate the net revenue only for 2022 and 2023:
     - For 2022, include sales where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include sales where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - Use `MAX` to calculate the maximum net revenue for each year.
   - Group the data by `categoryname` to get maximum revenue by category.
   - Order the results alphabetically by `categoryname`.

In [15]:
%%sql 

SELECT
    p.categoryname AS category,
    MAX(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_total_sales,
    MAX(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,y2022_total_sales,y2023_total_sales
0,Audio,3473.35872,2730.8664
1,Cameras and camcorders,15008.392476,13572.0
2,Cell phones,7692.36888,8912.2179
3,Computers,38082.66084,27611.59941
4,Games and Toys,5202.013683,3357.303936
5,Home Appliances,31654.545559,32915.59146
6,"Music, Movies and Audio Books",5415.192063,3804.909492
7,TV and Video,30259.410607,27503.115401


---
## Pivot with Median

### 📝 Notes

#### Review
The median is the middle number if you sort the values in a set from low to high. 

**🌆 INSERT VISUAL🌆**
 
The median can also be written as the 50th percentile. Which means that 50% of the data is above or below it.


#### Calculate Median in PostgreSQL

`PERCENTILE_CONT`

- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
  ```sql
  SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
  FROM table_name
  WHERE column_name IS NOT NULL;
  ```
- Note: Some SQL languages may have a dedicated `MEDIAN()` function, but PostgreSQL doesn't. 

### 💻 Final Result

- Find the median net revenue for 2023 and 2022 by category.

#### Median Net Revenue by Category

**`PERCENTILE_CONT`**, **`WITHIN GROUP`**

1. Find the median for net revenue in 2022 - 2023.
   - Use the `PERCENTILE_CONT(0.5)` function to calculate the median value (50th percentile) of `net revenue` in the specified date range.
   - Define `net revenue` as the product of `quantity`, `netprice`, and `exchangerate`.
   - Filter rows in the `WHERE` clause where `orderdate` is between `2022-01-01` and `2023-12-31`.

In [None]:
%%sql 

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * s.netprice * exchangerate)) AS median
FROM
    sales s
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
;

2. Find the median net revenue for 2022 vs 2023 by category.

   - Join the `sales` table (`s`) with the `product` table (`p`) on `productkey`.
   - Use `PERCENTILE_CONT(0.5)` to calculate the median for each year within categories:
     - For 2022, include `net revenue` where `orderdate` is between `2022-01-01` and `2022-12-31`.
     - For 2023, include `net revenue` where `orderdate` is between `2023-01-01` and `2023-12-31`.
   - Use `CASE WHEN` to separate calculations for 2022 and 2023.
   - Group the data by `categoryname` to calculate medians for each category.
   - Order the results alphabetically by `categoryname`.

In [None]:
%%sql

SELECT
    p.categoryname AS category,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (CASE 
        WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) 
    END)) AS y2022_median_sales,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (CASE 
        WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) 
    END)) AS y2023_median_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;
