<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/2_Conditional_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Conditional Aggregation

**Product focused**

## Overview

### 🥅 Analysis Goals

- Use the following to do an EDA of the products and their categories ordered from the `sales` table.
    - Compare total sales of products ordered in 2023 and 2022
    - Total sales in 2023 and 2022.
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Aggregation Review
- `SUM` with `CASE WHEN`
- Concept 3

---

In [13]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Set up the connection parameters for this notebook
import psycopg2
import pandas as pd
import numpy as np

# Database connection parameters
connection = psycopg2.connect(
    dbname='contoso_100k',
    user='postgres',
    password='password',
    host='localhost',
    port='5432'
)

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## SUM

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Find the total sales for each entry by multiplying `quantity` (which is from the `sales` table) by the `price` in the `product` table and `exchangerate` (since not all sales are made in `USD`).

In [14]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    s.quantity,
    p.price,
    s.quantity * p.price * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
ORDER BY
    s.orderkey

orderkey,orderdate,customerkey,storekey,productkey,quantity,price,total_sale_amount
1000,2015-01-01,947009,400,48,1,149.95,96.2004225
1000,2015-01-01,947009,400,460,1,299.9,192.400845
1001,2015-01-01,1772036,430,1730,2,77.68,155.36
1002,2015-01-01,1518349,660,955,4,196.9,787.6
1002,2015-01-01,1518349,660,62,7,181.0,1267.0
1002,2015-01-01,1518349,660,1050,3,312.0,936.0
1002,2015-01-01,1518349,660,1608,1,109.99,109.99
1003,2015-01-01,1317097,510,85,3,99.99,299.97
1004,2015-01-01,254117,80,128,2,143.4,332.203308
1004,2015-01-01,254117,80,2079,1,665.94,771.3649614000001


2. Filter the data to only return data from 2023 and return the `categoryname`.

In [15]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    p.categoryname,
    s.quantity,
    p.price,
    s.quantity * p.price * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
ORDER BY
    s.orderkey

orderkey,orderdate,customerkey,storekey,productkey,categoryname,quantity,price,total_sale_amount
2923000,2023-01-01,239821,90,1581,"Music, Movies and Audio Books",7,219.0,2075.42139
2923001,2023-01-01,1025340,999999,2013,Home Appliances,1,665.94,553.762407
2923002,2023-01-01,686958,120,1602,"Music, Movies and Audio Books",3,179.99,506.2542732
2923002,2023-01-01,686958,120,349,Computers,1,383.0,359.08548
2923002,2023-01-01,686958,120,1644,"Music, Movies and Audio Books",1,57.88,54.2659728
2923003,2023-01-01,1889683,470,371,Computers,3,599.0,1797.0
2923003,2023-01-01,1889683,470,1605,"Music, Movies and Audio Books",6,289.99,1739.94
2923003,2023-01-01,1889683,470,1258,Cameras and camcorders,1,39.99,39.99
2923003,2023-01-01,1889683,470,1976,Home Appliances,3,899.0,2697.0
2923004,2023-01-01,55996,999999,2467,Home Appliances,3,30.99,136.78769069999998


3. Aggregegate the data to get the total sales by category. 
    - Remove other columns except for category
    - Aggregate by category

In [16]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * p.price * s.exchangerate) AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

category_name,total_sale_amount
Audio,730647.8724822998
Cameras and camcorders,2107965.6327211987
Cell phones,6383097.762667838
Computers,12373767.735130329
Games and Toys,286481.69538748043
Home Appliances,6317839.183700321
"Music, Movies and Audio Books",2321667.239495982
TV and Video,4699134.796674995


4. For 2022 we could do the same thing but just edit the date filter in the `WHERE` clause to be 2022.

In [17]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * p.price * s.exchangerate) AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

category_name,total_sale_amount
Audio,854127.3322440994
Cameras and camcorders,2429201.739937799
Cell phones,7342863.472145041
Computers,15548062.129970036
Games and Toys,351464.6304658014
Home Appliances,7374114.8490392305
"Music, Movies and Audio Books",2814693.739286459
TV and Video,6338489.86081101


---
## SUM with CASE WHEN

#### Total Sales by Category and Year

**`CASE WHEN` and `SUM`**

1. Step 1

In [18]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * s.exchangerate) END) AS y2023_total_sales,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * s.exchangerate) END) AS y2022_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

category,y2023_total_sales,y2022_total_sales
Audio,730647.8724822997,854127.3322440994
Cameras and camcorders,2107965.632721198,2429201.739937798
Cell phones,6383097.762667838,7342863.47214504
Computers,12373767.735130329,15548062.129970033
Games and Toys,286481.6953874804,351464.6304658014
Home Appliances,6317839.183700321,7374114.849039233
"Music, Movies and Audio Books",2321667.239495984,2814693.7392864595
TV and Video,4699134.796674995,6338489.860811011


#### Problem Description

**`FUNCTION` / Concept Covered**

1. Find the minimum, 25th percentile, 50th percentile (median), 75th percentile, and maximum.

To categorize sales into low, moderate, and high, we'll use the 25th percentile (Q1) and 75th percentile (Q3). It lets us segment the data into three meaningful categories, instead of guessing:

- **Low**: Below the 25th percentile (Q1).
- **Moderate**: Between the 25th and 75th percentiles (Q1 and Q3).
- **High**: Above the 75th percentile (Q3).

To calculate the percentiles (25th, 50th, and 75th) 
- **`PERCENTILE_CONT`** calculates a percentile (e.g., 25th, 50th, 75th) by estimating values between sorted data points.  
- Syntax:
```sql
SELECT 
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
FROM table_name
WHERE column_name IS NOT NULL;
```

In [19]:
%%sql 

SELECT
    MIN(s.quantity * p.price * exchangerate) AS minimum_sales,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS q1_sales,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS median_sales, -- Median 
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS q3_sales,
    MAX(s.quantity * p.price * exchangerate) AS maximum_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' 
;

minimum_sales,q1_sales,median_sales,q3_sales,maximum_sales
0.864576,109.99,390.81733600000007,1055.9386167,33040.27776


2. **Validate data**. Validating data another way using Python.

In [20]:
# SQL Query to fetch data
query = '''
SELECT 
    s.quantity * p.price * exchangerate AS total_sale_amount
FROM 
    sales s
JOIN 
    product p 
ON 
    s.productkey = p.productkey
WHERE 
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
ORDER BY 
    total_sale_amount;
'''

# Fetch data into a Pandas DataFrame
sales_df = pd.read_sql_query(query, connection)
sales_df

# Calculate percentiles
q1 = np.percentile(sales_df, 25)
median = np.percentile(sales_df, 50)
q3 = np.percentile(sales_df, 75)

print(f"25th Percentile (Q1): {q1}")
print(f"Median (50th Percentile): {median}")
print(f"75th Percentile (Q3): {q3}")

25th Percentile (Q1): 109.99
Median (50th Percentile): 390.81733600000007
75th Percentile (Q3): 1055.9386167


  sales_df = pd.read_sql_query(query, connection)


3. Pivot by category and then categorize sales as low, moderate and high based on the 25th, 50th, and 75th percentile.

In [21]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 109.99 THEN (s.quantity * p.price * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 109.99 AND (s.quantity * p.price * exchangerate) < 390.82 THEN (s.quantity * p.price * exchangerate) END) AS mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 390.82 THEN (s.quantity * p.price * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

category,low_total_sales,mod_total_sales,high_total_sales
Audio,18517.14886419999,153086.74323179998,559043.9803863
Cameras and camcorders,10450.695272499996,88050.94860670003,2009463.988842
Cell phones,92977.39448570008,614590.9325350003,5675529.435647107
Computers,53725.46274800002,511920.61936420016,11808121.6530181
Games and Toys,87521.74379977997,120820.3867505,78139.56483720001
Home Appliances,14312.664980999987,147530.55217780007,6155995.966541506
"Music, Movies and Audio Books",120542.6485763998,441908.8425569007,1759215.748362697
TV and Video,5783.283243400001,153111.92470640002,4540239.588725196


4. Add in the year to pivot by category, sale amount and year.

In [22]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 109.99 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 109.99 AND (s.quantity * p.price * exchangerate) < 390.82 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2023_mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 390.82 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2023_high_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 109.99 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 109.99 AND (s.quantity * p.price * exchangerate) < 390.82 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2022_mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 390.82 
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * p.price * exchangerate) END) AS y2022_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

category,y2023_low_total_sales,y2023_mod_total_sales,y2023_high_total_sales,y2022_low_total_sales,y2022_mod_total_sales,y2022_high_total_sales
Audio,18517.14886419999,153086.74323179995,559043.9803863001,22601.280879699996,190833.1095489,640692.9418154999
Cameras and camcorders,10450.695272499996,88050.94860670003,2009463.9888419996,10517.544616799994,114247.81607410005,2304436.3792469
Cell phones,92977.39448570012,614590.9325349999,5675529.435647105,114541.59651480013,714091.8891926997,6514229.986437498
Computers,53725.462748000034,511920.6193642,11808121.653018106,62006.52292680007,613193.0739515007,14872862.5330917
Games and Toys,87521.74379977993,120820.38675050004,78139.56483720001,106773.48944287989,137926.85501021994,106764.2860127
Home Appliances,14312.664980999987,147530.55217780007,6155995.966541506,15544.51740389998,175970.80389230017,7182599.527743014
"Music, Movies and Audio Books",120542.64857639978,441908.8425569006,1759215.7483626967,134520.7174466998,537019.2360523001,2143153.785787492
TV and Video,5783.283243400001,153111.92470639996,4540239.588725198,9966.4511265,237284.6163497,6091238.793334808
