<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/3_Pivot_With_Multiple_Case_When.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pivot with Multiple Case When Statements

## Overview

### 🥅 Analysis Goals

What the goals of this whole analysis (section) is for e.g. Create a retention analysis to focus on X. 

- Subgoal 1 
- Subgoal 2
- Subgoal 3

### 📘 Concepts Covered

- Concept 1
- Concept 2
- Concept 3

---

In [None]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

---
## Different Aggregations

### 📝 Notes

**Pivot with multiple `CASE WHEN` statements for different aggregation functions** combines pivot operations like `COUNT` and `SUM`, each using separate `CASE WHEN` conditions.

- Syntax:

  ```sql
  SELECT
    COUNT(CASE WHEN condition THEN column END) AS count_alias,
    SUM(CASE WHEN condition THEN column ELSE 0 END) AS sum_alias
  FROM table_name;
  ```

- Example:

  ```sql
  SELECT
    COUNT(CASE WHEN category = 'A' THEN user_id END) AS category_a_users,
    SUM(CASE WHEN category = 'A' THEN revenue ELSE 0 END) AS category_a_revenue
  FROM user_data;
  ```

  This counts the users and sums the revenue for category A.

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

In [None]:
%%sql

SELECT
    s.orderdate,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2022_net_revenue,
    COUNT(DISTINCT CASE WHEN s.orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN s.customerkey END) AS y2022_customers,
    SUM(CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS y2023_total_sales,
    COUNT(DISTINCT CASE WHEN s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN s.customerkey END) AS y2023_customers,
FROM
    sales s
GROUP BY   
    s.orderdate
ORDER BY
    s.orderdate

---
## Multiple CASE WHEN Statements

### 📝 Notes

- **Pivot with multiple `CASE WHEN` statements in the same aggregation function** applies multiple `CASE WHEN` conditions within a single aggregation, such as `COUNT`.

  - Syntax:

    ```sql
    SELECT
      COUNT(
        CASE 
          WHEN condition1 THEN column
          WHEN condition2 THEN column
        END
      ) AS alias
    FROM table_name;
    ```

  - Example:

    ```sql
    SELECT
      COUNT(
        CASE 
          WHEN category = 'A' THEN user_id
          WHEN category = 'B' THEN user_id
        END
      ) AS category_a_b_users
    FROM user_data;
    ```

    This counts users where the category is either 'A' or 'B'.

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Categorize as Low and High for Total Sale

**`SUM`**, **`CASE WHEN`**, **`PERCENTILE_CONT`**

1. To categorize sales into low and high into more meaningful categories (instead of guessing) using the median we found earlier. 

- **Low**: Below the median.
- **High**: Above the median.

In [None]:
%%sql 

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * s.netprice * exchangerate)) AS median
FROM
    sales s
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
;

2. Pivot by category and then categorize the sale into low and high based on the median for sales in 2022 and 2023. Then get the total sale amount.

In [None]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398 THEN (s.quantity * s.netprice * exchangerate) END) AS low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398 THEN (s.quantity * s.netprice * exchangerate) END) AS high_total_sales
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2022-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

3. Add in the year to pivot by category, sale amount and year to compare 2023 vs 2022 sales for the sales label.

In [None]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < 398
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= 398 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    category
ORDER BY
    category;

4. **Bonus** IF we wanted to make the median dynamic instead of having to input it manually.

In [None]:
%%sql

-- Calculate the median values
WITH median_value AS (
    SELECT 
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * s.netprice * exchangerate)) AS median
    FROM sales s
    WHERE orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
)

-- Pivot the data by cateogry, low and high sales, and year
SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < mv.median
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= mv.median
        AND orderdate::date BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2022_high_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) < mv.median
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_low_total_sales,
    SUM(CASE WHEN (s.quantity * s.netprice * exchangerate) >= mv.median 
        AND orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * exchangerate) END) AS y2023_high_total_sales
FROM    
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
    CROSS JOIN median_value mv
GROUP BY
    category
ORDER BY
    category;