<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/1_Basic_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Count & Sum Aggregation

## Count Overview

### 🥅 Analysis Goals

- **Total daily customers in 2023**: Examine daily trends to understand customer activity and identify peak periods.
- **Customer location (continent)**: Explore customer distribution by continent to assess regional trends.

### 📘 Concepts Covered

- `COUNT` Review
- `COUNT` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

[Source Documentation on Aggregate Functions.](https://www.postgresql.org/docs/9.5/functions-aggregate.html)

---

## What the heck is Pivoting?

#### Definition  
- Transforming data from a **long format (rows)** to a **wide format (columns)** for better analysis.  

#### Examples  

**Before Pivoting (Long Format):**  

| Date       | Category | Sales  |
|------------|---------|--------|
| 2024-01-01 | A       | 100    |
| 2024-01-01 | B       | 200    |
| 2024-01-02 | A       | 150    |

**After Pivoting (Wide Format):**  

| Date       | A Sales | B Sales |
|------------|--------|--------|
| 2024-01-01 | 100    | 200    |
| 2024-01-02 | 150    | NULL   |

#### Key Benefits  
✅ Easier to read & analyze  
✅ Reduces redundancy in reports  
✅ Enables quick comparisons across categories  


---

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Update package installer
    !sudo apt-get update -qq > /dev/null 2>&1

    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## COUNT Review

### 📝 Notes

`COUNT`

- **COUNT** counts the number of rows that match a specified condition or counts all rows when no condition is provided.

- Syntax:

  ```sql
  COUNT(column_name)
  ```

- Example: `COUNT(user_id)` counts all non-NULL values in the `user_id` column. `COUNT(*)` counts all rows, including those with NULL values.

### 📈 Analysis

- Get the total unique customers by day in 2023. Let's us take a closer look at daily trends to understand customer activity and identify peak periods.

#### Total Unique Customers by Day

**`COUNT`**

1. Count by day the total number of customers.
    - Use `COUNT(customerkey)` to count all customer entries, including duplicates for customers who appear multiple times on the same day.
    - Group data by `orderdate` to aggregate counts per day.
    - Sort the results in chronological order by `orderdate`.

In [2]:
%%sql

SELECT
    orderdate,
    COUNT(customerkey) AS total_customers
FROM
    sales
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_customers
0,2015-01-01,25
1,2015-01-02,8
2,2015-01-03,21
3,2015-01-05,10
4,2015-01-06,12
...,...,...
3289,2024-04-16,32
3290,2024-04-17,61
3291,2024-04-18,57
3292,2024-04-19,50


2. Instead let's get the count of **unique** customers by day.
    - 🔔 Use `COUNT(DISTINCT customerkey)` to ensure each customer is counted only once per day, even if they appear multiple times.
    - Group data by `orderdate` to get daily unique customer counts.
    - Sort the results in chronological order by `orderdate`.

In [3]:
%%sql

SELECT
    orderdate,
    COUNT(DISTINCT customerkey) AS total_customers -- Update
FROM
    sales
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_customers
0,2015-01-01,9
1,2015-01-02,6
2,2015-01-03,11
3,2015-01-05,4
4,2015-01-06,5
...,...,...
3289,2024-04-16,14
3290,2024-04-17,22
3291,2024-04-18,25
3292,2024-04-19,19


3. Calculate only for orders in 2023.

    - Use a `WHERE` clause to filter `orderdate` to the year 2023 (from `'2023-01-01'` to `'2023-12-31'`).
    - 🔔 Use `COUNT(DISTINCT customerkey)` to count unique customers per day within the specified year.
    - Group data by `orderdate` to get daily unique customer counts for 2023.
    - Sort the results in chronological order by `orderdate`.

In [4]:
%%sql

SELECT
    orderdate,
    COUNT(DISTINCT customerkey) AS total_customers
FROM
    sales
WHERE  -- Added
    orderdate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_customers
0,2023-01-01,12
1,2023-01-02,49
2,2023-01-03,64
3,2023-01-04,78
4,2023-01-05,87
...,...,...
359,2023-12-27,73
360,2023-12-28,75
361,2023-12-29,55
362,2023-12-30,91


---
## Pivot with COUNT

### 📝 Notes

`COUNT(CASE WHEN)`

- **Pivot with COUNT (using `CASE WHEN` statements)** enables pivoting data by counting rows based on conditional logic.

- Syntax:

  ```sql
  COUNT(DISTINCT CASE WHEN condition THEN column END) AS alias
  ```

- Example:

  ```sql
  SELECT
    COUNT(DISTINCT CASE WHEN status = 'active' THEN user_id END) AS active_users
  FROM users;
  ```

### 💻 Analysis

- Return the unique customers by day for customer continent. This helps us understand our customer demographics better. Specifically our customer distribution by continent to assess regional trends.

### Total Customers by Customer Continent
**`CASE WHEN`, `COUNT`**

1. Confirm the unique continents in the `customer` table.

    - Use `SELECT DISTINCT` to retrieve a list of unique values in the `continent` column.
    - This query ensures no duplicates in the output, showing each continent listed only once.

In [5]:
%%sql

SELECT DISTINCT
    continent
FROM
    customer

Unnamed: 0,continent
0,Australia
1,North America
2,Europe


2. Pivot the data by the unique number of customers who ordered between 2023-01-01 and 2023-12-31 by the continent.

    - Use `COUNT(DISTINCT ...)` to calculate the unique customers who placed orders, broken down by continent.
    - Use `CASE WHEN` within `COUNT` to conditionally count customers for specific continents:
        - `c.continent = 'Europe'` for European customers.
        - `c.continent = 'North America'` for North American customers.
        - `c.continent = 'Australia'` for Australian customers.
    - Filter the data to only include orders from 2023 using a `WHERE` clause (`orderdate BETWEEN '2023-01-01' AND '2023-12-31'`).
    - Group data by `orderdate` to aggregate daily customer counts by continent.
    - Sort the results by `orderdate` in chronological order.

In [6]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' THEN s.customerkey END) AS eu_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' THEN s.customerkey END) AS na_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' THEN s.customerkey END) AS au_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,eu_customers,na_customers,au_customers
0,2023-01-01,6,5,1
1,2023-01-02,15,31,3
2,2023-01-03,17,44,3
3,2023-01-04,28,46,4
4,2023-01-05,22,57,8
...,...,...,...,...
359,2023-12-27,26,41,6
360,2023-12-28,24,44,7
361,2023-12-29,19,32,4
362,2023-12-30,25,50,16


<img src="https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/Resources/images/1.1_customer_continent.png?raw=1" alt="Continent" width="50%">

---

## Sum Aggregation

### 🥅 Analysis Goals

Perform EDA on product categories and their net revenue from the sales table to uncover general trends and understand the dataset. Specifically:

- **Total net revenue in 2023 and 2022**: Compare yearly revenue trends to identify overall growth or decline.
- **Net revenue by product categories in 2023 and 2022**: Explore which categories contribute most to revenue across two years.

### 📘 Concepts Covered

- `SUM` Review
- `SUM` with `CASE WHEN`
- `BETWEEN` with `DATE`

[Source Documentation on Aggregate Functions.](https://www.postgresql.org/docs/9.5/functions-aggregate.html)

---
## SUM Review

### 📝 Notes

`SUM`

- **SUM** adds up all numeric values in a specified column, excluding NULL values.

- Syntax:

  ```sql
  SUM(column_name)
  ```

- Example:

  ```sql
  SELECT SUM(order_amount) AS total_revenue
  FROM orders;
  ```

### 📈 Analysis

- Find the total net revenue by day in 2023.
- Calculate the total net revenue by category in 2022 and 2023. Compare yearly revenue trends to identify overall growth or decline.

#### Total Net Revenue by Day in 2023

**`SUM`**

1. Find the net revenue by orderdate for 2023 orders.

    - Use `SUM(quantity * netprice * exchangerate)` to calculate the net revenue for each day.
    - Filter orders to include only dates in 2023 using `WHERE orderdate BETWEEN '2023-01-01' AND '2023-12-31'`.
    - Group data by `orderdate` to calculate daily revenue.
    - Sort the results by `orderdate` in chronological order.

In [7]:
%%sql

SELECT
    orderdate,
    SUM(quantity * netprice * exchangerate) AS net_revenue -- Added
FROM
    sales
WHERE
    orderdate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,net_revenue
0,2023-01-01,30140.80
1,2023-01-02,107847.49
2,2023-01-03,192655.60
3,2023-01-04,189451.71
4,2023-01-05,216573.23
...,...,...
359,2023-12-27,141981.34
360,2023-12-28,138772.19
361,2023-12-29,85913.44
362,2023-12-30,165917.02


#### Total Net Revenue by Product Category in 2022 and 2023

**`SUM`**

1. Find the total net revenue by the product category for 2023 orders.

    - Use `SUM(quantity * netprice * exchangerate)` to calculate net revenue for each product category.
    - 🔔 Join the `sales` table with the `product` table on `productkey` to access `categoryname`.
    - Filter orders to include only dates in 2023 using `WHERE orderdate BETWEEN '2023-01-01' AND '2023-12-31'`.
    - 🔔 Group data by `categoryname` to calculate revenue by category.
    - 🔔 Sort results alphabetically by `categoryname`.

In [8]:
%%sql

SELECT
    p.categoryname AS category_name, -- Added
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey -- Added
WHERE
    s.orderdate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    p.categoryname -- Update
ORDER BY
    p.categoryname -- Update

Unnamed: 0,category_name,net_revenue
0,Audio,688690.18
1,Cameras and camcorders,1983546.29
2,Cell phones,6002147.63
3,Computers,11650867.21
4,Games and Toys,270374.96
5,Home Appliances,5919992.87
6,"Music, Movies and Audio Books",2180768.13
7,TV and Video,4412178.23


2. Find the total net revenue by the product category for 2022 orders.

    - Use `SUM(quantity * netprice * exchangerate)` to calculate net revenue for each product category.
    - Join the `sales` table with the `product` table on `productkey` to access `categoryname`.
    - 🔔 Filter orders to include only dates in 2022 using `WHERE orderdate BETWEEN '2022-01-01' AND '2022-12-31'`.
    - Group data by `categoryname` to calculate revenue by category.
    - Sort results alphabetically by `categoryname`.

In [9]:
%%sql

SELECT
    p.categoryname AS category_name,
    SUM(s.quantity * s.netprice * s.exchangerate) AS net_revenue
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
WHERE
    s.orderdate BETWEEN '2022-01-01' AND '2022-12-31' -- Updated
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname

Unnamed: 0,category_name,net_revenue
0,Audio,766938.21
1,Cameras and camcorders,2382532.56
2,Cell phones,8119665.07
3,Computers,17862213.49
4,Games and Toys,316127.3
5,Home Appliances,6612446.68
6,"Music, Movies and Audio Books",2989297.28
7,TV and Video,5815336.61


---
## SUM with CASE WHEN

### 📝 Notes

`SUM(CASE WHEN)`

- **Pivot with SUM (using `CASE WHEN` statements)** enables pivoting data by summing values based on conditional logic.

- Syntax:

  ```sql
  SUM(CASE WHEN condition THEN column ELSE 0 END) AS alias
  ```

- Example:

  ```sql
  SELECT
    SUM(CASE WHEN region = 'North' THEN sales END) AS north_sales,
    SUM(CASE WHEN region = 'South' THEN sales END) AS south_sales
  FROM sales_data;
  ```

### 📈 Analysis

- Compare total net revenue of products by category ordered in 2023 and 2022. Explore which categories contribute most to revenue across two years. ◊

#### Total Net Revenue by Category and Year (2022 vs 2023)

**`CASE WHEN` and `SUM`**

1. Pivot to get the total net revenue by category and compare 2023 with 2022.

    - Use `SUM` with `CASE WHEN` to calculate separate revenue totals for 2022 and 2023:
        - `CASE WHEN orderdate BETWEEN '2022-01-01' AND '2022-12-31'` for 2022 revenue.
        - `CASE WHEN orderdate BETWEEN '2023-01-01' AND '2023-12-31'` for 2023 revenue.
    - Join the `sales` to `product` table using `LEFT JOIN` to group by `categoryname`.
    - Group data by `categoryname` to provide a category-based comparison.
    - Sort results alphabetically by `categoryname`.

In [10]:
%%sql

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN s.orderdate BETWEEN '2022-01-01' AND '2022-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS total_net_revenue_2022,
    SUM(CASE WHEN s.orderdate BETWEEN '2023-01-01' AND '2023-12-31' THEN (s.quantity * s.netprice * s.exchangerate) END) AS total_net_revenue_2023
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
GROUP BY
    p.categoryname
ORDER BY
    p.categoryname;

Unnamed: 0,category,total_net_revenue_2022,total_net_revenue_2023
0,Audio,766938.21,688690.18
1,Cameras and camcorders,2382532.56,1983546.29
2,Cell phones,8119665.07,6002147.63
3,Computers,17862213.49,11650867.21
4,Games and Toys,316127.3,270374.96
5,Home Appliances,6612446.68,5919992.87
6,"Music, Movies and Audio Books",2989297.28,2180768.13
7,TV and Video,5815336.61,4412178.23


<img src="https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/Resources/images/1.2_category_year.png?raw=1" alt="Continent" width="50%">


In [21]:
# Customer Gender Distribution by Store (1.1.1) - Problem
# Calculate the total number of unique male and female customers who made purchases in each store in 2023. This will help in understanding the gender distribution of customers across different stores.
#Use COUNT with CASE WHEN to count male and female customers separately.
#Group the results by storecode and order them by storecode.

%%sql
select
  st.storekey,
  c.customerkey,
  c.gender,
  case when
    c.gender = 'female', count distinct (customerkey)
    else count distinct (customerkey)

from sales as s
inner join store as st
  on s.storekey = st.storekey
inner join customer as c
  on s.customerkey = c.customerkey

where extract (year from s.orderdate) = 2023

group by c.gender


Unnamed: 0,storekey,customerkey,gender,orderdate
0,90,239821,female,2023-01-01
1,999999,1025340,female,2023-01-01
2,120,686958,male,2023-01-01
3,120,686958,male,2023-01-01
4,120,686958,male,2023-01-01
...,...,...,...,...
37512,999999,759196,male,2023-12-30
37513,999999,1328627,male,2023-12-30
37514,480,1398584,female,2023-12-31
37515,480,1398584,female,2023-12-31


In [14]:
%%sql
SELECT *
FROM store
LIMIT 10; -- Limiting to 10 rows for brevity

Unnamed: 0,storekey,storecode,geoareakey,countrycode,countryname,state,opendate,closedate,description,squaremeters,status
0,10,1,1,AU,Australia,Australian Capital Territory,2008-01-01,,Contoso Store Australian Capital Territory,595.0,
1,20,2,3,AU,Australia,Northern Territory,2008-01-12,2016-07-07,Contoso Store Northern Territory,665.0,Closed
2,30,3,5,AU,Australia,South Australia,2012-01-07,2015-08-08,Contoso Store South Australia,2000.0,Restructured
3,35,3,5,AU,Australia,South Australia,2015-12-08,,Contoso Store South Australia,3000.0,
4,40,4,6,AU,Australia,Tasmania,2010-01-01,,Contoso Store Tasmania,2000.0,
5,50,5,7,AU,Australia,Victoria,2015-12-09,,Contoso Store Victoria,2000.0,
6,60,6,8,AU,Australia,Western Australia,2010-01-01,,Contoso Store Western Australia,2000.0,
7,70,7,12,CA,Canada,New Brunswick,2007-05-07,2014-03-09,Contoso Store New Brunswick,1105.0,Restructured
8,72,7,12,CA,Canada,New Brunswick,2015-01-11,2018-02-02,Contoso Store New Brunswick,1500.0,Restructured
9,74,7,12,CA,Canada,New Brunswick,2018-06-02,,Contoso Store New Brunswick,3500.0,


In [15]:
%%sql
SELECT *
FROM sales
LIMIT 10; -- Limiting to 10 rows for brevity

Unnamed: 0,orderkey,linenumber,orderdate,deliverydate,customerkey,storekey,productkey,quantity,unitprice,netprice,unitcost,currencycode,exchangerate
0,1000,0,2015-01-01,2015-01-01,947009,400,48,1,112.46,98.97,57.34,GBP,0.64
1,1000,1,2015-01-01,2015-01-01,947009,400,460,1,749.75,659.78,382.25,GBP,0.64
2,1001,0,2015-01-01,2015-01-01,1772036,430,1730,2,54.38,54.38,25.0,USD,1.0
3,1002,0,2015-01-01,2015-01-01,1518349,660,955,4,315.04,286.69,144.88,USD,1.0
4,1002,1,2015-01-01,2015-01-01,1518349,660,62,7,135.75,135.75,62.43,USD,1.0
5,1002,2,2015-01-01,2015-01-01,1518349,660,1050,3,499.2,434.3,229.57,USD,1.0
6,1002,3,2015-01-01,2015-01-01,1518349,660,1608,1,65.99,58.73,33.65,USD,1.0
7,1003,0,2015-01-01,2015-01-01,1317097,510,85,3,74.99,74.99,34.48,USD,1.0
8,1004,0,2015-01-01,2015-01-01,254117,80,128,2,114.72,113.57,58.49,CAD,1.16
9,1004,1,2015-01-01,2015-01-01,254117,80,2079,1,499.45,499.45,165.48,CAD,1.16


In [16]:
%%sql
SELECT *
FROM customer
LIMIT 10; -- Limiting to 10 rows for brevity

Unnamed: 0,customerkey,geoareakey,startdt,enddt,continent,gender,title,givenname,middleinitial,surname,...,zipcode,country,countryfull,birthday,age,occupation,company,vehicle,latitude,longitude
0,15,4,1990-09-10,2034-07-29,Australia,male,Mr.,Julian,A,McGuigan,...,4357,AU,Australia,1965-03-24,55,Border Patrol agent,Cut Rite Lawn Care,2000 Peugeot Kart Up,-27.83,151.17
1,23,8,1995-08-11,2045-01-26,Australia,female,Ms.,Rose,H,Dash,...,6055,AU,Australia,1990-05-10,30,Agricultural and food scientist,Rack N Sack,2005 Volvo XC90,-31.92,116.05
2,36,2,1992-03-12,2044-05-14,Australia,female,Ms.,Annabelle,J,Townsend,...,2304,AU,Australia,1964-07-16,56,Special education teacher,id Boutiques,1999 Lancia Lybra,-32.88,151.71
3,120,6,1983-07-23,2033-08-09,Australia,male,Mr.,Jamie,H,Hetherington,...,7256,AU,Australia,1946-12-11,74,Dental laboratory technician,Showbiz Pizza Place,2006 Dodge Durango,-39.77,144.02
4,180,7,1987-11-26,2026-10-14,Australia,male,Mr.,Gabriel,P,Bosanquet,...,3505,AU,Australia,1955-04-24,65,Administrative support specialist,Dubrow's Cafeteria,1995 Morgan Plus 4,-34.13,142.14
5,185,2,1990-08-01,2029-05-28,Australia,female,Mrs.,Gabrielle,B,Castella,...,2469,AU,Australia,1980-02-23,40,Management dietitian,d.e.m.o.,1997 Alpina B6,-29.01,152.84
6,189,7,2008-07-05,2017-11-01,Australia,female,Ms.,Hayley,C,Jull,...,3377,AU,Australia,1960-04-18,60,Sculptor,Asian Plan,2006 Alpina B5,-37.34,142.91
7,210,2,1980-09-28,2030-05-08,Australia,female,Mrs.,Natalie,L,Hilder,...,2632,AU,Australia,1950-11-23,70,Motel desk clerk,Enrich Garden Services,1995 Ford Fairlane,-36.84,149.05
8,225,7,1985-06-25,2017-09-25,Australia,male,Mr.,Hunter,J,Hutchins,...,3763,AU,Australia,1978-07-20,42,Teletype operator,Mr. Good Buys,2012 Lexus GX,-37.51,145.44
9,243,2,1982-02-07,2027-02-09,Australia,female,Ms.,Maya,J,Atherton,...,2446,AU,Australia,1954-05-15,66,Pilates instructor,Franklin Music,1995 Chevrolet Caprice,-31.36,152.39
