<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/1_Count_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Count Aggregation

## Overview

### 🥅 Analysis Goals

- Use the following to do an EDA of the `sales` table and the customers.
    - Compare customers who ordered in 2023 and 2022
    - Look at location (continent) customers are based out of. 
    - 

### 📘 Concepts Covered

- `COUNT` Review
- `COUNT` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

---

In [2]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

---
## Sales Table Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Review Sales Table

**`FUNCTION` / Concept Covered**

1. Find the total sales for each entry by multiplying `quantity` (which is from the `sales` table) by the `price` in the `product` table and `exchangerate` (since not all sales are made in `USD`).

In [16]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    s.quantity,
    p.price,
    s.quantity * p.price * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
ORDER BY
    s.orderkey

orderkey,orderdate,customerkey,storekey,productkey,quantity,price,total_sale_amount
1000,2015-01-01,947009,400,48,1,149.95,96.2004225
1000,2015-01-01,947009,400,460,1,299.9,192.400845
1001,2015-01-01,1772036,430,1730,2,77.68,155.36
1002,2015-01-01,1518349,660,955,4,196.9,787.6
1002,2015-01-01,1518349,660,62,7,181.0,1267.0
1002,2015-01-01,1518349,660,1050,3,312.0,936.0
1002,2015-01-01,1518349,660,1608,1,109.99,109.99
1003,2015-01-01,1317097,510,85,3,99.99,299.97
1004,2015-01-01,254117,80,128,2,143.4,332.203308
1004,2015-01-01,254117,80,2079,1,665.94,771.3649614000001


2. Join `customer` table to get customer info like continent and gender of the customer

In [15]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    c.continent, --Added 
    c.gender, -- Added
    s.productkey,
    s.quantity,
    p.price,
    s.quantity * p.price * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN product p ON s.productkey = p.productkey
    LEFT JOIN customer c ON s.customerkey = c.customerkey
ORDER BY
    s.orderkey

orderkey,orderdate,customerkey,continent,gender,productkey,quantity,price,total_sale_amount
1000,2015-01-01,947009,Europe,male,48,1,149.95,96.2004225
1000,2015-01-01,947009,Europe,male,460,1,299.9,192.400845
1001,2015-01-01,1772036,North America,female,1730,2,77.68,155.36
1002,2015-01-01,1518349,North America,female,955,4,196.9,787.6
1002,2015-01-01,1518349,North America,female,62,7,181.0,1267.0
1002,2015-01-01,1518349,North America,female,1050,3,312.0,936.0
1002,2015-01-01,1518349,North America,female,1608,1,109.99,109.99
1003,2015-01-01,1317097,North America,male,85,3,99.99,299.97
1004,2015-01-01,254117,North America,male,128,2,143.4,332.203308
1004,2015-01-01,254117,North America,male,2079,1,665.94,771.3649614000001


---
## COUNT Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Customers by Day

**`FUNCTION` / Concept Covered**

1. Count by day how many distinct customers there were in 2023.

In [5]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT s.customerkey) AS customer
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE  
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

orderdate,customer
2023-01-01,12
2023-01-02,49
2023-01-03,64
2023-01-04,78
2023-01-05,87
2023-01-06,57
2023-01-07,99
2023-01-08,10
2023-01-09,43
2023-01-10,49


2. Update date filter to count unique customers by day in 2022.

In [17]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT s.customerkey) AS customer
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE  
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' -- Update
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

orderdate,customer
2023-01-01,12
2023-01-02,49
2023-01-03,64
2023-01-04,78
2023-01-05,87
2023-01-06,57
2023-01-07,99
2023-01-08,10
2023-01-09,43
2023-01-10,49


---
## Pivot with COUNT

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Customers by y and Continent

**`FUNCTION` / Concept Covered**

1. Find the distinct continents of the customers

In [9]:
%%sql 

SELECT DISTINCT continent
FROM customer

continent
Europe
North America
Australia


2. Pivot the data by the number of customers who ordered between 2022-01-01 and 2023-12-31 by the continent.

In [10]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' THEN s.customerkey END) AS eu_customer,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' THEN s.customerkey END) AS na_customer,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' THEN s.customerkey END) AS au_customer
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

orderdate,eu_customer,na_customer,au_customer
2022-01-01,29,52,5
2022-01-02,4,4,1
2022-01-03,10,28,1
2022-01-04,16,33,2
2022-01-05,18,40,4
2022-01-06,19,42,6
2022-01-07,11,26,7
2022-01-08,29,45,4
2022-01-09,1,6,0
2022-01-10,16,16,1


---
## Pivot with Multiple CASE WHEN Statements

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Unique Customers by Gender and Continent

**`FUNCTION` / Concept Covered**

1. Find the distinct continents of the customers

In [11]:
%%sql 

SELECT DISTINCT gender
FROM customer

gender
female
male


2. Get the count of unique customers by day by gender.

In [12]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.gender = 'male' THEN s.customerkey END) AS male_customers,
    COUNT(DISTINCT CASE WHEN c.gender = 'female' THEN s.customerkey END) AS female_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

orderdate,male_customers,female_customers
2022-01-01,40,46
2022-01-02,7,2
2022-01-03,20,19
2022-01-04,23,28
2022-01-05,38,24
2022-01-06,35,32
2022-01-07,27,17
2022-01-08,29,49
2022-01-09,5,2
2022-01-10,14,19


3. Get unique customers by date in 2022 and 2023 by continent and gender.

In [14]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' 
        AND c.gender = 'male' THEN s.customerkey END) AS male_eu_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' 
        AND c.gender = 'male' THEN s.customerkey END) AS male_na_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' 
        AND c.gender = 'male' THEN s.customerkey END) AS male_au_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' 
        AND c.gender = 'female' THEN s.customerkey END) AS female_eu_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' 
        AND c.gender = 'female' THEN s.customerkey END) AS female_na_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' 
        AND c.gender = 'female' THEN s.customerkey END) AS female_au_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

orderdate,male_eu_customers,male_na_customers,male_au_customers,female_eu_customers,female_na_customers,female_au_customers
2022-01-01,12,26,2,17,26,3
2022-01-02,4,3,0,0,1,1
2022-01-03,7,13,0,3,15,1
2022-01-04,4,18,1,12,15,1
2022-01-05,11,24,3,7,16,1
2022-01-06,8,22,5,11,20,1
2022-01-07,6,16,5,5,10,2
2022-01-08,11,18,0,18,27,4
2022-01-09,1,4,0,0,2,0
2022-01-10,7,6,1,9,10,0
