<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/1_Count_Aggregation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Count Aggregation

## Overview

### 🥅 Analysis Goals

- Use the following to do an EDA of the `sales` table and the customers.
    - Compare customers who ordered in 2023 and 2022
    - Look at location (continent) customers are based out of. 
    - 

### 📘 Concepts Covered

- `COUNT` Review
- `COUNT` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

---

In [1]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

---
## Sales Table Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Review Sales Table

**`FUNCTION` / Concept Covered**

1. Find the total sales for each entry by multiplying `quantity`  by `netprice` and `exchangerate` (since not all sales are made in `USD`).

In [2]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    s.storekey,
    s.productkey,
    s.quantity,
    s.netprice,
    s.quantity * s.netprice * s.exchangerate AS total_sale_amount
FROM
    sales s
ORDER BY
    s.orderkey

Unnamed: 0,orderkey,orderdate,customerkey,storekey,productkey,quantity,netprice,total_sale_amount
0,1000,2015-01-01,947009,400,48,1,98.9670,63.492279
1,1000,2015-01-01,947009,400,460,1,659.7800,423.281859
2,1001,2015-01-01,1772036,430,1730,2,54.3760,108.752000
3,1002,2015-01-01,1518349,660,955,4,286.6864,1146.745600
4,1002,2015-01-01,1518349,660,62,7,135.7500,950.250000
...,...,...,...,...,...,...,...,...
199868,3398034,2024-04-20,664396,999999,1651,7,139.1913,914.612113
199869,3398034,2024-04-20,664396,999999,1646,1,159.9900,150.182613
199870,3398035,2024-04-20,267690,999999,1575,2,53.6712,147.778282
199871,3398035,2024-04-20,267690,999999,415,5,293.4000,2019.618900


2. Join `customer` table to get customer info like continent and gender of the customer

In [3]:
%%sql

SELECT
    s.orderkey,
    s.orderdate,
    s.customerkey,
    c.continent, --Added 
    c.gender, -- Added
    s.productkey,
    s.quantity,
    s.quantity * s.netprice * s.exchangerate AS total_sale_amount
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
ORDER BY
    s.orderkey

Unnamed: 0,orderkey,orderdate,customerkey,continent,gender,productkey,quantity,total_sale_amount
0,1000,2015-01-01,947009,Europe,male,48,1,63.492279
1,1000,2015-01-01,947009,Europe,male,460,1,423.281859
2,1001,2015-01-01,1772036,North America,female,1730,2,108.752000
3,1002,2015-01-01,1518349,North America,female,955,4,1146.745600
4,1002,2015-01-01,1518349,North America,female,62,7,950.250000
...,...,...,...,...,...,...,...,...
199868,3398034,2024-04-20,664396,Europe,female,1651,7,914.612113
199869,3398034,2024-04-20,664396,Europe,female,1646,1,150.182613
199870,3398035,2024-04-20,267690,North America,male,1575,2,147.778282
199871,3398035,2024-04-20,267690,North America,male,415,5,2019.618900


---
## COUNT Review

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Customers by Day

**`FUNCTION` / Concept Covered**

1. Count by day how many distinct customers there were in 2023.

In [4]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT s.customerkey) AS customer
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE  
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,customer
0,2023-01-01,12
1,2023-01-02,49
2,2023-01-03,64
3,2023-01-04,78
4,2023-01-05,87
...,...,...
359,2023-12-27,73
360,2023-12-28,75
361,2023-12-29,55
362,2023-12-30,91


2. Update date filter to count unique customers by day in 2022.

In [5]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT s.customerkey) AS customer
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE  
    s.orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' -- Update
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,customer
0,2023-01-01,12
1,2023-01-02,49
2,2023-01-03,64
3,2023-01-04,78
4,2023-01-05,87
...,...,...
359,2023-12-27,73
360,2023-12-28,75
361,2023-12-29,55
362,2023-12-30,91


---
## Pivot with COUNT

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Total Customers by y and Continent

**`FUNCTION` / Concept Covered**

1. Find the distinct continents of the customers

In [6]:
%%sql 

SELECT DISTINCT continent
FROM customer

Unnamed: 0,continent
0,Europe
1,North America
2,Australia


2. Pivot the data by the number of customers who ordered between 2022-01-01 and 2023-12-31 by the continent.

In [7]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' THEN s.customerkey END) AS eu_customer,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' THEN s.customerkey END) AS na_customer,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' THEN s.customerkey END) AS au_customer
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,eu_customer,na_customer,au_customer
0,2022-01-01,29,52,5
1,2022-01-02,4,4,1
2,2022-01-03,10,28,1
3,2022-01-04,16,33,2
4,2022-01-05,18,40,4
...,...,...,...,...
724,2023-12-27,26,41,6
725,2023-12-28,24,44,7
726,2023-12-29,19,32,4
727,2023-12-30,25,50,16


---
## Pivot with Multiple CASE WHEN Statements

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Unique Customers by Gender and Continent

**`FUNCTION` / Concept Covered**

1. Find the distinct continents of the customers

In [8]:
%%sql 

SELECT DISTINCT gender
FROM customer

Unnamed: 0,gender
0,female
1,male


2. Get the count of unique customers by day by gender.

In [9]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.gender = 'male' THEN s.customerkey END) AS male_customers,
    COUNT(DISTINCT CASE WHEN c.gender = 'female' THEN s.customerkey END) AS female_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,male_customers,female_customers
0,2022-01-01,40,46
1,2022-01-02,7,2
2,2022-01-03,20,19
3,2022-01-04,23,28
4,2022-01-05,38,24
...,...,...,...
724,2023-12-27,40,33
725,2023-12-28,36,39
726,2023-12-29,31,24
727,2023-12-30,51,40


3. Get unique customers by date in 2022 and 2023 by continent and gender.

In [10]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' 
        AND c.gender = 'male' THEN s.customerkey END) AS male_eu_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' 
        AND c.gender = 'male' THEN s.customerkey END) AS male_na_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' 
        AND c.gender = 'male' THEN s.customerkey END) AS male_au_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' 
        AND c.gender = 'female' THEN s.customerkey END) AS female_eu_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' 
        AND c.gender = 'female' THEN s.customerkey END) AS female_na_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' 
        AND c.gender = 'female' THEN s.customerkey END) AS female_au_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate::date BETWEEN '2022-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,male_eu_customers,male_na_customers,male_au_customers,female_eu_customers,female_na_customers,female_au_customers
0,2022-01-01,12,26,2,17,26,3
1,2022-01-02,4,3,0,0,1,1
2,2022-01-03,7,13,0,3,15,1
3,2022-01-04,4,18,1,12,15,1
4,2022-01-05,11,24,3,7,16,1
...,...,...,...,...,...,...,...
724,2023-12-27,12,23,5,14,18,1
725,2023-12-28,8,23,5,16,21,2
726,2023-12-29,8,19,4,11,13,0
727,2023-12-30,17,25,9,8,25,7
