<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/1_Pivot_With_Case_Statements/1.1_Basic_Aggregation_Count.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Count Aggregation

## Overview

### 🥅 Analysis Goals

Perform exploratory data analysis (EDA) to understand customer characteristics in recent orders, focusing on trends and demographics within the dataset. Specifically:

- **Total daily customers in 2023**: Examine daily trends to understand customer activity and identify peak periods.
- **Customer location (continent)**: Explore customer distribution by continent to assess regional trends.
- **Customer gender**: Analyze gender demographics to understand overall customer composition.

### 📘 Concepts Covered

- `COUNT` Review
- `COUNT` with `CASE WHEN`
- Pivot with Multiple CASE WHEN Statements

---

## What the heck is Pivoting?

#### Definition  
- Transforming data from a **long format (rows)** to a **wide format (columns)** for better analysis.  

#### Examples  

**Before Pivoting (Long Format):**  

| Date       | Category | Sales  |
|------------|---------|--------|
| 2024-01-01 | A       | 100    |
| 2024-01-01 | B       | 200    |
| 2024-01-02 | A       | 150    |

**After Pivoting (Wide Format):**  

| Date       | A Sales | B Sales |
|------------|--------|--------|
| 2024-01-01 | 100    | 200    |
| 2024-01-02 | 150    | NULL   |

#### Key Benefits  
✅ Easier to read & analyze  
✅ Reduces redundancy in reports  
✅ Enables quick comparisons across categories  


---

In [3]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## COUNT Review

### 📝 Notes

`COUNT`

- **COUNT** counts the number of rows that match a specified condition or counts all rows when no condition is provided.

- Syntax:

  ```sql
  COUNT(column_name)
  ```

- Example: `COUNT(user_id)` counts all non-NULL values in the `user_id` column. `COUNT(*)` counts all rows, including those with NULL values.

### 💻 Final Result

- Get the total unique customers by day in 2023. Let's us take a closer look at daily trends to understand customer activity and identify peak periods.

#### Total Unique Customers by Day

**`COUNT`**

1. Count by day the total number of customers.
    - Use `COUNT(customerkey)` to count all customer entries, including duplicates for customers who appear multiple times on the same day.
    - Group data by `orderdate` to aggregate counts per day.
    - Sort the results in chronological order by `orderdate`.

In [11]:
%%sql

SELECT
    orderdate,
    COUNT(customerkey) AS total_customers
FROM
    sales
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_customers
0,2015-01-01,25
1,2015-01-02,8
2,2015-01-03,21
3,2015-01-05,10
4,2015-01-06,12
...,...,...
3289,2024-04-16,32
3290,2024-04-17,61
3291,2024-04-18,57
3292,2024-04-19,50


2. Instead let's get the count of **unique** customers by day.
    - 🔔 Use `COUNT(DISTINCT customerkey)` to ensure each customer is counted only once per day, even if they appear multiple times.
    - Group data by `orderdate` to get daily unique customer counts.
    - Sort the results in chronological order by `orderdate`.

In [12]:
%%sql

SELECT
    orderdate,
    COUNT(DISTINCT customerkey) AS total_customers -- Update
FROM
    sales
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_customers
0,2015-01-01,9
1,2015-01-02,6
2,2015-01-03,11
3,2015-01-05,4
4,2015-01-06,5
...,...,...
3289,2024-04-16,14
3290,2024-04-17,22
3291,2024-04-18,25
3292,2024-04-19,19


3. Calculate only for orders in 2023.

    - Use a `WHERE` clause to filter `orderdate` to the year 2023 (from `'2023-01-01'` to `'2023-12-31'`).
    - 🔔 Use `COUNT(DISTINCT customerkey)` to count unique customers per day within the specified year.
    - Group data by `orderdate` to get daily unique customer counts for 2023.
    - Sort the results in chronological order by `orderdate`.

In [13]:
%%sql

SELECT
    orderdate,
    COUNT(DISTINCT customerkey) AS total_customers
FROM
    sales
WHERE  -- Added
    orderdate BETWEEN '2023-01-01' AND '2023-12-31' 
GROUP BY
    orderdate
ORDER BY
    orderdate

Unnamed: 0,orderdate,total_customers
0,2023-01-01,12
1,2023-01-02,49
2,2023-01-03,64
3,2023-01-04,78
4,2023-01-05,87
...,...,...
359,2023-12-27,73
360,2023-12-28,75
361,2023-12-29,55
362,2023-12-30,91


---
## Pivot with COUNT

### 📝 Notes

`COUNT(CASE WHEN)`

- **Pivot with COUNT (using `CASE WHEN` statements)** enables pivoting data by counting rows based on conditional logic.

- Syntax:

  ```sql
  COUNT(DISTINCT CASE WHEN condition THEN column END) AS alias
  ```

- Example:

  ```sql
  SELECT 
    COUNT(DISTINCT CASE WHEN status = 'active' THEN user_id END) AS active_users
  FROM users;
  ```

### 💻 Analysis

- Return the unique customers by day for: customer continent and then customer gender. This helps us  understand our customer demographics better. Specifically our customer distribution by continent to assess regional trends and the customer gender to understand overall customer composition.

### Total Customers by Customer Continent
**`CASE WHEN`, `COUNT`**

1. Confirm the unique continents in the `customer` table.

    - Use `SELECT DISTINCT` to retrieve a list of unique values in the `continent` column.
    - This query ensures no duplicates in the output, showing each continent listed only once.

In [7]:
%%sql

SELECT DISTINCT
    continent
FROM 
    customer

Unnamed: 0,continent
0,Europe
1,North America
2,Australia


2. Pivot the data by the unique number of customers who ordered between 2023-01-01 and 2023-12-31 by the continent.

    - Use `COUNT(DISTINCT ...)` to calculate the unique customers who placed orders, broken down by continent.
    - Use `CASE WHEN` within `COUNT` to conditionally count customers for specific continents:
        - `c.continent = 'Europe'` for European customers.
        - `c.continent = 'North America'` for North American customers.
        - `c.continent = 'Australia'` for Australian customers.
    - Filter the data to only include orders from 2023 using a `WHERE` clause (`orderdate BETWEEN '2023-01-01' AND '2023-12-31'`).
    - Group data by `orderdate` to aggregate daily customer counts by continent.
    - Sort the results by `orderdate` in chronological order.

In [8]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.continent = 'Europe' THEN s.customerkey END) AS eu_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'North America' THEN s.customerkey END) AS na_customers,
    COUNT(DISTINCT CASE WHEN c.continent = 'Australia' THEN s.customerkey END) AS au_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,eu_customers,na_customers,au_customers
0,2023-01-01,6,5,1
1,2023-01-02,15,31,3
2,2023-01-03,17,44,3
3,2023-01-04,28,46,4
4,2023-01-05,22,57,8
...,...,...,...,...
359,2023-12-27,26,41,6
360,2023-12-28,24,44,7
361,2023-12-29,19,32,4
362,2023-12-30,25,50,16


<img src="../Resources/images/1.1_customer_continent.png" alt="Continent" width="50%">

---
### Unique Customers by Gender

**`CASE WHEN`**, **`COUNT`**

1. Confirm the unique options in the `gender` column of the `customer` table.
    - Use `SELECT DISTINCT` to retrieve all unique values in the `gender` column.
    - This provides a list of all possible gender o◊ptions stored in the data without duplicates.

In [9]:
%%sql

SELECT DISTINCT gender
FROM customer

Unnamed: 0,gender
0,female
1,male


2. Get the count of unique customers by day by gender.

    - Use `COUNT(DISTINCT ...)` to calculate the unique customer count by gender.
    - Use `CASE WHEN` within `COUNT` to conditionally count customers by gender:
        - `c.gender = 'male'` for male customers.
        - `c.gender = 'female'` for female customers.
    - Filter the data to include only orders from 2023 using the `WHERE` clause (`orderdate BETWEEN '2023-01-01' AND '2023-12-31'`).
    - Group data by `orderdate` to aggregate daily unique customer counts by gender.
    - Sort the results in chronological order by `orderdate`.

In [10]:
%%sql

SELECT
    s.orderdate,
    COUNT(DISTINCT CASE WHEN c.gender = 'male' THEN s.customerkey END) AS male_customers,
    COUNT(DISTINCT CASE WHEN c.gender = 'female' THEN s.customerkey END) AS female_customers
FROM
    sales s
    LEFT JOIN customer c ON s.customerkey = c.customerkey
WHERE
    s.orderdate BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY
    s.orderdate
ORDER BY
    s.orderdate

Unnamed: 0,orderdate,male_customers,female_customers
0,2023-01-01,4,8
1,2023-01-02,21,28
2,2023-01-03,33,31
3,2023-01-04,34,44
4,2023-01-05,50,37
...,...,...,...
359,2023-12-27,40,33
360,2023-12-28,36,39
361,2023-12-29,31,24
362,2023-12-30,51,40


<img src="../Resources/images/1.1_customer_gender.png" alt="Gender" width="50%">