# Count Aggregation

## Overview

### 🥅 Analysis Goals

- What we’re going to use for this dataset to do X e.g. Use the following in order to explore a dataset on experience and salaries
    - Major topic 1
    - Major topic 2
    - Major topic 3
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Concept 1
- Concept 2
- Concept 3

---

In [None]:
import sys
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

In [37]:
import psycopg2
import pandas as pd
import numpy as np

# Database connection parameters
connection = psycopg2.connect(
    dbname='contoso_100k',
    user='postgres',
    password='password',
    host='localhost',
    port='5432'
)

In [32]:
%config SqlMagic.named_parameters = "disabled"

---
## Major Topic

### 📝 Notes

- Add in specific notes

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

**Basic Query**

Needs to be rewritten

Find percentile

```sql
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) AS median
FROM table_name
WHERE column_name IS NOT NULL;

```

We want to categorize total sales (`quantity * price * exchangerate`) by low, moderate, and high sales amount. But we need to determine these values.

- low - minimum to moderate value
- moderate - moderate value to high value
- high - anything above high value

In [42]:
%%sql 

SELECT
    MIN(s.quantity * p.price * exchangerate) AS minimum_sales,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS q1_sales,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS median_sales, -- Median 
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY (s.quantity * p.price * exchangerate)) AS q3_sales,
    MAX(s.quantity * p.price * exchangerate) AS maximum_sales
FROM
    sales s
JOIN
    product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' 
;

minimum_sales,q1_sales,median_sales,q3_sales,maximum_sales
0.864576,109.99,390.81733600000007,1055.9386167,33040.27776


**Validate data**. Validating data another way using Python.

In [43]:
# SQL Query to fetch data
query = '''
SELECT 
    s.quantity * p.price * exchangerate AS total_sale_amount
FROM 
    sales s
JOIN 
    product p 
ON 
    s.productkey = p.productkey
WHERE 
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31'
ORDER BY 
    total_sale_amount;
'''

# Fetch data into a Pandas DataFrame
sales_df = pd.read_sql_query(query, connection)
sales_df

  sales_df = pd.read_sql_query(query, connection)


Unnamed: 0,total_sale_amount
0,0.864576
1,0.868053
2,0.874123
3,0.950000
4,0.950000
...,...
37512,30973.727207
37513,31024.269000
37514,31253.540229
37515,32915.591460


In [44]:
# Calculate percentiles
q1 = np.percentile(sales_df, 25)
median = np.percentile(sales_df, 50)
q3 = np.percentile(sales_df, 75)

print(f"25th Percentile (Q1): {q1}")
print(f"Median (50th Percentile): {median}")
print(f"75th Percentile (Q3): {q3}")

25th Percentile (Q1): 109.99
Median (50th Percentile): 390.81733600000007
75th Percentile (Q3): 1055.9386167


In [45]:
%%sql 

SELECT
    p.categoryname AS category,
    COUNT(CASE WHEN (s.quantity * p.price * exchangerate) < 109.99 THEN orderkey END) AS num_low_total_sales,
    COUNT(CASE WHEN (s.quantity * p.price * exchangerate) >= 109.99 AND (s.quantity * p.price * exchangerate) < 390.82 THEN orderkey END) AS num_mod_total_sales,
    COUNT(CASE WHEN (s.quantity * p.price * exchangerate) >= 390.82 THEN orderkey END) AS num_high_total_sales
FROM
    sales s
JOIN
    product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

category,num_low_total_sales,num_mod_total_sales,num_high_total_sales
Audio,299,685,635
Cameras and camcorders,182,358,1098
Cell phones,2318,2443,4869
Computers,977,2098,5940
Games and Toys,2622,597,126
Home Appliances,214,626,2298
"Music, Movies and Audio Books",2630,2014,1874
TV and Video,74,622,1918


**Alternative Method if `COUNT DISTINCT` isn't needed**. This is less resource intensive than `COUNT`.

In [46]:
%%sql 

SELECT
    p.categoryname AS category,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) < 109.99 THEN 1 ELSE 0 END) AS num_low_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 109.99 AND (s.quantity * p.price * exchangerate) < 390.82 THEN 1 ELSE 0 END) AS num_mod_total_sales,
    SUM(CASE WHEN (s.quantity * p.price * exchangerate) >= 390.82 THEN 1 ELSE 0 END) AS num_high_total_sales
FROM
    sales s
JOIN
    product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

category,num_low_total_sales,num_mod_total_sales,num_high_total_sales
Audio,299,685,635
Cameras and camcorders,182,358,1098
Cell phones,2318,2443,4869
Computers,977,2098,5940
Games and Toys,2622,597,126
Home Appliances,214,626,2298
"Music, Movies and Audio Books",2630,2014,1874
TV and Video,74,622,1918


**Advanced Query**

Get the unqiue number of customers who had total, low and high sales.

In [48]:
%%sql 

SELECT
    p.categoryname AS category,
    COUNT(DISTINCT CASE WHEN (s.quantity * p.price * exchangerate) < 109.99 THEN customerkey END) AS low_total_sales_customer,
    COUNT(DISTINCT CASE WHEN (s.quantity * p.price * exchangerate) >= 109.99 AND (s.quantity * p.price * exchangerate) < 390.82 THEN customerkey END) AS mod_total_sales_customer,
    COUNT(DISTINCT CASE WHEN (s.quantity * p.price * exchangerate) >= 390.82 THEN customerkey END) AS high_total_sales_customer
FROM
    sales s
JOIN
    product p ON s.productkey = p.productkey
WHERE
    orderdate::date BETWEEN '2023-01-01' AND '2023-12-31' 
GROUP BY
    category
ORDER BY
    category;

category,low_total_sales_customer,mod_total_sales_customer,high_total_sales_customer
Audio,295,672,619
Cameras and camcorders,181,349,1061
Cell phones,2104,2210,4068
Computers,935,1921,4720
Games and Toys,2368,580,126
Home Appliances,211,610,2098
"Music, Movies and Audio Books",2376,1862,1749
TV and Video,74,603,1781
