<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/3_Ranking.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Ranking Functions

## Overview

**Product Analysis Focused**

### 🥅 Analysis Goals

- What we’re going to use for this dataset to do X e.g. Use the following in order to explore a dataset on experience and salaries
    - Major topic 1
    - Major topic 2
    - Major topic 3
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Concept 1
- Concept 2
- Concept 3

In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Order by & Row Number

### 📝 Notes

`ORDER BY`

- Gives a rank to each row.
- Syntax
    - `ORDER BY`: Orders rows within each partition for the function.
    - `ROW_NUMBER()`: Assigns a unique number to each row within a partition.

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

In [3]:
%%sql

SELECT 
    customerkey,
    ROW_NUMBER() OVER () AS product_rank
FROM sales

Unnamed: 0,customerkey,product_rank
0,947009,1
1,947009,2
2,1772036,3
3,1518349,4
4,1518349,5
...,...,...
199868,664396,199869
199869,664396,199870
199870,267690,199871
199871,267690,199872


#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Step 1: Rank products by total sales within each month

In [7]:
%%sql

SELECT customerkey,
       DATE_TRUNC('month', orderdate) AS month,
       SUM(quantity * netprice * exchangerate) AS total_net_revenue,
       ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC('month', orderdate) ORDER BY SUM(quantity * netprice * exchangerate) DESC) AS customer_rank
FROM sales
GROUP BY 
    month,
    customerkey
ORDER BY 
    month, 
    customerkey
;

Unnamed: 0,customerkey,month,total_net_revenue,customer_rank
0,130500,2015-01-01 00:00:00-08:00,2246.91,65
1,133386,2015-01-01 00:00:00-08:00,876.31,105
2,143847,2015-01-01 00:00:00-08:00,50.00,190
3,200729,2015-01-01 00:00:00-08:00,1789.65,73
4,202969,2015-01-01 00:00:00-08:00,9.09,198
...,...,...,...,...
82412,2066845,2024-04-01 00:00:00-07:00,792.61,149
82413,2072527,2024-04-01 00:00:00-07:00,3025.98,38
82414,2076477,2024-04-01 00:00:00-07:00,993.38,135
82415,2091108,2024-04-01 00:00:00-07:00,118.00,214


Step 2: Only get top 5 customers for each month.

In [10]:
%%sql

WITH ranked_customers AS (
    SELECT customerkey,
        DATE_TRUNC('month', orderdate) AS month,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue,
        ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC('month', orderdate) ORDER BY SUM(quantity * netprice * exchangerate) DESC) AS customer_rank
    FROM sales
    GROUP BY 
        month,
        customerkey
    ORDER BY 
        month, 
        customerkey
)

SELECT
*
FROM ranked_customers
WHERE customer_rank <= 5

Unnamed: 0,customerkey,month,total_net_revenue,customer_rank
0,208950,2015-01-01 00:00:00-08:00,10314.49,2
1,810590,2015-01-01 00:00:00-08:00,9581.11,3
2,1198381,2015-01-01 00:00:00-08:00,10745.86,1
3,1473073,2015-01-01 00:00:00-08:00,9117.21,4
4,1928189,2015-01-01 00:00:00-08:00,8535.84,5
...,...,...,...,...
555,205498,2024-04-01 00:00:00-07:00,12460.21,2
556,206927,2024-04-01 00:00:00-07:00,12276.19,3
557,318017,2024-04-01 00:00:00-07:00,29449.55,1
558,1136286,2024-04-01 00:00:00-07:00,10971.84,4


---
## RANK and DENSE_RANK

### 📝 Notes

`ORDER BY`

- Syntax
    - `RANK()`: Assigns the same rank to rows with identical values but skips ranks after ties (e.g., 1, 2, 2, 4).
    - `DENSE_RANK()`: Similar to RANK(), it assigns the same rank to rows with identical values but does not skip ranks after ties (e.g., 1, 2, 2, 3).

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.



#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Write using `RANK()`:

In [6]:
%%sql

WITH ranked_products AS (
    SELECT pd.productname AS product_name,
           EXTRACT(MONTH FROM sl.orderdate) AS month,
           SUM(sl.quantity * sl.unitprice) AS total_sales,
           RANK() OVER (PARTITION BY EXTRACT(MONTH FROM sl.orderdate) ORDER BY SUM(sl.quantity * sl.unitprice) DESC) AS product_rank
    FROM sales sl
    LEFT JOIN product pd ON sl.productkey = pd.productkey
    GROUP BY pd.productname, EXTRACT(MONTH FROM sl.orderdate)
)

SELECT 
    product_name, 
    month, total_sales, product_rank
FROM ranked_products
WHERE product_rank <= 5
ORDER BY month, product_rank;


product_name,month,total_sales,product_rank
Contoso Projector 1080p X981 White,1,254898.0,1
Adventure Works Desktop PC2.33 XD233 Black,1,209304.0,2
WWI Desktop PC2.33 X2330 Brown,1,193909.0,3
WWI Desktop PC2.33 X2330 Silver,1,188395.0,4
"Adventure Works 52"" LCD HDTV X590 Black",1,186904.3555,5
WWI Desktop PC2.33 X2330 Brown,2,267888.5,1
Adventure Works Desktop PC2.33 XD233 Brown,2,248064.0,2
Adventure Works Desktop PC2.33 XD233 Black,2,240796.5,3
WWI Desktop PC2.33 X2330 Black,2,234804.5,4
Adventure Works Desktop PC2.33 XD233 Silver,2,229168.5,5


Write using `DENSE_RANK()`

In [7]:
%%sql

WITH ranked_products AS (
    SELECT pd.productname AS product_name,
           EXTRACT(MONTH FROM sl.orderdate) AS month,
           SUM(sl.quantity * sl.unitprice) AS total_sales,
           RANK() OVER (PARTITION BY EXTRACT(MONTH FROM sl.orderdate) ORDER BY SUM(sl.quantity * sl.unitprice) DESC) AS product_rank
    FROM sales sl
    LEFT JOIN product pd ON sl.productkey = pd.productkey
    GROUP BY pd.productname, EXTRACT(MONTH FROM sl.orderdate)
)

SELECT 
    product_name, 
    month, total_sales, product_rank
FROM ranked_products
WHERE product_rank <= 5
ORDER BY month, product_rank;


product_name,month,total_sales,product_rank
Contoso Projector 1080p X981 White,1,254898.0,1
Adventure Works Desktop PC2.33 XD233 Black,1,209304.0,2
WWI Desktop PC2.33 X2330 Brown,1,193909.0,3
WWI Desktop PC2.33 X2330 Silver,1,188395.0,4
"Adventure Works 52"" LCD HDTV X590 Black",1,186904.3555,5
WWI Desktop PC2.33 X2330 Brown,2,267888.5,1
Adventure Works Desktop PC2.33 XD233 Brown,2,248064.0,2
Adventure Works Desktop PC2.33 XD233 Black,2,240796.5,3
WWI Desktop PC2.33 X2330 Black,2,234804.5,4
Adventure Works Desktop PC2.33 XD233 Silver,2,229168.5,5


What's the difference between `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`

1. `ROW_NUMBER()` 
    - Even if two rows have the same value, they will get different, consecutive ranks.
    - Example: If three products have the same sales amount, they’ll be ranked 1, 2, and 3 in sequence.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 2            |
    | 400   | 3            |
    | 300   | 4            |


2. `RANK()`
    - Rows with identical values receive the same rank, and the next rank jumps to the next number in sequence.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 4.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 1            |
    | 400   | 3            |
    | 300   | 4            |


3. `DENSE_RANK()`
    - Rows with identical values receive the same rank, and the next rank continues sequentially without gaps.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 2.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 1            |
    | 400   | 2            |
    | 300   | 3            |

**Alternative note format**

- Same info as above but in a different format. 

| Function     | Description                                                                                    | Tie Handling                           | Example Sales Values (500, 500, 400, 300) |
|--------------|------------------------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------------|
| ROW_NUMBER() | Assigns a unique, sequential rank   to each row without regard for ties.                       | No ties; each row gets a unique   rank | 1, 2, 3, 4                                            |
| RANK()       | Assigns the same rank to   identical values but skips ranks after ties.                        | Same rank for ties; skips next   ranks | 1, 1, 3, 4                                            |
| DENSE_RANK() | Assigns the same rank to   identical values but continues sequentially without skipping ranks. | Same rank for ties; no skipped   ranks | 1, 1, 2, 3                                            |