<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/3_Ranking.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Ranking Functions

## Overview

**Product Analysis Focused**

### 🥅 Analysis Goals

- What we’re going to use for this dataset to do X e.g. Use the following in order to explore a dataset on experience and salaries
    - Major topic 1
    - Major topic 2
    - Major topic 3
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Concept 1
- Concept 2
- Concept 3

In [20]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## Order By

### 📝 Notes

`ORDER BY`

- **ORDER BY**: Orders rows within each partition for the function.
- Syntax
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
            ORDER BY column_name
        ) AS window_column_alias
    FROM table_name;
    ```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Average Customer by Cohort Orderd by Cohort

**`ORDER BY`**

1. Get the average revenue by each customer.  
   - Define a CTE `cohort_analysis` to calculate the cohort year and total revenue for each customer.  
        - Extract the cohort year using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Calculate the total revenue for each customer with `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` to ensure total revenue and cohort year are assigned to each customer.  
   - In the main query, use `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year ORDER BY cohort_year)` to calculate the average revenue per customer for each cohort and order by the cohort.  
        - Select `cohort_year`, `customerkey`, and the average total revenue for output.  
        - Use `ORDER BY cohort_year, customerkey` to sort the results by cohort and customer.  

In [24]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
)

SELECT 
    cohort_year, 
    customerkey,
    AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year ORDER BY cohort_year) AS avg_ltv --Updated
FROM cohort_analysis

Unnamed: 0,cohort_year,customerkey,avg_ltv
0,2015,1088780,5271.59
1,2015,1404475,5271.59
2,2015,928010,5271.59
3,2015,492702,5271.59
4,2015,341576,5271.59
...,...,...,...
49482,2024,1406861,2037.55
49483,2024,841578,2037.55
49484,2024,994228,2037.55
49485,2024,1032701,2037.55


---
## Row Number

### 📝 Notes

`ROW_NUMBER`

- **ROW NUMBER**: Assigns a unique number to each row within a partition.
- Syntax:
    ```sql
    ROW_NUMBER() OVER ()
    ```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

In [12]:
%%sql

SELECT 
    orderkey,
    customerkey,
    ROW_NUMBER() OVER () AS sales_row
FROM sales

Unnamed: 0,orderkey,customerkey,sales_row
0,1000,947009,1
1,1000,947009,2
2,1001,1772036,3
3,1002,1518349,4
4,1002,1518349,5
...,...,...,...
199868,3398034,664396,199869
199869,3398034,664396,199870
199870,3398035,267690,199871
199871,3398035,267690,199872


#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Step 1: Rank products by total sales within each month

In [7]:
%%sql

SELECT customerkey,
       DATE_TRUNC('month', orderdate) AS month,
       SUM(quantity * netprice * exchangerate) AS total_net_revenue,
       ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC('month', orderdate) ORDER BY SUM(quantity * netprice * exchangerate) DESC) AS customer_rank
FROM sales
GROUP BY 
    month,
    customerkey
ORDER BY 
    month, 
    customerkey
;

Unnamed: 0,customerkey,month,total_net_revenue,customer_rank
0,130500,2015-01-01 00:00:00-08:00,2246.91,65
1,133386,2015-01-01 00:00:00-08:00,876.31,105
2,143847,2015-01-01 00:00:00-08:00,50.00,190
3,200729,2015-01-01 00:00:00-08:00,1789.65,73
4,202969,2015-01-01 00:00:00-08:00,9.09,198
...,...,...,...,...
82412,2066845,2024-04-01 00:00:00-07:00,792.61,149
82413,2072527,2024-04-01 00:00:00-07:00,3025.98,38
82414,2076477,2024-04-01 00:00:00-07:00,993.38,135
82415,2091108,2024-04-01 00:00:00-07:00,118.00,214


Step 2: Only get top 5 customers for each month.

In [16]:
%%sql

WITH ranked_customers AS (
    SELECT customerkey,
        DATE_TRUNC('month', orderdate) AS month,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue,
        ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC('month', orderdate) ORDER BY SUM(quantity * netprice * exchangerate) DESC) AS customer_rank
    FROM sales
    GROUP BY 
        month,
        customerkey
    ORDER BY 
        month, 
        customerkey
)

SELECT
*
FROM ranked_customers
WHERE customer_rank <= 5
ORDER BY month, customer_rank

Unnamed: 0,customerkey,month,total_net_revenue,customer_rank
0,1198381,2015-01-01 00:00:00-08:00,10745.86,1
1,208950,2015-01-01 00:00:00-08:00,10314.49,2
2,810590,2015-01-01 00:00:00-08:00,9581.11,3
3,1473073,2015-01-01 00:00:00-08:00,9117.21,4
4,1928189,2015-01-01 00:00:00-08:00,8535.84,5
...,...,...,...,...
555,318017,2024-04-01 00:00:00-07:00,29449.55,1
556,205498,2024-04-01 00:00:00-07:00,12460.21,2
557,206927,2024-04-01 00:00:00-07:00,12276.19,3
558,1136286,2024-04-01 00:00:00-07:00,10971.84,4


---
## RANK

### 📝 Notes

`RANK`

- **RANK**: Assigns the same rank to rows with identical values but skips ranks after ties (e.g., 1, 2, 2, 4).
- Syntax:
    ```sql
    RANK() OVER ()
    ```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.



#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Write using `RANK()`:

In [17]:
%%sql

WITH ranked_customers AS (
    SELECT customerkey,
        DATE_TRUNC('month', orderdate) AS month,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue,
        RANK() OVER (PARTITION BY DATE_TRUNC('month', orderdate) ORDER BY SUM(quantity * netprice * exchangerate) DESC) AS customer_rank
    FROM sales
    GROUP BY 
        month,
        customerkey
    ORDER BY 
        month, 
        customerkey
)

SELECT
*
FROM ranked_customers
WHERE customer_rank <= 5
ORDER BY month, customer_rank

Unnamed: 0,customerkey,month,total_net_revenue,customer_rank
0,1198381,2015-01-01 00:00:00-08:00,10745.86,1
1,208950,2015-01-01 00:00:00-08:00,10314.49,2
2,810590,2015-01-01 00:00:00-08:00,9581.11,3
3,1473073,2015-01-01 00:00:00-08:00,9117.21,4
4,1928189,2015-01-01 00:00:00-08:00,8535.84,5
...,...,...,...,...
555,318017,2024-04-01 00:00:00-07:00,29449.55,1
556,205498,2024-04-01 00:00:00-07:00,12460.21,2
557,206927,2024-04-01 00:00:00-07:00,12276.19,3
558,1136286,2024-04-01 00:00:00-07:00,10971.84,4


---
## DENSE RANK

### 📝 Notes

`DENSE_RANK`

- **DENSE_RANK**: Similar to RANK(), it assigns the same rank to rows with identical values but does not skip ranks after ties (e.g., 1, 2, 2, 3).
- Syntax:
    ```sql
    DENSE_RANK() OVER ()
    ```


### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.



#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Write using `DENSE_RANK()`

In [18]:
%%sql

WITH ranked_customers AS (
    SELECT customerkey,
        DATE_TRUNC('month', orderdate) AS month,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue,
        DENSE_RANK() OVER (PARTITION BY DATE_TRUNC('month', orderdate) ORDER BY SUM(quantity * netprice * exchangerate) DESC) AS customer_rank
    FROM sales
    GROUP BY 
        month,
        customerkey
    ORDER BY 
        month, 
        customerkey
)

SELECT
*
FROM ranked_customers
WHERE customer_rank <= 5
ORDER BY month, customer_rank

Unnamed: 0,customerkey,month,total_net_revenue,customer_rank
0,1198381,2015-01-01 00:00:00-08:00,10745.86,1
1,208950,2015-01-01 00:00:00-08:00,10314.49,2
2,810590,2015-01-01 00:00:00-08:00,9581.11,3
3,1473073,2015-01-01 00:00:00-08:00,9117.21,4
4,1928189,2015-01-01 00:00:00-08:00,8535.84,5
...,...,...,...,...
555,318017,2024-04-01 00:00:00-07:00,29449.55,1
556,205498,2024-04-01 00:00:00-07:00,12460.21,2
557,206927,2024-04-01 00:00:00-07:00,12276.19,3
558,1136286,2024-04-01 00:00:00-07:00,10971.84,4


### 💡 What's the difference between `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`

1. `ROW_NUMBER()` 
    - Even if two rows have the same value, they will get different, consecutive ranks.
    - Example: If three products have the same sales amount, they’ll be ranked 1, 2, and 3 in sequence.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 2            |
    | 400   | 3            |
    | 300   | 4            |


2. `RANK()`
    - Rows with identical values receive the same rank, and the next rank jumps to the next number in sequence.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 4.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 1            |
    | 400   | 3            |
    | 300   | 4            |


3. `DENSE_RANK()`
    - Rows with identical values receive the same rank, and the next rank continues sequentially without gaps.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 2.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 1            |
    | 400   | 2            |
    | 300   | 3            |

**Alternative note format**

- Same info as above but in a different format. 

| Function     | Description                                                                                    | Tie Handling                           | Example Sales Values (500, 500, 400, 300) |
|--------------|------------------------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------------|
| ROW_NUMBER() | Assigns a unique, sequential rank   to each row without regard for ties.                       | No ties; each row gets a unique   rank | 1, 2, 3, 4                                            |
| RANK()       | Assigns the same rank to   identical values but skips ranks after ties.                        | Same rank for ties; skips next   ranks | 1, 1, 3, 4                                            |
| DENSE_RANK() | Assigns the same rank to   identical values but continues sequentially without skipping ranks. | Same rank for ties; no skipped   ranks | 1, 1, 2, 3                                            |