<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/3_Ranking.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Ranking Functions

## Overview

**Product Analysis Focused**

### 🥅 Analysis Goals

- Identify the top-spending cohorts.
    - Major topic 1
    - Major topic 2
    - Major topic 3
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Concept 1
- Concept 2
- Concept 3

In [27]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## Order By

### 📝 Notes

`ORDER BY`

- **ORDER BY**: Orders rows within each partition for the function.
- `ORDER BY` can be ordered in either `DESC` or `ASC` order.
- Syntax
    ```sql
    SELECT
        window_function() OVER (
            PARTITION BY partition_expression
            ORDER BY column_name --DESC or ASC
        ) AS window_column_alias
    FROM table_name;
    ```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Average Customer by Cohort Orderd by Cohort

**`ORDER BY`**

1. Get the average revenue by each customer (use the same query from the last problem).  
   - Define a CTE `cohort_analysis` to calculate the cohort year and total revenue for each customer.  
        - Extract the cohort year using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Calculate the total revenue for each customer with `SUM(quantity * netprice * exchangerate)`.  
        - Group by `customerkey` to ensure total revenue and cohort year are assigned to each customer.  
   - In the main query, use `AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year ORDER BY total_customer_net_revenue)` to calculate the average revenue per customer for each cohort and order by the customer's net revenue.  
        - Select `cohort_year`, `customerkey`, and the average total revenue for output.  

In [28]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
)

SELECT 
    cohort_year, 
    customerkey,
    AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year ORDER BY total_customer_net_revenue) AS avg_ltv --Updated
FROM cohort_analysis
-- ORDER BY commented out

Unnamed: 0,cohort_year,customerkey,avg_ltv
0,2015,1035383,3.14
1,2015,980553,3.90
2,2015,1727718,4.23
3,2015,1967829,4.56
4,2015,227582,4.82
...,...,...,...
49482,2024,556313,1983.58
49483,2024,494202,1996.46
49484,2024,1707989,2009.38
49485,2024,446428,2022.89


What does it look like without an `ORDER BY`?

In [25]:
%%sql

-- No order by
WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
)

SELECT 
    cohort_year, 
    customerkey,
    AVG(total_customer_net_revenue) OVER (PARTITION BY cohort_year) AS avg_ltv --Updated
FROM cohort_analysis

Unnamed: 0,cohort_year,customerkey,avg_ltv
0,2015,1088780,5271.59
1,2015,1404475,5271.59
2,2015,928010,5271.59
3,2015,492702,5271.59
4,2015,341576,5271.59
...,...,...,...
49482,2024,1406861,2037.55
49483,2024,841578,2037.55
49484,2024,994228,2037.55
49485,2024,1032701,2037.55


---
## Row Number

### 📝 Notes

`ROW_NUMBER`

- **ROW NUMBER**: Assigns a unique number to each row within a partition.
- Syntax:
    ```sql
    ROW_NUMBER() OVER ()
    ```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Rank Customers Order Quantity

**`ROW_NUMBER`**

1. By customer assign a rank to the total orders each customer made.

In [80]:
%%sql
SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    ROW_NUMBER() OVER (ORDER BY COUNT(orderkey) DESC) AS row_number_rank
FROM sales
GROUP BY customerkey;


Unnamed: 0,customerkey,total_orders,row_number_rank
0,1834524,31,1
1,1375597,30,2
2,249557,27,3
3,1495941,26,4
4,459519,26,5
...,...,...,...
49482,1603362,1,49483
49483,618460,1,49484
49484,1313599,1,49485
49485,1842437,1,49486


#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

In [101]:
%%sql

WITH cohort_totals AS (
    SELECT
        cohort_year,
        order_month,
        SUM(orderkey) AS total_orders,
        COUNT(DISTINCT customerkey) AS user_count
    FROM (
        SELECT
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
            customerkey,
            DATE_TRUNC('month', orderdate) AS order_month,
            orderkey
        FROM sales
        GROUP BY
            customerkey,
            order_month,
            orderkey
    ) cohort_analysis
    GROUP BY cohort_year, order_month
)
SELECT
    cohort_year,
    order_month,
    total_orders,
    user_count,
    ROW_NUMBER() OVER (ORDER BY total_orders DESC) AS row_number_rank,
    RANK() OVER (ORDER BY total_orders DESC) AS revenue_rank,
    DENSE_RANK() OVER (ORDER BY total_orders DESC) AS dense_rank
FROM cohort_totals;

Unnamed: 0,cohort_year,order_month,total_orders,user_count,row_number_rank,revenue_rank,dense_rank
0,2023,2023-02-01 00:00:00-08:00,5875473666,1946,1,1,1
1,2024,2024-02-01 00:00:00-08:00,5832194997,1718,2,2,2
2,2022,2022-12-01 00:00:00-08:00,5779022351,1960,3,3,3
3,2022,2022-09-01 00:00:00-07:00,4941116687,1731,4,4,4
4,2022,2022-10-01 00:00:00-07:00,4940313221,1705,5,5,5
...,...,...,...,...,...,...,...
107,2015,2015-05-01 00:00:00-07:00,32136055,236,108,108,108
108,2015,2015-02-01 00:00:00-08:00,14082939,291,109,109,109
109,2015,2015-03-01 00:00:00-08:00,9915444,139,110,110,110
110,2015,2015-04-01 00:00:00-07:00,8791189,78,111,111,111


---
## RANK

### 📝 Notes

`RANK`

- **RANK**: Assigns the same rank to rows with identical values but skips ranks after ties (e.g., 1, 2, 2, 4).
- Syntax:
    ```sql
    RANK() OVER ()
    ```

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.



#### Rank Customers Order Quantity

**`RANK`**

1. By customer assign a rank to the total orders each customer made.

In [82]:
%%sql
SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS rank_rank
FROM sales
GROUP BY customerkey;


Unnamed: 0,customerkey,total_orders,rank_rank
0,1834524,31,1
1,1375597,30,2
2,249557,27,3
3,1495941,26,4
4,459519,26,4
...,...,...,...
49482,1603362,1,39985
49483,618460,1,39985
49484,1313599,1,39985
49485,1842437,1,39985


#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Write using `RANK()`:

In [None]:
%%sql

---
## DENSE RANK

### 📝 Notes

`DENSE_RANK`

- **DENSE_RANK**: Similar to RANK(), it assigns the same rank to rows with identical values but does not skip ranks after ties (e.g., 1, 2, 2, 3).
- Syntax:
    ```sql
    DENSE_RANK() OVER ()
    ```


### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.



#### Rank Customers Order Quantity

**`DENSE_RANK`**

1. By customer assign a rank to the total orders each customer made.

In [83]:
%%sql
SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    DENSE_RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS dense_rank_rank
FROM sales
GROUP BY customerkey;


Unnamed: 0,customerkey,total_orders,dense_rank_rank
0,1834524,31,1
1,1375597,30,2
2,249557,27,3
3,1495941,26,4
4,459519,26,4
...,...,...,...
49482,1603362,1,28
49483,618460,1,28
49484,1313599,1,28
49485,1842437,1,28


#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Write using `DENSE_RANK()`

In [None]:
%%sql

### 💡 What's the difference between `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`

1. `ROW_NUMBER()` 
    - Even if two rows have the same value, they will get different, consecutive ranks.
    - Example: If three products have the same sales amount, they’ll be ranked 1, 2, and 3 in sequence.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 2            |
    | 400   | 3            |
    | 300   | 4            |


2. `RANK()`
    - Rows with identical values receive the same rank, and the next rank jumps to the next number in sequence.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 4.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 1            |
    | 400   | 3            |
    | 300   | 4            |


3. `DENSE_RANK()`
    - Rows with identical values receive the same rank, and the next rank continues sequentially without gaps.
    - Example: If three products have the same highest sales amount, they all get rank 1, and the next product will get rank 2.

    | Sales | ROW_NUMBER() |
    |-------|--------------|
    | 500   | 1            |
    | 500   | 1            |
    | 400   | 2            |
    | 300   | 3            |

**Alternative note format**

- Same info as above but in a different format. 

| Function     | Description                                                                                    | Tie Handling                           | Example Sales Values (500, 500, 400, 300) |
|--------------|------------------------------------------------------------------------------------------------|----------------------------------------|-------------------------------------------------------|
| ROW_NUMBER() | Assigns a unique, sequential rank   to each row without regard for ties.                       | No ties; each row gets a unique   rank | 1, 2, 3, 4                                            |
| RANK()       | Assigns the same rank to   identical values but skips ranks after ties.                        | Same rank for ties; skips next   ranks | 1, 1, 3, 4                                            |
| DENSE_RANK() | Assigns the same rank to   identical values but continues sequentially without skipping ranks. | Same rank for ties; no skipped   ranks | 1, 1, 2, 3                                            |

In [79]:
%%sql
SELECT 
    customerkey,
    COUNT(orderkey) AS total_orders,
    ROW_NUMBER() OVER (ORDER BY COUNT(orderkey) DESC) AS row_number_rank,
    RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS rank_rank,
    DENSE_RANK() OVER (ORDER BY COUNT(orderkey) DESC) AS dense_rank_rank
FROM sales
GROUP BY customerkey;


Unnamed: 0,customerkey,total_orders,row_number_rank,rank_rank,dense_rank_rank
0,1834524,31,1,1,1
1,1375597,30,2,2,2
2,249557,27,3,3,3
3,1495941,26,4,4,4
4,459519,26,5,4,4
...,...,...,...,...,...
49482,1603362,1,49483,39985,28
49483,618460,1,49484,39985,28
49484,1313599,1,49485,39985,28
49485,1842437,1,49486,39985,28
