<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/3_Windows_Functions/4_Lag_Lead.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Lag / Lead

### 🥅 Analysis Goals

- What we’re going to use for this dataset to do X e.g. Use the following in order to explore a dataset on experience and salaries
    - Major topic 1
    - Major topic 2
    - Major topic 3
- The end goal of this is e.g. Identify which jobs meet our expectations of years experience and total salary.

### 📘 Concepts Covered

General concepts we’re going to cover

- Concept 1
- Concept 2
- Concept 3

---

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the ipython-sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## LAG

### 📝 Notes

- `LAG()`: Retrieves data from a previous row in the same result set.

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

In [3]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_customer_net_revenue
    FROM sales
    GROUP BY 
        customerkey
)

SELECT *
FROM cohort_analysis

Unnamed: 0,cohort_year,customerkey,total_customer_net_revenue
0,2018,2044589,2470.73
1,2021,1603477,136.62
2,2017,876049,2601.13
3,2024,1469222,5278.54
4,2018,2089398,98.39
...,...,...,...
49482,2019,853617,903.31
49483,2016,1573639,6973.42
49484,2022,1355936,149.99
49485,2024,967453,5.40


In [7]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

cohort_totals AS (
    SELECT
        cohort_year,
        SUM(total_net_revenue) AS total_cohort_revenue,
        COUNT(DISTINCT customerkey) AS total_customers,
        SUM(total_net_revenue) / COUNT(DISTINCT customerkey) AS avg_revenue_per_customer    
    FROM cohort_analysis
    GROUP BY
        cohort_year
)

SELECT *
FROM cohort_totals

Unnamed: 0,cohort_year,total_cohort_revenue,total_customers,avg_revenue_per_customer
0,2015,14892230.47,2825,5271.59
1,2016,18360521.74,3397,5404.92
2,2017,21979733.96,4068,5403.08
3,2018,36460385.42,7446,4896.64
4,2019,36696243.88,7755,4731.95
5,2020,11921900.97,3031,3933.32
6,2021,18387736.18,4663,3943.33
7,2022,29872808.3,9010,3315.52
8,2023,14979328.33,5890,2543.18
9,2024,2856649.33,1402,2037.55


In [9]:
%%sql

WITH cohort_analysis AS (
    SELECT 
        EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
        customerkey,
        SUM(quantity * netprice * exchangerate) AS total_net_revenue
    FROM sales
    GROUP BY 
        customerkey
),

cohort_totals AS (
    SELECT
        cohort_year,
        SUM(total_net_revenue) AS total_cohort_revenue,
        COUNT(DISTINCT customerkey) AS total_customers,
        SUM(total_net_revenue) / COUNT(DISTINCT customerkey) AS avg_ltv   
    FROM cohort_analysis
    GROUP BY
        cohort_year
)

SELECT 
    cohort_year,
    avg_ltv,
    LAG(avg_ltv) OVER (ORDER BY cohort_year) AS prev_cohort_ltv,
    avg_ltv - LAG(avg_ltv) OVER (ORDER BY cohort_year) AS ltv_change,
    FIRST_VALUE(avg_ltv) OVER (ORDER BY cohort_year) AS first_cohort_ltv,
    avg_ltv - FIRST_VALUE(avg_ltv) OVER (ORDER BY cohort_year) AS ltv_change_from_first
FROM cohort_totals

Unnamed: 0,cohort_year,avg_ltv,prev_cohort_ltv,ltv_change,first_cohort_ltv,ltv_change_from_first
0,2015,5271.59,,,5271.59,0.0
1,2016,5404.92,5271.59,133.34,5271.59,133.34
2,2017,5403.08,5404.92,-1.84,5271.59,131.5
3,2018,4896.64,5403.08,-506.44,5271.59,-374.95
4,2019,4731.95,4896.64,-164.69,5271.59,-539.64
5,2020,3933.32,4731.95,-798.62,5271.59,-1338.26
6,2021,3943.33,3933.32,10.0,5271.59,-1328.26
7,2022,3315.52,3943.33,-627.81,5271.59,-1956.07
8,2023,2543.18,3315.52,-772.34,5271.59,-2728.41
9,2024,2037.55,2543.18,-505.63,5271.59,-3234.03


## LEAD

### 📝 Notes

- `LEAD()`: Retrieves data from the following row in the same result set.

### 💻 Final Result

- Describe what the final result should be e.g. return the retention by X cohort.

#### Problem Description

**`FUNCTION` / Concept Covered**

1. Go into specific step / what we’re going to do. E.g. Use the `=` operator to set a new column to be equal to Experience

Scenario 1: Use LAG() to calculate month-over-month revenue change.

In [None]:
%%sql

SELECT
    DATE_TRUNC('month', s.OrderDate) AS sales_month,
    SUM(s.SalesAmount) AS total_revenue,
    LAG(SUM(s.SalesAmount)) OVER (ORDER BY DATE_TRUNC('month', s.OrderDate)) AS previous_month_revenue,
    SUM(s.SalesAmount) - LAG(SUM(s.SalesAmount)) OVER (ORDER BY DATE_TRUNC('month', s.OrderDate)) AS revenue_change
FROM
    Sales s
GROUP BY
    DATE_TRUNC('month', s.OrderDate)
ORDER BY
    sales_month;


Scenario 2: Use LEAD() to predict next month’s revenue.

In [None]:
%%sql

SELECT
    DATE_TRUNC('month', s.OrderDate) AS sales_month,
    SUM(s.SalesAmount) AS total_revenue,
    LEAD(SUM(s.SalesAmount)) OVER (ORDER BY DATE_TRUNC('month', s.OrderDate)) AS next_month_revenue
FROM
    Sales s
GROUP BY
    DATE_TRUNC('month', s.OrderDate)
ORDER BY
    sales_month;


Scenario 3: Use FIRST_VALUE() to find the first recorded sale for each product category.

In [None]:
%%sql

SELECT
    pc.ProductCategoryName AS category,
    p.ProductName AS product,
    FIRST_VALUE(s.OrderDate) OVER (PARTITION BY pc.ProductCategoryName ORDER BY s.OrderDate) AS first_sale_date
FROM
    Sales s
JOIN
    Products p ON s.ProductKey = p.ProductKey
JOIN
    ProductCategories pc ON p.ProductCategoryKey = pc.ProductCategoryKey
ORDER BY
    category, first_sale_date;
