<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/5_Views/2_View_Project.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Views Project

## Overview

### 🥅 Analysis Goals

Analyze cohort revenue and lifetime value (LTV) to uncover daily trends, mid-term fluctuations, future potential, and long-term customer value patterns.
- **Cohort Revenue Insights:**  
  - Track cumulative revenue up to each month to measure cohort growth over time.  
  - Calculate remaining cumulative revenue from each order month to analyze future revenue potential.  
  - Evaluate average LTV for each cohort using cumulative revenue while incorporating a 3-month rolling average.

### 📘 Concepts Covered

- Use views

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## Views

### 📝 Notes

`CREATE VIEW`

- **Why Use Views in PostgreSQL?**  
  - Simplifies complex queries by storing them as reusable, named objects.  
  - Ensures consistency and readability when multiple queries rely on the same logic.  
  - Enhances security by restricting access to specific rows/columns.  
  - Improves maintainability by centralizing changes to the query logic.

- **Syntax:**  
    ```sql
    CREATE VIEW view_name AS
    SELECT
        column1,
        column2,
        column3
    FROM table_name
    WHERE condition;
    ```
    - `CREATE VIEW view_name AS`: Creates a new view with the specified name.
    - `SELECT`: Defines the query whose results will be stored in the view.
    - `WHERE`: (Optional) Filters data included in the view.◊


### 💻 Final Result

- Calculates the average lifetime value (LTV) for each cohort based on cumulative revenue and user count.
  - Computes a 30-day rolling average LTV for shorter timeframes to analyze recent changes in customer value.
  - Provides insights into overall customer value trends and mid-term customer activity for cohorts.

#### Average and 30-Day Rolling LTV

**`CREATE VIEWS`**

1. Put the previous query into a CTE named `cohort_summary` and get the cumulative summary using a window function for the cohort year and order date.  

   - Define a CTE `cohort_summary` to calculate the total daily revenue for each cohort.  
        - Use `SUM(total_net_revenue)` to aggregate the total revenue per cohort and day.  
        - Group the CTE by `cohort_year` and `orderdate` to summarize the data at the cohort and daily levels.  
        - **🔔**: Get the order year month from `orderdate` using `DATE_TRUNC` and cast that as a `::date` to make it easier to read.
   - **🔔**: In the main query, use `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` to calculate the cumulative revenue for each cohort up to the current date.  
        - Apply `PARTITION BY cohort_year` to ensure the cumulative calculation is done separately for each cohort.  
        - Order by `year_month` within each cohort to maintain chronological order.  
        - Select `cohort_year`, `year_month`, and `cumulative_revenue` for the final output.  
        - Use `ORDER BY cohort_year, year_month` to display results in a sorted and logical order.  

In [2]:
%%sql

WITH cohort_summary AS (
    SELECT
        cohort_year,
        DATE_TRUNC('month', orderdate)::date AS year_month,
        SUM(total_net_revenue) AS total_revenue
    FROM cohort_analysis
    GROUP BY 
        cohort_year, 
        year_month
)
    
SELECT
    cohort_year,
    year_month,
    SUM(total_revenue) OVER (
        PARTITION BY cohort_year
        ORDER BY year_month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- Changed 
    ) AS cumulative_revenue
FROM cohort_summary
ORDER BY 
    cohort_year, 
    year_month

Unnamed: 0,cohort_year,year_month,cumulative_revenue
0,2015,2015-01-01,384092.66
1,2015,2015-02-01,1090466.78
2,2015,2015-03-01,1423428.37
3,2015,2015-04-01,1584195.37
4,2015,2015-05-01,2132828.00
...,...,...,...
107,2023,2023-12-01,33108565.51
108,2024,2024-01-01,2677498.55
109,2024,2024-02-01,6219821.10
110,2024,2024-03-01,7912675.99


2. Put the previous main query into a CTE called `rolling_ltv` and select all of the results in the main query.  

   - Define a CTE `cohort_summary` to calculate the total monthly revenue for each cohort.  
        - Use `SUM(total_net_revenue)` to aggregate the total revenue per month, grouped by `cohort_year` and ` DATE_TRUNC('month', orderdate)::date`.  
   - **🔔**: Add another CTE `rolling_ltv` to calculate the cumulative revenue for each cohort.  
        - Use a window function `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` to compute the cumulative revenue up to the current date for each cohort.  
        - Order the cumulative results by `cohort_year` and `year_month` within the `rolling_ltv` CTE.  
   - **🔔**: In the main query, use `SELECT * FROM rolling_ltv` to display all results, including `cohort_year`, `year_month`, and `cumulative_revenue`.  

In [3]:
%%sql

WITH cohort_summary AS (
    SELECT
        cohort_year,
        DATE_TRUNC('month', orderdate)::date AS year_month,
        SUM(total_net_revenue) AS total_revenue
    FROM cohort_analysis
    GROUP BY 
        cohort_year, 
        year_month
),

-- Moved main query to CTE
rolling_ltv AS (
    SELECT
        cohort_year,
        year_month,
        SUM(total_revenue) OVER (
            PARTITION BY cohort_year
            ORDER BY year_month 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 
        ) AS cumulative_revenue
    FROM cohort_summary
    ORDER BY
        cohort_year, 
        year_month
)

SELECT *
FROM rolling_ltv

Unnamed: 0,cohort_year,year_month,cumulative_revenue
0,2015,2015-01-01,384092.66
1,2015,2015-02-01,1090466.78
2,2015,2015-03-01,1423428.37
3,2015,2015-04-01,1584195.37
4,2015,2015-05-01,2132828.00
...,...,...,...
107,2023,2023-12-01,33108565.51
108,2024,2024-01-01,2677498.55
109,2024,2024-02-01,6219821.10
110,2024,2024-03-01,7912675.99


3. Add `COUNT` to count the number of months since the cohort’s first order and modify the main query to call specific columns.  

   - Define a CTE `cohort_summary` to calculate the total monthly revenue for each cohort, grouping by `cohort_year` and `year_month`.  
   - In the second CTE `rolling_ltv`:  
     - `cumulative_revenue` using `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which tracks the total accumulated revenue per cohort over time.  
     - **🔔**: `months_since_start` using `COUNT(*) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which counts the number of months since the start.
   - The main query selects `cumulative_revenue` and `months_since_start` for each `cohort_year` and `year_month`.  

In [4]:
%%sql

WITH cohort_summary AS (
    SELECT
        cohort_year,
        DATE_TRUNC('month', orderdate)::date AS year_month,
        SUM(total_net_revenue) AS total_revenue
    FROM cohort_analysis
    GROUP BY 
        cohort_year, 
        year_month
),

rolling_ltv AS (
    SELECT
        cohort_year,
        year_month,
        SUM(total_revenue) OVER (
            PARTITION BY cohort_year 
            ORDER BY year_month 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        COUNT(*) OVER ( -- Added 
            PARTITION BY cohort_year 
            ORDER BY year_month ROWS 
            BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS months_since_start
    FROM cohort_summary
    ORDER BY
        cohort_year, 
        year_month
)

SELECT
    cohort_year,
    year_month,
    cumulative_revenue,
    months_since_start
FROM rolling_ltv;

Unnamed: 0,cohort_year,year_month,cumulative_revenue,months_since_start
0,2015,2015-01-01,384092.66,1
1,2015,2015-02-01,1090466.78,2
2,2015,2015-03-01,1423428.37,3
3,2015,2015-04-01,1584195.37,4
4,2015,2015-05-01,2132828.00,5
...,...,...,...,...
107,2023,2023-12-01,33108565.51,12
108,2024,2024-01-01,2677498.55,1
109,2024,2024-02-01,6219821.10,2
110,2024,2024-03-01,7912675.99,3


4. Calculate the rolling average LTV using: `cumulative_revenue / months_since_start`.  

   - Define a CTE `cohort_summary` to calculate the total monthly revenue for each cohort, grouping by `cohort_year` and `year_month`.  
   - In the second CTE `rolling_ltv`:  
     - `cumulative_revenue` using `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which tracks the total accumulated revenue per cohort over time.  
     - `months_since_start` using `COUNT(*) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which counts number of months since starting.
   - The main query selects `cumulative_revenue` and `months_since_start` for each `cohort_year` and `year_month`.  
   - **🔔**: In the main query, calculate the rolling average LTV by dividing `cumulative_revenue` by `months_since_start`.  
        - Select `cohort_year`, `year_month`, `cumulative_revenue`, `months_since_start`, and the calculated `rolling_avg_ltv`.  

In [5]:
%%sql

WITH cohort_summary AS (
    SELECT
        cohort_year,
        DATE_TRUNC('month', orderdate)::date AS year_month,
        SUM(total_net_revenue) AS total_revenue
    FROM cohort_analysis
    GROUP BY 
        cohort_year, 
        year_month
),

rolling_ltv AS (
    SELECT
        cohort_year,
        year_month,
        SUM(total_revenue) OVER (
            PARTITION BY cohort_year 
            ORDER BY year_month 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        COUNT(*) OVER ( -- Added 
            PARTITION BY cohort_year 
            ORDER BY year_month ROWS 
            BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS months_since_start
    FROM cohort_summary
    ORDER BY
        cohort_year, 
        year_month
)

SELECT
    cohort_year,
    year_month,
    cumulative_revenue,
    months_since_start,
    cumulative_revenue / months_since_start AS rolling_avg_ltv
FROM rolling_ltv;

Unnamed: 0,cohort_year,year_month,cumulative_revenue,months_since_start,rolling_avg_ltv
0,2015,2015-01-01,384092.66,1,384092.66
1,2015,2015-02-01,1090466.78,2,545233.39
2,2015,2015-03-01,1423428.37,3,474476.12
3,2015,2015-04-01,1584195.37,4,396048.84
4,2015,2015-05-01,2132828.00,5,426565.60
...,...,...,...,...,...
107,2023,2023-12-01,33108565.51,12,2759047.13
108,2024,2024-01-01,2677498.55,1,2677498.55
109,2024,2024-02-01,6219821.10,2,3109910.55
110,2024,2024-03-01,7912675.99,3,2637558.66


5. Add two new columns in the `rolling_ltv` CTE to get the rolling 7-day revenue.  

     - Define a CTE `cohort_summary` to calculate daily total revenue per cohort, grouping by `cohort_year` and `year_month`.  
     - In the second CTE `rolling_ltv` to compute the following metrics per cohort:  
        - `cumulative_revenue` using `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_monthe ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which tracks the total revenue accrued by the cohort over time.  
        - `months_since_start` using `COUNT(*) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which counts number of months since starting.   
        - **🔔**: `rolling_3_month_revenue` using `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)`, which captures total revenue for the last 3 months.  
        - **🔔**: `rolling_3_month_num_months` using `COUNT(*) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)`, which counts the number of months in the rolling 3-month period.  
   - In the main query, include:  
        - `cumulative_revenue` and `month_quarter`.  
        - `rolling_avg_ltv`, calculated as `cumulative_revenue / months_since_start`.  
        -  **🔔**: `rolling_3_month_revenue` and `rolling_3_month_num_months` to display the rolling metrics.  

In [6]:
%%sql

WITH cohort_summary AS (
    SELECT
        cohort_year,
        DATE_TRUNC('month', orderdate)::date AS year_month,
        SUM(total_net_revenue) AS total_revenue
    FROM cohort_analysis
    GROUP BY 
        cohort_year, 
        year_month
),

rolling_ltv AS (
    SELECT
        cohort_year,
        year_month,
        SUM(total_revenue) OVER (
            PARTITION BY cohort_year 
            ORDER BY year_month 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        COUNT(*) OVER (
            PARTITION BY cohort_year 
            ORDER BY year_month ROWS 
            BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS months_since_start,
        SUM(total_revenue) OVER ( -- Added 
            PARTITION BY cohort_year 
            ORDER BY year_month
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) AS rolling_3_month_revenue,
        COUNT(*) OVER ( -- Added 
            PARTITION BY cohort_year 
            ORDER BY year_month 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) AS rolling_3_month_num_months
    FROM cohort_summary
    ORDER BY
        cohort_year,
        year_month
)

SELECT
    cohort_year,
    year_month,
    cumulative_revenue,
    months_since_start,
    cumulative_revenue / months_since_start AS rolling_avg_ltv,
    rolling_3_month_revenue, -- Added
    rolling_3_month_num_months -- Added
FROM rolling_ltv;

Unnamed: 0,cohort_year,year_month,cumulative_revenue,months_since_start,rolling_avg_ltv,rolling_3_month_revenue,rolling_3_month_num_months
0,2015,2015-01-01,384092.66,1,384092.66,384092.66,1
1,2015,2015-02-01,1090466.78,2,545233.39,1090466.78,2
2,2015,2015-03-01,1423428.37,3,474476.12,1423428.37,3
3,2015,2015-04-01,1584195.37,4,396048.84,1200102.71,3
4,2015,2015-05-01,2132828.00,5,426565.60,1042361.22,3
...,...,...,...,...,...,...,...
107,2023,2023-12-01,33108565.51,12,2759047.13,8179976.91,3
108,2024,2024-01-01,2677498.55,1,2677498.55,2677498.55,1
109,2024,2024-02-01,6219821.10,2,3109910.55,6219821.10,2
110,2024,2024-03-01,7912675.99,3,2637558.66,7912675.99,3


6. Calculate the rolling 3-month average LTV using: `rolling_3_month_revenue  / rolling_3_month_rum_days`. Remove the `user_count` column.  
    - Define a CTE `cohort_summary` to calculate daily total revenue per cohort, grouping by `cohort_year` and `year_month`.  
    - In the second CTE `rolling_ltv` to compute the following metrics per cohort:  
        - `cumulative_revenue` using `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_monthe ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which tracks the total revenue accrued by the cohort over time.  
        - `months_since_start` using `COUNT(*) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, which counts number of months since the start. 
        - `rolling_3_month_revenue` using `SUM(total_revenue) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)`, which captures total revenue for the last 3 months.  
        - `rolling_3_month_num_months` using `COUNT(*) OVER (PARTITION BY cohort_year ORDER BY year_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)`, which counts the number of months in the rolling 3-month period. 
    - In the main query select `cohort_year`, `year_month`, `cumulative_revenue`, `rolling_avg_ltv`, and `rolling_3_month_avg_ltv`:  
        - Include `cohort_year` and `cumulative_revenue` for reference.  
        - Calculate `rolling_avg_ltv` as `cumulative_revenue / months_since_start`.  
        - **🔔**: Calculate `rolling_3_month_avg_ltv` as `rolling_3_month_revenue / rolling_3_month_num_months`.  
        - **🔔**: Remove the `month_quarter`, `rolling_3_month_revenue`, `rolling_3_month_num_months` columns from the final output.  


In [7]:
%%sql

WITH cohort_summary AS (
    SELECT
        cohort_year,
        DATE_TRUNC('month', orderdate)::date AS year_month,
        SUM(total_net_revenue) AS total_revenue
    FROM cohort_analysis
    GROUP BY 
        cohort_year, 
        year_month
),

rolling_ltv AS (
    SELECT
        cohort_year,
        year_month,
        SUM(total_revenue) OVER (
            PARTITION BY cohort_year 
            ORDER BY year_month 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumulative_revenue,
        COUNT(*) OVER (
            PARTITION BY cohort_year 
            ORDER BY year_month ROWS 
            BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS months_since_start,
        SUM(total_revenue) OVER ( -- Added 
            PARTITION BY cohort_year 
            ORDER BY year_month
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) AS rolling_3_month_revenue,
        COUNT(*) OVER ( -- Added 
            PARTITION BY cohort_year 
            ORDER BY year_month 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) AS rolling_3_month_num_months
    FROM cohort_summary
    ORDER BY
        cohort_year,
        year_month
)

SELECT
    cohort_year,
    year_month,
    cumulative_revenue,
    cumulative_revenue / months_since_start AS rolling_avg_ltv,
    rolling_3_month_revenue / rolling_3_month_num_months AS rolling_3_month_avg_ltv --Added 
FROM rolling_ltv;

Unnamed: 0,cohort_year,year_month,cumulative_revenue,rolling_avg_ltv,rolling_3_month_avg_ltv
0,2015,2015-01-01,384092.66,384092.66,384092.66
1,2015,2015-02-01,1090466.78,545233.39,545233.39
2,2015,2015-03-01,1423428.37,474476.12,474476.12
3,2015,2015-04-01,1584195.37,396048.84,400034.24
4,2015,2015-05-01,2132828.00,426565.60,347453.74
...,...,...,...,...,...
107,2023,2023-12-01,33108565.51,2759047.13,2726658.97
108,2024,2024-01-01,2677498.55,2677498.55,2677498.55
109,2024,2024-02-01,6219821.10,3109910.55,3109910.55
110,2024,2024-03-01,7912675.99,2637558.66,2637558.66
