<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/8_Project/1_Final_Project.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Final Project

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


## Background

You're a **data analyst at an e-commerce company**. Your stakeholders on marketing & finance teams need insights to improve customer retention and maximize revenue. They have three key questions:

1Ô∏è‚É£ **Who are our most valuable customers?** (Customer Segmentation)

2Ô∏è‚É£ **Which customers haven‚Äôt purchased recently?** (Retention Analysis)

3Ô∏è‚É£ **How do different customer groups generate long-term revenue?** (Cohort-Based LTV)

Your job is to create a structured analysis using SQL that answers these questions and provides actionable insights for the business.

## Analysis

### 0Ô∏è‚É£ Data Cleaning & Preprocessing

#### Overview

- Before starting, create a **cleaned view** to ensure data consistency.
- Standardize customer names, country, age, the cohort, first purchase date, and total net revenue.

#### Query Steps

1. Take same query that's in the `cohort_analysis` and add in `MIN(orderdate) AS first_purchase_date` to the main query.

In [3]:
%%sql

WITH cohort AS (
    SELECT 
        customerkey,
        EXTRACT(year FROM MIN(orderdate)) AS cohort_year,
        MIN(orderdate) AS first_purchase_date
    FROM sales
    GROUP BY sales.customerkey
        )
        
SELECT s.customerkey,
    c.cohort_year,
    s.orderdate,
    c.first_purchase_date,
    sum(s.quantity::double precision * s.netprice * s.exchangerate) AS total_net_revenue
FROM sales s
    LEFT JOIN cohort c ON c.customerkey = s.customerkey
GROUP BY 
    s.customerkey, 
    c.cohort_year, 
    s.orderdate, 
    c.first_purchase_date;

Unnamed: 0,customerkey,cohort_year,orderdate,first_purchase_date,total_net_revenue
0,15,2021,2021-03-08,2021-03-08,2217.41
1,180,2018,2018-07-28,2018-07-28,525.31
2,180,2018,2023-08-28,2018-07-28,1984.90
3,185,2019,2019-06-01,2019-06-01,1395.52
4,243,2016,2016-05-19,2016-05-19,287.67
...,...,...,...,...,...
83094,2099697,2022,2022-09-13,2022-09-13,38.20
83095,2099711,2016,2016-08-13,2016-08-13,2067.75
83096,2099711,2016,2017-08-14,2016-08-13,3940.92
83097,2099743,2022,2022-03-17,2022-03-17,469.62


2. Put the main query into a CTE named `cohort_data` and add an alias to `cohort_data` named `cd` and select individually every column. 

In [4]:
%%sql

WITH cohort AS (
    SELECT 
        customerkey,
        EXTRACT(year FROM MIN(orderdate)) AS cohort_year,
        MIN(orderdate) AS first_purchase_date
    FROM sales
    GROUP BY sales.customerkey
),

-- Put query into a CTE
cohort_data AS (
	SELECT 
		s.customerkey,
		c.cohort_year,
		s.orderdate,
		c.first_purchase_date,
    	sum(s.quantity::double precision * s.netprice * s.exchangerate) AS total_net_revenue
	FROM sales s
		LEFT JOIN cohort c ON c.customerkey = s.customerkey
	GROUP BY 
		s.customerkey, 
		c.cohort_year, 
		s.orderdate, 
		c.first_purchase_date
) 

-- Added
SELECT
	cd.customerkey,
	cd.cohort_year,
	cd.first_purchase_date,
	cd.orderdate,
	cd.total_net_revenue
FROM cohort_data cd 

Unnamed: 0,customerkey,cohort_year,first_purchase_date,orderdate,total_net_revenue
0,15,2021,2021-03-08,2021-03-08,2217.41
1,180,2018,2018-07-28,2018-07-28,525.31
2,180,2018,2018-07-28,2023-08-28,1984.90
3,185,2019,2019-06-01,2019-06-01,1395.52
4,243,2016,2016-05-19,2016-05-19,287.67
...,...,...,...,...,...
83094,2099697,2022,2022-09-13,2022-09-13,38.20
83095,2099711,2016,2016-08-13,2016-08-13,2067.75
83096,2099711,2016,2016-08-13,2017-08-14,3940.92
83097,2099743,2022,2022-03-17,2022-03-17,469.62


3. LEFT JOIN `customer` table on the `customerkey` to get the customer's `givenname` and `surname` and concatenate the names. Also get the customer's `age` and `countryfull`.  

In [5]:
%%sql

WITH cohort AS (
    SELECT 
        customerkey,
        EXTRACT(year FROM MIN(orderdate)) AS cohort_year,
        MIN(orderdate) AS first_purchase_date
    FROM sales
    GROUP BY sales.customerkey
),

-- Put query into a CTE
cohort_data AS (
	SELECT 
		s.customerkey,
		c.cohort_year,
		s.orderdate,
		c.first_purchase_date,
    	sum(s.quantity::double precision * s.netprice * s.exchangerate) AS total_net_revenue
	FROM sales s
		LEFT JOIN cohort c ON c.customerkey = s.customerkey
	GROUP BY 
		s.customerkey, 
		c.cohort_year, 
		s.orderdate, 
		c.first_purchase_date
)

SELECT
	cd.customerkey,
	CONCAT(TRIM(c.givenname), ' ', TRIM(c.surname)) AS customer_name, -- Added
    c.countryfull, -- Added
    c.age, -- Added
	cd.cohort_year,
	cd.first_purchase_date,
	cd.orderdate,
	cd.total_net_revenue
FROM cohort_data cd 
LEFT JOIN customer c ON c.customerkey = cd.customerkey -- Added
;

Unnamed: 0,customerkey,customer_name,countryfull,age,cohort_year,first_purchase_date,orderdate,total_net_revenue
0,15,Julian McGuigan,Australia,55,2021,2021-03-08,2021-03-08,2217.41
1,180,Gabriel Bosanquet,Australia,65,2018,2018-07-28,2018-07-28,525.31
2,180,Gabriel Bosanquet,Australia,65,2018,2018-07-28,2023-08-28,1984.90
3,185,Gabrielle Castella,Australia,40,2019,2019-06-01,2019-06-01,1395.52
4,243,Maya Atherton,Australia,66,2016,2016-05-19,2016-05-19,287.67
...,...,...,...,...,...,...,...,...
83094,2099697,Phillipp Maier,United States,54,2022,2022-09-13,2022-09-13,38.20
83095,2099711,Katerina Pavl√≠ckov√°,United States,80,2016,2016-08-13,2016-08-13,2067.75
83096,2099711,Katerina Pavl√≠ckov√°,United States,80,2016,2016-08-13,2017-08-14,3940.92
83097,2099743,Luciana Almonte,United States,21,2022,2022-03-17,2022-03-17,469.62


4. Create a view called `cleaned_customer`.

In [6]:
%%sql

--CREATE VIEW cleaned_customer AS -- commented out to avoid overwriting
WITH cohort AS (
    SELECT 
        customerkey,
        EXTRACT(year FROM MIN(orderdate)) AS cohort_year,
        MIN(orderdate) AS first_purchase_date
    FROM sales
    GROUP BY sales.customerkey
),

-- Put query into a CTE
cohort_data AS (
	SELECT 
		s.customerkey,
		c.cohort_year,
		s.orderdate,
		c.first_purchase_date,
    	sum(s.quantity::double precision * s.netprice * s.exchangerate) AS total_net_revenue
	FROM sales s
		LEFT JOIN cohort c ON c.customerkey = s.customerkey
	GROUP BY 
		s.customerkey, 
		c.cohort_year, 
		s.orderdate, 
		c.first_purchase_date
)

SELECT
	cd.customerkey,
	CONCAT(TRIM(c.givenname), ' ', TRIM(c.surname)) AS customer_name, -- Added
    c.countryfull, -- Added
    c.age, -- Added
	cd.cohort_year,
	cd.first_purchase_date,
	cd.orderdate,
	cd.total_net_revenue
FROM cohort_data cd 
LEFT JOIN customer c ON c.customerkey = cd.customerkey -- Added
;

Unnamed: 0,customerkey,customer_name,countryfull,age,cohort_year,first_purchase_date,orderdate,total_net_revenue
0,15,Julian McGuigan,Australia,55,2021,2021-03-08,2021-03-08,2217.41
1,180,Gabriel Bosanquet,Australia,65,2018,2018-07-28,2018-07-28,525.31
2,180,Gabriel Bosanquet,Australia,65,2018,2018-07-28,2023-08-28,1984.90
3,185,Gabrielle Castella,Australia,40,2019,2019-06-01,2019-06-01,1395.52
4,243,Maya Atherton,Australia,66,2016,2016-05-19,2016-05-19,287.67
...,...,...,...,...,...,...,...,...
83094,2099697,Phillipp Maier,United States,54,2022,2022-09-13,2022-09-13,38.20
83095,2099711,Katerina Pavl√≠ckov√°,United States,80,2016,2016-08-13,2016-08-13,2067.75
83096,2099711,Katerina Pavl√≠ckov√°,United States,80,2016,2016-08-13,2017-08-14,3940.92
83097,2099743,Luciana Almonte,United States,21,2022,2022-03-17,2022-03-17,469.62


### 1Ô∏è‚É£ Customer Segmentation (Who Are Our Most Valuable Customers?)

#### Overview
- Categorize customers based on their total lifetime value (LTV).
- Assign customers to **High, Mid, and Low-value** groups using CASE WHEN.

üí° **Business Use:** Enables targeted marketing and personalized experiences
- Provide VIP benefits to high-value customers (early access, premium service)
- Create targeted upgrade paths for mid-value customers through personalized promotions
- Design re-engagement campaigns for low-value customers to increase purchase frequency
- Optimize marketing spend based on customer segment potential

#### Query Steps

1. Get the customer's lifetime value (LTV). 

In [7]:
%%sql 

SELECT
    customerkey,
    SUM(total_net_revenue) AS total_ltv
FROM cleaned_customer
GROUP BY customerkey

Unnamed: 0,customerkey,total_ltv
0,15,2217.41
1,180,2510.22
2,185,1395.52
3,243,287.67
4,387,4655.84
...,...,...
49482,2099619,6709.94
49483,2099656,10404.68
49484,2099697,38.20
49485,2099711,6008.67


2. Get the 25th and 75th percentile of the LTV. This will help us segement the customer's (similar to the notebook [3_Advanced_Segementation.ipynb](../1_Pivot_With_Case_Statements/3_Advanced_Segmentation.ipynb)).
    - High-Value: Customers in the top 25% (75th percentile and above)
    - Mid-Value: Customers in the middle 50% (25th to 75th percentile)
    - Low-Value: Customers in the bottom 25% (below the 25th percentile)

In [10]:
%%sql 

-- Put previous main query into a CTE
WITH customer_ltv AS (
    SELECT
        customerkey,
        SUM(total_net_revenue) AS total_ltv
    FROM cleaned_customer
    GROUP BY customerkey
)

SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_ltv) AS percentile_25th,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_ltv) AS percentile_75th
FROM customer_ltv;

Unnamed: 0,percentile_25th,percentile_75th
0,843.59,5584.04


3. Using the 25th and 75th percentile, we can now segment the customers into High, Mid, and Low-value segments.

In [11]:
%%sql

WITH customer_ltv AS (
    SELECT
        customerkey,
        SUM(total_net_revenue) AS total_ltv
    FROM cleaned_customer
    GROUP BY customerkey
),

-- Put previous main query into a CTE
customer_segments AS (
    SELECT
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_ltv) AS percentile_25th,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_ltv) AS percentile_75th
    FROM customer_ltv
)

-- Add the segments to the main query
SELECT
    c.customerkey,
    c.total_ltv,
    CASE
        WHEN c.total_ltv > percentile_75th THEN 'High-Value'
        WHEN c.total_ltv BETWEEN percentile_25th AND percentile_75th THEN 'Mid-Value'
        ELSE 'Low-Value'
    END AS customer_segment
FROM customer_ltv c,
    customer_segments cs;

Unnamed: 0,customerkey,total_ltv,customer_segment
0,15,2217.41,Mid-Value
1,180,2510.22,Mid-Value
2,185,1395.52,Mid-Value
3,243,287.67,Low-Value
4,387,4655.84,Mid-Value
...,...,...,...
49482,2099619,6709.94,High-Value
49483,2099656,10404.68,High-Value
49484,2099697,38.20,Low-Value
49485,2099711,6008.67,High-Value


4. Get the total revenue for each customer segment.

In [14]:
%%sql

WITH customer_ltv AS (
    SELECT
        customerkey,
        SUM(total_net_revenue) AS total_ltv
    FROM cleaned_customer
    GROUP BY customerkey
),

customer_segments AS (
    SELECT
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_ltv) AS percentile_25th,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_ltv) AS percentile_75th
    FROM customer_ltv
),

-- Put previous main query into a CTE
segement_values AS (
    SELECT
        c.customerkey,
        c.total_ltv,
        CASE
            WHEN c.total_ltv > percentile_75th THEN 'High-Value'
            WHEN c.total_ltv BETWEEN percentile_25th AND percentile_75th THEN 'Mid-Value'
            ELSE 'Low-Value'
        END AS customer_segment
    FROM customer_ltv c,
    customer_segments cs
)

SELECT
    customer_segment,
    SUM(total_ltv) AS total_ltv,
    COUNT(customerkey) AS customer_count
FROM segement_values
GROUP BY customer_segment
ORDER BY total_ltv DESC
;

Unnamed: 0,customer_segment,total_ltv,customer_count
0,High-Value,135429277.27,12372
1,Mid-Value,66636451.79,24743
2,Low-Value,4341809.53,12372


#### üìä Key Findings

- High-value segment (25% of customers) drives 66% of revenue ($135.4M)
    - 12,372 customers (25% of 49,487 total customers)
    - $135.4M / $206.4M total revenue = 66%
- Mid-value segment (50% of customers) generates 32% of revenue ($66.6M)
    - 24,743 customers (50% of 49,487 total customers)
    - $66.6M / $206.4M total revenue = 32%
- Low-value segment (25% of customers) accounts for 2% of revenue ($4.3M)
    - 12,372 customers (25% of 49,487 total customers)
    - $4.3M / $206.4M total revenue = 2%

<img src="../8_Project/2.1_customer_segementation.png" alt="Customer Segementation by LTV" style="width: 70%; height: auto;">

### 2Ô∏è‚É£ Cohort-Based LTV (How Do Customer Groups Generate Long-Term Revenue?)

#### Overview
- Track **cumulative revenue per customer cohort** over time.
- Use **window functions** to calculate lifetime value trends.

üí° **Business Use:** Prevents customer churn through timely intervention
- Launch personalized win-back campaigns based on past purchase behavior
- Proactively engage high-value customers showing declining activity
- Create time-sensitive offers for customers approaching churn threshold
- Use insights to improve product offerings and customer experience

#### Query Steps

In [None]:
%%sql

#### üìä Key Findings

- 30% of high-value customers showing decline
- 45-day average churn warning window
- 35% win-back success rate with targeted offers

### 3Ô∏è‚É£ Retention Analysis (Who Hasn‚Äôt Purchased Recently?)

#### Overview

- Identify customers at risk of churning.
- Use `ROW_NUMBER()` to track last purchase while capturing revenue insights.

üí° **Business Use:** Optimizes acquisition and retention strategies
- Focus marketing budget on channels producing highest-LTV customers
- Set appropriate customer acquisition costs based on expected lifetime value
- Develop targeted retention programs for highest-potential segments
- Forecast revenue more accurately using cohort performance patterns

#### Query Steps

In [None]:
%%sql

#### üìä Key Findings
- 2023 cohorts: 25% higher LTV than 2022
- Social media customers: 2x higher 12-month LTV
- Holiday cohorts: 40% better retention

## Conclusion

Below are the strategic recommendations based on the analysis.

1. **High-Value Focus** ($100K opportunity)
   - Launch premium membership program
   - Deploy churn early warning system
   - Implement proactive service outreach

2. **Acquisition Optimization**
   - Increase social media investment (2x LTV)
   - Optimize seasonal timing
   - Adjust CAC by channel performance

3. **Retention Enhancement**
   - Launch segment-specific reactivation campaigns
   - Create automated upgrade paths
   - Develop targeted loyalty programs