<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/7_Basic_Query_Optimization/3_Project_Customer_Retention.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 3️⃣ Retention Analysis (Who Hasn’t Purchased Recently?)

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

## Background

You're a **data analyst at an e-commerce company**. Your stakeholders on marketing & finance teams need insights to improve customer retention and maximize revenue. They have three key questions:

1️⃣ **Who are our most valuable customers?** (Customer Segmentation)

2️⃣ **How do different customer groups generate long-term revenue?** (Cohort-Based LTV) 

3️⃣ **Which customers haven’t purchased recently?** (Retention Analysis)

Your job is to create a structured analysis using SQL that answers these questions and provides actionable insights for the business.

## Analysis

#### Overview

- Identify customers who have churned.
- Use `ROW_NUMBER()` to track last purchase while capturing revenue insights.

**📊 Business Terms**: 
- Active Customer: Customer who made a purchase within the last 6 months
- Churned Customer: Customer who hasn't made a purchase in over 6 months
- Last Purchase Date: Most recent transaction date for each customer
- Churn Period: 6-month inactivity threshold

**💡 Why It Matters**: Helps track customer retention and engagement
- Identifies at-risk customers before they fully churn
- Enables targeted re-engagement campaigns
- Measures effectiveness of retention strategies
- Provides insights into customer lifecycle

💼 **Example Use Cases:** Optimizes acquisition and retention strategies
- Focus marketing budget on channels producing highest-LTV customers
- Set appropriate customer acquisition costs based on expected lifetime value
- Develop targeted retention programs for highest-potential segments
- Forecast revenue more accurately using cohort performance patterns

#### Query Steps

1. From the `orderdate` using a windows function assign a row number to each order using `ROW_NUMBER() OVER (PARTITION BY customerkey ORDER BY orderdate DESC)`. 

In [40]:
%%sql

SELECT
    customerkey,
    orderdate AS last_purchase_date,
    (quantity * netprice * exchangerate) AS last_net_revenue,
    ROW_NUMBER() OVER (PARTITION BY customerkey ORDER BY orderdate DESC) AS rn
FROM sales

Unnamed: 0,customerkey,last_purchase_date,last_net_revenue,rn
0,15,2021-03-08,2217.41,1
1,180,2023-08-28,71.36,1
2,180,2023-08-28,1913.55,2
3,180,2018-07-28,525.31,3
4,185,2019-06-01,1395.52,1
...,...,...,...,...
199868,2099711,2017-08-14,3940.92,1
199869,2099711,2016-08-13,2067.75,2
199870,2099743,2023-02-11,598.46,1
199871,2099743,2022-03-17,375.57,2


2. Put the last query into a CTE called `customer_last_purchase`, return `customerkey`, `last_purchase_date`, and `last_net_revenue` columns. Filter where `rn` is 1, to get the customer's last purchase and use a subquery to filter out customers whose *first* purchase was in 2024 (meaning they would be recent customer's and active by default).

In [41]:
%%sql

-- Put previous query into a CTE
WITH customer_last_purchase AS (
    SELECT
        customerkey,
        orderdate AS last_purchase_date,
        quantity * netprice * COALESCE(exchangerate, 1) AS last_net_revenue,
        ROW_NUMBER() OVER (PARTITION BY customerkey ORDER BY orderdate DESC) AS rn
    FROM sales
)
SELECT
    clp.customerkey,
    clp.last_purchase_date,
    clp.last_net_revenue
FROM customer_last_purchase clp
WHERE  -- Added
    clp.rn = 1 -- Filter to only return last purchase
    AND clp.customerkey IN ( -- Filter out customers who made only one purchase in 2024
            SELECT customerkey
            FROM sales
            GROUP BY customerkey
            HAVING MIN(orderdate) < '2024-01-01'
        )
;

Unnamed: 0,customerkey,last_purchase_date,last_net_revenue
0,15,2021-03-08,2217.41
1,180,2023-08-28,71.36
2,185,2019-06-01,1395.52
3,243,2016-05-19,287.67
4,387,2023-11-16,30.51
...,...,...,...
48080,2099619,2020-07-10,544.59
48081,2099656,2024-02-06,193.56
48082,2099697,2022-09-13,4.74
48083,2099711,2017-08-14,3940.92


3. Get the last order date from the `sales` table.

In [5]:
%%sql

    SELECT
        MAX(orderdate)
    FROM sales

Unnamed: 0,max
0,2024-04-20


4. Add a `CASE WHEN` statement to assign a status to the customer based on their last purchase. If their last purchase was within 6 moths of `2024-04-20` then they are considered 'Active', otherwise they are considered 'Churned'.

> **⚠️ Note**: Typically we'd use 'CURRENT_DATE' instead of 2024-04-20 to be more dynamic but since this data isn't up-to-date we are using 2024-04-20.

In [47]:
%%sql

WITH customer_last_purchase AS (
    SELECT
        customerkey,
        orderdate AS last_purchase_date,
        quantity * netprice * COALESCE(exchangerate, 1) AS last_net_revenue,
        ROW_NUMBER() OVER (PARTITION BY customerkey ORDER BY orderdate DESC) AS rn
    FROM sales
)
SELECT
    clp.customerkey,
    clp.last_purchase_date,
    clp.last_net_revenue,
    CASE -- Added
        WHEN clp.last_purchase_date < '2024-04-20'::date - INTERVAL '6 months' THEN 'Churned'
        ELSE 'Active'
    END AS customer_status
FROM customer_last_purchase clp
WHERE 
    clp.rn = 1
    AND clp.customerkey IN (
            SELECT customerkey
            FROM sales
            GROUP BY customerkey
            HAVING MIN(orderdate) < '2024-01-01'
        )
;

Unnamed: 0,customerkey,last_purchase_date,last_net_revenue,customer_status
0,15,2021-03-08,2217.41,Churned
1,180,2023-08-28,71.36,Churned
2,185,2019-06-01,1395.52,Churned
3,243,2016-05-19,287.67,Churned
4,387,2023-11-16,30.51,Active
...,...,...,...,...
48080,2099619,2020-07-10,544.59,Churned
48081,2099656,2024-02-06,193.56,Active
48082,2099697,2022-09-13,4.74,Churned
48083,2099711,2017-08-14,3940.92,Churned


5. Put the main query into a CTE and name it `churned_customers`. In the main query return the `customer_status` and the `COUNT` of the `customerkey`. Also calculate the percentage of churned customers over the total and use `ROUND` to round the results to two decimal places.

> **⚠️ Note**: Why are we using a windows function instead of a subquery like `(COUNT(customerkey)/ (SELECT COUNT(customerkey) FROM churned_customers))`? Window functions (OVER) are more efficient than subqueries for calculating totals as they compute in a single pass, avoiding additional table scans and maintaining consistency with the main query's filters.

In [48]:
%%sql

WITH customer_last_purchase AS (
    SELECT
        customerkey,
        orderdate AS last_purchase_date,
        quantity * netprice * COALESCE(exchangerate, 1) AS last_net_revenue,
        COUNT(*) OVER (PARTITION BY customerkey) as purchase_count,
        ROW_NUMBER() OVER (PARTITION BY customerkey ORDER BY orderdate DESC) AS rn
    FROM sales
),

-- Put previous main query into a CTE
churned_customers AS (
SELECT
    clp.customerkey,
    clp.last_purchase_date,
    clp.last_net_revenue,
    CASE
        WHEN clp.last_purchase_date < '2024-04-20'::date - INTERVAL '6 months' THEN 'Churned'
        ELSE 'Active'
    END AS customer_status
FROM customer_last_purchase clp
WHERE 
    clp.rn = 1
    AND clp.customerkey IN (
            SELECT customerkey
            FROM sales
            GROUP BY customerkey
            HAVING MIN(orderdate) < '2024-01-01'
        )
)

-- Added
SELECT
    customer_status,
    COUNT(DISTINCT customerkey) AS num_customers,
    ROUND(COUNT(DISTINCT customerkey) / SUM(COUNT(DISTINCT customerkey)) OVER(),2) AS status_percentage
FROM churned_customers
GROUP BY customer_status;


Unnamed: 0,customer_status,num_customers,status_percentage
0,Active,5613,0.12
1,Churned,42472,0.88


<img src="../Resources/images/7.3_customer_churn.png" alt="Customers Churned" style="width: 70%; height: auto;">

6. To get the `cohort_year` use an `INNER JOIN` in the `churned_customers` CTE. Then in the main query return the `cohort_year`. Update the `status_precentage` to `PARTITION BY cohort_year`.

> **⚠️ Note**: COUNT(DISTINCT customerkey) is necessary when grouping by cohort_year because customers might appear in multiple cohort records, ensuring we don't double-count customers within each cohort's percentage calculation.

In [49]:
%%sql

WITH customer_last_purchase AS (
    SELECT
        customerkey,
        orderdate AS last_purchase_date,
        quantity * netprice * COALESCE(exchangerate, 1) AS last_net_revenue,
        ROW_NUMBER() OVER (PARTITION BY customerkey ORDER BY orderdate DESC) AS rn,
        MIN(orderdate) OVER (PARTITION BY customerkey) AS first_purchase_date
    FROM sales
),

churned_customers AS (
    SELECT
        clp.customerkey,
        clp.last_purchase_date,
        clp.last_net_revenue,
        ca.cohort_year,
        CASE
            WHEN clp.last_purchase_date < '2024-04-20'::date - INTERVAL '6 months' THEN 'Churned'
            ELSE 'Active'
        END AS customer_status
    FROM customer_last_purchase clp
    INNER JOIN cohort_analysis ca ON clp.customerkey = ca.customerkey -- Added
    WHERE 
        clp.rn = 1
        AND clp.customerkey IN (
            SELECT customerkey
            FROM sales
            GROUP BY customerkey
            HAVING MIN(orderdate) < '2024-01-01'
        )
)

SELECT
    cohort_year, -- Added
    customer_status,
    COUNT(customerkey) AS num_customers,
    ROUND(COUNT(customerkey) / (SUM(COUNT(customerkey)) OVER(PARTITION BY cohort_year)), 2) AS status_percentage
FROM churned_customers
GROUP BY 
    cohort_year, -- Added
    customer_status
ORDER BY 
    cohort_year, -- Added
    customer_status;

Unnamed: 0,cohort_year,customer_status,num_customers,status_percentage
0,2015,Active,772,0.13
1,2015,Churned,5087,0.87
2,2016,Active,965,0.14
3,2016,Churned,5853,0.86
4,2017,Active,1251,0.15
5,2017,Churned,7021,0.85
6,2018,Active,2183,0.15
7,2018,Churned,12259,0.85
8,2019,Active,2010,0.14
9,2019,Churned,11928,0.86


<img src="../Resources/images/7.3_customer_churn_cohort_year.png" alt="Customer Churn by Cohort Year" style="width: 70%; height: auto;">

#### 📊 Key Findings

- **Older Cohorts (2015-2019) Have High Churn (85-88%)**  
  - Long-term retention remains weak, with most cohorts stabilizing below 15% active.  
  - Reactivation efforts for high-value churned users may be more effective than broad retention strategies.  

- **Recent Cohorts (2022-2023) Show Improved Retention**  
  - **2023 cohort (33% active)** has the strongest retention among older cohorts, indicating potential improvements in customer experience.  
  - **2022 cohort (18% active)** performs slightly better than previous years, suggesting some retention gains.  

- **Retention Drops Consistently After 2-3 Years**  
  - Active rates stabilize between 12-15% for older cohorts, reinforcing the need for stronger early engagement.  
  - Without intervention, newer cohorts are likely to follow a similar churn pattern.  