<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/1_Handling_Nulls.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Handling Nulls

## Overview

### 🥅 Analysis Goals

Understand customer-level revenue, cohort classification, and order behavior to assess customer value and purchasing patterns.
- **Handle Missing Values**: Replace NULL values with appropriate defaults using COALESCE to ensure data consistency and accurate analysis
- **Clean Revenue Data**: Calculate total net revenue and number of orders per customer, replacing NULL values with 0 for accurate aggregation
- **Cohort Analysis**: Join customer data with sales information using LEFT JOIN to maintain all customer records while analyzing purchase patterns

### 📘 Concepts Covered

- `COALESCE`
- `NULLIF`

[Source Documentation for Conditional Expressions](https://www.postgresql.org/docs/17/functions-conditional.html)

In [2]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

ImportError: Unable to import required dependencies:
numpy: Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there.

---
## COALESCE

### 📝 Notes

**`COALESCE()`**

- **COALESCE**: Returns the first non-null value from a list of expressions.

- Syntax:

  ```sql
  SELECT COALESCE(expression1, expression2, ..., default_value);
  ```

- Used to replace `NULL` values with a default. Common in reporting and data cleaning, such as filling missing values with a placeholder.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Customer Lifetime Value (LTV): Total revenue generated by a customer over time
  - Net Revenue: Total revenue after accounting for all adjustments
- **💡 Why It Matters**: Ensures consistent customer data analysis
    - Enables accurate customer revenue tracking
    - Prevents missing data from skewing analysis results
    - Maintains data integrity in customer-level calculations
- **🎯 Common Use Cases**: 
  - Customer name standardization
  - Revenue calculations
  - Customer data cleaning
- **📈 Related KPIs**: 
  - Customer net revenue
  - Customer count
  - Data completeness metrics

### 📈 Analysis

- Returned the cleaned customer's LTV.
- Calculates each customer's total net revenue.

#### Cleaned Customer's LTV

**`COALESCE`**

1. Write in a query that gets the LTV for each customer. 
   - Selects `customerkey` to group revenue calculations by customer.  
   - Calculates `net_revenue` using `SUM(quantity * netprice * exchangerate)`.  
   - Uses `GROUP BY customerkey` to aggregate revenue per customer.  

In [None]:
%%sql
    
SELECT
    customerkey,
    SUM(quantity * netprice * exchangerate) AS ltv
FROM sales
GROUP BY
    customerkey

2. Put the query into a CTE (`sales_data`), then `LEFT JOIN` this CTE onto the customer table to return every customer's cleaned name and their LTV. 
   - Defines `sales_data` as a CTE that calculates `net_revenue` per customer.  
   - In the main query:
        - 🔔 Performs a `LEFT JOIN` on `customer` to retain all customers, even those without sales.  
        - 🔔 Uses `COALESCE(s.net_revenue, 0)` to ensure customers without sales show `0` LTV instead of `NULL`.

In [None]:
%%sql

-- Put query into a CTE
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS ltv
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.ltv,
    COALESCE(s.ltv, 0) AS cleaned_ltv
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/images/6.1_customer_ltv.png" alt="Customer LTV" width="50%">

> ⚠️ **Chart Note**: This plots only 15 of our customers for better visualization.

#### Cleaned Customer's Net Revenue

**`COALESCE`**

1. Using our `cohort_analysis` view from the last lesson, get cohort information.
   - Select `customerkey`, `cohort_year`, `num_orders`, and `total_net_revenue`.

In [None]:
%%sql

SELECT
    ca.customerkey,
    ca.cohort_year,
    ca.num_orders,
    ca.total_net_revenue
FROM cohort_analysis ca;


2. Using `LEFT JOIN`, join the cohort analysis view onto the customer table to return every customer's cohort information.
   - Select `customerkey` (from the customer table), `cohort_year`, `num_orders`, and `total_net_revenue` (from the cohort analysis view).    
   - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.

In [None]:
%%sql

SELECT
    c.customerkey, -- Get customer key from customer table
    ca.cohort_year,
    ca.num_orders,
    ca.total_net_revenue
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey;


3. Use `COALESCE` to replace `NULL` values with `0`.
   - Select `customerkey` (from the customer table), `cohort_year`, `num_orders`, and `total_net_revenue` (from the cohort analysis view).    
   - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
   - 🔔 Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.

In [None]:
%%sql

SELECT
    c.customerkey,
    ca.cohort_year,
    COALESCE(ca.num_orders, 0) AS num_orders, -- Updated
    COALESCE(ca.total_net_revenue, 0) AS total_net_revenue -- Updated
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey;

---
## NULLIF

### 📝 Notes

**`NULLIF`**

- **NULLIF**: Returns `NULL` if two expressions are equal; otherwise, returns the first expression.

- Syntax:

  ```sql
  SELECT NULLIF(expression1, expression2);
  ```

- Helps prevent division by zero by returning `NULL` instead of causing an error.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Cohort Analysis: Grouping customers by acquisition year
  - Average Order Value: Revenue per order for each customer
  - Customer Orders: Number of transactions per customer
- **💡 Why It Matters**: Enables accurate customer behavior analysis
    - Prevents division by zero errors in calculations
    - Allows proper calculation of average order values
    - Helps identify customer purchasing patterns
    - Provides clear view of customer order frequency
- **🎯 Common Use Cases**: 
  - Average order calculations
  - Customer cohort analysis
  - Order pattern analysis
- **📈 Related KPIs**: 
  - Average order value
  - Order frequency
  - Cohort metrics

### 📈 Analysis

- Convert customer's with `0` LTV to `NULL`.
- Calculate each customer's average order value.

#### Replace NULL Customer's LTV

**`NULLIF`**

1. Replace `COALESCE` with `NULLIF` to ensure customers without sales show `0` LTV instead of `NULL`.
   - Defines `sales_data` as a CTE that calculates `net_revenue` per customer.  
   - In the main query:
        - 🔔 Performs a `LEFT JOIN` on `customer` to retain all customers, even those without sales.  
        - 🔔 Uses `NULLIF(s.net_revenue, 0)` to ensure customers without sales show `0` LTV instead of `NULL`.

In [None]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS ltv
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.ltv,
    NULLIF(s.ltv, 0) AS cleaned_ltv -- Updated
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

#### Customer Average Order Value

**`NULLIF`**

1. Add `SUM` to `num_orders` and `total_net_revenue` to get the total number of orders and total net revenue for each customer.
   - Select `customerkey` (from the customer table) and `cohort_year` (from the cohort analysis view)
   `num_orders`, and `total_net_revenue`.    
   - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
   - Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
   - 🔔 Add `SUM` to `num_orders` and `total_net_revenue`.
   - 🔔 `GROUP BY` the `customerkey` and `cohort_year`.

In [None]:
%%sql

SELECT
    c.customerkey,
    ca.cohort_year,
    SUM(COALESCE(ca.num_orders, 0)) AS num_orders, -- Updated
    SUM(COALESCE(ca.total_net_revenue, 0)) AS total_net_revenue -- Updated
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey
GROUP BY 
    c.customerkey, 
    ca.cohort_year

2. Calculate each customer's average order value.
   - Select `customerkey` (from the customer table) and `cohort_year` (from the cohort analysis view)
   `num_orders`, and `total_net_revenue`.    
   - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
   - Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
   - Add `SUM` to `num_orders` and `total_net_revenue`.
   - 🔔 Calculate `avg_order_value` by dividing `total_net_revenue` by `num_orders` and add `NULLIF` to `num_orders` to prevent division by zero.
   - `GROUP BY` the `customerkey` and `cohort_year`.

In [None]:
%%sql

SELECT
    c.customerkey,
    ca.cohort_year,
    SUM(COALESCE(ca.num_orders, 0)) AS num_orders,
    SUM(COALESCE(ca.total_net_revenue, 0)) AS total_net_revenue,
    SUM(ca.total_net_revenue) / SUM(NULLIF(ca.num_orders, 0)) AS avg_order_value  -- Added: Prevents division by zero
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey
GROUP BY 
    c.customerkey, 
    ca.cohort_year

<img src="../Resources/images/6.1_customer_avg_rev.png" alt="Customer Average Revenue" width="50%">

> ⚠️ **Chart Note**: This plots only 15 of our customers for better visualization.

### 💡 What's the difference between `COALESCE` and `NULLIF`

- `NULLIF(expr1, expr2)` Returns NULL if `expr1 = expr2`, otherwise returns `expr1` (used to nullify specific values).  
- `COALESCE(expr1, expr2, ...)`: Returns the first non-NULL value from a list (used to replace NULLs with defaults).  
- **Difference:** `NULLIF` creates NULLs, while `COALESCE` replaces NULLs.