<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/1_Handling_Nulls.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Handling Nulls

## Overview

### 🥅 Analysis Goals

Understand customer-level revenue, cohort classification, and order behavior to assess customer value and purchasing patterns.  

- **Clean Up Customer Names:** Standardize customer names by replacing null values with 'Unknown' to ensure data consistency and prevent missing values from impacting analysis.  
- **Clean Customer’s Net Revenue:** Calculate total net revenue for each customer and join it with customer data to understand individual spending behavior and revenue contribution.  
- **Cohort Customer Value:** Determine each customer’s acquisition cohort, total revenue, number of orders, and average order value to analyze customer lifetime value and purchasing trends.

### 📘 Concepts Covered

- `COALESCE`
- `NULLIF`

---
## COALESCE

### 📝 Notes

**`COALESCE()`**

- **COALESCE**: Returns the first non-null value from a list of expressions.

- Syntax:

  ```sql
  SELECT COALESCE(expression1, expression2, ..., default_value);
  ```

- Used to replace `NULL` values with a default. Common in reporting and data cleaning, such as filling missing values with a placeholder.

### 💻 Final Result

- Standardizes customer names by replacing null values with 'Unknown' to ensure consistency and avoid missing data issues in analysis.
- Calculates each customer's total net revenue and joins it with customer data to understand spending behavior at an individual level.

#### Clean Customer Names

**`COALESCE`**

1. Use `COALESCE` on customer's givenname and surname. 
    - Selects `customerkey` to keep customer identification.  
    - Uses `COALESCE(givenname, 'Unknown')` to replace `NULL` values in `givenname`.  
    - Uses `COALESCE(surname, 'Unknown')` to replace `NULL` values in `surname`.

In [None]:
SELECT
    customerkey,
    COALESCE(givenname, 'Unknown') AS cleaned_givenname,
    COALESCE(surname, 'Unknown') AS cleaned_surname
FROM customer;

<img src="../Resources/query_results/6_handling_nulls_1.png" alt="Query Results 1" style="width: 60%; height: auto;">

#### Cleaned Customer's Net Revenue

**`COALESCE`**

1. Write in a query that gets the total net revenue for each customer. 
   - Selects `customerkey` to group revenue calculations by customer.  
   - Calculates `net_revenue` using `SUM(quantity * netprice * exchangerate)`.  
   - Uses `GROUP BY customerkey` to aggregate revenue per customer.  

In [None]:
    SELECT
        customerkey,
        SUM(quantity * netprice * exchangerate) AS net_revenue
    FROM sales
	GROUP BY
		customerkey

<img src="../Resources/query_results/6_handling_nulls_4.png" alt="Query Results 1" style="width: 50%; height: auto;">

2. Put the query into a CTE (`sales_data`), then `LEFT JOIN` this CTE onto the customer table to return every customer's cleaned name and their net revenue. 
   - Defines `sales_data` as a CTE that calculates `net_revenue` per customer.  
   - Performs a `LEFT JOIN` on `customer` to retain all customers, even those without sales.  
   - Uses `COALESCE(c.givenname, 'Unknown')` and `COALESCE(c.surname, 'Unknown')` to replace missing names.  
   - Uses `COALESCE(s.net_revenue, 0)` to ensure customers without sales show `0` revenue instead of `NULL`.

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_handling_nulls_5.png" alt="Query Results 1" style="width: 70%; height: auto;">

---
## NULLIF

### 📝 Notes

**`NULLIF`**

- **NULLIF**: Returns `NULL` if two expressions are equal; otherwise, returns the first expression.

- Syntax:

  ```sql
  SELECT NULLIF(expression1, expression2);
  ```

- Helps prevent division by zero by returning `NULL` instead of causing an error.

### 💻 Final Result

- Determines each customer's acquisition cohort and calculates their total revenue, number of orders, and average order value to analyze customer lifetime value and purchasing patterns.

#### Cohort Customer Value

**`NULLIF`**

1. Get the total number of orders by customer in the `sales_data` CTE and use `COALESCE` on `num_orders` in the main query. 
   - Defines `sales_data` as a CTE that calculates `net_revenue` and `num_orders` per customer.  
   - Uses `COUNT(orderkey)` to count the total number of orders for each customer.  
   - Performs a `LEFT JOIN` on `customer` to retain all customers, even those without orders.  
   - Uses `COALESCE(s.num_orders, 0)` to replace `NULL` values with `0` for customers without orders.  

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS num_orders -- Added
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_handling_nulls_6.png" alt="Query Results 6" style="width: 70%; height: auto;">

2. Return the average order value for each customer by calculating: `net_revenue / num_orders` and add `NULLIF` to `num_orders` to prevent division by zero.
   - Reuses `sales_data` CTE to retrieve `net_revenue` and `num_orders` per customer.  
   - Uses `COALESCE(s.num_orders, 0)` to display `0` for customers without orders.  
   - Calculates `avg_order_value` using `s.net_revenue / NULLIF(s.num_orders, 0)`, ensuring division by zero is avoided by returning `NULL` when `num_orders = 0`.  

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  -- Added: Prevents division by zero
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_handling_nulls_7.png" alt="Query Results 7" style="width: 70%; height: auto;">

3. In the `sales_data` find the cohort year by getting the year from the minimum `orderdate` and select `cohort_year` in the main query.
   - Adds `EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year` to determine the first purchase year per customer.  
   - Groups by `customerkey` to calculate `cohort_year`, `net_revenue`, and `num_orders`.  
   - Performs a `LEFT JOIN` on `customer` to retain all customers.  
   - Selects `cohort_year` in the main query to include each customer's first purchase year.  
   - Uses `NULLIF(s.num_orders, 0)` in `avg_order_value` calculation to prevent division by zero.

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year, -- Extract cohort year
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    c.cohort_year,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  -- Added: Prevents division by zero
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_handling_nulls_8.png" alt="Query Results 8" style="width: 90%; height: auto;">