<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/1_Handling_Nulls.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Handling Nulls

## Overview

### 🥅 Analysis Goals

Understand customer-level revenue, cohort classification, and order behavior to assess customer value and purchasing patterns.  

- **Clean Up Customer Names:** Standardize customer names by replacing null values with 'Unknown' to ensure data consistency and prevent missing values from impacting analysis.  
- **Clean Customer’s Net Revenue:** Calculate total net revenue for each customer and join it with customer data to understand individual spending behavior and revenue contribution.  
- **Cohort Customer Value:** Determine each customer’s acquisition cohort, total revenue, number of orders, and average order value to analyze customer lifetime value and purchasing trends.

### 📘 Concepts Covered

- `COALESCE`
- `NULLIF`

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## COALESCE

### 📝 Notes

**`COALESCE()`**

- **COALESCE**: Returns the first non-null value from a list of expressions.

- Syntax:

  ```sql
  SELECT COALESCE(expression1, expression2, ..., default_value);
  ```

- Used to replace `NULL` values with a default. Common in reporting and data cleaning, such as filling missing values with a placeholder.

### 💻 Final Result

- Standardizes customer names by replacing null values with 'Unknown' to ensure consistency and avoid missing data issues in analysis.
- Calculates each customer's total net revenue and joins it with customer data to understand spending behavior at an individual level.

#### Clean Customer Names

**`COALESCE`**

1. Use `COALESCE` on customer's givenname and surname. 
    - Selects `customerkey` to keep customer identification.  
    - Uses `COALESCE(givenname, 'Unknown')` to replace `NULL` values in `givenname`.  
    - Uses `COALESCE(surname, 'Unknown')` to replace `NULL` values in `surname`.

In [2]:
%%sql

SELECT
    customerkey,
    COALESCE(givenname, 'Unknown') AS cleaned_givenname,
    COALESCE(surname, 'Unknown') AS cleaned_surname
FROM customer;

Unnamed: 0,customerkey,cleaned_givenname,cleaned_surname
0,15,Julian,McGuigan
1,23,Rose,Dash
2,36,Annabelle,Townsend
3,120,Jamie,Hetherington
4,180,Gabriel,Bosanquet
...,...,...,...
104985,2099639,Miroslav,Slach
104986,2099656,Wilfredo,Lozada
104987,2099697,Phillipp,Maier
104988,2099711,Katerina,Pavlícková


#### Cleaned Customer's Net Revenue

**`COALESCE`**

1. Write in a query that gets the total net revenue for each customer. 
   - Selects `customerkey` to group revenue calculations by customer.  
   - Calculates `net_revenue` using `SUM(quantity * netprice * exchangerate)`.  
   - Uses `GROUP BY customerkey` to aggregate revenue per customer.  

In [3]:
    %%sql
    
    SELECT
        customerkey,
        SUM(quantity * netprice * exchangerate) AS net_revenue
    FROM sales
	GROUP BY
		customerkey

Unnamed: 0,customerkey,net_revenue
0,2044589,2470.73
1,1603477,136.62
2,876049,2601.13
3,1469222,5278.54
4,2089398,98.39
...,...,...
49482,853617,903.31
49483,1573639,6973.42
49484,1355936,149.99
49485,967453,5.40


2. Put the query into a CTE (`sales_data`), then `LEFT JOIN` this CTE onto the customer table to return every customer's cleaned name and their net revenue. 
   - Defines `sales_data` as a CTE that calculates `net_revenue` per customer.  
   - In the main query:
        - 🔔 Performs a `LEFT JOIN` on `customer` to retain all customers, even those without sales.  
        - 🔔 Uses `COALESCE(c.givenname, 'Unknown')` and `COALESCE(c.surname, 'Unknown')` to replace missing names.  
        - 🔔 Uses `COALESCE(s.net_revenue, 0)` to ensure customers without sales show `0` revenue instead of `NULL`.

In [4]:
%%sql

-- Put query into a CTE
WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cleaned_givenname,cleaned_surname,net_revenue
0,15,Julian,McGuigan,2217.41
1,23,Rose,Dash,0.00
2,36,Annabelle,Townsend,0.00
3,120,Jamie,Hetherington,0.00
4,180,Gabriel,Bosanquet,2510.22
...,...,...,...,...
104985,2099639,Miroslav,Slach,0.00
104986,2099656,Wilfredo,Lozada,10404.68
104987,2099697,Phillipp,Maier,38.20
104988,2099711,Katerina,Pavlícková,6008.67


---
## NULLIF

### 📝 Notes

**`NULLIF`**

- **NULLIF**: Returns `NULL` if two expressions are equal; otherwise, returns the first expression.

- Syntax:

  ```sql
  SELECT NULLIF(expression1, expression2);
  ```

- Helps prevent division by zero by returning `NULL` instead of causing an error.

### 💻 Final Result

- Determines each customer's acquisition cohort and calculates their total revenue, number of orders, and average order value to analyze customer lifetime value and purchasing patterns.

#### Cohort Customer Value

**`NULLIF`**

1. Get the total number of orders by customer in the `sales_data` CTE and use `COALESCE` on `num_orders` in the main query. 
   - Defines `sales_data` as a CTE that calculates `net_revenue` and `num_orders` per customer.  
        - 🔔 Uses `COUNT(orderkey)` to count the total number of orders for each customer.  
   - In the main query:
        - Performs a `LEFT JOIN` on `customer` to retain all customers, even those without orders.  
        - Uses `COALESCE(s.num_orders, 0)` to replace `NULL` values with `0` for customers without orders.  

In [5]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS num_orders -- Added
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cleaned_givenname,cleaned_surname,net_revenue,num_orders
0,15,Julian,McGuigan,2217.41,1
1,23,Rose,Dash,0.00,0
2,36,Annabelle,Townsend,0.00,0
3,120,Jamie,Hetherington,0.00,0
4,180,Gabriel,Bosanquet,2510.22,3
...,...,...,...,...,...
104985,2099639,Miroslav,Slach,0.00,0
104986,2099656,Wilfredo,Lozada,10404.68,13
104987,2099697,Phillipp,Maier,38.20,3
104988,2099711,Katerina,Pavlícková,6008.67,2


2. Return the average order value for each customer by calculating: `net_revenue / num_orders` and add `NULLIF` to `num_orders` to prevent division by zero.
   - Defines `sales_data` as a CTE that calculates `net_revenue` and `num_orders` per customer.  
   - In the main query:
        - Uses `COALESCE(s.num_orders, 0)` to display `0` for customers without orders.  
        - 🔔 Calculates `avg_order_value` using `s.net_revenue / NULLIF(s.num_orders, 0)`, ensuring division by zero is avoided by returning `NULL` when `num_orders = 0`.  

In [6]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  -- Added: Prevents division by zero
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cleaned_givenname,cleaned_surname,net_revenue,total_orders,avg_order_value
0,15,Julian,McGuigan,2217.41,1,2217.41
1,23,Rose,Dash,0.00,0,
2,36,Annabelle,Townsend,0.00,0,
3,120,Jamie,Hetherington,0.00,0,
4,180,Gabriel,Bosanquet,2510.22,3,836.74
...,...,...,...,...,...,...
104985,2099639,Miroslav,Slach,0.00,0,
104986,2099656,Wilfredo,Lozada,10404.68,13,800.36
104987,2099697,Phillipp,Maier,38.20,3,12.73
104988,2099711,Katerina,Pavlícková,6008.67,2,3004.34


3. In the `sales_data` find the cohort year by getting the year from the minimum `orderdate` and select `cohort_year` in the main query.
   - Defines `sales_data` as a CTE that calculates `net_revenue` and `num_orders` per customer.
        - 🔔 Adds `EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year` to determine the first purchase year per customer.  
   - In the main query:
        - Groups by `customerkey` to calculate `cohort_year`, `net_revenue`, and `num_orders`.  
        - Performs a `LEFT JOIN` on `customer` to retain all customers.  
        - 🔔 Selects `cohort_year` in the main query to include each customer's first purchase year.  
        - Uses `NULLIF(s.num_orders, 0)` in `avg_order_value` calculation to prevent division by zero.

In [8]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year, -- Extract cohort year
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    COALESCE(c.givenname, 'Unknown') AS cleaned_givenname,
	COALESCE(c.surname, 'Unknown') AS cleaned_surname,
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  -- Added: Prevents division by zero
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cohort_year,cleaned_givenname,cleaned_surname,net_revenue,total_orders,avg_order_value
0,15,2021,Julian,McGuigan,2217.41,1,2217.41
1,23,,Rose,Dash,0.00,0,
2,36,,Annabelle,Townsend,0.00,0,
3,120,,Jamie,Hetherington,0.00,0,
4,180,2018,Gabriel,Bosanquet,2510.22,3,836.74
...,...,...,...,...,...,...,...
104985,2099639,,Miroslav,Slach,0.00,0,
104986,2099656,2023,Wilfredo,Lozada,10404.68,13,800.36
104987,2099697,2022,Phillipp,Maier,38.20,3,12.73
104988,2099711,2016,Katerina,Pavlícková,6008.67,2,3004.34


### 💡 What's the difference between `COALESCE` and `NULLIF`

- `NULLIF(expr1, expr2)` Returns NULL if `expr1 = expr2`, otherwise returns `expr1` (used to nullify specific values).  
- `COALESCE(expr1, expr2, ...)`: Returns the first non-NULL value from a list (used to replace NULLs with defaults).  
- **Difference:** `NULLIF` creates NULLs, while `COALESCE` replaces NULLs.