<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/2_String_Formatting.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# String Formatting

## Overview

### 🥅 Analysis Goals

Standardize customer name formatting and analyze daily revenue trends to ensure data consistency, improve reporting accuracy, and gain insights into customer behavior.  

- **Standardized Full Name & Revenue Analysis:** Concatenates first and last names for consistency and readability while calculating customer average order value.  

- **Uppercase Name Formatting & Revenue Analysis:** Converts names to uppercase for uniformity in reporting and case-insensitive operations while ensuring revenue analysis maintains standardized uppercase names for consistent data processing.  

- **Lowercase Name Formatting & Revenue Analysis:** Converts names to lowercase for datasets requiring a uniform format while analyzing  customer average order value.  

- **Trimmed Name Formatting & Project Preperation:** Removes leading and trailing spaces to clean up inconsistencies in name fields while preparing data for the final project.  

### 📘 Concepts Covered

- `CONCAT`
- `UPPER`
- `LOWER`
- `TRIM`

In [3]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


---
## CONCAT

### 📝 Notes

**`CONCAT`**

- **CONCAT**: Combines two or more strings into a single string.

- Syntax:

  ```sql
  SELECT CONCAT(string1, string2, ...);
  ```

- Automatically handles `NULL` values as empty strings, avoiding `NULL` results when concatenating.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Full Name: Combined first and last name
  - Cohort Analysis: Customer grouping by start year
  - Customer Value: Revenue and order patterns
- **💡 Why It Matters**: Links customer identity to revenue patterns
    - Combines names for clear customer identification
    - Tracks revenue patterns by named customer
    - Connects purchase behavior to specific customers
    - Enables cohort analysis with clean customer data
- **🎯 Common Use Cases**: 
  - Customer revenue tracking
  - Named cohort analysis
  - Customer value assessment
- **📈 Related KPIs**: 
  - Customer lifetime value
  - Cohort revenue
  - Order frequency per customer

### 📈 Analysis

- Standardizes customer names by concatenating first and last names.  
- Analyzes customer purchase behavior by calculating their cohort year, total net revenue, and order frequency.  

#### Standardized Customer Name

**`CONCAT`**

1. Use `CONCAT` to combine the customer's givenname and surname.
    - Joins `givenname` and `surname` into a single string with a space in between for consistency.  
    - Standardizes name formatting to improve readability and ensure uniform data representation.  

In [4]:
%%sql

SELECT
    customerkey,
    CONCAT(givenname, ' ', surname) AS cleaned_name
FROM customer;

Unnamed: 0,customerkey,cleaned_name
0,15,Julian McGuigan
1,23,Rose Dash
2,36,Annabelle Townsend
3,120,Jamie Hetherington
4,180,Gabriel Bosanquet
...,...,...
104985,2099639,Miroslav Slach
104986,2099656,Wilfredo Lozada
104987,2099697,Phillipp Maier
104988,2099711,Katerina Pavlícková


#### Average Order Value with Customer Name

**`CONCAT`**

1. Using the final query from `Handling_Nulls` add in the customer name's and combine the `givenname` and `surname`.
    - 🔔 Uses `CONCAT(c.givenname, ' ', c.surname)` to combine the first and last names.  
    - Select `customerkey` (from the customer table) and `cohort_year` (from the cohort analysis view)
    - `num_orders`, and `total_net_revenue`.    
    - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
    - Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
    -  Add `SUM` to `num_orders` and `total_net_revenue`.
    - 🔔 `GROUP BY` the `customerkey`, `cohort_year`, `givenname`, and `surname`.


In [5]:
%%sql

SELECT
    c.customerkey,
    CONCAT(ca.givenname, ' ', ca.surname) AS cleaned_name, -- Added
    ca.cohort_year,
    SUM(COALESCE(ca.num_orders, 0)) AS num_orders,
    SUM(COALESCE(ca.total_net_revenue, 0)) AS total_net_revenue,
    SUM(ca.total_net_revenue) / SUM(NULLIF(ca.num_orders, 0)) AS avg_order_value
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey
GROUP BY 
    c.customerkey, 
    ca.cohort_year, 
    ca.givenname, -- Added
    ca.surname -- Added

Unnamed: 0,customerkey,cleaned_name,cohort_year,num_orders,total_net_revenue,avg_order_value
0,15,Julian McGuigan,2021,1,2217.41,2217.41
1,23,,,0,0.00,
2,36,,,0,0.00,
3,120,,,0,0.00,
4,180,Gabriel Bosanquet,2018,3,2510.22,836.74
...,...,...,...,...,...,...
104985,2099639,,,0,0.00,
104986,2099656,Wilfredo Lozada,2023,13,10404.68,800.36
104987,2099697,Phillipp Maier,2022,3,38.20,12.73
104988,2099711,Katerina Pavlícková,2016,2,6008.67,3004.34


<img src="../Resources/images/6.2_customer_avg_rev.png" alt="Customer Average Revenue" width="50%">

> ⚠️ **Chart Note**: This plots only 15 of our customers for better visualization.

---
## UPPER

### 📝 Notes

**`UPPER()`**

- **UPPER**: Converts a string to uppercase.

- Syntax:

  ```sql
  SELECT UPPER(string_column);
  ```

- Useful for standardizing text, such as converting names or codes to uppercase for comparison.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Name Standardization: Uppercase formatting
  - Revenue Attribution: Linking sales to customers
  - Customer Behavior: Purchase patterns over time
- **💡 Why It Matters**: Standardizes analysis of customer value
    - Standardize display formats
    - Accurately tracks revenue by customer
    - Enables reliable cohort comparisons
    - Maintains data consistency across revenue analysis
- **🎯 Common Use Cases**: 
  - Revenue reporting by customer
  - Standardized cohort analysis
  - Customer behavior tracking
- **📈 Related KPIs**: 
  - Net revenue per customer
  - Cohort performance
  - Customer order patterns

### 📈 Analysis

- Converts customer names to uppercase for uniformity.  
- Evaluate customer spending trends by calculating their cohort year, total net revenue, and order frequency.

#### Uppercase Name Formatting

**`UPPER`**

1. Using `UPPER` uppercase the `givenname`, `surnname` and the `full_name` (which was created using `CONCAT`).
    - Applies `UPPER` to `givenname` and `surname` to ensure all characters are capitalized.  
    - Creates a **fully uppercase** `full_name` by applying `UPPER` within `CONCAT`.  

In [6]:
%%sql

SELECT
    customerkey,
    UPPER(givenname) AS uppercase_givenname,
    UPPER(surname) AS uppercase_surname,
    CONCAT(UPPER(givenname), ' ', UPPER(surname)) AS uppercase_full_name    
FROM customer;

Unnamed: 0,customerkey,uppercase_givenname,uppercase_surname,uppercase_full_name
0,15,JULIAN,MCGUIGAN,JULIAN MCGUIGAN
1,23,ROSE,DASH,ROSE DASH
2,36,ANNABELLE,TOWNSEND,ANNABELLE TOWNSEND
3,120,JAMIE,HETHERINGTON,JAMIE HETHERINGTON
4,180,GABRIEL,BOSANQUET,GABRIEL BOSANQUET
...,...,...,...,...
104985,2099639,MIROSLAV,SLACH,MIROSLAV SLACH
104986,2099656,WILFREDO,LOZADA,WILFREDO LOZADA
104987,2099697,PHILLIPP,MAIER,PHILLIPP MAIER
104988,2099711,KATERINA,PAVLíCKOVá,KATERINA PAVLíCKOVá


#### Avg Order Value With Uppercase Names

**`UPPER`**

1. Using query from `CONCAT` clean up the customer name and to make the combined customer name both upper case. 
    - 🔔 Uses `CONCAT(UPPER(givenname), ' ', UPPER(surname))` to ensure name consistency across reports.    
    - Select `customerkey` (from the customer table) and `cohort_year` (from the cohort analysis view)
    - `num_orders`, and `total_net_revenue`.    
    - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
    - Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
    -  Add `SUM` to `num_orders` and `total_net_revenue`.
    - `GROUP BY` the `customerkey`, `cohort_year`, `givenname`, and `surname`.

In [7]:
%%sql

SELECT
    c.customerkey,
    CONCAT(UPPER(ca.givenname), ' ', UPPER(ca.surname)) AS cleaned_name, -- Updated
    ca.cohort_year,
    SUM(COALESCE(ca.num_orders, 0)) AS num_orders,
    SUM(COALESCE(ca.total_net_revenue, 0)) AS total_net_revenue,
    SUM(ca.total_net_revenue) / SUM(NULLIF(ca.num_orders, 0)) AS avg_order_value
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey
GROUP BY 
    c.customerkey, 
    ca.cohort_year, 
    ca.givenname,
    ca.surname

Unnamed: 0,customerkey,cleaned_name,cohort_year,num_orders,total_net_revenue,avg_order_value
0,15,JULIAN MCGUIGAN,2021,1,2217.41,2217.41
1,23,,,0,0.00,
2,36,,,0,0.00,
3,120,,,0,0.00,
4,180,GABRIEL BOSANQUET,2018,3,2510.22,836.74
...,...,...,...,...,...,...
104985,2099639,,,0,0.00,
104986,2099656,WILFREDO LOZADA,2023,13,10404.68,800.36
104987,2099697,PHILLIPP MAIER,2022,3,38.20,12.73
104988,2099711,KATERINA PAVLíCKOVá,2016,2,6008.67,3004.34


---
## LOWER

### 📝 Notes

**`LOWER()`**

- **LOWER**: Converts a string to lowercase.  
- Syntax:  
  ```sql
  SELECT LOWER(string_column);
  ```
- Useful for standardizing text, such as making email addresses or usernames case-insensitive for comparisons.

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Customer Identification: Lowercase standardization
  - Purchase History: Order patterns over time
  - Revenue Tracking: Sales by customer
- **💡 Why It Matters**: Ensures consistent revenue analysis
    - Standardizes customer name matching
    - Enables accurate revenue attribution
    - Maintains cohort integrity
    - Links purchase patterns to unique customers
- **🎯 Common Use Cases**: 
  - Customer purchase analysis
  - Revenue pattern identification
  - Cohort tracking
- **📈 Related KPIs**: 
  - Customer revenue trends
  - Order frequency
  - Cohort value metrics

### 📈 Analysis

- Converts customer names to lowercase.  
- Assesses customer revenue by calculating their cohort year, total net revenue, and order frequency.  

#### Lowercase Name Formatting

**`LOWER`**

1. Using `LOWER` lowercase the `givenname`, `surnname` and the `full_name` (which was created using `CONCAT`).
    - Applies `LOWER` to both `givenname` and `surname` to enforce lowercase formatting.  
    - Concatenates `LOWER(givenname)` and `LOWER(surname)` into a fully lowercase `full_name`.  

In [8]:
%%sql

SELECT
    customerkey,
    LOWER(givenname) AS lowercase_givenname,
    LOWER(surname) AS lowercase_surname,
    CONCAT(LOWER(givenname), ' ', LOWER(surname)) AS lowercase_full_name    
FROM customer;

Unnamed: 0,customerkey,lowercase_givenname,lowercase_surname,lowercase_full_name
0,15,julian,mcguigan,julian mcguigan
1,23,rose,dash,rose dash
2,36,annabelle,townsend,annabelle townsend
3,120,jamie,hetherington,jamie hetherington
4,180,gabriel,bosanquet,gabriel bosanquet
...,...,...,...,...
104985,2099639,miroslav,slach,miroslav slach
104986,2099656,wilfredo,lozada,wilfredo lozada
104987,2099697,phillipp,maier,phillipp maier
104988,2099711,katerina,pavlícková,katerina pavlícková


#### Avg Order Value With Lowercase Names

**`LOWER`**


1. Using query from `UPPER` clean up the customer name and to make the combined customer name both lower case. 
    - 🔔 Uses `CONCAT(LOWER(givenname), ' ', LOWER(surname))` to maintain a fully lowercase name format.  
    - Select `customerkey` (from the customer table) and `cohort_year` (from the cohort analysis view)
    - `num_orders`, and `total_net_revenue`.    
    - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
    - Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
    -  Add `SUM` to `num_orders` and `total_net_revenue`.
    - `GROUP BY` the `customerkey`, `cohort_year`, `givenname`, and `surname`.

In [9]:
%%sql

SELECT
    c.customerkey,
    CONCAT(LOWER(ca.givenname), ' ', LOWER(ca.surname)) AS cleaned_name, -- Updated
    ca.cohort_year,
    SUM(COALESCE(ca.num_orders, 0)) AS num_orders,
    SUM(COALESCE(ca.total_net_revenue, 0)) AS total_net_revenue,
    SUM(ca.total_net_revenue) / SUM(NULLIF(ca.num_orders, 0)) AS avg_order_value
FROM customer c
LEFT JOIN cohort_analysis ca ON c.customerkey = ca.customerkey
GROUP BY 
    c.customerkey, 
    ca.cohort_year, 
    ca.givenname,
    ca.surname

Unnamed: 0,customerkey,cleaned_name,cohort_year,num_orders,total_net_revenue,avg_order_value
0,15,julian mcguigan,2021,1,2217.41,2217.41
1,23,,,0,0.00,
2,36,,,0,0.00,
3,120,,,0,0.00,
4,180,gabriel bosanquet,2018,3,2510.22,836.74
...,...,...,...,...,...,...
104985,2099639,,,0,0.00,
104986,2099656,wilfredo lozada,2023,13,10404.68,800.36
104987,2099697,phillipp maier,2022,3,38.20,12.73
104988,2099711,katerina pavlícková,2016,2,6008.67,3004.34


---
## TRIM

### 📝 Notes

**`TRIM()`**

- **TRIM**: Removes leading and/or trailing spaces (or specified characters) from a string.  
- Syntax:  
  ```sql
  SELECT TRIM([BOTH | LEADING | TRAILING] 'characters' FROM string_column);
  ```
- Default behavior removes spaces from both ends of a string.  
- Useful for cleaning up user input, formatting text, or ensuring consistent comparisons.  

### 🔑 Key Concepts
- **📊 Business Terms**: 
  - Data Accuracy: Clean customer names
  - Revenue Mapping: Linking sales to customers
  - Customer Analysis: Purchase behavior study
- **💡 Why It Matters**: Ensures accurate customer revenue analysis
    - Prevents duplicate customer records
    - Enables precise revenue attribution
    - Maintains clean cohort groupings
    - Improves customer behavior analysis
- **🎯 Common Use Cases**: 
  - Clean revenue reporting
  - Accurate cohort analysis
  - Customer behavior tracking
- **📈 Related KPIs**: 
  - Revenue accuracy
  - Customer uniqueness
  - Cohort integrity

### 📈 Analysis

- Removes leading and trailing spaces from names.  
- Ensures accurate customer data for our project.

#### Trimmed Name Formatting

**`TRIM`**

1. Using the final query from `Handling_Nulls` clean up the customer name and combine the `givenname` and `surname`.
    - Uses `TRIM(givenname)` and `TRIM(surname)` to remove unnecessary leading and trailing spaces.  
    - Prevents inconsistencies caused by extra spaces in user-entered data.  
    - Creates a `full_name` by concatenating trimmed names to ensure clean formatting.  

In [10]:
%%sql

SELECT
    customerkey,
    TRIM(givenname) AS trimmed_givenname,
    TRIM(surname) AS trimmed_surname,
    CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS trimmed_full_name  
FROM customer;

Unnamed: 0,customerkey,trimmed_givenname,trimmed_surname,trimmed_full_name
0,15,Julian,McGuigan,Julian McGuigan
1,23,Rose,Dash,Rose Dash
2,36,Annabelle,Townsend,Annabelle Townsend
3,120,Jamie,Hetherington,Jamie Hetherington
4,180,Gabriel,Bosanquet,Gabriel Bosanquet
...,...,...,...,...
104985,2099639,Miroslav,Slach,Miroslav Slach
104986,2099656,Wilfredo,Lozada,Wilfredo Lozada
104987,2099697,Phillipp,Maier,Phillipp Maier
104988,2099711,Katerina,Pavlícková,Katerina Pavlícková


#### Create a View of the Cleaned Data

**`TRIM`**

 1. Grab the query used to create the view `cohort_analysis` (In `1_View_Intro.ipynb`) and put that into the `customer_revenue`. Also used the main query in that view and put it into a `cohort_data` CTE.
    - Use a CTE (`customer_revenue`) to preprocess customer revenue metrics. 
    - Define a CTE (`cohort_data`) to get the `first_purchase_date` and `cohort_year`.
    - In the main query: 
         - 🔔 Uses `CONCAT(TRIM(givenname), ' ', TRIM(surname))` to ensure a clean, trimmed name format.  
         - Select `customerkey`, `cohort_year`, `givenname`, `surname`, `num_orders`, and `total_net_revenue`.  
         - Use `LEFT JOIN` to join the cohort analysis view onto the customer table.
         - Use `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
         - Add `SUM` to `num_orders` and `total_net_revenue`.
         - `GROUP BY` the `customerkey`, `cohort_year`, `givenname`, and `surname`.

In [11]:
%%sql

WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    c.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    SUM(COALESCE(cd.num_orders, 0)) AS total_orders,
    SUM(COALESCE(cd.total_net_revenue, 0)) AS total_net_revenue,
    SUM(COALESCE(cd.total_net_revenue, 0)) / NULLIF(SUM(COALESCE(cd.num_orders, 0)), 0) AS avg_order_value  -- Added: Prevents division by zero
FROM customer c
LEFT JOIN cohort_data cd ON c.customerkey = cd.customerkey
GROUP BY c.customerkey, cd.cohort_year, cd.givenname, cd.surname
;

Unnamed: 0,customerkey,cohort_year,cleaned_name,total_orders,total_net_revenue,avg_order_value
0,15,2021,Julian McGuigan,1,2217.41,2217.41
1,23,,,0,0.00,
2,36,,,0,0.00,
3,120,,,0,0.00,
4,180,2018,Gabriel Bosanquet,3,2510.22,836.74
...,...,...,...,...,...,...
104985,2099639,,,0,0.00,
104986,2099656,2023,Wilfredo Lozada,13,10404.68,800.36
104987,2099697,2022,Phillipp Maier,3,38.20,12.73
104988,2099711,2016,Katerina Pavlícková,2,6008.67,3004.34


#### 💡 Why aren't use just using `cohort_analysis` view?

Why are we pasting in our `cohort_data` CTE instead of just calling the `cohort_analysis` view in the main query? 

Three reasons:

1. 🚀 **Performance Overhead** – Stacking views increases query complexity, slowing execution and making optimization harder.  
2. 🛠️ **Maintainability Issues** – Debugging becomes difficult, and changes in one view can break dependent views.  
3. ✅ **Better Alternatives** – Use materialized views, flatten queries, or CTEs for efficiency and clarity.

2. Remove SUM from `total_orders`, `net_revenue` and `avg_order_value` and only get data from `cohort_data`.
    - Use a CTE (`customer_revenue`) to preprocess customer revenue metrics. 
    - Define a CTE (`cohort_data`) to get the `first_purchase_date` and `cohort_year`.
    - In the main query: 
        - Select `customerkey`, `cohort_year`, `num_orders`, `total_net_revenue`.
        - Uses `CONCAT(TRIM(givenname), ' ', TRIM(surname))` to ensure a clean, trimmed name format.
        - 🔔 Remove `LEFT JOIN` to only get data from `cohort_data`. 

In [13]:
%%sql

WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    cd.customerkey, -- Updated to use cohort_data
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS total_orders, -- Updated to remove SUM
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue -- Updated to remove SUM
FROM cohort_data cd
;

Unnamed: 0,customerkey,cohort_year,cleaned_name,total_orders,total_net_revenue
0,15,2021,Julian McGuigan,1,2217.41
1,180,2018,Gabriel Bosanquet,1,525.31
2,180,2018,Gabriel Bosanquet,2,1984.90
3,185,2019,Gabrielle Castella,1,1395.52
4,243,2016,Maya Atherton,1,287.67
...,...,...,...,...,...
83094,2099697,2022,Phillipp Maier,3,38.20
83095,2099711,2016,Katerina Pavlícková,1,2067.75
83096,2099711,2016,Katerina Pavlícková,1,3940.92
83097,2099743,2022,Luciana Almonte,2,469.62


3. Add in `countryfull`, `age`, `first_purchase_date`, and `orderdate`.
    - Use a CTE (`customer_revenue`) to preprocess customer revenue metrics. 
    - Define a CTE (`cohort_data`) to get the `first_purchase_date` and `cohort_year`.
    - In the main query: 
        - Select `customerkey`, `cohort_year`, `num_orders`, `total_net_revenue`.
        - Uses `CONCAT(TRIM(givenname), ' ', TRIM(surname))` to ensure a clean, trimmed name format.
        - Add `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.
        - 🔔 Add `countryfull`, `age`, `first_purchase_date`, and `orderdate` from `cohort_data`.

In [14]:
%%sql

WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull, -- Added
    cd.age, -- Added 
    cd.first_purchase_date, -- Added 
    cd.orderdate -- Added 
FROM cohort_data cd;

Unnamed: 0,customerkey,cohort_year,cleaned_name,num_orders,total_net_revenue,countryfull,age,first_purchase_date,orderdate
0,15,2021,Julian McGuigan,1,2217.41,Australia,55,2021-03-08,2021-03-08
1,180,2018,Gabriel Bosanquet,1,525.31,Australia,65,2018-07-28,2018-07-28
2,180,2018,Gabriel Bosanquet,2,1984.90,Australia,65,2018-07-28,2023-08-28
3,185,2019,Gabrielle Castella,1,1395.52,Australia,40,2019-06-01,2019-06-01
4,243,2016,Maya Atherton,1,287.67,Australia,66,2016-05-19,2016-05-19
...,...,...,...,...,...,...,...,...,...
83094,2099697,2022,Phillipp Maier,3,38.20,United States,54,2022-09-13,2022-09-13
83095,2099711,2016,Katerina Pavlícková,1,2067.75,United States,80,2016-08-13,2016-08-13
83096,2099711,2016,Katerina Pavlícková,1,3940.92,United States,80,2016-08-13,2017-08-14
83097,2099743,2022,Luciana Almonte,2,469.62,United States,21,2022-03-17,2022-03-17


4. Create a view of the cleaned data.
    - 🔔 Use `CREATE OR REPLACE VIEW` to create or replace a view of the cleaned data and name it `cleaned_customer`.
    - Use a CTE (`customer_revenue`) to preprocess customer revenue metrics. 
    - Define a CTE (`cohort_data`) to get the `first_purchase_date` and `cohort_year`.
    - In the main query: 
        - Select `customerkey`, `cohort_year`, `num_orders`, `total_net_revenue`, `countryfull`, `age`, `first_purchase_date`, and `orderdate` from `cohort_data`.
        - Uses `CONCAT(TRIM(givenname), ' ', TRIM(surname))` to ensure a clean, trimmed name format.
        - Uses `COALESCE` on `num_orders` and `total_net_revenue` to replace `NULL` values with `0` for `num_orders`.

In [15]:
%%sql

CREATE OR REPLACE VIEW cleaned_customer AS
WITH customer_revenue AS (
	SELECT
		s.customerkey,
		s.orderdate,
		SUM(s.quantity * s.netprice * s.exchangerate) AS total_net_revenue,
		COUNT(s.orderkey) AS num_orders,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
	FROM sales s 
	LEFT JOIN customer c ON c.customerkey = s.customerkey
	GROUP BY
		s.customerkey,
		s.orderdate,
		c.countryfull,
		c.age,
		c.givenname,
		c.surname
),

cohort_data AS (
	SELECT
		cr.*,
		MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey) AS first_purchase_date,
		EXTRACT(YEAR FROM MIN(cr.orderdate) OVER (PARTITION BY cr.customerkey)) AS cohort_year
	FROM customer_revenue cr 
)

SELECT
    cd.customerkey, 
    cd.cohort_year,
    CONCAT(TRIM(cd.givenname), ' ', TRIM(cd.surname)) AS cleaned_name, 
    COALESCE(cd.num_orders, 0) AS num_orders,
    COALESCE(cd.total_net_revenue, 0) AS total_net_revenue,
    cd.countryfull,
    cd.age,
    cd.first_purchase_date,
    cd.orderdate
FROM cohort_data cd;

View the query: 

In [16]:
%%sql

SELECT * FROM cleaned_customer;

Unnamed: 0,customerkey,cohort_year,cleaned_name,num_orders,total_net_revenue,countryfull,age,first_purchase_date,orderdate
0,15,2021,Julian McGuigan,1,2217.41,Australia,55,2021-03-08,2021-03-08
1,180,2018,Gabriel Bosanquet,1,525.31,Australia,65,2018-07-28,2018-07-28
2,180,2018,Gabriel Bosanquet,2,1984.90,Australia,65,2018-07-28,2023-08-28
3,185,2019,Gabrielle Castella,1,1395.52,Australia,40,2019-06-01,2019-06-01
4,243,2016,Maya Atherton,1,287.67,Australia,66,2016-05-19,2016-05-19
...,...,...,...,...,...,...,...,...,...
83094,2099697,2022,Phillipp Maier,3,38.20,United States,54,2022-09-13,2022-09-13
83095,2099711,2016,Katerina Pavlícková,1,2067.75,United States,80,2016-08-13,2016-08-13
83096,2099711,2016,Katerina Pavlícková,1,3940.92,United States,80,2016-08-13,2017-08-14
83097,2099743,2022,Luciana Almonte,2,469.62,United States,21,2022-03-17,2022-03-17
