<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/2_String_Formatting.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# String Formatting

## Overview

### 🥅 Analysis Goals

Standardize customer name formatting and analyze revenue trends to ensure data consistency, improve reporting accuracy, and gain insights into customer behavior.  

- **Standardized Full Name & Revenue Analysis:** Concatenates first and last names for consistency and readability while calculating customer acquisition cohort, net revenue, and order frequency to assess customer value and purchasing trends.  

- **Uppercase Name Formatting & Revenue Analysis:** Converts names to uppercase for uniformity in reporting and case-insensitive operations while ensuring revenue analysis maintains standardized uppercase names for consistent data processing.  

- **Lowercase Name Formatting & Revenue Analysis:** Converts names to lowercase for datasets requiring a uniform format while analyzing revenue trends with consistently formatted names to improve data clarity.  

- **Trimmed Name Formatting & Revenue Analysis:** Removes leading and trailing spaces to clean up inconsistencies in name fields while ensuring revenue calculations remain accurate, preventing discrepancies due to extra whitespace.  

### 📘 Concepts Covered

- `CONCAT`
- `UPPER`
- `LOWER`
- `TRIM`

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# If running in Google Colab, install PostgreSQL and restore the database
if 'google.colab' in sys.modules:
    # Install PostgreSQL
    !sudo apt-get install postgresql -qq > /dev/null 2>&1

    # Start PostgreSQL service (suppress output)
    !sudo service postgresql start > /dev/null 2>&1

    # Set password for the 'postgres' user to avoid authentication errors (suppress output)
    !sudo -u postgres psql -c "ALTER USER postgres WITH PASSWORD 'password';" > /dev/null 2>&1

    # Create the 'colab_db' database (suppress output)
    !sudo -u postgres psql -c "CREATE DATABASE contoso_100k;" > /dev/null 2>&1

    # Download the PostgreSQL .sql dump
    !wget -q -O contoso_100k.sql https://github.com/lukebarousse/Int_SQL_Data_Analytics_Course/releases/download/v.0.0.0/contoso_100k.sql

    # Restore the dump file into the PostgreSQL database (suppress output)
    !sudo -u postgres psql contoso_100k < contoso_100k.sql > /dev/null 2>&1

    # Shift libraries from ipython-sql to jupysql
    !pip uninstall -y ipython-sql > /dev/null 2>&1
    !pip install jupysql > /dev/null 2>&1

# Load the sql extension for SQL magic
%load_ext sql

# Connect to the PostgreSQL database
%sql postgresql://postgres:password@localhost:5432/contoso_100k

# Enable automatic conversion of SQL results to pandas DataFrames
%config SqlMagic.autopandas = True

# Disable named parameters for SQL magic
%config SqlMagic.named_parameters = "disabled"

# Display pandas number to two decimal places
pd.options.display.float_format = '{:.2f}'.format

---
## CONCAT

### 📝 Notes

**`CONCAT`**

- **CONCAT**: Combines two or more strings into a single string.

- Syntax:

  ```sql
  SELECT CONCAT(string1, string2, ...);
  ```

- Automatically handles `NULL` values as empty strings, avoiding `NULL` results when concatenating.

### 💻 Final Result

- Standardizes customer names by concatenating first and last names for better readability and consistency across reports.  
- Analyzes customer purchase behavior by calculating their cohort year, total net revenue, and order frequency to understand long-term value.  

#### Standardized Customer Name

**`CONCAT`**

1. Use `CONCAT` to combine the customer's givenname and surname.
    - Joins `givenname` and `surname` into a single string with a space in between for consistency.  
    - Standardizes name formatting to improve readability and ensure uniform data representation.  

In [4]:
%%sql

SELECT
    customerkey,
    CONCAT(givenname, ' ', surname) AS cleaned_name
FROM customer;

Unnamed: 0,customerkey,cleaned_name
0,15,Julian McGuigan
1,23,Rose Dash
2,36,Annabelle Townsend
3,120,Jamie Hetherington
4,180,Gabriel Bosanquet
...,...,...
104985,2099639,Miroslav Slach
104986,2099656,Wilfredo Lozada
104987,2099697,Phillipp Maier
104988,2099711,Katerina Pavlícková


#### Customer Revenue And Cohort Analysis

**`CONCAT`**

1. Using the final query from `Handling_Nulls` clean up the customer name and combine the `givenname` and `surname`.
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - Uses `CONCAT(c.givenname, ' ', c.surname)` to combine the first and last names.  
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  


In [5]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year, 
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(c.givenname, ' ', c.surname) AS cleaned_name, -- Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cohort_year,cleaned_name,net_revenue,total_orders,avg_order_value
0,15,2021,Julian McGuigan,2217.41,1,2217.41
1,23,,Rose Dash,0.00,0,
2,36,,Annabelle Townsend,0.00,0,
3,120,,Jamie Hetherington,0.00,0,
4,180,2018,Gabriel Bosanquet,2510.22,3,836.74
...,...,...,...,...,...,...
104985,2099639,,Miroslav Slach,0.00,0,
104986,2099656,2023,Wilfredo Lozada,10404.68,13,800.36
104987,2099697,2022,Phillipp Maier,38.20,3,12.73
104988,2099711,2016,Katerina Pavlícková,6008.67,2,3004.34


---
## UPPER

### 📝 Notes

**`UPPER()`**

- **UPPER**: Converts a string to uppercase.

- Syntax:

  ```sql
  SELECT UPPER(string_column);
  ```

- Useful for standardizing text, such as converting names or codes to uppercase for comparison.

### 💻 Final Result

- Converts customer names to uppercase for uniformity when performing case-insensitive comparisons or standardizing display formats.  
- Evaluates customer spending trends while ensuring name formatting is consistent in uppercase for standardization in reports.  

#### Uppercase Name Formatting

**`UPPER`**

1. Using `UPPER` uppercase the `givenname`, `surnname` and the `full_name` (which was created using `CONCAT`).
    - Applies `UPPER` to `givenname` and `surname` to ensure all characters are capitalized.  
    - Creates a **fully uppercase** `full_name` by applying `UPPER` within `CONCAT`.  

In [7]:
%%sql

SELECT
    customerkey,
    UPPER(givenname) AS uppercase_givenname,
    UPPER(surname) AS uppercase_surname,
    CONCAT(UPPER(givenname), ' ', UPPER(surname)) AS uppercase_full_name    
FROM customer;

Unnamed: 0,customerkey,uppercase_givenname,uppercase_surname,uppercase_full_name
0,15,JULIAN,MCGUIGAN,JULIAN MCGUIGAN
1,23,ROSE,DASH,ROSE DASH
2,36,ANNABELLE,TOWNSEND,ANNABELLE TOWNSEND
3,120,JAMIE,HETHERINGTON,JAMIE HETHERINGTON
4,180,GABRIEL,BOSANQUET,GABRIEL BOSANQUET
...,...,...,...,...
104985,2099639,MIROSLAV,SLACH,MIROSLAV SLACH
104986,2099656,WILFREDO,LOZADA,WILFREDO LOZADA
104987,2099697,PHILLIPP,MAIER,PHILLIPP MAIER
104988,2099711,KATERINA,PAVLíCKOVá,KATERINA PAVLíCKOVá


#### Revenue Analysis With Uppercase Names

**`UPPER`**

1. Using query from `CONCAT` clean up the customer name and to make the combined customer name both upper case. 
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - 🔔 Uses `CONCAT(UPPER(givenname), ' ', UPPER(surname))` to ensure name consistency across reports.   
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes*average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  

In [8]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year, 
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(UPPER(c.givenname), ' ', UPPER(c.surname)) AS cleaned_name, --Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cohort_year,cleaned_name,net_revenue,total_orders,avg_order_value
0,15,2021,JULIAN MCGUIGAN,2217.41,1,2217.41
1,23,,ROSE DASH,0.00,0,
2,36,,ANNABELLE TOWNSEND,0.00,0,
3,120,,JAMIE HETHERINGTON,0.00,0,
4,180,2018,GABRIEL BOSANQUET,2510.22,3,836.74
...,...,...,...,...,...,...
104985,2099639,,MIROSLAV SLACH,0.00,0,
104986,2099656,2023,WILFREDO LOZADA,10404.68,13,800.36
104987,2099697,2022,PHILLIPP MAIER,38.20,3,12.73
104988,2099711,2016,KATERINA PAVLíCKOVá,6008.67,2,3004.34


---
## LOWER

### 📝 Notes

**`LOWER()`**

- **LOWER**: Converts a string to lowercase.  
- Syntax:  
  ```sql
  SELECT LOWER(string_column);
  ```
- Useful for standardizing text, such as making email addresses or usernames case-insensitive for comparisons.

### 💻 Final Result

- Converts customer names to lowercase to ensure consistency in case-sensitive operations or data exports requiring lowercase formatting.  
- Assesses customer revenue while maintaining lowercase formatting for uniformity in datasets where lowercase naming is preferred.  

#### Lowercase Name Formatting

**`LOWER`**

1. Using `LOWER` lowercase the `givenname`, `surnname` and the `full_name` (which was created using `CONCAT`).
    - Applies `LOWER` to both `givenname` and `surname` to enforce lowercase formatting.  
    - Concatenates `LOWER(givenname)` and `LOWER(surname)` into a fully lowercase `full_name`.  

In [9]:
%%sql

SELECT
    customerkey,
    LOWER(givenname) AS lowercase_givenname,
    LOWER(surname) AS lowercase_surname,
    CONCAT(LOWER(givenname), ' ', LOWER(surname)) AS lowercase_full_name    
FROM customer;

Unnamed: 0,customerkey,lowercase_givenname,lowercase_surname,lowercase_full_name
0,15,julian,mcguigan,julian mcguigan
1,23,rose,dash,rose dash
2,36,annabelle,townsend,annabelle townsend
3,120,jamie,hetherington,jamie hetherington
4,180,gabriel,bosanquet,gabriel bosanquet
...,...,...,...,...
104985,2099639,miroslav,slach,miroslav slach
104986,2099656,wilfredo,lozada,wilfredo lozada
104987,2099697,phillipp,maier,phillipp maier
104988,2099711,katerina,pavlícková,katerina pavlícková


#### Revenue Analysis With Lowercase Names

**`LOWER`**

1. Using query from `UPPER` clean up the customer name and to make the combined customer name both lower case. 
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - 🔔 Uses `CONCAT(LOWER(givenname), ' ', LOWER(surname))` to maintain a fully lowercase name format.  
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes*average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  

In [10]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(LOWER(c.givenname), ' ', LOWER(c.surname)) AS cleaned_name, -- Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cohort_year,cleaned_name,net_revenue,total_orders,avg_order_value
0,15,2021,julian mcguigan,2217.41,1,2217.41
1,23,,rose dash,0.00,0,
2,36,,annabelle townsend,0.00,0,
3,120,,jamie hetherington,0.00,0,
4,180,2018,gabriel bosanquet,2510.22,3,836.74
...,...,...,...,...,...,...
104985,2099639,,miroslav slach,0.00,0,
104986,2099656,2023,wilfredo lozada,10404.68,13,800.36
104987,2099697,2022,phillipp maier,38.20,3,12.73
104988,2099711,2016,katerina pavlícková,6008.67,2,3004.34


---
## TRIM

### 📝 Notes

**`TRIM()`**

- **TRIM**: Removes leading and/or trailing spaces (or specified characters) from a string.  
- Syntax:  
  ```sql
  SELECT TRIM([BOTH | LEADING | TRAILING] 'characters' FROM string_column);
  ```
- Default behavior removes spaces from both ends of a string.  
- Useful for cleaning up user input, formatting text, or ensuring consistent comparisons.  

### 💻 Final Result

- Removes leading and trailing spaces from names to clean up inconsistencies caused by extra whitespace in data entry.  
- Ensures accurate revenue analysis while cleaning name fields by trimming extra spaces to prevent discrepancies in joins and reporting.  

#### Trimmed Name Formatting

**`TRIM`**

1. Using the final query from `Handling_Nulls` clean up the customer name and combine the `givenname` and `surname`.
    - Uses `TRIM(givenname)` and `TRIM(surname)` to remove unnecessary leading and trailing spaces.  
    - Prevents inconsistencies caused by extra spaces in user-entered data.  
    - Creates a `full_name` by concatenating trimmed names to ensure clean formatting.  

In [11]:
%%sql

SELECT
    customerkey,
    TRIM(givenname) AS trimmed_givenname,
    TRIM(surname) AS trimmed_surname,
    CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS trimmed_full_name  
FROM customer;

Unnamed: 0,customerkey,trimmed_givenname,trimmed_surname,trimmed_full_name
0,15,Julian,McGuigan,Julian McGuigan
1,23,Rose,Dash,Rose Dash
2,36,Annabelle,Townsend,Annabelle Townsend
3,120,Jamie,Hetherington,Jamie Hetherington
4,180,Gabriel,Bosanquet,Gabriel Bosanquet
...,...,...,...,...
104985,2099639,Miroslav,Slach,Miroslav Slach
104986,2099656,Wilfredo,Lozada,Wilfredo Lozada
104987,2099697,Phillipp,Maier,Phillipp Maier
104988,2099711,Katerina,Pavlícková,Katerina Pavlícková


#### Revenue Analysis With Trimmed Names

**`TRIM`**

1. Using query from `LOWER` clean up the customer name and to remove any extra spaces. 
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - 🔔 Uses `TRIM` within `CONCAT` to eliminate excess whitespace while maintaining a standardized name format.   
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes*average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  

In [12]:
%%sql

WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(TRIM(c.givenname), ' ', TRIM(c.surname)) AS cleaned_name, -- Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

Unnamed: 0,customerkey,cohort_year,cleaned_name,net_revenue,total_orders,avg_order_value
0,15,2021,Julian McGuigan,2217.41,1,2217.41
1,23,,Rose Dash,0.00,0,
2,36,,Annabelle Townsend,0.00,0,
3,120,,Jamie Hetherington,0.00,0,
4,180,2018,Gabriel Bosanquet,2510.22,3,836.74
...,...,...,...,...,...,...
104985,2099639,,Miroslav Slach,0.00,0,
104986,2099656,2023,Wilfredo Lozada,10404.68,13,800.36
104987,2099697,2022,Phillipp Maier,38.20,3,12.73
104988,2099711,2016,Katerina Pavlícková,6008.67,2,3004.34
