<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Int_SQL_Data_Analytics_Course/blob/main/6_Data_Cleaning/2_Strings.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# String Formatting

## Overview

### 🥅 Analysis Goals

Standardize customer name formatting and analyze revenue trends to ensure data consistency, improve reporting accuracy, and gain insights into customer behavior.  

- **Standardized Full Name & Revenue Analysis:** Concatenates first and last names for consistency and readability while calculating customer acquisition cohort, net revenue, and order frequency to assess customer value and purchasing trends.  

- **Uppercase Name Formatting & Revenue Analysis:** Converts names to uppercase for uniformity in reporting and case-insensitive operations while ensuring revenue analysis maintains standardized uppercase names for consistent data processing.  

- **Lowercase Name Formatting & Revenue Analysis:** Converts names to lowercase for datasets requiring a uniform format while analyzing revenue trends with consistently formatted names to improve data clarity.  

- **Trimmed Name Formatting & Revenue Analysis:** Removes leading and trailing spaces to clean up inconsistencies in name fields while ensuring revenue calculations remain accurate, preventing discrepancies due to extra whitespace.  

### 📘 Concepts Covered

- `CONCAT`
- `UPPER`
- `LOWER`
- `TRIM`

---
## CONCAT

### 📝 Notes

**`CONCAT`**

- **CONCAT**: Combines two or more strings into a single string.

- Syntax:

  ```sql
  SELECT CONCAT(string1, string2, ...);
  ```

- Automatically handles `NULL` values as empty strings, avoiding `NULL` results when concatenating.

### 💻 Final Result

- Standardizes customer names by concatenating first and last names for better readability and consistency across reports.  
- Analyzes customer purchase behavior by calculating their cohort year, total net revenue, and order frequency to understand long-term value.  

#### Standardized Customer Name

**`CONCAT`**

1. Use `CONCAT` to combine the customer's givenname and surname.
    - Joins `givenname` and `surname` into a single string with a space in between for consistency.  
    - Standardizes name formatting to improve readability and ensure uniform data representation.  

In [None]:
SELECT
    customerkey,
    CONCAT(givenname, ' ', surname) AS cleaned_name
FROM customers;

<img src="../Resources/query_results/6_string_formatting_1.png" alt="Query Results 1" style="width: 50%; height: auto;">

#### Customer Revenue And Cohort Analysis

**`CONCAT`**

1. Using the final query from `Handling_Nulls` clean up the customer name and combine the `givenname` and `surname`.
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - Uses `CONCAT(c.givenname, ' ', c.surname)` to combine the first and last names.  
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  


In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year, 
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(c.givenname, ' ', c.surname) AS cleaned_name, -- Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_string_formatting_2.png" alt="Query Results 2" style="width: 80%; height: auto;">

---
## UPPER

### 📝 Notes

**`UPPER()`**

- **UPPER**: Converts a string to uppercase.

- Syntax:

  ```sql
  SELECT UPPER(string_column);
  ```

- Useful for standardizing text, such as converting names or codes to uppercase for comparison.

### 💻 Final Result

- Converts customer names to uppercase for uniformity when performing case-insensitive comparisons or standardizing display formats.  
- Evaluates customer spending trends while ensuring name formatting is consistent in uppercase for standardization in reports.  

#### Uppercase Name Formatting

**`UPPER`**

1. Using `UPPER` uppercase the `givenname`, `surnname` and the `full_name` (which was created using `CONCAT`).
    - Applies `UPPER` to `givenname` and `surname` to ensure all characters are capitalized.  
    - Creates a **fully uppercase** `full_name` by applying `UPPER` within `CONCAT`.  

In [None]:
SELECT
    customerkey,
    UPPER(givenname) AS uppercase_givenname,
    UPPER(surname) AS uppercase_surname,
    CONCAT(UPPER(givenname), ' ', UPPER(surname)) AS uppercase_full_name    
FROM customers;

<img src="../Resources/query_results/6_string_formatting_3.png" alt="Query Results 3" style="width: 80%; height: auto;">

#### Revenue Analysis With Uppercase Names

**`UPPER`**

1. Using query from `CONCAT` clean up the customer name and to make the combined customer name both upper case. 
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - Uses `CONCAT(UPPER(givenname), ' ', UPPER(surname))` to ensure name consistency across reports.   
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes*average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year, 
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(UPPER(c.givenname), ' ', UPPER(c.surname)) AS cleaned_name, --Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value  
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_string_formatting_4.png" alt="Query Results 1" style="width: 80%; height: auto;">

---
## LOWER

### 📝 Notes

**`LOWER()`**

- **LOWER**: Converts a string to lowercase.  
- Syntax:  
  ```sql
  SELECT LOWER(string_column);
  ```
- Useful for standardizing text, such as making email addresses or usernames case-insensitive for comparisons.

### 💻 Final Result

- Converts customer names to lowercase to ensure consistency in case-sensitive operations or data exports requiring lowercase formatting.  
- Assesses customer revenue while maintaining lowercase formatting for uniformity in datasets where lowercase naming is preferred.  

#### Lowercase Name Formatting

**`LOWER`**

1. Using `LOWER` lowercase the `givenname`, `surnname` and the `full_name` (which was created using `CONCAT`).
    - Applies `LOWER` to both `givenname` and `surname` to enforce lowercase formatting.  
    - Concatenates `LOWER(givenname)` and `LOWER(surname)` into a fully lowercase `full_name`.  

In [None]:
SELECT
    customerkey,
    LOWER(givenname) AS lowercase_givenname,
    LOWER(surname) AS lowercase_surname,
    CONCAT(LOWER(givenname), ' ', LOWER(surname)) AS lowercase_full_name    
FROM customer;

<img src="../Resources/query_results/6_string_formatting_5.png" alt="Query Results 5" style="width: 80%; height: auto;">

#### Revenue Analysis With Lowercase Names

**`LOWER`**

1. Using query from `UPPER` clean up the customer name and to make the combined customer name both lower case. 
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - Uses `CONCAT(LOWER(givenname), ' ', LOWER(surname))` to maintain a fully lowercase name format.  
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes*average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(LOWER(c.givenname), ' ', LOWER(c.surname)) AS cleaned_name, -- Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_string_formatting_6.png" alt="Query Results 6" style="width: 80%; height: auto;">

---
## TRIM

### 📝 Notes

**`TRIM()`**

- **TRIM**: Removes leading and/or trailing spaces (or specified characters) from a string.  
- Syntax:  
  ```sql
  SELECT TRIM([BOTH | LEADING | TRAILING] 'characters' FROM string_column);
  ```
- Default behavior removes spaces from both ends of a string.  
- Useful for cleaning up user input, formatting text, or ensuring consistent comparisons.  

### 💻 Final Result

- Removes leading and trailing spaces from names to clean up inconsistencies caused by extra whitespace in data entry.  
- Ensures accurate revenue analysis while cleaning name fields by trimming extra spaces to prevent discrepancies in joins and reporting.  

#### Trimmed Name Formatting

**`TRIM`**

1. Using the final query from `Handling_Nulls` clean up the customer name and combine the `givenname` and `surname`.
    - Uses `TRIM(givenname)` and `TRIM(surname)` to remove unnecessary leading and trailing spaces.  
    - Prevents inconsistencies caused by extra spaces in user-entered data.  
    - Creates a `full_name` by concatenating trimmed names to ensure clean formatting.  

In [None]:
SELECT
    customerkey,
    TRIM(givenname) AS trimmed_givenname,
    TRIM(surname) AS trimmed_surname,
    CONCAT(TRIM(givenname), ' ', TRIM(surname)) AS trimmed_full_name  
FROM customer;

<img src="../Resources/query_results/6_string_formatting_7.png" alt="Query Results 6" style="width: 80%; height: auto;">

#### Revenue Analysis With Trimmed Names

**`TRIM`**

1. Using query from `LOWER` clean up the customer name and to remove any extra spaces. 
    - Use a CTE (`sales_data`) to preprocess customer revenue metrics. 
        - Extracts each customer’s **cohort year** using `EXTRACT(YEAR FROM MIN(orderdate))`.  
        - Aggregates total **net revenue** per customer using `SUM(quantity * netprice * exchangerate)`.  
        - Counts the **number of orders** per customer using `COUNT(orderkey)`.  
    - In the main query: 
        - Uses `TRIM` within `CONCAT` to eliminate excess whitespace while maintaining a standardized name format.   
        - Uses `COALESCE` to replace null revenue and order values with `0` to avoid missing data issues.  
        - Computes*average order value by dividing `net_revenue` by `num_orders`, handling division by zero with `NULLIF`.  

In [None]:
WITH sales_data AS (
        SELECT
            customerkey,
            EXTRACT(YEAR FROM MIN(orderdate)) AS cohort_year,
            SUM(quantity * netprice * exchangerate) AS net_revenue,
            COUNT(orderkey) AS num_orders
        FROM sales
        GROUP BY
            customerkey
)

SELECT
    c.customerkey,
    s.cohort_year,
    CONCAT(TRIM(c.givenname), ' ', TRIM(c.surname)) AS cleaned_name, -- Added
    COALESCE(s.net_revenue, 0) AS net_revenue,
    COALESCE(s.num_orders, 0) AS total_orders,
    s.net_revenue / NULLIF(s.num_orders, 0) AS avg_order_value
FROM customer c
LEFT JOIN sales_data s ON c.customerkey = s.customerkey;

<img src="../Resources/query_results/6_string_formatting_8.png" alt="Query Results 6" style="width: 80%; height: auto;">