# Spark SQL Workshop: Advanced join & group by techniques

## Logistics 

### Date & Time
**June 21st, 2025**  
1:00 PM - 2:00 PM EST (10:00 AM - 11:00 AM PST)

### What You Will Learn
- How to use **JOINs** to validate data and identify underlying data issues
- How to use advanced aggregation functions & check data quality with **GROUP BY** 

### Who This Workshop Is For

**Prequisites:**
- SQL basics, especially JOIN & GROUP BY basics ([see basics here](./basics.ipynb))
- Basic understanding of **[fact and dimension tables](https://www.startdataengineering.com/post/advanced-sql/#4-data-modeling--data-flow)**
- GitHub codespaces or Docker compose (if running locally)
    
**Perfect for:**
- People with some experience in SQL
- People who work with SQL regularly

**Not suitable for:**
- People who don't know SQL basics, especially JOIN & GROUP BY basics ([see basics here](./basics.ipynb))
- People looking for topics other than advanced JOIN and GROUP BY techniques

### How to Join
- **Format:** YouTube live workshop with hands-on coding
- **Participation:** You are expected to code along
- **Interaction:** Live Q&A session included
- **Practice:** Exercises provided

### Workshop Link

**[YouTube live link](https://www.youtube.com/watch?v=OPBhvZOq7oo)**

### Feedback
Feedback form link (TBD)

## Setup data

In [None]:
%%bash
python ./generate_data.py
python ./run_ddl.py

In [None]:
!jupyter labextension install jupyterlab-mermaid

In [None]:
spark

In [None]:
%%sql
use prod.db

## [Quick refresher] Facts & Dimensions

1. `Fact` tables containing information about how dimensions interact with each other in real life. Example: An order fact is an interaction between a customer and a seller involving one or more products. E.g. `Lineitem` & `Orders`.
2. `Dimension` tables store data for a business entity (e.g., customer, product, partner, etc). These tables describe the ‘who’ and ‘what’ types of questions. For example, which stores had the highest revenue yesterday? In this question, stores will be the dimension. E.g. `Customer`, `Supplier`

The term analytical querying usually refers to aggregating numerical (spend, count, sum, avg) data from the fact table for specific dimension attribute(s) (e.g., name, nation, date, month) from the dimension tables.

Some examples of analytical queries are
1. Who are the top 10 suppliers (by totalprice) in the past year?
2. What are the average sales per nation per year?
3. How do customer market segments perform (sales) month-over-month?

**Example**

![Analytical query](./images/analytical_qry.png)

In [None]:
%%sql
SELECT 
    YEAR(o.o_orderdate) as order_year,
    c.c_name,
    AVG(o.o_totalprice) as avg_order_price
FROM orders o
LEFT JOIN customer c ON o.o_custkey = c.c_custkey
GROUP BY YEAR(o.o_orderdate), c.c_custkey, c.c_name
ORDER BY 1 desc, c_custkey
LIMIT 10

```mermaid
flowchart LR
    subgraph A["Orders (Fact Table) - Wider"]
        A1["o_custkey: 1001 | o_orderdate: 2023-01-15 | o_totalprice: 75000.00"]
        A2["o_custkey: 1002 | o_orderdate: 2023-02-20 | o_totalprice: 120000.00"]
        A3["o_custkey: 1001 | o_orderdate: 2023-03-10 | o_totalprice: 95000.00"]
        A4["o_custkey: 1003 | o_orderdate: 2023-04-05 | o_totalprice: 85000.00"]
        A5["o_custkey: 9999 | o_orderdate: 2023-05-12 | o_totalprice: 50000.00"]
    end
    
    subgraph B["Customer (Dimension Table)"]
        B1["c_custkey: 1001 | c_name: John Smith"]
        B2["c_custkey: 1002 | c_name: Jane Doe"]
        B3["c_custkey: 1003 | c_name: Bob Wilson"]
    end
    
    subgraph C["Result"]
        C1["order_year: 2023 | c_name: John Smith | avg_order_price: 85000.00"]
        C2["order_year: 2023 | c_name: Jane Doe | avg_order_price: 120000.00"]
        C3["order_year: 2023 | c_name: Bob Wilson | avg_order_price: 85000.00"]
        C4["order_year: 2023 | c_name: NULL | avg_order_price: 50000.00"]
    end
    
    A1 -.->|matches| B1
    A2 -.->|matches| B2
    A3 -.->|matches| B1
    A4 -.->|matches| B3
    A5 -.->|no match| C4
    
    B1 -->|contributes to| C1
    B2 -->|contributes to| C2
    B3 -->|contributes to| C3
    
    style A fill:#e74c3c,stroke:#c0392b,color:#fff,stroke-width:3px
    style B fill:#3498db,stroke:#2980b9,color:#fff,stroke-width:2px
    style C fill:#27ae60,stroke:#229954,color:#fff,stroke-width:2px
```

## [Joins] can be used to validate data and identify underlying data issues

- While `joins` are typically used to combine tables, they can also be used to inspect data and get data diff.

- When joining tables, there is usually one table called the `driver/base` table to which other tables are joined.


### Find data in a table that is not part of another table with `anti join`

- When you need to get rows that are in one table but not in another, use `anti join`

- You can get the rows from the left table that does not have any matches from the right table


#### Exercise ( 5 min )
1. In the below query get all the data from `orders` CTE what are not in `lineitem` CTE.

```mermaid
flowchart LR
    subgraph A["Orders Table"]
        A1["`**o_orderkey | o_custkey**
        001 | 1001
        002 | 1002
        003 | 1003
        004 | 1004
        005 | 1005`"]
    end
    
    subgraph B["LineItem Table"]
        B1["`**l_orderkey | l_partkey**
        001 | 2001
        003 | 2003
        005 | 2004`"]
    end
    
    subgraph C["Result"]
        C1["`**o_orderkey | o_custkey**
        002 | 1002
        004 | 1004`"]
    end
    
    A1 -->|ANTI JOIN| C1
    B1 -.->|"ON o_orderkey = l_orderkey"| C1
    
    style A fill:#3498db,stroke:#2980b9,color:#fff
    style B fill:#27ae60,stroke:#229954,color:#fff
    style C fill:#e74c3c,stroke:#c0392b,color:#fff
```

In [None]:
%%sql
WITH orders AS (
 SELECT 001 as o_orderkey, 1001 as o_custkey
 UNION ALL
 SELECT 002 as o_orderkey, 1002 as o_custkey
 UNION ALL
 SELECT 003 as o_orderkey, 1003 as o_custkey
 UNION ALL
 SELECT 004 as o_orderkey, 1004 as o_custkey
 UNION ALL
 SELECT 005 as o_orderkey, 1005 as o_custkey
),
lineitem AS (
 SELECT 001 as l_orderkey, 2001 as l_partkey
 UNION ALL
 SELECT 003 as l_orderkey, 2003 as l_partkey
 UNION ALL
 SELECT 005 as l_orderkey, 2004 as l_partkey
)
SELECT o.*
FROM orders o
ANTI JOIN lineitem l ON o.o_orderkey = l.l_orderkey
-- Alternative approach; when you don't have anti join
-- LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey WHERE l.l_orderkey IS NULL

    
### Find data in a table that is closest in time to another table with `asof join` (not available in Spark)

- When you need to get the row that is closest in time to the current row

- Usually used when you need to get the "latest" price, or state from a fact table. Not really used to join dimensions.


#### Exercise ( 5 min )
1. In the below query get the `symbol, company, listing_date` from `stock` CTE and for the stock get their price as of asof the `listing_date`.

*Note* Assume the `price_tracker` CTE is a fact table where every change to the stocks are added to (typically this is in ms, but for simplicity we keep it at a day level)

```mermaid
flowchart LR
    subgraph A["Stock Table"]
        A1["AAPL | Apple Inc. | 2024-01-01"]
        A2["GOOGL | Alphabet Inc. | 2024-01-15"]
        A3["MSFT | Microsoft Corp. | 2024-02-01"]
    end
    
    subgraph B["Price Tracker Table"]
        B1["AAPL | 150.00 | 2024-01-10"]
        B2["AAPL | 155.00 | 2024-01-20"]
        B3["GOOGL | 2800.00 | 2024-01-25"]
        B4["GOOGL | 2850.00 | 2024-02-05"]
        B5["MSFT | 400.00 | 2024-02-10"]
    end
    
    subgraph C["Result"]
        C1["AAPL | Apple Inc. | 155.00 | 2024-01-20"]
        C2["GOOGL | Alphabet Inc. | 2850.00 | 2024-02-05"]
        C3["MSFT | Microsoft Corp. | 400.00 | 2024-02-10"]
    end
    
    A1 -.->|matches| B1
    A1 -->|latest match| B2
    A2 -.->|matches| B3
    A2 -->|latest match| B4
    A3 -->|matches| B5
    
    B2 -->|result| C1
    B4 -->|result| C2
    B5 -->|result| C3
    
    style A fill:#3498db,stroke:#2980b9,color:#fff
    style B fill:#27ae60,stroke:#229954,color:#fff
    style C fill:#e74c3c,stroke:#c0392b,color:#fff
```

In [None]:
%%sql
WITH stock AS (
 SELECT 'AAPL' as symbol, 'Apple Inc.' as company, '2024-01-01' as listing_date
 UNION ALL
 SELECT 'GOOGL' as symbol, 'Alphabet Inc.' as company, '2024-01-15' as listing_date
 UNION ALL
 SELECT 'MSFT' as symbol, 'Microsoft Corp.' as company, '2024-02-01' as listing_date
),
price_tracker AS (
 SELECT 'AAPL' as symbol, 150.00 as price, '2024-01-10' as price_date
 UNION ALL
 SELECT 'AAPL' as symbol, 155.00 as price, '2024-01-20' as price_date -- this is the latest price for apple and will be picked
 UNION ALL
 SELECT 'GOOGL' as symbol, 2800.00 as price, '2024-01-25' as price_date
 UNION ALL
 SELECT 'GOOGL' as symbol, 2850.00 as price, '2024-02-05' as price_date -- this is the latest price for google and will be picked
 UNION ALL
 SELECT 'MSFT' as symbol, 400.00 as price, '2024-02-10' as price_date -- this is the latest price for microsoft and will be picked
),
ranked_prices AS (
  SELECT s.symbol, s.company, s.listing_date, p.price, p.price_date,
         ROW_NUMBER() OVER (PARTITION BY s.symbol, s.listing_date 
                           ORDER BY p.price_date DESC) as rn
  FROM stock s
  JOIN price_tracker p ON s.symbol = p.symbol 
  WHERE s.listing_date <= p.price_date
)
SELECT symbol, company, listing_date, price, price_date
FROM ranked_prices
WHERE rn = 1

**Hint:** Get all the prices for a symbol and filter to the latest one


### Joins are used to validate referential integrity (aka are `foreign key` relationships valid)

- In a data warehouse some tables are created sooner than others

- When you join a quick table with a slow table you will loose data

- For example, if your orders data arrives much quicker than customer data your joins will either produce nulls (left join) or not be included in the output (inner joins)

- Usually an `UNKNOWN` catch all is used, you can also re-run the pipeline to reconcile when the slow data lands


#### Exercise ( 10 min )

1. Assume you have to join orders (loaded into warehouse in 5 minutes) and customer (loaded into the warehouse every 6h) table; how do you ensure that the results of your join is **complete**? 

    **Hint**: Start by defining what complete means.

2. What will you do if you find the table(s) are incomplete?

In [None]:
%%sql
WITH latest_orders AS (
 SELECT * FROM orders
 UNION ALL
 SELECT 
   9999999 as o_orderkey,
   8888888 as o_custkey,
   'O' as o_orderstatus,
   1500000.00 as o_totalprice,
   '2024-06-14' as o_orderdate,
   '1-URGENT' as o_orderpriority,
   'Clerk#000000999' as o_clerk,
   0 as o_shippriority,
   'New order for non-existent customer' as o_comment
)
SELECT o.*
    , c.* -- What would you do with these NULLs?
FROM latest_orders o
LEFT JOIN customer c ON o.o_custkey = c.c_custkey WHERE c.c_custkey IS NULL

**Discussion**:
  
Typically the fact tables arrive faster than dimensions (not raw data, but modelled dimension tables). And due to this the fact tables may have dimension ids that have not been loaded into the dimension table or have not been updated in the dimension table.

Depending on the use case there are 3 main ways of dealing with this scenario:

1. Left join dimension data to fact table and fill up NULLs with `UNKNOWN` or similar when reporting.
2. Do an inner join to only keep data that is fully available.
3. Reprocess the join pipeline multiple times so initially you will have `UNKNOWN` and on a future re-run you will have the dimension data and use this to overwrite existing data.


### Common data issues that create bad outputs when joining

- Ensure that your table(s) have a single grain before joining them.

- Handle slow and fast data joins based on use case

- Be careful if your join keys have NULLs, NULL != NULL
                                
- Be mindful of applying functions in join criteria, they can impact performance significantly

## [Group bys] can be used for more than reporting

- Quickly check distribution of dimensions (date, state, etc)

- Check unique key constraints, most warehouse allow you to define PK, but don't enforce them



#### Exercise ( 5 min )

Do you think the data is representative of real world from the customer numbers?


In [None]:
%%sql
select n.n_name as nation_name
    , count(*) as num_customers
    from customer c
    left join nation n 
    on c.c_nationkey = n.n_nationkey
group by n.n_name
order by num_customers desc
limit 10

The above results don't seem to make sense, as we cannot generally expect orders from `china` and other countries be around the similar number. This would raise red flags in a real life scenario.

**However** we use a tool that uses normal distribution to create fake data.

#### Exercise ( 5 min ) 

How would you use `group by` to check that the c_custkey column in the customer table is unique?

In [None]:
%%sql
select c_custkey
, count(*) as cnt
from customer 
group by 1
having cnt > 1
limit 5

### Aggregation functions beyond the standard count/min/max/avg/sum

- Statistical agg: Functions like correlation, sampling, standard deviation, skew, etc

- Collection agg: Functions to combine values into nested data types, e.g., array_agg, collect_set, etc

- Approximation agg: Functions that are fast by sacrificing accuracy, e.g., approx_distinct, approx_most_frequent

- Convenience agg: Functions that make common usages easier, e.g., count_if, bool_or, etc

- ROLL UPs, CUBE, GROUPING SETS are short hand versions of GROUP BYs typically used for reporting

While you can try to use your own logic to replicate some of the above functions, in-built functions are generally stable and well tested.

In [None]:
%%sql
select 
    year(o_orderdate) as yr
    , sum(case when o_orderpriority = '5-LOW' then 1 else 0 end) as num_low_orders
    , count_if(o_orderpriority = '5-LOW') as num_low_orders_easy -- Convenience agg

from orders
group by 1
order by 1 desc

In [None]:
%%sql
SELECT l_orderkey
   , collect_list(l_linenumber) as line_number
   , collect_list(struct(
        l_linenumber as line_number,
        l_quantity as quantity, 
        l_extendedprice as price
    )) as line_details -- Structured output
FROM lineitem
GROUP BY 1
ORDER BY l_orderkey
LIMIT 10

```mermaid
flowchart LR
    subgraph A["LineItem Table (Input)"]
        A1["orderkey: 1 | linenumber: 1 | quantity: 17 | price: 26734.03"]
        A2["orderkey: 1 | linenumber: 2 | quantity: 36 | price: 57191.40"]
        A3["orderkey: 1 | linenumber: 3 | quantity: 8 | price: 14254.80"]
        A4["orderkey: 2 | linenumber: 1 | quantity: 38 | price: 39447.04"]
        A5["orderkey: 3 | linenumber: 1 | quantity: 45 | price: 47301.30"]
        A6["orderkey: 3 | linenumber: 2 | quantity: 49 | price: 69947.99"]
    end
    
    subgraph C["Result (Aggregated)"]
        C1["orderkey: 1<br/>line_numbers: [1,2,3,4,5,6]<br/>line_details: [struct(1,17,26734), struct(2,36,57191), ...]"]
        C2["orderkey: 2<br/>line_numbers: [1]<br/>line_details: [struct(1,38,39447)]"]
        C3["orderkey: 3<br/>line_numbers: [1,2,3,4,5,6]<br/>line_details: [struct(1,45,47301), struct(2,49,69947), ...]"]
    end
    
    A1 -.-> C1
    A2 -.-> C1
    A3 -.-> C1
    A4 --> C2
    A5 -.-> C3
    A6 -.-> C3
    
    style A fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#27ae60,stroke:#229954,color:#fff
    
    style A fill:#3498db,stroke:#2980b9,color:#fff
    style C fill:#27ae60,stroke:#229954,color:#fff
```

In [None]:
%%sql
WITH order_details AS (
  SELECT l_orderkey
     , collect_list(l_linenumber) as line_number
     , collect_list(struct(
          l_linenumber as line_number,
          l_quantity as quantity, 
          l_extendedprice as price
      )) as line_details
  FROM lineitem
  GROUP BY 1
)
SELECT 
    l_orderkey,
    exploded_detail.line_number,
    exploded_detail.quantity,
    exploded_detail.price
    -- , explode(line_number) as individual_line_number
FROM order_details
LATERAL VIEW explode(line_details) t AS exploded_detail
    ORDER BY l_orderkey
LIMIT 20

#### Exercise ( 5 min )

Try the above query by uncommenting this `, explode(line_number)` line, what do you think is happening?

### Group by variations for reporting

In [None]:
%%sql
CREATE OR REPLACE TEMPORARY VIEW sales AS
SELECT 'North' as region, 'Electronics' as category, 100 as amount
UNION ALL
SELECT 'North' as region, 'Clothing' as category, 50 as amount
UNION ALL
SELECT 'South' as region, 'Electronics' as category, 80 as amount
UNION ALL
SELECT 'South' as region, 'Clothing' as category, 70 as amount;

In [None]:
%%sql
-- ROLLUP: Hierarchical aggregation (region -> category -> total)
SELECT region, category, SUM(amount) as total_sales
FROM sales
GROUP BY ROLLUP(region, category)
ORDER BY region, category;

In [None]:
%%sql
-- CUBE: All possible combinations
SELECT region, category, SUM(amount) as total_sales
FROM sales
GROUP BY CUBE(region, category)
ORDER BY region, category;


In [None]:
%%sql
-- GROUPING SETS: Custom combinations
SELECT region, category, SUM(amount) as total_sales
FROM sales
GROUP BY GROUPING SETS (
 (region, category),  -- detailed
 (region),           -- by region only
 ()                  -- grand total only
)
ORDER BY region, category;

### Gotchas when doing group bys: duplication, incorrect data types, additive/non-additive numbers, etc

- Are you using Group by to remove duplicates, this usually indicates a problem with your underlying data model

- Ensure that the numbers you are aggregating on are of the right data typs (e.g. number stored as string, .)

- Be mindful of additive and non-additive numbers

#### Exercise ( 5 min )

Inspect the below query, what is wrong with the logic correct? 

How would you fix it?

In [None]:
%%sql
-- CTE: Unique suppliers per day
WITH daily_suppliers AS (
 SELECT 
   DATE(l_shipdate) as ship_date,
   COUNT(DISTINCT l_suppkey) as daily_unique_suppliers
 FROM lineitem
 GROUP BY DATE(l_shipdate)
)
SELECT 
 YEAR(d.ship_date) as ship_year,
 SUM(d.daily_unique_suppliers) as yearly_total
FROM daily_suppliers d
GROUP BY YEAR(d.ship_date)
ORDER BY ship_year;

In [None]:
%%sql
-- CTE: Unique suppliers per day
WITH daily_suppliers AS (
 SELECT 
   DATE(l_shipdate) as ship_date,
   COUNT(DISTINCT l_suppkey) as daily_unique_suppliers
 FROM lineitem
 GROUP BY DATE(l_shipdate)
),

-- Unique suppliers per year (CORRECT way)
yearly_suppliers AS (
 SELECT 
   YEAR(l_shipdate) as ship_year,
   COUNT(DISTINCT l_suppkey) as yearly_unique_suppliers
 FROM lineitem
 GROUP BY YEAR(l_shipdate)
)

-- WRONG: Trying to sum daily unique suppliers to get yearly total
SELECT 
 YEAR(d.ship_date) as ship_year,
 SUM(d.daily_unique_suppliers) as wrong_yearly_total,  -- This is WRONG!
 y.yearly_unique_suppliers as correct_yearly_total
FROM daily_suppliers d
JOIN yearly_suppliers y ON YEAR(d.ship_date) = y.ship_year
GROUP BY YEAR(d.ship_date), y.yearly_unique_suppliers
ORDER BY ship_year;

## Recommended reading

1. [SQL for data engineers](https://www.startdataengineering.com/post/improve-sql-skills-de/)
2. [SQL or Python for data processing](https://www.startdataengineering.com/post/sql-v-python/)
3. [dbt tutorial](https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/)
4. [Build a data project with step-by-step instructions](https://www.startdataengineering.com/post/de-proj-step-by-step/)
