In [None]:
%%bash
python ./generate_data.py
python ./run_ddl.py

In [None]:
!jupyter labextension install jupyterlab-mermaid

**Prerequisites**:

  1. [SQL Basics: Join and Group By basics](./basics.ipynb)

In [7]:
spark

In [9]:
%%sql
use prod.db

## [Quick refresher] Facts & Dimensions

1. `Fact` tables containing information about how dimensions interact with each other in real life. Example: An order fact is an interaction between a customer and a seller involving one or more products. E.g. `Lineitem` & `Orders`.
2. `Dimension` tables store data for a business entity (e.g., customer, product, partner, etc). These tables describe the ‘who’ and ‘what’ types of questions. For example, which stores had the highest revenue yesterday? In this question, stores will be the dimension. E.g. `Customer`, `Supplier`

The term analytical querying usually refers to aggregating numerical (spend, count, sum, avg) data from the fact table for specific dimension attribute(s) (e.g., name, nation, date, month) from the dimension tables.

Some examples of analytical queries are
1. Who are the top 10 suppliers (by totalprice) in the past year?
2. What are the average sales per nation per year?
3. How do customer market segments perform (sales) month-over-month?

**Example**

![Analytical query](./images/analytical_qry.png)

In [6]:
# Add simple SQL demonstrating join & group by 

## Joins can be used to validate data and identify underlying data issues

- While `joins` are typically used to combine tables, they can also be used to inspect data and get data diff.

- When joining tables, there is usually one table called the `driver/base` table to which other tables are joined.


### Find data in a table that is not part of another table with `anti join`

- When you need to get rows that are in one table but not in another, use `anti join`

- You can get the rows from the left table that does not have any matches from the right table


In [15]:
%%sql
WITH orders AS (
 SELECT 001 as o_orderkey, 1001 as o_custkey
 UNION ALL
 SELECT 002 as o_orderkey, 1002 as o_custkey
 UNION ALL
 SELECT 003 as o_orderkey, 1003 as o_custkey
 UNION ALL
 SELECT 004 as o_orderkey, 1004 as o_custkey
 UNION ALL
 SELECT 005 as o_orderkey, 1005 as o_custkey
),
lineitem AS (
 SELECT 001 as l_orderkey, 2001 as l_partkey
 UNION ALL
 SELECT 003 as l_orderkey, 2003 as l_partkey
 UNION ALL
 SELECT 005 as l_orderkey, 2004 as l_partkey
)
SELECT o.*
FROM orders o
ANTI JOIN lineitem l ON o.o_orderkey = l.l_orderkey
-- Alternative approach; when you don't have anti join
-- LEFT JOIN lineitem l ON o.o_orderkey = l.l_orderkey WHERE l.l_orderkey IS NULL

o_orderkey,o_custkey
2,1002
4,1004


```mermaid
flowchart LR
    subgraph A["Orders Table"]
        A1["`**o_orderkey | o_custkey**
        001 | 1001
        002 | 1002
        003 | 1003
        004 | 1004
        005 | 1005`"]
    end
    
    subgraph B["LineItem Table"]
        B1["`**l_orderkey | l_partkey**
        001 | 2001
        003 | 2003
        005 | 2004`"]
    end
    
    subgraph C["Result"]
        C1["`**o_orderkey | o_custkey**
        002 | 1002
        004 | 1004`"]
    end
    
    A1 -->|ANTI JOIN| C1
    B1 -.->|"ON o_orderkey = l_orderkey"| C1
    
    style A fill:#3498db,stroke:#2980b9,color:#fff
    style B fill:#27ae60,stroke:#229954,color:#fff
    style C fill:#e74c3c,stroke:#c0392b,color:#fff
```

    
### Find data in a table that is closest in time to another table with `asof join`

- When you need to get the row that is closest in time to the current row

- Usually used when you need to get the "latest" price, or state from a fact table. Not really used to join dimensions.


In [18]:
%%sql
WITH stock AS (
 SELECT 'AAPL' as symbol, 'Apple Inc.' as company, '2024-01-01' as listing_date
 UNION ALL
 SELECT 'GOOGL' as symbol, 'Alphabet Inc.' as company, '2024-01-15' as listing_date
 UNION ALL
 SELECT 'MSFT' as symbol, 'Microsoft Corp.' as company, '2024-02-01' as listing_date
),
price_tracker AS (
 SELECT 'AAPL' as symbol, 150.00 as price, '2024-01-10' as price_date
 UNION ALL
 SELECT 'AAPL' as symbol, 155.00 as price, '2024-01-20' as price_date -- this is the latest price for apple and will be picked
 UNION ALL
 SELECT 'GOOGL' as symbol, 2800.00 as price, '2024-01-25' as price_date
 UNION ALL
 SELECT 'GOOGL' as symbol, 2850.00 as price, '2024-02-05' as price_date -- this is the latest price for google and will be picked
 UNION ALL
 SELECT 'MSFT' as symbol, 400.00 as price, '2024-02-10' as price_date -- this is the latest price for microsoft and will be picked
),
ranked_prices AS (
  SELECT s.symbol, s.company, s.listing_date, p.price, p.price_date,
         ROW_NUMBER() OVER (PARTITION BY s.symbol, s.listing_date 
                           ORDER BY p.price_date DESC) as rn
  FROM stock s
  JOIN price_tracker p ON s.symbol = p.symbol 
  WHERE s.listing_date <= p.price_date
)
SELECT symbol, company, listing_date, price, price_date
FROM ranked_prices
WHERE rn = 1

symbol,company,listing_date,price,price_date
AAPL,Apple Inc.,2024-01-01,155.0,2024-01-20
GOOGL,Alphabet Inc.,2024-01-15,2850.0,2024-02-05
MSFT,Microsoft Corp.,2024-02-01,400.0,2024-02-10


```mermaid
flowchart LR
    subgraph A["Stock Table"]
        A1["AAPL | Apple Inc. | 2024-01-01"]
        A2["GOOGL | Alphabet Inc. | 2024-01-15"]
        A3["MSFT | Microsoft Corp. | 2024-02-01"]
    end
    
    subgraph B["Price Tracker Table"]
        B1["AAPL | 150.00 | 2024-01-10"]
        B2["AAPL | 155.00 | 2024-01-20"]
        B3["GOOGL | 2800.00 | 2024-01-25"]
        B4["GOOGL | 2850.00 | 2024-02-05"]
        B5["MSFT | 400.00 | 2024-02-10"]
    end
    
    subgraph C["Result"]
        C1["AAPL | Apple Inc. | 155.00 | 2024-01-20"]
        C2["GOOGL | Alphabet Inc. | 2850.00 | 2024-02-05"]
        C3["MSFT | Microsoft Corp. | 400.00 | 2024-02-10"]
    end
    
    A1 -.->|matches| B1
    A1 -->|latest match| B2
    A2 -.->|matches| B3
    A2 -->|latest match| B4
    A3 -->|matches| B5
    
    B2 -->|result| C1
    B4 -->|result| C2
    B5 -->|result| C3
    
    style A fill:#3498db,stroke:#2980b9,color:#fff
    style B fill:#27ae60,stroke:#229954,color:#fff
    style C fill:#e74c3c,stroke:#c0392b,color:#fff
```

**Exercise: Scenario 10 min**

1. Assume you have to join orders (loaded into warehouse in 5 minutes) and customer (loaded into the warehouse every 6h) table; how do you ensure that the results of your join is **complete**? Hint: Start by defining what complete means.
2. What will you do if you find the table(s) are incomplete?


### Joins are used to validate referential integrity (aka are `foreign key` relationships valid)

- In a data warehouse some tables are created sooner than others

- When you join a quick table with a slow table you will loose data

- For example, if your orders data arrives much quicker than customer data your joins will either produce nulls (left join) or not be included in the output (inner joins)

- Usually an `UNKNOWN` catch all is used, you can also re-run the pipeline to reconcile when the slow data lands


In [29]:
%%sql
WITH latest_orders AS (
 SELECT * FROM orders
 UNION ALL
 SELECT 
   9999999 as o_orderkey,
   8888888 as o_custkey,
   'O' as o_orderstatus,
   1500000.00 as o_totalprice,
   '2024-06-14' as o_orderdate,
   '1-URGENT' as o_orderpriority,
   'Clerk#000000999' as o_clerk,
   0 as o_shippriority,
   'New order for non-existent customer' as o_comment
)
SELECT o.*
    , c.* -- What would you do with these NULLs?
FROM latest_orders o
LEFT JOIN customer c ON o.o_custkey = c.c_custkey WHERE c.c_custkey IS NULL

o_orderkey,o_custkey,o_orderstatus,o_totalprice,o_orderdate,o_orderpriority,o_clerk,o_shippriority,o_comment,c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,c_mktsegment,c_comment
9999999,8888888,O,1500000.0,2024-06-14,1-URGENT,Clerk#000000999,0,New order for non-existent customer,,,,,,,,



### Common data issues that create bad outputs when joining

- Ensure that your table(s) have a single grain before joining them.

- Handle slow and fast data joins based on use case

- Be careful if your join keys have NULLs, NULL != NULL
                                
- Be mindful of applying functions in join criteria, they can impact performance significantly

## Group bys can be used for more than reporting

- Quickly check distribution of dimensions (date, state, etc)

- Check unique key constraints, most warehouse allow you to define PK, but don't enforce them



In [39]:
%%sql
select n.n_name as nation_name
    , count(*) as num_customers
    from customer c
    left join nation n 
    on c.c_nationkey = n.n_nationkey
group by n.n_name
order by num_customers desc
limit 10

nation_name,num_customers
CHINA,3088
INDONESIA,3085
CANADA,3049
MOZAMBIQUE,3048
UNITED STATES,3046
KENYA,3045
IRAQ,3026
ROMANIA,3024
EGYPT,3018
RUSSIA,3015


**Question**: Do you think the data is representative of real world from the customer numbers?

The above results don't seem to make sense, as we cannot generally expect orders from `china` and other countries be around the similar number. This would raise red flags in a real life scenario.

**However** we use a tool that uses normal distribution to create fake data.

**Question** How would you use `group by` to check that the c_custkey column in the customer table is unique?\

In [40]:
%%sql
select c_custkey
, count(*) as cnt
from customer 
group by 1
having cnt > 1
limit 5

c_custkey,cnt


### Aggregation functions beyond the standard count/min/max/avg/sum

- Statistical agg: Functions like correlation, sampling, standard deviation, skew, etc

- Collection agg: Functions to combine values into nested data types, e.g., array_agg, collect_set, etc

- Approximation agg: Functions that are fast by sacrificing accuracy, e.g., approx_distinct, approx_most_frequent

- Convenience agg: Functions that make common usages easier, e.g., count_if, bool_or, etc

- ROLL UPs, CUBE, GROUPING SETS are short hand versions of GROUP BYs typically used for reporting

### Gotchas when doing group bys: duplication, incorrect data types, additive/non-additive numbers, etc

- Are you using Group by to remove duplicates, this usually indicates a problem with your underlying data model

- Ensure that the numbers you are aggregating on are of the right data typs (e.g. number stored as string, .)

- Be mindful of additive and non-additive numbers