In [None]:
%%bash
python ./generate_data.py
python ./run_ddl.py

**Prerequisites**:

  1. [SQL Basics: Join and Group By basics](./basics.ipynb)

## [Quick refresher] Facts & Dimensions

1. `Fact` tables containing information about how dimensions interact with each other in real life. Example: An order fact is an interaction between a customer and a seller involving one or more products. E.g. `Lineitem` & `Orders`.
2. `Dimension` tables store data for a business entity (e.g., customer, product, partner, etc). These tables describe the ‘who’ and ‘what’ types of questions. For example, which stores had the highest revenue yesterday? In this question, stores will be the dimension. E.g. `Customer`, `Supplier`

The term analytical querying usually refers to aggregating numerical (spend, count, sum, avg) data from the fact table for specific dimension attribute(s) (e.g., name, nation, date, month) from the dimension tables.

Some examples of analytical queries are
1. Who are the top 10 suppliers (by totalprice) in the past year?
2. What are the average sales per nation per year?
3. How do customer market segments perform (sales) month-over-month?

**Example**

![Analytical query](./images/analytical_qry.png)

In [6]:
# Add simple SQL demonstrating join & group by 

## Joins can be used to validate data and identify underlying data issues

- While `joins` are typically used to combine tables, they can also be used to inspect data and get data diff.

- When joining tables, there is usually one table called the `driver/base` table to which other tables are joined.

### Find data in a table that is not part of another table with `anti join`

- When you need to get rows that are in one table but not in another, use `anti join`

- You can get the rows from the left table that does not have any matches from the right table
    
### Find data in a table that is closest in time to another table with `asof join`

- When you need to get the row that is closest in time to the current row

- Usually used when you need to get the "latest" price, or state from a fact table. Not really used to join dimensions.

### Joins are used to validate referential integrity (aka are `foreign key` relationships valid)

- In a data warehouse some tables are created sooner than others

- When you join a quick table with a slow table you will loose data

- For example, if your orders data arrives much quicker than customer data your joins will either produce nulls (left join) or not be included in the output (inner joins)

- Usually an `UNKNOWN` catch all is used, you can also re-run the pipeline to reconcile when the slow data lands

### Common data issues that create bad outputs when joining

- Ensure that your table(s) have a single grain before joining them.

- Handle slow and fast data joins based on use caser

- Be careful if your join keys have NULLs, NULL != NULL
                                
- Be mindful of applying functions in join criteria, they can impact performance significantly

## Group bys can be used for more than reporting

- Quickly check distribution of dimensions (date, state, etc)

- Check unique key constraints, most warehouse allow you to define PK, but don't enforce them

### Aggregation functions beyond the standard count/min/max/avg/sum

- Statistical agg: Functions like correlation, sampling, standard deviation, skew, etc

- Collection agg: Functions to combine values into nested data types, e.g., array_agg, collect_set, etc

- Approximation agg: Functions that are fast by sacrificing accuracy, e.g., approx_distinct, approx_most_frequent

- Convenience agg: Functions that make common usages easier, e.g., count_if, bool_or, etc

- ROLL UPs, CUBE, GROUPING SETS are short hand versions of GROUP BYs typically used for reporting

### Gotchas when doing group bys: duplication, incorrect data types, additive/non-additive numbers, etc

- Are you using Group by to remove duplicates, this usually indicates a problem with your underlying data model

- Ensure that the numbers you are aggregating on are of the right data typs (e.g. number stored as string, .)

- Be mindful of additive and non-additive numbers