# The objective of a data warehouse is to enable analyzing historical data

In [1]:
! python ../setup.py

Cleaning up (if any existing) tpch db file tpch.db
The file tpch.db does not exist.
Creating TPCH input data at tpch.db
Cleaning up (if any existing) sqlite3 db file example.db
The file example.db does not exist.
Creating sqlite database file at example.db


In [2]:
import duckdb
import pandas as pd

%load_ext sql
conn = duckdb.connect("tpch.db")
%sql conn --alias duckdb

ImportError: PyArrow >= 4.0.0 must be installed; however, it was not found.

In [None]:
%%sql
show tables;

## Understanding how your business works is critical

- data flow from product/upstream team
- create a conceptual data flow diagram
- Use tpch as example

## Data modeling refers to how your data is stored for historical analytics


The term analytical querying usually refers to aggregating numerical (spend, count, sum, avg) data from the fact table for specific dimension attribute(s) (e.g., name, nation, date, month) from the dimension tables. Some examples of analytical queries are 

1. Who are the top 10 suppliers (by totalprice) in the past year?

2. What are the average sales per nation per year?

3. How do customer market segments perform (sales) month-over-month?

The questions above ask about **historically aggregating data from the fact tables for one or more business entities(dimensions)**. 
Consider the example analytical question below and notice the facts and dimensions.

![Analytical query](./images/im_2.png)
add: images

When we dissect the above analytical query, we see that it involves:

1. Joining the fact data with dimension table(s) to get the dimension attributes such as name, region, & brand. In our example, we join the orders fact table with the customer dimension table.

2. **Modifying granularity** (aka rollup, Group by) of the joined table to the dimension(s) in question. In our example, this refers to `GROUP BY custkey, YEAR(orderdate).`



### Kimball dimensional modeling is by far the most popular among all options

### Real life events are called facts

 **Facts**: Each row in a fact table represents a business process that occurred. E.g., In our data warehouse, each row in the `orders` fact table will represent an individual order, and each row in the `lineitem` fact table will represent an item sold as part of an order. Each fact row will have a unique identifier; in our case, it's `orderkey` for orders and a combination of `orderkey & linenumber` for lineitem.


A fact table's **grain (aka granularity, level)** refers to what a row in a fact table represents. For example, in our checkout process, we can have two fact tables, one for the order and another for the individual items in the order. The items table will have one row per item purchased, whereas the order table will have one row per order made.


In [None]:
%%sql

-- calculating the totalprice of an order (with orderkey = 1) from it's individual items
SELECT
    orderkey,
    round( sum(extendedprice * (1 - discount) * (1 + tax)),
        2
    ) AS totalprice
    -- Formula to calculate price paid after discount & tax
FROM
    lineitem
WHERE
    orderkey = 1
GROUP BY
    orderkey;

/*
 orderkey | totalprice
----------+------------
        1 |  172799.56
*/

-- The totalprice of an order (with orderkey = 1)
SELECT
    orderkey,
    totalprice
FROM
    orders
WHERE
    orderkey = 1;


### Dimension = Someone/something that interacts with your business

 **Dimension**: Each row in a dimension table represents a business entity that is important to the business. For example, A car parts seller's data warehouse will have a `customer` dimension table, where each row will represent an individual customer. Other examples of dimension tables in a car parts seller's data warehouse would be `supplier` & `part` tables. Techniques such as [SCD2](https://www.startdataengineering.com/post/how-to-join-fact-scd2-tables/#what-is-an-scd2-table-and-why-use-it) are used to store data whose values can change over time (e.g., customers address).


## Most tech companies follow the 3-hop architecture

Most data teams have their version of the 3-hop architecture. For example, dbt has its own version (stage, intermediate, mart), and Spark has medallion (bronze, silver, gold) architecture.

You may be wondering why we need this data flow architecture when we have the results easily with a simple query shown here.

While this is a simple example, in most real-world projects you want to have a standard, cleaned and modelled dataset(bronze) that can be use to create specialized dataset for end-users(gold). See below for how our data will flow:
                                                                                                                                                                         add: data 3-hop arch image                              

### Bronze: Extract raw data and confine it to standard names and data types

Since our dataset has data from customer, nation, region, order, and lineitem input datasets, we will bring those data into bronze tables. We will keep their names the same as the input datasets.

Let's explore the input datasets and create our bronze datasets.

### Silver: Model data for analytics

In the silver layer, the datasets are modeled using one of the popular styles (e.g., Kimball, Data Vault, etc.). We will use Kimball's dimensional model, as it is the most commonly used one and can account for many use cases.

**Data modeling**

We will create the following datasets

1. **dim_customer**: A customer level table with all the necessary attributes of a customer. We will join nation and region data to the cleaned_customer_df to get all the attributes associated with a customer.
2. **fct_orders**: An order level fact(an event that happened) table. This will be the same as cleaned_orders_df since the orders table has all the necessary details about the order and how it associates with dimension tables like customer_key.
3. **fct_lineitem**: A lineitem (items that are part of an order) fact table. This table will be the same as cleaned_lineitem_df since the lineitem table has all the lineitem level details and keys to associate with dimension tables like partkey and suppkey.


### Gold: Create tables for end-users

The gold layer contains datasets required for the end user. The user-required datasets are fact tables joined with dimension tables aggregated to the necessary grain. In real-world projects, multiple teams/users ask for datasets with differing grains from the same underlying fact and dimension tables. While you can join the necessary tables and aggregate them individually for each ask, it leads to repeated code and joins.

To avoid this issue, companies typically do the following:

1. **OBT**: This is a fact table with multiple dimension tables left joined with it.
2. **pre-aggregated table**: The OBT table rolled up to the end user/team requested grain. The pre-aggregated dataset will be the dataset that the end user accesses. By providing the end user with the exact columns they need, we can ensure that all the metrics are in one place and issues due to incorrect metric calculations by end users are significantly reduced. These tables act as our end-users SOT (source of truth).

#### OBT: Join the fact table with all its dimensions

In our example, we have two fact tables, fct_orders and fct_lineitem. Since we only have one dimension, dim_customer, we can join fct_orders and dim_customer to create wide_orders. For our use case, we can keep fct_lineitem as wide_lineitem.

That said, we can easily see a case where we might need to join parts and supplier data with fct_lineitem to get wide_lineitem. But since our use case doesn't require this, we can skip it!

Let's create our OBT tables


#### Pre-aggregated tables: Aggregate OBTs to stakeholder-specific grain

According to our data requirements, we need data from customer, orders, and lineitem. Since we already have customer and order data in wide_orders, we can join that with wide_lineitem to get the necessary data.

We can call the final dataset customer_outreach_metrics (read this article that discusses the importance of naming).

Let's create our final dataset in Python
