# Dimensional modelling concepts - Four step design process

## Business requirements 

- Understand the business by talking to business representatives. 
    - What are the KPIs?
    - What are some business issues that need to be solved with data?
    - What analytical needs will be supported by this model

## Data realities

- Talk with source systems experts to understand 
    - The type of data that is available.
    - How often the models change.
    - What systems is the data coming from and how is it stored.
    - Common problems with it.
    - How often it's updated.

## Collaborative modelling workshops

- Dimensional models should be design with help from experts and data governance representatives from the business.
- The model should be derived from a interactive workshops with these experts.
- Models shouldn't be designed in isolation since the designer may not understand the business requirements or the data realities.

## Four step design process

1. Select the business process
    - Identify operational activities to be modeled.
    - Identify events that capture performance metrics (for fact table).
    - Each process is a row in the enterprise data warehouse bus matrix.
2. Declare the grain
    - Establish a contract on what a single fact table row represents.
    - Facts/dimensions must be consistent with the grain.
    - Each grain = a separate physical table.
3. Identify the dimensions
    - Who, what, where, when, why, how.
    - Must contain descriptive attributes used by BI applications for filters.
4. Identify the facts
    - 1 fact = 1 process events. 
    - Measure values are almost always numeric.
    - List facts consistent with the table grain.
    - After meeting business requirements, adapt to data realities.

## Case study: Retail Sales

**Context**: Large grocery chain. Multiple stores spread out across multiple states. Each store has multiple departments: Grocery, frozen foods, meats, product, bakery, floral, health/beauty. Multiple products identified by Stock Keep units (SKU).

**Data collection**: Several operational systems. Cash registers collect customer purchases, Point of Sale systems collect data related to SKUs. Customer receipts contain a copy of: Store number, cashier identifier and name, product SKU, product name, cost, discount information, coupon information, total cost, item count, transaction number, transaction time and a receipt number. Vendor deliveries are also tracked at store back door.

**Business interests**: Logistics of ordering, stocking and selecting products while maximizing profits. This comes from charging as much as possible for products while keeping customers happy. Management and marketing makes decisions related to pricing and promotions which can greatly impact sales. Promotions include temporary price reductions, coupons and ads.

1. Business processes: Better understand point of sale systems (retail sales transactions). The objective is to analyze **what products** are selling, in **what stores**, with **what promotions** and **when**.
2. Grain: When we analyze the business process we proposed to analyze, we seen that the grain is a **single product in a sales transaction**. After analyzing the proposal, we can see that we don't propose drilling down deeper than into an individual products in a sale.
3. Dimensions: There are some keywords that have repeated over our analysis. **Stores**, **products**, **transactions**, **promotions**, **cashiers**, **payment methods** and **dates**.
4. Facts: We can basically identify how to transform the grain into a table. The grain is a single product in a sales transaction. The facts are metrics collected for the grain, therefore we can identify some metrics from the receipt: Number of products, regular price, discount, extended discount (quantity * discount), extended sales dollar amount (quantity * net unit price), net paid price(sales dollar - extended discount). Also some systems contain a standard dollar cost price.

Below, we outline a basic diagram containing what could be the start of a dimensional model for the proposed case:

![Retail Sales Partial Dimensional Model](https://github.com/gustavom2998/engineering_notes/blob/main/books/data_warehouse_toolkit/images/2_1.png?raw=true)

## SQL Snippets
While covering new topics we can implement, we will be using [DuckDB](https://duckdb.org/) to run SQL scripts that test, validate, and implement these ideas. DuckDB is a in-process relational DBMS that designed for OLAP workloads, similar to what SQLite is for OLTP workloads.

We can start by implementing basic definitions for our dimensions for our case study, even though they might change in the near future. For now, we will be using the `CREATE TABLE` statement, as well as the Constraints mechanism to define `PRIMARY` and `FOREIGN` keys. Take a look at the documentation to get a brief idea about [create table](https://duckdb.org/docs/sql/statements/create_table) and [table constraints](https://duckdb.org/docs/sql/constraints). 

Below, we list the table definitions for our dimensional and facts tables. 

In [1]:
# Install DuckDB if not installed
%pip install duckdb

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Python38\python.exe -m pip install --upgrade pip' command.


In [2]:
import duckdb

# Start duckdb connection
db = duckdb.connect("retail_sales.db")

In [3]:
# Create the date dimension
date_dim_ddl = """
CREATE OR REPLACE TABLE date_dim (
    date_id INTEGER PRIMARY KEY,
    date_value DATE
)
"""

db.execute(date_dim_ddl)

<duckdb.DuckDBPyConnection at 0x2454b334230>

In [4]:
# Create the store dimension
store_dim_ddl = """
CREATE OR REPLACE TABLE store_dim (
    store_id INTEGER PRIMARY KEY,
    store_name VARCHAR
)
"""

db.execute(store_dim_ddl)

<duckdb.DuckDBPyConnection at 0x2454b334230>

In [5]:
# Create the cashier dimension
cashier_dim_ddl = """
CREATE OR REPLACE TABLE cashier_dim (
    cashier_id INTEGER PRIMARY KEY,
    cashier_name VARCHAR
)
"""

db.execute(cashier_dim_ddl)

<duckdb.DuckDBPyConnection at 0x2454b334230>

In [6]:
# Create the product dimension
product_dim_ddl = """
CREATE OR REPLACE TABLE product_dim (
    product_id INTEGER PRIMARY KEY,
    product_name VARCHAR
)
"""

db.execute(product_dim_ddl)


<duckdb.DuckDBPyConnection at 0x2454b334230>

In [7]:
# Create the promotion dimension
promotion_dim_ddl = """
CREATE OR REPLACE TABLE promotion_dim (
    promotion_id INTEGER PRIMARY KEY,
    promotion_type VARCHAR
)
"""

db.execute(promotion_dim_ddl)


<duckdb.DuckDBPyConnection at 0x2454b334230>

In [8]:
# Create the payment method dimension
payment_method_dim_ddl = """
CREATE OR REPLACE TABLE payment_method_dim (
    payment_method_id INTEGER PRIMARY KEY,
    payment_method VARCHAR
)
"""

db.execute(payment_method_dim_ddl)



<duckdb.DuckDBPyConnection at 0x2454b334230>

In [9]:
# Create the retail sales fact table
retail_sales_ddl = """
CREATE OR REPLACE TABLE retail_sales (
    date_id INTEGER REFERENCES date_dim(date_id),
    store_id INTEGER REFERENCES store_dim(store_id),
    cashier_id INTEGER REFERENCES cashier_dim(cashier_id),
    product_id INTEGER REFERENCES product_dim(product_id),
    promotion_id INTEGER REFERENCES promotion_dim(promotion_id),
    payment_method_id INTEGER REFERENCES payment_method_dim(payment_method_id),
    PRIMARY KEY (date_id, store_id, cashier_id, product_id, promotion_id, payment_method_id),
    pos_transaction_id INTEGER,
    sales_quantity UINTEGER,
    regular_unit_price FLOAT,
    discount_unit_price FLOAT,
    net_unit_price FLOAT,
    extended_discount_dollars FLOAT,
    extended_sales_dollars FLOAT,
    extended_cost_dollars FLOAT,
    extended_gross_profit_dollars FLOAT
)
"""

db.execute(retail_sales_ddl)

<duckdb.DuckDBPyConnection at 0x2454b334230>

In [10]:
db.commit()
db.close()


## Reference

Personal notes for educational purposes based of the book [The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling](https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802) by Ralph Kimball and Margy Ross, 3rd edition.


DuckDB. Why DuckDB? Retrieved April 30, 2023, from [https://duckdb.org/why_duckdb](https://duckdb.org/why_duckdb).