<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 10: Dimensional Data Warehouse Design** 
_Data Driven Analytics_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- 

### **Skills / Know how to ...**
- 

--------
## **LESSON 10 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/ZSVRiOfodDY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

## **BIG PICTURE: The Holy Grail of Data Warehousing**

Data Warehousing as a universal data repository has been a dream since Ralph Kimball first coined the term in the 1980s. Analysts had by that point built up quite a repertoire of models for just about any kind of analysis. They could create classification and regression trees (decision trees, random forests, etc.). They could do linear, nonlinear, kernel, and logistic regression with large datasets. They could solve optimization problems with thousands of variables and tens of thousands of constraints. Even the neural network models at the core of the lastest and greatest deep learning techniques were pretty mature by 1994.  

What was missing was data! Well, sort of. What we had was lots of transactional data locked up in siloed mainframe systems. Online banking (via ATMs) had been a reality since the 1970s, credit card transactions since the 1960s, and the air traffic control system since the 1950s. These systems were great at capturing event data, one event at a time. They could even handle bulk transactions and reporting, if you could wait for the job to run overnight. And heaven forbit that you might want data in a different format or that wasn't already included in a standard report.  

However, analytical data is not the same as transactional data. Among the differences: 
- **Temporal Scope:** Emphasis on historical data rather than real-time operations
- **Diversity:** Drawing on multiple sources instead of a single transaction system
- **Data Quality:** "good enough to run the company" was not good enough for analytical modeling
- **Performance Tradeoffs:** Focus on computational speed (on desktop computers) instead of data throughput (on shared mainframes)
- **Access:** Providing analysts with direct access to transaction systems is both insecure and expensive

Even back then the ultimate solutions were known, though not remotely close to being available. Consider, for example, this figure from Kimball's seminal book *The Data Warehouse Lifecycle Toolkit*, published in 1998 and based on original work **from the mid-1980s**.

![Kimball's Data Warehouse](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L10_Kimball_Data_Warehouse_Elements.png)

All of the elements of a modern data pipeline are there. It even articulated the steps of the ETL process in detail. It was all there ... to be realized *someday*. 

Someday is now. With commodity data storage, ample computing power, ready-made software for just about any kind of modeling, and the analytical results to attract attention from management, data infrastructure is finally seen as what it should have been all along: a critical resource upon which the company relies to make it stand out against the competition. 

That's what the vendors tell us, anyway. 

In this lesson we will explore the **Dimensional Data Warehouse** model first proposed by Kimball et al. all those years ago. We will also consider how its mass adoption has influenced the SQL standard in recent years, with the addition of extensions like  **arrays**, **structs**, and  **partitions** that relax fundamental assumptions of the relational data model. 

---
## **The Star Schema Pattern**

The star schema used by dimensional data warehouses is what we call a **design pattern**, a standard solution to a standard problem. By giving each pattern a name, designers can discuss their work with others without having to explain their decisions anew each time. As design patterns pass into common usage, they form a kind of shorthand design language that works in a given problem domain. 

The star schema pattern solves the problem of the impedence mismatch between the way data is recorded versus how it is used by analysts. It reduces all data down to measures (facts) and labels (dimensions). Since *all information* takes that form eventually anyway, the pattern strikes a nice balance between structure and general applicability. 

In our dimensional model, we end up with several moderately-sized **dimension tables** surrounding a large (possibly massive) **fact table**:
- The dimensions are somewhat timeless and immutable. They define the **context** for the facts. Even when the facts themselves may change over time, the dimensions remain relatively static. 
- The facts are somewhat volatile, with new **measures** continually added and others redefined to suit the ever changing needs of the analysts. If there is a way to precompute a statistic or other measure so analysts don't have to, then do it. If a given measure is no longer needed or misleading, then redefine or remove it. 

Once again, here is the NBA PlayFacts warehouse, this time noting some of the key features. We will use is as an example,starting with the dimensions before moving on to the facts. We will also explore variations on the general star schema pattern that fit certain use cases. 
![NBA PlayFacts Dim DW](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L10_Star_Schema_Notes.png)



### **Dimension Tables**
Dimensions are represent the things that we use to describe a given fact. In other words, the dimensions define the lens through which we view each fact. They are what give it context and meaning. 

In theory the **dimensions exist before the fact data is collected**. So, how do we know what they are before we have any data? A good starting point is the [Five Ws framework](https://en.wikipedia.org/wiki/Five_Ws) used by journalists and storytellers the world over:
- **Who was involved?** People, roles, etc. 
- **What happened** Event types, outcomes, etc. 
- **When did it happen?** Timing or place in a sequence
- **Where did it happen?** Location, which may be conceptual rather than physical
- **Why did it happen?** Intent, cause, etc. 
- **How did it happen?** Steps, sequential logic, etc. 

Though there is some disagreement about this, the usual recommendation is that dimension tables be fully denormalized. Since they are often fairly small (relative to the facts table) and don't change much, there is little chance of creating anomalies over time. So, while it may be tempting to, for example, normalize out zip codes and cities from a location dimension, there is no real need, especially when it would require an unnecessary table join.

As we can see in the NBA example, the first question $-$ who is involved in a given play $-$ is answered with three dimensions: 
- the individual player who gets credited with each event
- the lineup of players on the court at the time
- the team whose play is being reflected by the facts

It is done this way to support different, independently-calculated measures:
- the counting stats (points, rebounds, assists, etc.) for an individual player, lineup, or team
- the total playing time for each player or lineup  on the court, even when they are not generating counting stats

An often overlooked but potentially tricky aspect of dimension design is whether dimensions are allowed to overlap. In other words, can the same dimension be represented two different ways? Can we combine dimensions to create a third uber-dimension? 

Like a lot of things, it depends. For example, we could combine geo-location data (addresses) with organizational hierarchy (offices, districts, regions, etc.) into a single dimension if that is how the data is usually grouped. The result is a several levels of **granularity**, all stored in one dimension table. For a different analytical application, however, we may want to keep things more separated, especially if the geographic locations overlap. It may be that, for example, the same physical location may mix personnel from multiple functions or divisions within a company. Then it would not make sense to treat locations as point in the organizational hierarchy. 

We will go into this idea more in the Pro Tips section, where we will discuss the peculiar logic of longitudinal segmentation (i.e., slicing time). 

### **Fact Tables**
Fact tables exist at the intersection of the dimension tables. Each fact is labeled with foreign keys to the dimension tables, usually one key per dimension. The rest of the columns are measures that can be used in aggregate calculations.  

What makes a good measure? Anything from which we can calculate a descriptive statistic: 
- For text data, we generally are limited to the text itself and counts of some sort. We may, for example count the number of times with word "no" appears, how many sentences there are, etc. 
- For numerical data we can use all of the usual statistics like mean, maximum, minimum, etc. 
- For temporal data (dates and times), we may calculate elapsed times, inter-event times, cumulative times, etc. that can be treated like numerical data
- For binary data (pictures, etc.), the options are very limited, though one may be able to apply a machine learning technique to generate numerical digests that can be aggregated.  

Interestingly, the measures can only be as granular as the dimensions allow. In other words, if a given dimension only has 3 possible labels, then that dimension can only divide up the facts three ways. Each additional dimension, however, increases the granularity accordingly. If every dimension were to have two possible values, then a one single dimension could divide the facts into 2 groups. With two dimensions, we could divide into up to 4 groups. With three dimensions, we could generate up to 8 groups, etc. 

One way to visualize this is with a (hyper-) cube, with each dimension on a side. Each fact is *binned* inside one of the smaller cubes at the intersection of the dimensions. For the NBA PlayFacts cube below, each fact is binned based on the game, team, and player. Thus, if with only three dimensions, we would only be able to generate boxscore stats for full games. In order to get statistics within a game (e.g., for the last two minutes of each period) we would need to include the play segment dimension. (Don't ask about how we'd show a 4-dimensional cube. Just know that we can.)

![Data Cube](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L10_DataCube_wide.png)

> **Heads Up:** It is sometimes difficult to distinguish dimensions from measures when source data is numerical. For example, is the time on the clock (i.e., seconds remaining in the period) a measure or a dimension? It is a measure, in the sense that it captures the passage of time, but it is also a dimension, in that it records when a given event happened. The key when examining any given quantity is to ask whether you would ever aggregate it (sum, average, etc.) of just cite it. In a basketball game it is the latter, so we separate it out into the play segment dimension. The clock *interval* between events (elapsed time), however, is something that we can sum up by quarter, player, etc. Thus, it belongs on the fact table. 

### **Do we really need dimension *tables*?**
One of the advantages of keeping dimensional data in separate tables is that it can significantly reduce storage costs by eliminating redundant data. However, with the advent of cloud-based data storage at commodity prices, storage becomes less important than performance. Thus, we may choose to denormalize everything into a single table that doesn't require any expensive joins. It's simple enough. If we already have the data in a normalized form, then we would just need to join in every table and select every column to generate a new "one table fits all" data warehouse. The dimensions would still be there, just as columns instead of tables.

While that sounds great in theory, in practice there are two good reasons for creating separate dimension tables: 
- If care is taken to **use just foreign keys in the `GROUP BY` clause**, then it can actually be faster to query multiple tables than a single table. This is because the joins will happen **after** the grouping has reduced the data to something manageable. The incremental performance cost of the join is then practically nil, especially if there are a small number of groups in the result set. 
- Dimension tables provide opportunities to add in static descriptive data. For example, we could add in the seating capacity or age of a given basketball arena if we treat it as a dimension table instead of just a column. 

Whether either of these advantages are relevant depends on the situation. As a general rule, unless you have a good reason not to, it is best to create dimension tables instead of dimension columns. 

### **Snowflakes and Galaxies**
 


---
## **Rollups and Drilldowns**

---
## **NBA LineupFacts Data Mart**









---
## **PRO TIPS: How to Handle Longitudinal Dimensions**





---
## **SQL AND BEYOND: Arrays, Structs, and Partitions**

---
## **Congratulations! You've made it to the end of Lesson 10.**

Next week we will consider alternative data models that can improve flexibility and performance. 



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `BUAN6510` folder so you can find it next time.