<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 5: Table Design and Normalization** 
_The science of bulletproofing your tables._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Tradeoffs that every designer makes
- The many kinds of keys and how they are used
- Table normalization and normal forms
- Table denormalization and when to use it
 

### **Skills / Know how to ...**
- Break a large table into normalized tables
- Use relational notation to describe table schema
- Detect when a choice of keys will potentially corrupt data
- Denormalize data in SQL to suit the needs of data analysts

--------
## **LESSON 5 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/rsCrjQck_jQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

### **Run this boilerplate code before continuing on.** 
 

In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Install the Python to MySQL DBI connector
!pip install pymysql

%sql mysql+pymysql://buan6510student:buan6510@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016

Collecting pymysql
[?25l  Downloading https://files.pythonhosted.org/packages/4f/52/a115fe175028b058df353c5a3d5290b71514a83f67078a6482cff24d6137/PyMySQL-1.0.2-py3-none-any.whl (43kB)
[K     |████████████████████████████████| 51kB 3.9MB/s 
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.0.2


'Connected: buan6510student@lahman2016'

**Rerun this code as needed to keep your software up to date and database connection fresh.**  

---
## **BIG PICTURE: Why programmers suck at database design**
In some ways, programming is the most arrogant profession of all. Software is inherently malleable in ways that nothing else can possibly be. Like science fiction writers, programmers can alter the laws of physics to suit whatever the needs are at the time. However, unlike fiction writers, programmers can go a step further by building *and running* the universes they design. Software is not a simulation or a movie, it is the very embodiment of whatever the programmer wants it to be. That is real power, just in a very narrow universe. 

These days most programmers learn the craft by building *apps* of one sort or another. An app is all about process, handling whatever actions the user chooses. If there is data involved then it rarely exists beyond a single use of the app, perhaps as cache or maybe a message to be sent to a server somewhere in the cloud. Outside of this scope, the app programmer generally does not really care. It's all beyond their control anyway. 

It is in the cloud that the persistent part of the software exists. If data is to be stored and shared among many users then it is a *systems* programmer who will design and build that necessary middleware and data repositories. For these programmers the world is less about the dynamics of the app and more about the permanent structures needed to keep it running. 

There are some programmers, who like to call themselves *full stack developers*, that do both frontend app and the backend server development. However, if you dig even a little bit into their knowledge base, you will likely find that they are 80% frontend and 20% backend. They know just enough about the backend to keep the apps running but don't really like doing it very much. Instead, they are always looking for shortcuts so they can make the visible part of the app that much nicer. 

The same kind of milieu is common in data science, where the sexy frontend stuff that everybody sees is the models and the visuals. Like the apps developers, they see data management as a chore. To them everything is just better if each project has a massive dataset (table) that they can build models from. If there are any bugs in the data then they will just program around them. Why not? The tools make it easy to do so. 

Where does this leave us? In a world where fewer and fewer programmers *really* understand database design. There just isn't enough to get excited about when one can get so much instant gratification from a UI tweak or running a fancy new machine learning algorithm. Honestly, who can blame them? Nobody is going to pat them on the back for getting the backend right but everybody will exclaim in excitement when an analytical model unearths a previously unknown insight and then distills it down to *just the right story*. 

That said, always be on the lookout for data errors that can't be programmed around. Sometimes they make the difference between being right and dead wrong. 

In this lesson we will learn about table design, starting with the tradeoffs a designer invariably has to make before moving on to the normatively *correct* techniques of normalization and table decomposition. We will conclude with a discussion of when to throw out correctness in favor of convenience, speed, or analyst preferences. 

 ---
## **Design Tradeoffs**
### **The Eternal Questions**

Design is about making decisions. If we make the right decisions, then the right systems get built and everybody is blissfully happy. We might not get the credit but people are happy nonetheless. If we make the wrong decisions then everybody is upset *at us*. 

So what is the right way to design a system? Well, if that were answerable in a paragraph, then it wouldn't be design. We have to consider what is being asked of the system, what soutions are available, and what we can afford. In other words, it comes down to tradeoffs and priorities. 

We will now take a look at a few eternal data design priorities, in what should be increasing importance for most applications. However, your mileage may vary depending on what the situation. 

### **Minimizing Space**
In the old days before big data, storage was often the most expensive part of a computer system. Programmers would do just about any amount of programming to avoid buying new storage hardware. They would literally count characters to minimize the number of bytes a given file required on disk. 

To this end, they came up with some tricks that often shaved off kilobytes without having to resort to file compression. A few examples:
- Repeating fields, where each line of file only recorded what was different from the line above.
- Cryptic codes in place of long strings of text. Often they were hardwired into the programs, working like magic incantations when used by people in the know. 
- Overloading fields so that multiple facts could be stored in one field. 

You can see this same kind of thinking today in the messages passed between the front end app and a server. However, network bandwidth is becoming so plentiful that even this last bastion of space efficiency is just not important to worry about. 

Space is cheap and getting cheaper. 

### **Maximizing Calculation Speed**
Along the same lines as with space, raw speed has historically been prioritized over correctness. Long ago it was because computers were so slow. These days it is because it we ask so much more of our computer systems. If we can shave 5% of the computing time off a given operation that will be performed billions of time, then it is well worth it to do so.

Relevant techniques for raw speed include:
- Precomputing whatever can be done in advance, even when it swells storage with redundant data.
- Approximating results whenever 100% fidelity is not strictly necessary.
- Locating data closer to each user, even when it means some data will be out of sync with others

Of course, computers are getting faster and faster. However, expect this trend to continue as demands for raw speed will likely increase faster than we can build bigger and faster hardware.  

### **Maximizing Coherency**
Coherency is the ability to make sense of the data. Do all the facts fit together to tell coherent stories? Is each fact expressed in the best possible way? 

Generally, data coherency has been the domain of data modelers, who are more concerned with the stories than the data itself:
- What are the entities being tracked?
- What data is collected about each one? 
- How do the entities relate to each other? 

These sorts of questions never get old. They are focused on the same things as the app developer and the data scientists. 

We will touch on some of these questions in this lesson, then devote the bulk of Lesson 6 to entity relationship modeling. 

### **Minimizing Risk of Data Corruption**
Data integrity is an essential quality that never gets old. It is literally seeking to put the truth (and only the truth) into our databases. It is getting harder and harder to achieve, however. 

Big data is ugly data. It often comes in corrupted, forcing the dataabse system to clean it up before it can be stored. If the system is going to do that then it needs to have a goal, a definition of what *correct* and *clean* are. If data can't be fixed then the system should reject it rather than accept a lie as the truth. 

It is this last design priority that is at the heart of table design and normalization. If we design our tables so that they follow a few (not-so-easy) rules, then we can avoid the vast majority of data corruption errors or, as we will call them, **data anomalies**. 

---
## **Relational Notation and Functional Dependencies**


---
## **Keys**
 


---
## **Normal Forms**
 


---
## **PRO TIPS: How to selectively _denormalize_**



---
## **SQL AND BEYOND: EAV Models and NoSQL**




 







  

 








---
## **Congratulations! You've made it to the end of Lesson 4.**

In this lesson our treatment of table design  



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `BUAN6510` folder so you can find it next time.