Relational Database Modeling
===

> Those who cannot remember the past are condemned to repeat it.

-- _The Life of Reason_, by George Santayana

By The End of This Session You Will Know:
---
- A brief history of relational database management system (RDBMS) 
- DB Design 101
- The basics of Relational Algebra
- The common types of keys
- 1st, 2nd, and 3rd Nominal forms
- ACID (the other kind)
- Object-relational mapping (ORM), aka THE HORROR!

---
What is relational database management system (RDBMS)?

> A program that lets you create, update, and administer a relational database. 

---
DB Evolution
---

* 1960s
    - Hierarchical data structure (IBM IMS)
    - Network data structure (CODASYL)

* 1970s
    - Relational data model
        * A Relational Model of Data for Large Shared Data Banks – E. F. Codd [1970]
    - System R (IBM), Ingres (Berkeley)

* 1980s
    - Commercialization of RDBMS (relational database management systems)
        * Oracle, Sybase, IBM DB2, Informix
    - SQL (Structured Query Language)
    - ACID (**A**tomic, **C**onsistent, **I**solated, **D**urable)

* 1990s
    - PC RDBMS
        * Paradox, Microsoft SQL Server & Access
    - Larger DBs, driven by internet
    - Consolidation among commercial DB vendors

* 2000s
    - Commercialization of Open Source RDBMS
        * MySQL, Postgres
    - Evolving requirements expose RDBMS limitations
        * Storing complex and dynamic objects
        * Processing increasing data volumes
        * Analyzing massive amounts of data



---
Design
---
![](images/entity.png)

For every entity, create corresponding table

For every relationship, create table

---
Relational Algebra
---

Relational Algebra is the pure mathmatical theory behind the specific instance of a RDMS

Has 5 primitive operators:

1. Selection
2. Projection (aka, the picking a sub-set of all available columns)
3. Cartesian product (also called the cross product or cross join)
4. Set union
5. Set difference

These are the 1st principles. If you are RYO DB, you should know it inside and out.

[Introduction to Relational Algebra](http://en.wikipedia.org/wiki/Relational_algebra) <br>
[Deeper dive into relational algebra](http://www.tutorialspoint.com/dbms/relational_algebra.htm)

Keys
-------------
* A **candidate key** of a relation is a minimal superkey for that relation; that is, a set of attributes such that:
    1. the relation does not have two distinct tuples (*i.e.* rows or records in common database language) with the same values for these attributes (which means that the set of attributes is a superkey)
    2. there is no proper subset of these attributes for which (1) holds (which means that the set is minimal).  
* A **primary key** is a candidate key that is chosen to uniquely identify each tuple (row) in a relation (table). *N.B.* A **candidate key** is a logical construct. A **primary key** is an implementation detail.
* A **foreign key** (when enforced) indicates a dependancy between two relations (tables). 
    - often (though not necessarily) a candidate key of the other table
    - referential integrity exists when a candidate key exists for all foreign keys

![Referential Integrity](https://s3-us-west-2.amazonaws.com/dsci6007/assets/Referential_integrity_broken.png)

---
Normalization
---

The objectives of normalization were stated as follows by Codd:

> 1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
> 2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
> 3. To make the relational model more informative to users;
> 4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.  

Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34

First normal form (1NF)  
------------------------
A relation is in *first normal form* if the domain of each attribute contains only **atomic** values, and the value of each attribute contains only a **single value** from that domain.

***The following scenario illustrates how a database design might violate first normal form.***

*Taken from [http://en.wikipedia.org/wiki/First_normal_form](http://en.wikipedia.org/wiki/First_normal_form)*

Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

**Customer**

| Customer ID | First Name | Surname   | Telephone Number |
| -----------:| ---------- | --------- | ---------------- |
|         123 | Robert     | Ingram    | 555-861-2025     |
|         456 | Jane       | Wright    | 555-403-1659     |
|         789 | Maria      | Fernandez | 555-808-9633     |

The designer then becomes aware of a requirement to record **multiple** telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

**Customer**

| Customer ID | First Name | Surname   | Telephone Number |
| -----------:| ---------- | --------- | ---------------- |
|         123 | Robert     | Ingram    | 555-861-2025     |
|         456 | Jane       | Wright    | 555-403-1659 <br> 555-776-4100 |
|         789 | Maria      | Fernandez | 555-808-9633     |

Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical _relational database management system_ will not allow fields in a table to contain multiple values in this way.

### A design that complies with 1NF

A design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.

**Customer Name**

| Customer ID | First Name | Surname   |
| -----------:| ---------- | --------- |
|         123 | Robert     | Ingram    |
|         456 | Jane       | Wright    |
|         789 | Maria      | Fernandez |


**Customer Telephone Number **

| Customer ID | Telephone Number |
| ---:| --- |
| 123 | 555-861-2025 |
| 456 | 555-403-1659 |
| 456 | 555-776-4100 |
| 789 | 555-808-9633 |

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer.

Second normal form (2NF) 
------------------------
A table is in 2NF *if and only if* it is in 1NF and every *non-prime attribute* of the table is dependent on *the whole of a candidate key*.

***Consider a table describing employees' skills:***

**Employees' Skills**

| Employee 	|Skill 	|Current Work Location|
| --------- | --------- | --------------- |
| Brown 	|Light Cleaning 	|73 Industrial Way|
| Brown 	|Typing 	|73 Industrial Way|
| Harrison 	|Light Cleaning 	|73 Industrial Way|
| Jones 	|Shorthand 	|114 Main Street|
| Jones 	|Typing 	|114 Main Street|
| Jones 	|Whittling 	|114 Main Street|

Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.

*Taken from [http://en.wikipedia.org/wiki/Second_normal_form](http://en.wikipedia.org/wiki/Second_normal_form)*

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "**Shorthand**" and "**Typing**" records and not update his "**Whittling**" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"

**Employees' Skills**

| Employee 	|Skill 	|Current Work Location|
| - | - | - |
| Brown 	|Light Cleaning 	|73 Industrial Way|
| Brown 	|Typing 	|73 Industrial Way|
| Harrison 	|Light Cleaning 	|73 Industrial Way|
| Jones 	|Shorthand 	|*414 Brannon Street* |
| Jones 	|Typing 	|*414 Brannon Street* |
| Jones 	|Whittling 	|114 Main Street|

A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:


**Employees**

| Employee 	|Current Work Location
| --- | ---
| Brown 	|73 Industrial Way
| Harrison 	|73 Industrial Way
| Jones 	|114 Main Street


**Employees' Skills**

|Employee 	|Skill
| --- | ---
|Brown 	|Light Cleaning
|Brown 	|Typing
|Harrison 	|Light Cleaning
|Jones 	|Shorthand
|Jones 	|Typing
|Jones 	|Whittling

Neither of these tables can suffer from update anomalies.

Not all 2NF tables are free from update anomalies, however. This brings us to...

Third normal form (3NF) 
------------------------
3NF was originally defined by E.F. Codd in 1971

A table is in 3NF *if and only if* it is in 2NF and every *non-prime attribute* of the table is *non-transitively* (i.e. directly) dependent on *every superkey* of that table.

Or...  
>"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."*

* Requiring existence of "the key" ensures that the table is in 1NF
* Requiring that non-key attributes be dependent on "the whole key" ensures 2NF
* Requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF

***An example of a 2NF table that fails to meet the requirements of 3NF is:***

**Tournament Winners**

|Tournament 	|Year 	|Winner 	|Winner Date of Birth
| --- | --- | --- | ---
|Indiana Invitational 	|1998 	|Al Fredrickson 	|21 July 1975
|Cleveland Open 	|1999 	|Bob Albertson 	|28 September 1968
|Des Moines Masters 	|1999 	|Al Fredrickson 	|21 July 1975
|Indiana Invitational 	|1999 	|Chip Masterson 	|14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute **Winner Date of Birth** is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

*Taken from [http://en.wikipedia.org/wiki/Third_normal_form](http://en.wikipedia.org/wiki/Third_normal_form)*

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

**Tournament Winners**

|Tournament 	|Year 	|Winner
| --- | --- | ---
|Indiana Invitational 	|1998 	|Al Fredrickson
|Cleveland Open 	|1999 	|Bob Albertson
|Des Moines Masters 	|1999 	|Al Fredrickson
|Indiana Invitational 	|1999 	|Chip Masterson

**Player Dates of Birth**

|Player 	|Date of Birth
| --- | ---
|Chip Masterson 	|14 March 1977
|Al Fredrickson 	|21 July 1975
|Bob Albertson 	|28 September 1968


Update anomalies cannot occur in these tables, which are both in 3NF.

> I believe firmly that anything less than a fully normalized design is strongly contraindicated ... [Y]ou should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements.

Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 152

---
Tidy Data
---

![](images/HadleyObama.png)
Single handled keeping R relevant.

What it means to have "tidy" data:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

Same things I have been saying only in Plain English.

---
ACID
---

A set of properties that guarantee that database transactions are processed reliably

What is a transaction?
a single logical operation on the data

One common example is the transfer of funds from one bank account to another. _Even though it involves multiple changes such as debiting one account and crediting another, is a single transaction_

![](images/acid.jpg)

- Atomicity: each transaction must be "all or nothing"
- Consistency: any transaction will bring the database from one valid state to another
- Isolation: concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially,
- Durability: once a group of SQL statements execute, the results need to be stored permanently
 
![](images/uptime.jpg)

This makes sense in single systems that are updated every night (think banking in 70s). It makes less sense in distrubted web world. 

![](images/mfbt.jpg)

<!--

----
Why RDMS and SQL are fundmentally flawed
----

- The world is not Boolean
- Scaling
- Beggin the need for OOP <-> Relational, aka ORMs

---
The World is nonlinear (and non-Boolean)
---

Real World -> Web Application -> Database -> SQL -> Report -> Decision about real world

> Each row within a table is a declaration about a fact in the world, and SQL allows for operator-efficient data retrieval of those facts using predicate logic to create inferences from those facts. 

wtf!?

---
Scaling
---

Given ACID requirements, it is non-trival to scale.

The progressiong is
1. Larger servers
2. Sharding
3. Complex/distrubted systems
-->

---
ORMs 
---

RDMS's basic abstraction are relations, tables with rows.

Python's basic abstraction is Object Oriented Programming (OOP), classes with attributes and methods

> In Python "Everything is an object"

In [9]:
# None is an object
isinstance(None, object)

True

### Different abstractions in OOP and relational models.

Object Oriented Programming (OOP) abstractions:
    
- Identity
- State
- Behavior
- Encapsulation

Relational model abstractions:

- Relation
- Attribute
- Tuple

There is also different terms for same concept.

Web Applications:

- Create
- Read
- Update
- Delete

DBs:

- Insert
- Select
- Update
- Delete

There is an impedence mismatch, aka diferent mental models.

An ORM is a translation device.

In Python, it is [SQLAlchemy](http://www.sqlalchemy.org/)

[ORMs](https://en.wikipedia.org/wiki/Object-relational_mapping)

---
### Takehome message

RDMS is a "solved" problem. <!--Hire someone to do it for you.-->

Just Avoid Bad Decisions.

---
Summary
---
- We learn about the world of RDMS (mathematical elegance...until reality hits).
- A little bit of math (relational algebra) and lot of rules.
- It is important that your data is 3rd Nominal form (or peole will laugh at you).
- RDMS do not pair well with OOP. ORMs are bandaids on gaping flesh wounds.

<br>
---