<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 4: The Relational Model** 
_A mostly gentle introduction to the mathematics of data modeling._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The elements of the relational data model
- Coherent relations
- SQL `SELECT` queries as Relational Algebra 
- Importance of Data Integrity

### **Skills / Know how to ...**
- Assess data integrity by examining data and basic assumptions
- Identify (and eliminate) duplicate table rows
- Embed SQL into Python code (without `%%sql` magic)

--------
## **LESSON 4 HIGHLIGHTS**

In [None]:
#@title Run this cell if video is does not appear TODO NEW VIDEO
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/rsCrjQck_jQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

---
## **BIG PICTURE: Why Spreadsheets are not Relational Databases (but also kind of are)**
So many analysts get their first taste of data analysis in MS Excel (or Google Sheets, its online doppleganger). In the first sitting they'll type in some data into a couple of columns, then maybe add some formatting and perhaps column totals. Then they learn about copying cells with calculations in them, the proper use of `$` to freeze the row or column addresses, etc. Eventually, they'll perhaps learn about defining cell ranges as "databases" that can be dragged into PivotTables. It all seems so easy! Why not just stick with that? 

The problem is that for all of the expressive power of Excel, it is a poor subsitute for a *real* database (and no, MS Access doesn't count). Some issues:
- Any cell can have data of any type. So, we can't predict what is in a cell without looking inside. 
- The data is only semi-structured. The meaning of a given row or column can vary from row to row and column to column within the same data set. At the top of the spreadsheet the third column might be home addresses but lower down it might be average home values. 
- Spreadsheets have limited capacity. If you have 2 million rows of data then MS Excel just won't work. 
- Using `lookup()` or other similar means to combine data from multiple sheets is slow and error prone. There is no way to *really* know you got the cell references right in your `vlookup()` call except to know in advance what values the lookup should return. 
- While spreadhseets do allow named ranges almost nobody actually uses them. So, you have to know which rows and columns make up a given range (say `$B2:$F$1093`). Then if you copy a formula that uses those ranges to another cell you have to be careful about the `$` in the cell addresses. 

Consider for example, the sheet below:

![Spreadsheet bugs](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_Spreadsheet_Bugs.png)

A few issues:
- Why did `countif()` in column C get Greta Life's customer count wrong?   
  >*Because of a missing `$`.*
- Why might the `vlookup()` in column H have gone so terribly wrong?  
 >*Because the staff were not sorted alphabeticaly once Brock Lee was hired.*
- What could go wrong if we added more customers?  
 >*If we just tack customers onto the end of columns F-H, does the customer count get updated too? It depends on how the `countif()` range is defined.*
- How about if we hired Brock Lee's dad, who is also named Brock Lee? What might happen then? How would we fix it?   
 >*The `vlookup()` in column H would get confused if there are two Brock Lees. The names would have to be disambiguated using something like Brock Lee Jr and Brock Lee Sr. That in turn could mess up the data in column G. If we forget to update even one name in that column then our data is corrupted.*

None of these problems are actually inherent to the functions or cells themselves. With different data the bugs might just go away. 

In fact, the *internal* data representation used by spreadsheet software is actually pretty rock solid. After all, spreadsheets have ...
- unique cell addresses (A1, B2, etc.) 
- well-behaved data types (general, number, etc.) 
- a stable data structure (fixed rows and columns) that does not change when we add new data
- traceable cell dependencies (see below)

 ![Spreadsheet trace precedents](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_excel_trace_precedents.png)

That's pretty much the definition of a relational database, buried deep inside of MS Excel but with lots of user interface conveniences that actually make it more likely that data will get corrupted. **So, while it *is possible* to use a spreadsheet as a crude database, it takes lots of discipline to keep the data clean and consistently structured.**

In this lesson we will consider the relational database model, which was designed specifically to prevent just these sorts of **integrity** issues. Rather than rely on end users to *just know* when they are corrupting their own data we will learn to apply a few rules that prevent data corruption in the first place.

---
## **Tables as Relations**

The Relational Data Model was introduced by E. F. Codd at IBM in 1970. It is based on so-called Relational Algebra, which defines a set of rules and operations that should apply to tabular data (a.k.a., relations). SQL (or Sequel back then) was the programming language IBM used to implement the relational model.  

### **Terminology and Equivalencies**
In the time before the relational model, all data existed in files. In order to read or write data, a program had to implement a file system. A computer operating system, for example, has a file system for just this purpose. Apps make use of it to access data storage. 

In old-school data **files**, each line of text (it was always text or at least text-encoded) was called a **record**. Each record had some (possibly variable) number of **fields**, each representing one datum. Fields were delimited with a separator character, often a `tab` because it was not likely to be present in the data. A modern example is the so-called CSV file, where CSV stands for *comma -separated values*.

One step up from a file is a data set, where the file is explicitly in a tabular format, with rows and columns. This is the model on which SQL is built. Note that we implement the relational model in SQL, but it is possible to do things in SQL that technically violate the rules of the relational model. 

When working with *data models*, possibly before we actually have any data, we dont actually refer to tables. Instead, we talk of *entity types*, *instances*, and *attributes*. We will get into this deeper in Lesson 6. However, it is generally acknowledged that each entity type correpsonds to a table, an instance is a row, and an attribute is a column.

| Relation    | Tuple    | Attribute |
|:---------   |:---------|:--------- |
| File        | Record   | Field     |
| Table       | Row      | Column    |
| Entity Type | Instance | Attribute |

As shown in the table above, the relational model formalizes these general notions using mathematical language: **relations**, **tuples**, and **attributes**. We will get into the exact meaning of these terms, but for now you can just think of them as synomymous with tables, rows, and columns.  

### **Sets, Mappings, and Relations**

You may have learned about mathematical relations in middle school, probably in a lesson about sets. A set is a collection of items (numbers, words, pictures, cats, ...) without duplication. The items can be of mixed types (e.g., cats and animated gifs) as long as no item is represented twice. 

![Set with cats and gifs](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_Set.png)

If we have multiple sets then we can *map* one set to another. The most familiar kind of mapping is a function, which maps a *domain* set (i.e., all possible function inputs) to a *range* set (i.e., all possible outputs). 

![Functional Mapping](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_Functions.png)

In a **functional** (surjective) mapping, each item in the domain maps to exactly one item in the range. In other words, each time the function is called with a given input, the function always returns the same output. If the set of inputs is finite, we can replace any calculation with a table, with one row for each input value and its associated output value. The mathematical name for such pairings is *tuple*, which we'll come back to in a bit. 

![General Relational Mapping](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_Relations.png)

A **relation** is a more general kind of mapping where items in the domain can map to *multiple items* in the range. Like with a functional mapping, we can represent relations as tables, only this time we may need multiple rows per input item. As long as we can capture each mapping arrow as a pair, then we can call the relationship a relation.  

>It often confuses people when they find out that the Relational Model is about relations, not relation*ships*. Roll with it. In the end it doesn't matter much ... unless you are a mathematician. 

We can extend the pairwise mappings to allow multiple range sets (or *codomains*). 

![Multiple Codomains](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_Relation_Multiple_Codomains.png)

The result is that as we add more codomains the pair-wise mappings become triplets, quadruplets, quintuplets, ... or as we call them generally *tuples* that can have any number of values. Depending on the mappings, it is possible that some tuples will contain empty values, but that is allowed in the relational model. 

Going back to the the table / row / column terminology, each row is a tuple and each column is a named attribute. A table is then a set of rows (tuples). Within each tuple, every item in the domain is mapped to an item in each of the attribute codomains(columns). Further, there can be no row duplication; otherwise the relation would violate the first rule of sets. 

### **Coherent Relations**
A relation (table) is said to be *coherent* if:
- each row describes one domain entity (i.e., not a composite) 
- each column is one attribute 
- there are no duplicate rows or columns
- there is just one fact per (row, column)
- row and column order don't affect interpretation

Coherence in this case is in the mathemematical sense. It just means that we can translate the relation's tuples back to the original set mappings. 

Without coherence, the rest of the relational model falls apart. These are the minimal requirements for making sense of what a table represents. 








---
## **Relational Algebra**

### **About Algebra**
**An algebra** $-$ note the phrasing $-$ is a mathematical system that implements operations like addition, subtraction, multiplication, and division. Generally, we define a kind of algebra based on the data it operates on and perhaps special rules that only apply to that kind of data. The **Elementary Algebra** you learned in school involves basic arithmetic operations on numerical data: 1+1=2, 2x3=6, ...

In our discussion of boolean expressions in Lesson 2 we enountered **Boolean Algebra** that operates on `0` and `1` data where
- `OR` $\Leftrightarrow$ `+`
- `AND` $\Leftrightarrow$ $\times$
- `NOT` $\Leftrightarrow$ `-`
- `1 OR 1 = 1`$\Leftrightarrow$ `1+1=1`

Yet another common algebra operates on sets. The **Set Algebra** operators are shown in the table below.

| Operator | Symbol | Meaning |
| :------  | :----: | :----- |
| Union    | $\cup$ | `A` $\cup$ `B` returns all items in `A` **or** `B`.|
| Intersection | $\cap$ | `A` $\cap$ `B` returns all items in `A` **and** `B`.|
| Difference | $-$ | `A` $-$ `B` returns all items in `A` **and not** in `B`.|
| Product | $\times$ | `A` $\times$ `B` is the set of all possible pairs `(a,b)` where `a` is in set `A` and `b` is in set `B`.| 

A couple remarks:
- If set `A` has 2 items and set `B` has 3 items, then `A` $\times$ `B` has 6 items. That's where the name `Product` comes from. 
- Boolean Algebra is a special case of Set Algebra where every set (boolean expression) is a singleton containing a `1` or `0` (for True or False). The dead giveaway here is presence of **and**, **or**, and **and not** in the last column. 

### **Relational Operators**
Relational algebra is a set of operations that can be applied to any relation (table). Like boolean algebra, relational algebra is a kind of extension of set algebra where the items are tuples within coherent relations. 

When applied to a relation, each operator produces a **resultset**, which is itself a relation. 

#### **Restrict**
The **Restrict** operator chooses which tuples to include in the resultset. It is equivalent to the SQL `WHERE` clause. 

#### **Project**
The **Project** operator indicates which attributes (columns) to include in the tuples. It is equivalent to the column list in the SQL `SELECT` clause.  

#### **Product**
The **Product** operator calculates the cross product of two relations. It is equivalent to a list of tables in the SQL `FROM` clause. 

#### **Except and Union**
The **Except** operator calculates the *difference* between two relations with similar tuples. The **Union** operator *adds* one set of tuples to another. We covered these at the end of Lesson 3.  

#### **Chaining: Joins and Subqueries**

Since relational operators always produce relational resultsets, we can feed the results of one operator to the next in a chain of operators. 

So, for example, a table join is actually three operations. Let `TableA` and `TableB` be two relations, then a table join is equivalent to:  

>`TableA` **Product** `TableB` **Select** join-conditions **Project** columns

In fact, that is exactly how an implicit join expresses it. In SQL we have the `JOIN` operator so we don't accidentally forget to do the **SELECT** and **Project** after the **Product**. Otherwise it is totally redundant. 

Similarly, a SQL subquery is just a relation (in the form of a virtual table) that we insert in the chain of relational operators that make up a SQL query. 

---
## **Three Kinds of Data Integrity**

---
## **PRO TIPS: How to find duplicate rows**



---
## **SQL AND BEYOND: pandas DataFrames**




 







  

 








## **Congratulations! You've made it to the end of Lesson 4.**

In this lesson we covered the essential mathematical theory that underlies the relational database model. In the next two lessons we will apply what we've learned to database design.  



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `BUAN6510` folder so you can find it next time.