<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 6: Entity Relationship Modeling** 
_A Visual Approach to Database Design._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Tradeoffs that every designer makes
- Table normalization and normal forms
- The Entity Attribute Value database model

### **Skills / Know how to ...**
- Break a large table into smaller *normalized* tables
- Use relational notation to describe table schema
- Detect when a choice of keys will potentially corrupt data

--------
## **LESSON 5 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/rsCrjQck_jQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

---
## **BIG PICTURE: Data modeling is not about data**

**Data modeling is something we do *before* we have data.** It's the design activity that makes it possible to collect data for a given purpose. So, data modeling is about what we will do with the data and what we can expect to find as we collect it. The data itself ends up being a collection of artifacts that gets us from data collection to data usage. 

So, whenever we start design of a new database (i.e., the receptical for our data) the most important questions are:
- What exists now?
- What is the data about?
- What is the data going to be used for?
- How long does the data need to be kept?
- Who is going to use the data? 

Only after we get answers to these questions can we begin to think about the tables, columns, keys, constraints, etc. For example, let's consider the "What exists now?" question. What exists now may include any or all of the following: 
- a paper process used to collect dead tree data and produce reports
- people whose jobs are to work with the current data sources every day, perhaps with an orgnizational hierarchy that determines data access permissions
- online systems that need data in a specific format or produce data in a (not necessarily the same) format
- end users that don't currently know they need the data 
- a laundry list of complaints about the current system (or lack of one)
- ...
It goes on and on. 

The purpose of data modeling is to capture enough detail about the data requirements as possible before we make any technology decisions we might regret later. If, for example, we find that some data needs to be separate from other data because of different sources or user access permissions, then we need to take that into account when designing our tables. Similarly, if some data is permanent and other data is only useful for a couple of weeks, then perhaps we will want to keep them separate as well. 

In this lesson we will use Entity Relationship Modeling to capture data requirements in an intuitive way that can be explained to clients and power users. It is like writing a draft of the user manual for the data before anything else. 

---
## **Entity Relationship Diagrams**

Entity Relationship models have been around almost as long as relational databases. Peter Chen first developed the theory in the mid-1970s, a time when relational databases were still pretty exotic, to solve a serious conceptual problem. People understood how to read and write files but following all the rules needed to build relational databases was a challenge.  

ER diagrams (ERDs) are designed to be just specific enough to be useful without introducing any unnecessary details that are best left for later. Consider this ERD from Lesson 1:

![Cleaners ERD](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L6_erd1.png)

- **Each box represents an entity class or type.** Any given type can model any number of instances. So, for example, the `Customer` entity (type) can represent millions of customers (instances). 
- **The attributes are listed inside the box.** Traditionally, primary key attributes (identifiers like `customer_id`) are listed at the top and the foreign key attributes at the bottom, with non-key attributes in between. If an attribute is both a primary key and foreign key then group it with the primary keys attributes. 
  > Note: in the earliest stages of the design process, we may omit the attributes entirely, focusing on just the entities and relationships.  
- **Relationships are shown with lines connecting the boxes.** The notation at each end of a connecting line represents how many entity instances there are at that end. The relationship shown in the diagram indicates that:
  - each `Customer` instance can generate zero of more `Invoice` instances
  - each `Invoice` instance is generated by one and only one `Customer`

With this much detail we can say:
- what tables we will need in the database
- what columns are on each table
- what foreign keys are needed and what tables they refer to

What we don't care about (yet) includes 
- the data types of the attributes
- how the keys are generated and managed
- how many rows of data are to be in each table
- how the tables will be populated with data
- how data will be updated or deleted from the tables

These things are for logical design. For now we are just interested in the general database structure, not its implementation. We need to get the basic concepts right before diving into the details. 

 


## **ERDs as Conceptual Storytelling**

The key insight of entity relationship modeling $-$ the one that makes it so useful for database design $-$ is that **databases exist to tell stories**. They describe things that people care about enough to record for later. Entity-relationship diagramming is a visual language for telling database stories before we actually have data.

In the visual language of ERDs there are just two kinds of sentences:
- Description: entity A **is described by** attributes X, Y, and Z 
- Action: entity A **acts upon** entity B

In these stories, the **entities and attributes are the descriptive nouns**, while the **relationships are the action verbs**. 

We will take these one at a time and then discuss some of the more interesting special cases. 




## **Entities and Attributes**

In principle, any particular thing could be considered an entity or an attribute. Often it's hard to tell which, at least at first, before we know the context for our database stories. 

The difference between an entity and an attribute is that entities always have a unique identity within the **domain model** (i.e., context). Recall that in Lesson 4 we defined the relational model using set mappings from **domains** to **codomains**: 
![Relation mappings from Lesson 4](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L4_Relations.png)

The items in the domain set are the entities represented by a table. The mappings to the codomains are the attributes. While entities are by definition unique, we could map any given **domain entity** to **attribute values** in any number of codomains. Like the entities, the attribute values themselves are unique but the **attribute mappings** have no such constraint. We could map several entities to the same attribute value or even map an entity to itself (i.e., where codomain = domain) several times if we like. It is in this sense that **entities are unique** while **attribute mappings** are not.

To determine if a given item is an entity, we look for **identifier attributes** that make sense in the story context. For example, does a bank need to track individual dollar bills (via serial numbers) or just the number of dollars in a person's account? If the former then dollars are entities. Otherwise, in the absense of a useful identifier, dollar amounts are just values used to describe something else (a customer's account, a financial transaction, etc.).

Besides the need for unique identifiers, there is another sense in which entities must be unique. **Each entity is always one thing and only one thing.** If an entity is composed of multiple things, then it is actually an **associative entity** that connects other things together. Each of these connected things is an entity, as is the association itself. 

Let's consider, for example, a marriage between two people. Is that one thing (a marriage) or three things (marriage plus two spouses)? In early versions of the database story it may be just a marriage license with a date, few signatures, and an id number. However, as the story gets fleshed out, one may find the need to represent the people getting married. While it is tempting to just add the details about these people to the marriage license itself, this makes the idea of a marriage license much more complex than it has to be. It also complicates questions like "Has Toby been married before?" or "Is Toby *currently married* to somebody else?" Instead of looking up Toby (and finding licenses), we would have to scan through licenses to find any with Toby's name. A much better, more savvy, way to do it is to **register each person** separately and then **reference** them on the marriage license. As long as the licenses are **indexed** by person $-$ easily done because each person has a unique identifier $-$ we could find any of Toby's previous marriages on file with one lookup.

Once we have the entities worked out, the attributes are usually pretty easy to sus out. All we need to know in an ERD is what they are named, and whether any of them are identifiers (primary keys) or references (foreign keys). Data types and other constraints (e.g., allowed combinations of attribute values) can wait until logical design. 






---
## **Relationships**
Relationships are how the entities in our stories interact. They provide the action. How that action plays out depends on the natures of the entities and their interactions. 

Continuing our marriage example, here are three different scenarios, each as a different ERD:

![Three Marriage Proposals](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L6_Marriage_ERD.png)

Scenario A represents the case where people are just names on a marriage license. Each person is listed as either `spouse1` or `spouse2`. This a very static story with no action. It is almost as if nobody will ever read the story after it is written.

Scenario B represents a person-centric case, where all that matters is who is currently married to whom. (It's a rare unary mapping of an entity domain onto itself.) Notice the phrasing of the text on the relationship. It describes the action (or more like a condition, depending on your views of marriage) of being married to another person. It is assumed that if person X is married to person Y that the relationship would be noted on both entities. If a person is unmarried, then the `spouseID` would be left blank.

Scenario C separates out the marriage license from the people getting married. In this case we can add more attributes to the `Marriage License` and `Person` entities to indicate things like birthdates, divorce dates, etc. Generally, this is likely how your state records marriages. It is up to them to enforce the "can't have more than one marriage at a time" law. 

> Note that Scenario C allows for marriage to be between more than 2 people. Why? Because standard ERD graphic language doesn't distinguish between "more than 1" and "many". That marriage is between two people is a constraint that would be handled in the logical design phase of the database development.


### **Degrees**

The degree of a relationship is how many entity types are being connected. 
By far the most common relationship is between two entity types. That is called a **binary relationship**. 

![Binary Relationship](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L6_Binary_Relationship.png)

The relationship above is a many-to-many binary relationship. There can be just about any number of entities on either side of the relationship. However, there are only two entity types. 

A **unary relationship** is one where the associations are among entities with the same entity type. 

![Binary Relationship](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L6_Unary_Relationship.png)

One can conceive of a unary relationship between rail cars on a train. Each car has 0 or 1 cars immediate before and 0 or cars after it. However, every rail car is a car. 

> We can have higher degree relationships (ternary, quaternary, etc.) as well. These are not very common and can be confusing to interpret. Often in such cases we will use an associative entity that represents the relationship itself. We only menton them in passing here because you may see one in the wild someday. 

### **Maximum Cardinalities**
### **Minimum Cardinalities**



---
## **Special Cases**

### **Strong Entities and Weak Entities**

Parent Child

HAS A

### **Subtypes**

IS A

### **Twinning**

### **Associative Entities**

---
## **Movies Tonight: A Case Study**

This is the start of a 4 part case that will run through Lesson 8. 









---
## **PRO TIPS: How to sanitize data before storage**








---
## **SQL AND BEYOND: EAV Models**

One easy way to always keep your data normalized is to use a single table with four columns:
- An **entity type**
- An **entity** identifier
- An **attribute** identifier
- A **value** to store

We can, in fact, normalize any set of tables down to this one design, which we call EAV. 

For example, in an EAV database the relational database tables

**<center>Staff</center>**

|eid | name |
|---|------|
 1 | Barb Ackue 
 2 | Buck Kinnear


**<center>Contacts</center>**

| contactid |eid | email | usage | 
|---|------| --------------|---|
1 | 1 | backue@acmesales.com | work 
2 | 1 | barb.ackue@gmail.com | home 
3 | 2 | bkinnear@acmesales.com | work 
4 | 2 | buckkinnear2315@hotmail.com | home 

would be stored as 

| type | id | attribute | value |
| ---|------| --------------|-----|
 Staff | 1 | name | Barb Ackue
 Staff | 2 | name | Buck Kinnear
 Contact | 1 | eid | 1
 Contact | 1 | email | backue@acmesales.com
 Contact | 1 | usage | work
 ... | ... | ... | ...

 > **Heads up:** EAV is a special case of the key-value store model used by some NoSQL databases, where a composite like (type, id, attribute) is used for the keys. 

EAV databases have a few advantages:
- The single table design never changes.
- There is flexibility to handle new kinds of information easily; just add a new entity type (schema) with a few attributes and a data type for the values.
- If there are lots of attributes with missing data then there is no need to store the NULL values.

However, there are also some disadvantages:
- The table will have an enormous number of rows; to assemble the facts about even one entity might require dozens of rows.
- It is really hard to enforce referential integrity rules because foreign keys are just data like anything else.
- That the schema are so flexible makes it hard to know exactly what facts are knowable about a given entity without actually assembling it.

The most well known application for EAV is for electronic medical records, where the database has to store test results that may have any number of attributes. In such a situation having total flexibility to store whatever data is available in whatever format is needed is very very useful. However, not many applications need so much flexibility. 




 







  

 








---
## **Congratulations! You've made it to the end of Lesson 5.**

We will continue our coverage of database design in Lesson 6, which takes a more visual approach, using ER diagrams instead of normalization rules to model and analyze database requirements.



## **On your way out ... Be sure to save your work**.
In Google Drive, drag this notebook file into your `BUAN6510` folder so you can find it next time.