In [4]:
%reload_ext sql
#%sql postgresql://postgres:postpost@localhost:5433/ensembl
%sql postgresql://<USERNAME>@localhost/ensembl 

'Connected: postgres@ensembl'

# Intro to Database Design

Before we can decide on the structure of our tables, we need to do a *requirements* analysis on the data and how it will be used.

There are three main steps we need to consider when we design a database.

1. Requirements Analysis
   - type of data, types of queries, performance requirements
2. Conceptual Database Design
   - high level description/constraints of the data (ER Model)
3. Logical Database Design
   - convert conceptual design into a database schema

## Requirements Analysis

In requirements analysis, we need to understand the needs of our users. Here is an incomplete list.

1. Who is our user?
2. What problem do they want to solve?
3. What data is available? Can it be used to answer the problem?
4. what sort of queries will they need to understand the data? 
5. Will we need to clean the data?
6. How much data will they generate? How quickly does the database need to be updated?

Our example is going to be a restaurant that wants to keep track of its orders and customers for a loyalty program.

1. **Who is our user?** Our user is the restaurant owner. 

2. **What problem do they want to solve?** The primary problem they want to answer is: how frequently a customer orders food and how much they order? They want to answer this question to help them decide what kinds of bonuses they should offer to loyal customers.

3. **What data is available? Can it be used to answer the problem?**. The data comes from two places: the *Point of Sale* terminal, which handles sales and has a timestamp, and the *customer* database.

4. **What sort of queries will they need to understand the data?** They will need to join the customer information and the sales information in order to tabulate the spending habits of the total users.

5. **Will we need to clean the data?**

6. **How much data will they generate? How quickly does the database need** 

## Data Modeling (Conceptual Design)

Data modeling is the process of understanding the relationships between the different data types in a model. We need to understand the actual ways that the data naturally groups and depends on each other.

For example, for a set of customers who order food, we could have everything in a single table. But it doesn't make sense to model it this way, because there is data that doesn't actually group together.

What about having a `customers` table and an `orders` Table? These are the *entities* in our data model. In the data modeling process, we need to decide what *fields* are going to belong to our entities.

The type of info in the `customers` table:

```
- customer_id
- name
- address
- phone Number
```

That leads to an Entity Relation (ER) diagram for our customer table:

![](docs/image/customer.png)

The type of info in the `orders` table:

```
- customer_id
- order_id
- order_date
- order_total
```

Here's the ER Diagram for the `orders` table:

![](docs/image/orders.png)

# Our final conceptual model

We know that we have to connect these two tables in order to do useful things with the data. The final part of our Entity-Relationship diagram is describing the relationship between entities.

We know that a `customer` **has** multiple `orders`, so let's add that as a relationship between the two entities. We add this relationship with a diamond.

![](docs/image/er-full.png)

Now we are done with our conceptual model and now we can move ahead with the *logical design of our database.

# Relationships

We've been skirting around the idea of *relationships* between tables. We want to carefully define this in terms of the *cardinality* of the relationship.

Based on our entities, there are three main types of relationship cardinality that we need to think about. 

I'm not going to go over the visual notation for relationships (also known as *crow's foot* notation), because I don't think it's that helpful at this point. If you want to learn more, check out: https://www.calebcurry.com/cardinality-and-modality/

- One to One
    - example: Student to School: One student goes to one school
- One to Many
    - example: Teacher to Student: One teacher has many students
- Many to Many
    - example: Clubs and Students: Many students may belong to Multiple Clubs
    


# Your Turn

What is the cardinality of the following relationships? Discuss with a buddy or small group. 

- Bosses to Employees
- Genus to Species
- Books to Genres
- Person to Birthplace

# Why Cardinality Matters

Why does this matter? It has to do with how our relational tables interact with each other. 

If we had a `school` table and a `student` table (one-to-many), that means for each row in the `school` table, there are many rows in the `student` table that map to it. 

If we wanted to delete *one* row in `school`, we would probably delete *many* rows in `student`.

*Many* to *Many* relationships can be really hard to manage, and if you can avoid using them in your data model, you should. 

Say we have a `clubs` table and a `student` table. If we deleted a row from `clubs`, which rows in `students` should we delete? Things get very complicated in terms of many-to-many relationships and the integrity of the data.

We'll talk more about this when we talk about Database Normalization.

# Logical Design

## What datatype for these columns?

One of the steps we need to do to convert our conceptual model into a logical one is decide on the datatypes for each column.

Postgres has a lot of different data types: https://www.postgresql.org/docs/9.5/datatype.html

Some of the most useful ones:

- [Numeric Types](https://www.postgresql.org/docs/9.5/datatype-numeric.html)
    - `INTEGER`, `NUMERIC`
- [Character Types](https://www.postgresql.org/docs/9.5/datatype-character.html)
    - `CHARACTER`, `TEXT` 
- [Date/Time Types](https://www.postgresql.org/docs/9.5/datatype-datetime.html)
    - `DATE`, `TIME`, `INTERVAL`
- [Boolean Types](https://www.postgresql.org/docs/9.5/datatype-boolean.html)
    - `BOOLEAN`

Other useful ones:
    
- [Universally Unique Identifiers (UUID)](https://www.postgresql.org/docs/9.5/datatype-uuid.html)
- [JSON](https://www.postgresql.org/docs/9.5/datatype-json.html)

There are modifiers that we can add that can limit the amount of storage space that is used for the database. For example, if we are storing a Social Security Number (SSN) as a string, we know it has 12 characters. So we could specify the datatype as "CHARACTER (25)". There is also a `VARYING` modifier.

# Your Turn

For each *field* in the `orders` table, decide on a datatype based on the above datatypes. You may have to look at the documentation for the above datatypes to decide this.  Don't worry about the modifiers for now.

```
- customer_id
- order_id
- order_date
- order_total
```

# Our Final Schema

Now we have decided our data types, we can convert our ER diagram into a *database schema*, which precisely identifies the fields and their data types in each folder.

A Database Administrator (DBA) can start building the database by building up the SQL that fit our table.

![](docs/image/customerexample.png)

# Now We Can Build Our Database!

Now we have our conceptual model, we can actually start building our database using `CREATE TABLE` statements.

# Primary Keys and Tables

A primary key is a unique identifier of a row in the data. It must be unique to the row and not be in any others. A primary key is an example of a *constraint* on a database table.

Most tables have a primary key; they are usually based on the *Entity* the table is modeled on. For example, the primary key for the `customer` table is `customer_id`. We shouldn't have rows in our table that have duplicate `customer_id`.

Do we generate `customer_id` by ourselves? We can manually add these ourselves, but databases have mechanisms for generating these automatically.

In [17]:
%%sql
DROP TABLE IF EXISTS customer;
CREATE TABLE customer
    (
    customer_id SERIAL PRIMARY KEY,
    name CHARACTER VARYING,
    address TEXT,
    phone_number CHARACTER VARYING
    );

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.


[]

# Question

What is the primary key for the `gene` table? (It's not defined as a primary key here, but we could add this as a constraint).

```
CREATE TABLE gene
  (
      ensembl_gene_id character(25),
      gene_strand integer,
      gene_end integer,
      gene_start integer,
      chromosome character varying,
      gene_symbol character varying
  );
```

# INSERT INTO

Now we've created our `customer` table. How do we get data into it?

Let's add a customer into it. We can do this with the `INSERT INTO` clause.

We first define what columns we are inserting into, and then we can insert a tuple of values into those columns.

Note we don't need to add a `customer_id` entry. It's automatically filled out because we made that field a `SERIAL PRIMARY KEY`.

In [18]:
%%sql
INSERT INTO customer(name, address, phone_number)
VALUES
   ('Mary Worth', '1555 Charterstone Lane', '555-555-5555');

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


[]

In [13]:
%sql SELECT * FROM customer

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


customer_id,name,address,phone_number
1,Mary Worth,1555 Charterstone Lane,555-555-5555


We can add multiple rows to our `customer` table by providing multiple tuples:

In [19]:
%%sql
INSERT INTO customer(name, address, phone_number)
VALUES
   ('Ian Cameron', '1554 Charterstone Lane', '555-555-1554'),
   ('Toby Cameron', '1554 Charterstone Lane', '555-555-1554'),
   ('Wilbur Weston', '1533 Charterstone Lane', '555-555-1533'),
   ('Jeff Cory', '1511 Charterstone Lane', '555-555-1511');

 * postgresql://postgres:***@localhost:5433/ensembl
4 rows affected.


[]

In [20]:
%sql SELECT * FROM customer;

 * postgresql://postgres:***@localhost:5433/ensembl
5 rows affected.


customer_id,name,address,phone_number
1,Mary Worth,1555 Charterstone Lane,555-555-5555
2,Ian Cameron,1554 Charterstone Lane,555-555-1554
3,Toby Cameron,1554 Charterstone Lane,555-555-1554
4,Wilbur Weston,1533 Charterstone Lane,555-555-1533
5,Jeff Cory,1511 Charterstone Lane,555-555-1511


# FOREIGN KEYs

Remember when we talked about *relations* in our relational database? 

We strictly define these *relations* using `FOREIGN KEY`s in our table definitions. In order to define a column as a foreign key, we need to use the `FOREIGN KEY` keyword in front of it, and use `REFERENCES` followed by the table and the column name that it refers to.

In [43]:
%%sql
DROP TABLE IF EXISTS orders;

CREATE TABLE orders
  (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_date DATE,
    order_total MONEY,
    FOREIGN KEY (customer_id) REFERENCES customer (customer_id)
  );


 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.


[]

In [45]:
%%sql
INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(2,'2020-01-02', 20.50),
(1, '2020-01-03', 10.00),
(2, '2020-01-03', 30.50);


 * postgresql://postgres:***@localhost:5433/ensembl
3 rows affected.


[]

In [49]:
%sql SELECT * FROM orders

 * postgresql://postgres:***@localhost:5433/ensembl
4 rows affected.


order_id,customer_id,order_date,order_total
1,2,2020-01-02,$20.50
2,1,2020-01-03,$10.00
3,2,2020-01-03,$30.50
5,10,2020-01-02,$10.50


# Constraints: Reinforcing Integrity

What about if we try to insert a customer that doesn't exist in the `customer` table? The DBMS actually returns an error if we try to do this. Because of our foreign key constraint, the `customer_id` has to exist in the `customer` table before we even do this.

When you run this, it's easiest to look at the very bottom for the SQL error message.

In [47]:
%%sql
INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(10,'2020-01-02',10.50);


 * postgresql://postgres:***@localhost:5433/ensembl


IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "orders" violates foreign key constraint "orders_customer_id_fkey"
DETAIL:  Key (customer_id)=(10) is not present in table "customer".

[SQL: INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(10,'2020-01-02',10.50);]
(Background on this error at: http://sqlalche.me/e/gkpj)

In [48]:
%%sql
    INSERT INTO customer(customer_id, name, address, phone_number)
VALUES
   (10,'Tommy Beedie', '1530 Charterstone Lane', '555-555-1530');

INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(10,'2020-01-02',10.50);

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.
1 rows affected.


[]

# Try This

Insert a new customer below and query the `customer` table without specifying the `customer_id` and query the `customer` table again. What is your new row's `customer_id`?

In [None]:
%%sql

# Database Dumps

Database dumps consist of all the SQL you need to recreate a database on another server, or instance. These database dumps can be used to provide *mirror* servers for data that is accessed across the world.

Postgres has a command to dump a database.

# Other Constraints

There are lots of other constraints we can place on tables that are extremely useful.

`NOT NULL` is one constraint that is used very often, for required fields. For example, we could add the `NOT NULL` to the name

```
CREATE TABLE customer
    (
    customer_id SERIAL PRIMARY KEY,
    name CHARACTER VARYING,
    address TEXT,
    phone_number CHARACTER VARYING
    );
```

`ON DELETE CASCADE` is also extremely useful. We can add this to the end of our `FOREIGN KEY` constraint to add the constraint that if we delete a customer from our `customer` table, their orders will also be removed from the `customer` table.

```
CREATE TABLE orders
  (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_date DATE,
    order_total MONEY,
    FOREIGN KEY (customer_id) REFERENCES 
        customer (customer_id) ON DELETE CASCADE
  );
```

There are many more types of constraints on databases, and you can learn about them here: https://www.postgresql.org/docs/8.2/ddl-constraints.html

# Transactions

So far, these operations we have done on the database are called *transactions*. 

A transaction is defined as a *unit of work* on the database; any time we run a SQL statement, we are executing a transaction. This could be an `ALTER TABLE`, an `INSERT INTO`, an `UPDATE` or a `DELETE FROM`.

The point of a transaction is that it is independent of other transactions, or *atomic*. This is especially important when multiple people are modifying the contents of a table at once. This is fundamental in keeping the *integrity* of the data within the database intact. 

- In databases, there is a standard called [ACID](https://en.wikipedia.org/wiki/ACID) that is applied to transactions
    - atomicity
    - consistency 
    - isolation 
    - durability
    
Being ACID compliant is very important for ensuring the integrity of the data in the database. 

ACID compliance is also at the heart of what are called *stored procedures*, which are a series of transactions that act like a single transaction in the database. The database is locked so that these steps are concurrent (right after each other), to further ensure integrity. 

# ALTER TABLE

What if we wanted to add a column to the table, or add a foreign key constraint on a field? Do we have to delete everything?

Nope. We can use `ALTER TABLE` to add columns or add constraints. 

More info about `ALTER TABLE` here: https://www.postgresql.org/docs/9.1/sql-altertable.html

In [5]:
%%sql 
    ALTER TABLE customer 
        ALTER COLUMN name SET NOT NULL;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.


[]

Adding a `ON DELETE CASCADE` constraint to our `FOREIGN KEY` requires us to first `DROP CONSTRAINT` on our foreign key. I know the name of the foreign key because I used the `PSQL` shell and typed in `\d orders`.

In [9]:
%%sql
ALTER TABLE orders
DROP CONSTRAINT orders_customer_id_fkey;

ALTER TABLE orders
    ADD CONSTRAINT orders_customer_id_fkey
    FOREIGN KEY (customer_id) REFERENCES customer (customer_id)
        ON DELETE CASCADE;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.


[]

# DELETE FROM

`DELETE FROM` requires a `WHERE` clause.

```
%%sql
DELETE FROM orders
    WHERE customer_id = 10;

SELECT * FROM customer;
```

# Your Turn

Use `DELETE FROM` to remove `customer_id` `1` from the `customer` table. Show that the `orders` table has changed.

In [54]:
%%sql

NameError: name 'DELE' is not defined

# What you learned today

This was a big topic! Thanks for sticking with it.

- The Different Phases of the Database Design Process
- `CREATE TABLE` and the different datatypes
- Relationships and why cardinality matters
- Primary and Foreign Keys and constraints
- `INSERT INTO`, `ALTER TABLE`, and `DELETE FROM`
- Using constraints to enforce database integrity
- Why transactions are important to databases