In [4]:
%reload_ext sql
%sql postgresql://postgres:postpost@localhost:5433/ensembl
#%sql postgresql://<USERNAME>@localhost/ensembl 

'Connected: postgres@ensembl'

# Intro to Database Design

Before we can decide on the structure of our tables, we need to do a *requirements* analysis on the data and how it will be used.

There are three main steps we need to consider when we design a database.

1. Requirements Analysis
2. Conceptual Database Design
3. Logical Database Design

## Requirements Analysis

In requirements analysis, we need to understand the needs of our users; what sort of queries will they need to understand the data?

Our example is going to be a Chinese restaurant that wants to keep track of its orders and customers for a loyalty program.

The primary query they want is to understand how frequently a customer orders food and how much they order in order to help them decide what kinds of bonuses they should offer to loyal customers.

## Data Modeling (Conceptual Design)

Data modeling is the process of understanding the relationships between the different data types in a model. We need to understand the actual ways that the data naturally groups and depends on each other.

For example, for a set of customers who order food. We could have everything in a single table. But it doesn't make sense to model it this way, because there is data that doesn't actually group together.

What about having a `customers` table and an `orders` Table?

The type of info in the `customers` table:

```
- customer_id
- name
- address
- phone Number
```

That leads to an Entity Relation (ER) diagram:



The type of info in the `orders` table:

```
- customer_id
- order_id
- order_date
- order_total
```


# Logical Design

## What datatype for these columns?

One thing about databases: they take up a lot of disk space.

There are ways to reduce this. If we know that a data field only has 10 digits, for example (such as a phone number), we might only allocate that much memory when we define it.

Postgres has a lot of different data types: https://www.postgresql.org/docs/9.5/datatype.html

Some of the most useful ones:

- [Numeric Types](https://www.postgresql.org/docs/9.5/datatype-numeric.html)
    - `INTEGER`, `NUMERIC`
- [Character Types](https://www.postgresql.org/docs/9.5/datatype-character.html)
    - `CHARACTER`, `TEXT` 
- [Date/Time Types](https://www.postgresql.org/docs/9.5/datatype-datetime.html)
    - `DATE`, `TIME`, `INTERVAL`
- [Boolean Types](https://www.postgresql.org/docs/9.5/datatype-boolean.html)
    - `BOOLEAN`

Other useful ones:
    
- [Universally Unique Identifiers (UUID)](https://www.postgresql.org/docs/9.5/datatype-uuid.html)
- [JSON](https://www.postgresql.org/docs/9.5/datatype-json.html)

# Your Turn

For each column in the `orders` table, decide on a datatype based on the above datatypes. 

```
- customer_id
- order_id
- order_date
- order_total
```

# Relationships

We've been skirting around the idea of *relationships* between tables. We want to carefully define this in terms of the *cardinality* of the relationship.

Based on our entities, there are three main types of relationship cardinality that we need to think about. 

- One to One
    - example: Student to School: One student goes to one school
- One to Many
    - example: Teacher to Student: One teacher has many students
- Many to Many
    - example: Clubs and Students: Many students may belong to Multiple Clubs
    
Why does this matter? It has to do with how our relational tables interact with each other. 

If we had a `school` table and a `student` table (one-to-many), that means for each row in the `school` table, there are many rows in the `student` table that map to it. 

If we wanted to delete a row in `school`, we would probably delete many rows in `student`.

*Many* to *Many* relationships can be really hard to manage, and if you can avoid using them in your data model, you should. 

If we deleted a row from `clubs`, which rows in `students` should we delete? Things get very complicated in terms of many-to-many relationships and the integrity of the data.

We'll talk more about this when we talk about Database Normalization.

# Your Turn

What is the cardinality of the following relationships? Discuss with a buddy or small group. 

- Bosses to Employees
- Genus to Species
- Books to Genres
- Person to Birthplace

# Our Final ER diagram

![](docs/image/customerexample.png)

# Primary Keys and Tables

A primary key is a unique identifier of a row in the data. It must be unique to the row and not be in any others. A primary key is an example of a *constraint* on a database table.

Most tables have a primary key; they are usually based on the *Entity* the table is modeled on. For example, the primary key for the `customer` table is `customer_id`. We shouldn't have rows in our table that have duplicate `customer_id`.

Do we generate `customer_id` by ourselves? We can manually add these ourselves, but databases have mechanisms for generating these automatically.

In [17]:
%%sql
DROP TABLE IF EXISTS customer;
CREATE TABLE customer
    (
    customer_id SERIAL PRIMARY KEY,
    name CHARACTER VARYING,
    address TEXT,
    phone_number CHARACTER VARYING
    );

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.


[]

# Question

What is the primary key for the `gene` table? (It's not defined as a primary key here, but we could add this as a constraint).

```
CREATE TABLE gene
  (
      ensembl_gene_id character(25),
      gene_strand integer,
      gene_end integer,
      gene_start integer,
      chromosome character varying,
      gene_symbol character varying
  );
```

# INSERT INTO

Now we've created our `customer` table. Let's add a customer into it. We can do this with the `INSERT INTO` clause.

We first define what columns we are inserting into, and then we can insert a tuple of values into those columns.

Note we don't need to add a `customer_id` entry. It's automatically filled out because we made that field a `SERIAL PRIMARY KEY`.

In [18]:
%%sql
INSERT INTO customer(name, address, phone_number)
VALUES
   ('Mary Worth', '1555 Charterstone Lane', '555-555-5555');

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


[]

In [13]:
%sql SELECT * FROM customer

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


customer_id,name,address,phone_number
1,Mary Worth,1555 Charterstone Lane,555-555-5555


We can add multiple rows to our `customer` table by providing multiple tuples:

In [19]:
%%sql
INSERT INTO customer(name, address, phone_number)
VALUES
   ('Ian Cameron', '1554 Charterstone Lane', '555-555-1554'),
   ('Toby Cameron', '1554 Charterstone Lane', '555-555-1554'),
   ('Wilbur Weston', '1533 Charterstone Lane', '555-555-1533'),
   ('Jeff Cory', '1511 Charterstone Lane', '555-555-1511');

 * postgresql://postgres:***@localhost:5433/ensembl
4 rows affected.


[]

In [20]:
%sql SELECT * FROM customer;

 * postgresql://postgres:***@localhost:5433/ensembl
5 rows affected.


customer_id,name,address,phone_number
1,Mary Worth,1555 Charterstone Lane,555-555-5555
2,Ian Cameron,1554 Charterstone Lane,555-555-1554
3,Toby Cameron,1554 Charterstone Lane,555-555-1554
4,Wilbur Weston,1533 Charterstone Lane,555-555-1533
5,Jeff Cory,1511 Charterstone Lane,555-555-1511


# FOREIGN KEYs

Remember when we talked about *relations* in our relational database? 

We strictly define these *relations* using `FOREIGN KEY`s in our table definitions. In order to define a column as a foreign key, we need to use the `FOREIGN KEY` keyword in front of it, and use `REFERENCES` followed by the table and the column name that it refers to.

In [43]:
%%sql
DROP TABLE IF EXISTS orders;

CREATE TABLE orders
  (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_date DATE,
    order_total MONEY,
    FOREIGN KEY (customer_id) REFERENCES customer (customer_id)
  );


 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.


[]

In [45]:
%%sql
INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(2,'2020-01-02',20.50),
(1, '2020-01-03', 10.00),
(2, '2020-01-03', 30.50);


 * postgresql://postgres:***@localhost:5433/ensembl
3 rows affected.


[]

In [46]:
%sql SELECT * FROM orders

 * postgresql://postgres:***@localhost:5433/ensembl
3 rows affected.


order_id,customer_id,order_date,order_total
1,2,2020-01-02,$20.50
2,1,2020-01-03,$10.00
3,2,2020-01-03,$30.50


# Database Dumps

Database dumps consist of all the SQL you need to recreate a database on another server, or instance. These database dumps can be used to provide *mirror* servers for data that is accessed

# Constraints: Reinforcing Integrity

What about if we try to insert a customer that doesn't exist in the `customer` table? The DBMS actually returns an error if we try to do this. Because of our foreign key constraint, the `customer_id` has to exist in the `customer` table before we even do this.

In [47]:
%%sql
INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(10,'2020-01-02',10.50);


 * postgresql://postgres:***@localhost:5433/ensembl


IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "orders" violates foreign key constraint "orders_customer_id_fkey"
DETAIL:  Key (customer_id)=(10) is not present in table "customer".

[SQL: INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(10,'2020-01-02',10.50);]
(Background on this error at: http://sqlalche.me/e/gkpj)

In [48]:
%%sql
    INSERT INTO customer(customer_id, name, address, phone_number)
VALUES
   (10,'Tommy Beedie', '1530 Charterstone Lane', '555-555-1530');

INSERT INTO orders(customer_id, order_date, order_total)
VALUES
(10,'2020-01-02',10.50);

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.
1 rows affected.


[]

# Try This

Insert a new customer below and query the `customer` table without specifying the `customer_id` and query the `customer` table again. What is its new id?

In [None]:
%%sql

# Transactions

So far, these operations we have done on the database are called *transactions*. A transaction is defined as a *unit of work* on the database; any time we run a SQL statement. This could be an `ALTER TABLE`, an `INSERT INTO` or a `DELETE FROM`.

The point of a transaction is that it is independent of other transactions, or *atomic*. This is especially important when multiple people are modifying the contents of a table at once. This is fundamental in keeping the *integrity* of the data within the database intact. 

- In databases, there is a standard called [ACID](https://en.wikipedia.org/wiki/ACID) that is applied to transactions
    - atomicity
    - consistency 
    - isolation 
    - durability


