In [1]:
%reload_ext sql

##Connect to the database
%sql postgresql://postgres:postpost@localhost:5433/ensembl
#%sql postgresql://<USERNAME>@localhost/ensembl            
  

'Connected: postgres@ensembl'

# Learning Objectives

- **Understand** database anomalies and how they lead to dirty data
- **Understand** what problems database normalization solves
- **Identify** whether a table is in 1st Normal Form (1NF) and how to fix it if it isn't
- **Identify** whether a table is in 2nd Normal Form (2NF) and how to fix it if it isn't.
- **Identify** whether a table in in 3rd Normal Form (3NF) and how to fix it.
- **Understand** the tradeoffs between normalization and database complexity

# Why does data get dirty in a database?

Unforseen data dependencies can make our data dirty and inconsistent.

Oftentimes, we are given a dataset in a form that has dependencies.

What kind of dependencies are we talking about?

- Non-unique rows in our database
- Rows that have repeated primary keys
- Different entities and their associated info stored into the same table

## Motivating Example

Let's start with a table of genes and pathways:






In [10]:
%%sql
  DROP TABLE IF EXISTS gene_and_pathway;
  CREATE TABLE gene_and_pathway 
    (
      ensembl_gene_id CHARACTER (25) NOT NULL,
      gene_symbol CHARACTER (30) NOT NULL,
      reactome_pathway_id CHARACTER (25) NOT NULL,
      reactome_pathway_name CHARACTER VARYING NOT NULL 
  );  
  COPY gene_and_pathway(ensembl_gene_id, gene_symbol, reactome_pathway_id, reactome_pathway_name)
FROM 'c:/Code/BMI535slides/data/pathway_gene.csv' DELIMITER ',' CSV HEADER;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.
7 rows affected.


[]

In [11]:
%sql SELECT * FROM gene_and_pathway

 * postgresql://postgres:***@localhost:5433/ensembl
7 rows affected.


ensembl_gene_id,gene_symbol,reactome_pathway_id,reactome_pathway_name
JAK1,ENSG00000198793,R-HSA-168256,Immune System
JAK1,ENSG00000198793,R-HSA-913531,Interferon Signaling
AKT1,ENSG00000142208,R-HSA-168256,Immune System
AKT1,ENSG00000142208,R-HSA-382551,Transport of small molecules
MTOR,ENSG00000162434,R-HSA-168256,Immune System
STAT3,ENSG00000168610,R-HSA-168256,Immune System
STAT3,ENSG00000168610,R-HSA-382551,Transport of small molecules


# QUESTION

What is the cardinality (one-to-one, one-to-many, or many-to-many) between `gene_symbol` and `reactome_pathway_name`?

# INSERTION anomaly

In [22]:
%%sql
    INSERT INTO gene_and_pathway(ensembl_gene_id, gene_symbol)
    VALUES ('ENSG00000140285', 'FGF7');

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


[]

In [23]:
%%sql
    SELECT * FROM gene_and_pathway;

 * postgresql://postgres:***@localhost:5433/ensembl
8 rows affected.


id,ensembl_gene_id,gene_symbol,reactome_pathway_id,reactome_pathway_name
1,ENSG00000198793,JAK1,R-HSA-168256,Immune System
2,ENSG00000198793,JAK1,R-HSA-913531,Interferon Signaling
3,ENSG00000142208,AKT1,R-HSA-168256,Immune System
4,ENSG00000142208,AKT1,R-HSA-382551,Transport of small molecules
5,ENSG00000162434,MTOR,R-HSA-168256,Immune System
6,ENSG00000168610,STAT3,R-HSA-168256,Immune System
7,ENSG00000168610,STAT3,R-HSA-382551,Transport of small molecules
8,ENSG00000140285,FGF7,,


Uh oh! We've introduced an *insert* anomaly; because we only added the `gene` information (`ensembl_gene_id` and `gene_symbol`) - our row is missing information.

# Deletion Anomaly

Something equally bad happens if we try to delete `JAK1` from our table. What pathway was lost when we deleted?

In [24]:
%%sql
    DELETE FROM gene_and_pathway WHERE gene_symbol = 'JAK1';

 * postgresql://postgres:***@localhost:5433/ensembl
2 rows affected.


[]

In [26]:
%sql SELECT * FROM gene_and_pathway;

 * postgresql://postgres:***@localhost:5433/ensembl
6 rows affected.


id,ensembl_gene_id,gene_symbol,reactome_pathway_id,reactome_pathway_name
3,ENSG00000142208,AKT1,R-HSA-168256,Immune System
4,ENSG00000142208,AKT1,R-HSA-382551,Transport of small molecules
5,ENSG00000162434,MTOR,R-HSA-168256,Immune System
6,ENSG00000168610,STAT3,R-HSA-168256,Immune System
7,ENSG00000168610,STAT3,R-HSA-382551,Transport of small molecules
8,ENSG00000140285,FGF7,,


# What Can We do about these Anomalies?

In short, we can ensure that our data is *normalized* before we add it to our database. 

We have a *many* to *many* relationship in our table between the *gene* entity and the pathway entity. That is, 

# The 1st Normal Form (1NF)

Our journey to normalizing the data starts with the first normal form.

- Data is stored in tables with rows uniquely identified by a primary key
- Data within each table is stored in individual columns in its most reduced form
- There are no repeating groups (such as reactome_pathway_id1, reactome_pathway_id2)

Does our data meet this format?

1. No. We don't meet this criteria, since there is not a primary key identified that is unique to each row.
2. Yes. This appears to be the case, as we don't have any fields that contain multiple values in them.
3. Yes. There are no repeating columns.

Ok, let's try this again, fixing criteria 1. One thing that will make a unique primary key is to make a COMPOSITE key, which is using multiple columns to make a primary key. We can do that by adding a PRIMARY KEY constraint to our table.


In [43]:
%%sql
  DROP TABLE IF EXISTS gene_and_pathway;
  CREATE TABLE gene_and_pathway 
    (
      ensembl_gene_id CHARACTER (25) NOT NULL,
      gene_symbol CHARACTER (30) NOT NULL,
      reactome_pathway_id CHARACTER (25),
      reactome_pathway_name CHARACTER VARYING, 
      PRIMARY KEY (ensembl_gene_id, reactome_pathway_id)
  );  
  COPY gene_and_pathway(gene_symbol, ensembl_gene_id, reactome_pathway_id, reactome_pathway_name)
FROM 'c:/Code/BMI535slides/data/pathway_gene.csv' DELIMITER ',' CSV HEADER;


 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.
7 rows affected.


[]

In [44]:
%sql SELECT * FROM gene_and_pathway;

 * postgresql://postgres:***@localhost:5433/ensembl
7 rows affected.


ensembl_gene_id,gene_symbol,reactome_pathway_id,reactome_pathway_name
ENSG00000198793,JAK1,R-HSA-168256,Immune System
ENSG00000198793,JAK1,R-HSA-913531,Interferon Signaling
ENSG00000142208,AKT1,R-HSA-168256,Immune System
ENSG00000142208,AKT1,R-HSA-382551,Transport of small molecules
ENSG00000162434,MTOR,R-HSA-168256,Immune System
ENSG00000168610,STAT3,R-HSA-168256,Immune System
ENSG00000168610,STAT3,R-HSA-382551,Transport of small molecules


Hmm. Now we have unique rows defined by our primary key, but we can see we still have repeating information across rows. 

For example, we can see that multiple genes participate in the `Immune System` pathway, and that we repeat the `reactome_pathway_id` (R-HSA-168256) and the `reactome_pathway_name` (Immune system) multiple times. 

We also have repeating information in the `ensembl_gene_id` and `gene_symbol` fields. This is bad, and as we've seen it can lead to anomalies.

# Second Normal Form (2NF)

We can take it further and normalize this table to 2nd Normal Form (2NF). The requirements for 2NF are:

- Everything from 1NF
- Only data that relates to a table's primary key is stored in each table

So, do the above tables satisfy 2NF conditions?

Well, it seems like we are storing two different entities in our database: *gene*, and *pathway*. We want everything in the *gene* table to just be dependent on `ensembl_gene_id`.

So, what we really to take our table apart into multiple tables, and split our composite key (ensembl_gene_id and reactome_pathway_id) into their own primary keys.

Let's try separating the `pathway` information out first:

In [51]:
%%sql
    DROP TABLE IF EXISTS pathway;
    CREATE TABLE pathway AS
        (SELECT DISTINCT reactome_pathway_id, reactome_pathway_name
        FROM gene_and_pathway);
    ALTER TABLE pathway
        ADD COLUMN id SERIAL PRIMARY KEY;
    SELECT * FROM pathway;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
3 rows affected.
Done.
3 rows affected.


reactome_pathway_id,reactome_pathway_name,id
R-HSA-382551,Transport of small molecules,1
R-HSA-168256,Immune System,2
R-HSA-913531,Interferon Signaling,3


In [52]:
%%sql
    DROP TABLE IF EXISTS ensembl_gene;
    CREATE TABLE ensembl_gene AS
        (SELECT DISTINCT ensembl_gene_id, gene_symbol
        FROM gene_and_pathway);
    ALTER TABLE ensembl_gene
       ADD COLUMN id SERIAL PRIMARY KEY;
    SELECT * FROM ensembl_gene;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
4 rows affected.
Done.
4 rows affected.
(psycopg2.errors.SyntaxError) syntax error at or near "%%"
LINE 1: %%sql
        ^

[SQL: %%%%sql
    DROP TABLE IF EXISTS gene_to_pathway;]
(Background on this error at: http://sqlalche.me/e/f405)


The final table we need to create defines the relationship between our `ensembl_gene` table and our `pathway` table. This is especially important given the many-to-many relationship between `ensembl_gene` and `pathway`. This table is called a *bridge table* and we want it to map the `id` column from both tables.

The first thing we need to do is grab the `ensembl_gene_id` and `reactome_pathway_id` and put it into a new table.

In [53]:
%%sql
    DROP TABLE IF EXISTS gene_to_pathway;
    CREATE TABLE gene_to_pathway AS
    (
        SELECT DISTINCT ensembl_gene_id, reactome_pathway_id
        FROM gene_and_pathway
    );
    SELECT * FROM gene_to_pathway;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.
7 rows affected.
7 rows affected.


ensembl_gene_id,reactome_pathway_id
ENSG00000162434,R-HSA-168256
ENSG00000198793,R-HSA-913531
ENSG00000142208,R-HSA-168256
ENSG00000198793,R-HSA-168256
ENSG00000168610,R-HSA-382551
ENSG00000168610,R-HSA-168256
ENSG00000142208,R-HSA-382551


Now we have that, we can build our bridge table by SELECTing the `id`s from the tables by doing two `LEFT JOINS`:

In [55]:
%%sql
CREATE table g2p AS
(SELECT g.id AS gene_id, p.id AS path_id FROM
    gene_to_pathway AS g2p
    LEFT JOIN pathway AS p
        ON g2p.reactome_pathway_id = p.reactome_pathway_id
    LEFT JOIN ensembl_gene AS g
        ON g2p.ensembl_gene_id = g.ensembl_gene_id);



 * postgresql://postgres:***@localhost:5433/ensembl
7 rows affected.


gene_id,path_id
4,2
2,3
3,2
2,2
1,1
1,2
3,1


## To do later

Reconstruct the original table from the `ensembl_gene`, `pathway` and `g2p` tables.

# Third Normal Form (3NF)

- Everything from 2NF
- There are no in-table (transitive) dependencies between the columns in each table


- Inconsistent data collection/lack of protocols
- Lack of auditing the database tables

https://www.winshuttle.com/blog/6ways-dirty-data/  
https://towardsdatascience.com/what-is-dirty-data-d96abbdf254e

# Does Database Normalization Help?

https://support.microsoft.com/en-us/help/283878/description-of-the-database-normalization-basics 
https://www.itprotoday.com/sql-server/sql-design-why-you-need-database-normalization
http://agiledata.org/essays/dataNormalization.html


# 