# Introduction to postgres SQL

This notebook is meant to be a short introduction to relational database queries using SQL.

Note that you will need to load the database first (see that notebook for more information).

Make sure that you change <USERNAME> below to your own username!

In [None]:
%reload_ext sql

##Connect to the database
#%sql postgresql://postgres@localhost/ensembl


#%sql postgresql://postgres:postpost@localhost:5433/ensembl
%sql postgresql://<USERNAME>@localhost/ensembl

# SELECTing from a table in the database

All SQL queries have the following form:

```
SELECT *columns*  <- required
FROM *table*      <- required
WHERE *criteria*  <- this clause is optional  
```

The very first thing we can do is return all the columns from a query using `*`. But remember, we need a `FROM` clause as well to make our query complete.

In [6]:
%sql SELECT * FROM gene LIMIT 20;

 * postgresql://postgres:***@localhost:5433/ensembl
20 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000198888,1,4262,3307,MT,MT-ND1
ENSG00000198763,1,5511,4470,MT,MT-ND2
ENSG00000198804,1,7445,5904,MT,MT-CO1
ENSG00000198712,1,8269,7586,MT,MT-CO2
ENSG00000228253,1,8572,8366,MT,MT-ATP8
ENSG00000198899,1,9207,8527,MT,MT-ATP6
ENSG00000198938,1,9990,9207,MT,MT-CO3
ENSG00000198840,1,10404,10059,MT,MT-ND3
ENSG00000212907,1,10766,10470,MT,MT-ND4L
ENSG00000198886,1,12137,10760,MT,MT-ND4


# WHERE

WHERE is an optional clause, but it lets us add filtering criteria to your query. We use the LIMIT clause to only show the first 20 lines of our table.

In [7]:
%sql SELECT * FROM gene WHERE chromosome = '14' LIMIT 20;

 * postgresql://postgres:***@localhost:5433/ensembl
20 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000100505,-1,51096061,50975262,14,TRIM9
ENSG00000139921,1,51257655,51240162,14,TMX1
ENSG00000139926,1,51730727,51489100,14,FRMD6
ENSG00000172717,1,67228550,67189393,14,FAM71D
ENSG00000100749,1,96931722,96797382,14,VRK1
ENSG00000186469,1,51979342,51826195,14,GNG2
ENSG00000072415,1,67336061,67241342,14,MPP5
ENSG00000100554,-1,67360265,67294371,14,ATP6V1D
ENSG00000134001,1,67386516,67360151,14,EIF2S1
ENSG00000087302,1,52010694,51989514,14,RTRAF


# Boolean Operations

We can chain criteria together by using the `AND`/`OR` boolean operations. 

In [30]:
%%sql
SELECT * FROM gene WHERE gene_end < 100000 AND chromosome = '10'

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000261456,-1,74163,46892,10,TUBB8


## Think about it

Should an AND query be *larger* or *smaller* than an OR query?

In [None]:
%%sql
SELECT * FROM gene WHERE gene_end < 100000 OR chromosome = '10'

# SELECTING columns

You can select individual columns of each table using the SELECT statement. 

Note that we have to put our chromosome (`14`) in quotes since the datatype of the column is character.

In [4]:
%%sql 
SELECT ensembl_gene_id, gene_start, chromosome 
FROM gene
WHERE chromosome = '14'

 * postgresql://postgres:***@localhost:5433/ensembl
19 rows affected.


ensembl_gene_id,gene_start,chromosome
ENSG00000211890,105583731,14
ENSG00000211890,105583731,14
ENSG00000211891,105597691,14
ENSG00000211891,105597691,14
ENSG00000211892,105620506,14
ENSG00000211892,105620506,14
ENSG00000211893,105639559,14
ENSG00000211893,105639559,14
ENSG00000211895,105703995,14
ENSG00000211895,105703995,14


# Your Turn

Write a SELECT statement that returns everything from the `gene` table if gene_symbol = `FGFR2`

In [32]:
#Room for your answer here

%sql 

 * postgresql://postgres:***@localhost:5433/ensembl


'Connected: postgres@ensembl'

# Aliases using AS

One useful trick we can use are *aliases*. This becomes more important as we join tables together, since it:

1. Saves us typing
2. Makes our query more clear and specific. 

You can see that `ensembl_gene_id` might be in multiple tables. Postgres gets confused if we don't.

In general, you can use an alias *before* you define it in the query, which is confusing. You just need to define it somewhere (usually in the FROM clause).

When you refer to a column, it is best to preface it with the *alias* so your query is exact.

In [8]:
%%sql 
SELECT g.ensembl_gene_id, g.gene_start, g.chromosome 
    FROM gene as g
    WHERE chromosome = '14' LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_start,chromosome
ENSG00000100505,50975262,14
ENSG00000139921,51240162,14
ENSG00000139926,51489100,14
ENSG00000172717,67189393,14
ENSG00000100749,96797382,14
ENSG00000186469,51826195,14
ENSG00000072415,67241342,14
ENSG00000100554,67294371,14
ENSG00000134001,67360151,14
ENSG00000087302,51989514,14


We can also rename our columns using AS:

In [None]:
%%sql 
SELECT ensembl_gene_id as ensembl, gene_start, chromosome FROM gene
WHERE chromosome = '14' LIMIT 20;

# Aggregating using COUNT

The COUNT verb in SQL lets us count things. If we use COUNT(ensembl_gene_id), it will count the number of 

In [21]:
%sql SELECT COUNT(ensembl_transcript_id) FROM transcript;

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


count
168617


In [23]:
%sql SELECT COUNT(DISTINCT ensembl_gene_id) FROM gene2transcript;

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


count
22799


# GROUP BY

GROUP BY is extremely useful if we want to produce a table of counts. Note that to produce a table with counts, we need to return both `chromosome` and `COUNT(chromosome)`.

In [40]:
%sql SELECT chromosome, COUNT(chromosome) FROM gene GROUP BY chromosome;

 * postgresql://postgres:***@localhost:5433/ensembl
357 rows affected.


chromosome,count
CHR_HSCHR1_5_CTG32_1,3
13,321
CHR_HSCHR17_3_CTG1,8
CHR_HSCHR21_6_CTG1_1,1
CHR_HSCHR1_ALT2_1_CTG32_1,7
CHR_HSCHR1_1_CTG3,7
CHR_HSCHR19KIR_FH08_A_HAP_CTG3_1,9
CHR_HSCHR22_1_CTG7,17
Y,48
CHR_HSCHR19LRC_LRC_S_CTG3_1,19


# ORDER BY

That's great and everything, but that doesn't answer the question of which chromosome has the largest number of mapped genes. 

By adding an `ORDER BY` clause followed by a `DESC`, we can sort our result table. 

In [41]:
%%sql 
    SELECT chromosome, COUNT(chromosome) as count 
        FROM gene 
        GROUP BY chromosome 
        ORDER BY count DESC;

 * postgresql://postgres:***@localhost:5433/ensembl
357 rows affected.


chromosome,count
1,2049
19,1471
11,1309
2,1246
17,1184
3,1076
6,1047
12,1031
7,918
5,885


# Your Turn

How many genes are mapped to the `1` (+) strand and how many genes are mapped to the `-1` (-) strand?

In [None]:
#Room for your answer here
%sql

# JOINing the Tables

Of course, our `transcript` and `gene` tables aren't that useful by themselves. We need to integrate information in these tables to produce useful queries.

We will use what are called JOINs on the data, to produce a table that has information about genes and their transcripts.

# The Different JOIN types

I apologize for the different tables here. However, these figures took a long time to make and I don't have the time to make bioinformatics specific ones.

- INNER JOIN

Only retains the rows which are in common between the two joined tables.

![](docs/image/Slide3.JPG)

- LEFT JOIN

Retains everything in INNER JOIN, plus those rows in the LEFT table.

![](docs/image/Slide4.JPG)

- RIGHT JOIN

Retains everything in INNER JOIN, plus those rows in the LEFT table.

- OUTER JOIN

Retains ALL rows in both tables, regardless of whether there is matching criteria.

## Quiz Yourself

How many rows would there be in the results table above if we RIGHT JOINed instead?

# LEFT JOIN in action

Now, let's try a `LEFT JOIN` on our data.

In [14]:
%%sql 
    SELECT g.ensembl_gene_id, 
        g.gene_start, 
        g.gene_symbol, 
        t2g.ensembl_transcript_id
    FROM gene AS g
    LEFT JOIN gene2transcript AS t2g 
        ON g.ensembl_gene_id = t2g.ensembl_gene_id 
    LIMIT 20;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_start,gene_symbol,ensembl_transcript_id
ENSG00000198888,3307,MT-ND1,ENST00000361390
ENSG00000198763,4470,MT-ND2,ENST00000361453
ENSG00000198804,5904,MT-CO1,ENST00000361624
ENSG00000198712,7586,MT-CO2,ENST00000361739
ENSG00000228253,8366,MT-ATP8,ENST00000361851
ENSG00000198899,8527,MT-ATP6,ENST00000361899
ENSG00000198938,9207,MT-CO3,ENST00000362079
ENSG00000198840,10059,MT-ND3,ENST00000361227
ENSG00000212907,10470,MT-ND4L,ENST00000361335
ENSG00000198886,10760,MT-ND4,ENST00000361381


You may recall that we have **three** tables: `gene`, `gene2transcript` and `transcript`. To join together `gene` and `transcript`, our `LEFT JOIN` has to include all three tables. 

We can do this by adding another `LEFT JOIN` clause to our query. Here we're adding in the `transcript` table by including the criteria `t2g.ensembl_transcript_id` = `t.ensembl_transcript_id`.

In [36]:
%%sql 
    SELECT g.ensembl_gene_id, g.gene_start, g.gene_symbol, t.ensembl_transcript_id, t.transcript_start
    FROM gene AS g
    LEFT JOIN gene2transcript AS t2g 
        ON g.ensembl_gene_id = t2g.ensembl_gene_id 
    LEFT JOIN transcript as t 
        ON t2g.ensembl_transcript_id = t.ensembl_transcript_id
    LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_start,gene_symbol,ensembl_transcript_id,transcript_start
ENSG00000198786,12337,MT-ND5,ENST00000361567,12337
ENSG00000278817,131494,AC007325.4,ENST00000613204,131494
ENSG00000281022,133338460,MED22,ENST00000631196,133338460
ENSG00000274053,54643934,NCR1,ENST00000619679,54643934
ENSG00000273506,54911759,NCR1,ENST00000619067,54911759
ENSG00000277667,54160649,AC012314.14,ENST00000622368,54160649
ENSG00000277733,54173854,MBOAT7,ENST00000612567,54173854
ENSG00000276017,72411,AC007325.1,ENST00000617983,72411
ENSG00000280584,133205277,OBP2B,ENST00000630166,133205277
ENSG00000143954,79025686,REG3G,ENST00000498312,79025748


# Use Case: Finding Overlapping Genes

One query we might do is to find all of the overlapping genes in our database. 

That is, we want those genes where 

```
gene1.gene_symbol != gene2.gene_symbol
gene1.chromosome = gene2.chromosome
gene1.start > gene2.end and 
gene2.end < gene.1 start and
```

How do we do this, since we only have 1 table? We need to do what is called a SELF JOIN. 

Basically, we make two aliases for our table, called `g1` and `g2`. Then we can join these two tables just like any other join. Here we use an `INNER JOIN`.

In [None]:
%%sql
    SELECT g1.gene_symbol AS gene1, 
        g2.gene_symbol AS gene2,
        g1.gene_start as g1_start,
        g2.gene_start as g2_start,
        g1.gene_end as g1_end,
        g2.gene_end as g1_end
    FROM gene AS g1
    INNER JOIN gene AS g2
        ON g1.gene_symbol != g2.gene_symbol 
    WHERE g1.gene_start <= g2.gene_end
        AND g1.gene_end >= g2.gene_start
        AND g1.chromosome = g2.chromosome;

# Using EXPLAIN ANALYZE to find out why our query is slow

Because we are joining and scanning many tables for a query, some of the more complicated ones may take a lot longer to run.

If we add `EXPLAIN ANALYZE` to the beginning of our query, we will understand what tasks the Database Management System (DBMS) take the longest.

In [59]:
#Need to run this for demo purposes
%%sql
    DROP INDEX IF EXISTS ChrStartEnd;

 * postgresql://postgres:***@localhost:5433/ensembl
Done.


[]

In [55]:
%%sql
    EXPLAIN ANALYZE
    SELECT g1.gene_symbol AS gene1, 
        g2.gene_symbol AS gene2,
        g1.gene_start as g1_start,
        g2.gene_start as g2_start,
        g1.gene_end as g1_end,
        g2.gene_end as g1_end
    FROM gene AS g1
    INNER JOIN gene AS g2
        ON g1.gene_symbol != g2.gene_symbol 
    WHERE g1.gene_start <= g2.gene_end
        AND g1.gene_end >= g2.gene_start
        AND g1.chromosome = g2.chromosome;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


QUERY PLAN
Hash Join (cost=747.98..430934.97 rows=2357115 width=12) (actual time=24.184..8822.094 rows=10846 loops=1)
Hash Cond: ((g1.chromosome)::text = (g2.chromosome)::text)
Join Filter: (((g1.gene_symbol)::text <> (g2.gene_symbol)::text) AND (g1.gene_start <= g2.gene_end) AND (g1.gene_end >= g2.gene_start))
Rows Removed by Join Filter: 21205785
-> Seq Scan on gene g1 (cost=0.00..462.99 rows=22799 width=19) (actual time=0.063..15.625 rows=22799 loops=1)
-> Hash (cost=462.99..462.99 rows=22799 width=19) (actual time=23.785..23.785 rows=22799 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1443kB
-> Seq Scan on gene g2 (cost=0.00..462.99 rows=22799 width=19) (actual time=0.035..10.678 rows=22799 loops=1)
Planning Time: 3.032 ms
Execution Time: 8825.565 ms


# Indexing: making a query faster

Querying takes time, because we have to scan the whole table.

However, if there is a column we query a lot, we can create a **index** for it. This index allows us to query the table faster.

Much like the other data structures we use in Python, an *index* takes the form of a B-Tree, which allows for much more rapid searching of the data. 

Why don't we index everything? Well, indexes take a lot of disk space. So we are better off picking and choosing which columns we want to index. This is based on what kinds of queries and searches that are the most commonly done on a table.

There are data engineers whose job it is to tune the database to make queries run faster. Don't worry too much if this is confusing right now. Just know who you're going to talk with when you need a database query to run lightning fast.

In [56]:
%%sql
CREATE INDEX ChrStartEnd on gene(chromosome, gene_start, gene_end)

 * postgresql://postgres:***@localhost:5433/ensembl
Done.


[]

Now try running the query again with `EXPLAIN ANALYZE`. How much faster was it on your machine?

In [57]:
%%sql
    EXPLAIN ANALYZE
    SELECT g1.gene_symbol AS gene1, 
        g2.gene_symbol AS gene2,
        g1.gene_start as g1_start,
        g2.gene_start as g2_start,
        g1.gene_end as g1_end,
        g2.gene_end as g1_end
    FROM gene AS g1
    INNER JOIN gene AS g2
        ON g1.gene_symbol != g2.gene_symbol 
    WHERE g1.gene_start <= g2.gene_end
        AND g1.gene_end >= g2.gene_start
        AND g1.chromosome = g2.chromosome;

 * postgresql://postgres:***@localhost:5433/ensembl
8 rows affected.


QUERY PLAN
Nested Loop (cost=0.29..17921.28 rows=2357115 width=12) (actual time=0.266..2351.437 rows=10846 loops=1)
-> Seq Scan on gene g1 (cost=0.00..462.99 rows=22799 width=19) (actual time=0.069..6.443 rows=22799 loops=1)
-> Index Scan using chrstartend on gene g2 (cost=0.29..0.70 rows=7 width=19) (actual time=0.101..0.101 rows=0 loops=22799)
Index Cond: (((chromosome)::text = (g1.chromosome)::text) AND (gene_start <= g1.gene_end) AND (gene_end >= g1.gene_start))
Filter: ((g1.gene_symbol)::text <> (gene_symbol)::text)
Rows Removed by Filter: 1
Planning Time: 5.373 ms
Execution Time: 2353.030 ms


# Indexing and Primary Keys

In our next session, we will be learning some useful terminology for our database: 
- Primary Keys, which uniquely identify a row in a table
- Foreign Keys, which refer to a key in another table

# Useful to know: Subqueries

This has been a lot of SQL to throw at you! So I'm making this last section optional.

Sometimes it's useful to break up complicated queries into subqueries. Basically any query that returns a table can be used in a FROM statement, but you must create an alias for the subquery.

In [61]:
%%sql
SELECT ensembl_gene_id, gene_symbol
    FROM
    (SELECT * FROM gene WHERE chromosome = '1') AS subquery
    WHERE subquery.gene_start >= 1000000

 * postgresql://postgres:***@localhost:5433/ensembl
2040 rows affected.


ensembl_gene_id,gene_symbol
ENSG00000116996,ZP4
ENSG00000270188,MTRNR2L11
ENSG00000203685,STUM
ENSG00000143772,ITPKB
ENSG00000120370,GORAB
ENSG00000117480,FAAH
ENSG00000157978,LDLRAP1
ENSG00000143801,PSEN2
ENSG00000260238,PMF1-BGLAP
ENSG00000242252,BGLAP


Another clause is `UNION`, which can combine two subqueries, basically paste one of the subquery tables on top of another.

In [66]:
%%sql
    (SELECT * FROM gene WHERE gene_symbol = 'TP53') 
    UNION
    (SELECT* FROM gene WHERE gene_symbol = 'FGFR2')

 * postgresql://postgres:***@localhost:5433/ensembl
2 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000066468,-1,121598458,121478334,10,FGFR2
ENSG00000141510,-1,7687550,7661779,17,TP53


# Fix Me!

Why does this UNION query not work? 

Modify the bottom subquery to work with the top subquery.

In [None]:
#fix the second SQL statement!
%%sql
    (SELECT ensembl_gene_id, gene_symbol FROM gene WHERE gene_symbol = 'TP53')
    UNION
    (SELECT* FROM gene WHERE gene_symbol = 'FGFR2')

# What's Next?

We'll learn some of the intricacies about Database Design and Normalization. Stay tuned!

# Acknowledgements

This notebook was adapted from a tutorial by Christina Zheng.