# Introduction to postgres SQL

This notebook is meant to be a short introduction to relational database queries using SQL.

In [None]:
%reload_ext sql

##Connect to the database
#%sql postgresql://postgres@localhost/ensembl

%sql postgresql://postgres:postpost@localhost:5433/ensembl

# SELECTing from a table in the database



In [6]:
%sql SELECT * FROM gene LIMIT 20;

 * postgresql://postgres:***@localhost:5433/ensembl
20 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000198888,1,4262,3307,MT,MT-ND1
ENSG00000198763,1,5511,4470,MT,MT-ND2
ENSG00000198804,1,7445,5904,MT,MT-CO1
ENSG00000198712,1,8269,7586,MT,MT-CO2
ENSG00000228253,1,8572,8366,MT,MT-ATP8
ENSG00000198899,1,9207,8527,MT,MT-ATP6
ENSG00000198938,1,9990,9207,MT,MT-CO3
ENSG00000198840,1,10404,10059,MT,MT-ND3
ENSG00000212907,1,10766,10470,MT,MT-ND4L
ENSG00000198886,1,12137,10760,MT,MT-ND4


# WHERE

WHERE is an optional clause, but it lets us add filtering criteria to your query.

In [7]:
%sql SELECT * FROM gene WHERE chromosome = '14' LIMIT 20;

 * postgresql://postgres:***@localhost:5433/ensembl
20 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000100505,-1,51096061,50975262,14,TRIM9
ENSG00000139921,1,51257655,51240162,14,TMX1
ENSG00000139926,1,51730727,51489100,14,FRMD6
ENSG00000172717,1,67228550,67189393,14,FAM71D
ENSG00000100749,1,96931722,96797382,14,VRK1
ENSG00000186469,1,51979342,51826195,14,GNG2
ENSG00000072415,1,67336061,67241342,14,MPP5
ENSG00000100554,-1,67360265,67294371,14,ATP6V1D
ENSG00000134001,1,67386516,67360151,14,EIF2S1
ENSG00000087302,1,52010694,51989514,14,RTRAF


# Boolean Operations

We can chain criteria together by using the `AND`/`OR` boolean operations. 

In [30]:
%%sql
SELECT * FROM gene WHERE gene_end < 100000 AND chromosome = '10'

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000261456,-1,74163,46892,10,TUBB8


# Think about it

Should an AND query be larger or smaller than an OR query?

In [None]:
%%sql
SELECT * FROM gene WHERE gene_end < 100000 OR chromosome = '10'

# SELECTING columns

You can select individual columns of each table using the SELECT statement. 

Note that we have to put our chromosome (`14`) in quotes since the datatype of the column is character.

In [4]:
%%sql 
SELECT ensembl_gene_id, gene_start, chromosome 
FROM gene
WHERE chromosome = '14'

 * postgresql://postgres:***@localhost:5433/ensembl
19 rows affected.


ensembl_gene_id,gene_start,chromosome
ENSG00000211890,105583731,14
ENSG00000211890,105583731,14
ENSG00000211891,105597691,14
ENSG00000211891,105597691,14
ENSG00000211892,105620506,14
ENSG00000211892,105620506,14
ENSG00000211893,105639559,14
ENSG00000211893,105639559,14
ENSG00000211895,105703995,14
ENSG00000211895,105703995,14


# Your Turn

Write a SELECT statement that returns everything from the `gene` table if gene_symbol = `FGFR2`

In [32]:
#Room for your answer here

%sql 

 * postgresql://postgres:***@localhost:5433/ensembl


'Connected: postgres@ensembl'

/bin/sh: 1: d: not found
psql: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/tmp/.s.PGSQL.5433"?


# Aliases using AS

One useful trick we can use are aliases. This becomes more important as we join tables together, since it:

1. Saves us typing
2. Makes our query more clear. You can see that `ensembl_gene_id` might be in multiple tables. Postgres gets confused if we dont.

In [8]:
%%sql 
SELECT g.ensembl_gene_id, g.gene_start, g.chromosome 
FROM gene as g
WHERE chromosome = '14' LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_start,chromosome
ENSG00000100505,50975262,14
ENSG00000139921,51240162,14
ENSG00000139926,51489100,14
ENSG00000172717,67189393,14
ENSG00000100749,96797382,14
ENSG00000186469,51826195,14
ENSG00000072415,67241342,14
ENSG00000100554,67294371,14
ENSG00000134001,67360151,14
ENSG00000087302,51989514,14


We can also rename our columns using AS:

In [None]:
%%sql 
SELECT ensembl_gene_id as ensembl, gene_start, chromosome FROM gene
WHERE chromosome = '14' LIMIT 20;

# Aggregating using COUNT

The COUNT verb in SQL lets us count things. If we use COUNT(ensembl_gene_id), it will count the number of 

In [21]:
%sql SELECT COUNT(ensembl_transcript_id) FROM transcript;

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


count
168617


In [23]:
%sql SELECT COUNT(DISTINCT ensembl_gene_id) FROM gene2transcript;

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.


count
22799


# GROUP BY



In [40]:
%sql SELECT chromosome, COUNT(chromosome) FROM gene GROUP BY chromosome;

 * postgresql://postgres:***@localhost:5433/ensembl
357 rows affected.


chromosome,count
CHR_HSCHR1_5_CTG32_1,3
13,321
CHR_HSCHR17_3_CTG1,8
CHR_HSCHR21_6_CTG1_1,1
CHR_HSCHR1_ALT2_1_CTG32_1,7
CHR_HSCHR1_1_CTG3,7
CHR_HSCHR19KIR_FH08_A_HAP_CTG3_1,9
CHR_HSCHR22_1_CTG7,17
Y,48
CHR_HSCHR19LRC_LRC_S_CTG3_1,19


# Your Turn

How many distinct chromosomes are there in the `gene` table?

In [None]:
#Room for your answer here
%sql

# JOINing the Tables

Of course, our `transcript` and `gene` tables aren't that useful by themselves. We need to integrate information in these tables to produce useful queries.

We will use what are called JOINs on the data. 

In [14]:
%sql SELECT * FROM gene2transcript LIMIT 10

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_transcript_id,transcript_start,transcript_end,transcript_type
ENST00000361390,3307,4262,protein_coding
ENST00000361453,4470,5511,protein_coding
ENST00000361624,5904,7445,protein_coding
ENST00000361739,7586,8269,protein_coding
ENST00000361851,8366,8572,protein_coding
ENST00000361899,8527,9207,protein_coding
ENST00000362079,9207,9990,protein_coding
ENST00000361227,10059,10404,protein_coding
ENST00000361335,10470,10766,protein_coding
ENST00000361381,10760,12137,protein_coding


In [14]:
%%sql 
    SELECT g.ensembl_gene_id, g.gene_start, g.gene_symbol, t2g.ensembl_transcript_id
    FROM gene AS g
    LEFT JOIN gene2transcript AS t2g 
        ON g.ensembl_gene_id = t2g.ensembl_gene_id 
    LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_start,gene_symbol,ensembl_transcript_id
ENSG00000198888,3307,MT-ND1,ENST00000361390
ENSG00000198763,4470,MT-ND2,ENST00000361453
ENSG00000198804,5904,MT-CO1,ENST00000361624
ENSG00000198712,7586,MT-CO2,ENST00000361739
ENSG00000228253,8366,MT-ATP8,ENST00000361851
ENSG00000198899,8527,MT-ATP6,ENST00000361899
ENSG00000198938,9207,MT-CO3,ENST00000362079
ENSG00000198840,10059,MT-ND3,ENST00000361227
ENSG00000212907,10470,MT-ND4L,ENST00000361335
ENSG00000198886,10760,MT-ND4,ENST00000361381


You may recall that we have **three** tables: `gene`, `gene2transcript` and `transcript`. To join together `gene` and `transcript`, our `LEFT JOIN` has to include all three tables. 

We can do this by adding another `LEFT JOIN` clause to our query. Here we're adding in the `transcript` table by including the criteria `t2g.ensembl_transcript_id` = `t.ensembl_transcript_id`.

In [36]:
%%sql 
    SELECT g.ensembl_gene_id, g.gene_start, g.gene_symbol, t.ensembl_transcript_id, t.transcript_start
    FROM gene AS g
    LEFT JOIN gene2transcript AS t2g 
        ON g.ensembl_gene_id = t2g.ensembl_gene_id 
    LEFT JOIN transcript as t 
        ON t2g.ensembl_transcript_id = t.ensembl_transcript_id
    LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_start,gene_symbol,ensembl_transcript_id,transcript_start
ENSG00000198786,12337,MT-ND5,ENST00000361567,12337
ENSG00000278817,131494,AC007325.4,ENST00000613204,131494
ENSG00000281022,133338460,MED22,ENST00000631196,133338460
ENSG00000274053,54643934,NCR1,ENST00000619679,54643934
ENSG00000273506,54911759,NCR1,ENST00000619067,54911759
ENSG00000277667,54160649,AC012314.14,ENST00000622368,54160649
ENSG00000277733,54173854,MBOAT7,ENST00000612567,54173854
ENSG00000276017,72411,AC007325.1,ENST00000617983,72411
ENSG00000280584,133205277,OBP2B,ENST00000630166,133205277
ENSG00000143954,79025686,REG3G,ENST00000498312,79025748


# The Different JOIN types

- LEFT JOIN
- INNER JOIN
- RIGHT JOIN
- OUTER JOIN

# Indexing: making a query faster

Querying takes time, because we have to scan the whole table.

However, if there is a column we query a lot, we can create **index** for it.

# Useful to know: Subqueries