# CREATING the Database and user

## Getting Started: Don't run this code in anaconda!

First, start by following the steps here up to `Create a Database`: https://medium.com/@FranckPachot/postgresql-and-jupyter-notebook-e7b68cb6427d

Then, open your Anaconda Prompt (on Windows) or your Terminal (on Mac) and run the following:

## For windows users

Make sure to sub your `USERNAME` on your machine here. 

```
#set the database location
#add PGDATA to your environment variables
set PGDATA=C:\Anaconda\pgdata
mkdir %PGDATA%
pg_ctl initdb
pg_ctl start
createdb <USERNAME>
psql
```

## For Mac users

Make sure to sub your `USERNAME` on your machine here. 

```
#make sure to add this export line to your .bashrc!
#set the database location
export PGDATA=~/pgdata
mkdir $PGDATA
pg_ctl initdb
pg_ctl start
createdb <USERNAME>
#open the postgres prompt
psql 
```

## For Both types of USERS

in the PSQL prompt, type in the following:

```
CREATE DATABASE ensembl
exit
```

Then continue on here, running the cells below. 

# Loading Data Into Our Database

The first thing we do is to load the `sql` extension, which enables us to run SQL statements directly (make sure that that you have the `ipython-sql` package installed using `conda install -y -c conda-forge ipython-sql` from the Anaconda prompt. Once that is loaded, we connect to our database.

Remember to sub your username for `<USERNAME>` here!

In [12]:
%reload_ext sql

##Connect to the database
#%sql postgresql://postgres:postpost@localhost:5433/ensembl
%sql postgresql://<USERNAME>@localhost/ensembl            
            

'Connected: postgres@ensembl'

In [10]:
pg_version=%sql select version()
print(pg_version)

 * postgresql://postgres:***@localhost:5433/ensembl
1 rows affected.
+------------------------------------------------------------+
|                          version                           |
+------------------------------------------------------------+
| PostgreSQL 12.1, compiled by Visual C++ build 1914, 64-bit |
+------------------------------------------------------------+


# Creating the Database Tables

Here we load our data into our database. We first need to specify the data types used in our table. Postgres has a lot of different data types.

## Before you run this

Note that Postgres requires the **absolute** file path to the data files. So you will need to modify the paths that are after the `FROM` statements to be the absolute paths of where your datafiles are. So adjust these paths to fit your machine

In [27]:
%%sql 
drop table if exists gene;
drop table if exists transcript;
drop table if exists gene2transcript;

CREATE TABLE gene2transcript
(
    ensembl_gene_id character(25),
    ensembl_transcript_id character(25)
);
COPY gene2transcript(ensembl_gene_id, ensembl_transcript_id)
FROM 'c:/Code/BMI535slides/data/ensembl_gene_transcript.csv' DELIMITER ',' CSV HEADER;

CREATE TABLE transcript
  (
     ensembl_transcript_id character(25),
      transcript_start integer,
      transcript_end integer,
      transcript_type character varying
  );
    
COPY transcript(ensembl_transcript_id, transcript_start, transcript_end, transcript_type)
FROM 'c:/Code/BMI535slides/data/ensembl_transcript.csv' DELIMITER ',' CSV HEADER;

CREATE TABLE gene
  (
      ensembl_gene_id character(25),
      gene_strand integer,
      gene_end integer,
      gene_start integer,
      chromosome character varying,
      gene_symbol character varying
  );
    
COPY gene(ensembl_gene_id, gene_strand, gene_end, gene_start, chromosome, gene_symbol) 
FROM 'c:/Code/BMI535slides/data/ensembl_gene.csv' DELIMITER ',' CSV HEADER;


 * postgresql://postgres:***@localhost:5433/ensembl
Done.
Done.
Done.
Done.
168617 rows affected.
Done.
168617 rows affected.
Done.
22799 rows affected.


[]

# Ensuring we have loaded our data correctly

Now we're going to run a couple of SQL commands to ensure that we've loaded our data in correctly. Use this page as a reference when you are doing the exercises.

In [28]:
%sql SELECT * from gene LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,gene_strand,gene_end,gene_start,chromosome,gene_symbol
ENSG00000198888,1,4262,3307,MT,MT-ND1
ENSG00000198763,1,5511,4470,MT,MT-ND2
ENSG00000198804,1,7445,5904,MT,MT-CO1
ENSG00000198712,1,8269,7586,MT,MT-CO2
ENSG00000228253,1,8572,8366,MT,MT-ATP8
ENSG00000198899,1,9207,8527,MT,MT-ATP6
ENSG00000198938,1,9990,9207,MT,MT-CO3
ENSG00000198840,1,10404,10059,MT,MT-ND3
ENSG00000212907,1,10766,10470,MT,MT-ND4L
ENSG00000198886,1,12137,10760,MT,MT-ND4


In [24]:
%sql SELECT * FROM transcript LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_transcript_id,transcript_start,transcript_end,transcript_type
ENST00000361390,3307,4262,protein_coding
ENST00000361453,4470,5511,protein_coding
ENST00000361624,5904,7445,protein_coding
ENST00000361739,7586,8269,protein_coding
ENST00000361851,8366,8572,protein_coding
ENST00000361899,8527,9207,protein_coding
ENST00000362079,9207,9990,protein_coding
ENST00000361227,10059,10404,protein_coding
ENST00000361335,10470,10766,protein_coding
ENST00000361381,10760,12137,protein_coding


In [25]:
%sql SELECT * FROM gene2transcript LIMIT 10;

 * postgresql://postgres:***@localhost:5433/ensembl
10 rows affected.


ensembl_gene_id,ensembl_transcript_id
ENSG00000198888,ENST00000361390
ENSG00000198763,ENST00000361453
ENSG00000198804,ENST00000361624
ENSG00000198712,ENST00000361739
ENSG00000228253,ENST00000361851
ENSG00000198899,ENST00000361899
ENSG00000198938,ENST00000362079
ENSG00000198840,ENST00000361227
ENSG00000212907,ENST00000361335
ENSG00000198886,ENST00000361381
