# CREATING the Database and user

## Getting Started: Don't run this code in anaconda!

First, start by following [these steps](https://medium.com/@FranckPachot/postgresql-and-jupyter-notebook-e7b68cb6427d) up to "Create a database", also described below. 


It's easiest to do this by [creating a new conda environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) (which can then be used as a kernel for jupyter notebooks as well), as follows: 

create a conda environment using the following commands (in **Terminal for Mac users** or **Anaconda Prompt for Windows users**): 

    conda create --name db1

    conda activate db1

    conda install -y -c anaconda psycopg2

    conda install -y -c conda-forge ipython-sql

    conda install -y -c conda-forge postgresql

    conda install -y -c conda-forge pgspecial

enable use of this conda env as jupyter kernel: 

    conda install -c anaconda ipykernel

    python -m ipykernel install --user --name db1 --display-name "db1"

Then, also in your Anaconda Prompt (on Windows) or your Terminal (on Mac) and run the following:

## For Windows users

You can also set the `PGDATA` environment variable using the control panel. 

Make sure to sub your `USERNAME` on your machine here. If your windows user name has a space use quotes, e.g. `createdb "Ted Laderas"`

```
#set the database location
#add PGDATA to your environment variables either with the following statement
#or use control panel to add it (it can be user or global)
setx PGDATA C:\Anaconda\pgdata

```
Note: you may need to re-start Anaconda Prompt after setting the environment variable. Ensure the conda environment you created above is activated in the new prompt. 


```
#make the database directory
mkdir %PGDATA%
# initialize the database in the PGDATA Folder
pg_ctl initdb
# start the postgres daemon (process that runs in background)
pg_ctl start
# create a user name so you can access the database
createdb <USERNAME>
# open the prompt so we can add a database
psql
```

## For Mac users

Make sure to sub your `USERNAME` on your machine here. 

```
#set the database location
#add this line to your .bash_profile or .profile
export PGDATA=~/pgdata

#check to ensure PGDATA is set: 
echo $PGDATA 
#if variable has not been set to appropriate directory, run: 
source .bash_profile
#or
source profile
#in directory containing profile file (home) 

#make the database directory
mkdir $PGDATA
pg_ctl initdb
pg_ctl start

#see above for information about username
createdb <USERNAME>
#open the postgres prompt
psql 
```

## For Both types of USERS

in the PSQL prompt, type in the following (note the semicolon!):

```
CREATE DATABASE ensembl;
exit
```

Then continue on here, running the cells below. 

# Loading Data Into Our Database

The first thing we do is to load the `sql` extension, which enables us to run SQL statements directly. Once that is loaded, we connect to our database.

Remember to sub your username for `<USERNAME>` here! If your username has a space in it, you can use `%20` to substitute for space.

In [1]:
%reload_ext sql

##Connect to the database
#%sql postgresql://postgres:postpost@localhost:5433/ensembl ##

## update for your computer
%sql postgresql://mooneymi@localhost/ensembl

'Connected: mooneymi@ensembl'

In [2]:
pg_version=%sql select version()
print(pg_version)

 * postgresql://mooneymi@localhost/ensembl
1 rows affected.
+----------------------------------------------------------------------------------------+
|                                        version                                         |
+----------------------------------------------------------------------------------------+
| PostgreSQL 12.9 on x86_64-apple-darwin13.4.0, compiled by clang version 12.0.0, 64-bit |
+----------------------------------------------------------------------------------------+


# Creating the Database Tables

Here we load our data into our database. We first need to specify the data types used in our table. Postgres has a lot of different data types.

## Before you run this

Note that Postgres requires the **absolute** file path to the data files. So you will need to modify the paths that are after the `FROM` statements to be the absolute paths of where your datafiles are. So adjust these paths to fit your machine

In [None]:
%%sql 

drop table if exists gene;
drop table if exists transcript;
drop table if exists gene2transcript;

CREATE TABLE gene2transcript
(
    ensembl_gene_id character(25),
    ensembl_transcript_id character(25)
);
COPY gene2transcript(ensembl_gene_id, ensembl_transcript_id)
FROM '/Users/klockec/Documents/code/BMI535slides/data/ensembl_gene_transcript.csv' DELIMITER ',' CSV HEADER;

CREATE TABLE transcript
  (
     ensembl_transcript_id character(25),
      transcript_start integer,
      transcript_end integer,
      transcript_type character varying
  );
    
COPY transcript(ensembl_transcript_id, transcript_start, transcript_end, transcript_type)
FROM '/Users/klockec/Documents/code/BMI535slides/data/ensembl_transcript.csv' DELIMITER ',' CSV HEADER;

CREATE TABLE gene
  (
      ensembl_gene_id character(25),
      gene_strand integer,
      gene_end integer,
      gene_start integer,
      chromosome character varying,
      gene_symbol character varying
  );
    
COPY gene(ensembl_gene_id, gene_strand, gene_end, gene_start, chromosome, gene_symbol) 
FROM '/Users/klockec/Documents/code/BMI535slides/data/ensembl_gene.csv' DELIMITER ',' CSV HEADER;


# Ensuring we have loaded our data correctly

Now we're going to run a couple of SQL commands to ensure that we've loaded our data in correctly. Use this page as a reference when you are doing the exercises.

In [None]:
%sql SELECT * from gene LIMIT 10;

In [None]:
%sql SELECT * FROM transcript LIMIT 10;

In [None]:
%sql SELECT * FROM gene2transcript LIMIT 10;

# Acknowledgements

This material was adapted from notebooks by Ted Laderas. 