# SQL NOTES

The ESC403 cluster comes with [Catherine Devlin's `%sql`-magic for IPython][1]; this allows you to run SQL queries from the IPython notebook, and intermix them with Python code.

[1]: https://github.com/catherinedevlin/ipython-sql

Before we can use the `%sql` syntax, two steps must be taken:

* Load the IPython-SQL bridge code

In [2]:
%load_ext sql


* Connect to an actual database; this must be the first `%sql` statement (the funny `rmurri@/lustre` syntax is correct and means "connect to the PostGreSQL DB named `lustre` running on *this* host as user `rmurri`"  -- please replace `rmurri` with your local user name):

In [3]:
%sql postgresql://mivkov@/lustre


u'Connected: mivkov@lustre'

Now you can run 1-line SQL queries by prefixing them with `%sql`:

In [56]:
%sql select * from lustre limit 5;

5 rows affected.


usr,grp,atime,mtime,blksize,size,path
usr388,i5105,1384455829,1384455829,4,1653,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00323201_dock009.pdb
usr388,i5105,1384453069,1384453069,4,1378,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00195993_dock004.pdb
usr388,i5105,1384454883,1384454883,4,2038,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00271731_dock012.pdb
usr388,i5105,1384450216,1384450216,4,1873,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00107714_dock008.pdb
usr388,i5105,1384457084,1384457084,4,1873,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00369091_dock017.pdb


It is also possible to run multi-line (or multiple) SQL queries by using the `%%sql` syntax instead.  Note that in this case the SQL instructions *must not* be on the same line as the `%%sql` magic marker:

## Selection


The SELECT statement is used to compute expressions in SQL. The result of evaluating a SELECT statement is again a relation (table). It allows composition of queries into larger expressions.

In [58]:
# Strings: single quotes
%sql select 'something' 

1 rows affected.


?column?
something


In [59]:
%sql select 'some string' as "column name"

1 rows affected.


column name
some string


Multiple expressions can be evaluated by separating them with a comma. They define different columns in the result table.

In [60]:
%sql select 'foo', 'bar' 

1 rows affected.


?column?,?column?_1
foo,bar


In [61]:
# select from a database
%sql select * from lustre_sample ;

SELECT ... FROM ... WHERE returns a relation of all rows in a table that satisfy a certain predicate.

\* is a shorthand for “all column names”.
limit INT shows only the (first) INT columns.

In [4]:
%%sql

/* comments are done this way. */
/* must be after %%sql */
-- or like this

/* limit the number of rows printed */

select * 
from lustre_sample
limit 3

3 rows affected.


usr,grp,atime,mtime,blksize,size,path
usr388,i5105,1384794403,1384794403,4,1485,/scratch/bioc/usr388/SEED_HBF/PDB_frag/ZINC00064449_frag_0_seed_3.pdb
usr388,i5105,1384794033,1384794033,4,1540,/scratch/bioc/usr388/SEED_HBF/PDB_frag/ZINC00452805_frag_0_seed_2.pdb
usr388,i5105,1384792540,1384792540,4,1265,/scratch/bioc/usr388/SEED_HBF/PDB_frag/ZINC00134003_frag_0_seed_2.pdb


In [123]:
%%sql 
select * 
from lustre_sample 
where usr='usr388'
limit 3

3 rows affected.


usr,grp,atime,mtime,blksize,size,path
usr388,i5105,1384515907,1384515907,4,1488,/scratch/bioc/usr388/VS_AllNow_libo_3WAT_end/d353/LIBO00362035_dock002.pdb
usr388,i5105,1384515910,1384515910,4,2313,/scratch/bioc/usr388/VS_AllNow_libo_3WAT_end/d353/LIBO00362130_dock006.pdb
usr388,i5105,1384515910,1384515910,4,1488,/scratch/bioc/usr388/VS_AllNow_libo_3WAT_end/d353/LIBO00362156_dock007.pdb


In [125]:
%%sql 
select * 
from lustre_sample 
where size>=100000000000
limit 3

3 rows affected.


usr,grp,atime,mtime,blksize,size,path
usr345,i5535,1386878807,1386905916,111291280,113962044756,/scratch/aim/usr345/bams/cleaned/recal/PA_A948_recal_cleaned.bam
usr345,i5535,1386875748,1386908639,134411624,137637131306,/scratch/aim/usr345/bams/cleaned/recal/PA_A950_recal_cleaned.bam
usr345,i5535,1393549454,1392933763,155772348,159510486088,/scratch/aim/usr345/bams/recal/PA_A950_recal.bam


In [126]:
%%sql
select *
from lustre_sample
where length(path) = 102
limit 3

3 rows affected.


usr,grp,atime,mtime,blksize,size,path
usr345,i5535,1391658149,1391658149,4,773,/scratch/aim/usr345/bwa/output_phase2/PA_A964/C00W1ABXX_F_6_sorted_marked_realigned_fastqc/summary.txt
usr345,i5535,1391651630,1391651630,4,773,/scratch/aim/usr345/bwa/output_phase2/PP_A942/81MD6ABXX_E_4_sorted_marked_realigned_fastqc/summary.txt
usr345,i5535,1391666264,1391666264,4,773,/scratch/aim/usr345/bwa/output_phase2/PP_A942/C002LABXX_F_3_sorted_marked_realigned_fastqc/summary.txt


In [116]:
%%sql

/* select only certain columns */

select usr,size
from lustre_sample
limit 3

3 rows affected.


usr,size
usr264,10942224
usr264,6827216
usr264,13142380


In [128]:
%%sql

/*select only unique rows*/

select distinct usr
from lustre_sample
limit 3


3 rows affected.


usr
usr25
usr324
usr234


In [127]:
%%sql

/*select only unique rows*/

select distinct usr,size
from lustre_sample
limit 3


3 rows affected.


usr,size
us293,258
us293,260
us293,261


## Ordering

In [134]:
%%sql
select distinct usr
from lustre_sample
order by usr asc
/* order by column usr, ascending */
limit 3

3 rows affected.


usr
us293
us319
us320


In [136]:
%%sql
select distinct usr
from lustre_sample
order by usr desc
/* order by column usr, descending */
limit 3

3 rows affected.


usr
usr75
usr394
usr390


## Aggregate functions

By default, aggregation is over all selected rows. 

A GROUP BY clause can be used to specify how rows should be grouped; expressions involving aggregate functions will be computed once per each group.

An additional HAVING clause applies a predicate to further select groups based on some expression.

In [138]:
# average
%sql select avg(size) from lustre_sample

1 rows affected.


avg
5170994.219224081


In [139]:
# max
%sql select max(size) from lustre_sample

1 rows affected.


max
346124300004


In [140]:
# min
%sql select min(size) from lustre_sample

1 rows affected.


min
0


In [141]:
# count
%sql select count(size) from lustre_sample

1 rows affected.


count
1519053


In [144]:
%%sql 

/* Grouping rows together first */

select usr, avg(size) 
from lustre_sample 
group by usr 
limit 3

3 rows affected.


usr,avg
usr25,90882260.01653272
usr324,1942829.153247914
usr234,1336169.992317148


In [145]:
%%sql 

/* compute average file size per user, but only consider users that have at least 10000 files on the system */

select usr, avg(size) 
from lustre_sample 
group by usr 
having count(path)>10000
limit 3

3 rows affected.


usr,avg
usr324,1942829.153247914
usr246,3621718.400295508
usr264,21193713.232901487


## Creating, deleting, altering, updating, joining tables


In [32]:
%%sql

--------------------
-- creating tables
--------------------

create table myMovieTable(              -- also possible: create temporary table (deleted after session)
        title         varchar(256),     -- string of up to 256 chars, padded with spaces
        length        bigint,           -- long integer; also integer, smallint
        rating        float ,           -- floating point number; also real, double precision
        release_date  date              -- date; also time, timestamp
        )

Done.


[]

In [33]:
%%sql

-------------------------------
-- inserting stuff into tables
-------------------------------

insert into myMovieTable(title,length,rating,release_date)
VALUES ('Star Wars', 124, 8.7, DATE '1977-05-25');

-- print table out

select * from myMovieTable limit 5

1 rows affected.
1 rows affected.


title,length,rating,release_date
Star Wars,124,8.7,1977-05-25


In [34]:
%%sql

-------------------------------------------------------------------
-- insert stuff into table from some other table with conditions
-------------------------------------------------------------------
-- Here: create new table, copy file paths from lustre_sample where filezise > 10000

create temporary table someTable(
            path varchar(256),
            size bigint);

insert into someTable (path, size)
select path,size from lustre_sample
where size>10000;

select * from someTable
limit 5

Done.
547554 rows affected.
5 rows affected.


path,size
/scratch/bioc/usr388/test_1110/d2/outputs/polar_rec_reduc_angle.mol2,323702
/scratch/bioc/usr388/test_1110/d2/outputs/length_hb.gen,887452
/scratch/bioc/usr388/test_1110/d3/outputs/receptor_uhbd.pdb,100039
/scratch/bioc/usr388/test_1110/seed_allres_out/ZINC00027497_frag_0_match.mol2.out,192854
/scratch/bioc/usr388/test_1110/seed_allres_out/ZINC00032209_frag_1_match.mol2.out,192700


In [36]:
#################
# Delete tables
#################
%sql drop table myMovieTable
%sql drop table someTable

Done.
Done.


[]

In [37]:
%%sql

----------------------------------------------------------------------------
-- copy only 1 column from some other table into a table with existing rows
----------------------------------------------------------------------------

-- first, create some table
create temporary table someTable(
            path varchar(256),
            size bigint);


-- give it some values
insert into someTable (path, size)
select path,size from lustre_sample
where size>1000000;


-- add column
alter table someTable add atime bigint;


-- populate additional column
update someTable
set atime=orig.atime from lustre_sample as orig
where someTable.path=orig.path;




Done.
225610 rows affected.
Done.
225610 rows affected.


[]

## Miscelanneous Notes

* Each table in a SQL database is given a name (at creation time).