The ESC403 cluster comes with [Catherine Devlin's `%sql`-magic for IPython][1]; this allows you to run SQL queries from the IPython notebook, and intermix them with Python code.

[1]: https://github.com/catherinedevlin/ipython-sql

Before we can use the `%sql` syntax, two steps must be taken:

* Load the IPython-SQL bridge code

In [1]:
%load_ext sql


* Connect to an actual database; this must be the first `%sql` statement (the funny `rmurri@/lustre` syntax is correct and means "connect to the PostGreSQL DB named `lustre` running on *this* host as user `rmurri`"  -- please replace `rmurri` with your local user name):

In [2]:
%sql postgresql://mivkov@/lustre


u'Connected: mivkov@lustre'

Now you can run 1-line SQL queries by prefixing them with `%sql`:

In [3]:
%sql select * from lustre limit 5;

5 rows affected.


usr,grp,atime,mtime,blksize,size,path
usr388,i5105,1384452271,1384452271,4,2203,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00170060_dock014.pdb
usr388,i5105,1384452311,1384452311,4,1708,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00171011_dock014.pdb
usr388,i5105,1384452490,1384452490,4,1213,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00174084_dock006.pdb
usr388,i5105,1384452674,1384452674,4,1488,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00179374_dock004.pdb
usr388,i5105,1384453267,1384453267,4,1708,/scratch/bioc/usr388/VS_AllNow_libo_3WAT/ledock_pose/LIBO00205410_dock007.pdb


It is also possible to run multi-line (or multiple) SQL queries by using the `%%sql` syntax instead.  Note that in this case the SQL instructions *must not* be on the same line as the `%%sql` magic marker:

In [None]:
%%sql
select count(*) from lustre;
select distinct count(usr) from lustre;
select distinct count(grp) from lustre;

The `%%time` magic prints the time taken to evaluate a cell (which comes handy when doing performance comparisons):

In [None]:
%%time

import time
time.sleep(5)

----

**Note:** to keep running times low, we will be using table `lustre_sample` throughout, which contains a sample of 5% the rows of the original `lustre` table.

## 1. Is it possible to convert fields atime and mtime to PostgreSQL's TIMESTAMP type?

Yes, it *is* possible to alter a SQL table definition after the table has been created.  Look at the documentation for the [ALTER TABLE](https://www.tutorialspoint.com/sql/sql-alter-command.htm) statement.

We shall break this down into steps: (1) create a new table, (2) populate it, then (3) alter the definition and (4) fill the new column with values.

In [14]:
# delete table before making another one with same name
%sql drop table mytable

Done.


[]

In [15]:
%%sql

/*  create table */ 

create temporary table mytable (
                      path varchar(256),
                      size BIGINT, 
                      mtime BIGINT );

Done.


[]

In [16]:
%%sql

/* copy values into it */

insert into mytable(path,size,mtime)
select path,size,mtime from lustre_sample;

1519053 rows affected.


[]

In [17]:
# (3) alter table definition: add new columns for "access time" using the TIMESTAMP type
%sql alter table mytable add atime timestamp;
#%sql alter table mytable add atime bigint;

Done.


[]

In [19]:
%%sql

-- populate additional column

update mytable
set atime=to_timestamp(orig.atime) from lustre_sample as orig
where mytable.path=orig.path;



1519053 rows affected.


[]

Show some data from the table we created::

In [20]:
%%sql
select * from mytable
limit 5

5 rows affected.


path,size,mtime,atime
/scratch/econ/usr357/sp/job600_7.sh,496,1351160127,2012-10-25 10:15:27
/scratch/pci/usr394/lib/cp2k/POTENTIAL,73903,1385110050,2013-11-22 08:47:30
/scratch/econ/fsl/fsl/src/fdt/facalc,5580,1349084113,2012-11-13 19:50:45
/scratch/econ/fsl/fsl/bin/std2imgcoord,2630446,1349088426,2012-11-13 19:50:21
/scratch/iftp/usr264/sphydro/output_00044/part_00044.out00046,1272,1343118217,2012-07-24 08:23:37


## 2. Can you count the number of files in a given directory?

Yes, using SQL's `like` string matching operator, which allows any
part of a string to be matched by the `%` character (i.e., just like
`*` for file names)::

In [23]:
%%sql
select count(path) from lustre_sample
where path like '/scratch/econ/%'; -- <-- insert directory name here

1 rows affected.


count
32596


## 3. Can you find the directory that holds the largest number of files? 

In [100]:
%%sql
create temporary table testing(path varchar(256));


(psycopg2.ProgrammingError) relation "testing" already exists
 [SQL: 'create temporary table testing(path varchar(256));']


In [26]:
%%sql
insert into testing (path)
select path from lustre_sample limit 1;
select * from testing limit 2;

1 rows affected.
1 rows affected.


path
/scratch/bioc/usr388/Vina_5wat/out/ZINC72133399_out.pdbqt


In [101]:
%%sql
select regexp_matches(testing.path,'(\/.*?/)[^/]*?\.\S*') from testing


1 rows affected.


regexp_matches
[u'/scratch/']


In [110]:
%%sql
select regexp_split_to_table(testing.path,'/*/') from testing
--select regexp_matches(testing.path,'/*/') from testing;


7 rows affected.


regexp_split_to_table
scratch
bioc
usr388
Vina_5wat
out
ZINC72133399_out.pdbqt


In [111]:
%%sql
select count(path) from testing
where path like regexp_maches(testing.path,'/*/')

(psycopg2.ProgrammingError) function regexp_maches(character varying, unknown) does not exist
LINE 2: where path like regexp_maches(testing.path,'/*/')
                        ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
 [SQL: "select count(path) from testing\nwhere path like regexp_maches(testing.path,'/*/')"]


## 4. Can you find the directory tree that holds the largest number of files?

Yes or no? *(and why?)*
No, I can't. I cant figure out how to separate directories from filenames and use them as search/count criterions.
