__This notebook describes the setup of CLdb with a set of E. coli genomes.__

**Notes**

* It is assumed that you have CLdb in your PATH

In [149]:
# path to raw files
## CHANGE THIS!
rawFileDir = "/home/nyoungb2/perl/projects/CLdb/data/Ecoli/"
# directory where the CLdb database will be created
## CHANGE THIS!
workDir = "/home/nyoungb2/t/CLdb_Ecoli/"

In [150]:
# viewing file links
import os
import zipfile
import csv
from IPython.display import FileLinks
# pretty viewing of tables
## get from: http://epmoyer.github.io/ipy_table/
from ipy_table import *

__The required files are in '../ecoli_raw/':__

* a loci table
* array files
* genome nucleotide sequences
 * genbank (preferred) or fasta format

__Let's look at the provided files for this example:__

In [151]:
FileLinks(rawFileDir)

# Checking that CLdb is installed in PATH

In [152]:
!CLdb -h

Usage:
    CLdb [options] -- subcommand [subcommand_options]

  Options:
    --list
        List all subcommands.

    --perldoc
        Get perldoc of subcommand.

    --sql
        SQL passed to subcommand for limiting queries. (eg., --sql
        'loci.subtype == "I-B" or loci.subtype == "I-C"'). NOTE: The sql
        statement must go in SINGLE quotes!

    --config
        Config file (if not ~/.CLdb)

    --config-params
        List params set by config

    -v Verbose output
    -h This help message

  For more information:
    perldoc CLdb



# Setting up the CLdb directory

In [153]:
# this makes the working directory
if not os.path.isdir(workDir):
    os.makedirs(workDir)

In [None]:
# unarchiving files in the raw folder over to the newly made working folder
files = ['array.zip','loci.zip', 'GIs.txt.zip']
files = [os.path.join(rawFileDir, x) for x in files]
for f in files:
    if not os.path.isfile(f):
        raise IOError, 'Cannot find file: {}'.format(f)
    else:
        zip = zipfile.ZipFile(f)
        zip.extractall(path=workDir)         

print 'unzipped raw files:'        
FileLinks(workDir)      

unzipped raw files:


## Downloading the genome genbank files. Using the 'GIs.txt' file

* GIs.txt is just a list of GIs and taxon names.

In [None]:
# making genbank directory
genbankDir = os.path.join(workDir, 'genbank')
if not os.path.isdir(genbankDir):
    os.makedirs(genbankDir)    

# downloading genomes
!cd $genbankDir; \
    CLdb -- accession-GI2fastaGenome -format genbank -fork 5 < ../GIs.txt
    
# checking files
!cd $genbankDir; \
    ls -thlc *.gbk

Writing files to '/home/nyoungb2/t/CLdb_Ecoli/genbank'
Attempting to stream: Escherichia_coli_K-12_W3110 (accession/GI = 388476123)
Attempting to stream: Escherichia_coli_BL21_DE3 (accession/GI = 387825439)
Attempting to stream: Escherichia_coli_K-12_DH10B (accession/GI = 170079663)
Attempting to stream: Escherichia_coli_O157_H7 (accession/GI = 16445223)
Attempting to stream: Escherichia_coli_K-12_MG1655 (accession/GI = 49175990)


# Creating/loading CLdb of E. coli CRISPR data

In [None]:
!CLdb -- makeDB -h

## Making CLdb sqlite file

In [None]:
!cd $workDir; \
    CLdb -- makeDB -r -drop
    
CLdbFile = os.path.join(workDir, 'CLdb.sqlite')
print 'CLdb file location: {}'.format(CLdbFile)

## Setting up CLdb config

* This way, the CLdb script will know where the CLdb database is located.
  * Otherwise, you would have to keep telling the CLdb script where the database is.

In [None]:
s = 'DATABASE = ' + CLdbFile
configFile = os.path.join(os.path.expanduser('~'), '.CLdb')

with open(configFile, 'wb') as outFH:
    outFH.write(s)
    
print 'Config file written: {}'.format(configFile)

## Loading loci

* The next step is loading the loci table.
 * This table contains the user-provided info on each CRISPR-CAS system in the genomes.
 * Let's look at the table before loading it in CLdb

### Checking out the CRISPR loci table

In [None]:
lociFile = os.path.join(workDir, 'loci', 'loci.txt')

# reading in file
tbl = []
with open(lociFile, 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        tbl.append(row)

# making table
make_table(tbl)
apply_theme('basic')

__Notes on the loci table:__
* As you can see, not all of the fields have values. Some are not required (e.g., 'fasta_file').
* You will get an error if you try to load a table with missing values in required fields.
* For a list of required columns, see the documentation for `CLdb -- loadLoci -h`.

### Loading loci info into database

In [None]:
!CLdb -- loadLoci -h

In [None]:
!CLdb -- loadLoci < $lociFile

**Notes on loading**

* A lot is going on here:
  1. Various checks on the input files
  2. Extracting the genome fasta sequence from each genbank file 
    * the genome fasta is required
  3. Loading of the loci information into the sqlite database
  
**Notes on the command**

* Why didn't I use the 'required' `-database` flag for `CLdb -- loadLoci`???
  * I didn't have to use the `-database` flag because it is provided via the .CLdb config file that was previously created.

In [None]:
# This is just a quick summary of the database 
## It should show 10 loci for the 'loci' rows
!CLdb -- summary

## Loading CRISPR arrays

* The next step is to load the CRISPR array tables.
* These are tables in 'CRISPRFinder format' that have CRISPR array info.
  * Let's take a look at one of the array files before loading them all.

In [None]:
# an example array file (obtained from CRISPRFinder)
arrayFile = os.path.join(workDir, 'array', 'Ecoli_0157_H7_a1.txt')
!head $arrayFile

__Note: the array file consists of 4 columns:__

1. spacer start
1. spacer sequence
1. direct-repeat sequence
1. direct-repeat stop

All extra columns ignored!

In [None]:
# loading CRISPR array info
!CLdb -- loadArrays 

In [None]:
# This is just a quick summary of the database 
## It should show 75 spacer & 85 DR entries in the database
!CLdb -- summary

## Loading CAS genes

* Technically, all coding seuqences in the region specified in the loci table (CAS_start, CAS_end) will be loaded.
* This requires 2 subcommands:
  1. The 1st gets the gene info
  2. The 2nd loads the info into CLdb

In [None]:
!CLdb -- getGenesInLoci | CLdb -- loadGenes

## Setting array sense strand

* __The strand that is transcribed needs to be defined in order to have the correct sequence for downstream analyses (e.g., blasting spacers and getting PAM regions)__
* __The sense (reading) strand is defined by (order of precedence):__
 * The leader region (if defined; in this case, no).
 * Array_start,Array_end in the loci table
   * The genome negative strand will be used if array_start > array_end

In [None]:
!CLdb -- setSenseStrand 

## Spacer and DR clustering

* __Clustering of spacer and/or DR sequences accomplishes:__
 * A method of comparing within and between CRISPRs
 * A reducing redundancy for spacer and DR blasting

In [None]:
!CLdb -- clusterArrayElements -s -r

# Database summary

In [None]:
!CLdb -- summary -name -subtype 

# Next Steps

* [arrayBlast](./arrayBlast.ipynb)
  * Blast spacers (& DRs), get protospacers, PAM regions, mismatches to the protospacer & SEED sequence