## Pipeline to match bacteria and phages via CRISPR spacers

The idea here is to figure out which phages infect which bacteria by matching CRISPR spacers to phages from a database. 

### Step 1: Install the necessary software

We'll use [Crass](http://ctskennerton.github.io/crass/) to find CRISPR spacers in metagenomic data. The [Crass manual](http://ctskennerton.github.io/crass/assets/manual.pdf) has instructions for installation, but here's what worked on Ubuntu: 

* **Install Crass dependencies** 

```
sudo apt-get install libxerces-c3-dev
```

```
sudo add-apt-repository ppa:dns/gnu -y
sudo apt-get update -q
sudo apt-get install --only-upgrade autoconf
```

```
sudo apt install libtool-bin
```

```
wget http://www.zlib.net/zlib-1.2.11.tar.gz
tar -xvzf zlib-1.2.11.tar.gz 
cd zlib-1.2.11/
./configure --prefix=/usr/local/zlib
make
sudo make install
```


* **Install Crass: download tar from [here](http://ctskennerton.github.io/crass/)**

```
tar -xf crass-0.3.12.tar.gz 
cd crass-0.3.12/
./autogen.sh 
./configure 
make 
sudo make install
```

* **Install BLAST locally**

See the documentation [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to install BLAST to run on a local machine.

### Step 2: Download some microbiome data in fasta or fastq format

We'll use a dataset from the Human Microbiome Project as an example. The file below is a metagenomic sample from subginvival plaque.

In [None]:
# Download and extract a dataset
!wget ftp://public-ftp.ihmpdcc.org/Illumina/subgingival_plaque/SRS014107.tar.bz2
!tar -xf SRS014107.tar.bz2

### Step 3: Download reference genomes and make BLAST databases

We're using an NCBI phage database and the NCBI bacteria and archaea databases, accessible by downloading the accessions from [this FTP site](ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/IDS/) using `collect_accessions.py` and then downloading genome sequences using `acc2gb.py`.

In [13]:
# Download accession numbers for phages, bacteria, and archaea

!python ../parserscripts/collect_accessions.py Phages.ids > phage_accessions.txt
!python ../parserscripts/collect_accessions.py Bactera.ids > bacteria_accessions.txt
!python ../parserscripts/collect_accessions.py Archaea.ids > archaea_accessions.txt

In [None]:
# Download genome sequences - this takes a long time for bacteria (there are ~6700 sequences)

cat phage_accessions.txt | python parserscripts/acc2gb.py your@email.com nuccore fasta > phagegenomes.dat
cat bacteria_accessions.txt | python parserscripts/acc2gb.py your@email.com nuccore fasta > bacteriagenomes.dat
cat archaea_accessions.txt | python parserscripts/acc2gb.py your@email.com nuccore fasta > archaeagenomes.dat

In [None]:
# Create BLAST databases: one for bacteria and archaea, one for phages

!makeblastdb -in "phagegenomes.dat" -dbtype nucl -title phagedatabase_May2018 -out phagedb_May2018
!cat bacteriagenomes.dat archaeagenomes.dat | makeblastdb -in - -dbtype nucl -title bacdatabase_May2018 -out bacdb_May2018

The output from creating a BLAST database should look something like this:

```
Building a new DB, current time: 05/11/2018 12:37:11
New DB name:   /Documents/GitHub/phageParser/bacdb_May2018
New DB title:  bacdatabase_May2018
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6990 sequences in 103.433 seconds.
```

### Step 4: Run Crass

The simplest syntax to run Crass on a fasta file is `crass input_fasta_file -o output_directory`.

In [None]:
# Run Crass on one of the downloaded files
!crass SRS014107.denovo_duplicates_marked.trimmed.1.fastq -o SRS014107

### Step 5: Parse Crass output to get spacer sequences

The Crass output we're interested in is the file `crass.crispr` that appears in the output directory. It's an XML file, so we can parse it. 

Setup:
* create a folder called `spacers` in the Crass output folder
* create a folder called `source_reads` in the Crass output folder

In [40]:
import xml.etree.ElementTree as ET

In [41]:
tree = ET.parse('~/crass-0.3.12/SRS014107/crass.crispr')
root = tree.getroot()

In [121]:
# create dictionary to store repeats and read headers to get them out of the original file later
read_dict = {}

# create a list of the accessions that come up
read_dict_accessions = {}
    
for child in root: # each top-level child is a CRISPR array
    repeat = child.attrib['drseq'] # the consensus repeat sequence identified by Crass
    spacers = child[0][2] # the spacers associated with that repeat
    source_reads = child[0][0] # the source reads that the spacers and repeats come from
    
    read_dict[repeat] = {}
    
    # create a file with the spacers
    
    with open("spacers/" + repeat + ".fasta", 'w') as spacer_file:
        for spacer in spacers:
            spacer_file.write('>' + spacer.attrib['spid'] + '\n')
            spacer_file.write(spacer.attrib['seq'] + '\n')
    
    for read in source_reads:
        header = read.attrib['accession']
        read_dict[repeat][header] = []
        read_dict_accessions[header] = repeat
            

In [60]:
# Extract the sequences associated with each repeat for BLASTing
from Bio import SeqIO

In [123]:
for seq_record in SeqIO.parse("~/crass-0.3.12/SRS014107.denovo_duplicates_marked.trimmed.1.fastq", "fastq"):
    header = seq_record.id 
    if header in read_dict_accessions.keys():
        read_dict[read_dict_accessions[header]][header] = seq_record.seq
    #print(repr(seq_record.seq))
    #print(len(seq_record))

In [144]:
# save sequences to a file, one for each repeat

for repeat in read_dict.keys():
    with open("source_reads/" + repeat +'_reads' + ".fasta", 'w') as repeat_file:
        for header, sequence in read_dict[repeat].items():
            repeat_file.write('>' + header + '\n')
            repeat_file.write(str(sequence) + '\n')

### Step 6: BLAST spacers against phage database, BLAST source reads against bacterial database

### Step 7: Parse BLAST output

### Step 8: Create interaction matrix