## STACKS+BOWTIE+BLAST Pipeline for Population Genomics Analysis

This is the full pipeline that I'm using to analyze my Pacific cod time series data, and I'm hoping to be able to use on future projects as well.

Our lab uses Bowtie and BLAST to filter down our catalog of loci into a de novo reference genome. I wasn't sure that I'd have enough time to do this by the end of the quarter, but now it feels like it's worth a try. I'm going to copy and paste from my other Full Pipeline notebook, but add in the Bowtie and BLAST steps here.

#### Go to working directory

In [1]:
cd /Volumes/Time\ Machine\ Backups/Cod-Time-Series-Data/ 

[Errno 2] No such file or directory: '/Volumes/Time Machine Backups/Cod-Time-Series-Data/'
/Users/natalielowell/Git-repos/FISH546/Cod-Time-Series-Project/Notebooks


#### Adding library identifier to file names

If you are analyzing data run on multiple lanes, it may be useful to rename your files such that they have the unique library identifier (eg., \_L1 or \_L2) because barcodes will be redundant between libraries. I wrote a [script](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Scripts/add_lib_to_filename.py) that will add this to your filenames.



In [None]:
!python add_lib_to_filename.py process_radtags_out/cod_lib1 _L1

In [None]:
!python add_lib_to_filename.py process_radtags_out/cod_lib2 _L2

#### Running ``ustacks``

``ustacks`` [documentation](http://catchenlab.life.illinois.edu/stacks/comp/ustacks.php) highlights:

<br>ustacks -t file_type -f file_path [-d] [-r] [-o path] [-i id] [-m min_cov] [-M max_dist] [-p num_threads] [-R] [-H] [-h]
<br>t — input file Type. Supported types: fasta, fastq, gzfasta, or gzfastq.
<br>f — input file path.
<br>o — output path to write results.
<br>i — SQL ID to insert into the output to identify this sample.
<br>m — Minimum depth of coverage required to create a stack (default 2).
<br>M — Maximum distance (in nucleotides) allowed between stacks (default 2).
<br>N — Maximum distance allowed to align secondary reads to primary stacks (default: M + 2).
<br>R — retain unused reads.
<br>H — disable calling haplotypes from secondary reads.
<br>p — enable parallel execution with num_threads threads.
<br>h — display this help messsage.

<br>
<br>
Running custom python [script for ``ustacks``](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Scripts/pypipe_ustacks.py):

In [None]:
!python pypipe_ustacks.py barcodes_samplenames.txt ./process_radtags_out ./ustacks_out

#### Running ``cstacks``
``cstacks`` documentation highlights:

<br>cstacks -b batch_id -s sample_file [-s sample_file_2 ...] [-o path] [-n num] [-g] [-p num_threads] [--catalog path] [-h]
<br>p — enable parallel execution with num_threads threads.
<br>b — MySQL ID of this batch.
<br>s — TSV file from which to load radtags.
<br>o — output path to write results.
<br>m — include tags in the catalog that match to more than one entry.
<br>n — number of mismatches allowed between sample tags when generating the catalog.
<br>g — base catalog matching on genomic location, not sequence identity.
<br>h — display this help messsage.

<br>
<br>
Running custom python [script for ``cstacks``](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Scripts/pypipe_cstacks.py):


In [None]:
!python pypipe_cstacks.py new_filenames_shell.txt ustacks_out 10 2 cstacks_out 3 5

#### Filter with ``Bowtie``

<br>
First, make a fasta file for ``Bowtie`` from the tags file. You will need to be in the directory of your ``cstacks`` output. This involves running a custom python [script](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Scripts/genBOWTIEfasta.py) that Marine wrote, and requires the catalog tags file and the name of the loci, which I can get from the header of another catalog file and copying and pasting into a new text file, where each value is locusnumber_snpnumber, and everything is comma separated.

<br>
Here, I made batch_2_loci.txt for this round.

In [1]:
cd /Volumes/Time\ Machine\ Backups/Cod-Time-Series-Data/ustacks_out

/Volumes/Time Machine Backups/Cod-Time-Series-Data/ustacks_out


In [3]:
!gzip -d batch_2.catalog.tags.tsv.gz

In [5]:
cd ../scripts

/Volumes/Time Machine Backups/Cod-Time-Series-Data/scripts


In [6]:
!python genBOWTIEfasta.py ../ustacks_out/batch_2_loci.txt ../ustacks_out/batch_2.catalog.tags.tsv


Make a directory for ``Bowtie`` files and navigate there. Store the software there. Then use ``Bowtie`` to make a reference genome. Also, genBOWTIEfasta.py stores the new file in the same folder you're running the script fromm, so it may be worth editing the script at some point to direct where it saves. For now, I'll continue.

In [7]:
cd /Volumes/Time\ Machine\ Backups/Cod-Time-Series-Data

/Volumes/Time Machine Backups/Cod-Time-Series-Data


In [8]:
!mkdir Bowtie

In [18]:
cd /Volumes/Time Machine Backups/Cod-Time-Series-Data/Bowtie/bowtie-1.1.2

/Volumes/Time Machine Backups/Cod-Time-Series-Data/Bowtie/bowtie-1.1.2


In [8]:
!./bowtie-build seqsforBOWTIE.fa batch_2

Settings:
  Output files: "batch_2.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  seqsforBOWTIE.fa
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 219600
Using parameters --bmax 164700 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 164700 --dcv 1024
Constructing suffix-array element generator
Building

Then, align it to itself and filter out any sequences that aligned to sequences other than themselves using a custom [script](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Scripts/parseBowtie_DD.py) that Dan wrote in our lab.

In [11]:
!./bowtie -f -v 3 --sam --sam-nohead \
batch_2 \
seqsforBOWTIE.fa \
batch_2_BOWTIEout.sam

# reads processed: 6100
# reads with at least one reported alignment: 6100 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 6100 alignments to 1 output stream(s)


In [19]:
cd ../../scripts

/Volumes/Time Machine Backups/Cod-Time-Series-Data/scripts


In [23]:
!python parseBowtie_DD.py ../Bowtie/bowtie-1.1.2/batch_2_BOWTIEout.sam ../Bowtie/bowtie-1.1.2/batch_2_BOWTIEout_filtered.fa

Number of Bowtie output lines read: 6100
Number of sequences written to output: 6100


#### Filter with ``BLAST``

<br>
Change directory to highest project directory, here Cod Time Series Data. Then, make a directory for Blast and make a Blast database out of the output from Bowtie. This requires me to move the filtered fasta file, which I did manually. 

Then, we'll be filtering out any loci that match other loci equally well or better than to themselves, which is supposed to remove highly repetitive loci like microsatellites that can interfere with our data analysis.

In [24]:
cd ..

/Volumes/Time Machine Backups/Cod-Time-Series-Data


In [25]:
mkdir Blast

In [26]:
cd Blast

/Volumes/Time Machine Backups/Cod-Time-Series-Data/Blast


In [28]:
!makeblastdb -in batch_2_BOWTIEout_filtered.fa \
-parse_seqids \
-dbtype nucl \
-out batch_2_BOWTIEfiltered

/bin/sh: makeblastdb: command not found
