## Stacks Pipeline rerun (batch 1)


This is a rerun of the stacks pipeline for all northeastern Pacific cod samples. 

<br>
Key parameters for this run of stacks: 
1. trimming to 142 bp
2. using bounded SNP model
3. stack depth of 10
4. nucleotide differences between stacks = 3
5. cstacks catalog built with 10 individuals per population
6. locus must be present in 4/9 populations
7. locus must be present in 75% of individuals in a population
8. "v" alignment mode in BOWTIE, with no more than 3 mismatches



### process_radtags --> ustacks --> counting # sequences per file
**3/7/2017**

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/notebooks'

In [2]:
cd ../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/scripts


In [9]:
!python radtags_ustacks_genShell_3-6.py

In [10]:
!head -n 10 radtags_ustacks_3-6.sh

#!/bin/bash

# lane 1
process_radtags -p /media/mfisher5/New\ Volume/Kristen/Data/KOD03 -i gzfastq -y gzfastq -o ../samplesT142 -b barcodesL1.txt -e sbfI -E phred33 -r -c -q -t 142

# lane 2
process_radtags -p /media/mfisher5/New\ Volume/Kristen/Data/AD06 -i gzfastq -y gzfastq -o ../samplesT142 -b barcodesL2_AD.txt -e sbfI -E phred33 -r -c -q -t 142

process_radtags -p /media/mfisher5/New\ Volume/Kristen/Data/WC05 -i gzfastq -y gzfastq -o ../samplesT142 -b barcodesL2_WC.txt -e sbfI -E phred33 -r -c -q -t 142



In [11]:
!head -n 5 sampleList.txt

KOD03_035
KOD03_051
KOD03_052
KOD03_053
KOD03_054


In [None]:
# run in command line
chmod +x ./radtags_ustacks_3-6.sh
./radtags_ustacks_3-6.sh


<br> 
**Number of reads per individual, post-process radtags**

![graph](https://github.com/mfisher5/PCod-US-repo/blob/master/analyses/n_reads_radtags.png?raw=true)

<br>

### cstacks --> sstacks --> populations
**3/8/2017**

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/notebooks'

In [2]:
cd ../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/scripts


In [7]:
!python cstacks_populations_genShell_3-8.py \
samples_for_cstacks.txt \
sampleList.txt

This made a shell script to run cstacks to populations. 

In [8]:
!head cstacks_populations_3-8.sh


#cstacks
cstacks -b 1 -s ../stacks/KOD03_057 -s ../stacks/KOD03_064 -s ../stacks/KOD03_067 -s ../stacks/KOD03_078 -s ../stacks/KOD03_066 -s ../stacks/KOD03_061 -s ../stacks/KOD03_071 -s ../stacks/KOD03_081 -s ../stacks/KOD03_055 -s ../stacks/KOD03_065 -s ../stacks/AD06_017 -s ../stacks/AD06_001 -s ../stacks/AD06_008 -s ../stacks/AD06_048 -s ../stacks/AD06_036 -s ../stacks/AD06_018 -s ../stacks/AD06_011 -s ../stacks/AD06_044 -s ../stacks/AD06_035 -s ../stacks/AD06_041 -s ../stacks/WC05_015 -s ../stacks/WC05_008 -s ../stacks/WC05_018 -s ../stacks/WC05_029 -s ../stacks/WC05_040 -s ../stacks/WC05_022 -s ../stacks/WC05_016 -s ../stacks/WC05_017 -s ../stacks/WC05_032 -s ../stacks/WC05_025 -s ../stacks/HS04_015 -s ../stacks/HS04_016 -s ../stacks/HS04_017 -s ../stacks/HS04_040 -s ../stacks/HS04_008 -s ../stacks/HS04_032 -s ../stacks/HS04_022 -s ../stacks/HS04_006 -s ../stacks/HS04_018 -s ../stacks/HS04_014 -s ../stacks/PI04_017 -s ../stacks/PI04_016 -s ../stacks/PI04_038 -s ../stacks/PI04_0

I also created a populations map file for `populations`

In [9]:
!head -n 5 PopMap.txt

KOD03_035	Kodiak03
KOD03_051	Kodiak03
KOD03_052	Kodiak03
KOD03_053	Kodiak03
KOD03_054	Kodiak03


In [11]:
!chmod +x cstacks_populations_3-8.sh

In [None]:
#run at command line
./cstacks_populations_3-8.sh


I need to rename the samples folder, since Kristen's reads were 100bp. But I first added the following to my `.gitignore` file: 

`samplesT92/*.fq.gz`
<br>
`samplesT92/*.fq`
<br>
`stacks/*.genepop`
<br>
`stacks/*.fa`




In [1]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo


In [4]:
mv samplesT142 samplesT92



<br>
### Reference genome building
<br>


**Step One: ** Build the fasta file for BOWTIE using the loci in the `populations` genepop file and the sequences in the `batch_1.catalog.tags.tsv.gz` file

In [2]:
cd ../scripts/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/scripts


In [3]:
!head genBOWTIEfasta_fromGENEPOP.py

### This python script will create a list of loci from the `populations` output genepop file ###

## ARGUMENTS: 
#ARG 1 - genepop file from `populations`. 
#ARG 2 - the .catalog file output from `cstacks` (unzipped)


import sys

#open the genepop file


In [5]:
!gzip -d ../stacks/batch_1.catalog.tags.tsv.gz

In [7]:
!python genBOWTIEfasta_fromGENEPOP.py \
../stacks/batch_1.genepop \
../stacks/batch_1.catalog.tags.tsv

-----
Reading loci from file:
../stacks/batch_1.genepop
Stacks version 1.44; Genepop version 4.1.3; March 08, 2017

Done reading loci

Using sequences from catalog file:
../stacks/batch_1.catalog.tags.tsv

Writing new fasta file...


In [8]:
!head seqsforBOWTIE.fa

>1
TGCAGGCAACAGAACGCAGTAGTGTAATGTGGACATCTTAATCACATTTAAGACCTGTTTGAAGCATTTAAACATCGTAAGACACCCTGAGG
>7
TGCAGGAATATAGTTAAAGTTGGTTGATTTGTAGTTTCAAGTATAAACATCTGAGAAATACTTATACGATCAAATGCTTTGTATCTGCCTTA
>8
TGCAGGCCAAGCAGCTCTCATCATGCTGTCATCATTCACTGCAAGTGTCACTGCTATTGTCTGAATGCTACTGCAGTGCCGCCTGTCCCCAA
>9
TGCAGGAGAAAGGCGAGGAAGCCTGTAAGGAGTTTTATCGTGCACTCCACTTGCACGTGGAAGAAGTTTATTACAGCTTGCCCACACGCCTC
>10
TGCAGGGTCGACCTCCCGCTGCTCCCCAGCCGGCTCCGCCCACCGGAGGCCACGCCCACAGAGGCCACGCCCCCCTCAGACGAGGAGGAGGA


In [9]:
!mv seqsforBOWTIE.fa ../stacks/batch_1_seqsforBOWTIE.fa


<br>
**Step Two:** Run BOWTIE filtering from within the stacks folder

In order to run BOWTIE, the program must be held within the folder you are working. 

In [10]:
cd ../stacks

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks


In [11]:
!mkdir refgenome

In [15]:
!mv refgenome/bowtie-1.1.2 ../../L1L2stacks_m10_boundSNP/bowtie-1.1.2

In [19]:
!rsync ../../L1L2stacks_m10_boundSNP/bowtie-1.1.2 refgenome/bowtie-1.1.2

skipping directory bowtie-1.1.2


for some reason that didn't work, so I manually copied bowtie in. 

In [20]:
!mv batch_1_seqsforBOWTIE.fa refgenome/batch_1_seqsforBOWTIE.fa

In [21]:
cd refgenome

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks/refgenome


In [23]:
!bowtie-build batch_1_seqsforBOWTIE.fa batch_1

Settings:
  Output files: "../batch_1.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  batch_1_seqsforBOWTIE.fa
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 282417
Using parameters --bmax 211813 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 211813 --dcv 1024
Constructing suffix-array element generat

In [26]:
!bowtie -f -v 3 --sam --sam-nohead \
batch_1 \
batch_1_seqsforBOWTIE.fa \
batch_1_BOWTIEout.sam

# reads processed: 12279
# reads with at least one reported alignment: 12279 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 12279 alignments to 1 output stream(s)


In [28]:
!python ../../scripts/parseBowtie_DD.py \
batch_1_BOWTIEout.sam \
batch_1_BOWTIE_filtered.fa

Number of Bowtie output lines read: 12279
Number of sequences written to output: 12279



<br>
**Step Three:** BLAST filtering

In [29]:
!makeblastdb -in batch_1_BOWTIE_filtered.fa \
-parse_seqids \
-dbtype nucl \
-out batch_1_BOWTIEfilteredDB



Building a new DB, current time: 03/09/2017 13:57:20
New DB name:   /mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/stacks/refgenome/batch_1_BOWTIEfilteredDB
New DB title:  batch_1_BOWTIE_filtered.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 12279 sequences in 0.257832 seconds.


In [30]:
!blastn -query batch_1_BOWTIE_filtered.fa \
-db batch_1_BOWTIEfilteredDB \
-out batch_1_BOWTIE_BLAST_filtered

In [31]:
!python ../../scripts/checkBlastResults_DD.py \
batch_1_BOWTIE_BLAST_filtered \
batch_1_BOWTIE_filtered.fa \
batch_1_BOWTIE_BLAST_filtered.fa \
batch_1_BOWTIE_BLAST_output_bad.fa


Identifying which loci are 'good' and 'bad' based on BLAST alignments...
Writing 'good' and 'bad' loci to their respective files...


In [33]:
!grep -c "^>" batch_1_BOWTIE_BLAST_filtered.fa
# 12084

12084



<br>
**Step Four:** Create final SAM file containing the reference database of loci

In [34]:
!bowtie-build batch_1_BOWTIE_BLAST_filtered.fa \
batch_1_ref_genome

Settings:
  Output files: "batch_1_ref_genome.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  batch_1_BOWTIE_BLAST_filtered.fa
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 277932
Using parameters --bmax 208449 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 208449 --dcv 1024
Constructing suffix-array


<br>
**Step Four:** Align process_radtags output to new reference "genome"


*Note-* when aligning, you have to choose a `-v` value, or the number of mismatches allowed between reads. this should match the `-M` value chosen in ustacks. 


In [4]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/scripts'

In [5]:
cd ../PCod-US-repo/scripts

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-US-repo/scripts


In [6]:
!head -n 15 RefGenome_BOWTIEalign_genshell3-9.py

###### Generate Shell Script to Align all FastQ Data Files to BOWTIE ref genome ######

## MF 3/9/2017
## For US Cod Data



## At Command Line: python cstacks_populations_genShell_3-8 ARG1 ARG2
##---- ARG1 = complete sample list file
##---- ARG2 = relative path to bowtie ref database, including file name without filetype suffix
##---- ARG3 = relative path to stacks fastq files, output from process_radtags
##---- ARG4 = batch #

############################################################################



In [8]:
!python RefGenome_BOWTIEalign_genshell3-9.py \
sampleList.txt \
../stacks/refgenome/batch_1_ref_genome \
../samplesT92 \
1

In [9]:
!head -n 10 RefGenome_BOWTIEalign_1.sh

#!/bin/bash

bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_035.fq.gz KOD03_035.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_051.fq.gz KOD03_051.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_052.fq.gz KOD03_052.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_053.fq.gz KOD03_053.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_054.fq.gz KOD03_054.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_055.fq.gz KOD03_055.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_056.fq.gz KOD03_056.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_057.fq.gz KOD03_057.sam


<br>
All fastq files must be unzipped! I pasted in the following code to my bash shell script to unzip my files:


In [None]:
#cd ../samplesT92
#echo "finding all gzipped 'tags' files"`
#tags_file_array="$(find . -name '*.fq.gz')"


#echo "unzipping all fastq files"
#for file in $tags_file_array
#do
#	echo $file
#	gzip -d $file
#	echo "file unzipped"
#done

#cd ../scripts

In [11]:
!head -n 20 RefGenome_BOWTIEalign_1.sh

#!/bin/bash

cd ../samplesT92
echo "finding all gzipped 'tags' files"
tags_file_array="$(find . -name '*.fq.gz')"


echo "unzipping all fastq files"
for file in $tags_file_array
do
	echo $file
	gzip -d $file
	echo "file unzipped"
done

cd ../scripts

bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_035.fq.gz KOD03_035.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_051.fq.gz KOD03_051.sam
bowtie -q -v 3 -norc --sam ../stacks/refgenome/batch_1_ref_genome ../samplesT92/KOD03_052.fq.gz KOD03_052.sam


In [12]:
!chmod +x RefGenome_BOWTIEalign_1.sh

In [None]:
# ran the shell script at the command line 
#as it was expected to take overnight
./RefGenome_BOWTIEalign_1.sh