## Stacks pipeline synthesis

#### 8/22/2017

<br>
This notebook provides a summary of the final `stacks` pipeline for the US and Korean Cod data. The settings / flags used below were employed for both sets of data to maintain consistency. The workflow for each dataset is as follows: 


#### Workflow: Combining US and Korean data sets


![img-combined-workflow](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/notebook_pics/combined_workflow.png?raw=true)

![img-sk-workflow2](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/notebook_pics/koreandata_workflowsm_p2.png?raw=true)



<br>
<br>

### process_radtags

In [None]:
#single read (Korea lanes 2,3,5; Alaska lanes)
process_radtags -p $raw_data_file \
-i gzfastq \
-y fastq \
-o $"samplesT"_folder \
-b $barcodes_file \
-e sbfI \
-E phred33 \
-r -c -q \
-t 142 or 92

#paired end (Korea lanes 1, 4)
process_radtags -p $raw_data_folder \
-P \
-i gzfastq \
-y fastq \
-o $"samplesT"_folder \
-b $barcodes_file \
-e sbfI \
-E phred33 \
-r -c -q \
-t 142

### ustacks (building reference)

In [None]:
ustacks -t fastq \
-f $samples_folder/file.fq \
-r -d \
-o $stacks_folder \
-m 5 \
-M 3 \
-p 6 \
--model_type bounded

### cstacks (building reference)

In [None]:
cstacks -b 7 \
-s $stacks_folder/samples \
-n 3 \
-p 6

### sstacks (building reference)

In [None]:
sstacks -b 7 \
-c $stacks_folder \
-s $stacks_folder/sample \
-o $stacks_folder \
-p 6

### populations (building reference)

In [None]:
populations -b 7 \
-P $stacks_folder \
-M $popmap.txt \
-t 36 \
-r 0.75 \
-p 4 \
-m 5  \
--genepop \
--fasta

### bowtie / blast (building reference)

In [None]:
# make fasta file from populations output
python genBOWTIEfasta_fromGENEPOP.py \
$populations_genepop_file \
$catalog_tags_file

In [None]:
# build first bowtie database
bowtie-build $batch_number.fa $batch_number

In [None]:
# align against database
bowtie -f -v 3 -k 5 --sam --sam-nohead \
$batch_number \
$batch_number.fa \
$batch_number_BOWTIEout.sam

In [None]:
# parse out sequences to discard
python ../../scripts/parseBOWTIE_DD.py \
batch_number_BOWTIEout.sam \
batch_number_BOWTIEfiltered.fa

In [None]:
# build blast database from bowtie filtered file
makeblastdb -in batch_number_BOWTIEfiltered.fa \
-parse_seqids \
-dbtype nucl \
-out batch_number_BOWTIEfilteredDB

In [None]:
# query the blast database
blastn -query batch_number_BOWTIEfiltered.fa \
-db batch_number_BOWTIEfilteredDB \
-out batch_number_BOWTIE_BLAST_filtered

In [None]:
# parse out sequences to discard
python ../../scripts/checkBlastResults_DD.py \
batch_number_BOWTIE_BLAST_filtered \
batch_number_BOWTIEfiltered.fa \
batch_number_BOWTIE_BLAST_filtered.fa \
batch_number_BOWTIE_BLAST_output_bad.fa

In [None]:
# build final bowtie database from double-filtered file
bowtie-build batch_number_BOWTIE_BLAST_filtered.fa \
batch_number_ref_genome

### .sam alignment files

In [None]:
python RefGenome_BOWTIEalign_genshell.py \
$popmap.txt \
$referenc_path/batch_number_ref_genome \
$samples_folder \
$batch \
$new_stacks_wgenome_folder

### pstacks

In [None]:
pstacks -t sam \
-f $stacks_wgenome_folder/sample.sam \
-o $stacks_wgenome_folder \
-i IDnumber \
-m 10 \
-p 6 \
--model_type bounded

### cstacks

In [None]:
cstacks -b batch_number \
-s $stacks_wgenome_folder/samples \
-g \
-p 6

### sstacks

In [None]:
sstacks -b batch_number \
-c $stacks_wgenome_folder/batch_number \
-s $stacks_wgenome_folder/sample \
-o $stacks_wgenome_folder \
-p 6

### populations

In [None]:
populations -b batch_number \
-P $stacks_wgenome_folder \
-M popmap.txt \
-r .8 \
-p 3 \
-m 10 \
--write_random_snp \
--genepop --fasta \
-t 36

Need to decide: what should the `-p` flag be? It represents the minimum number of popuations a locus must be present in to process that locus. 

I have 9 Korean populations and 9 Alaskan populations. In my data, I've been using `n_pops/2`, rounded down.