## Sequencing Lane Two Data


This notebook runs the Korean P. cod Lane 2 RADseq data through the stacks pipeline, from `process_radtags` through `populations`. The parameters used in this notebook were established using the Lane 1 data (see Testing Stacks Parameters I and II). 
This lane of sequencing includes important samples to be processed *before* continuing with the rest of lab work, in order to test the effectiveness of the protocols used on fully degraded DNA and 300ng DNA. 

This [Evernote nb](http://www.evernote.com/l/AoqJfZuhLQRLGIQ39UkIkKLxdGYsBSXFxt0/) contains the correct version of all the scripts used here, as well as FastQC and other visuals


**Data info: **

Illumina HiSeq 4000 SR 150bp

Run 820, Sample 768

This lane of sequencing contains 72 samples and 60 unique samples. This includes: 
- 12 individuals with fully degraded DNA
- 12 individuals prepared with 500ng and with 300ng DNA (NEW 300ng protocol for testing)


**Programs:** FastQC, stacks v. 1.42 
<br><br><br><br>

### 11/14/2016

**STEP ONE: DOWNLOAD DATA**

In [1]:
cd ../../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [2]:
cd raw_data

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/raw_data


In [3]:
ls

[0m[34;42mL1_PE150[0m/


In [4]:
mkdir L2_SR150

In [3]:
ls

[0m[34;42mL1_PE150[0m/  [34;42mL2_SR150[0m/


In [4]:
cd L2_SR150

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/raw_data/L2_SR150


In [5]:
#not enough space on C: to download; used 
wget -r --no-parent --reject "index.html*" http://gc3fstorage.uoregon.edu/HGGJ2BBXX/768/  
#to download directly into folder on D: instead of C:

In [7]:
ls

[0m[01;32m768_768_S99_L002_R1_001.fastq.gz[0m*  [34;42mFlowcell[0m/


**STEP TWO: CHECK OUT FASTQC **

In [8]:
cd ../../fastqc_v0.11.5/FastQC/

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/fastqc_v0.11.5/FastQC


In [10]:
!./fastqc

^C


Links to output can be found in this [Evernote nb](http://www.evernote.com/l/AoqJfZuhLQRLGIQ39UkIkKLxdGYsBSXFxt0/)

**STEP THREE: `process_radtags`**

I want to run process_radtags and THEN run ustacks, rather than using a shell script for both, because I want to check out several samples post-process_radtags to see if the trimming actually got rid of enough poor quality bases at the end of the read. 

I also won't be running all samples through ustacks at first, since I haven't decided on the parameters at the moment. 

All of the process_radtags sample will be in the `L2samplesT142/` folder. I won't be renaming the lane 1 `samplesT142/` folder to include "L1" since all of my notes refer to it as just `samplesT142` 

In [11]:
cd ../../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [12]:
!mkdir L2samplesT142

In [13]:
ls

[0m[01;32mcstacks_b303[0m*                        [34;42mL2samplesT142[0m/              [34;42msamplesT142[0m/
[34;42mDiagrams[0m/                            [34;42mmf-fish546-2016[0m/            [34;42msamplesT146[0m/
[34;42mfastqc_v0.11.5[0m/                      [01;32mprocess_radtags_nobarcode[0m*  [34;42mscripts[0m/
[01;32milluminaFilter_process_radtags.log[0m*  [34;42mraw_data[0m/                   [34;42mUCstacksL1[0m/


Next, I made a barcode + sampleID text file for process_radtags

In [15]:
cd scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


Now I can create a shell script for process_radtags --> ustacks, and run the `process_radtags` part of it in the command line. Note that instead of changing the file names from barcode --> sample ID myself, I can simply specify a barcode file to `process_radtags` in the format: 
<br>
<br>
barcode \t sampleID
<br>
<br>
So I made a barcode_sampleID text file without the "mv" that I used in Lane 1

In [22]:
!head barcodesL2.txt

% more barcodes_run01_lane02
CAAAAA	SO022216_01
GACGAC	MU011816_01
TTTGTC	MU012816_05
TGCTTA	MU012816_06
CCCGGT	MU012816_07
GCGACC	MU012816_08
CTTATG	MU012816_09
AGCGCA	MU012816_10
TCGCCA	MU032315_01


In [23]:
!python radtags_ustacks_genShellSR.py barcodesL2.txt

**(1)** trim to 142

In [None]:
!process_radtags -p /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis/raw_data/L2_SR150/ \
-i gzfastq -y gzfastq \
-o /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis/L2samplesT142 \
-b /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis/scripts/barcodesL2.txt \
-e sbfI -E phred33 -r -c -q -t 142

Processing single-end data.

Using Phred+33 encoding for quality scores.

Reads will be truncated to 142bp

Found 1 input file(s).

Searching for single-end, inlined barcodes.

Loaded 72 barcodes (6bp).

Will attempt to recover barcodes with at most 1 mismatches.

Processing file 1 of 1 [768_768_S99_L002_R1_001.fastq.gz]
  377851294 total reads; -50386443 ambiguous barcodes; -65025476 ambiguous RAD-Tags; +41406050 recovered; -2840924 low quality reads; 259598451 retained reads.
  
Closing files, flushing buffers...

Outputing details to log: '/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/L2samplesT142/process_radtags.log'
<br>
<br>
**377851294** total sequences;

  **50386443** ambiguous barcode drops;
  
 ** 2840924** low quality read drops;
  
  **65025476** ambiguous RAD-Tag drops;
  
**259598451** retained reads.



----------------------------------------------
Retained 68.7% reads



**Checking FASTQC output per sample**

I opened at least one sample from each of the populations + breeding seasons in FastQC to check out the per base sequence quality toward the end of the sample. 

See [Evernote](http://www.evernote.com/l/AoqJfZuhLQRLGIQ39UkIkKLxdGYsBSXFxt0/)
<br>
<br>
Summary: most of the FastQC output looks fine, with PBSQ not in the red at all, and dropping into the yellow between 115 - 125 bp in the NON-DEGRADED sequences. In several DEGRADED sequences, PBSQ drops into the red at 105-130. 
<br>
MU011816_01: 130

MU012816_05: 110

MU012816_10: 104

MU032315_01: 140

MU032315_02: 130

MU033015_03: 94


These samples also have the smallest file sizes: 
<br>
MU033015_03: 23.1

MU032315_02: 28.9

MU012816_10: 36.2

(MU012816_07: 37.6)

(MU033015_02: 38.8)

(MU012816_09: 39.3)

MU012816_05: 58.6

MU032315_01: 68.8

(SO022216_01: 77.9)

MU011816_01: 92.3

(MU012816_06: 100.7)

(MU012816_08: 144.3)
<br>
<br>
<br>
**(2)** Trim to 110

I'm trimming all of the samples much smaller so that I can selectively look at the fully degraded samples, and see if more reads are retained (larger file size) when I get rid of the end of the sequence that might be causing those reads to be filtered out. If I do end up using these trimmed samples later, I would only use the ones in Mukho/Sokcho. This isolates the smaller sequences to a single population, and when making comparisons with this population I could selectively focus on SNPs at positions below 110 bp. 

In [5]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [6]:
!mkdir L2samplesT110

In [None]:
!process_radtags -p /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis/raw_data/L2_SR150/ \
-i gzfastq -y gzfastq \
-o /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis/L2samplesT110 \
-b /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis/scripts/barcodesL2.txt \
-e sbfI -E phred33 -r -c -q -t 110

Processing single-end data.

Using Phred+33 encoding for quality scores.

Reads will be truncated to 110bp

Found 1 input file(s).

Searching for single-end, inlined barcodes.

Loaded 72 barcodes (6bp).

Will attempt to recover barcodes with at most 1 mismatches.

Processing file 1 of 1 [768_768_S99_L002_R1_001.fastq.gz]

  377851294 total reads; -50386443 ambiguous barcodes; -65025476 ambiguous RAD-Tags; +41406050 recovered; -2422281 low quality reads; 260017094 retained reads.
  
Closing files, flushing buffers...

Outputing details to log: '/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/L2samplesT110/process_radtags.log'
<br>
<br>
377851294 total sequences;

  50386443 ambiguous barcode drops;
  
  2422281 low quality read drops;
  
  65025476 ambiguous RAD-Tag drops;
  
260017094 retained reads.

________________________________________________________________________________________

Retained 68.8% reads

### CHECKING PROTOCOL RESULTS (post `process_radtags`)


To compare the effectiveness of the 300ng protocol, I want to compare the number of reads retained for each of the 12 individuals that were run with 500ng and with 300ng of DNA. In order to do this, I'll run a shell script that will use `awk` to extract the number of sequences from each file. 

<br>
<br>
**(1)** Made a shell script to count the sequences in each FastQ file, and output to FastQsequenceCounts.txt

In [2]:
cd ../../scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


In [3]:
!head countFASTQ_seqs.sh

#!/bin/bash


cd ../L2samplesT142

zcat ../L2samplesT142/GEO020414_8.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQsequenceCounts.txt

zcat ../L2samplesT142/GEO020414_9.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQsequenceCounts.txt




In [13]:
!tail countFASTQ_seqs.sh

zcat ../L2samplesT142/GEO020414_23_300.fq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQsequenceCounts_300.txt

zcat ../L2samplesT142/GEO020414_24_300.fq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQsequenceCounts_300.txt

zcat ../L2samplesT142/GEO020414_25_300.fq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQsequenceCounts_300.txt

zcat ../L2samplesT142/

In [5]:
!chmod +x countFASTQ_seqs.sh

In [7]:
!./countFASTQ_seqs.sh

^C


I'd also like to take a look at the degraded DNA samples, to see how many sequences were retained. I used the same format as above, simply changing the sample names and changing the output to `>>FastQseqCounts_degraded.txt`

I'm a little worried because the files for the degraded samples seem fairly small, especially considering that they are FastQ files, and so should be larger than the Fasta file from Lane 1. 

In [8]:
!head countFASTQ_seqs_deg.sh

#!/bin/bash


cd ../L2samplesT142

zcat MU033015_03.fq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQseqCounts_deg.txt

zcat MU032315_02.fq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQseqCounts_deg.txt




In [10]:
!chmod +x countFASTQ_seqs_deg.sh

In [None]:
!./countFASTQ_seqs_deg.sh

**Outcomes**
<br>
For graphs and scripts, see this [Evernote notebook](http://www.evernote.com/l/AoqJfZuhLQRLGIQ39UkIkKLxdGYsBSXFxt0/). 
<br>
<br>
(1) 300ng and 500ng : significant different between the total number of sequences and the number of unique sequences using a paired t-test. The 500ng set had MORE total and unique sequences than the 300ng set. However, graphically, this doesn't look too bad (excluding GEO020414_13, which for some reason had very low # of reads. Will have to use 300ng for this one).
On average, a 21% increase in total # of reads from 300ng to 500ng (min = 9.3%, max = 43.4%)
On average, a 20% increase in total # unique reads from 300ng to 500ng (min = 7%, max = 40%)
<br>

(2) Degraded DNA, trimmed to 142 v. 110: Highly significant difference between total # and # of unique sequences, % of sequences unique (????) but on a graph you can barely tell the difference; much less obvious than 300 v. 500 comparison.
On average, a 3% increase in # of reads from trimming to 142 to 110 bp. Will stick with the 142
<br>

(3) Degraded DNA v. Good DNA (500ng) : Ouch. Highly significant differences in all four categories. The average difference is 3.31 * 10^6 (good DNA WAY better).
On average, good quality DNA samples looked at here have 4.59 x the number of reads as 500ng of degraded DNA.
<br>
____________________________________________________
______________________________________________________
<br>
<br>
<br>
<br>

**STEP FOUR: RUN USTACKS **
<br>
<br>
I'll run ustacks on the T142 samples, with the parameters that I THINK will end up being my defaults: 

**-m** 10
**-M** 3

These are identical to the runs on the Lane 1 data in the `stacks_m10` folder

In [14]:
pwd

u'/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts'

In [16]:
# to create a shell script that runs ustacks AND 
# finds seq counts for all fq.gz files, use to determine which samples used for cstacks
!python radtags_ustacks_genShellSR.py barcodesL2.txt

In [17]:
!mv new_radtags_ustacks_shell ../ustacks_shell_11-15.sh

In [19]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [20]:
# deleted process_radtags part
!head ustacks_shell_11-15.sh

#!/bin/bash

#ustacks
cd /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis
ustacks -t gzfastq -f L2samplesT142/SO022216_01.fq.gz -r -d -o stacks_m10 -i 001 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L2samplesT142/MU011816_01.fq.gz -r -d -o stacks_m10 -i 002 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L2samplesT142/MU012816_05.fq.gz -r -d -o stacks_m10 -i 003 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L2samplesT142/MU012816_06.fq.gz -r -d -o stacks_m10 -i 004 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L2samplesT142/MU012816_07.fq.gz -r -d -o stacks_m10 -i 005 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L2samplesT142/MU012816_08.fq.gz -r -d -o stacks_m10 -i 006 -m 10 -M 3 -p 6


In [22]:
!mkdir stacksCombo_m10

In [23]:
!chmod +x ustacks_shell_11-15.sh

In [None]:
!./ustacks_shell_11-15

***Plan going forward...***

Move all of the ustacks files from Lane 1 in the UCstacksL1/stacks_m10 folder into the new stacksCombo_m10/ folder. 

Run this set of samples all the way through the stacks process, following the [flowchart](https://github.com/mfisher5/mf-fish546-2016/blob/master/Diagrams/PopGen_Workflow.md) on github. EXTRA STEPS: Marine's filtering scripts on `populations` genepop output. 


Then, compare between samples of interest:
(1) # loci retained post filtering
(2) % missing genotypes
(3) # individuals that had to be taken out because of missing genotypes

### 11/17/2016

**STEP FIVE: RUN CSTACKS**

I ran the same script used to **Check protocol results: post-process_radtags** to count the number of sequences in the gzipped fastq files from ustacks, and then added these to a spreadsheet with the lane 1 sequence counts. I ordered from largest to smallest, and took the first ten individuals from each population. I ignored the BORYEONG 2007 samples. For populations with less than ten individuals, I used all individuals. 

MUK/SOK 2015/2016 (12)

GEO 2014 (33)

GEO 2015 (33)

POH 2015 (28)

NAM 2015 (16)

YS 2016 (7)

I moved all of the ustacks files into the stacksCombo_m10 folder, and ran cstacks + sstacks with the following shell script: 

In [1]:
pwd

u'/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/mf-fish546-2016/notebooks'

In [2]:
cd ../../scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


In [3]:
!head samples_for_cstacks_L2.txt

### batch 1 : 10 individuals per pop except for Yellow Sea, stacksCombo_m10
YS121315_12.fq.gz
YS121315_15.fq.gz
YS121315_16.fq.gz
YS121315_08.1.fa.gz
YS121315_10.1.fa.gz
YS121315_14.1.fa.gz
YS121315_12_300.fq.gz
GE011215_07.1.fa.gz
GE012315_06.1.fa.gz


In [5]:
!head -n 15 cstacks_sstacks_genShellSR.py

####### Generate Shell Script that will run cstacks and sstacks #######
## MF 11/16/2016
## Command Line Arguments: 
#ARG1 = samples_for_cstacks file
#ARG2 = first lane barcode file
#ARG3 = second lane barcode file

##########################################################################


import sys
newfile = open("cstacks_sstacks_shell_11-16.sh", "w")
## cstacks ##
catFile = open(sys.argv[1], "r")	#open the file with your list of samples to use in cstacks



I had to input different barcode files for each lane because lane 1 samples have an additional `.1` extension on them, so they were handled differently in the python script above. 

In [None]:
#generate shell script
!python cstacks_sstacks_genShellSR.py samples_for_cstacks_L2.txt barcodes_L1.txt barcodes_L2.txt

In [None]:
#move to DataAnalysis folder
!mv cstacks_sstacks_shell_11-16.sh ../cstacks_sstacks_shell_11-16.sh

In [6]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [None]:
!chmod +x cstacks_sstacks_shell_11-16.sh
!./cstacks_sstacks_shell_11-16.sh

For some reason, several of the `.matches` files generated by sstacks were 0 bytes. so I ran the following script to rerun these samples

In [7]:
!head sstacks_11-16.sh

#!/bin/bash
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/GE012315_11.1 -s stacksCombo_m10/GEO020414_11 -s stacksCombo_m10/GEO020414_15_300 -s stacksCombo_m10/GEO020414_17_300 -s stacksCombo_m10/GEO020414_30 -o stacksCombo_m10 -p 6 2>> stacksCombo_m10/sstacks_out_b1.2


I was also worried about previous samples run through sstacks; maybe they had missing data, but it wasn't obvious because they had a nonzero file size. so I reran my full sstacks script from WITHIN the stacksCombo_m10 folder; still, several files had 0 bytes, so I had to rerun them hand from the command line. 

In [8]:
cd stacksCombo_m10

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/stacksCombo_m10


In [9]:
!head sstacks_shell_11-17.sh

#!/bin/bash

sstacks -b 1 -c ./batch_1 -s ./PO010715_06.1 -s ./PO010715_27.1 -s ./PO010715_28.1 -s ./PO010715_29.1 -s ./GE011215_08.1 -s ./GE011215_09.1 -s ./GE011215_14.1 -s ./GE011215_15.1 -s ./NA021015_16.1 -s ./NA021015_21.1 -s ./GE011215_10.1 -s ./GE012315_01.1 -s ./PO020515_05.1 -s ./PO020515_09.1 -s ./PO020515_10.1 -s ./GE011215_07.1 -s ./GE011215_16.1 -s ./GE011215_29.1 -s ./NA021015_02.1 -s ./NA021015_03.1 -s ./NA021015_08.1 -s ./NA021015_13.1 -s ./GE012315_03.1 -s ./GE012315_22.1 -s ./GE012315_04.1 -s ./GE012315_05.1 -s ./GE012315_06.1 -s ./PO010715_19.1 -s ./PO031715_20.1 -s ./NA021015_10.1 -s ./GE011215_20.1 -s ./PO020515_03.1 -s ./PO020515_08.1 -s ./NA021015_17.1 -s ./NA021015_22.1 -s ./PO010715_11.1 -s ./PO020515_16.1 -s ./GE011215_21.1 -s ./GE011215_30.1 -s ./NA021015_14.1 -s ./NA021015_06.1 -s ./NA021015_09.1 -s ./PO020515_17.1 -s ./PO010715_17.1 -s ./PO020515_15.1 -s ./PO010715_10.1 -s ./GE011215_01.1 -s ./GE011215_24.1 -s ./PO031715_13.1 -s ./PO010715_08.1 -s ./PO02

See files `sstacks_out_b1`, `sstacks_out_b1.2` and `sstacks_out_b1.3` to look at standard error output. not sure why it won't work, no errors being thrown!

**STEP SIX: RUN POPULATIONS**

In order to filter the cstacks catalog of loci, I need to make a `.fasta` file of all of the unique loci + their corresponding sequences. I can get the sequences from the `.catalog.tags` file out of cstacks, but I have to run `populations` and get a genepop file to find the unique loci. 

In [10]:
pwd

u'/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/stacksCombo_m10'

In [19]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [20]:
!populations -b 1 \
-P stacksCombo_m10 \
-M scripts/PopMap_stacksCombo_m10.txt \
-t 36 \
-r 0.50 \
-p 2 \
-m 5 \
--genepop

Fst kernel smoothing: off
Bootstrap resampling: off
Percent samples limit per population: 0.5
Locus Population limit: 2
Minimum stack depth: 5
Log liklihood filtering: off; threshold: 0
Minor allele frequency cutoff: 0
Maximum observed heterozygosity cutoff: 1
Applying Fst correction: none.
Parsing population map...
The population map contained 129 samples, 6 population(s), 1 group(s).
Error: Unable to locate any file in input directory 'stacksCombo_m10/'.


WTF???

In [21]:
cd stacksCombo_m10

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/stacksCombo_m10


In [23]:
!zcat batch_1.catalog.alleles.tsv.gz | head

# cstacks version 1.42; catalog generated on 2016-11-16 21:25:45
0	1	1	AG	0	0
0	1	1	AT	0	0
0	1	1	TG	0	0
0	1	1	TT	0	0
0	1	3	AGC	0	0
0	1	3	CCC	0	0
0	1	3	CCG	0	0
0	1	3	CGC	0	0
0	1	5	A	0	0

gzip: stdout: Broken pipe


Maybe the files from cstacks or sstacks are corrupted in some way. Went back and reran everything. STILL outputting three files with no data; this time they were MUKHO samples (whereas last time they were GEOJE2014 samples). 

Tried running the following shell, which runs each sample individually. 

In [24]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [25]:
!head sstacks_byline_shell_11-17.sh

#!/bin/bash
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/SO022216_01 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU011816_01 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU012816_05 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU012816_06 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU012816_07 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU012816_08 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU012816_09 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU012816_10 -o stacksCombo_m10 -p 6
sstacks -b 1 -c stacksCombo_m10/batch_1 -s stacksCombo_m10/MU032315_01 -o stacksCombo_m10 -p 6


In [None]:
!populations -b 1 \
-P stacksCombo_m10 \
-M scripts/PopMap_stacksCombo_m10.txt \
-t 36 \
-r 0.50 \
-p 2 \
-m 5 \
--genepop

Now I need to generate the bowtie fasta file: 

In [5]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis


In [6]:
cd stacksCombo_m10

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10


In [7]:
!gzip -d batch_1.catalog.tags.tsv.gz

In [8]:
cd ../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [9]:
!python genBOWTIEfasta.py ../stacksCombo_m10/batch_1loci.txt ../stacksCombo_m10/batch_1.catalog.tags.tsv

In [10]:
!head seqsforBOWTIE.fa

>3
TGCAGGACGACCCGCTGGGGCGTGGTGCTCTGCCAAATGGGTGGCAGTGAAGGTCAGGGGTCTCAGGGGACCAGACCCGACCGGCAGAAACTGGCTCAGGGGATGTGGAATGTCACCTCACTGTGGGGAAAGGAGCTTTAGC
>5
TGCAGGGATATTAAATACAGGCACAGGGAACATGGCAGGGTGGAAGAGGATTGGCCTGCGCTGGAACTTTTTATCATAAATTGGAGTCTGCGGAGATAGACTGCCGACTTGAGTGTGCGGTTTGCTCCAACAGCAAAGTGTT
>14
TGCAGGTACACACGCTCAAGTCACGTTGAGGCGTGTACTGTATGTTAAGTACATGTTAAGTAATGGTTAAGTATCCTTCCTCAGACTGAGGAAGGGTTATTCACACTACACGTGTGAATAATTATTTACACAACACGTTTTA
>16
TGCAGGGGCCGGGGACGGGGGTGGTGTGTATGCTCGGACATGTTTGTCGAGAGTTTAGGGGTAGTCTGTCCTGCTTCATGGTGACAGAGATGGCAATGGGAGTGAGTGAGTGCGCTGAGTCGCTGTGTAGTCACAGAGTGCT
>17
TGCAGGTTCAGGATGATCTTGTCCGGCCCAAAGCTGTTACTGGCCACGCAGGTGTAGTTCCCAGAATCCTCGGCTTTCACCGTGCGTATGACCAGGCTGCCGTTGCCGTGGACGCTGCGTCGGCTGTCAATCACAACAGGCG


In [11]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis


In [12]:
cd stacksCombo_m10

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10


**STEP EIGHT: FILTER WITH BOWTIE AND BLAST**

In [13]:
!mkdir BOWTIE

In [14]:
!mv ../scripts/seqsforBOWTIE.fa BOWTIE/seqsforBOWTIE.fa

In [16]:
cd BOWTIE

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10/BOWTIE


In [15]:
!grep ">" seqsforBOWTIE.fa | wc -l #number of unique loci that were retained

14600


In [17]:
ls

[0m[34;42mbowtie-1.1.2[0m/  [01;32mseqsforBOWTIE.fa[0m*


In [18]:
#create bowtie reference
!bowtie-build seqsforBOWTIE.fa batch_1

Settings:
  Output files: "batch_1.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  seqsforBOWTIE.fa
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 518300
Using parameters --bmax 388725 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 388725 --dcv 1024
Constructing suffix-array element generator
Building

In [19]:
ls

[0m[01;32mbatch_1.1.ebwt[0m*  [01;32mbatch_1.3.ebwt[0m*  [01;32mbatch_1.rev.1.ebwt[0m*  [34;42mbowtie-1.1.2[0m/
[01;32mbatch_1.2.ebwt[0m*  [01;32mbatch_1.4.ebwt[0m*  [01;32mbatch_1.rev.2.ebwt[0m*  [01;32mseqsforBOWTIE.fa[0m*


In [21]:
#align seqs.fasta file against itself
!bowtie -f -v 3 --sam --sam-nohead \
batch_1 \
seqsforBOWTIE.fa \
batch_1_BOWTIEout.sam

# reads processed: 14600
# reads with at least one reported alignment: 14600 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 14600 alignments to 1 output stream(s)


In [22]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [24]:
#filter out the sequences that aligned to sequences other than themselves
!python parseBowtie_DD.py ../stacksCombo_m10/BOWTIE/batch_1_BOWTIEout.sam ../stacksCombo_m10/BOWTIE/batch_1_BOWTIEout_filtered.fa

Number of Bowtie output lines read: 14600
Number of sequences written to output: 14600


In [25]:
cd ../stacksCombo_m10

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10


In [26]:
#filter with BLAST
!mkdir BLAST

In [27]:
cd BLAST

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10/BLAST


In [28]:
!mv ../BOWTIE/batch_1_BOWTIEout_filtered.fa batch_1_BOWTIEout_filtered.fa

In [29]:
#make a blast database
!makeblastdb -in batch_1_BOWTIEout_filtered.fa \
-parse_seqids \
-dbtype nucl \
-out batch_1_BOWTIEfiltered



Building a new DB, current time: 11/17/2016 13:49:11
New DB name:   /mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10/BLAST/batch_1_BOWTIEfiltered
New DB title:  batch_1_BOWTIEout_filtered.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 14600 sequences in 0.271607 seconds.


In [30]:
#blast the fasta file against the database (itself)
!blastn -query batch_1_BOWTIEout_filtered.fa \
-db batch_1_BOWTIEfiltered \
-out batch_1_BowtieBlastFiltered

In [31]:
#filter out loci that aligned to themselves

In [32]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [33]:
!python checkBlastResults_DD.py \
../stacksCombo_m10/BLAST/batch_1_BowtieBlastFiltered \
../stacksCombo_m10/BLAST/batch_1_BOWTIEout_filtered.fa \
../stacksCombo_m10/BLAST/batch_1_BowtieBlastFiltered_GOOD.fa \
../stacksCombo_m10/BLAST/batch_1_BowtieBlastFiltered_BAD.fa


Identifying which loci are 'good' and 'bad' based on BLAST alignments...
Writing 'good' and 'bad' loci to their respective files...


In [34]:
cd ../stacksCombo_m10/BLAST/

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10/BLAST


In [36]:
#count number of loci retained
!grep ">" batch_1_BowtieBlastFiltered_GOOD.fa | wc -l

14303


In [37]:
!grep ">" batch_1_BowtieBlastFiltered_BAD.fa | wc -l

297


In [38]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis/stacksCombo_m10


In [39]:
!gzip batch_1.catalog.tags.tsv

So now I have my reference database in a fasta file!

** STEP NINE: RUN PSTACKS AGAINST NEW REFERENCE DATABASE **

## STOP: 

After groundtruthing the .matches files for an individual INCLUDED in the cstacks catalog, the SNPs identified in the file DO NOT match the ACTUAL SNPs from the .tags file's reads. 


Have to stop here. Will try an earlier AND a later version of stacks this weekend to see if the same problem is still occurring. 