# Lane 1 (PE 150) Data
<br>
<br>
This notebook goes through the stacks pipeline, from `process_radtags` to `cstacks`, for the first lane of sequencing recieved for the Korean Pacific cod. 
Cstacks was run for *de novo* assembly of reads into a reference catalog of loci. Sstacks was then run *without* additional filtering / screening of the reference catalog, to become familiar with the process. For future analysis, the parameters used in ustacks and cstacks will have to be optimized, and additional screening will be done to create a reference genome (see project workflow [here](https://github.com/mfisher5/mf-fish546-PCod/blob/master/Diagrams/PopGen_Workflow.md)). 
<br>
<br>
The primary purpose of this notebook is to run `process_radtags` to clean and demultiplex the data, and become more familiar with other steps in the stacks pipeline. This includes writing python scripts to generate shell scripts for one or more steps in the stacks pipeline. 
<br>
<br>
This notebook was translated from Evernote; the original [Evernote notebook](http://www.evernote.com/l/Aop4Hv2efMJKarQTFQ6NtlVES2W7fhalyvQ/) contains original notes written while writing the program, and has the related scripts, with parameters exactly as they were run here, attached. Note that some python scripts listed here that are in my github [scripts/](https://github.com/mfisher5/mf-fish546-2016/tree/master/scripts) folder may have been altered slightly since these analyses were run, since they are reused each time I run that step in the pipeline. 
<br>
<br>
Programs: FastQC, [stacks v. 1.42](http://catchenlab.life.illinois.edu/stacks/)
<br>
<br>



## process_radtags


**Notes on the Data:**

Paired end, 150bp

2 raw data files: R1 = forward, R2 = reverse

File path: D/Pacific\ cod/DataAnalysis/raw_data/L1_PE150 (*in Github: mf-fish546-2016/raw_data/L1PE_150*)

*[Flowcell Summary](https://github.com/mfisher5/mf-fish546-2016/blob/master/raw_data/FlowcellSummaries/Lane1_FlowcellSummary.png)*

## 10/7/2016

**Dealing with paired-end data**: From the stacks manual

________________________________________________________________________________________________________________________________
*If your data are paired-end, Illumina HiSeq data, in a directory called raw:*
<br>
<br>
`~/raw% ls`
<br>
`lane4_NoIndex_L004_R1_001.fastq  lane4_NoIndex_L004_R1_009.fastq  lane4_NoIndex_L004_R2_005.fastq
lane4_NoIndex_L004_R1_002.fastq  lane4_NoIndex_L004_R1_010.fastq  lane4_NoIndex_L004_R2_006.fastq`
<br>
<br>
Then you simply add the **-P flag.** process_radtags understands the Illumina naming scheme and will figure out how to properly pair the files together:

`% process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 -e sbfI -E phred33 -r -c -q`
________________________________________________________________________________________________________________________________


**Trim Length:** FastQC

Opened the original raw data (not demultiplexed to individuals) files through FastQC. Fragment length = 151 bp. The sequence quality tends to get pretty low within the last few bp in the forward sequence, so I trimmed to 149. In the reverse sequence the quality gets low much early (about 120bases), but Dan said that the reverse sequence is usually of poorer quality.

FastQC for forward and reverse sequences [download through github](https://github.com/mfisher5/mf-fish546-2016/tree/master/raw_data/L1_PE150/FastQC)

Or for quick access to the report, see the embedded HTML file in [Evernote](http://www.evernote.com/l/Aop4Hv2efMJKarQTFQ6NtlVES2W7fhalyvQ/)



### Running process_radtags

In [2]:
!pwd

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/mf-fish546-2016/notebooks


In [6]:
#navigate to DataAnalysis folder (in github, mf-fish546-2016)
cd ../../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [None]:
!mkdir samples
!mkdir samplesT140
!mkdir samplesT146
!mkdir samplesT142

**barcodes.txt** A barcodes file is needed for process_radtags. I used the barcodes file I sent in with the sequencing, which is a simple text file with a list of barcodes. For the sake of simplicity in running process_radtags, I saved the barcode file in the root directory `DataAnalysis`, or on Github `mf-fish546-2016`

In [8]:
!head barcodesL1.txt

AAACGG
GCCGTA
ACTCTT
TTCTAG
ATTCCG
CCGCAT
CGAGGC
CGCAGA
GAGAGA
GGGGCG


**(1)** Using `--filter_illumina`, trimmed to 149 bp (forgot about barcodes being trimmed off!! So this doesn't actually remove any bases)

In [None]:
process_radtags -p raw_data/L1_PE150/ \
-P \
-i gzfastq -y gzfasta \
-o samples \
-b barcodesL1.txt \
-e sbfI -E phred33 \
-r -c -q --filter_illumina \
-t 149

532905602 total reads

0 failed Illumina filtered reads;

48761576 ambiguous barcode drops;

248505677 low quality read drops;

235638349 retained reads. (44%)

**! Will not use filter_illumina; too strict on this data !**

**(2)** no `--filter_illumina`, trimmed to 140 bp (remove last 5 bases) (151 - 6 (barcodes) - 5 (trim))

In [None]:
process_radtags -p raw_data/L1_PE150/ \
-P \
-i gzfastq -y gzfasta \
-o samplesT140 \
-b barcodesL1.txt \
-e sbfI -E phred33 \
-r -c -q -t 140

532905602 total reads

48761576 ambiguous barcode drops;

6222291 low quality read drops;

30459515 ambiguous RAD-Tag drops;

447462220 retained reads. (83.97%)

## 10/10/2016

**(3)** no filter_illumina, trimmed to 146 bp (remove NO bases)

In [None]:
process_radtags -p raw_data/L1_PE150/ \
-P \
-i gzfastq -y gzfasta \
-o samplesT146 \
-b barcodesL1.txt \
-e sbfI -E phred33 \
-r -c -q -t 146

532905602 total reads

48761576 ambiguous barcode drops;

6222291 low quality read drops;

30459515 ambiguous RAD-Tag drops;

447462220 retained reads. (83.97%)

**(4)** no filter_illumina, trimmed to 142 bp (remove last 3 bases)

In [None]:
process_radtags -p raw_data/L1_PE150/ \
-P \
-i gzfastq -y gzfasta \
-o samplesT142 \
-b barcodesL1.txt \
-e sbfI -E phred33 \
-r -c -q -t 142

532905602 total reads

48761576 ambiguous barcode drops;

6414716 low quality read drops;

30459515 ambiguous RAD-Tag drops;

447269795 retained reads. (83.93%)

** ! Will be using this output for ustacks ! **

In [None]:
# to look at the new files generated: 
cd samplesT142
!ls | head 10

In [12]:
cd ../ #return to root directory

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


#### Removing unneccessary files

Since I won't be using the files produced by  **(1)** or **(2)** above, I deleted these from my computer. 

### Renaming Files

process-radtags has named all of my files by barcode, but it is easier for me to refer to them by sample ID (so that I know which samples belong to which populations). 


**(1)** I started out with a .csv file that has all of the barcodes next to their sample IDs. In excel, I inserted a column before the barcodes then added "mv": "L1_mv_barcodesTOsample.txt" 

In [13]:
cd scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


In [14]:
!head L1_mv_barcodesTOsample.txt

mv	AAACGG	PO010715_06
mv	GCCGTA	PO010715_27
mv	ACTCTT	PO010715_28
mv	TTCTAG	PO010715_29
mv	ATTCCG	GE011215_08
mv	CCGCAT	GE011215_09
mv	CGAGGC	GE011215_14
mv	CGCAGA	GE011215_15
mv	GAGAGA	NA021015_16
mv	GGGGCG	NA021015_21


**(2)** I ran a python script in the command line, to convert the list of barcodes / samples into a list of barcode / sample fasta files. The python script is specifically for paired end output. Paired end data has four "sample" fasta files output per individual: 1. forward sequences, 2. reverse sequences, 3. removed forward sequences from quality filtering, and 4. removed reverse sequences from quality filtering. Single read data has only (1) and (3). *I also have a script in the folder that handles filename conversions for single read data*

In [None]:
!python filenames_PEconversion.py L1_mv_barcodesTOsample.txt

This outputs two files named "new_filenames1.txt" and "new_filenames2.txt". Since I'm only going to be working with the forward sequences for the rest of the stacks pipeline, I'll only need `new_filenames1.txt` 

In [17]:
!mv new_filenames1.txt L1_filename_conversion.txt
!rm new_filenames2.txt
!head L1_filename_conversion.txt

mv	sample_AAACGG.1.fa.gz	PO010715_06.1.fa.gz
mv	sample_GCCGTA.1.fa.gz	PO010715_27.1.fa.gz
mv	sample_ACTCTT.1.fa.gz	PO010715_28.1.fa.gz
mv	sample_TTCTAG.1.fa.gz	PO010715_29.1.fa.gz
mv	sample_ATTCCG.1.fa.gz	GE011215_08.1.fa.gz
mv	sample_CCGCAT.1.fa.gz	GE011215_09.1.fa.gz
mv	sample_CGAGGC.1.fa.gz	GE011215_14.1.fa.gz
mv	sample_CGCAGA.1.fa.gz	GE011215_15.1.fa.gz
mv	sample_GAGAGA.1.fa.gz	NA021015_16.1.fa.gz
mv	sample_GGGGCG.1.fa.gz	NA021015_21.1.fa.gz


**(3)** I copied and pasted the contents of the new `L1_filename_conversion.txt` file directly into the command line. This renames the files in *each of the samples folders* by sample ID, rather than by barcode.

In [22]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


In [None]:
cd samplesT142
#copy and paste commands from L1_filename_conversion
cd ../samplesT146
#copy and paste commands from L1_filename_conversion
cd ../

In [23]:
cd samplesT142

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/samplesT142


In [25]:
ls | head -n 10

[0m[01;32mGE011215_01.1.fa.gz[0m*
[01;32mGE011215_07.1.fa.gz[0m*
[01;32mGE011215_08.1.fa.gz[0m*
[01;32mGE011215_09.1.fa.gz[0m*
[01;32mGE011215_10.1.fa.gz[0m*
[01;32mGE011215_14.1.fa.gz[0m*
[01;32mGE011215_15.1.fa.gz[0m*
[01;32mGE011215_16.1.fa.gz[0m*
[01;32mGE011215_20.1.fa.gz[0m*
[01;32mGE011215_21.1.fa.gz[0m*


In [26]:
cd ../

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis


## 10/11/2016

### Running ustacks

**! For now, I am only running ustacks on the forward scripts !**

**(1)** Using the `ustacks_replaceIDs.py` script, I generated all of the ustacks code for each individual forward sample file. You can alter the ustacks code within this script (line 6). This file is designed to keep me from having to copy and paste all of my new sample IDs into the same ustacks code.
<br>
Default Parameters: **-m 5 -M 3**

In [28]:
cd scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


In [30]:
!python ustacks_replaceIDs.py L1_filename_conversion.txt #generates new_ustacks_shell
!mv new_ustacks_shell ../L1F_ustacks_shell #rename script to make it more informative, move up one directory

**(2)** I also opened the `L1F_ustacks_shell` file and added: "#!/bin/bash" at the top so that it runs as a shell script.

**(3)** I ran the ustacks shell script from the command line.

In [None]:
cd ../
!chmod +x L1F_ustacks_shell
!./L1F_ustacks_shell

This output 4 .tsv.gz files per individual: (1) alleles, (2) models, (3) SNPs, (4) tags

<br>
<br>
I then moved the `L1F_ustacks_shell` into the `scripts/Lane1` folder. 

In [None]:
!mv L1F_ustacks_shell scripts/Lane1/L1F_ustacks_shell

## 10/12/2016

### Running cstacks

**(1)** I do not have a reference genome, so to create my catalog I used 3 individuals from each of the populations represented in the data set (total of 12 individuals; 4 populations = Namhae, Geoje, Pohang, and Yellow Sea Block). I only have 3 individuals from the Yellow Sea population, so they will all be used for the catalog. For the remaining populations, I want to pick the individuals with the most data available; aka the largest data file sizes from the process_radtags output (in the SamplesT142 folder). The best way to find out this information is to determine how many lines are in each gzipped file. This can be accomplished using a pipeline on the command line:

In [None]:
#starting in the root directory, DataAnalysis (github = mf-fish546-2016)
cd samplesT142
zcat [sampleID].1.fa.gz | wc -l >> ../new_WordCounts.txt   #using >> appends, rather than overwrites

 *(a)* Rather than typing this for all individuals: I created a code that will generate (1) all of the word count code, and (2) a list of the gzfasta files names by using my original document, "L1_mv_barcodesTOsample.txt"

In [41]:
cd scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


In [42]:
!python cstacks_generateLineCountCode.py  L1_mv_barcodesTOsample.txt

In [43]:
!head new_cstacks_linecount

zcat PO010715_06.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat PO010715_27.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat PO010715_28.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat PO010715_29.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat GE011215_08.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat GE011215_09.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat GE011215_14.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat GE011215_15.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat NA021015_16.1.fa.gz | wc -l >> ../new_LineCounts.txt
zcat NA021015_21.1.fa.gz | wc -l >> ../new_LineCounts.txt


In [44]:
!head new_cstacks_linecountSamples

PO010715_06.1.fa.gz
PO010715_27.1.fa.gz
PO010715_28.1.fa.gz
PO010715_29.1.fa.gz
GE011215_08.1.fa.gz
GE011215_09.1.fa.gz
GE011215_14.1.fa.gz
GE011215_15.1.fa.gz
NA021015_16.1.fa.gz
NA021015_21.1.fa.gz


In [45]:
cd ../samplesT142

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/samplesT142


In [47]:
#copy and paste the text from file "new_cstacks_linecount" into the command line
#this produces a file "new_LineCounts.txt" in the root directory
cd ../scripts

/mnt/hgfs/Shared Drive D/Pacific cod/DataAnalysis/scripts


In [48]:
!mv new_cstacks_linecount Lane1/Lane1_cstacks_linecount
!mv new_cstacks_linecountSamples Lane1/Lane1_cstacks_linecountSamples

 *(b)* I then copied and pasted the gzfasta file names and the line counts side-by-side in excel: file named `Lane1_sortedFilesbySize` in the `scripts/` folder.
<br> 
I sorted by population and # of lines. From this list, I copied the filenames of the 3 largest files back into a text document, "samples_for_cstacks.txt". NOTE: I labeled the batch number of that group at the top. ALL lines that are NOT the filenames of this particular batch should be hashed out (#) for the next code.

In [54]:
!head samples_for_cstacks.txt

#
#batch 1: 12 individuals (3 from each)
YS121315_14.1.fa.gz
YS121315_10.1.fa.gz
YS121315_08.1.fa.gz
GE011215_30.1.fa.gz
GE011215_15.1.fa.gz
GE011215_14.1.fa.gz
NA021015_06.1.fa.gz
NA021015_13.1.fa.gz


**(2)** I used the  "samples_for_cstacks.txt" file and the code below to generate the code for cstacks.

In [None]:
!python cstacks_generateCode2.py samples_for_cstacks.txt

**(3)** I **renamed the batch number,** then coped and pasted the generated code into the command line to run.

In [None]:
cd ../ #to root directory
mkdir stacks #make folder "stacks" for output

!cstacks -b 1 \
-s stacks/YS121315_14.1 -s stacks/YS121315_10.1 -s stacks/YS121315_08.1 -s stacks/GE011215_30.1 -s stacks/GE011215_15.1 -s stacks/GE011215_14.1 -s stacks/NA021015_06.1 -s stacks/NA021015_13.1 -s stacks/NA021015_16.1 -s stacks/PO010715_17.1 -s stacks/PO020515_10.1 -s stacks/PO020515_08.1 \
-o stacks -n 3 -p 6

this output 3 files into the stacks folder: (1) batch_1.catalog.alleles, (2)  batch_1.catalog.tags, (3)  batch_1.catalog.snps
I copied the standard output in terminal into this text file: cstacksOut_batch1

From this output, it looks like every time a new locus is discovered in any sample it is added, even if the locus is only found in that sample.

In [None]:
#renamed batch numbers to be more informative (total of 12 individuals in catalog)
cd stacks
!mv batch_1.catalog.alleles batch_12.catalog.alleles
!mv batch_1.catalog.tags batch_12.catalog.tags
!mv batch_1.catalog.snps batch_12.catalog.snps

### Choosing the number of individuals per pop to build a catalog: from Charlie's notes
The number of loci retained much farther down the pipeline is the greatest when 10 individuals per population were chosen to build the catalog (tested n = 10, 25, 50, all). Not sure if using less than ten would further increase number of loci retained. This trend held constant across all error rates, when looking at corrected and uncorrected genotypes.

So...

I reran everything above with 10 individuals from GEO, POH, NAM, and the 3 individuals from YS. (batch = 33)


In [None]:
cd ../

cstacks -b 33 \
-s stacks/YS121315_14.1 -s stacks/YS121315_10.1 -s stacks/YS121315_08.1 \
-s stacks/GE011215_07.1 -s stacks/GE012315_06.1 -s stacks/GE012315_11.1 -s stacks/GE012315_03.1 -s stacks/GE011215_09.1 -s stacks/GE011215_16.1 -s stacks/GE011215_01.1 -s stacks/GE011215_30.1 -s stacks/GE011215_15.1 -s stacks/GE011215_14.1 \
-s stacks/NA021015_10.1 -s stacks/NA021015_17.1 -s stacks/NA021015_21.1 -s stacks/NA021015_02.1 -s stacks/NA021015_09.1 -s stacks/NA021015_22.1 -s stacks/NA021015_14.1 -s stacks/NA021015_06.1 -s stacks/NA021015_13.1 -s stacks/NA021015_16.1 \
-s stacks/PO020515_03.1 -s stacks/PO010715_10.1 -s stacks/PO020515_14.1 -s stacks/PO010715_11.1 -s stacks/PO020515_05.1 -s stacks/PO020515_17.1 -s stacks/PO020515_15.1 -s stacks/PO010715_17.1 -s stacks/PO020515_10.1 -s stacks/PO020515_08.1 \
-o stacks -n 3 -p 6

this output 3 files into the stacks folder: (1) batch_33.catalog.alleles, (2)  batch_33.catalog.tags, (3)  batch_33.catalog.snps. I saved this output in "cstacksOut_batch33".

**!! NOTE !!** 

I later re-ran this WITHOUT the Yellow Sea samples (batch_303), so this batch ended up being deleted from my files. 