# Purpose

The purpose of this notebook is to document my full application of the Stacks pipeline on my cod data. I've been writing scripts to work on a subset of my files, and now I want to start running them on all of my files.

## ``process_radtags`` 

I don't have to run this on my data because the previous user of the data already did. In my script for running ``process_radtags``, however, I also use this opportunity to make folders for the output of each of the Stacks programs. So I'm doing that manually here. Below, you can see these directories plus two folders of cod data (two lanes of data):

![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-16%20at%203.10.45%20PM.png?raw=true)

## ``ustacks`` 

I'll be working off of my hard drive, so first I had to change directory to there.

In [1]:
cd /Volumes/Time\ Machine\ Backups/Cod-Time-Series-Data/ 

/Volumes/Time Machine Backups/Cod-Time-Series-Data


Another issue is that I have two lanes. It'd be easiest if I could work in only one directory, but their barcodes are redundant. So my first step is to rename the files by adding *_L1* for Library 1 and *_L2* for Library 2. I'll do that in python before the file extension. 

Here's the content of the script ``add_lib_to_filename.py``:

```
# arguments
# [1] directory name with files to rename 
# [2] text you would like to add to the end of each filename before file extensions

# assumptions
# [1] file extention = fq.qz
# [2] you are one directory above the directory that you want to rename the files of
# [3] your files aren't already renamed! --- write an if else loop to make sure it hasn't been run yet when you have the time

import sys
import subprocess

# make a text file that has each thing in this directory on a line; each line should be each file in the lane

string1 = "" # initiate string

string1 += 'ls ' + sys.argv[1] + ' > ./dircontents.txt' # add ls, wd, redirect to text file

firstshell = open("getcontents_shell.txt", "w")
firstshell.write(string1)
firstshell.close()

subprocess.call(["sh getcontents_shell.txt"], shell = True)

# now that file is made for contents of directory, read that file in
contents = open("dircontents.txt", "r")


# second shell = renaming script
rename_w_lib_shell = open("rename_w_lib_shell.txt", "w")

string2 = ""
string2 += "cd " + sys.argv[1] + "\n"


# loop that will break the script if it's already run to make sure you don't rename files into something wrong!
for line in contents:
	linelist = line.strip().split(".")
	if linelist[0][-4] == "L":
		print "CHECK TO SEE IF YOU HAVE ALREADY RENAMED THESE FILES. SCRIPT PAUSED AS WARNING BECAUSE FILES APPEAR RENAMED."
		sys.exit() # exit script
	else:
		continue
contents.close()



contents = open("dircontents.txt", "r")
for line in contents:
	original = line.strip()
	filename_list = line.strip().split(".") # make list out of whole file name with extensions
	filename_wo_ext = filename_list[0]
	file_ext = "." + filename_list[1] + "." + filename_list[2]
	wordlist = filename_wo_ext.split("_") # split filename without extensions into parts by underscore
	newfilename = wordlist[0] + sys.argv[2] + "_" + wordlist[1]
	newfilename += file_ext
	string2 += "mv" + "\t" + original + "\t" + newfilename + "\n"

rename_w_lib_shell.write(string2)
rename_w_lib_shell.close()
contents.close()

# call renaming shell script
subprocess.call(["sh rename_w_lib_shell.txt"], shell = True)
```

Running the code looks like this:

In [31]:
!python add_lib_to_filename.py process_radtags_out/cod_lib1 _L1

In [32]:
!python add_lib_to_filename.py process_radtags_out/cod_lib2 _L2

Now my files are renamed with the library information after the barcode before their 1 or 2 that was used to signify forward and reverse for paired end data.
<br><br>
From
<br>
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-16%20at%206.15.59%20PM.png?raw=true)
<br>
To
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-16%20at%206.15.48%20PM.png?raw=true)
<br>
<br>
And then I can group the two directories of libraries together.
<br>
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-17%20at%209.09.21%20AM.png?raw=true)
<br>
Now I want to run ``ustacks`` on all the files in this folder. I can use my handy dandy script I wrote a couple of weeks ago. It looks like this:

```
##########################################################################################
# 
### ``ustacks``
#
# PURPOSE: This step aligns identical RAD tags within an individual into stacks & provides data for calling SNPs
# INPUT: fastq or gzfastq files
# OUTPUT: 4 files - alleles, modules, snps, tags
# 
# 
### WHEN RUNNING THIS SCRIPT, YOUR INPUTS AT THE COMMAND LINE ARE:
# python  
# {0}[pipeline filename] 
# {1}[barcodes & samples textfile] 
# {2}[start directory] 
# {3}[end directory]
# 
### DEPENDENCIES
# 
# [1] You need a file where first column is barcode and second is unique sample name
# 
### WARNINGS
# 
# [1] ustacks only works when your working directory is one direcotry above the folders you are
#		calling and storing data in 
# 
##########################################################################################

# --- [A] call necessary modules

import subprocess
import sys 





# --- [B] Rename your files by the sample name

new_file = open("new_filenames_shell.txt", "w") # new txt file
dir = sys.argv[2] # directory files that need names changed
firststr = "cd " + dir + "\n" + "pwd\n"
new_file.write(firststr)

myfile = open(sys.argv[1], "r")				# myfile =  tab delimited file mentioned, called after your script

for line in myfile:						   # loop through each line
	linelist=line.strip().split()		   # splits into list by white space
	barcodefile1 = linelist[0] + "_1.fq.gz" # forward
	barcodefile2 = linelist[0] + "_2.fq.gz" # reverse
	samplefile1 = linelist[1] + "_1.fq.gz" # forward
	samplefile2 = linelist[1] + "_2.fq.gz" # reverse
	newstring = "mv" + "\t" + barcodefile1 + "\t" + samplefile1 + "\n" + "mv" + "\t" + barcodefile2 + "\t" + samplefile2 + "\n"
	# print newstring  # troubleshooting loop
	new_file.write(newstring)

myfile.close()
new_file.close()

# run the script you just made as a shell script to rename your files
subprocess.call(['sh new_filenames_shell.txt'], shell=True)

 



### --- [C] Make shell script to run all samples through command line through ``ustacks``

# ``ustacks`` requires an arbitrary integer for every sample, although unclear how it gets used as it does not become the name of the file


# name your 'from' and 'to' directories that will go in each line of your ustacks shell script
dirfrom = sys.argv[2]
dirto = sys.argv[3]

newfile2 = open("ustacks_shell.txt", "w")	 # make ustacks shell script to run through terminal
myfile = open("new_filenames_shell.txt", "r")	#open the file with a list of barcodes + sample IDs

# dir = sys.argv[2] # directory files that need names changed
# firststr = "cd " + dir + "\n"
# new_file.write(firststr)

dir2 = sys.argv[2] # directory with files that we want to run ustacks on

ID_int = 001								# start integer counter

lines = myfile.readlines()[2:] # skip first two lines because just cd and pwd


ID_int = 001								# start integer counter
for line in lines: 			#for each line in the barcode file	
	linelist=line.strip().split()	
	sampID = linelist[2] 					#save the second object as "sampID"
	if ID_int < 10: 
		ustacks_code = "ustacks -t gzfastq -f " + dirfrom + "/" + sampID + " -r -d -o " + dirto + " -i 00" + str(ID_int) + " -m 5 -M 3 -p 10" + "\n"
								#create a line of code for ustacks that includes the new sample ID (with 2 leading 0s)
	elif ID_int >= 10 & ID_int < 100: 
		ustacks_code = "ustacks -t gzfastq -f " + dirfrom + "/" + sampID + " -r -d -o " + dirto + " -i 0" + str(ID_int) + " -m 5 -M 3 -p 10" + "\n"
								#create a line of code for ustacks that includes the new sample ID (with 1 leading 0)
	else: 
		ustacks_code = "ustacks -t gzfastq -f " + dirfrom + "/" + sampID + " -r -d -o " + dirto + " -i " + str(ID_int) + " -m 5 -M 3 -p 10" + "\n"
								#create a line of code for ustacks that includes the new sample ID (with no leading 0s)
	newfile2.write(ustacks_code)	#append this new line of code to the output file
	ID_int += 1

myfile.close()
newfile2.close()

# run this new script through the terminal
subprocess.call(['sh ustacks_shell.txt'], shell=True)

##########################################################################################

```

And I run it like this:

In [7]:
!python pypipe_ustacks.py barcodes_samplenames.txt ./process_radtags_out ./ustacks_out

/Volumes/Time Machine Backups/Cod-Time-Series-Data/process_radtags_out
mv: AAACGG_L1_1.fq.gz: No such file or directory
mv: AAACGG_L1_2.fq.gz: No such file or directory
mv: GCCGTA_L1_1.fq.gz: No such file or directory
mv: GCCGTA_L1_2.fq.gz: No such file or directory
mv: ACTCTT_L1_1.fq.gz: No such file or directory
mv: ACTCTT_L1_2.fq.gz: No such file or directory
mv: TTCTAG_L1_1.fq.gz: No such file or directory
mv: TTCTAG_L1_2.fq.gz: No such file or directory
mv: ATTCCG_L1_1.fq.gz: No such file or directory
mv: ATTCCG_L1_2.fq.gz: No such file or directory
mv: CCGCAT_L1_1.fq.gz: No such file or directory
mv: CCGCAT_L1_2.fq.gz: No such file or directory
mv: CGAGGC_L1_1.fq.gz: No such file or directory
mv: CGAGGC_L1_2.fq.gz: No such file or directory
mv: CGCAGA_L1_1.fq.gz: No such file or directory
mv: CGCAGA_L1_2.fq.gz: No such file or directory
mv: GAGAGA_L1_1.fq.gz: No such file or directory
mv: GAGAGA_L1_2.fq.gz: No such file or directory
mv: GGGGCG_L1_1.fq.gz: No such file or director

The kernel crashed some point in the middle of the night so I don't have all my data :( going to try to rerun some of it now.

I'm only going to use the forward reads for this class project, so I manually removed the reverse read files that ustacks made and put them in a different folder.

## ``cstacks``

Then, I run cstacks on 10 samples that represent the samples with the most reads.

<br>
My ``cstacks`` script looks like this:

```
##########################################################################################
# 
###--- ``cstacks`` script
#
# PURPOSE: ustacks creates a catalog from a subset of individuals to call SNPs
# INPUT: ustacks out put files for specified number of individuals with most sequence reads
# OUTPUT: catalog file + associated files
# 
#### WHEN RUNNING THIS SCRIPT, INPUTS AT THE COMMAND LINE ARE:
# python 
# {0}[pypipe_cstacks.py] 
# {1}[shell script with changed file names] 	
# {2}[directory of input files] 
# {3}[# individuals for cstacks] 
# {4}[batch number] 
# {5}[output directory] 
# {6}[num mismatches allowed] 
# {7}[num threads]
# 
### DEPENDENCIES: 
# [1] Your file names coming out of ustacks cannot have a period other than before file extension (not sure if true! check?)
# 
### WARNINGS:
# [1] If you have to rerun this script, it will append onto it! make sure no file w name or else won't even run!
# 
##########################################################################################

### --- [A] Call necessary modules

import sys 
import subprocess





### --- [B] Count lines in each sequence file

myfile = open(sys.argv[1], "r")	#open the file with your list of barcodes and sample IDs
lines = myfile.readlines()[2:]


dirfrom = sys.argv[2] # get directory for input files
firststr = "cd " + sys.argv[2] + "\n" # write first line of shell to cd to this directory

filestring = ""
filestring += firststr

samplename_list = [] # to be used in loop later in this script

for line in lines: 					#for each line in the barcode file
	linelist = line.strip().split()		#make a list of character strings broken by tabs
	sampID = linelist[2]				#pick out file name
	samplename_list.append(sampID)
	newstring = "gunzip -c " + sampID + " | wc -l >> ../cstacks_linecount.txt\n" # make line of code to run at command line
	filestring += newstring # add to a list of strings we'll write to a file
myfile.close()


#create a new file where the ustacks code will go, write string to file, close
newfile = open("cstacks_linecount_shell.txt", "w")
newfile.write(filestring)
newfile.close()

# run shell script that will calculate line counts
# subprocess.call(['sh cstacks_linecount_shell.txt'], shell=True)
# 

# 
# 
# 
# ### --- [C] Get sample names for specified # of samples with most sequence reads 
# 
linecounts = open("cstacks_linecount.txt", "r") # read in line counts file
linecounts_list = [] # initiate a list for the line counts so I can get.item later

for line in linecounts:
	count = line.strip().split() # get line count
	linecounts_list.append(line)

list_samp_ct = [] # initiate empty list
i = 0 # start counter

# --- CHECK^ if lists look normal
# print "sample name list "
# print samplename_list
# print "line counts list "
# print linecounts_list

for item in linecounts_list:
	new_item = [samplename_list[i], linecounts_list[i]]
	list_samp_ct.append(new_item)
	i += 1
	
def getKey(item): # so that sorted will sort by second item in list
	return item[1]
sortedlist = sorted(list_samp_ct, key = getKey, reverse = True)
# print sortedlist # CHECK^

with open('all_sorted_name_counts.txt', 'w') as file:
    file.writelines('\t'.join(i) + '\n' for i in sortedlist) # makes file
    




### --- [D] Write and run shell script for ``cstacks``


# ---------
# I had an automated way to pick the ten samples with most reads, but my ustacks run on 20161108 failed halfway throgh
# and so I have less samples. So I manually picked ten with the most reads of the files that went through ustacks

# so hash comment this out in the future!


samples_for_use = []
foruse = open("files_for_cstacks.txt", "r")
for line in foruse:
	linelist = line.strip().split()
	samples_for_use.append(linelist[0])
foruse.close()



#----------



cstacks_shell = ""
dirstr = "cd " + sys.argv[2] + "\n"
firststr = "cstacks -b " + sys.argv[4] + " "
cstacks_shell += firststr

endrange = int(sys.argv[3]) # set end of range for loop
for i in range(0, endrange):
	filename = samples_for_use[i]
	trmd_filename = filename.rsplit(".",2)[0]
	# print trmd_filename # CHECK^
	string = "-s " + trmd_filename + " "
	cstacks_shell += string
laststr = "-o " + sys.argv[5] + " -n " + sys.argv[6] + " -p " + sys.argv[7]
cstacks_shell += laststr
# print cstacks_shell # CHECK^

cstacks_shell_txt = open("cstacks_shell.txt", "w")
cstacks_shell_txt.write(cstacks_shell)
cstacks_shell_txt.close()

# run shell script
subprocess.call(["sh cstacks_shell.txt"], shell = True)

##########################################################################################


##########################################################################################


### --- DOCUMENTATION FOR CSTACKS

# cstacks -b batch_id -s sample_file [-s sample_file_2 ...] [-o path] [-n num] [-g] [-p num_threads] [--catalog path] [-h]
# p — enable parallel execution with num_threads threads.
# b — MySQL ID of this batch.
# s — TSV file from which to load radtags.
# o — output path to write results.
# m — include tags in the catalog that match to more than one entry.
# n — number of mismatches allowed between sample tags when generating the catalog.
# g — base catalog matching on genomic location, not sequence identity.
# h — display this help messsage.
# Catalog editing:
# 
# --catalog [path] — provide the path to an existing catalog. cstacks will add data to this existing catalog.
# Advanced options:
# 
# --report_mmatches — report query loci that match more than one catalog locus.

```

This is how I would run it (below), except that my ustacks run failed and only 25% of my files went through, meaning a couple steps in my ustacks script had to be changed. So I did a couple parts manually, and at the terminal.

```
!python pypipe_cstacks.py new_filenames_shell.txt ustacks_out 10 1 cstacks_out 3 5
```

And then the output catalog files look like this:

<br>
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-18%20at%202.25.06%20PM.png?raw=true)
<br>

I ran ``ustacks`` on the remaining files overnight through the terminal, to not crash jupyter, and deleted all cstacks and sstacks files, to rerun them again on everything. This will be batch 2.

In [13]:
!python pypipe_cstacks.py new_filenames_shell.txt ustacks_out 10 2 cstacks_out 3 5

cstacks paramters selected:
  Loci matched based on sequence identity.
  Number of mismatches allowed between stacks: 3
  Gapped alignments: disabled
Constructing catalog from 10 samples.
Initializing new catalog...
  Parsing ustacks_out/2015_101_1.tags.tsv.gz
  Parsing ustacks_out/2015_101_1.snps.tsv.gz
  Parsing ustacks_out/2015_101_1.alleles.tsv.gz
  37987 loci were newly added to the catalog.
Processing sample ustacks_out/2015_101_1 [2 of 10]
  Parsing ustacks_out/2005_464_1.tags.tsv.gz
  Parsing ustacks_out/2005_464_1.snps.tsv.gz
  Parsing ustacks_out/2005_464_1.alleles.tsv.gz
Searching for sequence matches...
  Distance allowed between stacks: 3; searching with a k-mer length of 35 (110 k-mers per read); 5 k-mer hits required.
  37987 loci in the catalog, 3925472 kmers in the catalog hash.
Merging matches into catalog...
  34143 loci were matched to a catalog locus.
  0 loci were matched to a catalog locus using gapped alignments.
  4583 loci were newly added to the catalog.
  18

## ``sstacks``

<br>

Then comes sstacks, where the program matches each individuals reads to the catalog.

My code looks like this:

```
##########################################################################################
# 
### ``sstacks``
#
# PURPOSE: To match individual samples against your catalog for genotyping
# INPUT: TSV output files from cstacks
# OUTPUT: match files
# 
# 
### WHEN RUNNING THIS SCRIPT, YOUR INPUTS AT THE COMMAND LINE ARE:
# python  
# {0}[pipeline filename] 
# {1}[shell script w sample names]
# {2}[batch ID number]
# {3}[filepath to directory with catalog filename without file extension]
# {4}[filepath to directory w ustacks output files per sample]
# {5}[number of threads to use]
# 
### DEPENDENCIES
# 
# [1]
# 
### WARNINGS
# 
# [1] 
# 
##########################################################################################

# --- [A] call necessary modules

import subprocess
import sys 





# --- [B] make shell script for sstacks

trim_names = [] # initiate list

rename_shell = open(sys.argv[1], "r") # open file w filenames 
lines = rename_shell.readlines()[2:]

for line in lines:
	linelist = line.strip().split()
	trim_name = linelist[2].rsplit(".",2)[0]
	trim_names.append(trim_name)
# print trim_names # CHECK^

rename_shell.close()

numsamples = len(trim_names)

newfile = open("sstacks_shell.txt", "w") # create new file for shell script

filestring = ""

for i in range(0,numsamples):
	filestring += "sstacks -b " + sys.argv[2] + " -c " + sys.argv[3]
	substr = " -s " + sys.argv[4] + "/" + trim_names[i] + " -p " + sys.argv[5] + "\n"
	filestring += substr	

# print filestring # ^CHECK

newfile.write(filestring)
newfile.close()





# --- [C] run shell script for sstacks

subprocess.call(["sh sstacks_shell.txt"], shell=True)

```

Again, this run on "all" my data sort of failed when ustacks stopped overnight, so I'll need a new way of making a list of samples that I actually want to run sstacks on. Or, perhaps I can run it as is and it will skip those files and report that it couldn't find them. Let's try that...

Later, on Nov 29-30 I reran ustacks on everything. So trying again, this will be batch 2.

In [17]:
!python pypipe_sstacks.py new_filenames_shell.txt 2 ustacks_out/batch_1 ustacks_out 10 -o sstacks_out

Searching for matches by sequence identity...
  Parsing ustacks_out/batch_1.catalog.tags.tsv.gz
  Parsing ustacks_out/batch_1.catalog.snps.tsv.gz
  Parsing ustacks_out/batch_1.catalog.alleles.tsv.gz
Processing sample 'ustacks_out/2005_387_1' [1 of 1]
  Parsing ustacks_out/2005_387_1.tags.tsv.gz
  Parsing ustacks_out/2005_387_1.snps.tsv.gz
  Parsing ustacks_out/2005_387_1.alleles.tsv.gz
Searching for sequence matches...
20123 stacks compared against the catalog containing 61592 loci.
  19776 matching loci, 1158 contained no verified haplotypes.
  151 loci matched more than one catalog locus and were excluded.
  1007 loci contained SNPs unaccounted for in the catalog and were excluded.
  21083 total haplotypes examined from matching loci, 19679 verified.
Outputing to file ./2005_387_1.matches.tsv.gz
Searching for matches by sequence identity...
  Parsing ustacks_out/batch_1.catalog.tags.tsv.gz
  Parsing ustacks_out/batch_1.catalog.snps.tsv.gz
  Parsing ustacks_out/batch_1.catalog.alleles

So now I have:

<br>
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-18%20at%203.05.39%20PM.png?raw=true)

<br>

## ``populations``

Then populations! I also just learned that my ``ustacks`` and ``sstacks`` files need to be in the same directory, so I manually moved those.

Then I run ``populations`` like this:

In [None]:
!populations -b 2 -P ustacks_out -M popmap1.txt -t 10 -r 0.50 -p 2 -m 5 --genepop

Fst kernel smoothing: off
Bootstrap resampling: off
Percent samples limit per population: 0.5
Locus Population limit: 2
Minimum stack depth: 5
Log liklihood filtering: off; threshold: 0
Minor allele frequency cutoff: 0
Maximum observed heterozygosity cutoff: 1
Applying Fst correction: none.
Parsing population map...
The population map contained 26 samples, 3 population(s), 1 group(s).
Reading the catalog...
  Parsing ustacks_out/batch_2.catalog.tags.tsv.gz
  Parsing ustacks_out/batch_2.catalog.snps.tsv.gz
  Parsing ustacks_out/batch_2.catalog.alleles.tsv.gz
Reading matches to the catalog...
  Parsing ustacks_out/2005_387_1.matches.tsv.gz
  Parsing ustacks_out/2005_388_1.matches.tsv.gz
  Parsing ustacks_out/2005_389_1.matches.tsv.gz
  Parsing ustacks_out/2005_457_1.matches.tsv.gz
  Parsing ustacks_out/2005_459_1.matches.tsv.gz
  Parsing ustacks_out/2005_462_1.matches.tsv.gz
  Parsing ustacks_out/2005_463_1.matches.tsv.gz
  Parsing ustacks_out/2005_464_1.matches.tsv.gz
  Parsing ustacks_

ACK i just learned that what I thought was the arbitrary sample ID 