# Purpose

The purpose of this notebook is to document my full application of the Stacks pipeline on my cod data. I've been writing scripts to work on a subset of my files, and now I want to start running them on all of my files.

## ``process_radtags`` 

I don't have to run this on my data because the previous user of the data already did. In my script for running ``process_radtags``, however, I also use this opportunity to make folders for the output of each of the Stacks programs. So I'm doing that manually here. Below, you can see these directories plus two folders of cod data (two lanes of data):

![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-16%20at%203.10.45%20PM.png?raw=true)

## ``ustacks`` 

I'll be working off of my hard drive, so first I had to change directory to there.

In [1]:
cd /Volumes/Time\ Machine\ Backups/Cod-Time-Series-Data/ 

/Volumes/Time Machine Backups/Cod-Time-Series-Data


Another issue is that I have two lanes. It'd be easiest if I could work in only one directory, but their barcodes are redundant. So my first step is to rename the files by adding *_L1* for Library 1 and *_L2* for Library 2. I'll do that in python before the file extension. 

Here's the content of the script ``add_lib_to_filename.py``:

```
# arguments
# [1] directory name with files to rename 
# [2] text you would like to add to the end of each filename before file extensions

# assumptions
# [1] file extention = fq.qz
# [2] you are one directory above the directory that you want to rename the files of
# [3] your files aren't already renamed! --- write an if else loop to make sure it hasn't been run yet when you have the time

import sys
import subprocess

# make a text file that has each thing in this directory on a line; each line should be each file in the lane

string1 = "" # initiate string

string1 += 'ls ' + sys.argv[1] + ' > ./dircontents.txt' # add ls, wd, redirect to text file

firstshell = open("getcontents_shell.txt", "w")
firstshell.write(string1)
firstshell.close()

subprocess.call(["sh getcontents_shell.txt"], shell = True)

# now that file is made for contents of directory, read that file in
contents = open("dircontents.txt", "r")


# second shell = renaming script
rename_w_lib_shell = open("rename_w_lib_shell.txt", "w")

string2 = ""
string2 += "cd " + sys.argv[1] + "\n"


# loop that will break the script if it's already run to make sure you don't rename files into something wrong!
for line in contents:
	linelist = line.strip().split(".")
	if linelist[0][-4] == "L":
		print "CHECK TO SEE IF YOU HAVE ALREADY RENAMED THESE FILES. SCRIPT PAUSED AS WARNING BECAUSE FILES APPEAR RENAMED."
		sys.exit() # exit script
	else:
		continue
contents.close()



contents = open("dircontents.txt", "r")
for line in contents:
	original = line.strip()
	filename_list = line.strip().split(".") # make list out of whole file name with extensions
	filename_wo_ext = filename_list[0]
	file_ext = "." + filename_list[1] + "." + filename_list[2]
	wordlist = filename_wo_ext.split("_") # split filename without extensions into parts by underscore
	newfilename = wordlist[0] + sys.argv[2] + "_" + wordlist[1]
	newfilename += file_ext
	string2 += "mv" + "\t" + original + "\t" + newfilename + "\n"

rename_w_lib_shell.write(string2)
rename_w_lib_shell.close()
contents.close()

# call renaming shell script
subprocess.call(["sh rename_w_lib_shell.txt"], shell = True)
```

Running the code looks like this:

In [31]:
!python add_lib_to_filename.py process_radtags_out/cod_lib1 _L1

In [32]:
!python add_lib_to_filename.py process_radtags_out/cod_lib2 _L2

Now my files are renamed with the library information after the barcode before their 1 or 2 that was used to signify forward and reverse for paired end data.
<br><br>
From
<br>
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-16%20at%206.15.59%20PM.png?raw=true)
<br>
To
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-16%20at%206.15.48%20PM.png?raw=true)
<br>
<br>
And then I can group the two directories of libraries together.
<br>
![image](https://github.com/nclowell/FISH546/blob/master/Cod-Time-Series-Project/Notebooks/images_for_notebooks/Screen%20Shot%202016-11-17%20at%209.09.21%20AM.png?raw=true)
<br>
Now I want to run ``ustacks`` on all the files in this folder. I can use my handy dandy script I wrote a couple of weeks ago. It looks like this:

```
##########################################################################################
# 
### ``ustacks``
#
# PURPOSE: This step aligns identical RAD tags within an individual into stacks & provides data for calling SNPs
# INPUT: fastq or gzfastq files
# OUTPUT: 4 files - alleles, modules, snps, tags
# 
# 
### WHEN RUNNING THIS SCRIPT, YOUR INPUTS AT THE COMMAND LINE ARE:
# python  
# {0}[pipeline filename] 
# {1}[barcodes & samples textfile] 
# {2}[start directory] 
# {3}[end directory]
# 
### DEPENDENCIES
# 
# [1] You need a file where first column is barcode and second is unique sample name
# 
### WARNINGS
# 
# [1] ustacks only works when your working directory is one direcotry above the folders you are
#		calling and storing data in 
# 
##########################################################################################

# --- [A] call necessary modules

import subprocess
import sys 





# --- [B] Rename your files by the sample name

new_file = open("new_filenames_shell.txt", "w") # new txt file
dir = sys.argv[2] # directory files that need names changed
firststr = "cd " + dir + "\n" + "pwd\n"
new_file.write(firststr)

myfile = open(sys.argv[1], "r")				# myfile =  tab delimited file mentioned, called after your script

for line in myfile:						   # loop through each line
	linelist=line.strip().split()		   # splits into list by white space
	barcodefile1 = linelist[0] + "_1.fq.gz" # forward
	barcodefile2 = linelist[0] + "_2.fq.gz" # reverse
	samplefile1 = linelist[1] + "_1.fq.gz" # forward
	samplefile2 = linelist[1] + "_2.fq.gz" # reverse
	newstring = "mv" + "\t" + barcodefile1 + "\t" + samplefile1 + "\n" + "mv" + "\t" + barcodefile2 + "\t" + samplefile2 + "\n"
	# print newstring  # troubleshooting loop
	new_file.write(newstring)

myfile.close()
new_file.close()

# run the script you just made as a shell script to rename your files
subprocess.call(['sh new_filenames_shell.txt'], shell=True)

 



### --- [C] Make shell script to run all samples through command line through ``ustacks``

# ``ustacks`` requires an arbitrary integer for every sample, although unclear how it gets used as it does not become the name of the file


# name your 'from' and 'to' directories that will go in each line of your ustacks shell script
dirfrom = sys.argv[2]
dirto = sys.argv[3]

newfile2 = open("ustacks_shell.txt", "w")	 # make ustacks shell script to run through terminal
myfile = open("new_filenames_shell.txt", "r")	#open the file with a list of barcodes + sample IDs

# dir = sys.argv[2] # directory files that need names changed
# firststr = "cd " + dir + "\n"
# new_file.write(firststr)

dir2 = sys.argv[2] # directory with files that we want to run ustacks on

ID_int = 001								# start integer counter

lines = myfile.readlines()[2:] # skip first two lines because just cd and pwd


ID_int = 001								# start integer counter
for line in lines: 			#for each line in the barcode file	
	linelist=line.strip().split()	
	sampID = linelist[2] 					#save the second object as "sampID"
	if ID_int < 10: 
		ustacks_code = "ustacks -t gzfastq -f " + dirfrom + "/" + sampID + " -r -d -o " + dirto + " -i 00" + str(ID_int) + " -m 5 -M 3 -p 10" + "\n"
								#create a line of code for ustacks that includes the new sample ID (with 2 leading 0s)
	elif ID_int >= 10 & ID_int < 100: 
		ustacks_code = "ustacks -t gzfastq -f " + dirfrom + "/" + sampID + " -r -d -o " + dirto + " -i 0" + str(ID_int) + " -m 5 -M 3 -p 10" + "\n"
								#create a line of code for ustacks that includes the new sample ID (with 1 leading 0)
	else: 
		ustacks_code = "ustacks -t gzfastq -f " + dirfrom + "/" + sampID + " -r -d -o " + dirto + " -i " + str(ID_int) + " -m 5 -M 3 -p 10" + "\n"
								#create a line of code for ustacks that includes the new sample ID (with no leading 0s)
	newfile2.write(ustacks_code)	#append this new line of code to the output file
	ID_int += 1

myfile.close()
newfile2.close()

# run this new script through the terminal
subprocess.call(['sh ustacks_shell.txt'], shell=True)

##########################################################################################

```

And I run it like this:

In [None]:
!python pypipe_ustacks.py barcodes_samplenames.txt ./process_radtags_out ./ustacks_out

/Volumes/Time Machine Backups/Cod-Time-Series-Data/process_radtags_out
mv: AAACGG_L1_1.fq.gz: No such file or directory
mv: AAACGG_L1_2.fq.gz: No such file or directory
mv: GCCGTA_L1_1.fq.gz: No such file or directory
mv: GCCGTA_L1_2.fq.gz: No such file or directory
mv: ACTCTT_L1_1.fq.gz: No such file or directory
mv: ACTCTT_L1_2.fq.gz: No such file or directory
mv: TTCTAG_L1_1.fq.gz: No such file or directory
mv: TTCTAG_L1_2.fq.gz: No such file or directory
mv: ATTCCG_L1_1.fq.gz: No such file or directory
mv: ATTCCG_L1_2.fq.gz: No such file or directory
mv: CCGCAT_L1_1.fq.gz: No such file or directory
mv: CCGCAT_L1_2.fq.gz: No such file or directory
mv: CGAGGC_L1_1.fq.gz: No such file or directory
mv: CGAGGC_L1_2.fq.gz: No such file or directory
mv: CGCAGA_L1_1.fq.gz: No such file or directory
mv: CGCAGA_L1_2.fq.gz: No such file or directory
mv: GAGAGA_L1_1.fq.gz: No such file or directory
mv: GAGAGA_L1_2.fq.gz: No such file or directory
mv: GGGGCG_L1_1.fq.gz: No such file or director

Note that a lot of the files aren't my data, I should only have 106 individuals out of the two lanes, and it's paired end so there should be 212 files that are mine. Also, this kept crashing my jupyter notebook, so I ran it in the terminal outside the notebook. 