## Lanes 1 and 2 combined pipeline


This notebook goes through the stacks pipeline, from `process_radtags` to `populations`, for the first AND second lane of sequencing recieved for the Korean Pacific cod. 
Cstacks was run for *de novo* assembly of reads into a reference catalog of loci. Sstacks was then run *with* additional filtering / screening of the reference catalog (through BOWTIE and BLAST). 
<br>
<br>
The primary purpose of this notebook is to generate data from the Lane 1 and Lane 2 individuals that will allow me to assess my protocol for 300ng and degraded DNA. 
<br>
<br>
Programs: FastQC, [stacks v. 1.44](http://catchenlab.life.illinois.edu/stacks/), [bowtie 1.1.2](http://bowtie-bio.sourceforge.net/index.shtml), [ncbi blast-2.5.0](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
<br>
<br>
See [Evernote](http://www.evernote.com/l/AoqKYaI6qvZBprXPcIO4zO-9pMeEMBnm8N4/) for exact versions of scripts



## process_radtags

#### Lane 1
**12/5/2016**

Renamed /L2samples/ --> /L1L2samplesT142/

In [None]:
#ran in terminal
process_radtags -p raw_data/L1_PE150/ \
-P -i gzfastq -y gzfastq \
-o L1L2samplesT142 \
-b scripts/barcodesL1.txt \
-e sbfI -E phred33 -r -c -q -t 142

In [None]:
532905602 total sequences
 48761576 ambiguous barcode drops (9.2%)
  6414716 low quality read drops (1.2%)
 30459515 ambiguous RAD-Tag drops (5.7%)
447269795 retained reads (83.9%)

#### Lane 2

**12/5/2016**

In [None]:
#ran in terminal
process_radtags -p raw_data/L2_SR150/ \
-i gzfastq -y gzfastq \
-o stacksv1.44/L1L2samplesT142 \
-b scripts/barcodesL2.txt \
-e sbfI -E phred33 -r -c -q -t 142

In [None]:
377851294 total sequences
 50386443 ambiguous barcode drops (13.3%)
  2840924 low quality read drops (0.8%)
 65025476 ambiguous RAD-Tag drops (17.2%)
259598451 retained reads (68.7%)

## ustacks --> cstacks

### 12/6/2016


Ran on all output files from process_radtags for both populations in the `L1L2samplesT142/` folder. Output to `L1L2stacks_m10` folder. 

BEFORE I ran the ustacks_cstacks shell, I chose the top 10 individuals from each population by counting the total number of sequences in each fastq file. 

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-2016/notebooks'

In [2]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


I first generated a shell script to count the lines in all of the fastq sequences files

In [3]:
!head L1L2_seqCountsgen.py

import sys
lane1 = open(sys.argv[1], "r")
lane2 = open(sys.argv[2], "r")
newshell = open("CountFASTQseqs_12-4.sh", "w")

newshell.write("#!/bin/bash" + "\n\n")

for line in lane1: 
	linelist = line.strip().split()	
	filestring = "zcat L1L2samples/" + linelist[1] + ".1.fq.gz | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' >> FastQsequenceCounts.txt"


In [None]:
!python L1L2_seqCountsgen.py barcodesL1.txt barcodesL2.txt

In [None]:
!mv CountFASTQseqs_12-4.sh ../CountFASTQseqs_12-4.sh

In [None]:
cd ../

In [None]:
!chmod +x CountFASTQseqs_12-4.sh

In [None]:
./CountFASTQseqs_12-4.sh

This output a file, `FastQsequenceCounts_L1L2.txt. `

I then ran a script to extract only the first count on each line of the output file above, aka only the total sequence count

In [5]:
!head FASTQcounts_totalSeqExtract.py

##### This python script takes the list of Fastq sequence counts and extracts JUST the total sequence count ##############
## MF 12/4/2016

import sys
countsfile = open(sys.argv[1], "r")
newfile = open("FastqTotalSeqCounts.txt", "w")

for line in countsfile: 
	linelist = line.strip().split()
	newfile.write(linelist[0] + "\n")


In [None]:
!python FASTQcounts_totalSeqExtract.py ../FastQTotalSeqCounts_L1L2.txt

This output the file `FastqTotalSeqCounts_L1L2.txt`, containing the list of total sequence counts. I copied this into an excel document next to the sample names (copied in order from the barcode file). 

the excel document: `L1L2_SortedTotalSeqCounts.ods`

I then sorted all samples into populations and order by sequence count. Highlighted samples are those used for cstacks, put into a separate file `samples_for_cstacks_L1L2.txt`. I also made the population map at this stage, `PopMap_L1L2stacks.txt`

<br>
<br>

I then made the ustacks --> cstacks shell script. I could have continued to sstacks as well, but wanted to quality control given how glitchy it was. I copied in the ustacks script from the `radtags_ustacks_shell_12-3.sh`, changing the input folder name to `L1L2samplesT142`. I then copied the cstacks code in from the `L1L2_cstacks_sstacks_shell_12-4.sh`. 

In [6]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis


In [7]:
!head ustacks_cstacks_shell_12-6.sh

#!/bin/bash

#ustacks
ustacks -t gzfastq -f L1L2samplesT142/PO010715_06.1.fq.gz -r -d -o L1L2stacks_m10 -i 001 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L1L2samplesT142/PO010715_27.1.fq.gz -r -d -o L1L2stacks_m10 -i 002 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L1L2samplesT142/PO010715_28.1.fq.gz -r -d -o L1L2stacks_m10 -i 003 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L1L2samplesT142/PO010715_29.1.fq.gz -r -d -o L1L2stacks_m10 -i 004 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L1L2samplesT142/GE011215_08.1.fq.gz -r -d -o L1L2stacks_m10 -i 005 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L1L2samplesT142/GE011215_09.1.fq.gz -r -d -o L1L2stacks_m10 -i 006 -m 10 -M 3 -p 6
ustacks -t gzfastq -f L1L2samplesT142/GE011215_14.1.fq.gz -r -d -o L1L2stacks_m10 -i 007 -m 10 -M 3 -p 6


*Note - for some reason the ID numbers > 100 have leading zeros... removed manually from code*

*Also removed process_radtags code generated since this step was already done, but was produced as part of the shell*

In [18]:
!chmod +x ustacks_cstacks_shell_12-6.sh

In [None]:
#ran in terminal
./ustacks_cstacks_shell_12-6.sh

Error saved to `ustacks_cstacks_out_12-6`

<br>
<br>
## sstacks
I then used the original `sstacks_byline_12-4.sh` script to make a new script, `sstacks_byline_12-7.sh`, and run sstacks on all of the samples. The error output was saved to a `sstacks_out_ b1_12-7.txt` file and put into the `L1L2stacks_m10` folder. 

In [9]:
!head sstacks_byline_12-7.sh

#!/bin/bash
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO010715_06.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO010715_27.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO010715_28.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO010715_29.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO020515_05.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO020515_09.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO020515_10.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batch_1 -s L1L2stacks_m10/PO010715_19.1 -o L1L2stacks_m10 -p 6 2>> sstacks_out_b1_12-7
sstacks -b 1 -c L1L2stacks_m10/batc

In [None]:
chmod +x sstacks_byline_12-7.sh

In [None]:
./sstacks_byline_12-7.sh

## populations

### 12/6/2016

I ran populations -- tried to have it output a VCF file, but the program kept getting killed because there wasn't enough memory. so I ended up just outputing a genepop and a fasta file. 

In [10]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis'

In [11]:
!head L1L2_populations_12-7.sh

#!/bin/bash

## populations 

populations -b 1 -P L1L2stacks_m10 -M scripts/PopMap_L1L2stacks.txt -t 36 -r 0.75 -p 2 -m 10 --genepop --fasta --fstats


I'm outputing a genepop and fasta file, and also decided to see if I can get F statistics from populations (not the way we do it in the lab, but for class should be close enough to the actual statistics). 

In [4]:
!chmod +x L1L2_populations_12-7.sh

In [None]:
./L1L2_populations_12-7.sh

### Marine's scripts

(1) remove heading from batch_1 catalog.snps file


In [12]:
cd L1L2stacks_m10

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10


In [13]:
!gzip -d batch_1.catalog.snps.tsv.gz

(2) add a "_" between Cat and ID in the batch_1 haplotypes.tsv file

(3) Run Marine's script to make a biallelic catalog reference

In [14]:
cd ../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [15]:
!python preparing_file_for_correcting_genotypes.py \
../../L1L2stacks_m10/batch_1.haplotypes2.tsv \
../../L1L2stacks_m10/batch_1.biallelic_catalog.tsv \
../../L1L2stacks_m10/batch_1.catalog.snps2.tsv \
1

4 116 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
6 96 AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC GTA AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC
7 117 G G G G G G G G G G G G A G G G G G G G A G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
9 33 ACC TCC ACC TCC TCC ACC ACC ACC ACC ACC ACC ACC ACC TCC ACC TCC ACC ACC TCC ACC TCC ACC ACC ACC TCC ACC TCC AC

(5) create a bash script that will unzip all of the individual .tags.tsv files, and then call Marine's genotypes_verif python script.

In [16]:
cd ../../

/mnt/hgfs/Pacific cod/DataAnalysis


In [17]:
!head -n 15 gzip_MBgenotypesverif_BASHshell.sh

#!/bin/bash

### This shell script will unzip all of the individual .tags.tsv files needed for Marine Brieuc's genotypes_verif.py script, then call Marine's python script. Use this bash script AFTER running Marine's script `preparing_file_for_correcting_genotypes.py` ###

## M. Fisher 12/5/2016


#Ask for input from user
echo "This is your current location:"
pwd
echo "Please input the local path to the directory containing the individual 'tags' and 'matches' files"
read DIRECTORY


#(1) Navigate to directory with .tags.tsv files


This is an interactive script, so I had to run it on the command line

In [18]:
!chmod +x gzip_MBgenotypesverif_BASHshell.sh

In [None]:
./gzip_MBgenotypesverif_BASHshell.sh

#input at command line: DIRECTORY = L1L2stacks_m10
#moved into UndercallingHets_MB_CW folder

This script outputs a text file with the biallelic loci, and corrected genotypes for every individual at every locus. 

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-2016/notebooks'

In [2]:
cd ../../L1L2stacks_m10

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10


In [11]:
!head -n 2 batch_1.CorrectedGenos_2alleles.txt

	10006	10007	10010	10011	10012	10014	10017	10018	10020	10021	10026	1003	10032	10044	10062	10067	1007	10070	10077	10079	10080	10089	10096	10098	101	1010	10103	10109	10118	1012	10120	10129	1013	10130	10134	10135	10142	10151	10153	10161	10170	10175	10185	10186	10188	10192	10193	10194	10205	10207	10208	10209	10214	1022	10223	10225	10226	10228	1023	10234	10237	10241	10261	10263	10264	10266	10273	10276	1028	10282	1029	10293	10298	103	10306	10307	10309	10312	10320	10333	10337	10338	10340	10344	10350	10359	10360	10368	10369	1037	10370	10373	10377	1038	10385	1039	10390	104	10406	10413	10415	10416	10419	10422	10424	10426	10439	1044	10440	10442	10445	1045	10456	10460	10469	1047	10479	1048	10480	10485	10496	10497	10499	10501	10502	10504	10507	10511	1052	10523	10524	10529	10544	10545	10549	10553	10558	10560	10563	10570	10580	10584	10586	10595	10601	1061	10613	10616	1062	10625	10628	10637	10643	10647	1065	10650	10658	10659	10660	10665	10666	10668	10671	10673	10677	10687	10690	10692	10697	10703	10705

*Note that the genotypes above are in "-", "a", "ab", "b" format*

(6) Convert file to the gene pop format using Charlie's script (taken from Eleni's evernote)

In [6]:
cd ../scripts/UndercallingHets_MB_CW

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [7]:
!head Genepop_conversion_corrected.py

## Charles D. Waters <cwaters8@uw.edu>
## 11/01/2013
## This script takes the haplotype file that was output from Marine's Corrected Genotypes script ("genotypes_verif_v2.py") and then converts the genotypes from a,b,ab,and - to "0101","0202","0102", and "0000"

## Argument 1: Input haplotype file of corrected genotypes, which was generated by Marine's genotypes_verif_v2.py script
## Argument 2: Output file in the genepop format
#python Genepop_conversion_corrected.py batch_2_corrected_genotypes_2_alleles.txt batch_2_corrected_genotypes_2_alleles_genepop.txt
import sys




In [8]:
!python Genepop_conversion_corrected.py ../../L1L2stacks_m10/batch_1.CorrectedGenos_2alleles.txt ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic.genepop

In [9]:
cd ../../L1L2stacks_m10/

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10


In [10]:
!head batch_1.CorrectedGenos_biallelic.genepop

	10006	10007	10010	10011	10012	10014	10017	10018	10020	10021	10026	1003	10032	10044	10062	10067	1007	10070	10077	10079	10080	10089	10096	10098	101	1010	10103	10109	10118	1012	10120	10129	1013	10130	10134	10135	10142	10151	10153	10161	10170	10175	10185	10186	10188	10192	10193	10194	10205	10207	10208	10209	10214	1022	10223	10225	10226	10228	1023	10234	10237	10241	10261	10263	10264	10266	10273	10276	1028	10282	1029	10293	10298	103	10306	10307	10309	10312	10320	10333	10337	10338	10340	10344	10350	10359	10360	10368	10369	1037	10370	10373	10377	1038	10385	1039	10390	104	10406	10413	10415	10416	10419	10422	10424	10426	10439	1044	10440	10442	10445	1045	10456	10460	10469	1047	10479	1048	10480	10485	10496	10497	10499	10501	10502	10504	10507	10511	1052	10523	10524	10529	10544	10545	10549	10553	10558	10560	10563	10570	10580	10584	10586	10595	10601	1061	10613	10616	1062	10625	10628	10637	10643	10647	1065	10650	10658	10659	10660	10665	10666	10668	10671	10673	10677	10687	10690	10692	10697	10703	10705

*Note that the genotypes are now in "0000", "0101", "0102", "0202" format*

## Filter for MAF

The following scripts allow you to filter the new, corrected `.genepop` file for minor allele frequencies.


(1) Transpose the file so that minor allele frequencies can be calculated, using Dan's python script (from Eleni's evernote)

***Note: Eleni notes that you must first check the INPUT file to make sure that the first line is: "sample" followed by a list of all the sample names as column headers. If it is not, then you need to modify it. Since you can see above that I do not have a "sample" title for the first column, I modified the `batch_1.CorrectedGenos_biallelic.genepop` in Notepad++ by making "sample" the first word in the doc***

In [18]:
cd ../scripts/UndercallingHets_MB_CW

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [19]:
!head transpose.py

## This script was written by Dan Drinan and transposes the rows and columns of a haplotype file
#python transpose.py batch_2_corrected_genotypes_2_alleles_genepop.txt batch_2_corrected_genotypes_2_alleles_genepop_transposed.txt

import sys

input_file = open(sys.argv[1], 'r')

header = True
matrix_of_data = []



In [20]:
!python transpose.py ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_TextEdit.genepop ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_TRANSPOSED.genepop

In [21]:
cd ../../L1L2stacks_m10/

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10


In [22]:
!head batch_1.CorrectedGenos_biallelic_TRANSPOSED.genepop

sample PO010715_06.1 PO010715_27.1 PO010715_28.1 PO010715_29.1 PO020515_05.1 PO020515_09.1 PO020515_10.1 PO010715_19.1 PO031715_20.1 PO020515_03.1 PO020515_08.1 PO010715_11.1 PO020515_16.1 PO020515_17.1 PO010715_17.1 PO020515_15.1 PO010715_10.1 PO031715_13.1 PO010715_08.1 PO020515_14.1 PO020515_06 PO031715_23 PO031715_03 PO010715_04 PO020515_01 PO031715_04 PO031715_24 PO010715_12 GE011215_08.1 GE011215_09.1 GE011215_14.1 GE011215_15.1 GE011215_10.1 GE012315_01.1 GE011215_07.1 GE011215_16.1 GE011215_29.1 GE012315_03.1 GE012315_22.1 GE012315_04.1 GE012315_05.1 GE012315_06.1 GE011215_20.1 GE011215_21.1 GE011215_30.1 GE011215_01.1 GE011215_24.1 GE012315_08.1 GE012315_09.1 GE012315_10.1 GE012315_11.1 GE012315_17.1 GE012315_20.1 GEO012315_02 GEO012315_12 GEO012315_18 GEO012315_21 GE011215_18 GE011215_22 GE011215_19 GE011215_28 NA021015_16.1 NA021015_21.1 NA021015_02.1 NA021015_03.1 NA021015_08.1 NA021015_13.1 NA021015_10.1 NA021015_17.1 NA021015_22.1 NA021015_14.1 NA021015_06.1 NA021015_09.1

*Also check the file size; the TRANSPOSED file is the same size as the original .genepop*

(2) At this point eleni saves her text file as a `.csv`, but I'm going to alter the script so that it runs as a text file. 
***Note that you should have individuals clustered by population prior to running the next script***

(3) Run Eleni's script to filter by MAF

***Important - you MUST change the script code. 
At Line 30: specify which individuals are in each population by assigning columns to population names (see "rule" for assigning columns). I found column numbers for individuals by going into the TRANSPOSED file, inserting a row at the top of the spreadsheet, and numbering columns from 0 --- end. I didn't save the changes. ***

***Next chunk of code: copy those 3 lines of code for each population, renaming the population each time. ***


In [33]:
cd ../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [34]:
!head FilterLoci_by_MissingValues_MF_L1L2.py

### Eleni Petrou
#python Eleni_FilterLoci_by_MissingValues.py Marr15_batch1_filteredMAF_genotypes.csv Marr15_batch1_FinalCleanOutput.csv Marr15_batch1_FinalBlacklisted_output.csv

## MF edit 12/8/2016 for Lanes 1 and 2 Pcod samples

import sys

# Open your files for reading and writing
genotypes_file = open(sys.argv[1],'r')
clean_output_file = open(sys.argv[2],'w')


In [37]:
#Will get output to screen if too much missing data in a single population
!python FilterLoci_by_MissingValues_MF_L1L2.py ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_TRANSPOSED.genepop ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_CLEAN.genepop ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_BAD.genepop

In [38]:
cd ../../L1L2stacks_m10/

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10


I now have two new files: 

`batch_1.CorrectedGenos_biallelic_CLEAN.genepop`

`batch_1.CorrectedGenos_biallelic_BAD.genepop`


After taking a look at the CLEAN file, it doesn't appear to be working correctly - the same loci are listed multiple times, and all of the individuals have the same genotypes at that loci. I'm going to convert to a .csv and try the original script. 

**To convert to .csv: **

(1) open the genepop file in Notepad++ and copy all of the text

(2) open a new excel workbook, ctrl-a, and change the formatting to "text" to prevent the leading zeros on the genotypes from disappearing

(3) paste the genepop text into excel, save as `.csv`

(4) *!! In the python script, changed ".split()" to .split(",")*

<br>
<br>

### 12/9/2016

Eleni sent me a new version of her code that has multiple populations in it, so I can substitute in her script correctly instead of extrapolating from a one-population script where to put the repeat code for additional populations

In [1]:
cd ../../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [2]:
!head Eleni_filter_by_MinorAlleleFrequency.py

### Eleni Petrou
### June 15, 2015
### This script takes an input haplotype file with loci in rows and individuals in columns and calculated the allele frequencies for each allele at
### each locus. This script is based on Charlie's MAF_corrected_gebotypes script
#cd /mnt/hgfs/D/sequencing_data/Herring_PopulationStructure/output_stacks
#python Eleni_filter_by_MinorAlleleFrequency.py

## MF edited 12/9/2016 for P.cod




In [7]:
#in the new script, output files are specified within the script code
!python Eleni_filter_by_MinorAlleleFrequency_L1L2.py ../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_TRANSPOSED.csv

10014	0.0454545454532	0.954545454517	0.0	0.9999999995	0.0	0.999999999929	0.0399999999984	0.959999999962	0.0370370370357	0.962962962927	0.0	0.999999999833	0.0	0.9999999995
10079	0.0	0.99999999997	0.0	0.0	0.0	0.999999999933	0.0499999999983	0.949999999968	0.0185185185178	0.981481481445	0.0	0.999999999833	0.0	0.9999999995
10080	0.0151515151511	0.984848484819	0.0	0.9999999995	0.0333333333311	0.966666666602	0.0322580645151	0.967741935453	0.0185185185178	0.981481481445	0.0	0.999999999833	0.0	0.0
10103	0.99999999997	0.0	0.9999999995	0.0	0.999999999933	0.0	0.999999999966	0.0	0.999999999963	0.0	0.99999999975	0.0	0.9999999995	0.0
10135	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10153	0.99999999997	0.0	0.999999999	0.0	0.966666666602	0.0333333333311	0.999999999963	0.0	0.999999999963	0.0	0.999999999833	0.0	0.0	0.0
10186	0.0303030303021	0.969696969668	0.0	0.9999999995	0.0	0.999999999933	0.0499999999983	0.949999999968	0.0370370370357	0.962962962927	0.0	0.999999999833	0.0	0.99999999975
102

I opened the batch_1_filteredMAF_genoptypes.csv file in excel, and it looks like the loci / genotypes were populated correctly! To find the number of loci retained, I can just look at the line count of the file -1 for the header

In [8]:
cd ../../L1L2stacks_m10

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10


In [11]:
!cat batch_1_filteredMAF_genotypes.csv | wc -l

8655


A total of **8,654 loci** remaining - Eleni had 6,506 remaining after filtering for MAF. 
<br>
<br>
At this point, I would normally run Eleni's script to filter out loci that are missing more than 50% of the genotypes in any one population. HOWEVER, since my degraded samples look like they are missing MOST of their genotypes, I don't want this to affect the filtering. 

So I altered the script so that it copies over all of the Mukho / Sokcho info, but doesn't filter the loci based on those individuals. I also only have 3 individuals from Boryeong and 6 (really 5, since one is a replicate) from Yellow Sea, and I don't want those to sway the loci being filtered. so I left those out as well. 


In [13]:
cd ../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [15]:
!head MF_FilterLoci_by_MissingValues_L1L2_12-9.py 

import sys

# Open your files for reading and writing
genotypes_file = open(sys.argv[1],'r')
clean_output_file = open(sys.argv[2],'w')
blacklisted_output_file = open(sys.argv[3], 'w')

count = 0
for mystring in genotypes_file:		# Read in each line in the file as a string
	if count == 0: 


In [26]:
!python MF_FilterLoci_by_MissingValues_L1L2_12-9.py ../../L1L2stacks_m10/batch_1_filteredMAF_genotypes.csv ../../L1L2stacks_m10/batch_1_filteredMAF_GOODgenotypes.csv ../../L1L2stacks_m10/batch_1_filteredMAF_BADgenotypes.csv 

processed 8654 loci
Number of loci removed: 137


After removing any loci with missing data in populations with sample "n" > 10, I have **8,517 loci** remaining. 
<br>
<br>

When filtering for individuals with fewer than 50% genotypes called, I will use the output file from this step, `batch_1_filteredMAF_GOODgenotypes.csv`