## Understanding Genotype Error

I have ~6% genotyping error between replicates prepared using 2 different protocols. This notebook reviews ways that I am attempting to cut down on this error.

<br>
ONE. remove loci out of HWE - Maybe the loci with genotype mismatches are ones that tend to be out of HWE, and would be removed anyway. Ran Genepop v4.2

<br>
TWO. rerun stacks using the bounded snp model, as opposed to the default snp model. 

- An article explaining the difference between the two models can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3936987/)
- In the article, they explain that "Allowing high values for the error rate ε (e.g. greater than 10%) increases the likelihood that a locus with a number of alternative reads will be called a homozygous site with excessive error. Reducing the upper ε bound decreases the chance of calling a homozygote when the true genotype is heterozygous". Since stacks overcalls heterozygotes, we want the error rate to be very low. The problem with lowering the error rate is that this also significantly decreases the number of loci that are retained. In order to circumvent this problem, our lab uses a high error rate (to retain larger amounts of loci), and then corrects the genotypes (getting rid of the excess of homozygotes) using Marine's script. 


<br>
<br>
**(1) Remove loci out of HWE**

I ran genepop 4.2 on my file `batch_1_filtered_MAF_MissingLoci_Individs.txt`. 

This produced a ton of temporary files that all disappeared and gave me a single large text file with all of the pvalues.
However, with the format it is difficult to analyze. So I followed these steps to simplify the process of determining which loci are out of HWE: 

1. Copy and paste into excel, convert text to columns, delete the header
2. Save as a .csv file
3. Run a python script that condenses the long text file into a dictionary

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-PCod/notebooks'

In [3]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [5]:
cd ../Analyses/genepop

/mnt/hgfs/Pacific cod/DataAnalysis/Analyses/genepop


In [7]:
!head HWE_genepop_to_list.py

### Loci out of HWE ###
## Use this script to find the loci OUT of HWE from genepop's .txt file output. 
## MF 1/17/2017

## ARG 1 - genepop input file
## To Change in the Script: 
##   1. the end line number - the last line in the section of the genepop output file that has a separate chart for each locus (current is 18828)
##   2. the total number of lines that each locus' chart takes up (current is 19)
##   3. the number of lines to add on each iteration of the "for" loop (#2 - 1, current is 18)
#############################################


In [None]:
python HWE_genepop_to_list.py batch_1_filtered_MAF_MissingLoci_Individs.txt

And this caused some problems...


the script didn't work because some of the loci didn't have enough information to run Fisher's exact test, which meant that the line counting was off. 

WHY IS THIS HAPPENING 

-- those loci have ALL of the same genotypes + some missing genotypes. somewhere in the filtering scripts, the monomorphic loci are not being filtered out. 

-- I deleted these monomorphic loci across 1110 loci total, and ended up with 989 monomorphic loci. This extrapolating this to my 8517 final loci, I probably really only have ~7573 truly polymorphic loci. (See spreadsheet [here]()). 

-- which means that my genotyping error between 300ng and 500ng replicates is actually higher, since the loci being deleted are monomorphic and therefore can't be loci that are mismatched

<br>

I also ran the above script after deleting monomorphic loci up to line 18828 (1110 loci). Of these 1110 loci, only 20 loci had 4 or more populations out of HWE


**(2)** Rerun stacks using bounded SNP model

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-PCod/notebooks'

In [2]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [5]:
!python ustacks_populations_genShell_L1L2_1-17.py barcodesL1.txt barcodesL2.txt samples_for_cstacks_L1L2.txt

In [6]:
mv ustacks_populations_shell_boundSNP.sh ../ustacks_populations_shell_boundSNP.sh

In [7]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis


In [8]:
!head ustacks_populations_shell_boundSNP.sh

#!/bin/bash
cd /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis
mkdir L1L2stacks_m10_boundSNP

#ustacks
ustacks -t gzfastq -f L1L2samplesT142/PO010715_06.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 001 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/PO010715_27.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 002 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/PO010715_28.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 003 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/PO010715_29.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 004 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/GE011215_08.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 005 -m 10 -M 3 -p 6 --model_type bounded


In [None]:
./ustacks_populations_shell_boundSNP.sh