## Understanding Genotype Error

I have ~6% genotyping error between replicates prepared using 2 different protocols. This notebook reviews ways that I am attempting to cut down on this error.

<br>
ONE. remove loci out of HWE - Maybe the loci with genotype mismatches are ones that tend to be out of HWE, and would be removed anyway. Ran Genepop v4.2

<br>
TWO. rerun stacks using the bounded snp model, as opposed to the default snp model. 

- An article explaining the difference between the two models can be found [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3936987/)
- In the article, they explain that "Allowing high values for the error rate ε (e.g. greater than 10%) increases the likelihood that a locus with a number of alternative reads will be called a homozygous site with excessive error. Reducing the upper ε bound decreases the chance of calling a homozygote when the true genotype is heterozygous". Since stacks overcalls heterozygotes, we want the error rate to be very low. The problem with lowering the error rate is that this also significantly decreases the number of loci that are retained. In order to circumvent this problem, our lab uses a high error rate (to retain larger amounts of loci), and then corrects the genotypes (getting rid of the excess of homozygotes) using Marine's script. 


<br>
<br>
### (1) Remove loci out of HWE

I ran genepop 4.2 on my file `batch_1_filtered_MAF_MissingLoci_Individs.txt`. 

This produced a ton of temporary files that all disappeared and gave me a single large text file with all of the pvalues.
However, with the format it is difficult to analyze. So I followed these steps to simplify the process of determining which loci are out of HWE: 

1. Copy and paste into excel, convert text to columns, delete the header
2. Save as a .csv file
3. Run a python script that condenses the long text file into a dictionary

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-PCod/notebooks'

In [3]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [5]:
cd ../Analyses/genepop

/mnt/hgfs/Pacific cod/DataAnalysis/Analyses/genepop


In [7]:
!head HWE_genepop_to_list.py

### Loci out of HWE ###
## Use this script to find the loci OUT of HWE from genepop's .txt file output. 
## MF 1/17/2017

## ARG 1 - genepop input file
## To Change in the Script: 
##   1. the end line number - the last line in the section of the genepop output file that has a separate chart for each locus (current is 18828)
##   2. the total number of lines that each locus' chart takes up (current is 19)
##   3. the number of lines to add on each iteration of the "for" loop (#2 - 1, current is 18)
#############################################


In [None]:
python HWE_genepop_to_list.py batch_1_filtered_MAF_MissingLoci_Individs.txt

**And this caused some problems...**


the script didn't work because some of the loci didn't have enough information to run Fisher's exact test, which meant that the line counting was off. 

**WHY IS THIS HAPPENING **

-- genepop does not calculate HWE if: (1) there is only one allele present in the population, or (2) if there are two alleles present, but the second allele is only represented once. Since our biallelic catalog considers both of these cases as polymorphic loci, these loci *should* be retained as polymorphic, they just can't have HWE calculated.  

-- In a sample of 1110 loci, genepop was unable to calculate HWE for 121 loci. I groundtruthed two loci which were not used by genepop, to ensure that they were indeed polymorphic - they are (See spreadsheet [here](https://github.com/mfisher5/mf-fish546-PCod/blob/master/Analyses/loci_monomorphic_checks.xlsx)). 

<br>

I also ran the script to pull out HWE p-values after deleting the problematic polymorphic loci up to line 18828 (1110 loci). **Of these 1110 loci, only 20 loci had 4 or more populations out of HWE. **


<br>
**Effect on genotype mismatch percentage**

I recalculated genotype mismatches between 300ng and 500ng replicates after removing the 20 loci NOT in HWE. 

This led to an average change in genotype mismatch percentage of 0%, with a range between +0.023 (9.59% to 9.62%) and -0.016 (3.21% to 3.19%; 8.15% to 8.13%). 

No obvious bias toward increasing or decreasing genotype mismatch across all samples. 

<br>
<br>


### (2) Rerun stacks using bounded SNP model

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-PCod/notebooks'

In [2]:
cd ../../scripts

/mnt/hgfs/Pacific cod/DataAnalysis/scripts


In [5]:
!python ustacks_populations_genShell_L1L2_1-17.py barcodesL1.txt barcodesL2.txt samples_for_cstacks_L1L2.txt

In [6]:
mv ustacks_populations_shell_boundSNP.sh ../ustacks_populations_shell_boundSNP.sh

In [7]:
cd ../

/mnt/hgfs/Pacific cod/DataAnalysis


In [8]:
!head ustacks_populations_shell_boundSNP.sh

#!/bin/bash
cd /mnt/hgfs/Shared\ Drive\ D/Pacific\ cod/DataAnalysis
mkdir L1L2stacks_m10_boundSNP

#ustacks
ustacks -t gzfastq -f L1L2samplesT142/PO010715_06.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 001 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/PO010715_27.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 002 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/PO010715_28.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 003 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/PO010715_29.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 004 -m 10 -M 3 -p 6 --model_type bounded
ustacks -t gzfastq -f L1L2samplesT142/GE011215_08.1.fq.gz -r -d -o L1L2stacks_m10_boundSNP -i 005 -m 10 -M 3 -p 6 --model_type bounded


In [None]:
./ustacks_populations_shell_boundSNP.sh

**Marine's scripts: Using [this](https://github.com/mfisher5/mf-fish546-PCod/blob/master/notebooks/Lanes%201%20and%202%20combined%20pipeline.ipynb) notebook as template**

In [8]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/Analyses/genepop'

In [9]:
cd ../../L1L2stacks_m10_boundSNP

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10_boundSNP


In [10]:
!gzip -d batch_3.catalog.snps.tsv.gz

In [11]:
cd ../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [13]:
!python preparing_file_for_correcting_genotypes.py \
../../L1L2stacks_m10_boundSNP/batch_3.haplotypes2.tsv \
../../L1L2stacks_m10_boundSNP/batch_3.biallelic_catalog.tsv \
../../L1L2stacks_m10_boundSNP/batch_3.catalog.snps2.tsv \
1

4 116 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
6 96 AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC GTA AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC AAC
7 117 G G G G G G G G G G G G A G G G G G G G A G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
9 33 ACC TCC ACC TCC TCC ACC ACC ACC ACC ACC ACC ACC ACC TCC ACC TCC ACC ACC TCC ACC TCC ACC ACC ACC TCC ACC TCC ACC A

In [15]:
cd ../../

/mnt/hgfs/Pacific cod/DataAnalysis


In [16]:
!head gzip_MBgenotypesverif_BASHshell.sh

#!/bin/bash

### This shell script will unzip all of the individual .tags.tsv files needed for Marine Brieuc's genotypes_verif.py script, then call Marine's python script. Use this bash script AFTER running Marine's script `preparing_file_for_correcting_genotypes.py` ###

## M. Fisher 12/5/2016


#Ask for input from user
echo "This is your current location:"
pwd


In [None]:
#at command line
./gzip_MBgenotypesverif_BASHshell.sh

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/mf-fish546-PCod/notebooks'

In [2]:
cd ../../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [3]:
!python Genepop_conversion_corrected.py \
../../L1L2stacks_m10_boundSNP/batch_3.CorrectedGenos_2alleles.txt \
../../L1L2stacks_m10_boundSNP/batch_3.CorrectedGenos_biallelic.genepop

In [6]:
!python transpose.py \
../../L1L2stacks_m10_boundSNP/batch_3.CorrectedGenos_biallelic_TextEdit.genepop \
../../L1L2stacks_m10_boundSNP/batch_3.CorrectedGenos_biallelic_TRANSPOSED.genepop

Double checked column indices for each population before running the next script. Also changed output files. 

In [8]:
!python Eleni_filter_by_MinorAlleleFrequency_L1L2_boundSNP.py \
../../L1L2stacks_m10_boundSNP/batch_3.CorrectedGenos_biallelic_TRANSPOSED.csv

In [9]:
cd ../../L1L2stacks_m10_boundSNP/

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10_boundSNP


In [10]:
!cat batch_3.filtered_MAF.csv | wc -l

8748


** A total of 8,747 loci remaining **

At this point, I again altered the script below so that it copies over all of the Mukho / Sokcho info, but doesn't filter the loci based on those individuals. 

In [11]:
cd ../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [12]:
!python MF_FilterLoci_by_MissingValues_L1L2_12-9.py \
../../L1L2stacks_m10_boundSNP/batch_3.filtered_MAF.csv \
../../L1L2stacks_m10_boundSNP/batch_3.filtered_MAF_GOODgenotypes.csv \
../../L1L2stacks_m10_boundSNP/batch_3.filtered_MAF_BADgenotypes.csv 

processed 8747 loci
Number of loci removed: 135


In [13]:
cd ../../L1L2stacks_m10_boundSNP/

/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10_boundSNP


In [14]:
mv batch_3.filtered_MAF_GOODgenotypes.csv batch_3.filtered_MAF_MissingLoci.csv

#After removing any loci with missing data in populations with sample "n" > 10, I have **8,612 loci remaining**. 


When filtering for individuals with fewer than 50% genotypes called, I used the output file from this step, `batch_3.filtered_MAF_MissingLoci.csv`. 

<br>
I generated the fully filtered file, `batch_3.filtered_MAF_MissingLoci_Individs.csv`

then worked in excel to determine mismatch proportions.

**See results / comparisons to the default SNP model [here](http://www.evernote.com/l/AorKnBdfi5JHNrTXpsNl2cmZjhJvcoLNTr8/)**

<br>
<br>


### (4) Filter Loci with > 20% (not >50%) missing data

The loci that are mismatched may be those with more missing data (somehow?). Since I'm working with a lenient filtering threshold here, I'll make it more strict by getting rid of loci with >20% missing data. 


**(1) SNP model**

In [15]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/L1L2stacks_m10_boundSNP'

In [16]:
cd ../scripts/UndercallingHets_MB_CW/

/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW


In [17]:
!python MF_FilterLoci_by_MissingValues_L1L2_20p.py \
../../L1L2stacks_m10/batch_1_filteredMAF_genotypes.csv \
../../L1L2stacks_m10/batch_1.filtered_MAF_MissingLoci20.csv \
../../L1L2stacks_m10/batch_1.filtered_MAF_MissingLoci20_BADgenotypes.csv 

processed 8654 loci
Number of loci removed: 1232


**7422 Loci Retained**

New data in `batch_1.filtered_MAF_MissingLoci20_Individs.csv`

<br>
**(2) Bounded SNP model**

In [18]:
!python MF_FilterLoci_by_MissingValues_L1L2_20p.py \
../../L1L2stacks_m10_boundSNP/batch_3.filtered_MAF.csv \
../../L1L2stacks_m10_boundSNP/batch_3.filtered_MAF_MissingLoci20.csv \
../../L1L2stacks_m10_boundSNP/batch_3.filtered_MAF_MissingLoci20_BADgenotypes.csv  

processed 8747 loci
Number of loci removed: 1239


**7508 Loci Retained**

New data in `batch_3.filtered_MAF_MissingLoci20_Individs.csv`

### (5) Rerun DAPCs...

#### (1) after specifying 300ng as a separate population in MAF filtering. 

#### (2) With Bounded SNP, loci missing less than 50% data model

<br>
<br>

**(1) MAF Filtering**
One of the justifications to using both protocols despite genotype mismatches is that in the DAPC plot, when 500ng and 300ng are listed as two separate populations, they cluster on top of each other. BUT I put them through MAF filtering scripts as the same population. Will this impact the DAPC clustering to be biased toward clustering them together, when they really aren't? 

It shouldn't! But I'll test it! 


In [19]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/scripts/UndercallingHets_MB_CW'

In [21]:
!python Eleni_filter_by_MinorAlleleFrequency_L1L2_300sep.py \
../../L1L2stacks_m10/batch_1.CorrectedGenos_biallelic_TRANSPOSED.csv 

In [22]:
!python MF_FilterLoci_by_MissingValues_L1L2_12-9.py \
../../L1L2stacks_m10/batch_1_300.filteredMAF.csv \
../../L1L2stacks_m10/batch_1_300.filteredMAF_MissingLoci.csv \
../../L1L2stacks_m10/batch_1_300.filteredMAF_MissingLoci_BADgenotypes.csv 

processed 8770 loci
Number of loci removed: 203


Convert to genepop for input to R DAPC program

In [23]:
!python genepop_conversion_forR.py \
../../L1L2stacks_m10/batch_1_300.filteredMAF_MissingLoci_Individs.csv \
../../L1L2stacks_m10/batch_1_300.filteredMAF_MissingLoci_Individs.gen