## Checking SNP and haplotype genotyping error mismatch

I'm going a little crazy trying to understand how there could be such a discrepancy in genotyping error between the SNP and haplotype Genepop files, so I thought it was worth meticulously double-checking to make sure the error isn't me. It's usually me.

First, I'll take a look at the SNP Genepop file. It looks like this:

![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Notebooks/images_for_notebooks/snp_gp.png?raw=true)

In [59]:
cd /mnt/hgfs/SHARED_FOLDER/WorkingFolder/Stacks_2

/mnt/hgfs/SHARED_FOLDER/WorkingFolder/Stacks_2


I copied and pasted just the rows for the individuals with replicates, into this file 'cragigrun1_sampreps_snps_gen.txt', and it looks like this:

![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Notebooks/images_for_notebooks/pic_samps_for_snp_gen_error.png?raw=true)

with the rows extracted for samples with replicates, and also with the locus names at the bottom.

**HOWEVER** To double-check, I'm going to use a test file, which has only the first 11 SNPs, and looks like this:
![image](https://github.com/nclowell/RAD_Scallops/blob/master/CRAGIG_run1/Notebooks/images_for_notebooks/test_snps_reps.png?raw=true)
I'm using a test file so that I can introduce a known amount of error, and see how well it does.

I introduced the following errors:
* FG101: 1 total mismatch, 1 partial mismatch
* FG102: 2 total mismatches
* FG103: no mismatches
* FG104: 1 total mismatch, 2 partial mismatches

In [60]:
input = open("test_replicates.txt") # read in file
lines = input.readlines()
biglist = []
for line in lines:
    linelist = line.strip().split()
    biglist.append(linelist) # prepare a list to make an array

# check list of lists is what you expect
print biglist[0][0:9]

['FG101_A,', '0000', '0000', '0000', '0303', '0101', '0202', '0202', '0101']


In [75]:
headerline = lines[8]
snp_list = headerline.strip().split(",")

print snp_list[0:11]

['5_75', '5_84', '5_137', '6_16', '6_23', '6_24', '6_40', '6_74', '6_97', '6_112', '6_126']


In [62]:
num_snps = len(snp_list) # iterate over number of tags
print "You have " + str(num_snps) + " SNPs."

You have 11 SNPs.


In [63]:
import numpy as np
myarray = np.array(biglist) # turn the file into an array

Assign rows of replicates to make small arrays w two rows each, for each replicated individual.

In [64]:
FG101 = np.array(biglist[0:2])
FG102 = np.array(biglist[2:4])
FG103 = np.array(biglist[4:6])
FG104 = np.array(biglist[6:8])

Double-check row assignments...

In [65]:
FG101[0:2,0:7]


array([['FG101_A,', '0000', '0000', '0000', '0303', '0101', '0202'],
       ['FG101_B,', '0000', '0000', '0000', '0303', '0101', '0202']], 
      dtype='|S8')

In [66]:
# put arrays in list
list_of_arrays = []
list_of_arrays.append(FG101)
list_of_arrays.append(FG102)
list_of_arrays.append(FG103)
list_of_arrays.append(FG104)

In [67]:
# initiate counts and lists to track genotyping error
match_genotypes_count = 0
match_genotypes_tags = []
partmismatch_genotypes_count = 0
partmismatch_genotypes_tags = []
totalmismatch_genotypes_count = 0
totalmismatch_genotypes_tags = []
nodata_mismatch_count = 0
nodata_mismatch_tags = []

In [72]:
# sort genotype match/mismatch by sample
bysam_match_tags = []
bysam_partmismatch_tags = []
bysam_totalmismatch_tags = []
bysam_missing_tags = []

timescount = 1
for array in list_of_arrays:
    sam_match_genotypes_count = 0
    sam_match_genotypes_tags = []
    sam_partmismatch_genotypes_count = 0
    sam_partmismatch_genotypes_tags = []
    sam_totalmismatch_genotypes_count = 0
    sam_totalmismatch_genotypes_tags = []
    sam_nodata_mismatch_count = 0
    sam_nodata_mismatch_tags = []
    for i in range(1,num_snps+1):
        print "i = " + str(i)
        A_gen = array[0][i] # get genotypes
        B_gen = array[1][i]
        A_1 = A_gen[0:2] # pull out alleles
        A_2 = A_gen[2:4]
        B_1 = B_gen[0:2]
        B_2 = B_gen[2:4]
        A_list = [] # store alleles in lists
        A_list.append(A_1)
        A_list.append(A_2)
        print "A_list of alleles: " + str(A_list)
        B_list = []
        B_list.append(B_1)
        B_list.append(B_2)
        print "B_list of alleles: " + str(B_list)
        first_test_string = B_1 + B_2
        second_test_string = B_2 + B_1
        match_options = []
        match_options.append(first_test_string)
        match_options.append(second_test_string)
        if (A_gen == "0000" or B_gen == "0000") or (A_gen == "0000" and B_gen == "0000"):
            nodata_mismatch_count += 1
            nodata_mismatch_tags.append(snp_list[i])
            sam_nodata_mismatch_count +=1
            sam_nodata_mismatch_tags.append(snp_list[i])
        elif A_gen in match_options: # if perfect match
            match_genotypes_count += 1
            match_genotypes_tags.append(snp_list[i])
            sam_match_genotypes_count += 1
            sam_match_genotypes_tags.append(snp_list[i])
        elif A_1 in B_list or A_2 in B_list: # if one allele is a match
            partmismatch_genotypes_count += 1
            partmismatch_genotypes_tags.append(snp_list[i])
            sam_partmismatch_genotypes_count += 1
            sam_partmismatch_genotypes_tags.append(snp_list[i])
        elif A_1 not in B_list or A_2 not in B_list: # if not a match at all
            totalmismatch_genotypes_count += 1
            totalmismatch_genotypes_tags.append(snp_list[i])
            sam_totalmismatch_genotypes_count += 1
            sam_totalmismatch_genotypes_tags.append(snp_list[i])
        else:
            print "Your loop may not be fully working." 
            print "You thought there were three possible options, but there were more than three. Check!"
        
    bysam_missing_tags.append(sam_nodata_mismatch_tags)
    bysam_match_tags.append(sam_match_genotypes_tags)
    bysam_partmismatch_tags.append(sam_partmismatch_genotypes_tags)
    bysam_totalmismatch_tags.append(sam_totalmismatch_genotypes_tags)

    print "\nSample number " + str(timescount) + " :"
    print "Genotypes with perfect match: " + str(sam_match_genotypes_count) + " : "
    print "Genotypes with partial match: " + str(sam_partmismatch_genotypes_count) + " : "
    print "Genotypes with total mismatch: " + str(sam_totalmismatch_genotypes_count) + " : "
    print "Genotypes with mismatch due to missing data: " + str(sam_nodata_mismatch_count) + " : "
        
    timescount +=1

A_list of alleles: ['00', '00']
B_list of alleles: ['00', '00']
A_list of alleles: ['00', '00']
B_list of alleles: ['00', '00']
A_list of alleles: ['00', '00']
B_list of alleles: ['00', '00']
A_list of alleles: ['03', '03']
B_list of alleles: ['03', '03']
A_list of alleles: ['01', '01']
B_list of alleles: ['01', '01']
A_list of alleles: ['02', '02']
B_list of alleles: ['02', '02']
A_list of alleles: ['02', '02']
B_list of alleles: ['02', '02']
A_list of alleles: ['01', '01']
B_list of alleles: ['01', '01']
A_list of alleles: ['01', '01']
B_list of alleles: ['01', '01']
A_list of alleles: ['02', '02']
B_list of alleles: ['01', '03']
A_list of alleles: ['01', '02']
B_list of alleles: ['02', '02']


IndexError: list index out of range

In [69]:
from __future__ import division

total_excl_mv = match_genotypes_count + partmismatch_genotypes_count + totalmismatch_genotypes_count
print "Total genotypes excluding those with missing data : " + str(total_excl_mv)
print "\nTotal genotypes with perfect match: " + str(match_genotypes_count) + " : " + str((float(match_genotypes_count)/float(total_excl_mv)*100))[0:5] + "%"
print "Total genotypes with partial match: " + str(partmismatch_genotypes_count) + " : " + str((float(partmismatch_genotypes_count)/float(total_excl_mv)*100))[0:5] + "%"
print "Total genotypes with total mismatch: " + str(totalmismatch_genotypes_count) + " : " + str((float(totalmismatch_genotypes_count)/float(total_excl_mv)*100))[0:5] + "%"
print "\nTotal genotypes with mismatch due to missing data: " + str(nodata_mismatch_count)

Total genotypes excluding those with missing data : 37

Total genotypes with perfect match: 31 : 83.78%
Total genotypes with partial match: 2 : 5.405%
Total genotypes with total mismatch: 4 : 10.81%

Total genotypes with mismatch due to missing data: 3


**So, does this match what is known?**

Again, what is known about the test file:
* FG101: 1 total mismatch, 1 partial mismatch
* FG102: 2 total mismatches
* FG103: no mismatches
* FG104: 1 total mismatch, 2 partial mismatches

CRAP. It's not perfectly the same! It caught everything correctly except for the one partial mismatch in FG101. Now to try and sort out why that is... Second problem: the genotypes don't add to 11 like they're supposed to! Ah this could be a great relief is this is the root of the problem.