## Corrected for Ne bias from linkage


When using the linkage disequilibrium method for calculating effective population size from RAD sequencing data, the large number of markers can downwardly bias the estimate AND create artificially small confidence intervals (Waples et al. 2016).

I'm going to implement a bias correction method from Charlie Waters (original code written by Wes Larson), which will remove all pairwise comparisons of loci on the same chromosome when calculating linkage disequilibrium. 

<br>
### Steps:
1. Align all loci to the Atlantic cod genome, and filter out any that don't align
2. Run Ne Estimator to obtain the Burrows file output
3. Parse the Burrows file (python script above)
4. Re-calculate Ne (R script above)



<br>
**Programs:**
- Ne Estimator v2
- python v2.7
- R v3.4.0


**Scripts (unedited):**
- [get_clean_burrows.py]()
- [calc_ne_for_charlie_CW]()

_________________________________________

<br>
#### 4/30/2018

### Align loci in genepop file to genome

In [9]:
cd ../analyses/Ne/NeEstimator/Correction

/mnt/hgfs/PCod-Korea-repo/analyses/Ne/NeEstimator/Correction


#### First, read in a list of loci from the genepop file

In [10]:
###################### objects ###############################
myfile = "batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredCR_nomigrants_south_byyear.txt" # list of outlier loci
###############################################################


# read in outlier loci IDs
infile = open(myfile, "r")
infile.readline() # header

loci_list = []

line = infile.readline()

while "pop" not in line:
    loci_list.append(line.strip().strip("Locus"))
    line = infile.readline()
infile.close()

In [13]:
loci_list[0:5]

['10000', '10004', '10009', '1001', '10014']

#### Then match those loci to sequences in the stacks catalog.tags.tsv file

In [19]:
###################### objects ###############################
catalog = "../../../../stacks_b8_verif/batch_7.catalog.tags.tsv"
newfile = "batch_8_verif_loci.fa" # parsed output from this script
###############################################################


# write out the locus IDs and the consensus sequences to a new fasta file
seqfile = open(catalog, "r")
outfile = open(newfile, "w")
seqs_added = 0
seqfile.readline()
for line in seqfile:
    linelist = line.strip().split("\t")
    if linelist[2] in loci_list:
        outfile.write(">" + linelist[2] + "\n" + linelist[9] + "\n")
        seqs_added += 1
seqfile.close()
outfile.close()
print "Succesfully added sequences for ", seqs_added, " loci."

Succesfully added sequences for  5804  loci.


#### Align to Atlantic cod reference

In [15]:
!bowtie2 -f \
-x ../../../../../PCod-Compare-repo/ACod_reference/Gadus_morhua2 \
-U batch_8_verif_loci.fa \
-S batch_8_loci_bowtie2_Acod.sam

5804 reads; of these:
  5804 (100.00%) were unpaired; of these:
    500 (8.61%) aligned 0 times
    4830 (83.22%) aligned exactly 1 time
    474 (8.17%) aligned >1 times
91.39% overall alignment rate


#### Filter for good alignments

In [17]:
!samtools view -Sq 10 batch_8_loci_bowtie2_Acod.sam > batch_8_loci_bowtie2_Acod_filteredMQ.sam

#### Get list of aligned loci: output dataframe of locus \t chromosome

In [23]:
###################### objects ###############################
infile = "batch_8_loci_bowtie2_Acod_filteredMQ.sam"
outfile = "batch_8_verif_bowtie2_Acod_filteredMQ_loci_chr_list.txt" # parsed output from this script
###############################################################


# write out the locus IDs and the consensus sequences to a new fasta file
infile = open(infile, "r")
outfile = open(outfile, "w")
seqs_added = 0
aligned_loci_list = []

for line in infile:
    linelist = line.strip().split()
    if linelist[0] not in aligned_loci_list:
        outfile.write(linelist[0] + "\t" + linelist[2] + "\n")
        aligned_loci_list.append(linelist[0])
        seqs_added += 1
    elif linelist[0] in aligned_loci_list:
        print "oh no! locus ", linelist[0], " aligned twice!"
infile.close()
outfile.close()
print "Succesfully added sequences for ", seqs_added, " loci."

Succesfully added sequences for  4319  loci.


#### With list of aligned loci: filter genepop file

In [31]:
cd ../../../../

/mnt/hgfs/PCod-Korea-repo


In [25]:
!python scripts/subsetGenepop.py -h

Do you want to (A) select a random subset of loci or (B-untested) select a specific set of loci? ^C
Traceback (most recent call last):
  File "scripts/subsetGenepop.py", line 30, in <module>
    choice1 = raw_input("Do you want to (A) select a random subset of loci or (B-untested) select a specific set of loci? ")
KeyboardInterrupt




### Run Ne Estimator

<br>
Although I usually run the NeEstimator executable from the command line, you need to use the GUI to get the Burrows output file. 
For this test run, I used [this input file](). See settings below:

![img-ne-gui](https://github.com/mfisher5/PCod-Korea-repo/blob/master/nb_pictures/NeEstimator_GUI.png?raw=true)


**Note the Burrows file appears to only put out the Ne estimate for one population**

<br>
### Parse the Burrows file

In [32]:
cd analyses/Ne/NeEstimator/Correction

/mnt/hgfs/PCod-Korea-repo/analyses/Ne/NeEstimator/Correction


In [34]:
!head get_clean_burrows.py

##function to convert default burrows output to cleaner smaller version for input to R
##Wes Larson
## 4/30/14
##wlarson1@uw.edu
#
## Edited 4/30/2018 by Mary Fisher to take command line arguments
######################################################################################

import sys



In [38]:
!python get_clean_burrows.py batch_8_final_filtered_aligned_nomigrants_south_byyear3Bur.txt

In [39]:
!head batch_8_final_filtered_aligned_nomigrants_south_byyearBur.txt

Output from NeEstimator v.2
Input File: batch_8_final_filtered_aligned_nomigrants_south_byyear.txt

Locus names are listed after their designated numberings
(Up to 10 rightmost characters are printed and only up to 100 names are listed)
    1:10000           2:10004           3:10009           4:1001            5:10020       
    6:10023           7:1003            8:10030           9:10037          10:10040       
   11:10049          12:10055          13:10056          14:10058          15:10061       
   16:10068          17:10070          18:10088          19:10089          20:101         
   21:10101          22:10112          23:10115          24:10123          25:10127       


<br>
### Run the R Script `calc_ne.R`