## Identifying Outlier Loci

### stacks batch 8
<br>
This notebook contains the follow outlier identifications:

1. all samples
2. southern samples
3. east v. west samples
4. south v. west samples 

Using the programs `OutFLANK` and `Bayescan`.
<br>
#### 10/5/2017

### OutFLANK

[Github](https://github.com/whitlock/OutFLANK/blob/master/R/OutFLANK.R) 

[PDF Manual](https://github.com/whitlock/OutFLANK/blob/master/OutFLANK%20readme.pdf)

**OutFLANK output saved in an excel file in the `Results` folder.**

**(1) Convert Genepop file to OutFLANK file format.** Luckily, OutFLANK has a nice R function for this. However, you still need to manipulate your Genepop file to a certain file format to put it into that R function. the following python script will take a genepop file and a population map, and output three of the inputs for the OutFLANK function `MakeDiploidFSTMat()`. This is: 
1. a file containing a matrix of individuals (rows) x loci (columns) without headings. Alleles are coded in a `0`,`1`, `2`, `9` format. 
2. a file where each locus name is on a new line, as a string. This can be read directly into R as a list
3. a file where each sample's population name is on a new line (same order as matrix rows). This can also be read directly into R as a list. 

In [1]:
pwd

u'/mnt/hgfs/PCod-Korea-repo/notebooks'

In [2]:
cd ../analyses/

/mnt/hgfs/PCod-Korea-repo/analyses


In [3]:
cd Outliers/

/mnt/hgfs/PCod-Korea-repo/analyses/Outliers


In [4]:
!python convert_genepop_to_SNPmat.py -h

usage: convert_genepop_to_SNPmat.py [-h] [-i INPUT] [-p POPMAP] [-o OUTPUT]
                                    [-ol OUTLOCUSNAMES] [-op OUTPOPNAMES]

produce SNPmat file, and files containing loci / population lists for OutFLANK
outlier analysis.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        genepop file that you want to run through OutFLANK
  -p POPMAP, --popmap POPMAP
                        population map from stacks (each line has sample - tab
                        - population
  -o OUTPUT, --output OUTPUT
                        bash shell script file name. must have file extension
                        .sh
  -ol OUTLOCUSNAMES, --outLocusNames OUTLOCUSNAMES
                        text file with the name of each locus on each line, to
                        be read into R
  -op OUTPOPNAMES, --outPopNames OUTPOPNAMES
                        text file with the name of each samp

In [5]:
!mkdir batch8_verif

In [7]:
!python convert_genepop_to_SNPmat.py \
-i ../../stacks_b8_verif/batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredCR_genepop.txt \
-p ../../scripts/PopMap_Final.txt \
-o batch8_verif/batch_8_verif_final_filtered_SNPmat.txt \
-ol batch8_verif/batch_8_verif_SNPmat_locusnames.txt \
-op batch8_verif/batch_8_verif_SNPmat_popnames.txt

Korean Pacific cod filtered final genepop, stacks batch 8 MF 3/8/2018

Done creating SNPmat file.


<br>
**(2) Run OutFLANK and produce summary file containing outliers.** I used [this R script](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/R/OutFLANK_KorPCod_MF.R), which is well annotated. 
<br>

In [9]:
!head batch8_verif/allKOR_b8_verif_outflank_outliers.csv

locus,he,fst,meanAlleleFreq,qvals,pv,outlier
10203,0.485766058697014,0.196194621361391,0.415637860082305,0.0340042113406486,0.000148058975358412,1
14546,0.322962912369374,0.255211285854974,0.202479338842975,0.00571792115238723,4.97932175824722e-06,1
17767,0.151628987091046,0.244765201265605,0.0826446280991736,0.0073129286312226,9.55244668809918e-06,1
18723,0.214082029915989,0.226528580463956,0.121900826446281,0.0122998210403379,2.67775494346978e-05,1
1904,0.410707095464015,0.201381191671333,0.711297071129707,0.0319372424078581,0.000111247288503424,1
19221,0.448101398785035,0.303346827157339,0.661087866108787,0.000762883692098093,3.32169967531826e-07,1
2606,0.13062632333857,0.194841378644889,0.929752066115702,0.0340042113406486,0.000145082686416886,1
2694,0.366455405691688,0.227479983525084,0.241596638655462,0.0122998210403379,2.50665753045443e-05,1
3405,0.10140306122449,0.219157320390556,0.946428571428571,0.0143502744216365,3.74898315804728e-05,1



I only have **10 outlier loci**.


#### "Pop" headers by region

In [8]:
!python convert_genepop_to_SNPmat.py \
-i ../../stacks_b8_verif/batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredCR_byregion.txt \
-p ../../scripts/PopMap_Final.txt \
-o batch8_verif/batch_8_verif_final_filtered_byreg_SNPmat.txt \
-ol batch8_verif/batch_8_verif_SNPmat_byreg_locusnames.txt \
-op batch8_verif/batch_8_verif_SNPmat_byreg_popnames.txt

Korean Pacific cod filtered final genepop, by region, stacks batch 8 verif MF 3/8/2018

Done creating SNPmat file.


In [10]:
!head batch8_verif/allKOR_b8_verif_outflank_outliers_byregion.csv

locus,he,fst,meanAlleleFreq,qvals,pv,outlier
10203,0.485766058697014,0.196194621361391,0.415637860082305,0.0340042113406486,0.000148058975358412,1
14546,0.322962912369374,0.255211285854974,0.202479338842975,0.00571792115238723,4.97932175824722e-06,1
17767,0.151628987091046,0.244765201265605,0.0826446280991736,0.0073129286312226,9.55244668809918e-06,1
18723,0.214082029915989,0.226528580463956,0.121900826446281,0.0122998210403379,2.67775494346978e-05,1
1904,0.410707095464015,0.201381191671333,0.711297071129707,0.0319372424078581,0.000111247288503424,1
19221,0.448101398785035,0.303346827157339,0.661087866108787,0.000762883692098093,3.32169967531826e-07,1
2606,0.13062632333857,0.194841378644889,0.929752066115702,0.0340042113406486,0.000145082686416886,1
2694,0.366455405691688,0.227479983525084,0.241596638655462,0.0122998210403379,2.50665753045443e-05,1
3405,0.10140306122449,0.219157320390556,0.946428571428571,0.0143502744216365,3.74898315804728e-05,1


I have the same **10 outliers**.
<br>

** Southern Populations Only**

In [1]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/notebooks'

In [2]:
cd ../analyses/Outliers/

/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/analyses/Outliers


In [6]:
!python convert_genepop_to_SNPmat.py \
-i ../batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredRepsC_south.gen \
-p ../../scripts/PopMap_L1-5_mdFilter_b8.txt \
-o batch_8_final_filtered_south_SNPmat.txt \
-ol batch_8_SNPmat_south_locusnames.txt \
-op batch_8_SNPmat_south_popnames.txt

##Korean Pacific cod filtered final genepop, stacks batch 8 MF 9/29/2017

Done creating SNPmat file.


I got **50 outlier loci**

<br>


** East v. West **

In [27]:
!python convert_genepop_to_SNPmat.py \
-i ../batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredRepsC_eastwest.txt \
-p ../../scripts/PopMap_L1-5_mdFilter_b8.txt \
-o batch_8_final_filtered_EW_SNPmat.txt \
-ol batch_8_SNPmat_EW_locusnames.txt \
-op batch_8_SNPmat_EW_popnames.txt

##Korean Pacific cod filtered final genepop, stacks batch 8 MF 9/29/2017

Done creating SNPmat file.


There are **no outliers??** between eastern and western populations. 

This is likely because the neutral FST is so large between the two; OutFLANK only uses FST to detect outliers, so it's much harder to find outliers if FST is elevated (theoretically)
<br>
<br>


** South v. West **

In [28]:
!python convert_genepop_to_SNPmat.py \
-i ../batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredRepsC_southwest.txt \
-p ../../scripts/PopMap_L1-5_mdFilter_b8.txt \
-o batch_8_final_filtered_SW_SNPmat.txt \
-ol batch_8_SNPmat_SW_locusnames.txt \
-op batch_8_SNPmat_SW_popnames.txt

##Korean Pacific cod filtered final genepop, stacks batch 8 MF 9/29/2017

Done creating SNPmat file.


Also **no outliers.** Looks like I may not use this program for my data. 


<br>

<br>
<br>
#### 3/8/2018
<br>
### Bayescan
<br>

[Download](http://cmpg.unibe.ch/software/BayeScan/download.html) includes executable scripts and PDF manual. 


**(1) [Download](http://www.cmpg.unibe.ch/software/PGDSpider/) PGDSpider.** Bayescan uses its own type of input file. They suggest using PGD spider to convert genepop files into this file format

**(2) Convert genepop to Bayescan format.** In For SNP data, this can either be a "codominant" file format or a "SNP genotype matrix" (per Bayescan's user manual). They suggest that if you are not directly interested in Fis, you use SNPs as regular codominant data. In PGDspider, this is just a matter of choosing the file format and file names for the input and ouput files, and then selecting "SNP" in two short questions for the SPID file. *Note that using an old SPID file here caused an error; I had to create a new one*

**(3) Run Bayescan using the Windows GUI.** looks like this:
 ![img-bayescan](https://github.com/mfisher5/PCod-Korea-repo/blob/master/nb_pictures/bayescan_gui_verif.png?raw=true)
 
I used the following parameters:
 ![img-bayescan-options](https://github.com/mfisher5/PCod-Korea-repo/blob/master/nb_pictures/bayescan_gui_verif_params_p10.png?raw=true)
 
 
 Note that I have changed the default "sample size" to 20K. This is because in the PCod paper Gruenthal et al. (in review), they reported using 20,000 iterations. according to the Bayescan manual, the "Number of outputted iterations, default 5000" appears to be "sample size" in the gui and "-n" on the command line. 
 <br>
 **RUN 1:** prior odds of 10
 <br>
 **RUN 2:** prior odds of 100 (Eleni's computer)
 <br>
 **RUN 3:** prior odds of 1,000
 <br>
 **RUN 4**: prior odds of 10,000
 
 <br>
 
 ___________________________________
 #### 10/6/2017
 
 So it took about 16 hrs: 
 ![img-bayescan-done]()
 
<br>

**(4) Interpreting Bayescan Output.** This can be done in R. It requires sourcing a [Bayescan R script](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/R/BAYESCAN_plot_R.r). I also wrote an [R script](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/R/Bayescan_KorPCod_MF.R) that can be run to (a) write out outlier loci, (b) call the bayescan plot function and (c) produce plots of the posterior distribution for a variety of parameters. I just copied and pasted the results from (a) into a text file.

Since PGDspider renamed all of my loci, I also need to take the output from Bayescan and correlate it to my stacks locus IDs. I have 5,804 loci, and PGDspyder number the loci from 1 through 5,804. So I'm assuming that this was done in the order that my loci were listed in the genepop file. The code below will do that for the following text file format:

In [9]:
pwd

u'/mnt/hgfs/Pacific cod/DataAnalysis/PCod-Korea-repo/analyses/Outliers'

In [10]:
!head batch_8_BAYESCAN_outliers.txt

### BAYESCAN outliers -- note that locus IDs were assigned by PGD spyder, and don't correspond to actual Locus IDS in genepop file ###
$outliers
 [1]   31   55   88  193  336  350  393  406  469  550  620  673  697 1007 1044 1061
[17] 1120 1142 1296 1305 1343 1356 1382 1437 1452 1572 1582 1682 1799 1840 1855 2005
[33] 2079 2113 2147 2188 2195 2251 2282 2340 2375 2498 2509 2516 2536 2607 2610 2757
[49] 2817 2823 3004 3073 3084 3114 3235 3251 3309 3387 3478 3496 3502 3509 3516 3542
[65] 3610 3614 3618 3653 3712 3743 3886 3969 4196 4306 4324 4326 4662 4668 4709 4734
[81] 4974 5093 5110 5141 5165 5203 5240 5258 5268 5328 5343 5382

$nb_outliers


Python code available [here](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/Outliers/bayescan_to_stacks_locus_IDs.py)

In [2]:
!python bayescan_to_stacks_locus_IDs.py -h

usage: bayescan_to_stacks_locus_IDs.py [-h] [-i INPUT] [-gen GENEPOP]
                                       [-sep SEPARATOR] [-o OUTPUT]
                                       [-head HEADER]

Match bayescan outlier loci IDs to the actual stacks IDs (if PGD spider was
used for file conversion).

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        text file containing plot_bayescan() R consol output
  -gen GENEPOP, --genepop GENEPOP
                        the genepop file used in PGD spyder to create BAYESCAN
                        input file
  -sep SEPARATOR, --separator SEPARATOR
                        are the loci in your genepop file separated by a
                        'comma' or a 'newline'?
  -o OUTPUT, --output OUTPUT
                        output text file
  -head HEADER, --header HEADER
                        header for output text file. should start with #


In [23]:
!python bayescan_to_stacks_locus_IDs.py \
-i batch_8_BAYESCAN_outliers.txt \
-gen ../../stacks_b8_wgenome/batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredRepsC.txt \
-sep "newline" \
-o batch_8_BAYESCAN_outliers_stacksIDs.txt \
-head "Batch 8 BAYESCAN outliers; all samples"

reading BAYESCAN outliers..
['31', '55', '88', '193', '336', '350', '393', '406', '469', '550', '620', '673', '697', '1007', '1044', '1061', '1120', '1142', '1296', '1305', '1343', '1356', '1382', '1437', '1452', '1572', '1582', '1682', '1799', '1840', '1855', '2005', '2079', '2113', '2147', '2188', '2195', '2251', '2282', '2340', '2375', '2498', '2509', '2516', '2536', '2607', '2610', '2757', '2817', '2823', '3004', '3073', '3084', '3114', '3235', '3251', '3309', '3387', '3478', '3496', '3502', '3509', '3516', '3542', '3610', '3614', '3618', '3653', '3712', '3743', '3886', '3969', '4196', '4306', '4324', '4326', '4662', '4668', '4709', '4734', '4974', '5093', '5110', '5141', '5165', '5203', '5240', '5258', '5268', '5328', '5343', '5382', '92']
indexing stacks loci...
writing to output...
Done.


<br>
<br>
** SOUTHERN POPULATIONS ONLY**

Started running on 10/6/2017; estimated run time 6 hours. 

<br>
<br>
** SOUTHERN & EASTERN POPULATIONS ** -  11/1/2017

GENEPOP file with only southern and eastern sampling sites (each sampling site its own population)

In [1]:
cd ../analyses/Outliers/

/mnt/hgfs/PCod-Korea-repo/analyses/Outliers


In [3]:
!python bayescan_to_stacks_locus_IDs.py \
-i batch_8_southeast_BAYESCAN_outliers.txt \
-gen ../../stacks_b8_wgenome/batch_8_filteredMAF_filteredIndivids30_filteredLoci_filteredHWE_filteredRepsC.txt \
-sep "newline" \
-o batch_8_southeast_BAYESCAN_outliers_stacksIDs.txt \
-head "Batch 8 BAYESCAN outliers; southern and eastern site samples"

reading BAYESCAN outliers..
You have  15  outlier loci.
indexing stacks loci...
writing to output...
Done.


*Were all of these outliers found using all samples?*

In [9]:
infile = open("batch_8_southeast_BAYESCAN_outliers_stacksIDs.txt", "r")
southeast = []
for line in infile:
    if "Batch" not in line:
        southeast.append(line.strip().split())
infile.close()

infile2 = open("batch_8_BAYESCAN_outliers_stacksIDs.txt", "r")
all_pops = []
for line in infile2:
    if "Batch" not in line:
        all_pops.append(line.strip().split())
infile2.close()

print "Num. Outlier Loci using all samples: ", len(all_pops)
print "Num. Outlier Loci using southern & eastern samples: ", len(southeast)
print len([i for i in southeast if i in all_pops]), " outlier loci identified between southern & eastern sites also showed up in analysis using all samples."
print "Novel outlier loci in southern / eastern analysis: "
print [i for i in southeast if i not in all_pops]

Num. Outlier Loci using all samples:  93
Num. Outlier Loci using southern & eastern samples:  15
12  outlier loci identified between southern & eastern sites also showed up in analysis using all samples.
Novel outlier loci in southern / eastern analysis: 
[['4202'], ['4359'], ['7277']]


<br>
<br>
<br>

### OutFLANK v. Bayescan

** ALL POPULATIONS**

In [30]:
## parse out outflank outlier locus IDs
outflank = open("allKOR_b8_outflank_outliers.txt", "r")
outflank.readline()

outflank_loci = []

for line in outflank:
    outflank_loci.append(line.strip().split(",")[1])
outflank.close()

## parse out bayescan outlier locus IDs
bay = open("batch_8_BAYESCAN_outliers_stacksIDs.txt", "r")
bay.readline()

bayescan_loci = []
for line in bay:
    bayescan_loci.append(line.strip())
bay.close()

## Identify matching loci
print [i for i in outflank_loci if i in bayescan_loci]

['10203', '14316', '14546', '17767', '18723', '1904', '19221', '2606', '2694', '2705', '3699']


*note that all outlier loci identified in outflank were found in Bayescan.*