# Identifying Outlier Loci with Bayescan, across Eastern & Western pops
#### stacks batch 8

This notebook contains procedures and code used to identify outliers in the final filtered genepop file. This includes:

1. East v. West by sampling site (12 populations in input file)
    - priors 100
    - priors 1000
2. East v. West by site in east, by region in west (8 populations in input file)
    - priors 100
    - priors 1000
<br>

**Programs used: **
<br>
`Bayescan v2.1 `
<br>




<br>
## part 1. 
[Download](http://cmpg.unibe.ch/software/BayeScan/download.html) includes executable scripts and PDF manual.

**(1) Download [PGDSpider](http://www.cmpg.unibe.ch/software/PGDSpider/).** Bayescan uses its own type of input file. They suggest using PGD spider to convert genepop files into this file format

**(2) Convert genepop to Bayescan format.** In For SNP data, this can either be a "codominant" file format or a "SNP genotype matrix" (per Bayescan's user manual). They suggest that if you are not directly interested in Fis, you use SNPs as regular codominant data. In PGDspider, this is just a matter of choosing the file format and file names for the input and ouput files, and then selecting "SNP" in two short questions for the SPID file. Note that using an old SPID file here caused an error; I had to create a new one

### (3) Run Bayescan using the Windows GUI.
I used the following parameters for both runs, with the exception of the prior odds. Prior was run first as 100, then as 1000. 

![img-bay-start](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/notebook_pics/bayescan_p100.png?raw=true)
[link to bayescan's verification file](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/outliers/Bayescan/batch_8_BAYESCAN_output_eastwest_Verif.txt)

Note that I have changed the default "sample size" to 20K. This is because in the PCod paper Gruenthal et al. (in review), they reported using 20,000 iterations. according to the Bayescan manual, the "Number of outputted iterations, default 5000" appears to be "sample size" in the gui and "-n" on the command line. 

<br>

These are the input files: 
<br>
[East v. West by site input file]()
<br>
[East v. West by site / by reg input file]()

<br>
<br>


 #### East v. West
 **RUN 1:** by sampling site, prior odds of 100 (3/30/2018)
 <br>
 **RUN 2:** by sampling site, prior odds of 1000 (3/30/2018)
 <br>
 **RUN 3:** by site in east, by region in west, prior odds of 100 
 <br>
 **RUN 4:** by site in east, by region in west, prior odds of 1000 
 


 
<br>
#### 3/24/2018

### (4) Interpreting Bayescan Output.
This can be done in R. I use [this R script](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/R/Bayescan_KorPCod_MF.R) 

The R script includes two options: 
1.  Use the original Bayescan plotting functions. To do this, you will need the script [Bayescan_plot](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/R/BAYESCAN_plot_R.r). Since PGDspider changes the loci names, you will then need to (1) copy the R console output which lists outlier loci, and (2) use the python script [bayescan_to_stacks_locus_IDs_outliers.py](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/Outliers/bayescan_to_stacks_locus_IDs_outliers.py) to rename loci. 
    - pro: provides posterior distribution of fst
    - con: harder to customize the outlier plot. labeled loci in outlier plot do not correspond to actual loci names. provides a list of outlier loci names only to console.


2. Use an alternative plotting function I made with ggplot. You will first need to run the python script [bayescan_to_stacks_locus_IDs.py](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/Outliers/bayescan_to_stacks_locus_IDs.py) (see below) to generate the input file for this. 
    - pro: creates an input file that is nicely formatted. uses stacks loci names in R, so loci names on outlier plot correspond to actual loci names. will output a file with outlier loci names and all other information included in Bayescan's FST file. 
    - con: will not provide posterior distribution of fst. 
    
<br>
In this notebook, I went with option 2.

In [1]:
pwd

u'/mnt/hgfs/PCod-Compare-repo/notebooks'

In [2]:
cd ../analyses/Outliers

/mnt/hgfs/PCod-Compare-repo/analyses/Outliers


In [3]:
!python bayescan_to_stacks_locus_IDs.py -h

usage: bayescan_to_stacks_locus_IDs.py [-h] [-i INPUT] [-gen GENEPOP]
                                       [-o OUTPUT] [-s SEPARATOR]

Match bayescan outlier loci IDs to the actual stacks IDs (if PGD spider was
used for file conversion).

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        fst text file output from bayescan
  -gen GENEPOP, --genepop GENEPOP
                        the genepop file used in PGD spyder to create BAYESCAN
                        input file
  -o OUTPUT, --output OUTPUT
                        output text file
  -s SEPARATOR, --separator SEPARATOR
                        separator used in genepop file [comma/newline]


#### East v. West p100

In [6]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_eastwest_p100_output_fst.txt \
-gen ../alignment/batch_8_final_filtered_aligned_genepop.txt \
-s "newline" \
-o Bayescan/batch_8_eastwest_Bayescan_p100_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


#### East v. West p1000

In [7]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_eastwest_p1K_output_fst.txt \
-gen ../alignment/batch_8_final_filtered_aligned_genepop.txt \
-s "newline" \
-o Bayescan/batch_8_eastwest_Bayescan_p1K_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


#### West p100

In [8]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_west_Bayescan_p100_output_fst.txt \
-gen ../alignment/batch_8_final_filtered_aligned_genepop.txt \
-s "newline" \
-o Bayescan/batch_8_west_Bayescan_p100_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


#### West p1000

In [9]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_west_Bayescan_p1K_output_fst.txt \
-gen ../alignment/batch_8_final_filtered_aligned_genepop.txt \
-s "newline" \
-o Bayescan/batch_8_west_Bayescan_p1K_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


** Now switch over to R and use the R script [Bayescan_KorPCod_MF](https://github.com/mfisher5/PCod-Korea-repo/blob/master/analyses/R/Bayescan_KorPCod_MF.R)**

<br>
#### OUTLIERS:
So Bayescan didn't detect *any* outliers between east and west, or within the western populations. This is incorrect, based on previous analyses. My guess is that this is because the input files only had two populations. I'm going to change the inputs so that east v. west is broken down into the following populations:
- "West": Korea south, Korea west
- "East": Kodiak, Adak, WashCoast, Hecate Strait, Prince William Sound, Unimak Pass

The west is broken down by sampling site, without temporal replicates listed separately:
- Pohang, Geoje (2014 & 2015), Namhae, Jinhae Bay (early & late), Yellow Sea Block, Boryeong



______________________________________________________________________________________________
## part 2.

#### 3/24/2018

### (3) Run Bayescan using the Windows GUI.
I used the following parameters for both runs, with the exception of the prior odds. Prior was run first as 100, then as 1000. 

![img-bay-start](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/notebook_pics/bayescan_p100.png?raw=true)
[link to bayescan's verification file](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/outliers/Bayescan/batch_8_BAYESCAN_output_eastwest_Verif.txt)

Note that I have changed the default "sample size" to 20K. This is because in the PCod paper Gruenthal et al. (in review), they reported using 20,000 iterations. according to the Bayescan manual, the "Number of outputted iterations, default 5000" appears to be "sample size" in the gui and "-n" on the command line. 

<br>

These are the input files: 
<br>
[East v. West input file](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/outliers/Bayescan/batch_8_eastwest_Bayescan_input)
<br>
[West input file](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/outliers/Bayescan/batch_8_west_Bayescan_input)

<br>
<br>


 #### East v. West
 **RUN 1:** prior odds of 100 (3/23/2018)
 <br>
 **RUN 2:** prior odds of 1000 (3/23/2018)
 
 #### West
 **RUN 1:** prior odds of 100 (3/23/2018)
 <br>
 **RUN 2:** prior odds of 1000 (3/23/2018)


In [1]:
pwd

u'/mnt/hgfs/PCod-Compare-repo/notebooks'

In [2]:
cd ../analyses/outliers/

/mnt/hgfs/PCod-Compare-repo/analyses/outliers


In [4]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_eastwest_bypop_Bayescan_p100_output_fst.txt \
-gen Bayescan/batch_8_final_filtered_aligned_genepop_eastwest_bypop.txt \
-s "newline" \
-o Bayescan/batch_8_eastwest_bypop_Bayescan_p100_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


In [5]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_eastwest_bypop_Bayescan_p1K_output_fst.txt \
-gen Bayescan/batch_8_final_filtered_aligned_genepop_eastwest_bypop.txt \
-s "newline" \
-o Bayescan/batch_8_eastwest_bypop_Bayescan_p1K_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


In [6]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_eastwest_bypop_Bayescan_p1K_output_fst.txt \
-gen Bayescan/batch_8_final_filtered_aligned_genepop_eastwest_bypop.txt \
-s "newline" \
-o Bayescan/batch_8_eastwest_bypop_Bayescan_p1K_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


In [7]:
!python bayescan_to_stacks_locus_IDs.py \
-i Bayescan/batch_8_eastwest_bypop_Bayescan_p100_output_fst.txt \
-gen Bayescan/batch_8_final_filtered_aligned_genepop_eastwest_bypop.txt \
-s "newline" \
-o Bayescan/batch_8_eastwest_bypop_Bayescan_p100_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


<br>
### part 2. outliers

**East v. West**: prior 100 - 318; prior 1000 - 192
<br>
**West**: 2 outliers

<br>
## part 3.

I'm curious to see how many additional outliers I get when I add in the Salish Sea populations, and when I then run all sampling sites as different populations v. running all regions as populations. I completed two re-runs of Bayescan

In [8]:
pwd

u'/mnt/hgfs/PCod-Compare-repo/analyses/outliers'

** All 17 sampling sites; each site its own population **

In [11]:
!python bayescan_to_stacks_locus_IDs.py \
-i batch_8_all_Bayescan_p100_rerun_output_fst.txt \
-gen Bayescan/batch_8_final_filtered_aligned_genepop_eastwest_rerun.txt \
-s "newline" \
-o batch_8_all_Bayescan_p100_rerun_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.


**All 17 sampling sites; **
<br>
**West population by region**
<br>
**East population coastal sites each own pop; salish sea one pop**

In [12]:
!python bayescan_to_stacks_locus_IDs.py \
-i batch_8_all_byreg_Bayescan_p100_rerun_output_fst.txt \
-gen Bayescan/batch_8_final_filtered_aligned_genepop_eastwest_byreg_rerun.txt \
-s "newline" \
-o batch_8_all_byreg_Bayescan_p100_rerun_fst_stacksIDs.txt

indexing stacks loci...
You have  4286  loci.
copying over BAYESCAN output..
Copied over  4286  loci.
