## Sliding Window Analysis (practice)
### for Manhattan Plots Fst v. Base Position

In this notebook, I will be refining sliding window analysis for my data. 


<br>

Procedure is based on: Hohenlohe PA et al. (2010) PlOsGenetics 6(2):e1000862

Scripts were borrowed from: Charlie Waters, Marine Brieuc, Kot Ono

<br>

Programs used: `R v3.4.0`, `python v2.7`


<br>
### Step One: Alignment of Loci to Chromosome and Pairwise FST calculations

I completed my alignment of Pacific cod sample data to Linkage Groups in the Atlantic cod chromosome in the Jupyter notebook [Align to ACod Genome - batches 4 & 8](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/Align%20to%20ACod%20Genome%20-%20Batches%204%20and%208.ipynb)

I calculated Fst using Genepop when creating Manhattan Plots, in the Jupyter notebook [Manhattan Plots - Batches 4 and 8](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/Manhattan%20Plots%20-%20Batches%204%20and%208.ipynb)


<br>
### Step Two: Prepare input file for R Script

The input file needs to be organized like the following:

In [1]:
cd ../analyses/SlidingWindow/Charlie/

/mnt/hgfs/PCod-Compare-repo/analyses/SlidingWindow/Charlie


In [2]:
!head sliding_window_input_2010seg.txt

Locus	fst	chromosome	position
Ot010598_Ots01p	-0.0078	Ots01	0
Ot018821_Ots01p	-0.0087	Ots01	0
Ot036743_Ots01p	0.0134	Ots01	0
Ot051272_Ots01p	-0.012	Ots01	0
Ot052674_Ots01p	0.0548	Ots01	0
Ot055731_Ots01p	-0.0059	Ots01	0
Ot056394_Ots01p	0.0581	Ots01	0
Ot016214_Ots01p	-0.0104	Ots01	0.01
Ot040238_Ots01p	0.0034	Ots01	0.01


I have already parsed my .sam file and the `.ST2` file output from `genepop` (see Jupyter notebook [Manhattan Plots - Batches 4 and 8](https://github.com/mfisher5/PCod-Compare-repo/blob/master/notebooks/Manhattan%20Plots%20-%20Batches%204%20and%208.ipynb))

<br>
This gave me the following files:

In [3]:
cd ../../

/mnt/hgfs/PCod-Compare-repo/analyses


In [4]:
!head ManhattanPlots/batch_4_filteredMQ_filteredAS_aligned_loci.txt

Locus	LG	Position
3	LG21	18027708
6	LG10	5565406
11	LG10	22879220
16	LG21	7743712
18	LG04	17492843
20	LG01	8417786
36	LG06	19545065
45	LG08	19539446
63	LG07	8250504


In [6]:
!head ../stacks_b4_wgenome/batch_4_MB_filteredMAF_filteredLoci50_filteredIndivids_filteredHWE_eastwest_fst_parsed.txt

Locus	Fst
23994	0.0308
19719	0.0358
5259	0.0157
3480	0.2240
22463	0.0678
19716	0.0850
16074	0.0465
11542	0.0395
21633	0.0138


So I can join them using the following code in R:

`setwd("D:/Pacific cod/DataAnalysis/PCod-Compare-repo/analyses/SlidingWindow")`

`infile <- read.delim("../ManhattanPlots/batch_8_filteredMQ_filteredAS_aligned_loci.txt",header=TRUE)
head(infile)`

`fstfile <- read.delim("../../stacks_b8_wgenome_r05/batch_8_filteredMAF_filteredLoci30_filteredIndivids_filteredHWE_eastwest_fst_parsed.txt",header=TRUE)
head(fstfile)`

`install.packages("dplyr"); library(dplyr)`

`align_data <- left_join(infile,fstfile)`

<br>
<br>
And then replace the column headers so that they match the headers in Charlie's input file. The order of the columns doesn't matter in the R script for sliding window analysis:

`colnames(align_data) <- c("Locus", "chromosome", "position", "fst")`
 
 <br>
 <br>
 

I then want to write out my new data frame to a file:

`write.table(align_data, "batch_8_MB_SLA_input.txt", quote=FALSE, sep="\t")`

In [10]:
pwd

u'/mnt/hgfs/PCod-Compare-repo/analyses'

In [13]:
!head SlidingWindow/batch_4_MB_SLA_input.txt

Locus	chromosome	position	fst
1	3	LG21	18027708	0.039
2	6	LG10	5565406	0.0456
3	11	LG10	22879220	0.0343
4	16	LG21	7743712	0.0414
5	18	LG04	17492843	0.9551
6	20	LG01	8417786	0.0302
7	36	LG06	19545065	0.036
8	45	LG08	19539446	0.0161
9	63	LG07	8250504	0.1582


<br>
<br>
### Step Three: Choose Parameters for Analysis

Sliding window analysis uses a "kernel-smoothing moving average." The width of the window and the weight of each point are defined using a Gaussian function (`exp(-(-p-c)^2/2(sigma^2)`). Bootstrap resampling is used to assign significance values to moving average values of FST (or whatever population-level statistic you are working with).
<br>

There are three variable parameters that need to be defined: 

1. Sigma - the variance in the Gaussian function. The weighted average window is truncated at `3*sigma` for computational efficiency. Hohenlohe et al. (2010) used a sigma = 150 kb
2. Step size - how many base pairs the window moves over before it repeats the calculation of a weighted average. In the R code, "divisions" is specified instead of step side. Hohenlohe et al. (2010) used a step size = 100 kb. *with a step size of 100kb, and about 20 million base pairs per linkage group, Hohenlohe et al. (2010) end up with approximately 200 divisions.*
3. Replicates used in bootstrapping - Hohenlohe et al. (2010) tested out 100, 1000, 10000, 1 million, 10 million. Charlie uses 1 million. 


<br>


For my first run through of the analysis, I'm going to use the following for the R code:
1. Sigma - 150kb
2. Divisions - 200
3. Bootstrap replicates - 1 million 


<br>
<br>
### Step Four: Run R code 

[R Script SlidingWindow_MF.R](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/SlidingWindow/SlidingWindow_MF.R)

<br>
**Original Manhattan Plot:**

[img-lg01-manhattan](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/ManhattanPlots/batch_8_manhattan_line_lg01.png?raw=true)

<br>

**SLIDING WINDOW ANALYSIS Using a sigma = 150kb on LG01:**
![img-lg01-150](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/SlidingWindow/batch_8_LG01_SLA_150kb.png?raw=true)

<br>
**SLIDING WINDOW ANALYSIS Using a sigma = 200kb on LG01:**
![img-lg01-200](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/SlidingWindow/batch_8_LG01_SLA_200kb.png?raw=true)

<br>
**SLIDING WINDOW ANALYSIS Using a sigma = 200kb, divisions = 150 on LG01:**

![img-lg01-200-150d](https://github.com/mfisher5/PCod-Compare-repo/blob/master/analyses/SlidingWindow/batch_8_LG01_SLA_200kb_150d.png?raw=true)

In [1]:
d = 150
sigma = 200000
# sliding window size / min. allowable size for sliding window
float(3*sigma) / (float(22510304) / float(d))

3.9981690162869414

<br>

#### Sliding Window Analysis Parameters in the Literature




|----Citation----|----Window Size----|----Step Size ----|----# Divisions per Chromosome----|
|:--------------:|:------------------:|:----------------:|:-------------------------:|
|Hohenlohe et al. (2010) PlOs Gen - stickleback | 200kb | 100kb| ~ 200 |
|Larson et al. (2014) Evol. App. - salmon| 5cM | 1cM | ~100-150 (size of LG varies) |
|Karlson et al. (2013) Mol. Ecol. - atlantic cod| 4kb | 2kb | NA (concatenated scaffolds of old genome) |
|Nadeau et al. (2012) Phil Trans R. - butterflies| 10kb | 2kb | NA (concatenated scaffolds of old genome) |
|Johnston et al. (2012) Mol. Ecol. - A salmon| 20 SNPs | NA | NA |
|Anderson et al. (2012) PLos - zebrafish| 3 markers | 1 marker | NA |

<br>

|----Citation----|----Total # SNPs----|----# Loci per Chromosome ----|
|:--------------:|:------------------:|:----------------:|:-------------------------:|
|Hohenlohe et al. (2010) PlOs Gen| 45,000 | NA |
|Larson et al. (2014) Evol. App.| 10,944 | NA |
|Karlson et al. (2013) Mol. Ecol.| 321,342 | NA |
|Johnston et al. (2012) Mol. Ecol. - A salmon| 4,353 | NA |
|Anderson et al. (2012) PLos - zebrafish| ~36,000 | NA |

<br>

*notes:*
* Larson et al. (2010) also required at least two SNPs present in window to run analysis. 
* Nadeau et al. (2012) removed all moving averages "in which more than 90 per cent of the data were missing (i.e. less than 1 kb were present)"