# Exercises I: Frequency-based methods
----
## Variability statistics

We will use individuals sequencing in the 1000 genomes project. Even thought the individuals we sequenced at low/medium depth they were able to obtain good genotype calls for most of the SNPs in the genome (using imputation + external information). Because of computational demands we will only be looking at a 20Mb region for a small subset of the individuals. We will use 56 African individuals (YRI) and 60 European individuals (CEU). These are the individuals what overlap with the HapMap project where many selection scans in humans have been performed. We will try to explore the *LCT* gene located at position 136.6 Mb.

Aim: Locate the *LCT* selection signal using phased genotype data

We will use the program selscan with contains many of the haplotype based methods used for scan statistics. These methods are based on phased genotpe data.


In [None]:
## path for this exercise
ThePath=/course/popgen24/cindy/selectionExercises
  
#selscan program folder
SS=/course/popgen24/cindy/selectionExercises/prog/selscan-linux-1.3.0/

#VCF files
ceuVCF=$ThePath/ceuLCT.recode.vcf
yriVCF=$ThePath/yriLCT.recode.vcf
chbVCF=$ThePath/chbLCT.recode.vcf

#genetic map (positions in centimorgans)
MAP=$ThePath/geneticV2.map

### Tajima's Theta (Tajima's π)
---
First lets look at the variability of the data in the CEU.

- If there has been positive selection at the LCT loci what do you expect?

Estimate Tajima's theta (pi) in 10k windows.

In [None]:
$SS/selscan --pi --vcf $ceuVCF --pmap --out ceuLCT --threads 8 --pi-win 10000

**You can plot the results in `R` e.g.**

In [None]:
r<-read.table("ceuLCT.pi.10000bp.out",head=F)
r<-subset(r,V3!=0)
plot(r[,1],r[,3],ylab="pi",xlab="position")
causalSNP <- 136608646
abline(v=136.5e6,col="red")
legend("topleft",fill="red","LCT")

- Can you see the reduced variability?

To determine whether it is extreme we can compare with the rest of the region.

In [None]:
#continue in R

hist(r[,3],br=100)
(causalWin<-subset(r,V1<causalSNP & V2>causalSNP))
abline(v=causalWin[,3],col="red")

- is the variability in the *LCT* region extreme?

- Try different windows sizes to see if it will change your results **(NB! you have to change the name when reading into `R`)**

### Tajima's *D*

**See how Tajima's *D* compares to π.**
Go to https://genome.ucsc.edu/s/aalbrechtsen/hg18Selection


This is a browser that can be used to view genomic data. With the above link you will view both genes and Tajima's *D* for 3 populations. 
 - Individuals with African descent are named AD
 - Individuals with European descent are named ED
 - Individuals with Chinese descent are named XD
 - <br/><br/>
You are looking at a random 11Mb region of the genome. Try to get a sense of the values that Tajima's *D* takes along the genome for the 3 populations.
 - You can move to another part of the chromosome by clicking ones on the chromosome arm 
 - You can also change chromosome in the search field
 - You can zoom in by draging the mouse
 <figure>
  <img  src="https://raw.githubusercontent.com/populationgenetics/exercises/master/NaturalSelection/browser.png" alt="" width=800 title="">
 </figure>
 


Take note of the highest and lowest values of Tajima's *D* that you observed. 
<br/><br/>
  

Try to find the *SLC45A2* gene (use the search field and choose the top option). This is one of the strongest selected genes in Europeans. 
 - Is this an extreme value of Tajima's *D*?

<br/><br/>
Now go to the *LCT* loci. 
 - Does this have an extreme value of Tajima's *D* ?
 - What can you conclude on the performance of Tajima’s *D* ?
 


## Population genetic differentation statistics
----

### $F_{ST}$ and Population Branch Statistics (PBS)

Now that we've tried out single-population statistics, lets see how a selection scan that compares two populations performs.

The data for Hudson's $F_{ST}$ comparing CEU (Europeans) and YRI (West Africans) for the *LCT* region has already been pre-ran using 10000bp window sizes.


Let's copy the data into your folder and plot the results in `R`:

In [None]:
r <- read.table("/course/popgen24/cindy/selectionExercises/run/comboLCT.fst.out",header=T)
causalSNP <- 136608646
#plot FST results
plot(r$POS,r$FST,ylab="Fst");
abline(v=causalSNP,col="red")

Similarly, data for PBS comparing CEU and CHB (Han Chinese) to YRI for the *LCT* region has been performed with window sizes consisting of 75 variants per window (approximately 10000 bp). 

Let's plot the pre-ran PBS results in `R`:


In [None]:
r <- read.table("/course/popgen24/cindy/selectionExercises/run/comboLCT.pbs.out",header=T)
causalSNP <- 136608646
#plot PBS results
plot(r$POS,r$PBS,ylab="PBS");
abline(v=causalSNP,col="red")

**- How do both of these selection methods compare to Tajima's π/theta and Tajima's *D*?**
 <br/><br/>
 <br/><br/>

# Exercise II: Haplotype-based methods

## Integrated Haplotype Score (iHS)

Lets now see if the haplotype homozygosity does a better job than the frequency-based methods.

Run iHS using the command:

In [None]:
#$SS/selscan --ihs --vcf $ceuVCF --pmap --out ceuLCT --threads 8

The analysis takes a couple of minutes so instead we will work with an already pre-ran set of results. You can copy it here:

In [None]:
ln -s $ThePath/run/ceuLCT.ihs* .

The output columns are:

```
<locusID> <physicalPos> <'1' freq> <ihh1> <ihh0> <unstandardized iHS>
```

These statistics will be affected by the frequency of the SNPs therefore we have to normalize in frequency bins. The default in 100 bins.

In [None]:
$SS/norm --ihs --files ceuLCT.ihs.out --bins 20

The number of bins is too high for this data set since we do not have enough SNPs for each bin of allele frequencies. Therefore, redo the analysis where you reduce the number of bins to 20 with the `--bins 20` option.

Lets plot the results in `R`:

In [None]:
r <- read.table("ceuLCT.ihs.out.20bins.norm",as.is=T,head=F)
names(r) <- c("locusID", "physicalPos","freq","ihh1","ihh0","unstandardizediHS")
r[which.max(r$ihh1/r$ihh0),]
causalSNP <- 136608646
#plot without frequency standardization
plot(r$physicalPos,r$unstandardizediHS);

## with standardiztion IHS=ihh0/ihh1
r$iHS<-log(r$ihh1/r$ihh0)
plot(r$physicalPos,r$iHS,ylab="iHs");
abline(v=causalSNP,col="red")

## causal SNP test statistics vs. rest of region
(causalSite<-r[which(r$physicalPos==causalSNP),])
hist(r$iHS)
abline(v=causalSite$iHS,col="red")


## Cross-population EHH (XP-EHH)

Lets try to use West Africans (YRI) to normalise the iHS with XP-EHH:

In [None]:
#$SS/selscan --xpehh --vcf $ceuVCF --pmap --vcf-ref $yriVCF --out ceuLCT --threads 8

This may take up to 10 minutes so feel free to copy the results instead:

In [None]:
ln -s $ThePath/run/ceuLCT.xpehh* .

We also have to normalize these results:

In [None]:
$SS/norm --xpehh --files ceuLCT.xpehh.out

And once more, you can plot the results in `R`:

In [None]:
r<-read.table("ceuLCT.xpehh.out.norm",head=T,as.is=T,row.names=NULL)
causalSNP <- 136608646
(causalSite<-r[which(r$pos==causalSNP),])                 
plot(r$pos,r$normxpehh)
abline(v=causalSNP,col="red")

#print the site with maximum statistic 
r[which.max(r$normxpehh),]

#plot the distribution
hist(r$normxpehh)
abline(v=causalSite$normxpehh,col="red")

#get the quantile
mean(causalSite$normxpehh>r$normxpehh)

**- Are you more convinced that the site is under selection?**