# Exercises I: Frequency-based methods
----
## Variability statistics

We will use individuals sequencing in the 1000 genomes project. Even thought the individuals we sequenced at low/medium depth they were able to obtain good genotype calls for most of the SNPs in the genome (using imputation + external information). Because of computational demands we will only be looking at a 20Mb region for a small subset of the individuals. We will use 56 African individuals (YRI) and 60 European individuals (CEU). These are the individuals what overlap with the HapMap project where many selection scans in humans have been performed. We will try to explore the *LCT* gene located at position 136.6 Mb.

Aim: Locate the *LCT* selection signal using phased genotype data

We will use the program selscan with contains many of the haplotype based methods used for scan statistics. These methods are based on phased genotype data. To save time we have prepared such data for you in vcf format. Let's start by setting up the paths for those files as well as for the selscan program and a file with a genetic map:

In [None]:
## path for this exercise
ThePath=/course/popgen24/cindy/selectionExercises
  
#selscan program folder
SS=/course/popgen24/cindy/selectionExercises/prog/selscan-linux-1.3.0/

#VCF files
ceuVCF=$ThePath/ceuLCT.recode.vcf
yriVCF=$ThePath/yriLCT.recode.vcf
chbVCF=$ThePath/chbLCT.recode.vcf

#genetic map (positions in centimorgans)
MAP=$ThePath/geneticV2.map

And let's quickly look at a line in the vcf file to see how the fact that the data is indicated:

In [None]:
tail -n 1 $ThePath/ceuLCT.recode.vcf

Note that for each genotype for this SNP the alleles for each individual are separated by a vertical pipe "|". The vertical pipe | indicates that the genotype is phased, and is used to indicate which chromosome the alleles are on (if this is a slash / rather than a vertical pipe, it means we don’t know which chromosome they are on).

### Tajima's Theta (Tajima's π)
---
First lets look at the variability of the data in the CEU.

- If there has been positive selection at the LCT loci what do you expect?

With that in mind then try to estimate Tajima's theta (pi) in 10k windows using selscan using the following command:

In [None]:


$SS/selscan --pi --vcf $ceuVCF --pmap --out ceuLCT --threads 8 --pi-win 10000

Let's try to have look at the last 10 lines of the output file (which is named after the window size):

In [None]:
tail -n 10 ceuLCT.pi.10000bp.out

- Is it clear what is outputed? (hint: there is a line for each window and the first two colums define the window and the last the pi values)

Now try to plot the results in R using e.g. the following command:

In [None]:
r<-read.table("ceuLCT.pi.10000bp.out",head=F)
r<-subset(r,V3!=0)
plot(r[,1]/1e6,r[,3],ylab="pi",xlab="Position (Mb)",pch=20)
causalSNP <- 136608646/1e6
abline(v=causalSNP,col="red")
legend("topleft",lty=1,col="red","LCT")

- Can you see the reduced variability?

To determine whether it is extreme we can compare with the pi values in rest of the region using the following R code:

In [None]:
#continue in R

hist(r[,3],br=100, main="Histogram of pi values for all windows", xlab="pi")
causalWin<-subset(r,V1<(causalSNP*1e6)& V2>(causalSNP*1e6))
abline(v=causalWin[,3],col="red")
legend("topright",lty=1,col="red","LCT")
print(paste("The pi value in the window with LCT is",causalWin[,3]))

- Is the variability in the *LCT* region extreme?

- Try different windows sizes to see if it will change your results. To do so you need to go back to the code above, change the window size when you estimate pi with selscan and rerun all of the code **(NB! you also have to change the name of the file with resulsts when reading into `R`)**

### Tajima's *D*

**See how Tajima's *D* compares to π.**
Go to https://genome.ucsc.edu/s/aalbrechtsen/hg18Selection


This is a browser that can be used to view genomic data. With the above link you will view both genes and Tajima's *D* for 3 populations. 
 - Individuals with African descent are named AD
 - Individuals with European descent are named ED
 - Individuals with Chinese descent are named XD
<br/><br/>

You are looking at a random 11Mb region of the genome. Try to get a sense of the values that Tajima's *D* takes along the genome for the 3 populations.
 - You can move to another part of the chromosome by clicking ones on the chromosome arm 
 - You can also change chromosome in the search field
 - You can zoom in by draging the mouse
 <figure>
  <img  src="https://raw.githubusercontent.com/populationgenetics/exercises/master/NaturalSelection/browser.png" alt="" width=800 title="">
 </figure>
 


Take note of the highest and lowest values of Tajima's *D* that you observed. 
<br/><br/>
  

Try to find the *SLC45A2* gene (use the search field and choose the top option). This is one of the strongest selected genes in Europeans. 
 - Is this an extreme value of Tajima's *D*?

<br/><br/>
Now go to the *LCT* loci. 
 - Does this have an extreme value of Tajima's *D* ?
 - What can you conclude on the performance of Tajima’s *D* ?
 


## Population genetic differentation statistics
----

### $F_{ST}$ and Population Branch Statistics (PBS)

Now that we've tried out single-population statistics, lets see how a selection scan that compares two populations performs.

The data for Hudson's $F_{ST}$ comparing CEU (Europeans) and YRI (West Africans) for the *LCT* region has already been pre-ran using 10000bp window sizes.


Let's copy the data into your folder and plot the results in `R`:

In [None]:
r <- read.table("/course/popgen24/cindy/selectionExercises/run/comboLCT.fst.out",header=T)
causalSNP <- 136608646
#plot FST results
plot(r$POS,r$FST,ylab="Fst");
abline(v=causalSNP,col="red")

Similarly, data for PBS comparing CEU and CHB (Han Chinese) to YRI for the *LCT* region has been performed with window sizes consisting of 75 variants per window (approximately 10000 bp). 

Let's plot the pre-ran PBS results in `R`:


In [None]:
r <- read.table("/course/popgen24/cindy/selectionExercises/run/comboLCT.pbs.out",header=T)
causalSNP <- 136608646
#plot PBS results
plot(r$POS,r$PBS,ylab="PBS");
abline(v=causalSNP,col="red")

**- How do both of these selection methods compare to Tajima's π/theta and Tajima's *D*?**
 <br/><br/>
 <br/><br/>

# Exercise II: Haplotype-based methods

## Integrated Haplotype Score (iHS)

Lets now see if the haplotype homozygosity does a better job than the frequency-based methods.

Run iHS using the command:

In [None]:
#$SS/selscan --ihs --vcf $ceuVCF --pmap --out ceuLCT --threads 8

The analysis takes a couple of minutes so instead we will work with an already pre-ran set of results. You can copy it here:

In [None]:
ln -s $ThePath/run/ceuLCT.ihs* .

The output columns are:

```
<locusID> <physicalPos> <'1' freq> <ihh1> <ihh0> <unstandardized iHS>
```

These statistics will be affected by the frequency of the SNPs therefore we have to normalize in frequency bins. The default in 100 bins.

In [None]:
$SS/norm --ihs --files ceuLCT.ihs.out --bins 20

The number of bins is too high for this data set since we do not have enough SNPs for each bin of allele frequencies. Therefore, redo the analysis where you reduce the number of bins to 20 with the `--bins 20` option.

Lets plot the results in `R`:

In [None]:
r <- read.table("ceuLCT.ihs.out.20bins.norm",as.is=T,head=F)
names(r) <- c("locusID", "physicalPos","freq","ihh1","ihh0","unstandardizediHS")
r[which.max(r$ihh1/r$ihh0),]
causalSNP <- 136608646
#plot without frequency standardization
plot(r$physicalPos,r$unstandardizediHS);

## with standardiztion IHS=ihh0/ihh1
r$iHS<-log(r$ihh1/r$ihh0)
plot(r$physicalPos,r$iHS,ylab="iHs");
abline(v=causalSNP,col="red")

## causal SNP test statistics vs. rest of region
(causalSite<-r[which(r$physicalPos==causalSNP),])
hist(r$iHS)
abline(v=causalSite$iHS,col="red")


## Cross-population EHH (XP-EHH)

Lets try to use West Africans (YRI) to normalise the iHS with XP-EHH:

In [None]:
#$SS/selscan --xpehh --vcf $ceuVCF --pmap --vcf-ref $yriVCF --out ceuLCT --threads 8

This may take up to 10 minutes so feel free to copy the results instead:

In [None]:
ln -s $ThePath/run/ceuLCT.xpehh* .

We also have to normalize these results:

In [None]:
$SS/norm --xpehh --files ceuLCT.xpehh.out

And once more, you can plot the results in `R`:

In [None]:
r<-read.table("ceuLCT.xpehh.out.norm",head=T,as.is=T,row.names=NULL)
causalSNP <- 136608646
(causalSite<-r[which(r$pos==causalSNP),])                 
plot(r$pos,r$normxpehh)
abline(v=causalSNP,col="red")

#print the site with maximum statistic 
r[which.max(r$normxpehh),]

#plot the distribution
hist(r$normxpehh)
abline(v=causalSite$normxpehh,col="red")

#get the quantile
mean(causalSite$normxpehh>r$normxpehh)

**- Are you more convinced that the site is under selection?**

# PBS 1000 Genomes

    Lets explore the genome using PBS. First copy allele frequency informattion from the 1000 genomes

In [None]:
cp -r -sf  /course/popgen24/anders/selectionScan/ ~/
#enter the folder you just copied
cd ~/selectionScan



Load thw data into R   ( you do not need to understand the code)

In [None]:

## load data and function
setwd("~/selectionScan/")

source("server.R")
winSize <- 50000
shinyDir<- "~/selectionScan/tmp/"

shinyPBS<-paste0(shinyDir,"pbs") 
shinyCSV<-paste0(shinyDir,"pbs.csv")


First select 3 populations from

    NAT - Native Americans (PERU+Mexico)
    CHB – East Asian - Han Chinese
    CEU – Central Europeans
    YRI – African - Nigerians

The first population is the one which branch you are investigating. The two others are the one you are comparing to. Chose CEU as the first and choose CHB and YRI as the two others.

In [None]:
#"NAT", "CEU", "CHB","YRI"

#### choose populations
#pops 1=NAT,2=CHB",3=CEU",4=YRI
myPops <- c(3,2,4)

First lets get an overview of the whole genome by making a manhattan plot


In [None]:
options(repr.plot.width=17, repr.plot.height=8)
#### choose populations
PBSmanPlot(myPops)


ote which chromosomes have extreme values. A high value of PBS means a long branch length. To view a single chromosome – go to PBS region

Chose the chromosome with the highest PBS value and set the starting position to -1 to get the whole chromosome

e.g.

# see entire chromosome 1


In [None]:
options(repr.plot.width=17, repr.plot.height=10)
#modify below code with your selected chromsome
PBSmanRegion(myPops,chrom=1,start=-1)

Zoom in to the peak by changing start and end position.

# see region between 20Mb and 21Mb on chromosome 1


In [None]:
options(repr.plot.width=17, repr.plot.height=10)

#modify below code
PBSmanRegion(myPops,chrom=1,start=20,end=21)


- Locate the most extreme regions of the genome and zoom in
 - Identify the Gene with the highest PBS value.
 - What does the gene do?
 - Try the LCT gene (the mutations are locate in the adjacent MCM6 gene). See below on how to get the position
 - How does this compare with Tajima’s D
If you have time you can try other genes. Here are the top ones for Humans. You can find the find the location of the genes using for example the ucsc browser https://genome-euro.ucsc.edu/cgi-bin/hgGateway (choose human GRCh37/hg19 genome). Note that there are some population that you cannot test because the populations are not represented in the data e.g. Tibetan, Ethiopian , Inuit, Siberians.

### EDAR

if you have time then try to see if you can detect selection on the EDAR gene in East Asians ( EDAR was the gene you explored using low depth sequencing the first day of the course)