vignettes/pcadapt.Rmd

---
title: Using pcadapt to detect local adaptation
author: Keurcien Luu, Florian Privé, Michael G.B. Blum
output: 
  rmarkdown::html_document:
    css: style.css
vignette: >
  %\VignetteIndexEntry{pcadapt}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\usepackage[utf8]{inputenc}
---

**pcadapt** has been developed to detect genetic markers involved in biological adaptation. **pcadapt** provides statistical tools for outlier detection based on Principal Component Analysis (PCA).

In the following, we show how the **pcadapt** package can perform genome scans for selection based on individual genotype data. We show how to run the package using the example **geno3pops** that contains genotype data. A total of 150 individuals coming from three different populations were genotyped at 1,500 diploid markers. Simulations were performed with [simuPOP](http://simupop.sourceforge.net/) using a divergence model assuming that 150 SNPs confer a selective advantage. To run the package on the provided example, just copy and paste shaded `R` chunks.

To run the package, you need to install the package and load it using the following command lines:

```{r, eval = FALSE}
install.packages("pcadapt")
library(pcadapt)
```

```{r, include = FALSE}
library(pcadapt)#Comment
```

# A. Reading genotype data

You should use the `read.pcadapt` function to convert your genotype file to the `bed` format, which is [PLINK binary biallelic genotype table](https://www.cog-genomics.org/plink2/formats#bed). `read.pcadapt` converts different types of files to the `bed` format and returns a character string containing the name of the converted file, which should be used as input for the `pcadapt` function. Supported formats are the following: "pcadapt", "lfmm", "vcf", "bed", "ped", "pool". `pcadapt` files should have individuals in columns, SNPs in lines, and missing values should be encoded by a single character (e.g. 9) different from 0, 1 or 2.

For example, assume your genotype file is called "foo.lfmm" and is located in the directory "path_to_directory", use the following command lines:

```
path_to_file <- "path_to_directory/foo.lfmm"
filename <- read.pcadapt(path_to_file, type = "lfmm")
```

To run the provided example, we retrieve the file location of the example and we use `read.pcadapt` to convert the `lfmm` example to the `pcadapt` format.

```{r}
path_to_file <- system.file("extdata", "geno3pops.lfmm", package = "pcadapt")
filename <- read.pcadapt(path_to_file, type = "lfmm")
```

When working with genotype matrices already loaded in the `R` session, users have to run the `read.pcadapt` function and to specify the `type`argument, which can be `pcadapt` (individuals in columns, SNPs in lines) or `lfmm` (individuals in lines, SNPs in columns).

# B. Choosing the number `K` of Principal Components


The `pcadapt` function performs two successive tasks. First, PCA is performed on the centered and scaled genotype matrix. The second stage consists in computing test statistics and p-values based on the correlations between SNPs and the first `K` principal components (PCs). To run the function `pcadapt`, the user should specify the output returned by the function `read.pcadapt` and the number `K` of principal components to compute.

To choose `K`, principal component analysis should first be performed with a large enough number of principal components (e.g. `K=20`).

```{r}
x <- pcadapt(input = filename, K = 20) 
```

**NB**: by default, data are assumed to be diploid. To specify the ploidy, use the argument `ploidy` (`ploidy=2` for diploid species and `ploidy = 1` for haploid species) in the `pcadapt` function.

### B.1. Scree plot

The 'scree plot' displays in decreasing order the percentage of variance explained by each PC. Up to a constant, it corresponds to the eigenvalues in decreasing order. The ideal pattern in a scree plot is a steep curve followed by a bend and a straight line. The eigenvalues that correspond to random variation lie on a straight line whereas the ones that correspond to population structure lie on a steep curve. We recommend to keep PCs that correspond to eigenvalues to the left of the straight line (Cattell's rule). In the provided example, `K = 2` is the optimal choice for `K`. The plot function displays a scree plot:

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "screeplot")
```

By default, the number of principal components shown in the scree plot is `K`, but it can be reduced via the argument `K`.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "screeplot", K = 10)
```

### B.2. Score plot

Another option to choose the number of PCs is based on the 'score plot' that displays population structure. The score plot displays the projections of the individuals onto the specified principal components. Using the score plot, the choice of `K` can be limited to the values of `K` that correspond to a relevant level of population structure.

When population labels are known, individuals of the same populations can be displayed with the same color using the `pop` argument, which should contain the list of indices of the populations of origin. In the **geno3pops** example, the first population is composed of the first 50 individuals, the second population of the next 50 individuals, and so on. Thus, a vector of indices or characters (population names) that can be provided to the argument `pop` should look like this:

```{r}
# With integers
poplist.int <- c(rep(1, 50), rep(2, 50), rep(3, 50))
# With names
poplist.names <- c(rep("POP1", 50),rep("POP2", 50),rep("POP3", 50))
print(poplist.int)
print(poplist.names)
```

If this field is left empty, the points will be displayed in black. By default, if the values of `i` and `j` are not specified, the projection is done onto the first two principal components.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "scores", pop = poplist.int)
```

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "scores", pop = poplist.names)
```

Looking at population structure beyond `K = 2` confirms the results of the scree plot. The third and the fourth principal components do not ascertain population structure anymore.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "scores", i = 3, j = 4, pop = poplist.names)
```

# C. Computing the test statistic based on PCA

For a given SNP, the test statistic is based on the $z$-scores obtained when regressing SNPs with the `K` principal components. The test statistic for detecting outlier SNPs is the Mahalanobis distance, which is a multi-dimensional approach that measures how distant is a point from the mean. Denoting by $z^j = (z_1^j, \dots, z_K^j)$ the vector of `K` $z$-scores between the $j$-th SNP and the first `K` PCs, the sqaured Mahalanobis distance is defined as

$$D_j^2 = (z^j - \bar{z})\Sigma^{-1}(z^j - \bar{z})$$

where $\bar{z}$ and $\Sigma$ are robust estimates of the mean and of the covariance matrix. Once divided by a constant $\lambda$ called the genomic inflation factor, the scaled squared distances $D_j^2/\lambda$ should have a chi-square distribution with `K` degrees of freedom under the assumption that there are no outlier.

For the **geno3pops** data, it was found in section B that `K=2` corresponds to the optimal choice of the number of PCs.

```{r}
x <- pcadapt(filename, K = 2)
```

In addition to the number `K` of principal components to work with, the user can also set the parameter `min.maf` that corresponds to a threshold of minor allele frequency. By default, the parameter `min.maf` is set to `5%`. P-values of SNPs with a minor allele frequency smaller than the threshold are not computed (`NA` is returned).

The object `x` returned by the function `pcadapt` contains numerical quantities obtained after performing a PCA on the genotype matrix.

```{r ,eval=FALSE}
summary(x)
```

We assume in the following that there are `n` individuals and `L` markers.

- `scores` is a `(n,K)` matrix corresponding to the projections of the individuals onto each PC.

- `singular.values` is a vector containing the `K` ordered squared root of the proportion of variance explained by each PC.

- `loadings` is a `(L,K)` matrix containing the correlations between each genetic marker and each PC.

- `zscores` is a `(L,K)` matrix containing the $z$-scores.

- `af` is a vector of size `L` containing allele frequencies of derived alleles where genotypes of `0` are supposed to code for homozygous for the reference allele.

- `maf` is a vector of size `L` containing minor allele frequencies.

- `chi2.stat` is a vector of size `L` containing the rescaled statistics `stat/gif` that follow a chi-squared distribution with `K` degrees of freedom.

- `gif` is a numerical value corresponding to the genomic inflation factor estimated from `stat`.

- `pvalues` is a vector containing `L` p-values.

- `pass` A list of SNPs indices that are kept after exclusion based on the minor allele frequency threshold.

- `stat` is a vector of size `L` containing squared Mahalanobis distances by default.

All of these elements are accessible using the `$` symbol. For example, the p-values are contained in `x$pvalues`.

# D. Graphical tools

### D.1. Manhattan Plot

A Manhattan plot displays $-\text{log}_{10}$ of the p-values.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x , option = "manhattan")
```

### D.2. Q-Q Plot

The user can also check the expected uniform distribution of the p-values using a Q-Q plot

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "qqplot")
```

This plot confirms that most of the p-values follow the expected uniform distribution. However, the smallest p-values are smaller than expected confirming the presence of outliers.

### D.3. Histograms of the test statistic and of the p-values

An histogram of p-values confirms that most of the p-values follow an uniform distribution. The excess of small p-values indicates the presence of outliers.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
hist(x$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
```

The presence of outliers is also visible when plotting a histogram of the test statistic $D_j$.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x, option = "stat.distribution")
```

# E. Choosing a cutoff for outlier detection

To provide a list of outliers and choose a cutoff for outlier detection, there are several methods that are listed below from the less conservative one to the more conservative one.

## E.1. q-values

The `R` package [qvalue](http://www.bioconductor.org/packages/release/bioc/html/qvalue.html), transforms p-values into q-values. To install and load the package, type the following command lines:

```
## try http if https is not available
source("https://bioconductor.org/biocLite.R")
biocLite("qvalue")
library(qvalue)
```

For a given $\alpha$ (real valued number between $0$ and $1$), SNPs with q-values less than $\alpha$ will be considered as outliers with an expected false discovery rate bounded by $\alpha$.
The false discovery rate is defined as the percentage of false discoveries among the list of candidate SNPs. Here is an example of how to provide a list of candidate SNPs for the **geno3pops** data, for an expected false discovery rate lower than `10%`:

```
library(qvalue)
qval <- qvalue(x$pvalues)$qvalues
alpha <- 0.1
outliers <- which(qval < alpha)
length(outliers)
```

## E.2. Benjamini-Hochberg Procedure

```
padj <- p.adjust(x$pvalues,method="BH")
alpha <- 0.1
outliers <- which(padj < alpha)
length(outliers)
```

## E.3. Bonferroni correction

```
padj <- p.adjust(x$pvalues,method="bonferroni")
alpha <- 0.1
outliers <- which(padj < alpha)
length(outliers)
```

# F. Linkage Disequilibrium (LD) thinning

Linkage Disequilibrium can affect ascertainment of population structure ([Abdellaoui et al. 2013](https://www.nature.com/articles/ejhg201348)). When working with RAD-seq data, it should not be an issue. However, users analyzing dense data such as whole genome sequence data or dense SNP data should account for LD in their genome scans. 

In pcadapt, there is an option to compute PCs after SNP thinning. SNP thinning make use of two parameters: window size (default 200 SNPs) and $r^2$ threshold (default 0.1). The genome scan is then performed by looking at associations between all SNPs and PCs ascertained after SNP thinning.

We provide below an example of data analysis using simulated data where there is a LD. Genotype data are stored in a matrix (individuals in columns and SNPs in lines). An analysis without SNP thinning can be performed as follows
```{r}
path_to_file <- system.file("extdata", "SSMPG2017.rds", package = "pcadapt")
genotypes <- readRDS(path_to_file)
matrix <- read.pcadapt(genotypes, type = "pcadapt")
res <- pcadapt(matrix, K = 20)
plot(res, option = "screeplot")
```

Catell's rule indicates $K=4$.

```{r}
res <- pcadapt(matrix, K = 4)
plot(res)
```

Because the data have ben simulated, we know that outliers do not correspond to regions involved in adaptation but rather correspond to regions of low-recombination (high LD). 

To evaluate if LD might be an issue for your dataset, we recommend to display the loadings (contributions of each SNP to the PC) and to evaluate if the loadings are clustered in a single or several genomic regions.
```{r}
par(mfrow = c(2, 2))
for (i in 1:4)
  plot(res$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
```

Here, the top-left figure shows that PC1 is determined by a single genomic region, which is likely to be a region of strong LD.

We should therefore thin SNPs in order to compute the PCS.

```{r}
res <- pcadapt(matrix, K = 20, LD.clumping = list(size = 200, thr = 0.1))
plot(res, option = "screeplot")
```

After SNP thinning, we choose $K=2$. 

```{r}
res <- pcadapt(matrix, K = 2, LD.clumping = list(size = 200, thr = 0.1))
par(mfrow = c(1, 2))
for (i in 1:2)
  plot(res$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
```

The distribution of the loadings is now evenly distributed, so we can have a look at the genome scan, which correctly identifies regions involved in adaptation.

```{r}
plot(res)
```


# G. Detecting local adaptation with pooled sequencing data

For Pool-seq samples, the package also uses the `read.pcadapt` function. 

We assume that the user provides a matrix of relative frequencies with `n` rows and `L` columns (where `n` is the number of populations and `L` is the number of genetic markers). Assume your frequency file is called "foo" and is located in the directory "path_to_directory", use the following command lines:

```
pool.data <- read.table("path_to_directory/foo")
filename <- read.pcadapt(pool.data, type = "pool")
```

A Pool-seq example is provided in the package, and can be loaded as follows:

```{r,  eval = TRUE}
pool.data <- system.file("extdata", "pool3pops", package = "pcadapt")
filename <- read.pcadapt(pool.data, type = "pool")
```

With Pool-Seq data, the package computes again a Mahalanobis distance based on PCA loadings. 

By default, `pcadapt` function assumes that $K=n-1$. Smaller values of `K` can be provided by using argument `K`. Computation of Mahalanobis distances is performed as follows

```{r,  eval = TRUE}
res <- pcadapt(filename)
#The same as res <- pcadapt(filename,K=2)
summary(res)
```
`res' is a list containing the same elements than when using individual genotype data.

A scree plot can be obtained and be possibly used to reduce `K`. If the number of populations `n` is too small, it is impossible to use the scree plot to choose `K` and the default value of $K=n-1$ should be used.

```{r, eval = TRUE}
plot(res,option="screeplot")
```

A Manhattan plot can be displayed.
```{r, eval = TRUE}
plot(res, option = "manhattan")
```

A list of outliers can be obtained as for individual genotype data
```{r, eval = TRUE}
padj <- p.adjust(res$pvalues, method = "BH")
alpha <- 0.1
outliers <- which(padj < alpha)
length(outliers)
```
The function `get.pc`is also available for pooled-seq data (see H.3).

```{r, eval = TRUE}
get.pc(res,1:150)->aux
print(aux[,2])
```
Most of the 75 first SNPs are more related to the differentiation corresponding to the 1st PC whereas the following 75 SNPs are mostly related to the differentiation corresponding to the 2nd PC (in the simulations, these 2 sets of SNPs have been involved in adaptation occurring in different parts of the population divergence tree).

# H. Miscellaneous

### H.1. Missing values
 
The package accepts missing values. Computation of PCs and of P-values accounts for missing values. Missing values should be encoded by a single character (e.g. 9) different from 0, 1 or 2.

### H.2. Run `pcadapt` on matrices loaded in memory
 
`pcadapt` can be called on matrices loaded in memory instead of using a file. When calling the function `read.pcadapt`, users should specify if individuals are in columns, and SNPs in lines (type = "pcadapt") or if individuals are in lines, and SNPs in columns (type = `lfmm`).

```{r, eval=FALSE}
path_to_file <- system.file("extdata", "SSMPG2017.rds", package = "pcadapt")
genotypes <- readRDS(path_to_file)
print(dim(genotypes))
matrix <- read.pcadapt(genotypes, type = "pcadapt")
res <- pcadapt(matrix, K = 20)
plot(res, option = "screeplot")
```

### H.3. Association between PCs and outliers

It may be interesting to associate outliers with one of the `K` principal component to have indication about evolutionary pressure. The function `get.pc` allows to achieve that:

```
snp_pc <- get.pc(x, outliers)
```

### H.4 Component-wise genome scans

Another possibility is to perform one genome scan for each principal component (component-wise p-values). The test statistics are the loadings, which correspond to the correlations between each PC and each SNP. P-values are computed by making a Gaussian approximation for each PC and by estimating the standard deviation of the null distribution.

```{r}
path_to_file <- system.file("extdata", "geno3pops.lfmm", package = "pcadapt")
filename <- read.pcadapt(path_to_file, type = "lfmm")
x_cw <- pcadapt(filename, K = 2, method = "componentwise")
summary(x_cw$pvalues)
```

`pcadapt` returns `K` vectors of p-values (one for each principal component), all of them being accessible, using the `$` symbol or the `[]` symbol. For example, typing `x_cw$pvalues$p2` or `x_cw$pvalues[,2]` in the `R` console returns the list of p-values associated with the second principal component (provided that `K` is larger than or equal to `2`).

The p-values are computed based on the matrix of loadings. The loadings of the neutral markers are assumed to follow a centered Gaussian distribution. The standard deviation of the Gaussian distribution is estimated using the median absolute deviation.

To display the neutral distribution for the component-wise case, the value of `K` has to be specified.

```{r, fig.width = 7, fig.height = 5, fig.align = 'center'}
plot(x_cw, option = "stat.distribution", K = 2)
```

# Changelog

|Date       | Version          | Changes|
|--------   | ----   | -------------------------------------------------------|
|02-21-2018 | 4.0              | - `read.pcadapt` generates `bed` files instead of `pcadapt` files.|
|           |                  | - Computation of PCA is based on the package `RSpectra`.|
|           |                  | - Missing values are handled by specifying vector-matrix operation in `RSpectra` that accounts for missing values.|
|           |                  | - Includes LD thinning to compute PCs.|
|           |                  | - No more dependency to RcppArmadillo|
|           |                  | - For Pooled-seq data, use Mahalanobis distances based on PCA loadings, no more simulations of individuals genotypes|
|01-15-2017 | 3.1.0            | - Switch from C/Lapack to Rcpp/RcppArmadillo.|
|           |                  | - `pcadapt` can take genotype matrices as input.|
|           |                  | - Modified code for binomial sampling.|
|           |                  | - `pcadapt` argument `clean.files` is now deprecated.|
|           |                  | - `pcadapt` argument `output.filename` is now deprecated.|
|           |                  | - `read.pcadapt` argument `local.env` is now deprecated.|
|           |                  | - Latest update of `vcfR` taken into account.|
|12-20-2016 | 3.0.4            | - Method based on sampling genotypes added to handle pooled-sequencing.|
|10-06-2016 | 3.0.3            | - Option `type="vcfR"` has been added to `read.pcadapt` to overcome some conversion issues occuring with VCF files.|
|           |                   | - Argument `transpose` is now deprecated. Read section A for more details.|
| | |
|06-13-2016 | 3.0.2             | - The function `get.pc` has been added. For each SNP, it returns the most correlated principal component.|
| | |
|06-01-2016 | 3.0.1             | - The `read4pcadapt` is now deprecated, it is now called `read.pcadapt`.|
|           |                   | - Using the `pop` option when plotting scores now provides the color legend.|
| | |
|04-04-2016 | 3.0               | - All analyses are now included in the `R` package. Users should not use the `C` software PCAdapt fast anymore.|
|           |                   | - Big datasets can be handled directly within the `R` session.|
|           |                   | - `read4pcadapt` now converts files to the `pcadapt` format.|
|           |                   | - The first argument of `pcadapt` can be either a small genotype matrix or the output of `read4pcadapt`.|
| | |
|02-12-2016 | 2.2               | - The Mahalanobis distance is now estimated from the z-scores rather than the loadings.|
|           |                   | - Make sure you have downloaded the latest version of the C software PCAdapt (last updated on February, 11th, 2016).|
| | |
|01-05-2016 | 2.1.1             | - Bug fix: vignette header added.|
| | |
|12-18-2015 | 2.1               | - The scaling of the SNP before computing PCA has been changed. Instead of using standard deviation, we now use the squareroot of $p(1-p)$ (haploid data) or of $2p(1-p)$ (diploid data) where $p$ is the minimum allele frequency.|
|           |                   | - Bug fix: the genomic inflation factor has been corrected when `K=1`.|
|           |                   | - Bug fix: a problem due to high proportion of missing data slowing the program has been fixed.|
|           |                   | - Argument `"minmaf"` has been replaced with `"min.maf"`.|
| | |
|           | 2.0.1             | - Vignette corrected: when reading output from the software PCAdapt, we do not mention the deprecated argument `PCAdapt` in the function `pcadapt`.|
|           |                   | - Bug fix: an issue occurring when reading outputs from PCAdapt has been fixed.|
|           |                   | - We mention in the vignette how to use Pool-seq data with the C software PCAdapt fast.|
| | |
|           | 2.0               | - The default test statistic is not the communality statistic anymore but the Mahalanobis distance.|
|           |                   | - Test statistic for Pool-seq data.|

# Main reference

Luu, K., Bazin, E., & Blum, M. G. B. (2017). [pcadapt: an R package to perform genome scans for selection based on principal component analysis. Molecular Ecology Resources 17:67-77](http://biorxiv.org/content/early/2016/07/25/056135).