# Population structure

With the advent of SNP data it is possible to precisely infer the genetic distance across individuals or populations. As written in the book, one way of doing it is by comparing each SNP from each individual against every other individual. This comparison produces the so called: covariance matrix, which in genetic terms means the number of shared polymorphisms across individuals.
There are many ways to visualize this data, in this tutorial you will be exposed to `Principal Component Analysis` and `Admixture` software.

We will use the R package `SNPRelate`, which can easily handle vcf files and do the PCA. If you want to explore a bit more on the functionality of the package access [here](https://www.rdocumentation.org/packages/SNPRelate/versions/1.6.4).

## How to make this notebook work

* In this notebook we will use both the `command line bash` commands and `R` to setup the file folders.
* Having to shift between two languages, you need to choose a kernel every time we shift from one language to another. A kernel contains a programming language and the necessary packages to run the course material. To choose a kernel, go on the menu on the top of the page and select `Kernel --> Change Kernel`, and then select the preferred one. We will shift between two kernels, and along the code in this notebook you will see a picture telling you to change kernel. The two pictures are below:

<img src="Images/bash.png" alt="Bash" width="80"> Shift to the `Bash` kernel

<img src="Images/R.png" alt="R" width="80"> Shift to the `R` kernel


## Learning outcomes

At the end of this tutorial you will be able to

- **Extract** information from a `vcf` file and create a PCA projection
- **Look at the effect of** LD pruning to reveal population structure

## Setting up folders

Here we setup a link to the `Data` folder and create the `Results` folder.

<img src="Images/bash.png" alt="Bash" width="80"> Choose the `Bash` kernel

In [None]:
ln -sf ../Data

In [None]:
unzip Data/hapmap.zip -d  Data

# PCA from VCF data file

<img src="Images/R.png" alt="R" width="80"> Shift to the `R` kernel

In [None]:
library(SNPRelate)
library(ggplot2)

## Import data and calculate PCA

We read the metadata about the samples (geographic locations) and transform the `vcf` file into `gds` format, from which the package `SNPRelate` can calculate the PCA projection of the data.

In [None]:
# Reading the metadata information 
info = read.csv("Data/sample_infos_accessionnb.csv", header = T, sep = ';')

# Setting the directory of the VCF file 
vcf.fn <- "Data/Allvariants_135_145_chr2.vcf"

# Transforming the vcf file to gds format
snpgdsVCF2GDS(vcf.fn, "Data/Allvariants_135_145_chr2_2.gds", method="biallelic.only")

#Read the file and calculate the PCA
genofile <- snpgdsOpen("Data/Allvariants_135_145_chr2_2.gds",  FALSE, TRUE, TRUE)
pca <- snpgdsPCA(genofile)
summary(pca)

**Q.1** How many individuals and snps does this dataset have? What is an eigenvector and an eigenvalue?

The `pca` object just created is a list containing various elements.

In [None]:
ls(pca)

We use `pca$eigenvect` to plot the PCA. We extract also `pca$sample.id` to match the geographic locations in the metadata with the samples in `pca`.

In [None]:
eigenvectors = as.data.frame(pca$eigenvect[,1:5])
colnames(eigenvectors) = as.vector(sprintf("PC%s", seq(1:ncol(eigenvectors))))
pca$sample.id = sub("_chr2_piece_dedup", "", pca$sample.id)

# Matching the sample names with their origin and population
rownames(info) <- info[,"ENA.RUN"]
eigenvectors <- cbind(eigenvectors, info[pca$sample.id, c("population","region")])


In the end, we have created a table called `eigenvectors` containing the PCA coordinates and some metadata

In [None]:
head(eigenvectors)

Let's first look at how much of the variance of the data is explained by each eigenvector:

In [None]:
# Variance proportion:
pca$pca_percent <- pca$varprop*100
ggplot( NULL, aes(x=seq(1, length(pca$eigenval)), y=pca$pca_percent, label=sprintf("%0.2f", round(pca$pca_percent, digits = 2))) ) +
        geom_line() + geom_point() + 
        geom_text(nudge_y = .3, nudge_x = 1.5, check_overlap = T)

**Q.2** How many PC's do we need in order to explain 50% of the variance of the data? Can you make an accumulative plot of the variance explained PC?

## Visualization

We plot now the first two PCA coordinates and label them by Population, with color by region. We can see how only africans are separated from the rest, but the PCA is quite confused and cannot distinguish EastAsia and WestEurasia.

In [None]:
ggplot(data = eigenvectors, aes(x = PC1, y = PC2, col = region)) + 
        geom_point(size=3,alpha=0.5) + geom_text( aes(label=population), col="black") +
        scale_color_manual(values = c("#FF1BB3","#A7FF5B","#99554D")) +
        theme_bw()

**Q.3** Try to plot PC2 and PC3. Do you see the same patterns? What is the correlation between PC2 and PC3 (hint use the function cor())?

**Q.4** Try also to color the graph based on population. What do you observe?

## LD pruning

We implement LD pruning to eliminate those SNPs that are in high linkage disequilibrium. What happens is that the structure of the populations will change. According to ([Bergovich et al, 2024, Biorxiv](https://www.biorxiv.org/content/10.1101/2024.05.02.592187v1)), this is not good practice and removes a lot of the population information.

In [None]:
set.seed(1000)

# This function prune the snps with a thrshold of maximum 0.3 of LD
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.3)

# Get all selected snp's ids
snpset.id <- unlist(snpset)

pca_pruned <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2)

#add metadata
eigenvectors = as.data.frame(pca_pruned$eigenvect)
colnames(eigenvectors) = as.vector(sprintf("PC%s", seq(1:nrow(pca$eigenvect))))
pca_pruned$sample.id = sub("_chr2_piece_dedup", "", pca$sample.id)
eigenvectors <- cbind(eigenvectors, info[pca$sample.id, c("population","region")])

#plot
ggplot(data = eigenvectors, aes(x = PC3, y = PC2, col = region, label=population)) + 
        geom_text(hjust=1, vjust=0, angle=45) +
        geom_point(size=3,alpha=0.5) +
        scale_color_manual(values = c("#FF1BB3","#A7FF5B","#99554D")) +
        theme_bw() + coord_flip()

**Q.5** Implement different LD thresholds (0.1, 0.2, 0.3, 0.4, 0.5). How many SNPs are left after each filtering threshold? Are these SNPs linked?

Now we are going to convert this GDS file into a plink format, to be later used in the admixture exercise:

In [None]:
snpgdsGDS2BED(genofile, "Data/chr2_135_145_flt_prunned.gds", sample.id=NULL, snp.id=snpset.id, snpfirstdim=NULL, verbose=TRUE)

Save the data for later

In [None]:
save(pca, pca_pruned, info, genofile, file = "Results/data.Rdata")

# Admixture Estimation

<img src="Images/bash.png" alt="Bash" width="80"> Choose the `Bash` kernel

`Admixture` is a program for estimating ancestry in a model based manner from SNP genotype datasets, where individuals are unrelated. The input format required by the software is in binary PLINK (`.bed`) file. That is why we converted our vcf file into `.bed`.

Now with adjusted format and pruned snps, we are ready to run the admixture analysis. We believe that our individuals are derived from three ancestral populations:

In [None]:
admixture Results/chr2_135_145_flt_prunned.gds.bed 3
#move output files with other results
mv chr2_135_145_flt_prunned.gds.3.P chr2_135_145_flt_prunned.gds.3.Q  Results/

**Q.6** Have a look at the Fst across populations, that is printed in the terminal. Would you guess which populations are Pop0, Pop1 and Pop2 referring to?

After running admixture, 2 outputs are generated:

- `Q`: the ancestry fractions

- `P`: the allele frequencies of the inferred ancestral populations

Sometimes we may have no priori about `K`, one good way of choosing the best `K` is by doing a cross-validation procedure impletemented in admixture as follow:

In [None]:
for K in 1 2 3 4 5
do 
    admixture --cv Results/chr2_135_145_flt_prunned.gds.bed $K | tee log${K}.out
    mv chr2_135_145_flt_prunned.gds.$K.* log$K.out Results/
done

Have a look at the Cross Validation error of each `K`:

In [None]:
grep -h CV Results/log*.out

Save it in a text file:

In [None]:
grep -h CV Results/log*.out > Results/CV_logs.txt

<img src="Images/R.png" alt="R" width="80"> Shift to the `R` kernel

Look at the distribution of CV error.

In [None]:
library(ggplot2)

CV = read.table('Results/CV_logs.txt')

p <- ggplot(data = CV, aes(x = V3, y = V4, group = 1)) + 
    geom_line() + geom_point() + theme_bw() + 
    labs(x = 'Number of clusters', y = 'Cross validation error')

p

**Q.7** What do you understand of Cross validation error? Based on this graph, what is the best `K`?

Plotting the `Q` estimates. Choose the `K` that makes more sense to you and substitute it in the first line of code (right now it is `K=3`)

In [None]:
tbl = read.table("Results/chr2_135_145_flt_prunned.gds.3.Q")
ord = tbl[order(tbl$V1,tbl$V2,tbl$V3),]
bp = barplot(t(as.matrix(ord)), legend.text = c("African", "2", "3"),
            space = c(0.2),
            col=rainbow(3),
            xlab="Population #", 
            ylab="Ancestry",
            border=NA,
            las=2)

Note: Here we order the X-axis based on proportions for the first population component. However, you will see that in the HapMap data all the individuals show some portion of this component and the different individuals are more admixed in general, i.e they are no longer explained by mostly one component, it’s not useful to use that kind of ordering anymore to interpret the plot. Instead, we should keep the original order, since the files are originally ordered by population, and we should plot each population on the X axis to be able to interpret the plot. This can be achieved with something of the type:

In [None]:
library(dplyr)
library(ggplot2)
load("Results/data.Rdata")

#resize plot
options(repr.plot.width = 12, repr.plot.height = 8)

K=3

tbl = read.table( paste("Results/chr2_135_145_flt_prunned.gds.",K,".Q", sep="") )

origin <- rep( paste( rep("Pop", K), as.character(c(0:(K-1))), sep="" ) , each=dim(tbl)[1] )
population <- info[ pca$sample.id, ]$population
region <- info[ pca$sample.id, ]$region
regpop <- make.unique( paste(region, population) )

tbl <- as.data.frame(unlist(tbl))
colnames(tbl) <- 'Admixture_fraction'
tbl['origin'] = origin
tbl['population'] = rep(population,K)
tbl['region'] = region
tbl['region_population'] = regpop

ggplot(tbl, aes(fill=origin, y=Admixture_fraction, x=region_population)) +
    geom_bar(position="stack", stat="identity") + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

**Q.8** How many clusters do you identify in this plot? Does that agree with what was found using PCA?

In the following part of this exercise you will do both analysis (PCA and Admixture) using a different dataset. The data comes from the HAPMAP Consortium, to learn more about the populations studied in this project access here. The vcf file `hapmap.vcf`, an information file `relationships_w_pops_121708.txt`, as well as `.bim, .bed, .fam` files (only to be used if you get stuck during the exercise) are available for the admixture analysis. This dataset is placed here:

`Data/assignment`