

</br>
<font size="12">PCA and population structure</font>




# Simple example of PCA and MDS

First let's try to perform PCA and MDS on the small matrix from the slides. The below code will input the genotypes into R. 


In [None]:
#read in data from slides
G <-matrix(c(1,0,2,0,2,0,2,1,1,1,0,1,0,2,1,2,1,1,1,1,1,0,1,0,2,0,1,1,0,2,1,2,0,1,0),5,by=T,
           dimnames=list(paste0("IND",1:5),paste0("SNP",1:7)))
nInd <- nrow(G)

print(G)


In [None]:
## run the code to start a quiz
from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/popgenDK/courses/main/kenya2024/exercises/day3_PopulationStructure/pca_quiz1.json')


## MDS 

Let's try to do multidimensional scaling (MDS). First let's calculate the distance. The simple distance measure as seen in the slides is called a Manhattan distance.


In [None]:

## continue in R
D<-dist(G,upper=T,diag=T,method="manh")
D



 - How many dimensions are used to represent the distances?

Now let's reduce the number of dimension to 2 using MDS and plot the results:

In [None]:
 #perform MDS to 2 dimensions
k2<-cmdscale(D,k=2)

cat("\n Dimension reduction to two dimensions")
k2
cat("\n original Distance between individuals:")
dist(G,upper=T,diag=T,method="manha")
cat("\n Distance between individuals in from the MDS:")
round(D_k2<- dist(k2,upper=T,diag=T,method="manha"),2)


 - What is the biggest difference between the original distances and the projected?
 
 Lets plot the results

In [None]:
#plot the results
options(repr.plot.width=10, repr.plot.height=10)
plot(k2,pch=16,cex=3,col=1:5+1,ylab="distance 2nd dimension",xlab="distance 1st dimension",main="Multiple dimension scaling (MDS)")
points(k2,pch=as.character(1:5), col="white")
points(k2,pch=as.character(1:5), col="white")


 - Can you find any difference in the pairwise distances from the plot and the original pairwise distances?. 

## PCA
First let's try to perform PCA directy on the normalized genotypes without calculating the covariance matrix. We use the normalization $\tilde{G}_{ij}=\frac{G_{ij}-2f_j}{\sqrt{2f_j(1-f_j)}}$ where j is the site (SNP) and i is the individuals. $f_j=\frac{\sum_i^n G_{ij}}{2n}$ is the allele frequency.  

 - Why do we normalize the genotypes?
 
  

In [None]:
 #first normalize the data do that the mean and variance is the same for each SNP
normalize <- function(x){
    nInd <- nrow(x)
    avg <- colMeans(x)
    M <- x - rep(colMeans(x),each=nInd)
    M <- M/sqrt(2*rep(avg/2*(1-avg/2),each=nInd))
    M
 }


cat("Original genotypes\n")
print(G)

cat("\n Normalizes genotypes\n")
Gtilde <- normalize(G)
print(Gtilde)

cat("\n Dimension of G-tilde (tilde = ~)")
dim(Gtilde)


cat("\n Dimension of G-tilde")
 svd <- svd(Gtilde)
 ## print the decomposition for M=SDV
 ## u is the eigenvectors
 ## d is eigen values
 print(svd)


The above is the decomposition of the genotypes into the diagonal matrix (d) with eigenvalues, and the left (u) and right (v) eigenvectors such that
$\tilde{G}=U\Sigma V^T$
where $\Sigma$ has the diagonal values of d. 

PCA plots in genetics is often with the U matrix or $U\Sigma$ 


In [None]:
plot(svd$u[,1:2],pch=16,cex=3,col=1:5+1,ylab="2. PC",xlab="1. PC", 
     main="Principle component analysis (PCA) U matrix")
points(svd$u[,1:2],pch=as.character(1:5), col="white")

cat("SIGMA (diagonal matrix)")
SIGMA <-  diag(svd$d)
plot(svd$u[,1:2]%*%SIGMA[1:2,1:2],pch=16,cex=3,col=1:5+1,ylab="2. PC",xlab="1. PC", 
     main="Principle component analysis (PCA) UΣ")
points(svd$u[,1:2]%*%SIGMA[1:2,1:2],pch=as.character(1:5), col="white")


 - Compare the two plots. What is the difference? 

 - Compare MDS vs. PCA. Are the capturing the same thing? 

Bonus information:

Unlike MDS, PCA will not remove information, so you are actually able to reconstruct your covariance matrix from the principal components.

# PCA for wildebeest

Again we will use data from the Blue Wildebeest. To simply the the analysis we have included only on of the Brindles populations ( B-Etosha). 

<img src="https://raw.githubusercontent.com/popgenDK/popgenDK.github.io/gh-pages/images/slider/wildeBeastMap.png" alt="image info" />


#  PCA for low depth sequencing using PCAngsd 


# Software and data



### Software
We will be using plink, PCAone for this exercise. First lets see if the software is installed and get the data

In [None]:
echo --programs that are installed:--
which plink
which PCAone

#make folder if it does not exist already
mkdir -p ~/kenya2024/
mkdir -p ~/kenya2024/admixture

# enter folder
cd ~/kenya2024/admixture


# make links to files and add them to the folder
cp -sf /davidData/data/course/kenyaWorkshop/anders/structure_day3/blue_wildebeest_thin* .
cp -r -sf  /davidData/data/course/kenyaWorkshop/anders/structure_day3/multiRunK7 .

echo --- files in folder ---
ls 


Let perform PCA on the whole data (without LD pruning). 

We will use PCAone first:



In [None]:
PCAone

Shows the options. To run it use the following command

In [None]:

PCAone  -b blue_wildebeest_thin -o blue_wildebeest_thin

 - look at the above output. How many SNPs and how many individuals?
 
 The default is to calculate the top 10 PCs. If you want more you can use the option --pc <INT> to choose a different number. However, let see what the top PCs capture. 
    
First let look the two first PCs as well as the admixture proportions


In [None]:
options(repr.plot.width=12, repr.plot.height=12)

layout(matrix(c(1,1,2,3),nrow=2,by=T),height=c(2,4),width=2:1)
#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")


# Read in inferred admixture proportions
q <- read.table("~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.Q_4")

#read in the population labels (first column of fam file)
tab <- table(pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1])

#make the plot. 
plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=c(3,5,8,4,2,6,7))


pca <- read.table("~/kenya2024/admixture/blue_wildebeest_thin.eigvecs")
#layout(matrix(1:2,nrow=1),w=c(4,2))
plot(pca[,1:2],col=as.integer(as.factor(pop))+1,ylab=paste("PC",1),xlab=paste("PC",2),cex.lab=1.5,cex=2,lwd=8)
plot.new()
legend("top",legend=names(tab),bty="n",xpd=T,cex=2,text.col=1:length(tab)+1,text.font=2)


 - What information do you get from the PCA that you don't get from the ADMIXTURE results?
 - Can you identify the admixed individuals?
 
 
 Lets see what the other PCs show. 
 

In [None]:
options(repr.plot.width=12, repr.plot.height=16)

par(mfrow=c(3,2))
for(pc in 0:4)
    plot(pca[,pc*2+1:2],col=as.factor(pop),ylab=paste("PC",pc*2+2),xlab=paste("PC",pc*2+1),cex.lab=1.5,cex=2,lwd=8)


 - How many PCs are used to separate the populations?
 - What do you think is captured on PC 7 and 8?
 - What  do you think is captures on PC 9 and 10?
 

## Bonus exercise if there is time


Lets try to run the PCA after pruning LD (linkage disequillibrium) from the data. The noLD data was created in the admixture exercise. 
Run the PCA 

In [None]:
PCAone  -b  blue_wildebeest_noLD -o blue_wildebeest_noLD


We can start by comparing the eigenvalue. These are proportional to the variance explained so that higher values means that the corresponding PC captures more information about the data. 

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

eigen <- scan("~/kenya2024/admixture/blue_wildebeest_thin.eigvals")
eigenNoLD <- scan("~/kenya2024/admixture/blue_wildebeest_noLD.eigvals")

barplot(rbind(eigen,eigenNoLD),beside=T,col=2:3,legend=c("With LD","Without LD"),
       ylab="eigen values",xlab="PC 1:10")

 - which data set captures the most information about the population structure from the first top PCs?
 
 Lets plot the PCs from the two data sets

In [None]:
options(repr.plot.width=12, repr.plot.height=32)

tab <- table(pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1])


pcaNoLD <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.eigvecs")


par(mfrow=c(5,2))
for(pc in 0:4){
 
    plot(pca[,pc*2+1:2],col=as.factor(pop),ylab=paste("PC",pc*2+2),xlab=paste("PC",pc*2+1),cex.lab=1.5,cex=2,lwd=8,main="with LD")
    plot(pcaNoLD[,pc*2+1:2],col=as.factor(pop),ylab=paste("PC",pc*2+2),xlab=paste("PC",pc*2+1),cex.lab=1.5,cex=2,lwd=8,main="No LD")

    
    }

 - Which of the PCs from prevous analysis capture LD and not population structure?
 - Do you think it is better to perform PCA with out without LD?