# Setup environment

In [None]:
COURSE_PATH=/course/popgen25
DATA_PATH=$COURSE_PATH/pca
SOFTWARE_PATH=$COURSE_PATH/software

echo --programs that are installed:--
type plink
type PCAone


#make folder 
FOLDER=~/popgen25_pca
echo -e "\n--creating folder-- "
echo $FOLDER
mkdir -p $FOLDER

# enter folder
cd $FOLDER

#make sym link for data and current folder
ln -sfn $FOLDER ~/current_folder
ln -sfn $DATA_PATH ~/data_folder

# Simple example of PCA and MDS

First let's try to perform PCA and MDS on the small matrix from the slides. The below code will input the genotypes into R. 


In [None]:
#read in data from slides
G <-matrix(c(1,0,2,0,2,0,2,1,1,1,0,1,0,2,1,2,1,1,1,1,1,0,1,0,2,0,1,1,0,2,1,2,0,1,0),5,by=T,
           dimnames=list(paste0("IND",1:5),paste0("SNP",1:7)))
nInd <- nrow(G)

print(G)

In [None]:
import os
os.chdir(os.path.expanduser("~/data_folder"))

## run the code to start a quiz
from jupyterquiz import display_quiz
display_quiz('pca_quiz1.json')


## MDS 

Let's try to do MDS. First let's calculate the distance. The simple distance measure as seen in the slides is called a Manhattan distance.


In [None]:

## continue in R
D<-dist(G,upper=T,diag=T,method="manh")
D



 - How many dimensions are used to represent the distances?

Now let's reduce the number of dimension to 2 using MDS and plot the results:

In [None]:
k2<-cmdscale(D,k=2)

cat("\n Dimension reduction to two dimensions")
k2
cat("\n original Distance between individuals:")
org <- dist(G,upper=T,diag=T,method="manha")
org 

cat("\n Distance between individuals in from the MDS:")
round(D_k2<- dist(k2,upper=T,diag=T),2)




In [None]:
#plot the results
 plot(k2,pch=16,cex=3,col=1:5+1,ylab="distance 2th dimension",
      xlab="distance 1. dimension",main="Multiple dimension scaling (MDS)")
 points(k2,pch=as.character(1:5))



 - Can you find any difference in the pairwise distances from the plot and the original pairwise distances?. 

## PCA
First let's try to perform PCA directy on the normalized genotypes without calculating the covariance matrix

 - Why do we normalize the genotypes?

 

In [None]:
 #first normalize the data do that the mean and variance is the same for each SNP
  normalize <- function(x){
    nInd <- nrow(x)
    avg <- colMeans(x)
    M <- x - rep(colMeans(x),each=nInd)
    M <- M/sqrt(2*rep(avg/2*(1-avg/2),each=nInd))
    M
 }
print(G)
 M <- normalize(G)
print(M)
cat("Dimension of M")
dim(M)

 svd <- svd(M)
 ## print the decomposition for M=SDV
 ## u is the eigenvectors
 ## d is eigen values
 print(svd)


The above is the decomposition of the genotypes into the diagonal matrix (d) with eigenvalues, and the left (u) and right (v) eigenvectors such that
$M=U\Sigma V^T$
where $\Sigma$ has the diagonal values of d. Therefore, we can reconstruct the normalized genotypes from U, d and v:


In [None]:
##make a diagonal matrix with the eigenvalues
SIGMA <-  diag(svd$d)
print(SIGMA)
## using the matrixes from the decomposition we can undo the transformation of our normalized genotypes
M2 <- svd$u%*%SIGMA%*%t(svd$v)
cat("Original normalized genotypes (M):")
round(M,3)
cat("Reconstructed normalized genotypes(M2):")
round(M2,3)

 - Did the reconstruction of the normalized genotypes work?
 - Would you be able to reconstruct the unnormalized (raw) genotypes?

Now try performing PCA based on the covariance matrix instead. To do so we first calculate the covariance matrix:


In [None]:
 ## calculate the covariance matrix
C <- M %*% t(M)
 print(C)


The covariance matrix also shows the relationship between each individuals with the most similar individuals having a high positive value while the most distant individuals having a negativ value. However, unlike the euclidian distance the diagonal is not zero but instead is it related to the diversity within each individual.

Now let's try to do PCA on this covariance matrix instead

In [None]:
 ## then perform the PCA by singular value decomposition
 e <- eigen(C)

 ## print first PC
cat("First pricipal component:")
 print(e$vectors[,1])
 ## print first PC
cat("Eigenvalues:")
 print(round(e$values,4))
 ##plot 2 first PC. for the 5 indiviudals
 plot(e$vectors[,1:2],pch=16,cex=3,col=1:5+1,ylab="2. PC",
      xlab="1. PC",main="Principle component analysis (PCA)")
 points(e$vectors[,1:2],pch=as.character(1:5))
 




 - Do you get the same results using the covariance matrix as using the normalized genotypes directly?
 - Compare the two plots (MDS vs. PCA). Are the capturing the same thing? 

Bonus information:

Unlike MDS, PCA will not remove information, so you are actually able to reconstruct your covariance matrix from the principal components.

In [None]:
##continue in R
##make a diagonal matrix with the eigenvalues
SIGMA <- diag(e$value)

## transform the PC back to the original data
## using matrix multiplication V SIGMA Vt
out <- e$vectors %*% SIGMA %*% t(e$vectors)
cat("Reconstructed covariance:")
print(out)
cat("Original covariance:")
print(C)
#close R after you are done

Try to also compare the eigenvalues from the decomposition of the normalized genotypes and from the covariance matrix

In [None]:
cat("Eigenvalues of the covariance matrix:")
 print(round(e$values,4))

cat("Singular values from the normalized genotypes:")
 print(round(svd$d,4))

 - What is the relationship? (hint: try to square one of them by changing the above code)

#  PCA for low depth sequencing using PCAngsd 


In this exercise we will try to use PCAngsd to analyse the same data used in the NGSadmix exercises. 

Genotype likelihoods for variable sites was estimated from of bam files with low depth NGS data from the 1000 genomes project:


| Population code | Population                                     | Sample size |
|-----------------|------------------------------------------------|-------------|
| ASW             | HapMap African ancestry individuals from SW US | 61          |
| CEU             | European individuals                           | 99          |
| CHB             | Han Chinese in Beijing                         | 103         |
| YRI             | Yoruba individuals from Nigeria                | 108         |
| MXL             | Mexican individuals from LA California         | 63          |


Copy data to your newly created folder

In [None]:
cp -sf $DATA_PATH/1000G5popsAdmixK3seed3.qopt .
cp -sf $DATA_PATH/1000G5popsAdmixK4seed9.qopt .

##make links to files and add them to the folder
# links to genotype likelihood file ( from admixture analysis )
cp -sf $DATA_PATH/1000G5pops.inputgl.beagle.gz .

# link to population information file
cp -sf $DATA_PATH/1000G5pops.pop.info .

echo -e "\n--- files in folder ---"
ls .

echo -e "\n--programs that are installed:--"

Look inside the first lines in the population informaiton file

In [None]:
head 1000G5pops.pop.info


See the number of individuals for each population from the sample file

In [None]:
# summaries the fist column
cut -f1 1000G5pops.pop.info |  uniq -c

Count the number of lines in the genotype likelihood file

In [None]:
zcat 1000G5pops.inputgl.beagle.gz | wc -l 

In [None]:
import os
os.chdir(os.path.expanduser("~/data_folder"))

from jupyterquiz import display_quiz
display_quiz('pca_pcangsd_quiz.json')

 
 ## Run PCANGSD to perform PCA
 First let's get a list of the options in PCAngsd


In [None]:
pcangsd -h

Run PCANGSD on your genotype likelihood data using 5 CPU threads (will take ~1min)

In [None]:
pcangsd -b 1000G5pops.inputgl.beagle.gz -o PCANGSD1000G -t 5

The program estimates the covariance matrix that can then be used for PCA. Look at the output from the program

 - How many significant PCA was used by PCAngsd (see MAP test in output)?

Plot the results in R

In [None]:
# Read covariance matrix estimated by PCAngsd
C <- as.matrix(read.table("~/current_folder/PCANGSD1000G.cov"))

# Read population labels for each individuals
pop<-read.table("~/current_folder/1000G5pops.pop.info",stringsAsFactors=T)

# Estimate the eigenvectors (principal components) from the covariance matrix
e <- eigen(C)
plot(e$vectors[,c(1,2)],col=pop[,1],xlab="PC1",ylab="PC2")
legend("left",fill=1:5,levels(pop[,1]))


# Estimate the eigenvectors (principal components) from the covariance matrix
e <- eigen(C)
plot(e$vectors[,c(1,3)],col=pop[,1],xlab="PC1",ylab="PC3")
legend("left",fill=1:5,levels(pop[,1]))

Compare with the estimate admixture proportions (a NGSadmix analysis)



In [None]:
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/refs/heads/master/visFuns.R")

par(mfrow=2:1)
## read and plot the output from NGSadmix from the Tuesday's exercises
pop<-read.table("~/current_folder/1000G5pops.pop.info",as.is=T)
q<-read.table("~/current_folder/1000G5popsAdmixK3seed3.qopt")
# sort indiivduals by population and within populaoitn by admixture proportion
ord<-orderInds(pop = pop[,1], q=q) 

#plot
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",
        ylab="Admixture proportions",main="K=3")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,
     unique(pop[ord,1]),xpd=T) # add population labels
abline(v=cumsum(sapply(unique(pop[ord,1]),
                       function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

## read for K=4
pop<-read.table("~/current_folder/1000G5pops.pop.info",as.is=T)
q<-read.table("~/current_folder/1000G5popsAdmixK4seed9.qopt")
plot
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",
        ylab="Admixture proportions",main="K=4")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),
     xpd=T) # add population labels
abline(v=cumsum(sapply(unique(pop[ord,1]),
                       function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

 - In the PCA plot can you identify the Mexicans with only European ancestry?
 - What about the African American with East Asian ancestry?
 - Based on the PCA would you have reached the same conclusion as the admixture proportions?

## What if we dont use many iterations in PCAngnd

Try the same analysis but with only one iteration in the algorithm so that the result has not converged. Then we do not fully take the population structure into account when filling in the missing information


In [None]:

pcangsd -b 1000G5pops.inputgl.beagle.gz -o PCANGSD1000G_iter0 -t 5 --iter 1 -e 1

wait for the analysis to finish and then plot the results in R using the code below

In [None]:
# Read covariance matrix estimated by PCAngsd
C <- as.matrix(read.table("~/current_folder/PCANGSD1000G_iter0.cov"))

# Read population labels for each individuals
pop<-read.table("~/current_folder/1000G5pops.pop.info",stringsAsFactors=T)

# Estimate the eigenvectors (Principal components) from the covariance matrix
e <- eigen(C)
plot(e$vectors[,1:2],col=pop[,1],xlab="PC1",ylab="PC2")
legend("top",fill=1:5,levels(pop[,1]))


 - Do you see any difference?
 - Would any of your conclusions change? (compared to the previous PCA plot)

## Converting a PCA into admixture proportions
Let's try to use the PCA to infer admixture proportions based on the first 2 principal components. For the optimization we will use a small penalty on the admixture proportions (alpha). This is a way to convert your PCA into admixture proportions:


In [None]:

pcangsd -b 1000G5pops.inputgl.beagle.gz -o PCANGSD1000G -t 5 --admix --admix-alpha 50 



Plot the results in R



In [None]:
# Read the admixture proportions estimated from the PCA
q<-read.table("~/current_folder/PCANGSD1000G.admix.4.Q")

# Read population labels for each individuals
pop<-read.table("~/current_folder/1000G5pops.pop.info",stringsAsFactors=T)

## Order according to population
ord<-orderInds(pop = pop[,1], q=q) # sort indiivduals by population and within populaoitn by admixture proportion

#plot
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",
        ylab="Admixture proportions")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,
     unique(pop[ord,1]),xpd=T) # add population labels
abline(v=cumsum(sapply(unique(pop[ord,1]),
                       function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)


 - how does this compare to the results from an admixture proportion analysis (the NGSadmix analysis above)?


# PCAngsd and selection

For very recent selection we can look within closely related individuals for example with in Europeans

**Data:**

 - Genotype likelihoods in Beagle format
 - ~150k random SNPs with maf > 5%
 - Four EU populations with ~100 individuals in each
 - whole genome sequencing
 - depth 2-9X (1000 genome project)

 ```
CEU | Europeans in Utah (British)
GBR | Great Britain
IBS | Iberian/Spain
TSI | Italien
```

First let's set the paths


In [None]:
## make links to files and add them to the folder
# links to genotype likelihood file 
cp -sf ${DATA_PATH}/eu1000g.small.beagle.gz .

# link to population information file
cp -sf ${DATA_PATH}/eu1000g.sample.Info .

echo -e "\n--- eu1000* files in folder ---"
ls eu1000*


### Explore the input data. 

Take a quick look at the sample data.

First try to get an overview of the dataset by looking at the information file and making a summary using the following code:
 

In [None]:
# View first lines of sample info file
echo --- First lines in sample info file
head eu1000g.sample.Info

echo --- Count the number of samples from each population
cut -f 2 -d " " eu1000g.sample.Info | sed 1d| sort | uniq -c 

- How many samples from each country?

Now let's have a look at the genotype likelihood (GL) file that you have created with ANGSD. It is a "beagle format" file called all.beagle.gz - and will be the input file to PCAngsd. The first line in this file is a header line and after that it contains a line for each locus with GLs. By using the unix command wc we can count the number of lines in the file:



In [None]:
gunzip -c eu1000g.small.beagle.gz | wc -l
 

- Use this to find out how many loci there are GLs for in the data set?



Next, to get an idea of what the GL file contains try (from the command line) to print the first 9 columns of the first 7 lines of the file:



In [None]:
zcat eu1000g.small.beagle.gz | head -n 7 | cut -f1-9 | column -t

## Ignore the "Broken pipe"

In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3).

All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

 - Based on this, what is the most likely genotype of Ind0 in the first locus and the locus six?

### PCAngsd and selection

Run PCangsd with to estimate the covariance matrix while jointly estimating the individuals allele frequencies.



In [None]:
pcangsd -b eu1000g.small.beagle.gz -o EUsmall -t 5

This takes around 2 min to run. The program estimates the covariance matrix that can then be used for PCA. Look at the output from the program.

 - The algorithm might only need a low number of PCs to estimate the allele freuqencies. How many significant PCs (see MAP test in output)?

Now plot the results in R:


In [None]:
 ## R
 cov <- as.matrix(read.table("~/current_folder/EUsmall.cov"))

 e<-eigen(cov)
 ID<-read.table("~/current_folder/eu1000g.sample.Info",head=T,stringsAsFactors=T)
 plot(e$vectors[,1:2],col=ID$POP,xlab="PC1",ylab="PC2")

 legend("topleft",fill=1:4,levels(ID$POP))


 - Does the plot look like you expected? Which populations are close and distant to each other?

Since the European individuals in 1000G are not simple homogeneous disjoint populations it is hard to use PBS/FST or similar statistics to infer selection based on populating differences ( you will learn about these later). However, PCA offers a good description of the differences between individuals without having the define disjoint groups.

Let's try to infer selection along the genome based on the PCA



In [None]:
pcangsd -b eu1000g.small.beagle.gz -o EUsmall --selection \
    --sites-save --maf 0 -t 5


The analysis takes about two minutes. We also need to keep track of whether a SNP is used in the analysis or not, which can be done based on the output. Create a file with the SNP location info that you will need to plot the results (the third column indicate if the site is used=1 or not =0):



In [None]:
# Create file with position and chromosome
paste <(zcat eu1000g.small.beagle.gz| cut -f 1 | sed 's/\_/\t/g' | sed 1d ) \
    EUsmall.sites > EUsmall.sites.info

echo -- first lines of the created file
head  EUsmall.sites.info 

Next, plot the results of the selection scan



In [None]:
#read function for plotting
source("https://raw.githubusercontent.com/aalbrechtsen/Rfun/refs/heads/master/online.R")

# read in pvalues from seleciton scan
s <- scan("~/current_folder/EUsmall.selection")

# convert test statistic to p-value
pval<-pchisq(s,1,lower=FALSE)

## make QQ plot to QC the test statistics
qqPlot(pval)

The above is a QQ plot of the p-values from the selection scan. If the test statistics is good them most point will follow the red line which only a few (<1%) will deviate.

 - Did the test perform well?
 
 Finally, let's plot the results of the scan along the genome:  

In [None]:
## read positions (hg38)
data<-read.delim("~/current_folder/EUsmall.sites.info",
                 colC=c("factor","integer","integer"),head=F)
names(data)<-c("chr","pos","keep")
data <- subset(data,keep==1)
data$pval <- pval 

## make manhatten plot
options(repr.plot.width = 10, repr.plot.height = 6)
manPlot(data$pval,chr=as.integer(data$chr))

Lest zoom in 

In [None]:
# select sites to plot, 0.5Mb on either side of SNP
leadSNPposition <- data$pos[which.max(s)]

region <- subset(data,chr=="chr2" & pos >  leadSNPposition - 5e5 & 
                 pos < leadSNPposition + 5e5)

#plot
locusZoomNoLD(region$pval,chr=2,pos=region$pos,main="LocusZoom",
    geneticMap="~/data_folder/geneticMap/hg38/genetic_map_GRCh38_chr", 
    refGenes="~/data_folder/geneticMap/hg38/refGeneHG38.gz",
    w=which(region$pos==leadSNPposition))

See if you can make sense of the top hit. What do you think it the relevant gene in  that locus