</br>
<font size="12">PCA and population structure with called genotypes</font>


# PCA for wildebeest

Again we will use data from the Blue Wildebeest. To simply the the analysis we have included only on of the Brindles populations ( B-Etosha). 

<img src="https://raw.githubusercontent.com/popgenDK/popgenDK.github.io/gh-pages/images/slider/wildeBeastMap.png" alt="image info" />


#  PCA for low depth sequencing using PCAngsd 


# Software and data



### Software
We will be using plink, PCAone for this exercise. First lets see if the software is installed and get the data

In [None]:
echo --programs that are installed:--
which plink
which PCAone

#make folder if it does not exist already
mkdir -p ~/popgen24/
mkdir -p ~/popgen24/pca2

# enter folder
cd ~/popgen24/pca2

# make links to files and add them to the folder
cp -sf /davidData/data/course/kenyaWorkshop/anders/structure_day3/blue_wildebeest_thin* .
cp -r -sf  /davidData/data/course/kenyaWorkshop/anders/structure_day3/multiRunK7 .

echo --- files in folder ---
ls 


Let perform PCA on the whole data (without LD pruning). 

We will use PCAone first:



In [None]:
PCAone

Shows the options. To run it use the following command

In [None]:

PCAone  -b blue_wildebeest_thin -o blue_wildebeest_thin

 - look at the above output. How many SNPs and how many individuals?
 
 The default is to calculate the top 10 PCs. If you want more you can use the option --pc <INT> to choose a different number. However, let see what the top PCs capture. 
    
First let look the two first PCs as well as the admixture proportions


In [None]:
options(repr.plot.width=12, repr.plot.height=12)

layout(matrix(c(1,1,2,3),nrow=2,by=T),height=c(2,4),width=2:1)
#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")


# Read in inferred admixture proportions
q <- read.table("~/popgen24/pca2/multiRunK7/blue_wildebeest_noLD.7.Q_4")

#read in the population labels (first column of fam file)
tab <- table(pop <- read.table("~/popgen24/pca2/blue_wildebeest_thin.fam")[,1])

#make the plot. 
plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=c(3,5,8,4,2,6,7))


pca <- read.table("~/popgen24/pca2/blue_wildebeest_thin.eigvecs")
#layout(matrix(1:2,nrow=1),w=c(4,2))
plot(pca[,1:2],col=as.integer(as.factor(pop))+1,ylab=paste("PC",1),xlab=paste("PC",2),cex.lab=1.5,cex=2,lwd=8)
plot.new()
legend("top",legend=names(tab),bty="n",xpd=T,cex=2,text.col=1:length(tab)+1,text.font=2)


 - What information do you get from the PCA that you don't get from the ADMIXTURE results?
 - Can you identify the admixed individuals?
 
 
 Lets see what the other PCs show. 
 

In [None]:
options(repr.plot.width=12, repr.plot.height=16)

par(mfrow=c(3,2))
for(pc in 0:4)
    plot(pca[,pc*2+1:2],col=as.factor(pop),ylab=paste("PC",pc*2+2),xlab=paste("PC",pc*2+1),cex.lab=1.5,cex=2,lwd=8)


 - How many PCs are used to separate the populations?
 - What do you think is captured on PC 7 and 8?
 - What  do you think is captures on PC 9 and 10?
 

## Bonus exercise if there is time


Lets try to run the PCA after pruning LD (linkage disequillibrium) from the data. The noLD data was created in the admixture exercise. 
Run first the pruning with PCA  and then run PCA on the pruned data

In [None]:

PCAone -b blue_wildebeest_thin -k 6 --ld-stats 0 --ld-r2 0.1 --ld-bp 1000000

echo --number of variants to be keept --
wc -l pcaone.ld.prune.in
 
echo -e "\n --Extract variants using plink --"
plink --bfile blue_wildebeest_thin --extract pcaone.ld.prune.in --make-bed --out blue_wildebeest_noLD  --chr-set 29


In [None]:
PCAone  -b  blue_wildebeest_noLD -o blue_wildebeest_noLD


We can start by comparing the eigenvalue. These are proportional to the variance explained so that higher values means that the corresponding PC captures more information about the data. 

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

eigen <- scan("~/popgen24/pca2/blue_wildebeest_thin.eigvals")
eigenNoLD <- scan("~/popgen24/pca2/blue_wildebeest_noLD.eigvals")

barplot(rbind(eigen,eigenNoLD),beside=T,col=2:3,legend=c("With LD","Without LD"),
       ylab="eigen values",xlab="PC 1:10")

 - which data set captures the most information about the population structure from the first top PCs?
 
 Lets plot the PCs from the two data sets

In [None]:
options(repr.plot.width=12, repr.plot.height=32)

tab <- table(pop <- read.table("~/popgen24/pca2/blue_wildebeest_noLD.fam")[,1])


pcaNoLD <- read.table("~/popgen24/pca2/blue_wildebeest_noLD.eigvecs")


par(mfrow=c(5,2))
for(pc in 0:4){
 
    plot(pca[,pc*2+1:2],col=as.factor(pop),ylab=paste("PC",pc*2+2),xlab=paste("PC",pc*2+1),cex.lab=1.5,cex=2,lwd=8,main="with LD")
    plot(pcaNoLD[,pc*2+1:2],col=as.factor(pop),ylab=paste("PC",pc*2+2),xlab=paste("PC",pc*2+1),cex.lab=1.5,cex=2,lwd=8,main="No LD")

    
    }

 - Which of the PCs from prevous analysis capture LD and not population structure?
 - Do you think it is better to perform PCA with out without LD?

## Identity by state tree
As a last minute addition we can also make a neighbour joining tree mentioned in the lecture by first computing identity-by-state distances between individuals which a just the proportion of sites between two individuals where they are different.
We can then load these distances into R and produce a NJ tree with the package APE.


In [None]:
# get IBS distances with plink
plink --allow-extra-chr --bfile blue_wildebeest_thin --distance square 1-ibs --chr-set 29

In [None]:
library(ape)

# read in distances
m <- as.matrix(read.table("~/popgen24/pca2/plink.mdist", header = F))

# read id of individuals
id <- read.table("~/popgen24/pca2/plink.mdist.id")
rownames(m) <- id$V2
colnames(m) <- id$V2

pops <- c(4,6,8,2,5,7,3)
names(pops) <- unique(id$V1)

plot(nj(m), tip.color = pops[id$V1], type = "unrooted", show.tip.label = TRUE)
add.scale.bar()
legend("bottomright",
       legend = names(pops),
       fill = pops)

 - Try removing the " type = 'unrooted'," argument from the plotting command above and see what happens to the tree