

</br>
<font size="12">Estimating admixture proportions</font>


For this exercises we will use data from the Blue Wildebeest. To simply the the analysis we have included only one of the Brindle populations ( B-Etosha). There are five subspecies and but we have included all 3 populations from the east white bearded subpopulation.  

<img src="https://raw.githubusercontent.com/popgenDK/popgenDK.github.io/gh-pages/images/slider/wildeBeastMap.png" alt="image info" />


# Software and data



### Software
We will be using plink, PCAone and ADMIXTURE for this exerciser. First lets see if the software is installed

In [None]:

echo --programs that are installed:--
which admixture
which plink
which PCAone




### data

First lets make a folder in your home directory then then we will copy the data into your folder

In [None]:

#make folder 
mkdir -p ~/kenya2024/
mkdir -p ~/kenya2024/admixture

# enter folder
cd ~/kenya2024/admixture



##make links to files and add them to the folder
cp -sf /davidData/data/course/kenyaWorkshop/anders/structure_day3/blue_wildebeest_thin* .
cp -r -sf  /davidData/data/course/kenyaWorkshop/anders/structure_day3/multiRunK7 .
cp -r -sf  /davidData/data/course/kenyaWorkshop/anders/structure_day3/allK .


echo --- files in folder ---
ls 

### fam file
The genotype data is store in binary plink files (*.bed,*.fam,*.bim). Lets first look into the fam file which described the individuals in the data

In [None]:
echo -- number of lines in fam file --
wc -l blue_wildebeest_thin.fam

echo -- first 10 lines fam file --
head blue_wildebeest_thin.fam

echo -- counts of populations/subspecies from first column of fam file --
cut -f1 -d" " blue_wildebeest_thin.fam | sort | uniq -c

# Bim file
Now lets look into the bim file. This is the file that described the different genetics variants

In [None]:
echo -- number of lines in bim file --
wc -l blue_wildebeest_thin.bim

echo -e "\n-- first 10 lines bim file --"
echo -e "CHR\tvariantID CM\tPosition allele_1\t allele_2"
head blue_wildebeest_thin.bim

echo -e "\n-- counts number of variants per chromosone from the first column of bim file --"
echo \#Var Chromosome_name
cat blue_wildebeest_thin.bim | cut -f1  | uniq -c


Run the code below to answer start a Quiz questions 

In [None]:

from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/popgenDK/courses/main/kenya2024/exercises/day3_PopulationStructure/admixture_quiz1.json')


## LD pruning

It is recommended that LD pruning is performed prior to running ADMIXTURE. This is often done using plink while assmuming that there is no population structure. However, because we expect there to be lots of structure in our data we will us a new method, PCAone, that corrects for population structure using PCA. To perferm LD pruing we will choose the number of PCAs needed which in this case is **-k=6** since we expect there are 7 different populations (6 PCs allows for modelling 7 populations) because each PC can split data into to groups. We will use a LD threshold of **r2=0.1** which removed variants that are in LD with any other variant with a correlation coefficent above 0.1. Since we don't want to calculate LD between all pairs of variants we will estimate LD in a sliding window of size **1000000=1Mb**. 

The command to do so can be see below




In [None]:
PCAone -b blue_wildebeest_thin -k 6 --ld-stats 0 --ld-r2 0.1 --ld-bp 1000000

The software prints out a list of variants that are not in LD with eachother. We will extract those site using plink and create a new plink file named blue_wildebeest_noLD

In [None]:
echo --number of variants to be keept --
wc -l pcaone.ld.prune.in
 
echo -e "\n --Extract variants using plink --"
 plink --bfile blue_wildebeest_thin --extract pcaone.ld.prune.in --make-bed --out blue_wildebeest_noLD  --chr-set 29



## ADMIXTURE

we are now ready to run ADMIXTURE. first lets look at the options of the program

In [None]:
admixture --help

As can be seen we need to input our plink file and we need to choose a number of ancestral populations. In our case the most likely relevant number of assumed ancestral populations is 7 - one for each population. ADMIXTURE using numberic optimisation based on a random starting guess of the parameters. Therefore, we will specify a seed for the random numbers so that we can reproduce the results (else we will get a different result each time we run it). 

To make is run faster we will use 10 CPU threads. Run the comment ( will take ~2min).


In [None]:

admixture --seed 0 -j10 blue_wildebeest_noLD.bed 7



In [None]:
# let see which files it produces 
echo -- files sorted. last files are the most recent --
ls -r


You should find two files ending with * .7.Q and * .7.P respectively. These are the estimated ancestry proportions and allele frequencies. 

### plotting admixture proportions
lets plot the results. For this we will use R

In [None]:
#make plot wide
library("repr")
options(repr.plot.width=17, repr.plot.height=4.5)

#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

# Read in inferred admixture proportions
q <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.7.Q")

#read in the population labels (first column of fam file)
table(pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1])

#plot admixture proportions
plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=2:8)
legend(0,1.1,fill=2:8,legend=0:5,hor=T,xpd=T)

 - Does the results look like you expect (why/why not)?
 
 
 There are several hint that the results might be wrong. For example do you think it is realistic that the 3 cookson samples are all admixture with the same admixture proportions?
 
 ### EvalAdmix
 
 We can use evalAdmix to so if we can detect problems with the results of the analyiss. This method using the results of admixture to predict the genotypes for each indiviudals and then try identify indiviudals or population with a bad fit
 
 

In [None]:
evalAdmix -plink blue_wildebeest_noLD -fname blue_wildebeest_noLD.7.P \
-qname blue_wildebeest_noLD.7.Q -o blue_wildebeest_noLD.7.eval -P 10



The results from evalAdmixture can be plotted in R. 

In [None]:
options(repr.plot.width=17, repr.plot.height=12)

r <- as.matrix(read.table("~/kenya2024/admixture/blue_wildebeest_noLD.7.eval"))
plotCorRes(r, pop=pop, max_z = 0.25,rotatelabpop =20,adjlab = .05)


 - Which population does evalAdmix identify as having a bad fit? 


### evaluating covergence
There are several posible explanations for why the fit it bad. The two most common reasons are 1) The choice of K is suboptimal 2) the algorithm has not converged to the global minimum. ADMIXTURE tries to find the combination of paramerts than maximizes  the log likelihoods. The algorith does not always give the correct results when the likelhoood surface is not concave as illustated below

<img src="https://www.mathsisfun.com/algebra/images/function-max-global.svg" alt="image info" />

To test for convergence we can test many different starting points and if many starting points leads to the same best log likelihood then we have likely found the global minimum. 

To save time we have prerun ADMIXTURE using 10 other seeds. The results are found in the folder multiRunK7




In [None]:
K=7
#mkdir -p multiRunK$K
#for seed in 1 2 3 4 5 6 7 8 9 10
#do
#admixture --seed $seed -j70 blue_wildebeest_noLD.bed $K | tee multiRunK$K/blue_wildebeest_noLD.$K.log_$seed
#mv blue_wildebeest_noLD.$K.Q multiRunK$K/blue_wildebeest_noLD.$K.Q_$seed 
#mv blue_wildebeest_noLD.$K.P multiRunK$K/blue_wildebeest_noLD.$K.P_$seed
#done


ls multiRunK$K

We can extract the likihoods for each of the 10 seed and sort according to their values. 

In [None]:
grep ^Loglikelihood multiRunK7/blue_wildebeest_noLD.7.log* | sort -k 2 -t " "


 - Which seed has the highest likelihoods
 - Your run used seed 0. Did your run find a local or global maximum (find the likeihood in the bottom of the output from the program when you ran it)
 
 
 
 Lets try to plot the results from the seed with the best likelihood

In [None]:
options(repr.plot.width=17, repr.plot.height=4.5)


#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")


# Read in inferred admixture proportions
q <- read.table("~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.Q_4")

#read in the population labels (first column of fam file)
pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]

#make the plot. 
plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=2:8)

 - Did the results improve?
 - How many individuals can you find that are admixted between two subspecies?
 
 
 lets see if the fit is better now

In [None]:
evalAdmix -plink blue_wildebeest_noLD -fname ~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.P_4 \
-qname ~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.Q_4 -o blue_wildebeest_noLD.7.eval_4 -P 10




Lets plot the results

In [None]:
#make plot wide
library("repr")
options(repr.plot.width=17, repr.plot.height=12)

#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

#read in the population labels (first column of fam file)
pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]

r <- as.matrix(read.table("~/kenya2024/admixture/blue_wildebeest_noLD.7.eval_4"))
plotCorRes(r, pop=pop, max_z = 0.25,rotatelabpop =20,adjlab = .05)


 - How is the fit?
 - There are some individual pairs of individuals with a positive correlation. What do you think is the reason?
 
 ### run for multiple K
 
 
 This will take some time so we have precomputed it using the code below using 3 seeds per K

In [None]:
 
#mkdir -p allK

#for K in 1 2 3 4 5 6 7
#do
#  for seed in 1 2 3 
#  do
#    admixture --seed $seed -j70 blue_wildebeest_noLD.bed $K | tee allK/blue_wildebeest_noLD.$K.log_$seed
#    mv blue_wildebeest_noLD.$K.Q allK/blue_wildebeest_noLD.$K.Q_$seed 
#    mv blue_wildebeest_noLD.$K.P allK/blue_wildebeest_noLD.$K.P_$seed
#  done
#done


ls allK/


We can plot the results in R


In [None]:
options(repr.plot.width=17, repr.plot.height=9)

l<-list.files("~/kenya2024/admixture/allK/",full=TRUE,pattern="Q_1")
files<-sort(l)
print(files)
# possible K
Kall <- 2:7

## read Qs
allQ <- list()
for(K in Kall)
    allQ[[K]]<-t(read.table(files[K-min(Kall)+1]))


source("https://raw.githubusercontent.com/popgenDK/admixturePlot/main/admixFun.R")

pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]
palette(palette()[-1])
plotMulti(allQ,Kall=Kall,as.factor(pop))

 - What determines which populations get their own compoment first ( with low K)?

