

</br>
<font size="12">Estimating admixture proportions</font>


For this exercise we will again use data from the Blue Wildebeest. To simplify the analyses we have included only one of the Brindled wildebeest populations (B-Etosha). We have included one population from each of the five blue wildebeest subspecies,except for the eastern white-bearded wildebeest for which we included three populations.  

<img src="https://raw.githubusercontent.com/popgenDK/popgenDK.github.io/gh-pages/images/slider/wildeBeastMap.png" alt="image info" />


# Software and data



### Software
We will be using plink, PCAone and ADMIXTURE for this exercise. First let's see if the software is installed.

In [1]:

echo --programs that are installed:--
which admixture
which plink
which PCAone
which evalAdmix




--programs that are installed:--
/usr/bin/admixture
/usr/bin/plink
/usr/bin/PCAone
/usr/bin/evalAdmix


### Data sets

First let's make a folder in your home directory, then then we will copy the data into your folder.

In [None]:

#make folder 
mkdir -p ~/kenya2024/
mkdir -p ~/kenya2024/admixture

# enter folder
cd ~/kenya2024/admixture



##make links to files and add them to the folder
cp -sf /davidData/data/course/kenyaWorkshop/anders/structure_day3/blue_wildebeest_thin* .
cp -r -sf  /davidData/data/course/kenyaWorkshop/anders/structure_day3/multiRunK7 .
cp -r -sf  /davidData/data/course/kenyaWorkshop/anders/structure_day3/allK .


echo --- files in folder ---
ls 

### The fam file
The genotype data is stored in binary plink files (*.bed,*.fam,*.bim). Lets first look into the fam file which described the individuals in the data

In [None]:
echo -- number of lines in fam file --
wc -l blue_wildebeest_thin.fam

echo -- first 10 lines fam file --
head blue_wildebeest_thin.fam

echo -- counts of populations/subspecies from first column of fam file --
cut -f1 -d" " blue_wildebeest_thin.fam | sort | uniq -c

# The bim file
Now let's look into the bim file. This is the file that describes the different genetic variants in the data set.

In [None]:
echo -- number of lines in bim file --
wc -l blue_wildebeest_thin.bim

echo -e "\n-- first 10 lines bim file --"
echo -e "CHR\tvariantID CM\tPosition allele_1\t allele_2"
head blue_wildebeest_thin.bim

echo -e "\n-- counts number of variants per chromosone from the first column of bim file --"
echo \#Var Chromosome_name
cat blue_wildebeest_thin.bim | cut -f1  | uniq -c


Run the code below to start a quiz. 

In [2]:

from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/popgenDK/courses/main/kenya2024/exercises/day3_PopulationStructure/admixture_quiz1.json')


<IPython.core.display.Javascript object>

## LD pruning

It is recommended that LD pruning is performed prior to running ADMIXTURE. This is often done on the whole set of individuals using plink, while assmuming that there is no population structure. However, we actually expect there could be lots of population structure in our data, as individuals come from many different localities. Therefore, we will us a new method, PCAone, that corrects for population structure using PCA. To perform LD pruning we will choose the number of PCAs needed which in this case is **-k=6** since we expect there are 7 different populations (6 PCs allows for modelling 7 populations), because each PC can split data into to groups. We will use a LD threshold of **r2=0.1** which removes variants that are in LD with any other variant with a correlation coefficent above 0.1. Since we don't want to calculate LD between all pairs of variants in the whole genome, we will estimate LD in a sliding window of size **1000000=1Mb**. 

The command to do so can be seen below.



In [None]:
PCAone -b blue_wildebeest_thin -k 6 --ld-stats 0 --ld-r2 0.1 --ld-bp 1000000

The software prints out a new list of variants that are now not in LD with eachother. We will extract those sites using plink, and create a new plink file named blue_wildebeest_noLD.

In [None]:
echo --number of variants to be keept --
wc -l pcaone.ld.prune.in
 
echo -e "\n --Extract variants using plink --"
 plink --bfile blue_wildebeest_thin --extract pcaone.ld.prune.in --make-bed --out blue_wildebeest_noLD  --chr-set 29



## ADMIXTURE

We are now ready to run ADMIXTURE. First, let's look at the options of the program.

In [None]:
admixture --help

As can be seen in the code above, we need to input our plink file and we need to choose a number of ancestral populations to use. In our case, the most likely relevant number of assumed ancestral populations is 7 - one for each sampling locality. ADMIXTURE uses numeric optimisation based on a random starting guess of the parameters. Therefore, we will specify a seed for the random numbers, so that we can reproduce the results (else we will get a different result each time we run it). 

To make is run faster we will use 10 CPU threads. Run the following ( will take several minutes).


In [None]:

admixture --seed 0 -j10 blue_wildebeest_noLD.bed 7

##if it takes too long then you can stop the run and then copy the results 
##instead with the following code (and adding a # before the command above)
# cp /davidData/data/course/kenyaWorkshop/anders/structure_day3/blue_wildebeest_noLD.7.Q  .


In [None]:
# let see which files it produces 
echo -- files sorted. last files are the most recent --
ls -r


You should find two files ending with * .7.Q and * .7.P respectively. These are the estimated ancestry proportions and allele frequencies in the 7 inferred populations. 

### Plotting admixture proportions
Let's plot the results. For this we will use R.


In [None]:
#make plot wide
library("repr")
options(repr.plot.width=17, repr.plot.height=4.5)

#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

# Read in inferred admixture proportions
q <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.7.Q")

#read in the population labels (first column of fam file)
table(pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1])

#plot admixture proportions
plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=2:8)
legend(0,1.1,fill=2:8,legend=0:5,hor=T,xpd=T)

 - Does the results look like you expect (why/why not)?
 
 
 There are several hints that the results might be wrong. For example, do you think it is realistic that the 3 Cookson's wildebeest samples ('C-Luangwa') are all truly admixed with exactly the same admixture proportions?
 
 ### EvalAdmix to the rescue
 
To solve this challenge, we can use evalAdmix to diagnose problems with the results of the analysis. This method uses the results from ADMIXTURE to predict the genotypes for each indiviudals, and then try to identify individuals or populations with a bad fit under this ADMIXTURE model with K=7.
 
 

In [None]:
evalAdmix -plink blue_wildebeest_noLD -fname blue_wildebeest_noLD.7.P \
-qname blue_wildebeest_noLD.7.Q -o blue_wildebeest_noLD.7.eval -P 10



The results from evalAdmixture can be plotted in R. 

In [None]:
options(repr.plot.width=17, repr.plot.height=12)

r <- as.matrix(read.table("~/kenya2024/admixture/blue_wildebeest_noLD.7.eval"))
plotCorRes(r, pop=pop, max_z = 0.25,rotatelabpop =20,adjlab = .05)


 - Which population(s) does evalAdmix identify as having a bad fit? 


### Evaluating covergence
There are several possible explanations for why the ADMIXTURE model fit is bad. The two most common reasons are 1) The choice of K is sub-optimal, and 2) the algorithm has not converged to the globally best solution. ADMIXTURE tries to find the combination of parameters than maximizes the log likelihoods. However, the algorithm does not always find the optimal solution when the likelhoood surface is not concave as illustated below.

<img src="https://www.mathsisfun.com/algebra/images/function-max-global.svg" alt="image info" />

To test for convergence (i.e. reason (2) mentioned above), we can test many different starting points. If many different runs with different starting points consistently lead to the same best log likelihood, then ADMIXTURE has very likely found the global optimal soluation. 

To save time, we have pre-run ADMIXTURE using 10 other seeds. The results are found in the folder multiRunK7.




In [None]:
K=7
#mkdir -p multiRunK$K
#for seed in 1 2 3 4 5 6 7 8 9 10
#do
#admixture --seed $seed -j70 blue_wildebeest_noLD.bed $K | tee multiRunK$K/blue_wildebeest_noLD.$K.log_$seed
#mv blue_wildebeest_noLD.$K.Q multiRunK$K/blue_wildebeest_noLD.$K.Q_$seed 
#mv blue_wildebeest_noLD.$K.P multiRunK$K/blue_wildebeest_noLD.$K.P_$seed
#done


ls multiRunK$K

We can extract the likelihoods for each of the 10 runs started with different seeds, and sort them according to their values. 

In [None]:
grep ^Loglikelihood multiRunK7/blue_wildebeest_noLD.7.log* | sort -k 2 -t " "


 - Which seed has the highest likelihood?
 - Your first run used seed 0. Did your run find a local or global maximum (find the likeihood in the bottom of the output from the program when you ran it first, above).
 
 
Let's try to plot the results from the seed with the best likelihood.

In [None]:
options(repr.plot.width=17, repr.plot.height=4.5)


#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")


# Read in inferred admixture proportions
q <- read.table("~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.Q_4")

#read in the population labels (first column of fam file)
pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]

#make the plot. 
plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=2:8)

 - Did the results improve?
 - How many individuals can you find that are admixted between two subspecies?
 
 
 Let's use evalAdmix to see if the model fit is better now.

In [None]:
evalAdmix -plink blue_wildebeest_noLD -fname ~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.P_4 \
-qname ~/kenya2024/admixture/multiRunK7/blue_wildebeest_noLD.7.Q_4 -o blue_wildebeest_noLD.7.eval_4 -P 10




Let's plot the results

In [None]:
#make plot wide
library("repr")
options(repr.plot.width=17, repr.plot.height=12)

#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

#read in the population labels (first column of fam file)
pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]

r <- as.matrix(read.table("~/kenya2024/admixture/blue_wildebeest_noLD.7.eval_4"))
plotCorRes(r, pop=pop, max_z = 0.25,rotatelabpop =20,adjlab = .05)


 - How good is the fit this time?
 - There are some few remaining pairs of individuals with a strong positive correlation. What do you think is the reason?
 
 # Bonus exercise (only if you have time)
 
 ### Running ADMIXTURE for multiple K
 
 
 This would take some time, so we have pre-computed it using the code below using 3 seeds per K value.

In [None]:
 
#mkdir -p allK

#for K in 1 2 3 4 5 6 7
#do
#  for seed in 1 2 3 
#  do
#    admixture --seed $seed -j70 blue_wildebeest_noLD.bed $K | tee allK/blue_wildebeest_noLD.$K.log_$seed
#    mv blue_wildebeest_noLD.$K.Q allK/blue_wildebeest_noLD.$K.Q_$seed 
#    mv blue_wildebeest_noLD.$K.P allK/blue_wildebeest_noLD.$K.P_$seed
#  done
#done


ls allK/


We can plot the results in R.


In [None]:
options(repr.plot.width=17, repr.plot.height=9)

l<-list.files("~/kenya2024/admixture/allK/",full=TRUE,pattern="Q_1")
files<-sort(l)
print(files)
# possible K
Kall <- 2:7

## read Qs
allQ <- list()
for(K in Kall)
    allQ[[K]]<-t(read.table(files[K-min(Kall)+1]))


source("https://raw.githubusercontent.com/popgenDK/admixturePlot/main/admixFun.R")

pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]
palette(palette()[-1])
plotMulti(allQ,Kall=Kall,as.factor(pop))

 - What determines the order at which populations get their own ancestry compoment? In other words, is there a logic to which populations distinguishable at lower K?
- Which K is the best one?
 
 Lets use evalAdmix to evalute the fit for each choice of K


In [None]:
K=2

for K in 1 2 3 4 5 6 7
  do
    echo ---- Running for K=$K -----------------------
    evalAdmix -plink blue_wildebeest_noLD -fname ~/kenya2024/admixture/allK/blue_wildebeest_noLD.$K.P_1 \
    -qname ~/kenya2024/admixture/allK/blue_wildebeest_noLD.$K.Q_1 -o blue_wildebeest_noLD.$K.eval_1 -P 10
  done

Lets plot the results in R

In [None]:
#make plot wide
library("repr")

#read in code to plot admixture proportions ( plotAdmix function)
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

#read in the population labels (first column of fam file)
pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]
    options(repr.plot.width=14, repr.plot.height=4)

# Read in inferred admixture proportions
K=2
for(K in 2:7){
   
    q <- read.table(paste0("~/kenya2024/admixture/allK/blue_wildebeest_noLD.",K,".Q_1"))

    #read in the population labels (first column of fam file)
    pop <- read.table("~/kenya2024/admixture/blue_wildebeest_noLD.fam")[,1]

    #make the plot. 
 
    plotAdmix(q,pop=pop,rotatelab=15,padj=0.15,cex.lab=1.4,col=2:8)
    
    options(repr.plot.width=14, repr.plot.height=4)

    r <- as.matrix(read.table(paste0("~/kenya2024/admixture/blue_wildebeest_noLD.",K,".eval_1")))

    plotCorRes(r, pop=pop, max_z = 0.25,rotatelabpop =20,adjlab = .05,title=paste0("Correlation of residuals with K=",K))
   options(repr.plot.width=14, repr.plot.height=11)

}
#for(K in 2:7){
#    r <- as.matrix(read.table(paste0("~/kenya2024/admixture/blue_wildebeest_noLD.",K,".eval_1")))
#    plotCorRes(r, pop=pop, max_z = 0.25,rotatelabpop =20,adjlab = .05,title=paste0("Correlation of residuals with K=",K))
#}


 - How high a K is needed to have a good fit?
 - at K=4 selous and luangwa have the same component. Can you then conclude that they are the same population? and how can you use evalAdmix to help you interpretation
 