# Exercise: Inference of admixture and population structure

## A. Use of NGSadmix to infer admixture proportions for numerous individuals

In this exercise we will try to use NGSadmix to analyze a NGS dataset and evalAdmix to assess the results.

###  Login to the server and set paths
First run the following code to set the paths to all programs and data needed:

In [None]:
# Set path to ANGSD program
ANGSD=/emily/program/bin/angsd

# Set path to NGSadmix
NGSadmix=NGSadmix

# Set path to a bam file list with several bam files
BAMFOLDER=/course/popgen23/ida/admixexercise/smallbams

# Make directory for all the results
cd 
mkdir -p admixtureexercise
cd admixtureexercise

## First small example

We will first try to run an NGSadmix analysis of a small dataset consisting of bam files with low depth NGS data from 435 samples of 5 human populations from the 1000 genomes project:


| Population code | Population                                     | Sample size |
|-----------------|------------------------------------------------|-------------|
| ASW             | HapMap African ancestry individuals from SW US | 61          |
| CEU             | European individuals                           | 99          |
| CHB             | Han Chinese in Beijing                         | 103         |
| YRI             | Yoruba individuals from Nigeria                | 108         |
| MXL             | Mexican individuals from LA California         | 63          |


### Make input data using ANGSD

The input to NGSadmix is genotype likelihoods (GLs). Therefore the first step of running an NGSadmix analysis (if all you have are bams files) is to calculate GLs. So let's start bying doing that. First make a file that contains the paths of all the 30 bam files by finding all the names of the bamfiles in the relevant folder and saving this in a file called all.files:

In [None]:
find $BAMFOLDER | grep bam$ > all.files

To see the content of the file you made type:

In [None]:
cat all.files

Now calculate GLs from all the BAM files using ANGSD by running the following command in the terminal:

In [None]:
$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out all -P 5

NOTE that this will take a bit of time to run (around a minute). While waiting, try to remember what the different options used mean (you have seen most of them in previous exercises). If you do not remember all of them, then try to ask the person next to you. And if neither of you remember then try to figure it out by looking for help on the ANGSD webpage e.g. [here](http://www.popgen.dk/angsd/index.php/Genotype_Likelihoods), [here](http://www.popgen.dk/angsd/index.php/Major_Minor) and [here](http://www.popgen.dk/angsd/index.php/Filters).



## Explore the input data

Now let's have a look at the GL file that you have created with ANGSD. It is a "beagle format" file called all.beagle.gz - and will be the input file to NGSadmix. The first line in this file is a header line and after that it contains a line for each locus with GLs. By using the unix command wc we can count the number of lines in the file:

In [None]:
gunzip -c all.beagle.gz | wc -l

- Use this to find out how many loci there are GLs for in the data set?

Next, to get an idea of what the GL file contains try from the command line to print the first 9 columns of the first 7 lines of the file:

In [None]:
zcat all.beagle.gz | awk -v N=10 'NR<=N' | cut -f1-9 | column -t

In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3). All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1, then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

- Based on this, what is the most likely genotype for ind0 in the first locus? 

## Run an analysis of the data with NGSadmix

Now you know how the input looks. Next, let's try to perform an NGSadmix analyses of the GLs typing assuming the number of ancestral populations, K, is 3:

In [None]:
$NGSadmix -likes all.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o all

- When it is done you will see some output above. While waiting for the analysis to finish running please make sure you understand the command you ran. If you are in doubt seek help [here](http://www.popgen.dk/software/index.php/NgsAdmix#Parameters). Here you can also see what other options you have when you run an NGSadmix analyses.


## Explore the output

The output from the analysis you just ran is three files:

- all.log (a "log file" that summarizes the analysis run)
- all.fopt.gz (an "fopt file", which has a line for each locus that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations)
- all.qopt (a "qopt file", which has a line for each individual that contains anestimate of the individual's ancestry proportion from each of the three assumed ancestral populations).

Let's have a look at them one at a time. First, check the log file by typing

In [None]:
cat all.log

- What is the log likelihood of the estimates achieved by NGSadmix (called "best like" in the log file)?

Next, check the first line of the fopt file by typing:

In [None]:
zcat all.fopt.gz | awk 'NR==1'

- Based on this: what is the estimated allele frequency at the first locus in the three assumed ancestral populations?

Finally, check the 6th line of the qopt file and thus the estimated admixture proportions for the 6th individual by typing:

In [None]:
head -n6 all.qopt | tail -n1

- Based on this: does the individual look admixed?

You can see the ID of the first individual by getting the 6th line of the file you created with all your original bam files in the beginning:

In [None]:
head -n6 all.files | tail -n1 

- Based on that ID, which population does the individual come from?
- Based on this and the frequency estimates for the first locus that you looked at earlier, what does NGSadmix estimate the allele frequency to be at the first locus in that population?

## Plot the admixture proportion estimates

Finally, try to make a simple plot the estimated admixture proportions for all the individuals by opening the statistical program called R (which you do by typing "R" in the terminal and pressing enter) and then copy pasting the following code:

In [None]:
# Get ID and pop info for each individual
s<-strsplit(basename(scan("admixtureexercise/all.files",what="theFuck")),"\\.")
pop<-sapply(s,function(x) x[6])

# Import some functions to help in visualization
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

# Read in inferred admixture proportions
q<-read.table("admixtureexercise/all.qopt")

# Order individuals by population, and within population by admixture proportion
ord <- orderInds(pop=pop,q=q, popord=c("YRI", "ASW", "CEU", "MXL", "CHB"))

# Make plot            
par(mar=c(7,4,1,1))
barplot(t(q)[,ord],col=c(3,2,4),las=2,xlab="Individuals",ylab="Admixture proportions", space=0, border=NA)
text(sort(tapply(1:length(pop),pop[ord],mean)),-0.05,unique(pop[ord]),xpd=NA)
abline(v=cumsum(sapply(unique(pop[ord]),function(x){sum(pop[ord]==x)})),col=1,lwd=1.2)
            

Note that the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to the population they are sampled from.

- Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
- What does the plot suggest about whether the individuals are admixed?

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). In the following we will redo the analyses using a larger number of sites.


## More realistic example

Now you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like. We will try now to run a more realistic dataset, using the same samples with a larger number of sites. We have already made the input file with genotype likelihoods for 100 000 sites across the genome, and a file with population info.


- A file with genotype likelihoods from the 100 individuals in beagle format: /course/popgen23/ida/admixexercise/admixinput/1000G5pops100ksites.beagle.gz
- A file with labels that indicate which population they are sampled from: /course/popgen23/ida/admixexercise/admixinput/1000G5pops.pop.info


## Run an analysis of the data with NGSadmix

We start by running an NGSadmix analyses with K=3 (-K 3), using 10 cpu threads (-P 10) and using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05). Furthermore, to make sure we reach the maximum likelihood solution and not a local optima, we should run 20 independent optimizations runs (-seed i for i in 1:20).

(NB. Because running this would be too computationally intense to run everyone at the same time on the server, we have already run it and the following code just prints the commands you would need to run).  

In [None]:
inputpath=/course/popgen23/ida/admixexercise/admixinput/1000G5pops.inputgl.beagle.gz
outpath=/course/popgen23/ida/admixexercise/admixoutput
K=3

for i in `seq 1 20`
do
    echo "$NGSadmix -likes $inputpath -K $K -P 10 -minMaf 0.05 -seed $i -o ${outpath}/1000G5popsAdmixK${K}seed${i}"
done


This will produce 20 NGSadmix results with their corresponding output files. In order to assess convergence and find the run with the best log likelihood, we need to check the log likelihoods of the data. This command with extract the log likelihood of each run from the log file, add the seed and sort them from the best to the worse log likelihood.

In [None]:
outadmix=/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmix
K=3

rm -f allK$K.likes
for i in `seq 1 20`
do 
    cat ${outadmix}K${K}seed$i.log | grep "best like" | awk -F"[ =]" '{print $3}' >> allK$K.likes
done

cat -n allK$K.likes | sort -rhk2

- Does it look convergence was reached?

We continue by visualizing the results of the maximum likelihood run

In [None]:
# Import some functions to help in visualization
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R") 

# Read in info to plot
pop<-read.table("/course/popgen23/ida/admixexercise/admixinput/1000G5pops.pop.info",as.is=T)
q<-read.table("/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmixK3seed3.qopt")

# Sort individuals by population and within populations by admixture proportion
ord<-orderInds(pop = pop[,1], q=q) 

# Make plot
barplot(t(q)[,ord],col=2:10,space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T) # add population labels
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

- Why do you think it looks cleaner than the previous admixture plot we visualized with the same individuals?
- How many populations would you say now are admixed? Which population seem to be the admixture source? Does that make sense given what you know of these populations?

## Assessing model fit

We will now use evalAdmix to assess if the ancestries inferred in our admixture results are a good approximation to the correct ancestries.

Again, due to time and computational limitations, we have already ran the command and just provide here the command used:

In [None]:
EVALADMIX=evalAdmix

K=3
besti=3
inbgl=/course/popgen23/ida/admixexercise/admixinput/1000G5pops.inputgl.beagle.gz
inadmix=/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmixK${K}seed${besti}
out=/course/popgen23/ida/admixexercise/evaladmixoutput/1000G5pops.K${K}seed${besti}.corres

echo "$EVALADMIX -beagle $inbgl -fname ${inadmix}.fopt.gz -qname ${inadmix}.qopt -o $out -P 20"

We will now visualize the correlation of residuals estimated by evalAdmix, and use it to assess whether the estimated admixture proportions results are a good fit to the data.

In [None]:
# Import some funcitons to help in visualization
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R")

# Read in info to plot
pop<-read.table("/course/popgen23/ida/admixexercise/admixinput/1000G5pops.pop.info",as.is=T)
q<-read.table("/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmixK3seed3.qopt")
r <- as.matrix(read.table("/course/popgen23/ida/admixexercise/evaladmixoutput/1000G5pops.K3seed3.corres"))

# Sort individuals by population and within populations by admixture proportion
ord<-orderInds(pop = pop[,1], q=q) 

# Make plot
plotCorRes(r, pop=pop[,1], ord=ord, max_z = 0.2)

- Is there any population for which the estimated admixture proportions do not seem to have a good fit?
- Looking at the admixture proportions plot, can you think of a reason why that population might not be correctly modelled?

## Trying other values of K

We will now do again the analyses but using 4 instead of 3 ancestral populations. We start by doing 20 independent runs of NGSadmix, with the same settings except that this time we use K = 4. 

We then collect the likelihoods form the log files and look at them to assess if the optimization has converged to the global maximum likelihood.

Again, the analyses has already been run and we just provide the code to print the commands. 

- Can you spot the how the code is different from the code for K=3?

In [None]:
inputpath=/course/popgen23/ida/admixexercise/admixexercise/admixinput/1000G5pops.inputgl.beagle.gz
outpath=/course/popgen23/ida/admixexercise/admixexercise/admixoutput
K=4

for i in `seq 1 20`
do
    echo "$NGSadmix -likes $inputpath -K $K -P 10 -minMaf 0.05 -seed $i -o ${outpath}/1000G5popsAdmixK${K}seed${i}"
done

outadmix=/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmix
rm -f allK${K}.likes
for i in `seq 1 20`
do 
    cat ${outadmix}K${K}seed$i.log | grep "best like" | awk -F"[ =]" '{print $3}' >> allK${K}.likes
done

cat -n allK${K}.likes | sort -rhk2

- Does it look like it has converged?

We will now run evalAdmix to assess the model fit of the best admixture run (again, it has been pre run and we just print the command):

In [None]:
EVALADMIX=evalAdmix

K=4
besti=9
inbgl=/course/popgen23/ida/admixexercise/admixinput/1000G5pops.inputgl.beagle.gz
inadmix=/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmixK${K}seed${besti}
out=/course/popgen23/ida/admixexercise/evaladmixoutput/1000G5pops.K${K}seed${besti}.corres

echo "$EVALADMIX -beagle $inbgl -fname ${inadmix}.fopt.gz -qname ${inadmix}.qopt -o $out -P 20"

We will now visualize the estimated admixture proporitons and the correlation of residuals to assess their fit:

In [None]:
source("https://raw.githubusercontent.com/GenisGE/evalAdmix/master/visFuns.R") # import some funcitons to help in visualization
pop<-read.table("/course/popgen23/ida/admixexercise/admixinput/1000G5pops.pop.info",as.is=T)
q<-read.table("/course/popgen23/ida/admixexercise/admixoutput/1000G5popsAdmixK4seed9.qopt")


ord<-orderInds(pop = pop[,1], q=q) # sort indiivduals by population and within populaoitn by admixture proportion

# plot admixture proportions
barplot(t(q)[,ord],col=c(5,4,2,3),space=0,border=NA,xlab="Individuals",ylab="Admixture proportions")
text(sort(tapply(1:nrow(pop),pop[ord,1],mean)),-0.05,unique(pop[ord,1]),xpd=T) # add population labels
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)

r <- as.matrix(read.table("/course/popgen23/ida/admixexercise/evaladmixoutput/1000G5pops.K4seed9.corres"))
plotCorRes(r, pop=pop[,1], ord=ord, max_z = 0.2)

- What population does the new cluster that we have added correspond to?
- Based on the correlation of residuals, would you say adding that cluster has given a significant improvement to the model fit?


# B. Use of fastNGSadmix to infer admixture proportions for 3 samples 
Let's try to use fastNGSadmix. Specifically, let's see if we can use it to infer the ancestry of 3 samples: sample1, sample2 and sample3.

## Setup paths

To this first setup some paths by typing this in the terminal:

In [None]:
# Set path for the fastNGSadmix program
fastNGSadmix=fastNGSadmix

# Set path for all input files you will use in this exercise
inputpath=/course/popgen23/ida/admixexercise/fastNGSadmixinput/

## Explore the files with the reference panel
As reference panel we will use data from these 7 populations:

| Population code/name | Description                                    | 
|-----------------|------------------------------------------------|
| French |	French individuals |
| Han	 |  Chinese individuals|
| Chukchi|	Siberian individuals |
| Karitiana	| Native American individuals |
| Papuan |	Individuals from Papua New Guinea, Melanesia |
| Sindhi |	Individuals from India |
| YRI	 | Yoruba individuals from Nigeria |

The files with genotype likelihood (GL) data from sample1, sample2 and sample3 are in beagle format — so exactly the same format as the input files you used for NGSadmix. So let's not spend time on looking at those. But before we start analysing the data then have a quick look at the files with the reference panel (nInd.txt and refPanel.txt) so you know how they look (in case you at some point want to create your own reference panel - which there are scripts for that comes with fastNGSadmix). You can do this by running the following commands:

In [None]:
# Show the full content of the file nInd.txt
# (a file that has info how many samples from each population the panel consists of)
echo "Content of nInd.txt:"
cat ${inputpath}/nInd.txt | column -t

# Show the top 2 lines of the file refPanel.txt
# (a file that has info about allele frequencies for the 7 populations)
echo ""
echo "First 2 lines of refPanel.txt"
head -n2 ${inputpath}/refPanel.txt | column -t

- How many samples from each population does the reference population consist of?
- What are the allele frequencies of the first SNP for each of the populations?


## Analyse the samples with fastNGSadmix

Let's try to run fastNGSadmix on the 3 samples one at a time with the following commands:

In [None]:
# Analyse sample1
$fastNGSadmix -likes ${inputpath}/sample1.beagle -fname ${inputpath}/refPanel.txt -Nname ${inputpath}/nInd.txt -outfiles sample1 -whichPops all -conv 10 -seed 1

# Analyse sample2
$fastNGSadmix -likes ${inputpath}/sample2.beagle.gz -fname ${inputpath}/refPanel.txt -Nname ${inputpath}/nInd.txt -outfiles sample2 -whichPops all -conv 10 -seed 1

# Analyse sample3
$fastNGSadmix -likes ${inputpath}/sample3.beagle.gz -fname ${inputpath}/refPanel.txt -Nname ${inputpath}/nInd.txt -outfiles sample3 -whichPops all -conv 10 -seed 1

As you can see the way to run it is similar to NGSadmix. The options -likes and -outfiles are the same (-outfiles is the equivalent of -o in NGSadmix). But now we also have the -fname and -Nname, which allows you to specify files with your reference panel. Also notice you can ask to run multiple runs with different starting points using the option -conv which makes it easier to ensure convergence. And then there is actually one more parameter that has to be set, namely -whichPops which allows you to specify that you only want to use a subset of the populations in the reference panel, or that you want to analyze all populations. So e.g. you can re-analyse sample1 using using only 6 of the 7 populations in your reference panel (excluding the French):

In [None]:
# Re-analyse sample1 with a smaller reference panel
$fastNGSadmix -likes ${inputpath}/sample1.beagle -fname ${inputpath}/refPanel.txt -Nname ${inputpath}/nInd.txt -outfiles sample1V2 -whichPops Han,Yoruba,Sindhi,Papuan,Chukchi,Karitiana -conv 10 -seed 1

## Take a look at the output files

The output is very similar to that of NGSadmix. There is no fopt file, but there is a log file and and qopt file.

Try to look in the log files for the four analyses using the command cat, so e.g. for sample1 type:

In [None]:
# Show content of sample1.log 
cat sample1.log

- How many loci are the 4 different analyses based on (this is in the log files and is called Overlap)?

Next try to have a look at the qopt file for sample 1 (which like for NGSadmix contains the estimated admixture proportion for the sample):

In [None]:
# Show content of sample1.qopt
cat sample1.qopt | column -t

- Does the sample look admixed?

## Plot the analysis results

Instead of looking at all the qopt files then open R and plot the results for all 4 analyses:

In [None]:
# Plot results for first analysis of sample1
admix<-read.table("admixtureexercise/sample1.qopt",as.is=T,h=T)
barplot(as.matrix(admix),ylab="Admixture proportion",col="red",ylim=c(0,1),main="Sample1")

# Plot results for 2nd analysis of sample1 (the one where we excluded French from the reference panel)
admix<-read.table("admixtureexercise/sample1V2.qopt",as.is=T,h=T)
barplot(as.matrix(admix),ylab="Admixture proportion",col="red",ylim=c(0,1),main="Sample1 (re-analysed)")

# Plot results for analysis of sample2
admix<-read.table("admixtureexercise/sample2.qopt",as.is=T,h=T)
barplot(as.matrix(admix),ylab="Admixture proportion",col="red",ylim=c(0,1),main="Sample2")

# Plot results for analysis of sample3
admix<-read.table("admixtureexercise/sample3.qopt",as.is=T,h=T)
barplot(as.matrix(admix),ylab="Admixture proportion",col="red",ylim=c(0,1),main="Sample3")

Based on the results: 
- what ancestry do you think the three samples have (ignore the second analysis of sample1 for now)?

Now look at the results of the second analysis of sample1 (for which a different reference panel was used). 

- Why do you think the result depends on the reference panel and what are the consequences?

Finally

- Do you trust the results for sample2 and sample3 given the number of loci it is based on?

In order to investigate this we can let fastNGSadmix run with bootstraps, where we randomly sample (with replacement), the sites the analysis is based on. This tells us something about how susceptible our estimates are to change. Try to run fastNGSadmix with 100 bootstraps for sample2 and sample3:

In [None]:
$fastNGSadmix -likes ${inputpath}/sample2.beagle.gz -fname ${inputpath}/refPanel.txt -Nname ${inputpath}/nInd.txt -outfiles sample2boot -whichPops all -boot 100

$fastNGSadmix -likes ${inputpath}/sample3.beagle.gz -fname ${inputpath}/refPanel.txt -Nname ${inputpath}/nInd.txt -outfiles sample3boot -whichPops all -boot 100

Notice that now the FIRST row of the .qopt files, are the estimated ancestry based on ALL sites, and that the subsequent rows, are the ones based on the bootstraps.

Now let's try to plot the results for sample3 in R by opening R and typing:

In [None]:
# Plot estimates
admix<-read.table("admixtureexercise/sample3boot.qopt",as.is=T,h=T)
b<-barplot(as.matrix(admix[1,]),main="sample3",ylab="Admixture proportion",col="red",ylim=c(0,1))

# Plot confidence intervals
## - first we take the 0.025 and 0.975 sample quantiles for constructing the confidence interval for out estimates
lower<-as.numeric(apply(admix,2,function(x) quantile(x[2:length(x)],probs=c(0.025))))
upper<-as.numeric(apply(admix,2,function(x) quantile(x[2:length(x)],probs=c(0.975))))

## - then we plot them
segments(b,lower,b,upper)
segments(b-0.2,lower,b+0.2,lower)
segments(b-0.2,upper,b+0.2,upper)

- Can you say with confidence what the ancestry of this sample is?

Let us also plot sample2 with 100 bootstraps:

In [None]:
# Plot estimates
admix<-read.table("admixtureexercise/sample2boot.qopt",as.is=T,h=T)
b<-barplot(as.matrix(admix[1,]),main="sample2",ylab="Admixture proportion",col="red")

# Plot confidence intervals
## - first we take the 0.025 and 0.975 sample quantiles for constructing the confidence interval for out estimates
lower<-as.numeric(apply(admix,2,function(x) quantile(x[2:length(x)],probs=c(0.025))))
upper<-as.numeric(apply(admix,2,function(x) quantile(x[2:length(x)],probs=c(0.975))))

## - then we plot them
segments(b,lower,b,upper)
segments(b-0.2,lower,b+0.2,lower)
segments(b-0.2,upper,b+0.2,upper)

- Which sample are you more sure about the ancestry of? 
- What explains the difference between the 2 plots?


# C. Try to use the tool admixture instead on a dataset with called genotypes

Please go to 

https://github.com/popgenDK/courses/tree/main/kenya2024/exercises/day3_PopulationStructure

and get the jupyter notebook called Day3_AdmixtureV2.ipynb and run it. Note that if you get some weird error messages in the left hand side of the screen when you upload the notebook to emily then don't worry about and you can most likely get rid of it by pressing "Console".
