# ADMIXTURE (Alexander 2009)

Extract from my PhD thesis: http://dspace.ut.ee/bitstream/handle/10062/82088/molinaro_ludovica.pdf?sequence=6&isAllowed=y

It is possible to summarise the genetic information without a priori information by clustering the target individuals based on their genetic patterns, highlighting population structure within the dataset. 

### Clustering
Given a K number of clusters, clustering algorithms group together samples based on their similarity. As a result they assign each individual to all clusters with a probability of belonging to that cluster, defined as the membership coefficient. Such assignments occur SNP-wise to account for multiple ancestries within one genome. In this way, each individual will several membership coefficients that summarise the proportion of DNA for which they are most closely related to
the other individuals in cluster K.

## ADMIXTURE:
MANUAL: https://dalexander.github.io/admixture/admixture-manual.pdf

Extract from ADMIXTURE manual:  

ADMIXTURE is a program for estimating ancestry in a model-based manner from large autosomal SNP genotype datasets, where the individuals are unrelated (for example, the individuals in a case-control association study). ADMIXTURE’s input is binary PLINK (.bed), ordinary PLINK (.ped), or EIGENSTRAT (.geno) formatted files and its output is simple space-delimited files containing the parameter estimates

To use ADMIXTURE, you need an input file and an idea of `K` (a number that might reflect the number of ancestral populations). In the working directory, you should also have the input files: .bed, .bim  and .fam files. 

The general sintax to run admixture is the following:

It's general practice to run unsupervised ADMIXTURE for multiple K, for example:

The usage of K=1 is explained at the end

## Preparing the dataset

ADMIXTURE does not explicitly model Linkage Disequilibrium (LD), so it is best practice to remove SNPs in LD.

Using the software plink v1.9, we are going to first indentifying the SNPs that:

- show a pariwise r^2 > `0.1`
- in a genomic window of `50` SNPs,
- shifted by `10` SNPs at the end of each step

All the SNPs that exceed the 0.1 threshold, will then be removed

In [None]:
! plink --bfile dataset/1KGs_chr1_maf \
        --indep-pairwise 50 10 0.1 \
        --out dataset/SNPs_inLD

The command `--indep-pairwise` will create two files: X.prune.in and X.prune.out, where as X plink will set the name we indicated as an argument of the command `--output`

We will now remove the SNPs that pass the 0.1 threshold, with the plink command `--exclude` and the file.prune.out created by the `--indep-pairwise` command

In [None]:
! plink --bfile dataset/1KGs_chr1_maf \
        --exclude dataset/SNPs_inLD.prune.out \
        --make-bed \
        --out dataset/1KGs_chr1_maf_pruned 

## Running unsupervised ADMIXTURE
We will now run unsupervised ADMIXTURE, from K=1 to K=4. We will stick to K=4 for computational costrains, but generally admixture runs can go to K > 10.

In [None]:
! for K in {1..4}; do admixture dataset/1KGs_chr1_maf_pruned.bed ${K} --cv; done

ADMIXTURE will output in the working directory two outputs for each K: file.K.Q and file.K.P, where .Q contains the ancestry fractions, and .P the allele frequencies of the inferred ancestral populations. 
We are going to focus on the .Q file, that will contain:
* K columns
* As many rows as samples available, following the order of the .fam file

In [5]:
! head 1KGs_chr1_maf_pruned.2.Q

0.993758 0.006242
0.971634 0.028366
0.987112 0.012888
0.980797 0.019203
0.991779 0.008221
0.974260 0.025740
0.999990 0.000010
0.988025 0.011975
0.999990 0.000010
0.978363 0.021637


Let's paste the first two columns of the fam file with the Q file, to get a better understanding of the output

In [7]:
! awk '{print $1,$2}' ../dataset/1KGs_chr1_maf_pruned.fam > POP_ID_list
! for K in {1..4}; do paste POP_ID_list 1KGs_chr1_maf_pruned.${K}.Q > 1KGs_chr1_maf_pruned_ID.${K}.Q; done
! head 1KGs_chr1_maf_pruned_ID.2.Q

GBR HG00096	0.993758 0.006242
GBR HG00097	0.971634 0.028366
GBR HG00099	0.987112 0.012888
GBR HG00100	0.980797 0.019203
GBR HG00101	0.991779 0.008221
GBR HG00102	0.974260 0.025740
GBR HG00103	0.999990 0.000010
GBR HG00105	0.988025 0.011975
GBR HG00106	0.999990 0.000010
GBR HG00107	0.978363 0.021637


## Basic Plot

In [9]:
! Rscript ../scripts/ADMIXTURE.R 1KGs_chr1_maf_pruned_ID 4

[?25h[?25h[?25h[?25h[?25h[?25hnull device 
          1 
[?25h[?25h

## Running supervised ADMIXTURE

Supervised ADMIXTURE 

Supervised learning mode is enabled with the flag `--supervised` and requires an additional
file with a .pop suffix, specifying the ancestries of the reference individuals. Like this:

admixure --supervised --cv K file.bed

As mentioned, this mode required an additional file with .pop suffix. The prefix should be the same as the bed file. For example: mydata.bed, mydata.bim, mydata.fam and also mydata.pop.

The .pop file should look like below, where the individual characterized as "-" will be described as a mixture of "AFR", "ASN", "EUR". 

Given that in this example we have AFR, ASN, EUR, the K will be 3. In this way we force ADMIXTURE to describe the individual '-' as a mixture of 3 ancestry/clusters. 

We can create the `.pop` file, starting from the .fam file already available

In [14]:
! head dataset/1KGs_chr1_maf_pruned.fam

GBR HG00096 0 0 0 -9
GBR HG00097 0 0 0 -9
GBR HG00099 0 0 0 -9
GBR HG00100 0 0 0 -9
GBR HG00101 0 0 0 -9
GBR HG00102 0 0 0 -9
GBR HG00103 0 0 0 -9
GBR HG00105 0 0 0 -9
GBR HG00106 0 0 0 -9
GBR HG00107 0 0 0 -9


In [57]:
! awk '{ \
  if ($1 == "YRI" || $1 == "ESN" || $1 == "LWK") \
    category = "AFR"; \
  else if ($1 == "CEU" || $1 == "GBR" || $1 == "TSI")\
    category = "EUR"; \
  else if ($1 == "CHS" || $1 == "CHB" || $1 == "CDX") \
    category = "ASN"; \
  else \
    category = "-"; \
  print category, $2, $3, $4, $5, $6, category; \
}' dataset/1KGs_chr1_maf_pruned.fam > dataset/1KGs_chr1_maf_pruned.pop

In [58]:
! admixture dataset/1KGs_chr1_maf_pruned.bed --supervised --cv 3 

****                   ADMIXTURE Version 1.3.0                  ****
****                    Copyright 2008-2015                     ****
****           David Alexander, Suyash Shringarpure,            ****
****                John  Novembre, Ken Lange                   ****
****                                                            ****
****                 Please cite our paper!                     ****
****   Information at www.genetics.ucla.edu/software/admixture  ****

Cross-validation will be performed.  Folds=5.
Random seed: 43
Point estimation method: Block relaxation algorithm
Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
Point estimation will terminate when objective function delta < 0.0001
Estimation of standard errors disabled; will compute point estimates only.
Supervised analysis mode.  Examining .pop file...
Size of G: 3202x31437
Performing five EM steps to prime main algorithm
1 (EM) 	Elapsed: 6.303	Loglikelihood: -9.63468e+07	(delta): 7.9922

Let's paste the first two columns of the fam file with the Q file, to get a better understanding of the output

In [59]:
! paste dataset/1KGs_chr1_maf_pruned.fam 1KGs_chr1_maf_pruned.3.Q | awk '{print $1,$2,$7,$8}' > 1KGs_chr1_maf_pruned_supervised_ID.3.Q
! head 1KGs_chr1_maf_pruned_supervised_ID.3.Q

GBR HG00096 0.999980 0.000010
GBR HG00097 0.999980 0.000010
GBR HG00099 0.999980 0.000010
GBR HG00100 0.999980 0.000010
GBR HG00101 0.999980 0.000010
GBR HG00102 0.999980 0.000010
GBR HG00103 0.999980 0.000010
GBR HG00105 0.999980 0.000010
GBR HG00106 0.999980 0.000010
GBR HG00107 0.999980 0.000010


## Basic Plot

In [10]:
! Rscript ../scripts/ADMIXTURE.R 1KGs_chr1_maf_pruned_supervised_ID 3

[?25h[?25h[?25h[?25h[?25h[?25hnull device 
          1 
[?25h[?25h