# ADMIXTURE (Alexander 2009)

Extract from my PhD thesis: http://dspace.ut.ee/bitstream/handle/10062/82088/molinaro_ludovica.pdf?sequence=6&isAllowed=y

It is possible to summarise the genetic information without a priori information by clustering the target individuals based on their genetic patterns, highlighting population structure within the dataset. 

### Clustering
Given a K number of clusters, clustering algorithms group together samples based on their similarity. As a result they assign each individual to all clusters with a probability of belonging to that cluster, defined as the membership coefficient. Such assignments occur SNP-wise to account for multiple ancestries within one genome. In this way, each individual will several membership coefficients that summarise the proportion of DNA for which they are most closely related to
the other individuals in cluster K.

## ADMIXTURE:
MANUAL: https://dalexander.github.io/admixture/admixture-manual.pdf

Extract from ADMIXTURE manual:  

ADMIXTURE is a program for estimating ancestry in a model-based manner from large autosomal SNP genotype datasets, where the individuals are unrelated (for example, the individuals in a case-control association study). ADMIXTURE’s input is binary PLINK (.bed), ordinary PLINK (.ped), or EIGENSTRAT (.geno) formatted files and its output is simple space-delimited files containing the parameter estimates

To use ADMIXTURE, you need an input file and an idea of `K` (a number that might reflect the number of ancestral populations). In the working directory, you should also have the input files: .bed, .bim  and .fam files. 

The general sintax to run admixture is the following:

It's general practice to run unsupervised ADMIXTURE for multiple K, for example:

The usage of K=1 is explained at the end

## Preparing the dataset

ADMIXTURE does not explicitly model Linkage Disequilibrium (LD), so it is best practice to remove SNPs in LD.

Using the software plink v1.9, we are going to first indentifying the SNPs that:

- show a pariwise r^2 > `0.1`
- in a genomic window of `50` SNPs,
- shifted by `10` SNPs at the end of each step

All the SNPs that exceed the 0.1 threshold, will then be removed

In [None]:
! plink --bfile ../dataset/1KGs_chr1_maf \
        --indep-pairwise 50 10 0.1 \
        --out ../dataset/SNPs_inLD

The command `--indep-pairwise` will create two files: X.prune.in and X.prune.out, where as X plink will set the name we indicated as an argument of the command `--output`

We will now remove the SNPs that pass the 0.1 threshold, with the plink command `--exclude` and the file.prune.out created by the `--indep-pairwise` command

In [None]:
! plink --bfile ../dataset/1KGs_chr1_maf \
        --exclude ../dataset/SNPs_inLD.prune.out \
        --make-bed \
        --out ../dataset/1KGs_chr1_maf_pruned 

## Running unsupervised ADMIXTURE
We will now run unsupervised ADMIXTURE with our dataset (available in dir dataset/), from K=1 to K=7, be sure to have installed ADMIXTURE or imported the conda environment available in the conda_env directory.

In [None]:
! for K in {1..7}; do admixture ../dataset/1KGs_chr1_maf_pruned.bed ${K} --cv; done

ADMIXTURE will output in the working directory two outputs: file.Q and file.P, where .Q contains the ancestry fractions), and .P the allele frequencies of the inferred ancestral populations. 

In [None]:
! head file.Q

Note that the output filenames have ‘3’ in them. This indicates the number of populations (K) that was assumed for the analysis.

## Basic Plot

## Running supervised ADMIXTURE

Supervised ADMIXTURE 

Supervised learning mode is enabled with the flag `--supervised` and requires an additional
file with a .pop suffix, specifying the ancestries of the reference individuals. Like this:

admixure --supervised --cv K file.bed

As mentioned, this mode required an additional file with .pop suffix. The prefix should be the same as the bed file. For example: mydata.bed, mydata.bim, mydata.fam and also mydata.pop.

The .pop file should look like below, where the individual characterized as "-" will be described as a mixture of "AFR", "ASN", "EUR". 

Given that in this example we have AFR, ASN, EUR, the K will be 3. In this way we force ADMIXTURE to describe the individual '-' as a mixture of 3 ancestry/clusters. 

We can create the `.pop` file, starting from the .fam file already available

In [None]:
! head ../dataset/1KGs_chr1_maf_pruned.fam