# Supervised ADMIXTURE

MANUAL: https://dalexander.github.io/admixture/admixture-manual.pdf

Extract from ADMIXTURE manual:

ADMIXTURE is a program for estimating ancestry in a model-based manner from large autosomal SNP genotype datasets, where the individuals are unrelated (for example, the individuals in a case-control association study).

mkdir SupervisedAdmixture

In [None]:
cd SupervisedAdmixture

### Preparing the dataset

We are going to use the PLINK files to run supervised ADMIXTURE. Our goal is to run supervised ADMIXTURE on the Koksijde samples, using as reference groups 1) Gauls (France Late Iron Age - LIA) 2) the Saxon Medieval (SM) Dutch samples.

Let's also add the English Medieval samples from Cambridge as target, along with Koksijde, as a control.

First, let's load plink

In [None]:
module load plink/1.9-beta6.27

Now, we can subset the dataset, retrieving only the samples of interest, using the PLINK option `--keep`

In [None]:
egrep 'KoksijdeEMA.Anc|NedEMA.Anc|FranceLIA.Anc|UKCamEMA.Anc' non_imputed.fam  >> Samples_2_keep

In [None]:
plink --bfile non_imputed --keep Samples_2_keep --make-bed --out non_imputed_SupAdm


ADMIXTURE does not explicitly model Linkage Disequilibrium (LD), so it is best practice to remove SNPs in LD.

Using the software plink v1.9, we are going to first indentifying the SNPs that:

- show a pariwise r^2 > 0.1
- in a genomic window of 50 SNPs,
- shifted by 10 SNPs at the end of each step
  
All the SNPs that exceed the 0.1 threshold, will then be removed

For supervised ADMIXTURE analyses within the European continent, we need to retrieve at least 100K SNPs. Be sure that the `--indep-pairwise` parameters do not remove too many SNPs.

In [None]:
plink --bfile non_imputed_SupAdm --indep-pairwise 50 10 0.2 --out SNPs_inLD

The command `--indep-pairwise` will create two files: X.prune.in and X.prune.out, where as X plink will set the name we indicated as an argument of the command `--output`.

We will now remove the SNPs that pass the 0.1 threshold, with the plink command `--exclude` and the `file.prune.out` created by the `--indep-pairwise` command.

In [None]:
 plink --bfile non_imputed_SupAdm --exclude SNPs_inLD.prune.out --make-bed --out non_imputed_SupAdm_pruned 

### Supervised Mode

Supervised learning mode is enabled with the flag `--supervised` and requires an additional file with a .pop suffix. The prefix of the .pop file should be the same as the bed file. 

For example: 
- non_imputed_SupAdm_pruned.bed
- non_imputed_SupAdm_pruned.bim
- non_imputed_SupAdm_pruned.fam
- non_imputed_SupAdm_pruned.pop

The pop file should contain the list of samples, in the same order they appear in the .fam file, with two additional columns specifying the ancestries of the reference individuals. 

From ADMIXTURE manual: if the individual is a population reference, the .pop file line should be a string (beginning with an alphanumeric character) designating the population. If the individual is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to indicate that the ancestry should be estimated.

In [None]:
head non_imputed_SupAdm_pruned.fam

In [None]:
Eng.EM I11567 0 0 0 -9
Eng.EM I11569 0 0 0 -9
Eng.EM I11571 0 0 0 -9
Eng.EM I11573 0 0 0 -9
Eng.EM I11574 0 0 0 -9
Eng.EM I11575 0 0 0 -9
Eng.EM I11576 0 0 0 -9
Eng.EM I11577 0 0 0 -9
Eng.EM I11579 0 0 0 -9
Eng.EM I11581 0 0 0 -9

In [None]:
head non_imputed_SupAdm_pruned.pop

In [None]:
-	I11567	0	0	0	-9	-
-	I11569	0	0	0	-9	-
-	I11571	0	0	0	-9	-
-	I11573	0	0	0	-9	-
-	I11574	0	0	0	-9	-
-	I11575	0	0	0	-9	-
-	I11576	0	0	0	-9	-
-	I11577	0	0	0	-9	-
-	I11579	0	0	0	-9	-
-	I11581	0	0	0	-9	-

Awk can help us creating the pop file:

In [None]:
! awk '{ \
  if ($1 == "FranceLIA.Anc") \
    category = "Gauls"; \
  else if ($1 == "NedEMA.Anc")\
    category = "Saxons"; \
  else \
    category = "-"; \
  print category, $2, $3, $4, $5, $6, category; \
}' non_imputed_SupAdm_pruned.fam > non_imputed_SupAdm_pruned.pop

We can do a quick check of the .pop file, to ensure that it contains the new labels

In [None]:
awk '{print $1}' non_imputed_SupAdm_pruned.pop | sort -u

We are now ready to run supervised admixture, given 2 reference sources. The dataset we are using contains: 
* > 100K SNPs
* A total genotyping rate of 0.59.

In [None]:
module spider admixture

In [None]:
module load admixture-linux/1.3.0

Below you can find the command line to run supervised ADMIXTURE live on the server. **DO NOT RUN IT**  

In [None]:
admixure non_imputed_SupAdm_pruned.bed --supervised --cv 2 

We are going to send a SLURM job containing the supervised ADMIXTURE command line:

In [None]:
python ~/tmp_scripts/JobParser.py --command "admixture non_imputed_SupAdm_pruned.bed --supervised --cv 2 " --name sup_admixture --module admixture-linux/1.3.0

In [None]:
sbatch sup_admixture.sh

### Plot 

In [None]:
conda activate echo_workshop

In [None]:
Rscript ../tmp_scripts/AdmixturePlot_SupUnsup.R supervised non_imputed_SupAdm_pruned 2 KoksijdeEMA.Anc,NedEMA.Anc,FranceLIA.Anc,UKCamEMA.Anc

### Comparison between Imputed and Non Imputed Data

In [None]:
Rscript comparing_supADMX.R --file1 non_imputed_SupAdm_pruned --file2 imputed_SupAdm_pruned --target KoksijdeEMA.Anc --k 2