Skip to content


pcgoddard edited this page Oct 12, 2017 · 2 revisions


Pagé Goddard

Sep 28, 2017


Tutorial for calculating global ancestry proportions from known ancestral populations using ADMIXTURE.

Sample data: SAGE 2



Input Output
test.bed (plink) test.k.Q (ancestry fractions)
test.bim (plink) test.k.P (allele freq of inferred pop)
test.fam (plink)
test.pop (generate in R)
  • test includes cohort of interest AND reference populations
  • test binary files are 1/2 encoded
  • k = number of reference populations

File Locations

script variables and directories

date="171231" # update current yymmdd
wkdir="$HOME/wkdir" # choose your working directory

# data files
sagegenofile="$datadir/" # I'v set it up this way so that you can make a second object (eg: galagenofile) for another dataset
ancestralgenofile="$ancestraldir/ref.panels" # reference panel genotypes

# all your data should be in PLINK binary format (.bim .bed. fam)

PLINK: format data

ADMIXTURE requires 1 input that contains your population of unknown ancestry and the references of known ancestry. If you do not have a cohort-reference panel file, you must merge them together in PLINK. NB: plink will merge the files in alphabetical order by FID, so you may have your reference panels merged into the middle of the file rather than appended.

# PLINK variables

# merge cohort + reference pops
plink --bfile $sagegenofile --bmerge $ancestralgenofile.bed $ancestralgenofile.bim $ancestralgenofile.fam --make-bed --out $merge

# if error due to missed snps, flip the problematic snps and merge again
plink --bfile $ancestralgenofile --flip $merge_missnp --make-bed --out $ancestry_flip
plink --bfile $sagegenofile --bmerge $ancestry_flip --make-bed --out $merge

# if error persists, remove the snps with --exclude missnp

# recode Cohort + Ref binary files in 1/2 format
plink --bfile $merge --recode 12 --out $myadmix

optional: thin files for admixture, see bottom

R: build popfile

.pop is required for supervised admixture estimation; Each line of the .pop file corresponds to individual listed on the same line number in the .fam file. If the individual is a population reference, the .pop file line should be a string (beginning with an alphanumeric character) designating the population. If the individual is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to indicate that the ancestry should be estimated. The final format should be a single column with no header that lines up with the fam file as shown here:

fam_ID_source popfile
sage -
sage -
sage -
euro_ref ceu
euro_ref ceu
afr_ref yri
afr_ref yri
sage -
sage -
# R variables
date <- "171231" # update current yymmdd
mydat <- "genotype_ancestry_merge_" # same prefix as myadmix
merge <- read.table(file=paste(mydat, date, ".fam", sep=""), header = F, sep = ' ') # .fam file has the samples for your population and reference populations
ceu <- read.table("samples_ceu.txt", header = F, sep = ' ') # european ancestry
yri <- read.table("samples_yri.txt", header = F, sep = ' ') # african ancestry
nam <- read.table("samples_nam.txt", header = F, sep = ' ') # native american ancestry

# make popfile
merge$pop = ifelse(merge$X1 %in% ceu$V1, 'CEU', ifelse(merge$X1 %in% yri$V1, 'YRI', '-')) 
    # if IID is from CEU ref, write CEU; 
    # if IID is from YRI ref, write YRI; 
    # if IID is from NAM ref, write NAM;
    # if IID is in neither ref, it is a SAGE individual with unknwon ancestry; write - as placeholder
popfile <-$pop)

# write out
write.table(popfile, file=paste(mydat, date, ".pop", sep=""), row.names = F, quote = F)
    # check that same prefix as $myadmix input

ADMIXTURE: calculate global ancestry

# ADMIXTURE variables

## required
npop="2" # must reflect number of populations in reference panel

## optional
#bootnum="200" # optional

# admixture script
admixture $myadmix.ped $npop --supervised --seed=$seed -j$nthreads

Parsing ADMIXTURE results

Admixture output:

myadmix.2.P - allele frequencies of the inferred ancestral populations (SNP x ancestry) order: CEU YRI

wc -l $myadmix.2.P
    # 801660 lines
head $myadmix.2.P
    # 0.006379 0.000010
    # 0.827382 0.404886
    # 0.048119 0.021815
    # 0.874641 0.306484

myadmix.2.Q - ancestry fractions for each individual (sample x ancestry) order: CEU YRI

wc -l $myadmix.2.Q
    # 2165 lines
head $myadmix.2.Q
    # 0.142886 0.857114
    # 0.123782 0.876218
    # 0.208400 0.791600
    # 0.109842 0.890158

Reformat results

The following output can be used with programs such as REAP to calculate genetic relatedness matrices

# REAP variables

# transpose genotype files
plink --bfile $sagegenofile --recode 12 transpose --output-missing-genotype 0 --out $reapinput

# reformat
cut -d' ' -f1 $reapinput.tfam > $myID # SAGE ID list
cut -d' ' -f1-2 merged_trial.fam | paste -d' ' - $myadmix.2.Q > admix_out_IDs_${date} # paste global ancestries to IDs; FID IID CEU YRI
    # -d' ' reads space as delimiter
    # -f1-2 selects fields 1 through 2 to extract with cut
    # - directs paste to use the standard input form the pipe instead of a file
grep -Fwf $myID admix_out_IDs_${date} > $admixprop # extract sage only
    # -f read patterns from file1
    # -F read patterns as plain strings
    # -w match patterns as whole word
awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' $admixprop $reapinput.tfam > $admixprop_sorted
    # sorts $admixprop to match SAGE fam file
    # FNR==NR holds true for first named file
    # current line $0 is stored in associative array named x2 indexed by the first field [$1]
    # $1 in x2 will only start after first name file has been read completely
    # looks at the first field of line in second file, and prints the corresponding line from first

output order: FID IID EUR AFR

head $admixprop_sorted
    # CH30380 CH30380 0.153140 0.846860
    # VA30171 VA30171 0.120529 0.879471
    # VA30167 VA30167 0.180591 0.819409
    # CH30357 CH30357 0.134801 0.865199

Plotting results

#### files must be formatted with: race, eur, afr, natam
  # no subjectID, race column must come first. check example inputs

### Call libraries

# set up space
date <- "171231" # update current yymmdd
sage <- read.delim(print("$myadmix.2.Q"))
colnames(sage) <- c("EUR","AFR")
cbind(race = "AA", sage)
    #  race      EUR      AFR
    #  AA 0.142886 0.857114
    #  AA 0.123782 0.876218
    #  AA 0.208400 0.791600
    #  AA 0.109842 0.890158

# order data by increasing european ancestry
eur.sort = sage[order(sage$EUR),]
eur.sort$rank = 1:length(eur.sort$EUR)

# format data into geom_bar input
eur.sort.long = melt(eur.sort, id = c('rank','race'), = 'Ancestry')
    #  race rank  Ancestry    value
    #    AA 1      EUR 0.088553
    #    AA 2      EUR 0.100432
    #    AA 3      EUR 0.134868
    #    AA 4      EUR 0.181657

# create plots
afr.plot <- ggplot(eur.sort.long, aes(x=rank, y=value, fill=Ancestry))
afr.plot <- afr.plot + geom_bar(width=1, stat='identity')           # play with width to see what it does
afr.plot <- afr.plot +  theme(legend.position="bottom")                 # move legend
afr.plot <- afr.plot +  xlab('Individual') + ylab('Genetic ancestry')# label the axes
afr.plot <- afr.plot +  ggtitle('SAGE 2')                 # create a title
afr.plot <- afr.plot +  scale_fill_manual(labels=c("Eur","Afr"),values=c("#cc9999","#660099"))
afr.plot <- afr.plot + theme(axis.text.x=element_blank(),


resulting plot

# QED #

Thinning files

Documentation SEC 2.3

  • speeds up ADMIXTURE


# PLINK variables

# run PLINK
plink --bfile $genofile --indep-pairwise $slidingwindowsize $slidewidth $r2val --make-bed --out $thingenofile

You can then continue through the popfile and admixture steps with the thinned files. NB: Remember to test the order of thingenofile before making popfile

You can’t perform that action at this time.