# setup enviroment and data

In [None]:
COURSE_PATH=/course/popgen25
DATA_PATH=$COURSE_PATH/dstats
SOFTWARE_PATH=$COURSE_PATH/software

#make folder 
mkdir -p ~/popgen25_dstat

# enter folder
cd ~/popgen25_dstat

#make sym link for data and current folder
ln -sfn ~/popgen25_dstat ~/current_folder
ln -sfn $DATA_PATH ~/data_folder

# Practical on $F$-statistics CPH popgen 2025

In this practical we will explore the $F$-statistics framework using the `R`-package `admixtools`, a fast implementation of $F$-statistics. 

To begin, we load a few R packages required for the practical

In [None]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(admixtools)
    library(viridis)
    library(ape)
})

## 1. Dataset exploration

The data we are using is a subset of genotype data of modern and ancient humans from the Allen Ancient DNA Resource ([AADR](https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data)). 

Here we read a table containing some minimal metadata for the samples included. The `group_id` column contains the population labels for the individuals included. Modern populations have a suffix `.HO` as they represent SNP genotype data obtained using the Human Origins SNP array. Ancient groups have suffices indicating their age group and/or culture label. For example, groups with suffix `_HG` represent hunter-gatherers, groups with suffix `_N` or similar represent Neolithic populations; and `_BA` or similar represents Bronze Age populations. Finally, the suffix `.SG` represents genome-wide shotgun-sequenced data, whereas all other ancient data has been generated using in-solution capture enrichment at SNP sites.

In [None]:
## load sample metadata
sample_info <- read_tsv("~/data_folder/ho_anc.sample_info.tsv")
head(sample_info)
count(sample_info, age_group, group_id) ## population sample sizes per age group


The pairwise $F_2$ statistics for each population pair has already been pre-computed, allowing for easy use of all other $F$-statistics-based tools using `admixtools`.

In [None]:
## load precomputed f2 data
## please ignore warning message about negative f2 statistic estimates
f2_dir <- "~/data_folder/f2.ho_anc"
f2_data <- f2_from_precomp(f2_dir)

Let's have a quick look at the populations and number of SNPs included in the pre-computed dataset

In [None]:
## populations
dimnames(f2_data)[1]

## number of SNPs
count_snps(f2_data)

## 2. $F_{2}$ statistics

$F_{2}$ statistics provide a distance measure based on genetic drift between pairs of populations. In `admixtools`, pre-computed data is organized in an array with those $F_{2}$ distances between each pair of populations, separate for each genomic block. 

We can average the pairwise statistics across all blocks to obtain genome-wide distances, and use them to visualize genetic structure between the populations. Here we do that for all modern populations in the dataset, using a heatmap and a neighbor-joining tree.

In [None]:
## heatmap of f2 distances between modern populations
pops <- sample_info %>%
    filter(age_group == "modern") %>%
    distinct(group_id) %>%
    pull(group_id)

f2_pop <- apply(f2_data[pops, pops, ], 1:2, mean) # average across all blocks
diag(f2_pop) <- NA
heatmap(f2_pop, scale = "none", col = viridis(100), cexCol = 0.5, cexRow = 0.5)

## Neighbor-joining tree of f2 distances, root with South African hunter-gatherer population
tr <- nj(f2_pop) %>%
    root(outgroup = "Ju_hoan_North.HO")

plot(tr, cex = 0.5, font = 1)

Questions:
- Which populations show high branch lengths? Why might that be?
- Which populations have negative branch lengths? Why might that be?

Task: Using the same approach as above, now explore genetic structure of ancient populations
- heatmap of $F_2$ distances
- NJ, rooted with ancient South African hunter-gatherers (population `South_Africa_1900BP.SG`)

Questions:
- Which population groups show deepest split in the heatmap clustering? 
- Which populations have very long branch lengths? Why might that be?

In [None]:
## Task

## 3. $F_{3}$ statistics 

In this section we will explore the two main applications of $F_{3}$ statistics: Testing for evidence of admixture, and measuring shared genetic drift.

### 3.1. Admixture $F_{3}$

Admixture $F_{3}$ statistics of the form $F_{3}$(X;A,B) test whether there is evidence for admixture in population X. IF $F_{3}$(X;A,B) < 0, we can conclude that X is admixed with respect to sources related (possibly deeply!) to populations A and B. To gauge whether the test statistic is significantly negative, a block jacknife across all $F_{2}$ blocks is used to estimate standard errors and a corresponding Z-score. 

In the following example, we use the `qp3pop` function to test whether there is evidence for admixture in African Americans, between Yoruba from Nigeria as one source and all other modern populations as other source.

In [None]:
## Testing for admixture in African Americans
pops_ref <- sample_info %>%
    filter(age_group == "modern") %>%
    distinct(group_id) %>%
    pull(group_id)

## We use qp3pop here with
## African Americans as the admixed population (pop1).
## the Yoruba population as one of the reference population (pop2).
## and all other modern populations as the second reference population (pop3).
f3 <- qp3pop(f2_data, pop1 = "AfricanAmericans.HO", pop2 = "Yoruba.HO", pop3 = pops_ref) %>%
    arrange(est) %>%
    mutate(pop3 = fct_reorder(pop3, -est))
f3 %>% head()

## plotting the results
ggplot(f3, aes(x = est, y = pop3)) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_errorbarh(aes(xmin = est - 3 * se, xmax = est + 3 * se), height = 0.2) +
    geom_point() +
    theme_bw()


Questions:
- Is there evidence for admixture in African Americans?
- Which populations give significantly negative f3 values?
- Can we conclude anything about which likely true non-African source populations?

Task: Using the same approach as above, now test for admixture in modern target populations using Yamanaya pastoralists from the Eurasian Steppe (`Russia_Samara_EBA_Yamnaya`) and early European farmers (`Germany_EN_LBK`) as source populations

Questions:
- Is there evidence for admixture?
- Do any European populations have non-significant result? If so, how can we interpret that?

In [None]:
## Task

### 3.2. Outgroup $F_{3}$

In outgroup $F_{3}$ statistics of the form $F_{3}$(O;A,B) the target population X from the admixture test is replaced by a known outgroup population O to (A,B). The result of $F_{3}$(O;A,B) is then are a measure of the genetic drift shared between A and B from their common ancestor to the outgroup O.

In the following example, we use the `qp3pop` function to test estimate genetic drift shared between Sardinians and other modern populations, using an African hunter-gatherer population (Mbuti) as outgroup.

In [None]:
## Use outgroup f3 to estimate genetic drift shared between Sardinians and other modern populations
pops_ref <- sample_info %>%
    filter(age_group == "modern") %>%
    distinct(group_id) %>%
    pull(group_id)

## We use qp3pop here with
## Mbuti as the outgroup population (pop1).
## Sardinians as our test population (pop2).
## and all other modern populations as the reference population (pop3).
f3 <- qp3pop(f2_data, pop1 = "Mbuti.HO", pop2 = "Sardinian.HO", pop3 = pops_ref) %>%
    filter(pop2 != pop3) %>%
    slice_max(est, n = 20) %>%
    arrange(desc(est)) %>%
    mutate(pop3 = fct_reorder(pop3, est))
f3

ggplot(f3, aes(x = est, y = pop3)) +
    geom_errorbarh(aes(xmin = est - 3 * se, xmax = est + 3 * se), height = 0.2) +
    geom_point() +
    theme_bw()

Questions:
- Which modern population shares most drift with Sardinians?

Task: Using the same approach as above, now also include ancient populations in the analysis for (`pop3`)

Questions:
- Which populations now share most drift with Sardinians?
- What do they have in common?

In [None]:
## Task

## 4. $F_{4}$ statistics 

In this section we will explore the two related applications of $F_{4}$ statistics: Testing for treeness and symmetry tests.

### 4.1. Treeness test

Statistics of the form $F_{4}$(A,B;C,D) test whether four populations A,B,C,D are related through a simple unrooted tree. If $F_{4}$(A,B;C,D) = 0, and the results for the other two configurations are $F_{4}$(A,C;B,D)>0 and $F_{4}$(A,D;B,C)>0, a simple tree ((A,B)(C,D)) without gene flow is supported. If all three configurations are different from zero, a simple tree is rejected and gene flow must have occured.

D-statistics are closely related to these $F_{4}$ statistics, differing only by a scaling factor.

In the following example, we use `qpdstat` to test whether two African populations (Mbuti hunter-gatherers and Yoruba from Nigeria) form a clade with respect to two ancient European populations: Yamanaya pastoralists from the Eurasian Steppe (`Russia_Samara_EBA_Yamnaya`) and early European farmers (`Germany_EN_LBK`)

In [None]:
pops_test <- "Yoruba.HO"

## here we use the `qpdstat` function to compute the three configurations of f4 statistics for the given populations
## 1) f4(Mbuti, Yoruba; Yamnaya, LBK)
## 2) f4(Mbuti, Yamnaya; Yoruba, LBK)
## 3) f4(Mbuti, LBK; Yoruba, Yamnaya)
r <- map_dfr(pops_test, ~ {
    r1 <- qpdstat(f2_data, pop1 = "Mbuti.HO", pop2 = .x, pop3 = "Russia_Samara_EBA_Yamnaya", pop4 = "Germany_EN_LBK")
    r2 <- qpdstat(f2_data, pop1 = "Mbuti.HO", pop2 = "Russia_Samara_EBA_Yamnaya", pop3 = .x, pop4 = "Germany_EN_LBK")
    r3 <- qpdstat(f2_data, pop1 = "Mbuti.HO", pop2 = "Germany_EN_LBK", pop3 = .x, pop4 = "Russia_Samara_EBA_Yamnaya")
    bind_rows(r1, r2, r3) %>%
        mutate(test_pop = .x)
})
r

Questions:
- Does the test support treeness of the two African populations with respect to the two ancient European populations?

Task: Using the same approach as above, replace Yoruba with the Maasai (`Masai.HO`), a nomadic pastoralist population from East Africa

Questions:
- Does the test support treeness of the Maasai and Mbuti with respect to the two ancient European populations?
- If not, what gene flow could explain the results?

In [None]:
## Task

### 4.2. Symmetry test
In case we have a known outgroup O, we can use statistics of the form $F_{4}$(O,B;C,D) test whether population B is symmetrically related to populations (C,D). In the following example we check whether modern populations are symmetrically related to Yamnaya and early farmers, using the Mbuti as outgroup.

In [None]:
pops_test <- sample_info %>%
    filter(age_group == "modern") %>%
    distinct(group_id) %>%
    pull(group_id)

## here we use the `qpdstat` function to check whether our modern test populations 
## are symmetrically related to Yamnaya and LBK
## f4(Mbuti, Test population; Yamnaya, LBK)
f4 <- qpdstat(f2_data, pop1 = "Mbuti.HO", pop2 = pops_test, pop3 = "Russia_Samara_EBA_Yamnaya", pop4 = "Germany_EN_LBK") %>%
    arrange(est) %>%
    mutate(pop2 = fct_reorder(pop2, est))
f4

## plot results
ggplot(f4, aes(x = est, y = pop2)) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_errorbarh(aes(xmin = est - 3 * se, xmax = est + 3 * se), height = 0.2) +
    geom_point() +
    theme_bw()

Questions:
- Which populations share more drift with early farmers?
- Which populations share more drift with Yamnaya?
- Which populations are symmetrically related? Does that mean they form a clade with the outgroup?

## 5. *qpAdm*

In the final part of the exercises we will explore how to estimate admixture proportions using *qpAdm*, a phylogeny-free approach based on sets of $F_4$ statistics between three groups of populations:

- a "target" population for which we want to estimate admixture proportions
- "source" or "left" populations which can potentially contribute ancestry to the "target"
- "outgroup" or "right" populations, differentially related to the "source" and "target" groups

The choice of "right" populations has important impact on the results, as differentially shared genetic drift with both "source" and "target" populations are the signal that *qpAdm* uses to test for admixture and estimate proportions.

In the following example, we explore the established three ancestral population model of Europe (hunter-gatherers, farmers and Steppe) on modern English as a target.

In [None]:
## model a European target populations as 3 pop model (HG/farmer/Yamnaya)

## define right populations. These include
## 1) South African hunter-gatherer population
## 2) Ust-Ishim hunter-gatherer population
## 3) Kostenki14 Upper Paleolithic European hunter-gatherer population
## 4) Karelia Mesolithic European hunter-gatherer population
## 5) Anatolian Neolithic farmer population
## 6) Kotias Upper Paleolithic Caucasus hunter-gatherer population

right <- c(
    "South_Africa_1900BP.SG", "Russia_Ust_Ishim_HG.DG",
    "Russia_Kostenki14.SG", "Russia_Karelia_HG.SG", 
    "Turkey_N.SG", "Georgia_Kotias.SG")

## define left populations. These include
## 1) Germany LBK Neolithic farmer population
## 2) Russia Samara EBA Yamnaya population
## 3) Hungary Koros Neolithic hunter-gatherer population

left <- c("Germany_EN_LBK", "Russia_Samara_EBA_Yamnaya", "Hungary_EN_HG_Koros")

## define target population
target <- "English.HO"

## here we use the `qpadm` function to compute admixture weights for the 3-population model
res <- qpadm(f2_data, left, right, target)

## results of admixture weights for 3 pop model
res$weights

## results of nested models 1 and 2 population models with remining source population contribution forced to 0
res$popdrop

# please ignore warning: solve(): system is singular; attempting approx solution

We can explore two components of the results. 

The `weights` component contains the final estimated admixture proportions for the three population model. 

Questions: 
- Which population has the highest contribution?

The `popdrop` component contains result contrasting the full three populations model to different nested models with less source populations. For example, the row with pattern (`pat`) `001` shows proportions and model fit for a nested model where the third source population contribution (`Hungary_EN_HG_Koros` in this model) is forced to zero.

Questions: 
- Are models with fewer source populations also supported (check columns `chisq` and `p` for model fit)
- If so which populations are included there?

Task: 
Apply the same model to Sardinians as target population.

Questions:
- Which model is best supported?
- What are the admixture proportions?
- Do Sardinians have Steppe ancestry?

In [None]:
## Task

## 6. Free exercise - Explore admixture in Czech population history

The example dataset contains a large temporal transect of different ancient populatins from present-day Czech republic and their modern counterpart. Time permitting, explore their relationships using the tools learned here.