# Day 3, Practical 3

In this exercise we will cover:
 - Generating and displaying pairwise $F_{st}$ values
 - An example of f4 / D-statistics 
    
    
Tools used: plink2, R

The notebooks are editable, so feel free to experiment and change the code to see what happens or write notes in the text cells. Just remember to download the notebooks used here at some point if you want to save them with your own changes included.

In [None]:
### make directory for the exercise
mkdir -p ~/kenya2024/Fstats
cd ~/kenya2024/Fstats

which plink2
which R

We will be using the data set of called genotypes from different blue wildebeest populations, as well as some black wildebeest as an outgroup to compare to, saved in a plink format file set. 

Here is the map from earlier to help show the sampling locations of the different wildebeest populations:
<img src="https://raw.githubusercontent.com/popgenDK/popgenDK.github.io/gh-pages/images/slider/wildeBeastMap.png" alt="image info" />


 - Do you remember what a plink file set (.bed, bim and .fam) contains?
 

In [None]:
head /davidData/users/thomas/workshop/wildebeest_fst.fam

In [None]:
head /davidData/users/thomas/workshop/wildebeest_fst.bim

Below we have the command used to run the $F_{st}$ estimation:

In [None]:

plink2 --bfile /davidData/users/thomas/workshop/wildebeest_fst --within /davidData/users/thomas/workshop/clusterfile \
    --fst CATPHENO method=hudson --allow-extra-chr --threads 10

 - The cluster/within file give with "-within /davidData/users/thomas/workshop/clusterfile" tells the program how to separate the individuals into different groups for comparison. If we did not know up front which samples belonged together in populations, can you recall something we have looked at that could perhaps help with this?

Then let's have a look at the results:

In [None]:
# some hartebeest samples were also originally included in this data set, but now we can just remove those from
# the output
grep -ve Hartebeest plink2.fst.summary > tmp
mv tmp plink2.fst.summary

# print the results
column -t plink2.fst.summary

 - Which populations are most genetically differentiated? Which are most similar?
 
 - Can you indentify a pattern in the Fst values between black wildebeest and each of the blue wildebeest populations? Try to see if you can explain this pattern.

Are each of these values large or small? This is quite difficult to answer without context, as it will depend on the type of data you are analyzing, the amount of data and the scope of your study. To provide context, one often looks at a matrix of $F_{st}$ values, which can be visualized using a heatmap. To do this we first need to transform the above data frame into a matrix, and then generate a heatmap using the heatmap.2-function.

In [None]:
options(repr.matrix.max.cols=10, repr.matrix.max.rows=10)
options(repr.plot.width=16, repr.plot.height=16)
library(gplots)

# read the data into R
fst <- read.table("~/kenya2024/Fstats/plink2.fst.summary")
names(fst) <- c("pop1", "pop2", "est")
fst <- fst[fst$pop1 != "Hartebeest" & fst$pop2 != "Hartebeest",]

Here we transform the table from above into a pairwise matrix that contains the exact same information, just in a different format:

In [None]:
mat <- matrix(NA, 8, 8)
mat[lower.tri(mat)] <- fst$est
mat <- t(mat)
mat[lower.tri(mat)] <- fst$est
colnames(mat) <- c( "Amboseli", fst[1:7,2])
rownames(mat) <- c( "Amboseli", fst[1:7,2])
mat

In [None]:
heatmap.2(mat, symm=T, trace='n', cexRow=1.5, cexCol=1.5, margins = c(12, 12))

 - Look at the clustering tree produced by this method. Do the different groups relate to each other as we would expect?

- We can see some discrete levels of values in the color key and in the histogram in the inset plot. What do these correspond to?
 
An important note here is that the tree/dendrogram used to order the groups here simply comes from clustering based on the $F_{st}$ values and will not neccesarily reflect the true evolutionary history of the groups.



## F4 / D-statistic
For this part of the exercise we can use the R package admixtools to compute F4 values which correspond to the D-statistic mentioned earlier in the day, just with a flipped sign, so that negative values of F4 correspond to a positive D-statistic and vice versa.

Here we take a look at whether any of the populations of blue wildebeest are more closely related to the black wildebeest than the others - which indicates gene flow.

In [None]:
library(admixtools)
library(tidyverse)
options(repr.plot.width=16)

# load in f2 values that were pre computed from the plink data set
f2 <- read_f2("/davidData/users/thomas/workshop/wildebeest_fstats_wildebeestref")

# define tree of relationships between groups, this format is a bit hard to read,
#  but is similar to newick format for those familiar
tree <- c(c(c(c(c(c('N-Selous', 'C-Luangwa'), 'B-Ethosha'), c(c(c("E-Nairobi", "E-Amboseli"), "E-Monduli"), 'W-Serengeti')), 'black'), "hartebeest"))

# generate f4 values
f4 <- f4(f2, tree, f4mode = FALSE)

# select only values for subtrees where pop3 is black wildebeest and outgroup is hartebeest
f4_sub <- f4[f4$pop4=="hartebeest" & f4$pop3=="black",]

# make new row for each combination of pop 1 and pop 2 where they are switched and their f4 flipped, to make plot look nicer
f4_sub_switched <- f4_sub %>%
                          mutate(
                            temp = .data[["pop1"]],
                            !!"pop1" := .data[["pop2"]],
                            !!"pop2" := temp,
                            across(all_of(c("est", "z" )), ~ . * -1)
                          ) %>%
                          select(-temp)

f4_sub <- bind_rows(f4_sub, f4_sub_switched)

# show resulting subset
f4_sub

In this output, look for values in the column labelled 'z'. Those rows that have values higher than 3 or lower than -3 are usually considered statistically significant. If the value is negative, it means that the population in column pop2 has more alleles in common with black wildebeest than the population in column pop1. If it's positive, the interpretation is the opposite, i.e. pop1 has more alleles in common with black than pop2.

This can also be shown in a plot. Run this plotting code and see if you can make sense of the plot below.

In [None]:
ggplot(f4_sub, aes(est, pop1, fill = pop2)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~pop1, ncol=1, scale="free_y") +
theme_bw() +
xlab("f4 value \n with population 3 as black wildebeest and population 4 as hartebeest") +
theme(axis.text=element_text(size=20), legend.text = element_text(size=30), axis.title=element_text(size=20),
      strip.background = element_blank(), strip.text.x = element_blank(), legend.key.size = unit(2, 'cm'), legend.title = element_text(size=30)) 

 - Which population of blue wildebeest had gene flow with black wildebeest? How do you think this happened, and why is it not all blue wildebeests that had gene flow with black wildebeest?