# Population genetics summer course 2025, Denmark

# Local Ancestry Inference Exercise

## Setup Environment

In [None]:
COURSE_PATH=/course/popgen25
DATA_PATH=${COURSE_PATH}/LocalAncestry
SOFTWARE_PATH=${COURSE_PATH}/software

# go to working folder
mkdir -p ~/popgen25_localancestry
cd ~/popgen25_localancestry


ln -sfn ~/popgen25_localancestry ~/current_folder
ln -sfn ${DATA_PATH} ~/data_folder


In [None]:

FLARE=${SOFTWARE_PATH}/flare.jar
MOSAIC=${SOFTWARE_PATH}/mosaic.R

which bcftools

ls ${FLARE}
ls ${MOSAIC}


## Exercise

#### For this practical we will infer local ancestry in simulated data using two methods: FLARE and MOSAIC. We will compare the outputs from the two methods for different demographic models and show how their performance may differ. 

We will use simulated data from two toy models, with gene flow happening either at 20 or 200 generations ago.

If you're interested in how the data was simulated, please refer to the slendr R package: https://slendr.net/index.html

Let's take a look at the demographic model based on which we simulated genetic data.

One ancestral population "pop" splits into two populations (pop_b and pop_c) 1500 generations ago. At 400 generations ago, pop_mix splits from pop_b, and then subsequent gene flow occurs from pop_c into pop_mix either 20 or 200 generations ago.

##### Gene flow from pop_c into pop_mix 20 generations ago:

In [None]:

import os
from IPython.display import Image
Image(filename=os.path.expanduser('~/data_folder/model_gen_adm_20.png'))


##### Gene flow from pop_c into pop_mix 200 generations ago:

In [None]:

import os
from IPython.display import Image
Image(filename=os.path.expanduser('~/data_folder/model_gen_adm_200.png'))


##### Let's have a look at the files available for each scenario

In [None]:

ls ~/data_folder/model*


##### Focusing on the scenario where gene flow happens 20 generation ago, let's take a look at the metadata file

In [None]:

cat ~/data_folder/model_gen_adm_20_meta.tsv

#### Q: How many samples are taken at each time point and population?

In [None]:

suppressMessages(library(dplyr))

meta <- read.delim('~/data_folder/model_gen_adm_20_meta.tsv')
meta %>% 
    group_by(time, pop) %>%
    tally()

##### Let's do the same for the scenario with gene flow happening 200 generations ago:

In [None]:

suppressMessages(library(dplyr))

meta <- read.delim('~/data_folder/model_gen_adm_200_meta.tsv')
meta %>% 
    group_by(time, pop) %>%
    tally()

We can see that in both demographic scenarios, we have 10 individuals from pop_mix from one time point (0=present), and 50 from pop_b and 50 from pop_c from two different time points (0 and 1200 generations ago).

Let's focus a bit on the sources sampled at two different time points, 0 and 1200 generations before present. Specifically, let's see how differentiated pop_b and pop_c are in each case. 

##### From the simulated data, we have estimated FST values between pop_b and pop_c at both time points. (Remember FST?!) Let's load the estimates and take a look:

In [None]:

cat ~/data_folder/model_gen_adm_20_fst.tsv

#### Q: Do the fst values differ? Why?

Let's keep this difference in the back of our minds for a bit, it might come in handy again later in the exercise. 

##### Time to dive into the true ancestry tracks from the two scenarios. Let's load the tracts for each:

In [None]:

cat ~/data_folder/model_gen_adm_20_tracts.tsv

In [None]:

cat ~/data_folder/model_gen_adm_200_tracts.tsv

Ok, this seems like a lot of numbers. Let's make a plot so that it's easier to the eye.

##### Plot the true ancestry tracts with gene flow happening 20 generations ago:

In [None]:

library(ggplot2)
suppressMessages(library(dplyr))
options(scipen=999)

tracts_20 <- read.delim('~/data_folder/model_gen_adm_20_tracts.tsv')

values = c("pop_b" = "green4", 
           "pop_c" = "cadetblue3")

tracts_20 %>%
  mutate(chrom = paste(name, " (node", node_id, ")")) %>%
  ggplot(aes(x = left, xend = right, y = chrom, yend = chrom, color = source_pop)) +
  geom_segment(linewidth = 5) +
  scale_colour_manual(values = values, name="Source population") + 
  theme_minimal() +
  theme(axis.text.y = element_blank(), 
        panel.grid = element_blank(),
        legend.position = 'bottom',
        strip.text.y = element_text(size = 6)) +
  labs(x = "Position (bp)", y = "Haplotypes") +
  ggtitle("True ancestry tracts for gene flow happening 20 generations ago") +
  facet_grid(name ~ ., scales = "free_y")


Each row is an individual from pop_mix with its two haplotypes. The colour of each segment indicates the ancestry from which it came from (either pop_b or pop_c).

##### Now looking at gene flow happening 200 generations ago:

In [None]:

library(ggplot2)
suppressMessages(library(dplyr))
options(scipen=999)

tracts_200 <- read.delim('~/data_folder/model_gen_adm_200_tracts.tsv')

values = c("pop_b" = "green4", 
           "pop_c" = "cadetblue3")

tracts_200 %>%
  mutate(chrom = paste(name, " (node", node_id, ")")) %>%
  ggplot(aes(x = left, xend = right, y = chrom, yend = chrom, color = source_pop)) +
  geom_segment(linewidth = 5) +
  scale_colour_manual(values = values, name="Source population") + 
  theme_minimal() +
  theme(axis.text.y = element_blank(), 
        panel.grid = element_blank(),
        legend.position = 'bottom',
        strip.text.y = element_text(size = 6)) +
  labs(x = "Position (bp)", y = "Haplotypes") +
  ggtitle("True ancestry tracts for gene flow happening 200 generations ago") +
  facet_grid(name ~ ., scales = "free_y")


#### Q: What difference do we see in the tracts in each case? Why?

##### Let's look at the track length distribution:

In [None]:

tracts <- rbind(tracts_20, tracts_200)

tracts$adm_time_f = factor(tracts$adm_time, levels=c('20', '200'))

ggplot(tracts) +
  geom_histogram(
    aes(x = length, fill=source_pop),
    binwidth = 100000, alpha = 0.75
  ) +
  labs(
    x = "Tract length (bp)", y = "Density",
    title = "Tract length distribution"
  ) +
  scale_x_continuous(labels = scales::comma) +
  scale_fill_manual(values = values) + 
  theme_bw() +
  theme(legend.position = "none", 
        title =element_text(size=9),
        axis.text=element_text(size=9)) +  
  facet_grid(adm_time_f~source_pop, scales = "free_y")


#### Q: Is this what we would expect? How come do we always have longer tracts of pop_b ancestry relative to pop_c tracts?

### Now let's test some LAI methods. For this exercise, we will use FLARE (Browning et al., 2023) and MOSAIC (Salter-Townshend & Myers, 2019).

The point of this exercise is to see how LAI method performance may vary depending on different aspects of our data, such as sample sizes and how far back the admixture time is.  

You will find the files needed for each software in the following folders:

In [None]:

ls -1 ~/data_folder/FLARE/

echo ""

ls -1 ~/data_folder/MOSAIC/

#### Following the demographic scenarios pictured at the beginning of the exercise, we will infer local ancestry in target individuals from the pop_mix population, using samples from the pop_b and pop_c populations as sources. We will first focus on the case where gene flow happened 20 generations ago, using 20 samples from each of the two source populations (pop_b and pop_c) to "paint" the genomes of 10 individuals in the target population (pop_mix). Both source and target samples are sampled from the present (0 generations ago). We'll look at the results from the other scenarios later on.

### LAI using FLARE

##### FLARE is a recently developed LAI method, that has been tailored to run on large datasets with reduced computational time. Let's take a look at the input files needed:

In [None]:
ls -1 ~/data_folder/FLARE/gen_adm_20_source_20_source_time_0/

##### Let's look at each of the files in more detail:

In [None]:

adm20_flare=~/data_folder/FLARE/gen_adm_20_source_20_source_time_0/

#model_gen_adm_20_filt.vcf.gz: simulated VCF

#model_gen_adm_20-genetic_map.txt: genetic map with the following columns: chromosome, rs# or sno identifier, genetic distance (cM units), base-pair position
echo model_gen_adm_20-genetic_map.txt
head ${adm20_flare}/model_gen_adm_20-genetic_map.txt 
echo ""

#source_name_pop_gen_adm_20_source_20_source_time_0.txt: file containing the source sample names (first column) and their population (second column)
echo source_name_pop_gen_adm_20_source_20_source_time_0.txt
head ${adm20_flare}/source_name_pop_gen_adm_20_source_20_source_time_0.txt
echo ""

#target_name_gen_adm_20_source_20_source_time_0.txt: file with target sample names (from pop_mix)
echo target_name_gen_adm_20_source_20_source_time_0.txt
head ${adm20_flare}/target_name_gen_adm_20_source_20_source_time_0.txt


#####  Time to run FLARE (this might take a few minutes)

In [None]:

# Specify parameters

# Admixture time (generations before present)
GEN_ADM=20

# Number of samples from each source population
SOURCE=20

# Time point from which source samples were taken from (generations before present)
SOURCE_TIME=0

# Specify output path
FOLDER=~/data_folder/FLARE/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}

# Specify input files
MAP=${FOLDER}/model_gen_adm_${GEN_ADM}-genetic_map.txt # Plink format genetic map
VCF=${FOLDER}/model_gen_adm_${GEN_ADM}_filt.vcf.gz # VCF with target and source samples
SOURCE_LIST=${FOLDER}/source_name_pop_gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}.txt # two column file with source sample names and source population name
TARGET_LIST=${FOLDER}/target_name_gen_adm_${GEN_ADM}_source_20_source_time_0.txt # file with target sample names (considered one population)

# Make output folder if not present
if [ ! -d ~/current_folder/flare_out ]; then
    mkdir -p ~/current_folder/flare_out
fi

# Prefix for output files
PREFIX=~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output

# Run FLARE
java -Xmx500g -jar ${FLARE} \
ref=${VCF} \
ref-panel=${SOURCE_LIST} \
gt=${VCF} \
map=${MAP} \
gt-samples=${TARGET_LIST} \
gen=${GEN_ADM} \
out=${PREFIX} \
min-mac=1 \
probs=true \
nthreads=4

# index anc.vcf.gz output file
bcftools index -f ~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output.anc.vcf.gz

# What each option represents:
# java -Xmx500g -jar programmes/flare.jar \
# ref: VCF file with source (reference) samples
# ref-panel: two column file with source sample names and source population name
# gt: VCF with target samples
# map: Plink format genetic map
# gt-samples: file with target sample names (considered one population)
# gen: number of generations since admixture (default is 10 if not specified)
# out: output file prefix
# min-mac: minimum minor allele count in the ref VCF
# probs: specified whether the posterior probabilities are reported
# nthreads: number of computational threads

#####  Let's look at the output files:

In [None]:

GEN_ADM=20
SOURCE=20
SOURCE_TIME=0
PREFIX=~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output


# List files in output folder
ls ${PREFIX}*


In [None]:

GEN_ADM=20
SOURCE=20
SOURCE_TIME=0
PREFIX=~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output

# Look at the global ancestry estimates
#zless -S ${PREFIX}.global.anc.gz
zcat ${PREFIX}.global.anc.gz


In [None]:

GEN_ADM=20
SOURCE=20
SOURCE_TIME=0
PREFIX=~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output

# Look at the model output
cat ${PREFIX}.model


In [None]:

GEN_ADM=20
SOURCE=20
SOURCE_TIME=0
PREFIX=~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output

# Look at the log file which contains the summary of the analysis
cat ${PREFIX}.log


In [None]:

GEN_ADM=20
SOURCE=20
SOURCE_TIME=0
PREFIX=~/current_folder/flare_out/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}-output

# Look at the file which contains the inferred local ancestry for each allele (we'll only view one sample to make it easier to the eye)
bcftools view -s pop_mix_1 ${PREFIX}.anc.vcf.gz | head -20


Take some time to understand the FORMAT field entries AN1, AN2, ANP1, ANP2.

#### Q: What ancestry or ancestries is the first position of sample pop_mix_1 assigned to? Why?

### LAI using MOSAIC

##### MOSAIC is an LAI method which allows for unknown relationships between reference panels (sources) of ancestry. Let's take a look at the input files needed:

In [None]:

ls -1 ~/data_folder/MOSAIC/gen_adm_20_source_20_source_time_0/


In [None]:

adm20_mosaic=~/data_folder/MOSAIC/gen_adm_20_source_20_source_time_0/

#model_gen_adm_20_filt.vcf.gz: simulated VCF file

#pop_bgenofile.1: genotype file of pop_b samples
echo pop_bgenofile.1
head ${adm20_mosaic}/pop_bgenofile.1
echo ""

#pop_cgenofile.1: genotype file of pop_c samples
echo pop_cgenofile.1
head ${adm20_mosaic}/pop_cgenofile.1
echo ""

#pop_mixgenofile.1: genotype file of pop_mix samples
echo pop_mixgenofile.1
head ${adm20_mosaic}/pop_mixgenofile.1
echo ""

#sample.names: file with population in first column and sample names in second column
echo sample.names
cat ${adm20_mosaic}/sample.names
echo""

#snpfile.1: file with one snp per row and 6 columns: rsID, chr, distance, position, allele ?, allele ?. 
echo snpfile.1
head ${adm20_mosaic}/snpfile.1 
echo""


In [None]:

adm20_mosaic=~/data_folder/MOSAIC/gen_adm_20_source_20_source_time_0/

#rates.1: Recombination rates file for MOSAIC, has a bit of a weird format compared to normal genetic maps
echo rates.1

#3 rows with #sites, position, cumulative recombination rate (in centiMorgans). 

# the first row shows the number of sites
sed -n '1p' ${adm20_mosaic}/rates.1 

# the second row shows the snp positions
sed -n '2p' ${adm20_mosaic}/rates.1 | cut -d " " -f 1-10

# the second row shows the snp positions
sed -n '3p' ${adm20_mosaic}/rates.1 | cut -d " " -f 1-10


#####  Run MOSAIC. This will take a couple of minutes, so time to take a break!

In [None]:

# Specify parameters

# Admixture time (generations before present)
GEN_ADM=20

# Number of samples from each source population
SOURCE=20

# Time point from which source samples were taken from (generations before present)
SOURCE_TIME=0

# Specify path to input files
FOLDER=~/data_folder/MOSAIC/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}

# Specify output path
OUTPUT=~/current_folder/MOSAIC/gen_adm_${GEN_ADM}_source_${SOURCE}_source_time_${SOURCE_TIME}/output

# Remove output folder if it already exists from previous run
if [ -d ${OUTPUT} ]; then
    rm -r ${OUTPUT}
fi

# Specify path to output files
mkdir -p ${OUTPUT}
#OUTPUT=LAI/exercise/MOSAIC/gen_adm_Missing superscript or subscript argumentMissing superscript or subscript argument{SOURCE}_source_time_${SOURCE_TIME}/output

# specify path to fastfiles temporary folder
mkdir -p ~/.cache/fastfiles

# Specify the number of gridpoints per cM equal to the product of 0.0012 and the # of sites (suggested in Browning et al., 2023)
SITES=$(bcftools view -H ${FOLDER}/model_gen_adm_${GEN_ADM}_filt.vcf.gz | wc -l)
GRID=$(echo "$SITES * 0.0012" | bc) 

Rscript ${MOSAIC} \
'pop_mix' \
${FOLDER}/ \
--number 10 \
--ancestries 2 \
--chromosomes 1:1 \
--GpcM $GRID \
--fastfiles ~/.cache/fastfiles \
--gens ${GEN_ADM} \
--nophase \
--maxcores 16

# Move files to working directory
mv MOSAIC_RESULTS ${OUTPUT}


##### Let's now plot the inferred tracts from the two methods and compare them with the true tracts we extracted from the simulated data. We've formatted the output of each method so that we have all the tracts in the same format. You can find them here, with seperate files for haplotypes 1 and 2

In [None]:

ls ~/data_folder/all_tracts/model_gen_adm_20_source_20_source_time_0_prob_0.0_all_tracts_hap*.tsv


##### Look at the format of the hap1 file:

In [None]:

cat ~/data_folder/all_tracts/model_gen_adm_20_source_20_source_time_0_prob_0.0_all_tracts_hap1.tsv


##### Plot true and inferred tracts:

In [None]:

# import libraries
library(ggplot2)
suppressMessages(library(dplyr))

tracks_hap1 <- read.delim('~/data_folder/all_tracts/model_gen_adm_20_source_20_source_time_0_prob_0.0_all_tracts_hap1.tsv')
tracks_hap2 <- read.delim('~/data_folder/all_tracts/model_gen_adm_20_source_20_source_time_0_prob_0.0_all_tracts_hap2.tsv')

values = c("pop_b" = "chartreuse4", 
           "pop_c" = "cyan3")

# Let's plot three samples from the pop_mix population
samples <- c('pop_mix_1', 'pop_mix_2', 'pop_mix_3')

ggplot()+
  geom_segment(data=tracks_hap1 %>% filter(name %in% samples), aes(y = type, yend = type, x = left/1000000, xend = right/1000000, colour=source_pop), linewidth = 12) +
  geom_segment(data=tracks_hap2 %>% filter(name %in% samples), aes(y = type, yend = type, x = left_2/1000000, xend = right_2/1000000, colour=source_pop), linewidth = 12) +
  theme_bw()+
  theme(panel.spacing.y=unit(0.1, "lines"),
        axis.title.x = element_text(margin=margin(t=5)),
        axis.title.y = element_text(margin=margin(r=5)),
        axis.text.x = element_text(color = "black"),
        axis.ticks.x = element_line(linewidth = 0.3),
        plot.margin = margin(r = 0.5, l = 0.1, b = 0.5, unit = "cm"),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.y.left = element_text(size = 10, angle = 0),
        legend.position = 'none',
        strip.background =element_rect(fill="gray28"),
        strip.text = element_text(colour = 'white')
  ) +
  scale_color_manual(name = "Ancestry", values = values) +
  ylab("Sample") +
  xlab("Genome position (Mb)") +
  ggtitle("True and inferred ancestry tracts - gen_adm=20 generations ago") +
  facet_grid(name~.)


Each panel is a sample from the pop_mix population (only 3 samples shown in this plot) with its true and inferred ancestry tracts from MOSAIC and FLARE. Note that the two haplotypes have been "merged" as a continuous long chromosome. The colour of each segment indicates the ancestry from which it came from (either pop_b as green or pop_c as blue).

#### Q: How do the inferred ancesrty tracts look compared to the true ones?

##### Now let's see how the inferred tracts look like if we focus on a scenario where gene flow from pop_c into pop_mix happened 200 generations ago instead. We've already run the local ancestry inference methods and have the tracts ready to plot:

In [None]:

# import libraries
library(ggplot2)
suppressMessages(library(dplyr))

tracks_hap1 <- read.delim('~/data_folder/all_tracts/model_gen_adm_200_source_20_source_time_0_prob_0.0_all_tracts_hap1.tsv')
tracks_hap2 <- read.delim('~/data_folder/all_tracts/model_gen_adm_200_source_20_source_time_0_prob_0.0_all_tracts_hap2.tsv')

values = c("pop_b" = "chartreuse4", 
           "pop_c" = "cyan3")

# Let's plot three samples from the pop_mix population
samples <- c('pop_mix_1', 'pop_mix_2', 'pop_mix_3')

ggplot()+
  geom_segment(data=tracks_hap1 %>% filter(name %in% samples), aes(y = type, yend = type, x = left/1000000, xend = right/1000000, colour=source_pop), linewidth = 12) +
  geom_segment(data=tracks_hap2 %>% filter(name %in% samples), aes(y = type, yend = type, x = left_2/1000000, xend = right_2/1000000, colour=source_pop), linewidth = 12) +
  theme_bw()+
  theme(panel.spacing.y=unit(0.1, "lines"),
        axis.title.x = element_text(margin=margin(t=5)),
        axis.title.y = element_text(margin=margin(r=5)),
        axis.text.x = element_text(color = "black"),
        axis.ticks.x = element_line(linewidth = 0.3),
        plot.margin = margin(r = 0.5, l = 0.1, b = 0.5, unit = "cm"),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.y.left = element_text(size = 10, angle = 0),
        legend.position = 'none',
        strip.background =element_rect(fill="gray28"),
        strip.text = element_text(colour = 'white')
  ) +
  scale_color_manual(name = "Ancestry", values = values) +
  ylab("Sample") +
  xlab("Genome position (Mb)") +
  ggtitle("True and inferred ancestry tracts - gen_adm=200 generations ago") +
  facet_grid(name~.)


#### Q: Does the overlap seem to change? Why?

##### Let's now see what happens to the above scenario if we increase the sample size of sources from pop_b and pop_c from 20 to 50. Again, we have re-run the analyses in FLARE and MOSAIC and have the tracts ready to view:

In [None]:

# import libraries
library(ggplot2)
suppressMessages(library(dplyr))

tracks_hap1 <- read.delim('~/data_folder/all_tracts/model_gen_adm_200_source_50_source_time_0_prob_0.0_all_tracts_hap1.tsv')
tracks_hap2 <- read.delim('~/data_folder/all_tracts/model_gen_adm_200_source_50_source_time_0_prob_0.0_all_tracts_hap2.tsv')

values = c("pop_b" = "chartreuse4", 
           "pop_c" = "cyan3")

# Let's plot three samples from the pop_mix population
samples <- c('pop_mix_1', 'pop_mix_2', 'pop_mix_3')

ggplot()+
  geom_segment(data=tracks_hap1 %>% filter(name %in% samples), aes(y = type, yend = type, x = left/1000000, xend = right/1000000, colour=source_pop), linewidth = 12) +
  geom_segment(data=tracks_hap2 %>% filter(name %in% samples), aes(y = type, yend = type, x = left_2/1000000, xend = right_2/1000000, colour=source_pop), linewidth = 12) +
  theme_bw()+
  theme(panel.spacing.y=unit(0.1, "lines"),
        axis.title.x = element_text(margin=margin(t=5)),
        axis.title.y = element_text(margin=margin(r=5)),
        axis.text.x = element_text(color = "black"),
        axis.ticks.x = element_line(linewidth = 0.3),
        plot.margin = margin(r = 0.5, l = 0.1, b = 0.5, unit = "cm"),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.y.left = element_text(size = 10, angle = 0),
        legend.position = 'none',
        strip.background =element_rect(fill="gray28"),
        strip.text = element_text(colour = 'white')
  ) +
  scale_color_manual(name = "Ancestry", values = values) +
  ylab("Sample") +
  xlab("Genome position (Mb)") +
  ggtitle("True and inferred ancestry tracts - gen_adm=200 generations ago") +
  facet_grid(name~.)


Seems a lot better!

#### Q: Why does increasing sample size improve the local ancestry inference?

##### One final thing we will check is how the differentiation between sources can impact the local ancestry inference.

##### Focusing again on gene flow happening 20 generations ago, let's look at the metadata file with our sample info from the simulation:

In [None]:

suppressMessages(library(dplyr))

meta <- read.delim('~/data_folder/model_gen_adm_20_meta.tsv')
meta %>% 
    group_by(time, pop) %>%
    tally()


We can see that we have 10 individuals from pop_mix from one time point (0, present), and 50 from pop_b and 50 from pop_c from two different time points (0 and 1200). Let's see what happens if instead of using source samples from the present (0 generations before present), we take them from 1200 generations before present, and re-run the local ancestry inference methods:

In [None]:

# import libraries
library(ggplot2)
suppressMessages(library(dplyr))

tracks_hap1 <- read.delim('~/data_folder/all_tracts/model_gen_adm_20_source_20_source_time_1200_prob_0.0_all_tracts_hap1.tsv')
tracks_hap2 <- read.delim('~/data_folder/all_tracts/model_gen_adm_20_source_20_source_time_1200_prob_0.0_all_tracts_hap2.tsv')

values = c("pop_b" = "chartreuse4", 
           "pop_c" = "cyan3")

# Let's plot three samples from the pop_mix population
samples <- c('pop_mix_1', 'pop_mix_2', 'pop_mix_3')

ggplot()+
  geom_segment(data=tracks_hap1 %>% filter(name %in% samples), aes(y = type, yend = type, x = left/1000000, xend = right/1000000, colour=source_pop), linewidth = 12) +
  geom_segment(data=tracks_hap2 %>% filter(name %in% samples), aes(y = type, yend = type, x = left_2/1000000, xend = right_2/1000000, colour=source_pop), linewidth = 12) +
  theme_bw()+
  theme(panel.spacing.y=unit(0.1, "lines"),
        axis.title.x = element_text(margin=margin(t=5)),
        axis.title.y = element_text(margin=margin(r=5)),
        axis.text.x = element_text(color = "black"),
        axis.ticks.x = element_line(linewidth = 0.3),
        plot.margin = margin(r = 0.5, l = 0.1, b = 0.5, unit = "cm"),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        strip.text.y.left = element_text(size = 10, angle = 0),
        legend.position = 'none',
        strip.background =element_rect(fill="gray28"),
        strip.text = element_text(colour = 'white')
  ) +
  scale_color_manual(name = "Ancestry", values = values) +
  ylab("Sample") +
  xlab("Genome position (Mb)") +
  ggtitle("True and inferred ancestry tracts - gen_adm=20 generations ago") +
  facet_grid(name~.)
    

Hmm, seems like FLARE is having difficulty inferring the tracts (remember a few plots above it did quite well when using sources from the present (0 generations ago)?

#### Q: Do you have any feeling why this might be the case? (Hint: Take a look at the FST calculations we did at the beginning of the exercise)

So we can see that LAI methods can be sensitive to how old the admixture event was, the sample sizes used and the level of differentiation between the sources used.

### BONUS: Infering admixture times

Some LAI methods also infer the timing of admixture, or more specifically, how many generations prior to the age of the target individual(s) did the admixture event happen.

##### Let's go back to the output from MOSAIC and check how the inferred times perform for the admixture event happening 20 generations ago:

In [None]:

import os
from IPython.display import Image
Image(filename=os.path.expanduser('~/data_folder/MOSAIC/gen_adm_20_source_20_source_time_0/coancestry_curves.png'))


The above plots are coancestry curves as produced by MOSAIC. They capture information about the lengths of segments inherited from each source population (here shown as numbers 1 and 2). They show how rapidly the ancestry changes as genetic distance increases across the chromosome.

So if we take two locations of the genome seperated by a specific distance, we can see if the ancestry changes or not. If we do that over different genetic distances across the genome, we can construct these curves that show the length of segments from the source groups.

These curves should theoretically follow an exponential distribution with rate equal to the number of generations since the admixture event happened.

Each of these curves is governed by an admixture parameter lambda (λ), which drives the exponential decay of tract lengths since admixture. For our given two-way admixture model, we will average across the 3 estimates of the λ parameter outputted by MOSAIC, in order to report an admixture time.

The y axis shows the ratio of probabilities of pairs of local ancestries. The x axis is the genetic distance in cM. Looking at the first plot, we can see that the probability of going from ancestry 1 to ancestry 1 at two different genetic sites across the chromosome decreases with increasing genetic distance. Same thing in the 3rd plot going from ancestry 2 to ancestry 2. In the second plot, we see that you're more likely to switch ancestries the further you go down the chromosome. 

The green line is the fitted curve, the black line is the observed ratios across targets and the grey line is the per target ratio.

If we calculate the average across the three coancestry curves we get 20.2 generations before present, which is almost spot on with the actual admixture event!

More info here: https://www.chg.ox.ac.uk/~gav/admixture/2014-science-final/resources/FAQ.pdf ("under What are coancestry curves?")

##### Now let's look a the coancestry curves for the admixture event happening 200 generations ago:

In [None]:

import os
from IPython.display import Image
Image(filename=os.path.expanduser('~/data_folder/MOSAIC/gen_adm_200_source_20_source_time_0/coancestry_curves.png'))


We can see that this is much more noisy and a bit more tricky to infer, considering that the admixture event is older in time, and therefore a lot of the ancestry tracts across the genome have been broken down into smaller segments due to recombination. 

The average inferred admixture time across the three curves is 254.7 generations ago. This is not too far off from the truth, but we should always be cautious when interpreting results, especially for old admixture events.