# Estimating SFS

## simulations
Start by simulating some data.
To perform the simulation of genotypes you should specify the number of SNPs and number of individuals. The simulation will first simulate a population allele frequency based on a beta distribution (sim_af).

In [None]:
curve(dbeta(x,0.5, 2),from=0,to=1,main="Beta distribution for population allele frequency",
ylab="density",xlab="Allele frequency (x)")

In [None]:
# Simulate allele frequencies.
#
# Input is a number of sites, output is a vector of allele frequencies.
sim_af <- function(m)
    rbeta(m, 0.5, 2)

palette(c("goldenrod","purple","blue"))

# Simulate
set.seed(1)
m <- 100000
n <- 10

## simulate population allele frequencies
af <- sim_af(m)
cat("/n simulated true frequencies for the first 6 SNPs\n")
head(af)

hist(af,xlab="population allele frequency",col="orange",
  main="Histogram of simulated allele frequencies",
  ylab="Number of sites")




### simulate genotypes

 From a site with allele frequency $f_j$ it will simulate genotypes from a binomial (same as assuming HWE) $g_{ij} \sim Binomial(N=2,p=f_j)$ where $N$ is the number of tries and $i$ is the index of the individual.

 Based on genotypes for all individuals we can calculate the number of derived alleles $d_j=\sum_i g_{ij}$

In [None]:
# Simulate genotypes from allele frequencies.
#
# Input is a number of individuals, and a vector of m alleles frequencies,
# output is a n-by-m matrix.
sim_gt <- function(af, n){
   gt <- sapply(af, rbinom, n = n, size = 2)
  rownames(gt) <- paste0("Ind",1:n)
  colnames(gt) <- paste0("SNP",1:length(af))
  gt
}


## simulate genotypes for n indiviudals based on the frequency
gt <- sim_gt(af, n)
cat("Genotypes for the first 6 simulated SNPs")
gt[,1:6]


cat("\n\nNumber of derived alleles for the first 6 simulated SNPs ")
t(colSums(gt[,1:6]))

cat("\n\nNumber of SNPs with a certain sample allele frequency ")
table(colSums(gt)/2/n)


### Calculate the SFS from the genotypes

We can calcuate the SFS based on the number of derived alleles at each site. The SFS $\theta=\{\theta_0,\theta_1,\ldots,\theta_{2n}\}$ is simply the fraction of sites in the genome where we have $z$ derived alleles  where $z\in \{0,1,\ldots,2n\}$

In [None]:
# Calculate SFS from genotypes.
#
# Input is an n-by-m matrix of genotypes, output is an SFS.
calc_sfs <- function(gt) {
  n <- nrow(gt)
  m <- ncol(gt)
  table(colSums(gt) / (2 * n)) / m
}

true_sfs <- calc_sfs(gt)
barplot(true_sfs,xlab="Number of derived allleles (z)",ylab="fraction of sites (theta)",col=1,main="SFS from true genotypes",names=0:(2*n))

Simulate genotype likelihhoods from the true genotypes.
 - Choose an avg. depth $\bar{D}_i$
 - Choose an error rate $\epsilon$

For this simulation we will assume that there are only two possible bases (instead of the normal 4 $(A,C,G,T)$)
First the depth is simulated from a poission distribution $D_{ij} \sim Pois(\lambda=\bar{D}_i)$. For each genotype we sample the one of the two alleles for each of the $D_{ij}$ reads. When sampling the alleles there is a $episilon$ probabily of changes the allele to the other alleles.

In [None]:
# Simulate genotype likelihoods from genotypes.
#
# Input is a n-by-m matrix of genotypes, output is a n-list length of
# 3-by-m matrices.
sim_gl <- function(gt, error = 0.01, mean_depth = 2, min_depth = 0) {
  n <- nrow(gt)
  m <- ncol(gt)
  e <- c(error, 0.5, 1 - error)
  depths <- pmax(rpois(n * m, mean_depth), min_depth)
  probs <- e[gt + 1]
  alts <- rbinom(n * m, depths, probs)
  gl <- sapply(e, dbinom, x = alts, size = depths)
  lapply(seq(n), function(i) t(gl[seq(i, nrow(gl), n),]))
}

## simulate sequencing data based on the genotypes and mean sequencing depth a error rate
set.seed(1)
gl <- sim_gl(gt, mean_depth = 6, error = 0.01)

cat("Genotype likelihoods for individual 1")
gl[[1]]


 - How many sites?
 - With genotype is the most likely for SNP1?

call genotypes from genotype likelihoods $GT = argmax_g p(X|g)$

In [None]:
# Calls genotypes from genotype likelihoods with uniform prior.
call_gt <- function(gl) {
  gt <- do.call("rbind", lapply(gl, function(x) apply(x, 2, which.max) - 1))
  rownames(gt) <- paste("Ind",1:length(gl))
  colnames(gt) <- paste("SNP",1:length(gl[[1]][1,]))
  gt
}

gt_calls <- call_gt(gl)

cat("Genotypes call from GL (first 6 SNPs)")
gt_calls[,1:6]

SFS from genotype calls

In [None]:

call_sfs <- calc_sfs(gt_calls)
barplot(call_sfs,xlab="Number of derived allleles",ylab="fraction of sites",col=2,main="SFS from genotype calls")

# Estimate the SFS from genotype likelihoods
The likelihoood
$$p(X|\theta)=\prod_{j=1}^m p(X_j|\theta)=\prod_{j=1}^m \sum_{z=0}^{2n}p(X_j|Z_j=z)p(Z_j=z|\theta)$$
were $\theta=(\theta_0,\theta_1,\ldots,\theta_{2n})$ are the parameters - one for each of the $2n+1$ categories in the SFS such that $p(Z_j=z|\theta)=\theta_z$ and $p(X_j|Z_j)$ is the SAF.
To estimate the SFS, $\hat{\theta}=argmax_{\theta} p(X|\theta)$, we

1. First calculate the SAF from the genotype likelihoods
2. then use the EM algorithm to find the maximum likelihood

### 1: Calculate SAF likelihoods from genotype likelihoods.
Let $Z_j$ be the number of derived alleles at site $j$. The SAF (sample allele frequency) is the likelihoods of the data given the laten state.
$p(X_{j}|Z_j)$. Here $X_j$ is the sequencing data for all individuals for the $j$th site such that $X_{j}=(X_{1j},X_{2j},\ldots,X_{nj})$. The SAF can be written in terms of genotype likelihoods, $p(X_{ij}|G_{ij}=g_{i})$ where $g_{i}\in \{0,1,2\}$ is the number of derived alleles for individual $i$.
$$p(X_{j}|Z_j=z)=\sum_{g_{1j}}\sum_{g_{2j}}\ldots \sum_{g_{nj}}p(G_j=(g_{1j},g_{2j},\ldots,g_{nj})|Z_j=z)\prod_{i=1}^n p(X_{ij}|G_{ij}=g_{ij})$$


In [None]:
# Input is a list of 3-by-m matrices, output is a 2*n+1-by-m matrix.
calc_saf <- function(gl) {
  n <- length(gl)
  m <- unique(sapply(gl, ncol))
  saf <- matrix(0, nrow = 2 * n + 1, ncol = m)
  c <- choose(2, 0:2)
  saf[1:3, ] <- c * gl[[1]]
  for (i in 2:n) {
    # Loop invariant: after ith iteration, first 2 * i + 1 rows of SAF
    # would contain SAF for first i individuals if divided by
    # choose(2 * i, 0:(2 * i)))
        g <- gl[[i]]
        for (j in (2 * i - 1):1) {
            saf[j + 2,] <- colSums(saf[(j + 2):j,] * c * g)
        }
    saf[2,] <- saf[2,] * g[1,] + saf[1,] * 2 * g[2,]
    saf[1,] <- saf[1,] * g[1,]
  }
  saf / choose(2 * n, 0:(2 * n))
}


saf <- calc_saf(gl)


### 2: EM algorithm

E-step
$$q(Z_{j}=z)=p(Z_j|X_j,\theta)=\frac{p(X_j|Z_j=z)p(Z_j=z|\theta)}{\sum_{z'=0}^{2n} p(X_j|Z_j=z')p(Z_j=z'|\theta)}$$

M-step

$$\theta_{z}^{new}=\frac{\sum_{j=1}^m q(Z_{j}=z)}{\sum_{j=1}^m \sum_{z'=0}^{2n} q(Z_{j}=z')}$$



In [None]:



# Run an E-step.
#
# Input is SFS and precalculated (2 * n + 1)-by-m SAF matrix,
# output is (2 * n + 1)-by-m matrix of posterior allele count probabilities.
e_step <- function(sfs, saf) {
  post <- sfs * saf
  sweep(post, 2, colSums(post), `/`)
}
# Run an M-step.
#
# Input is (2 * n + 1)-by-m matrix of posterior allele count probabilities,
# output is new SFS.
m_step <- function(post) {
  new <- rowSums(post)
  new / sum(new)
}
# Runs EM for fixed number of iterations.
em <- function(saf, iterations = 50) {

  sfs <- rep(1 / nrow(saf), nrow(saf))
  for (i in seq(iterations)) {
    sfs <- m_step(e_step(sfs, saf))
  }
  sfs
}


em_sfs <- em(saf, iterations = 50)
barplot(em_sfs,xlab="Number of derived allleles",ylab="fraction of sites",col=3,main="SFS from GL (EM algo)")

Lets compare the SFS

In [None]:





# Plot different SFS
all_sfs <- rbind(
"True" = true_sfs,
"EM" = em_sfs,
"Calls" = call_sfs
)
barplot(all_sfs,beside=T,col=1:3,
  legend=c("TRUE SFS","from GT (EM algorithm)","from called genotypes"),
  names=0:(2*n),ylab="fraction of sites",main="SFS ")

  - At which depth is the SFS from called genotypes no longer biased?
  - What happens if you set the sequencing error rate to 0% or 10%?
## Bonus
  - Write the likelihoods and EM algorithm for estimating the two dimensional SFS



