# Estimating haplotype frequencies from genotypes(2 SNPs)

First set the true allele frequencies and the number of individuals.
From the haplotype frequency we can simulate haplotypes and from the haplotypes we can get the genomes.
We will assume that the genotypes from the first SNPs are AA,Aa,aa and the genotypes at the second SNP is BB,Bb,bb,




In [1]:
set.seed(1)
#set haplotype frequencies. must sum to 1
# these are the true haplotype frequencies
freqHap <- c(
    freqHapAB = 0.03,
    freqHapaB = 0.20,
    freqHapAb = 0.55,
    freqHapab = 1 -0.03-0.2-0.55
)

#set number of individals to simulate
N<-1000

## simulate the (2N)haplotypes.
cat("\nsimulated haplotype pairs")
simHaps <- sample(c("AB","aB","Ab","ab"),2*N,prob=freqHap,replace=T)
# two haplotype per individual
hapMat <- matrix(simHaps,ncol=2)
table(hapPairs <- paste(hapMat[,1],hapMat[,2],sep="/"))

#genotypes for SNP 1
SNP1 <- hapMat[,1] %in% c("AB","Ab")  + hapMat[,2] %in% c("AB","Ab") # counts haplotypes with A summed over two haplotypes
SNP2 <- hapMat[,1] %in% c("AB","aB")  + hapMat[,2] %in% c("AB","aB") # counts haplotypes with B summed over two haplotypes

#counts of genotypes combinations for the two SNPs
cat("\n simulated genotype data as table:\n")
(genotypeTab <- table(c("aa","Aa","AA")[SNP1+1],c("bb","Bb","BB")[SNP2+1]))





simulated haplotype pairs


ab/ab ab/aB ab/Ab ab/AB aB/ab aB/aB aB/Ab aB/AB Ab/ab Ab/aB Ab/Ab Ab/AB AB/ab 
   40    50   107     8    31    32   124     8   126   118   312    12     6 
AB/aB AB/Ab AB/AB 
    6    19     1 


 simulated genotype data as table:


    
      bb  Bb  BB
  aa  40  81  32
  Aa 233 256  14
  AA 312  31   1


First calculate the true population allele frequency based on the true haplotype frequencies

In [2]:
trueFreq <- data.frame(
    freq_A = freqHap["freqHapAB"] + freqHap["freqHapAb"],
    freq_a = freqHap["freqHapaB"] + freqHap["freqHapab"],
    freq_B = freqHap["freqHapaB"] + freqHap["freqHapAB"],
    freq_b = freqHap["freqHapab"] + freqHap["freqHapAb"]
 )
 trueFreq

Unnamed: 0_level_0,freq_A,freq_a,freq_B,freq_b
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
freqHapAB,0.58,0.42,0.23,0.77



Compare this to the simulated data by first calculating the allele frequency of for the two SNPs for allele A and allele B  (based on the genotypes counts.)


In [3]:
cat("est freq A")
( estFreqA <- sum(SNP1)/N/2 )
cat("\nest freq B")
( estFreqB <- sum(SNP2)/N/2 )



est freq A


est freq B




Calculate the haplotype frequency based on the sample haplotypes pairs


In [4]:

cat("table of haplotypes pairs")
hapPairsTab <- table(hapPairs)
hapPairsTab

countHaplotypes <- function(hapPairsTab,print=FALSE){
  countHap <- c()
  hapNames <- c("AB","aB","Ab","ab")
  for(x in hapNames){
    pairs <- c(paste(x,hapNames,sep="/"),paste(hapNames,x,sep="/")) # the x haplotype has is pair x,x twice
    if(print)
      cat("Hapotype",x,"is in pairs",pairs,"\n")
    countHap[x] <- sum( hapPairsTab[pairs] )
  }
  countHap
}
cat("\nextract columns (notice one haplotype pairs is there twice):\n")
countHap <- countHaplotypes(hapPairsTab,print=TRUE)
cat("\nCounts of haplotypes (from counts of haplotype pairs)")
countHap

## from counts to freuqencies
estFreqHap <- countHap/sum(countHap)
cat("\nhaplotype frequencies\n")
round(freqHap,4)


table of haplotypes pairs

hapPairs
ab/ab ab/aB ab/Ab ab/AB aB/ab aB/aB aB/Ab aB/AB Ab/ab Ab/aB Ab/Ab Ab/AB AB/ab 
   40    50   107     8    31    32   124     8   126   118   312    12     6 
AB/aB AB/Ab AB/AB 
    6    19     1 


extract columns (notice one haplotype pairs is there twice):
Hapotype AB is in pairs AB/AB AB/aB AB/Ab AB/ab AB/AB aB/AB Ab/AB ab/AB 
Hapotype aB is in pairs aB/AB aB/aB aB/Ab aB/ab AB/aB aB/aB Ab/aB ab/aB 
Hapotype Ab is in pairs Ab/AB Ab/aB Ab/Ab Ab/ab AB/Ab aB/Ab Ab/Ab ab/Ab 
Hapotype ab is in pairs ab/AB ab/aB ab/Ab ab/ab AB/ab aB/ab Ab/ab ab/ab 

Counts of haplotypes (from counts of haplotype pairs)


haplotype frequencies


# likelihood
 -  Write a likelihood function where there data is the genotypes for both SNPs and the parameter is the haplotypes.

$L(\theta)=p(X|\theta)=\prod_{i=1}^N p(X_i|\theta)$

where $\theta=(\theta_{AB},\theta_{aB},\theta_{Ab},\theta_{ab})$ are the haplotype frequencies and $X$ is the matrix of data where $X_{i,j}\in \{0,1,2\}$ is the genotype for individaul $i$ for site $j\in \{1,2\}$. There are $N$ individuals and there are 2 sites.

we can introduce a latent state of the four possible haplotypes
$p(X_i|\theta)=\sum_{z\in\{AB,aB,Ab,ab\}^2}p(X_i|Z=z)p(Z=z|\theta)$
where z is the pair of haplotypes $z=(z_1,z_2)$

$p(Z=z|\theta)=p(Z_1=z_1|\theta)p(Z_2=z_2|\theta)=\theta_{z_1}\theta_{z_2}$

$p(X_i|Z=z)=0$ if genotypes are not consistent with the haplotypes and 1 otherwise.


### Likelihoods for each individual

$p(X_i|\theta)=\sum_{z\in\{AB,aB,Ab,ab\}^2}p(X_i|Z=z)p(Z=z|\theta)$



In [5]:
#parameter theta
(theta <- freqHap)


likeG <- function(g1,g2,theta){
##input: that takes two genotypes and theta
##output p(X_i|theta)

  names(theta) <- c("AB","aB","Ab","ab")

  #p(X_i|theta)=sum_Z p(X|Z)p(Z|theta)
  pX <- 0
  for(z1 in c("AB","aB","Ab","ab")) # sum over two haplotypes
    for(z2 in c("AB","aB","Ab","ab")){
        hapToGeno1 <- z1%in%c("AB","Ab") + z2%in%c("AB","Ab") #haplotype to genotype (HapToGeno is the number of A allele 0,1,2 )
        hapToGeno2 <- z1%in%c("AB","aB") + z2%in%c("AB","aB")
        if(g1 != hapToGeno1 | g2 !=hapToGeno2)
          next
        pX <- pX + theta[z1]*theta[z2]
    }
  pX
}

for(g1 in 0:2)
  for(g2 in 0:2)
  cat("like for individuals with genotype SNP1=",g1," SNP2=",g2," is",likeG(g1,g2,theta),"\n")

#genotypeTab/sum(genotypeTab)

like for individuals with genotype SNP1= 0  SNP2= 0  is 0.0484 
like for individuals with genotype SNP1= 0  SNP2= 1  is 0.088 
like for individuals with genotype SNP1= 0  SNP2= 2  is 0.04 
like for individuals with genotype SNP1= 1  SNP2= 0  is 0.242 
like for individuals with genotype SNP1= 1  SNP2= 1  is 0.2332 
like for individuals with genotype SNP1= 1  SNP2= 2  is 0.012 
like for individuals with genotype SNP1= 2  SNP2= 0  is 0.3025 
like for individuals with genotype SNP1= 2  SNP2= 1  is 0.033 
like for individuals with genotype SNP1= 2  SNP2= 2  is 9e-04 


### likelihood
$L(\theta)=p(X|\theta)=\prod_{i=1}^N p(X_i|\theta)$


In [6]:
data <- cbind(SNP1,SNP2)
cat("Data for the first 5 individuals\n")
head(data)

cat("start likelihood calculations\n")

likelihood <- function(data,theta){
  #input: data ( N x 2 matrix of genotypes) + theta
  #output: log( p(X|theta) )
  N <- nrow(data)
  logLike <- 0
  for(i in 1:N)
    logLike <- logLike + log( likeG(g1=data[i,1],g2=data[i,2],theta) )

 logLike
}
lik <-likelihood(data,theta)
cat("Log likelihood based on true haplotype frequencies is",lik,"\n")


lik <-likelihood(data,rep(0.25,4))
cat("Log likelihood with uniform haplotype frequencies",lik,"\n")


Data for the first 5 individuals


SNP1,SNP2
2,0
1,0
1,0
0,2
2,0
1,1


start likelihood calculations
Log likelihood based on true haplotype frequencies is -1672.014 
Log likelihood with uniform haplotype frequencies -2168.858 


### Faster version
we only need to calculate the individual likelhood for the 9 combinations of genotypes

In [7]:
cat("data\n")
(dataTab <- genotypeTab)
cat("\nparameter theta\n")
(theta <- freqHap)


likelihoodFast <- function(dataTab,theta){
  #input: data ( N x 2 matrix of genotypes) + theta
  #output: log( p(X|theta) )

  N <- nrow(dataTab)
  logLike <- 0
  for(g1 in 0:2)
    for(g2 in 0:2){
     likG<- likeG(g1=g1,g2=g2,theta)
    logLike <- logLike + log( likG ) * dataTab[g1+1,g2+1]
    #cat("g1,g2",g1,g2,"likG",likG,"Nind with that genotype",dataTab[g1+1,g2+1],"\n")
    }
 logLike
}
lik <-likelihoodFast(dataTab,theta)
cat("\nLog likelihood based on true haplotype frequencies is",lik,"\n")


lik <-likelihoodFast(dataTab,rep(0.25,4))
cat("\nLog likelihood with uniform haplotype frequencies",lik,"\n")

data


    
      bb  Bb  BB
  aa  40  81  32
  Aa 233 256  14
  AA 312  31   1


parameter theta



Log likelihood based on true haplotype frequencies is -1672.014 

Log likelihood with uniform haplotype frequencies -2168.858 


# EMalgorithm


We need to calculate the posterior probabilty
$q_i(Z=z)=p(Z=z|X_i,\theta^{(n)})$  can be calculated by
$$p(Z=z|X_i,\theta^{(n)})=\frac{p(X_i|Z=z)p(Z=z|\theta^{(n))})}{\sum_{z'} p(X_i|Z=z')p(Z=z'|\theta^{(n))})}$$
where $p(Z=z|\theta)=\theta_{z_1}\theta_{z_2}$ and

$p(X_i|Z=z)=0$ if genotypes are not consistent with the haplotypes and 1 otherwise.


In [8]:
data <- cbind(SNP1,SNP2)
head(data)

cat("\nparameter theta\n")
(theta <- freqHap)


calculateQ<-function(theta,g1,g2){
  #input: theta and two genotypes
  #output: p(Z|X,theta) for 16 haplotype pairs

  #X_i is the genotypes g1 and g2
  # we need to calculate P(X|Z) and p(X|theta)

  #P(X|Z)p(Z|theta)
  names(theta) <- c("AB","aB","Ab","ab")
  pXZ_pZtheta <- c()
  for(z1 in c("AB","aB","Ab","ab")) # sum over two haplotypes
    for(z2 in c("AB","aB","Ab","ab")){
        hapToGeno1 <- z1%in%c("AB","Ab") + z2%in%c("AB","Ab")
        hapToGeno2 <- z1%in%c("AB","aB") + z2%in%c("AB","aB")
        hapPair <- paste(z1,z2,sep="/")
        if(g1 != hapToGeno1 | g2 !=hapToGeno2)
          pXZ_pZtheta[hapPair] <- 0
        else
          pXZ_pZtheta[hapPair] <- theta[z1]*theta[z2]
    }


  ## then we need to normalize by summing over all Z
  # P(X|Z)p(Z|theta)/[ sum( P(X|Z')p(Z'|theta) ) ]
  return( pXZ_pZtheta/sum(pXZ_pZtheta) )

}

calculate9Q <- function(theta){
  #input: theta
  #output: p(Z|X,theta), matrix 9x16. 9 genotype  pairs, 16 haplotype pairs
  ## X are the 9 genotype pairs.


  QforG1G2 <- array(NA,dim=c(3,3,16))
  for(g1 in 0:2)
    for(g2 in 0:2)
      QforG1G2[g1+1,g2+1,] <- calculateQ(theta,g1,g2)

  QforG1G2
}


emStep <- function(theta,data){

  names(theta) <- c("AB","aB","Ab","ab")
  N <- nrow(data)

  ## E step
  Q9 <- calculate9Q(theta)
  QZ <-matrix(0,nrow=nrow(data),ncol=16)
  colnames(QZ) <- names(calculateQ(theta,1,1)) ## give columnames for the 16 haplotype pairs
  for(i in 1:N){
    g1=data[i,1]
    g2=data[i,2]
    QZ[i,] <- Q9[g1+1,g2+1,]
  }


  ## Mstep
  # get the expected number of haptotype pairs(16 possible pairs)
  expectedHaploPairs <-  colSums(QZ)
  # calculate haplotype counts from the pairs (4 haplotypes)
  expectedCountHap <- countHaplotypes(expectedHaploPairs)
  thetaNew <- expectedCountHap/sum(expectedCountHap)
  thetaNew
}

thetaEM <- rep(1/4,4)
for(i in 0:10){
  ll <- likelihood(data,thetaEM)
  cat("iter",i,"theta:",thetaEM,"Loglike",ll,"\n")
  thetaEM <- emStep(thetaEM,data)
}

SNP1,SNP2
2,0
1,0
1,0
0,2
2,0
1,1



parameter theta


iter 0 theta: 0.25 0.25 0.25 0.25 Loglike -2168.858 
iter 1 theta: 0.0875 0.1435 0.508 0.261 Loglike -1728.17 
iter 2 theta: 0.05403413 0.1769659 0.5414659 0.2275341 Loglike -1683.774 
iter 3 theta: 0.03805581 0.1929442 0.5574442 0.2115558 Loglike -1672.219 
iter 4 theta: 0.03241402 0.198586 0.563086 0.205914 Loglike -1670.567 
iter 5 theta: 0.03070986 0.2002901 0.5647901 0.2042099 Loglike -1670.406 
iter 6 theta: 0.03022334 0.2007767 0.5652767 0.2037233 Loglike -1670.392 
iter 7 theta: 0.03008681 0.2009132 0.5654132 0.2035868 Loglike -1670.391 
iter 8 theta: 0.03004868 0.2009513 0.5654513 0.2035487 Loglike -1670.391 
iter 9 theta: 0.03003804 0.200962 0.565462 0.203538 Loglike -1670.391 
iter 10 theta: 0.03003508 0.2009649 0.5654649 0.2035351 Loglike -1670.391 


#faster version
instead of summing over all 1000 individual when just the 9 possible genotypes

In [9]:
cat("data\n")
(dataTab <- genotypeTab)

emStepFast <- function(theta,data){

  names(theta) <- c("AB","aB","Ab","ab")
  N <- sum(dataTab)

  ## E step
  Q9 <- calculate9Q(theta)
  QZ <-matrix(0,nrow=9,ncol=16)
  colnames(QZ) <- names(calculateQ(theta,1,1)) ## give columnames for the 16 haplotype pairs
  i<-1
  for(g1 in 0:2)
    for(g2 in 0:2){
      QZ[i,] <- Q9[g1+1,g2+1,]* dataTab[g1+1,g2+1]
      i <- i+1
  }


  ## Mstep
  # get the expected number of haptotype pairs(16 possible pairs)
  expectedHaploPairs <-  colSums(QZ)
  # calculate haplotype counts from the pairs (4 haplotypes)
  expectedCountHap <- countHaplotypes(expectedHaploPairs)
  thetaNew <- expectedCountHap/sum(expectedCountHap)
  thetaNew
}
cat("\n run EM!!\n")
thetaEM <- rep(0.25,4)
for(i in 0:10){
  ll <- likelihoodFast(dataTab,thetaEM)
  cat("iter",i,"theta:",thetaEM,"Loglike",ll,"\n")
  thetaEM <- emStepFast(thetaEM,dataTab)
}

cat("\n sample freq Hap (not observed)\n")
freqHap


data


    
      bb  Bb  BB
  aa  40  81  32
  Aa 233 256  14
  AA 312  31   1


 run EM!!
iter 0 theta: 0.25 0.25 0.25 0.25 Loglike -2168.858 
iter 1 theta: 0.0875 0.1435 0.508 0.261 Loglike -1728.17 
iter 2 theta: 0.05403413 0.1769659 0.5414659 0.2275341 Loglike -1683.774 
iter 3 theta: 0.03805581 0.1929442 0.5574442 0.2115558 Loglike -1672.219 
iter 4 theta: 0.03241402 0.198586 0.563086 0.205914 Loglike -1670.567 
iter 5 theta: 0.03070986 0.2002901 0.5647901 0.2042099 Loglike -1670.406 
iter 6 theta: 0.03022334 0.2007767 0.5652767 0.2037233 Loglike -1670.392 
iter 7 theta: 0.03008681 0.2009132 0.5654132 0.2035868 Loglike -1670.391 
iter 8 theta: 0.03004868 0.2009513 0.5654513 0.2035487 Loglike -1670.391 
iter 9 theta: 0.03003804 0.200962 0.565462 0.203538 Loglike -1670.391 
iter 10 theta: 0.03003508 0.2009649 0.5654649 0.2035351 Loglike -1670.391 

 sample freq Hap (not observed)


## EM algorithm (proof from class)

If the likelihhood is in the form likelihood
   
$log(L(\theta))  =  \sum_i log \left (\sum_j p(X_i,Z_j|\theta) \right )$

and
$\sum_j \theta_j=1$

and $p(X|Z,\theta)=p(X|Z)$

and $p(Z_j|\theta)=\theta_j$


Then the solution is
### E step
$q_i(Z_j)=p(Z_j|X_i,\theta^{(n)})$
   
### M step  
$\theta_j^{(n+1)} = \frac{\sum_i q_i(Z_j)}{\sum_i \sum_j q_i(Z_j)}$


# our likelihood
Same form
###log likelhood
$log(L(\theta))=Log(p(X|\theta))=\sum_{i=1}^N log\left(\sum_{z\in\{AB,aB,Ab,ab\}^2}p(X_i|Z=z)p(Z=z|\theta)\right)$


$\theta=(\theta_{AB},\theta_{aB},\theta_{Ab},\theta_{ab})$ so $\sum_{z_a} p(Z_a=z_a|\theta)=1$

and
$$p(X_i|Z,\theta)=p(X_i|Z)$$ because the genotypes are directy determined by the pairs of haplotypes
and $$p(Z=z_a|\theta)=\theta_z$$ because the parameter of z is the frequency of z.


## E
$q_i(Z=z)=p(Z=z|X_i,\theta^{(n)})$  can be calculated by
$$p(Z=z|X_i,\theta^{(n)})=\frac{p(X_i|Z=z)p(Z=z|\theta^{(n))})}{\sum_{z'} p(X_i|Z=z')p(Z=z'|\theta^{(n))})}$$
where $p(Z=z|\theta)=\theta_{z_1}\theta_{z_2}$ and

$p(X_i|Z=z)=0$ if genotypes are not consistent with the haplotypes and 1 otherwise.

for a single haplotype (instead of a pair)

$$p(Z_a=z_a|X_i,\theta^{(n)})=\sum_{z'}\left(p(Z=(z_a,z')|X_i,\theta^{(n)}) + p(Z=(z',z_a)|X_i,\theta^{(n)})\right)$$

## M
$\theta_{z_a}^{(n+1)} = \frac{\sum_i q_i(Z_a=z_a)}{\sum_i \sum_j q_i(Z_a=z_a)}$



