# Allele-Specific Expression

RNA sequencing can distinguish transcripts expressed from different copies of genes on homologous chromosomes when single-nucleotide polymorphisms (perhaps silent) distinguish the two alleles. Linkage between these distinctive SNPs and _cis_-regulatory sequences can provide information on regulatory variation within a shared cellular context.

## Null Hypothesis Testing

The null hypothesis in allele-specific expression analysis is that the alleles are expressed equally and so each read is equally likely to be derived from each allele.

Here, we'll take two approaches to get a _p_ value for the null hypothesis of equal expression in situations where just 25% of the reads come from one allele and 75% from the other. We'll look at this with just 8 reads, with 32 reads, and then with 100 reads.

### Permutation Testing

First, we'll generate many random sets of data according to the null model and look at the distribution of allele skew in these random data. Our approach to generating random sets of reads is simple: we choose randomly between `0` and `1`, and then count how often we choose `1` by summing the results of this random choice.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
print(np.random.choice(['A', 'C', 'G', 'T']))
reads = np.random.choice([0,1], 8)
print(reads)
print(sum(reads))

Now, we'll generate 10,000 random samples of 8 reads each, and tabulate how many random samples have zero, one, ..., eight reads from that sample.

In [None]:
allele_counts_8 = [0] * 9
print(allele_counts_8)
for i in range(0,10000):
    n_reads = sum(np.random.choice([0,1], 8))
    allele_counts_8[n_reads] += 1
print(allele_counts_8)


Next, we'll plot the distribution

In [None]:
plt.plot(allele_counts_8)

We can use this to ask: in what fraction of random samples do you see 12.5% (one-eighth) or fewer `1` reads? What can you conclude from seeing this kind of skew in a sample of 8 reads?

In [None]:
print(sum(allele_counts_8[0:2]) / sum(allele_counts_8))

What if we had no prior reason to look for a strong skew in one direction versus the other? That is, we're interested in situations where we had ≤12.5% or ≥87.5% frequence of `1` alleles?

In [None]:
print((sum(allele_counts_8[0:2]) + sum(allele_counts_8[7:9])) / sum(allele_counts_8))

We can repeat the same analysis, for 32 reads per random sample.

In [None]:
print(sum(np.random.choice([0,1], 32)))
allele_counts_32 = [0] * 33
for i in range(0,10000):
    n_reads = sum(np.random.choice([0,1], 32))
    allele_counts_32[n_reads] += 1
print(allele_counts_32)
print(allele_counts_32[26])
print(allele_counts_32[27])
plt.plot(allele_counts_32)

Now we can test for a similar skew in our 32-read samples. 

In [None]:
print((sum(allele_counts_32[0:5]) + sum(allele_counts_32[28:33])) / sum(allele_counts_32))

What about a stronger skew: at least 31 reads from one allele and no more than 1 from the other?

In [None]:
print((sum(allele_counts_32[0:2]) + sum(allele_counts_32[31:33])) / sum(allele_counts_32))

## Random Variables

There is probably a small but non-zero odds of getting this strong skew in the 32-read sample -- but we would need to generate a lot of random samples in order to figure out exactly how small. Instead, we can treat the number of reads from the `1` allele as a random variable with a binomial distribution. This isn't always a fair description of biological data, but it's a reasonable starting point here. 

### Binomial Distribution

The scipy package contains a statistics module with a sub-module specific for the binomial distribution. We can get the probability
```
P( k successes out of N trials with probability p of success per trial )
```
using
```
binom.pmf(k, N, p)
```
"pmf" here stands for "probability mass function".

For instance, we can ask about exactly 2 "successes" out of 8 "trials" -- think of this as 2 reads from the `1` allele out of 8 reads counted in total. We can also ask abotu exactly 2 reads from the `1` allele out of 32 total.

In [None]:
!pip3 install scipy
from scipy.stats import binom
print(binom.pmf(2, 8, 0.5))
print(binom.pmf(2, 32, 0.5))

We can also test a range of different k values, making it easy to sum up across many possibilities:

In [None]:
print(binom.pmf(range(0,5), 32, 0.5))
print(sum(binom.pmf(range(0,5), 32, 0.5)))

We can use this to compute a precise value for the very small probability of the 1-or-fewer vs 31-or-more skew in 32 reads.

In [None]:
print(binom.pmf([0,1,31,32], 32, 0.5))
print(sum(binom.pmf([0,1,31,32], 32, 0.5)))

We'd need to run a lot of simulations to find that p-value reliably!

We can also plot this distribution and compare it to our simulations.

In [None]:
plt.plot(binom.pmf(range(0,32), 32, 0.5) * 10000, '.-r')
plt.plot(allele_counts_32, 'o--b')

## Maximum Likelihood Estimation

Instead of simply testing whether allele expression is even, we want to _estimate_ the relative skew in expression. To do this, we will start by making a graph where we consider all possible bias values, and then figure out the likelihood function P( 4 reads out of 32 | bias ). We'll use the `binom.pmf` function again, but now we'll consider many different values for the 3rd _p_ parameter instead of the first one.

This graph looks similar to others that we've made, but the axes are different -- we'll add x and y axis labels to emphasize this difference.

In [None]:
bias = np.arange(0,1,0.01)
print(bias)
print(binom.pmf(4, 32, bias))
plt.plot(bias, binom.pmf(4, 32, bias))
plt.xlabel('Allele bias')
plt.ylabel('P(4 reads / 32 total)')

As we discussed before, we often want to work with log likelihood functions. In fact, `binom` has the built-in ability to give us a log probability that will be numerically stable when the actual likelihood is a very tiny number, using the `logpmf` method.

In [None]:
plt.figure()
plt.plot(bias, binom.logpmf(4, 32, bias))

In this plot, the likelihood function looks pretty flat around 0.25, but we might want to adjust the y-axis to focus on the region of high likelihood -- the default puts a lot of emphasis on parts of the plot where the likelihood is very small.

In [None]:
plt.figure()
plt.plot(bias, binom.logpmf(4, 32, bias))
plt.axis([0.0, 1.0, -10, 0])

We need to find the point on the x-axis where the likelihood is maximized. We can probably guess that this will happen at 0.125, but we can use algorithms from Scipy to find the best likelihood. These methods are typically expressed in terms of _minimization_, and so we'll minimize the negative log likelihood which is equivalent to finding the maximum of the likelihood.

To do this, we define a function to compute the negative log likelihood, called `negloglik`, and then use `minimize_scalar` from `scipy.optimize` to find the allele skew value that maximizes the likelihood of our data.

In [None]:
from scipy.optimize import minimize_scalar
def negloglik(bias):
    return -binom.logpmf(4, 32, bias)
plt.plot(bias, negloglik(bias))
mle = minimize_scalar(negloglik, bounds=(0,1), method='bounded')
print(mle)

Below, we'll plot the log likelihood functions for:
* 1 read out of 8
* 4 reads out of 32
* 16 reads out of 128

In [None]:
x = np.arange(0,1,0.01)
plt.plot(x, binom.logpmf(1, 8, x), color="magenta")
plt.plot(x, binom.logpmf(4, 32, x), color="black")
plt.plot(x, binom.logpmf(16, 128, x), color="cyan")
plt.axis([0,1,-10,0])