# Allele-Specific Expression

RNA sequencing can distinguish transcripts expressed from different copies of genes on homologous chromosomes when single-nucleotide polymorphisms (perhaps silent) distinguish the two alleles. Linkage between these distinctive SNPs and _cis_-regulatory sequences can provide information on regulatory variation within a shared cellular context.

## Maximum Likelihood Estimation

Instead of simply testing whether allele expression is even, we want to _estimate_ the relative skew in expression. To do this, we will start by making a graph where we consider all possible bias values, and then figure out the likelihood function P( 8 reads out of 32 | bias ). We'll use the `binom.pmf` function again, but now we'll consider many different values for the 3rd _p_ parameter instead of the first one.

This graph looks similar to others that we've made, but the axes are different -- we'll add x and y axis labels to emphasize this difference.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

bias = np.arange(0,1,0.01)
plt.plot(bias, binom.pmf(8, 32, bias))
plt.xlabel('Allele bias')
plt.ylabel('P(8 reads / 32 total)')

As we discussed before, we often want to work with log likelihood functions. In fact, `binom` has the built-in ability to give us a log probability that will be numerically stable when the actual likelihood is a very tiny number, using the `logpmf` method.

In this plot, the likelihood function looks pretty flat around 0.25, but we might want to adjust the y-axis to focus on the region of high likelihood -- the default puts a lot of emphasis on parts of the plot where the likelihood is very small.

We need to find the point on the x-axis where the likelihood is maximized. We can probably guess that this will happen at 0.25, but we can use algorithms from Scipy to find the best likelihood. These methods are typically expressed in terms of _minimization_, and so we'll minimize the negative log likelihood which is equivalent to finding the maximum of the likelihood.

To do this, we define a function to compute the negative log likelihood, called `negloglik`, and then use `minimize_scalar` from `scipy.optimize` to find the allele skew value that maximizes the likelihood of our data.

In [None]:
from scipy.optimize import minimize_scalar

The width of the likelihood peak can be used to compute a confidence interval. For a 95% confidence interval, we find the range of values where the log likelihood is within 1.92 of the best log likelihood.

We can do this in a brute-force sort of way -- start at the best bias value and keep increasing it until the log likelihood drops off too much:

Alternately, we can use algorithms in Python to solve for the point where the log likelihood crosses our threshold. That is, we want to solve for _u_ where
```
lnP(8 reads out of 32 | bias = u) = loglik_opt - 1.92
```
or
```
lnP(8 reads out of 32 | bias = u) - loglik_opt + 1.92 = 0
```

In [None]:
from scipy.optimize import root_scalar

We can then use the same trick to find the lower bound of the confidence interval by looking below bias_opt -- between 0 and bias_opt -- rather than above bias_opt

We can then find the overall confidence interval for our estimate, based on the shape of the likelihood function:

Below, we'll plot the log likelihood functions for:
* 2 reads out of 8
* 8 reads out of 32
* 25 reads out of 100