# Allele-Specific Expression

RNA sequencing can distinguish transcripts expressed from different copies of genes on homologous chromosomes when single-nucleotide polymorphisms (perhaps silent) distinguish the two alleles. Linkage between these distinctive SNPs and _cis_-regulatory sequences can provide information on regulatory variation within a shared cellular context.

## Null Hypothesis Testing

The null hypothesis in allele-specific expression analysis is that the alleles are expressed equally and so each read is equally likely to be derived from each allele.

Here, we'll take two approaches to get a _p_ value for the null hypothesis of equal expression in situations where just 25% of the reads come from one allele and 75% from the other. We'll look at this with just 8 reads, with 32 reads, and then with 100 reads.

### Permutation Testing

First, we'll generate many random sets of data according to the null model and look at the distribution of allele skew in these random data. Our approach to generating random sets of reads is simple: we choose randomly between `0` and `1`, and then count how often we choose `1` by summing the results of this random choice.

In [None]:
import numpy as np
import matplotlib.pyplot as plt


Now, we'll generate 10,000 random samples of 8 reads each, and tabulate how many random samples have zero, one, ..., eight reads from that sample. Print the counts, and then plot the distribution.

We can use this to ask: in what fraction of random samples do you see 25% or fewer `1` reads? What can you conclude from seeing this kind of skew in a sample of 8 reads?

We can repeat the same analysis, for 32 reads per random sample.

Now, we can test in what fraction of 32-read random samples do you see 25% or fewer `1` reads?

Clearly, read number has a big impact on different outcomes. 

In order to explore this further, we'll write a function to run the allele skew analysis using the number of reads as an input parameter.

### Binomial Distribution

There is probably a small but non-zero odds of getting a 25% skew in a 100-read sample -- but we would need to generate a lot of random samples in order to figure out exactly how small. Instead, we could model the number of reads according to the binomial distribution. This isn't always a fair description of biological data, but it's a reasonable starting point here.

Below, we use the binomial probability distribution to plot the _expected_ distribution of counts on top of the actual random sample from our `allele_skew` function.

In [None]:
!pip3 install scipy
from scipy.stats import binom

Now, we can compute the likelihood of getting 25 _or fewer_ `1` reads. How many random samples would you typically need to see one skewed this much?