# Random Variables

With random variables, we can talk about the _event_ that the random variable takes a particular value, and the _probability_ of that event. One fairly simple example of a random variable is the number of `C` nucleotides in a random nucleotide sequence. 

First, we import modules: `numpy` is needed for choosing nucleotides randomly, and `matplotlib.pyplot` will allow us to make some graphs.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In order to look at events like
```
P(exactly 3 Cs in 8 nucleotides)
```
we need to count up all 8-nucleotide sequences and see how many have exactly 3 Cs.

We'll build up our exhaustive list of 8-nucleotide sequences in parts. First, we'll build a list of all di-nucleotide sequences.

We can use nested `for` loops over these dinucleotides (or a list comprehension) to build the full set of 256 distinct 4-nucleotide sequences.

Finally, we can use this same trick once more to build the full set of 256 x 256 = 65536 different 8-nucleotide sequences. We won't print this entire list -- we'll just check that the number of entries is correct.

We can use a simple `for` loop to count up the number of `C` nucleotides in a sequence:

Now we'll loop over our 65,536-entry long list of all 8-base sequences and count up how often we find 0 `C`s, with 1 `C`, and so forth.

We could use a dictionary like we did above, with keys as the number of `C` nucleotides and values as the count of sequences. It could be hard to plot this, however. Instead, we'll use a list. We'll need a list with 9 entries, for zero, one, ..., eight `C` nucleotides. In order for `x[8]` to be defined, we need the length of `x` to be 9, not just 8, since we start counting from 0.

Now, it's easy to plot these counts using the `plot()` command.

We want to convert these into probabilities instead of counts. To do this, we'll just divide by the total counts in the array, which we can get with the `sum()` function. Since we looped over the 65,536 sequences in `octo_nts`, we expect that sum to be 65536.

Now we can plot these probabilities. Here, we'll use a second argument to the `plt.plot()` function in order to plot with dots (`o`), solid lines connecting them (`-`), and in blue (`b`).

Next, we'll generate some random sequences. The `random.choice()` function will make random choices from an array of alternatives. In the cell below, we give it an array containing the four nucleotide letters and ask for 8 random choices.

In [None]:
random_nts = np.random.choice(['A', 'C', 'G', 'T'], 8)
print(random_nts)

The output of this choice is a list, and we want to combine all the entries of that list together into a single string. We can do this using the `join()` function. We want to join all of the characters in the array onto an empty string, `''`.

In [None]:
print(''.join(random_nts))

Notice that we _saved_ our random nucleotides above and then joined them together. Every time we run `np.random.choice(...)` we get a distinct array of 8 nucleotides, chosen randomly:

So, we can make a list of 100 random sequences in a `for` loop that runs the same random choice 100 times and appends the random sequences onto a list:

**Exercise** Complete the loop below in order to count the number of sequences with different numbers of `C`s and conver the counts into probabilities:

Now, we can plot our random sequence probabilities and our exhaustive sequence probabilities on the same graph. We'll make the random sequence probabilities distinct using `'.--r'`, which uses small points, a dashed line, and red (instead of big points, solid line, and blue).

Most of the time, the red graph is pretty similar in shape to the blue one, but not identical. If we analyze more sequences, we'll get a distribution closer to the "ideal" blue curve.

We can easily run the same analysis for 1000 random sequences. We don't even need to store the random sequences.

Of course, there's a third alternative to counting all 65,536 sequences or sampling many random sequences. We know that the number of `C` nucleotides per sequence should follow the _binomial distribution_. The `scipy.stats.binom` module implements the binomial distribution in a "Probability Mass Function" (PMF), which is just a technical way of saying that it tells you the probability of different events under the binomial distribution. To compute this probability, we need to know the number of independent trials (N) and the probability of success in each trial (p). Using those, we can compute the probability of k successful trials, using
```
binom.pmf(k, N, p)
```
For instance, the probability of a `C` is 1/4 = 0.25, and we look at 8 nucleotides, so the probability of exactly one `C` when we sample 8 nucleotides independently is computed as follows:

In [None]:
from scipy.stats import binom
binom.pmf(1, 8, 0.25)

We can also create a binomial distribution function for a fixed N and p, and then find the probability of an array of k values in that distribution. Below, we create an array of x values from 0 through 8 inclusive, and then compute the probability (PMF) for each k value. We plot those relative to our 65,536-sequence reference and they should be exactly identical.

Using the binomial distribution and its PMF, we can try out uneven nucleotide distributions. For instance, below we make 3 different plots with p = 0.19 (like 19% C in the yeast genome), p = 0.25 (equal nucleotide frequencies), and p = 0.31 (like 31% T in the yeast genome, for instance). We then plot all three binomial PMFs in different colors -- cyan (`c`) for 19%, black for 25%, and magenta (`m`) for 31%.

We also create a _legend_ for our figure -- we include a "label" argument in each of our `plot(...)` functions, capture the "plot handle" produced by each of the plots, and use the `legend(...)` function to create a legend for these handles.

In [None]:
plot19, = plt.plot(binom.pmf(x, 8, 0.19), 'o-c', label='19%') ## Like 'C' in the yeast genome
plt.legend(handles=[plot19])