# Notebook 6.2: `np.random` and `np.histogram`

### Required software

In [480]:
# pip install toyplot
# conda install numpy

In [482]:
import numpy as np
import toyplot

### Numpy random

The `numpy.random` package is one of the most useful scientific packages you are likely to use. It will feel familiar because it has many of the same features as the `random` package from the Python standard library, but the numpy version is much more expansive and also much faster. 

In [485]:
# get 10 random integers between 0 and 255
np.random.randint(0, 255, 10)

array([195, 102, 154,  32, 244,   8, 207, 244, 135, 235])

In [509]:
# get 10 nucleotide bases from 'ACGT'
np.random.choice(list("ACGT"), 10)

array(['T', 'G', 'G', 'G', 'G', 'G', 'A', 'A', 'A', 'G'],
      dtype='<U1')

## Use `random.binomial` for masking 

Masking is an effective way to select only a subset of values in an array. This can be used to subsample randomly, or to filter values that mean only a certain criterion. Below are several ways to create a boolean mask to randomly sample values from an array efficiently. 



In [515]:
# an array of 1000 sequential ints
arr = np.arange(1000)

#### random binomial trials
Binomial sampling can be thought of like a coin flip, but where you can assign the probability to each outcome like a weighted coin. Below we run 1000 trials (size) of individual coin flips (n=1) where the probability of one outcome (say flipping heads) is 0.1 (p=0.1). This will return an array of binary integers (e.g., `array([0, 0, 1, 1])`) which we will then convert to a boolean type using the `astype()` call. 

In [516]:
# 1000 trials where each has success rate of 0.1
mask = np.random.binomial(n=1, p=0.1, size=1000).astype(bool)

# show the first 50 results
mask[:50]

array([False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False,  True,  True, False, False, False, False,
       False, False, False, False, False], dtype=bool)

#### masking with a boolean array
A boolean array can be used to subselect from another array by selecting only the elements of value `True` in the boolean array. Remember, True is a special keyword in Python, and it is equivalent to the value 1, which is why we were able to convert the 1's and 0's above into True's and False's so easily. Applying the mask from above that selected elements with a probability of 0.1 we see that it reduces the array of 1000 ordered integers into a smaller array of around 100 values. 

In [517]:
# use boolean array to mask (select only element that are True)
arr[mask]

array([  6,  24,  35,  39,  40,  55,  66,  67,  74,  90,  92, 100, 109,
       116, 123, 127, 134, 139, 146, 179, 184, 188, 200, 209, 216, 218,
       231, 252, 258, 266, 271, 275, 286, 306, 318, 320, 325, 333, 336,
       341, 356, 364, 374, 375, 384, 409, 446, 448, 470, 493, 499, 511,
       512, 522, 528, 535, 540, 548, 550, 554, 556, 560, 571, 579, 582,
       592, 595, 600, 614, 618, 623, 640, 649, 676, 679, 700, 715, 730,
       733, 741, 743, 761, 771, 786, 795, 823, 828, 832, 844, 849, 868,
       874, 885, 893, 906, 916, 920, 930, 951, 958, 965, 969, 973, 981,
       987, 988])

## Use `random.choice` for randomizing


Similar to above, instead of selecting True or False for every cell sometimes we may want to randomly sample values from an array while dictating the exact number that we will get in the end. This can be done with `random.choice`, and has a lot of potential uses in biological programming. One example is in the statistical method called bootstrap resampling. 

#### Bootstrap resampling (e.g., bootstrapping)
Bootstrapping is a *non-parametric* method for testing the reliablility of a measurement by testing how representative an observed statistic is compared to a random re-sampling of the data points from which it was calculated. It provides a way of examining the variance in a statistic without needing to collect an entirely new data set, nor assuming that the data are distributed according to a standard statistical distribution, like a normal distribution. Instead, we just re-test the same data set by resampling it. Another way to think about it is that it is examining whether there are few outlier data points that might be driving our results, since when you resample data points you expect that the outliers may sometimes be left out and thus the calculated statistic may be very different. Let's try it out.


In [540]:
# create a distribution of measured data points 
data = np.random.randint(0, 255, 1000)

# calculate a statistic on the observed data
dmean = data.mean()

# run one bootstrap replicate (sample w/o replacement to the same size as original)
boot0 = np.random.choice(observed, size=data.size)
bmean = boot0.mean()

# print observed and single bootstrap (they're pretty similar)
print(dmean, bmean)

130.956 120.812


In [541]:
# run 1000 boots using list-comprehension in an array
boots = np.array([np.random.choice(observed, size=observed.size).mean() for i in range(1000)])


#### plot bootstrap distribution and observed data point
As you can see our observed statistic falls right at the mean of our bootstrap distribution, thus we can say that our results are likely not skewed by large outliers, yet there is also a fair bit of variation around the mean so we now have a better estimate of uncertainty. 

In [542]:
# plot bootstrap distribution of means
c, a, m = toyplot.bars(np.histogram(boots, bins=20), height=200, width=400);

# add a vertical line at the observed data mean 
a.plot(
    [observed.mean(), observed.mean()],
    [0, 200],
    size=10, 
    color='red');

### Sampling from statistical distributions

For many statistical tests we are interested in comparing observed data to a known statistical distribution, or simulating data under a known statistical distribution to test whether observed data fit to some expected modeled outcome. The binomial distribution that we saw above is one such type of *parametric* model, where we provide a parameter (p; the probability of success in a trial) and simulate random runs under that model. Below we'll try out a few other common models used in biological programming. 

#### The uniform distribution 
The uniform distribution samples numbers with equal frequency within a set range of values (defined by `low` and `high`). This is similar to the `randint` function above, but in this case a `float` is returned, thus it is sampling randomly along all values within and between integers in the selected range. We are saying that all values in this range are equally likely to be sampled. 

In [566]:
# sampling from a Uniform distribution
np.random.uniform(low=0, high=255, size=10)

array([ 242.31732972,   43.34852412,   54.59578087,  220.37432612,
        160.03143117,   50.11474576,  243.55165217,  233.62161613,
        193.39184248,    7.6602638 ])

#### The normal distribution 
This is the standard bell curve, the result of sampling from a distribution with a mean value and some variance around that mean. The normal distribution is thus parameterized with two values, a mean (`loc`) and a standard deviation (`scale`). 

In [617]:
# sample from a Normal distribution
np.random.normal(loc=0, scale=2, size=10)

array([ 0.35779846,  0.61132089,  2.93396862, -0.58236117,  2.17141236,
        0.0748408 , -3.75156962,  2.22350855, -3.7191722 , -1.90037768])

## Histograms
A histogram is a way of *binning* values that are within some range of each other into a discrete category, and is typically used as a way for visualizing large data sets. In your reading histograms were created using the `matplotlib` library, which internally calls the function `np.histogram` to bin values. I think matplotlib is ugly and prefer the library `toyplot` so we will do the same using this instead. When we call `np.histogram` on an array of values it returns two values (or a single tuple with two values) that hold the value of each bin as well as the edges of each bin. Pass these arrays to `toyplot.bars` to plot a histogram like below. Here I add two additional arguments to `np.histogram` to set the number of bins to 20, and to return the values as a frequency (`density`) as opposed to a count of the number of values in each bin. 



In [572]:
arr = np.random.uniform(low=0, high=10, size=100000)
hist, edges = np.histogram(arr, bins=20, density=True)
toyplot.bars((hist, edges), height=200, width=400, label="Uniform distributed random values");

In [581]:
arr = np.random.normal(loc=0, scale=2, size=100000)
hist, edges = np.histogram(arr, bins=20, density=True)
toyplot.bars((hist, edges), height=200, width=400, label="Normal distributed random values");

#### Exponential distribution

The exponential distribution is the average *waiting time* between events that occur independently and with a fixed probability. For example, we might ask if the mutation rate is 1e-8 then what is the average waiting time between mutations at a single site in the genome? The distribution below shows that often the waiting time is very short, but sometimes it is very long. There is a long tail to the exponential distribution. To think about why this is consider the relationship of the exponential to the binomial distribution earlier (random trials with success `p`). It only takes one success to end a trial, but sometimes you can have many many many trials occur in a row without a successful event happening. These rare runs of failures create the long tail of the exponential distrubution. 

In [587]:
# waiting time is 1/lam where lam is the probability of an event
arr = np.random.exponential(scale=1/1e-8, size=100000)

# let's divide by 1e6 to get result in units of millions
arr = arr / 1e6
hist, edges = np.histogram(arr, bins=20, density=True)
toyplot.bars((hist, edges), height=200, width=400, 
             label="Exponential distribution",
             xlabel="N trials until success",
             ylabel="Frequency");

# on average, it takes about 100 generations for a mutation to occur at a site
arr.mean().astype(int)

100

## Multivariate normal distribution

The multivariate normal distribution is a structured distribution in which a `covariance matrix`(shared variance) describes the variance in draws from the distribution as well as the correlation among values sampled for each array. This type of distribution if used commonly in biology in the field of 'phylogenetic comparative methods', where we aim to quantitatively study morphological evolution among groups of species or populations. Using a `covariance matrix` we can represent the `phylogenetic relationships` among species (their shared ancestry) and thus model how similar species are expected to be. In other words, it is a way of modeling the non-independece of species as data points (close relatives are expected to have more similar traits by common descent).  

Here we can demonstrate this phenomenon by drawing values from a normal distribution for three different species with different trait means (`[2, 3, 4]`), but dictate that there is a  correlation structure among them. Between the first species and the second species the correlation is high (covariance=0.75) while between the first and third species or the second and third it is low (covariance=0.15). A phylogenetic tree is drawn to show what this covariance structure would look like for three species. 

As you can see in the first plot below, we generated a random distribution of points for each species over 150 replicates, where each replicate draws a mean trait value for each species. When we look at the data in one dimension it simply looks like three normal distributions of mean trait values drawn across many replicates, but when we compare the distributions in two dimensions we see there is a correlation structure: when the trait mean of species 0 is higher it is also higher in species 1. There is almost no correlation, however, between species 0 and 2 or species 1 and 2 trait means. 

In [621]:
# mean trait values 
mean = np.array([0, 5, 10])

# covariance structure (phylogeny) for three taxa
cov = np.array([
    [1.00, 0.75, 0.15],
    [0.75, 1.00, 0.15],
    [0.15, 0.15, 1.00],
    ])

# tree representation of same covariance structure
#
#     ----------+ 2
#     +
# -----
#     +     ----+ 1
#     ------+
#           ----+ 0
#

In [622]:
# draw values from a MVN (normal distribution with covariance structure)
arr = np.random.multivariate_normal(mean, cov, 150)

In [623]:
# plot in 1-dimension
canvas = toyplot.Canvas(height=200, width=400)
axes = canvas.cartesian(xlabel="trait value", ylabel="count")
m0 = axes.bars(np.histogram(arr[:, 0], bins=10));
m1 = axes.bars(np.histogram(arr[:, 1], bins=10));
m2 = axes.bars(np.histogram(arr[:, 2], bins=10));

In [624]:
# plot pairwise scatterplots
canvas = toyplot.Canvas(height=300, width=300)
axes = canvas.cartesian(xlabel="mean trait value", ylabel="mean trait value")
m0 = axes.scatterplot(arr[:, 0], arr[:, 1]);
m1 = axes.scatterplot(arr[:, 0], arr[:, 2]);
m2 = axes.scatterplot(arr[:, 1], arr[:, 2]);
canvas.legend([
    ("species 0 x 1", m0), 
    ("species 0 x 2", m1), 
    ("species 1 x 2", m2)],
    corner=('bottom-right', 50, 100, 50));

## Challenges (just to test youreself, not the assignment)

In [612]:
# sample ten random integers in the range 0-100


In [613]:
# sample ten random floats in the range 0-100


In [625]:
# sample 100 values from a normal distribution with mean 10 and stdev 2


In [626]:
# calculate and print the mean and std of the array generated above

In [627]:
# create a boolean mask of size 100 with 10 True values and 90 False values

In [628]:
# create a boolean mask where each element is randomly drawn True with p=0.5

In [None]:
# apply the boolean mask to an array of normally distributed values to subselect elements.