# Programming Assignment #2 --- Random Variables in SciPy

Now that we know the basics of the Python language and the three big libraries (NumPy, Pandas, and Matplotlib), we are ready to embark on our journey through statistical Python.

This notebook focuses on [SciPy](https://docs.scipy.org/doc/scipy/index.html), which is one of the main scientific computing libraries for Python. Beyond statistics, it provides functionality for many other areas of computational mathematics.

SciPy implements a [huge number](https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions) of probability distributions, including all the ones that we will study in our class. Through SciPy, we can access the mass and density functions of random variables, their cumulative distribution functions, and also their statistics like expectations, variances, and quantiles. Later, we will see that we may also "simulate" datasets from specific probability distributions using SciPy.

## Instructions

1. Read through all sections of the notebook, making sure to run each code cell as you go along.

2. Enter your code in the blank cells containing the comment `# ENTER YOUR CODE IN THIS CELL`. Do **not** delete this comment.

3. Do **not** delete or in any way alter existing cells, or add any cells of your own. If you _must_ add new code cells, please delete them before submitting your assignment.

4. This assignment is due **31 December, 1999 at 11:59pm**. See the submission instructions at the end of the notebook.

## Plotting PMFs

### Description

Let's begin by learning how to plot the probability mass functions of $\mathcal{B}in(n,p)$ random variables using SciPy. We must first make our imports. The `stats` module in SciPy includes the statistical funtionality of the library---let's import it under the alias `ss`:

In [None]:
import scipy.stats as ss

Binomial variables in SciPy are implemented as the `binom` class in `scipy.stats`. As you can see at the [docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html), the implementation of a $\mathcal{B}in(n,p)$ random variables takes $n$ and $p$ as shape parameters called `n` and `p`, naturally. Let's instantiate a binomial random variable, called `X`, with `n=12` and `p=0.75`:

In [None]:
X = ss.binom(n=12, p=0.75)

We can access the mass function of `X` through the `pmf` function. Let me show you how to plot it.

First, because Python doesn't really have a decent built-in method for plotting probability histograms, I wrote my own and called it `prob_hist`. It is defined in the next cell. You do _not_ need to understand the code in this cell; just run it, and move on.

In [None]:
import matplotlib.pyplot as plt

def prob_hist(xvals, yvals, stemwidth=25, title='', xlabel='values',
              ylabel='probability', size=(-1, -1), vline='', hline='', ymax=''):

    _, stems, _ = plt.stem(xvals, yvals, basefmt=' ', markerfmt=' ')
    plt.setp(stems, 'linewidth', stemwidth)
    plt.gca().set_ylim(ymin=0)
    plt.gca().set_xticks(xvals)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)

    if ymax != '':
        plt.gca().set_ylim(ymax=ymax)

    if size != (-1,-1):
        fig = plt.gcf()
        fig.set_size_inches(size)

    if vline != '':
        plt.axvline(x=vline, color='red', linestyle='--')

    if hline != '':
        plt.axhline(y=hline, color='red', linestyle='--')

    plt.tight_layout()

We will plot the mass function of our binomial random variable `X` by passing a range of $x$-values into `prob_hist`, along with the probabilities.

Remember that the range of a $\mathcal{B}in(12,0.75)$ random variable includes the $x$-values

$$
x=0, 1, \ldots, 12.
$$

In order to implement this range in Python, it will be easier to use `np.arange` rather than `np.linspace` as we learned in the previous programming assignment, because this range consists of only _integer_ values.

Then, once we have our range of $x$-values, we toss them into the mass function of our random variable `X` by calling `pmf`, and then insert the results into `prob_hist` along with extra parameters setting the label on the $x$-axis and the size of the figure:

In [None]:
import numpy as np

x_vals = np.arange(13) # generate the x-values
probs = X.pmf(x_vals) # toss them into the pmf
prob_hist(x_vals, probs, xlabel=r'$x$', size=(8, 4)) # plot

Pay close attention to the call `np.arange(13)`. This produces a range of integer values in the half-open interval $[0,13)$, so we do, in fact, get all integers $0, 1, \ldots, 12$. If you wanted to include the left-hand endpoint of the interval explicitly in your call, you can write:

In [None]:
np.arange(0, 13)

See? It's the same thing as calling `np.arange(13)`. You can change the left-hand endpoint also:

In [None]:
np.arange(4, 13)

The things to remember are:

* `np.arange` always **includes** the left-hand endpoint, but **excludes** the right-hand one.

* The default value for the left-hand endpoint is `0`.

* Use `np.arange` when you want a range of _interger_ values. If you want a range of _fractional_ values, use `np.linspace` instead.



### Problem 1 --- Geometric variables

SciPy's implementation of a $\mathcal{G}eo(p)$ random variable is through the class `geom` in `scipy.stats`. As you can see at the [docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.geom.html), the implementation takes $p$ as a shape parameter called `p`.

In one line of code, define a geometric random variable called `X` with `p=0.25`:

In [None]:
# ENTER YOUR CODE IN THIS CELL



Remember, the range of a geometric random variable is infinite:

$$
x=1, 2, 3, \ldots.
$$

Therefore, we will not be able to plot the _entire_ mass function.

In the next code cell, generate [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-1.png?raw=true) plot of the PMF of `X` over the indicated range of $x$-values using `prob_hist` as I demonstrated above. Set the `size` parameter to `(8, 4)`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



### Problem 2 --- Hypergeometric variables

Hypergeometric variables are implemented in `scipy.stats` under the `hypergeom` class. According to the [docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html), the implementation of a $\mathcal{HG}eo(M,n,N)$ variable takes $M$, $n$, and $N$ as shape parameters called `M`, `n`, and `N`.

In the next code cell, instantiate a $\mathcal{HG}eo(25, 10, 20)$ random variable `X` and produce [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-2.png?raw=true) plot of its PMF. Set `stemwidth=10` in your call to `prob_hist`.

In [None]:
# ENTER YOUR CODE IN THIS CELL



## Plotting PDFs

### Description

Plotting the density curves of continuous random variables is somewhat easier than plotting the mass functions of discrete variables since we do not need the custom function `prob_hist`. We can use the standard call to `plt.plot` that we learned in the first programming assignment.

Here's an example of the density curve of $X\sim \mathcal{N}(1, 2^2)$:

In [None]:
X = ss.norm(loc=1, scale=2)
x_vals = np.linspace(-6, 8)
plt.plot(x_vals, X.pdf(x_vals))
plt.xlabel(r'$x$')
plt.ylabel('probability density')
plt.title('$X \sim \mathcal{N}(1, 2^2)$')
plt.show()

Notice the following:

* SciPy's implementation of a $\mathcal{N}(\mu,\sigma^2)$ variable goes through the `norm` class in `scipy.stats`. It accepts $\mu$ and $\sigma$ as the parameters `loc` and `scale`.

* I switched from `np.arange` back to `np.linspace` because $X$ is continuous and we need to generate a range of _fractional_ $x$-values.

* I added a title to the plot by calling `plt.title`. I passed the LaTeX fragment `$X \sim \mathcal{N}(1, 2^2)$` inside a Python string (with single quotes) as the parameter.

### Problem 3 --- Exponential variables

On your own, discover how an $\mathcal{E}xp(\lambda)$ random variable is implemented in SciPy. Pay very, _very_ close attention to the parameter in the implementation!

Then, in the next code cell, produce [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-3.png?raw=true) plot of the density curve of an $\mathcal{E}xp(1/3)$ random variable called `X`. Be sure to notice the range of $x$-values, the labels on the $x$- and $y$-axes, and the title of the plot.

In [None]:
# ENTER YOUR CODE IN THIS CELL



### Problem 4 --- Beta variables

On your own (again), discover how a $\mathcal{B}eta(\alpha,\beta)$ is implemented in SciPy. Then, in the next code cell, implement _three_ beta variables

$$
X\sim \mathcal{B}eta(2, 3), \quad Y \sim \mathcal{B}eta(3, 3), \quad Z \sim \mathcal{B}eta(4, 3)
$$

and produce [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-4.png?raw=true) plot of their PMFs. Notice the labels and legend!

In [None]:
# ENTER YOUR CODE IN THIS CELL



## Plotting CDFs

### Description

The cumulative distribution functions of random variables in SciPy are accessed via the `cdf` method.

Let's plot the CDF of a discrete binomial variable:

In [None]:
X = ss.binom(n=12, p=0.75)
x_vals = np.arange(13)
accum_probs = X.cdf(x_vals)
prob_hist(x_vals,
          accum_probs,
          xlabel='$x$',
          ylabel='accumulated probability',
          size=(8, 4)
)

Now, let's plot the CDF of a continuous standard normal variable:

In [None]:
X = ss.norm() # default parameters loc=0 and scale=1
x_vals = np.linspace(-4, 4)
plt.plot(x_vals, X.cdf(x_vals))
plt.xlabel('$x$')
plt.ylabel('accumulated probability')
plt.show()

### Problem 5 --- Gamma variables

In the next code cell, produce [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-5.png?raw=true) plot of the CDF of a $\Gamma(1, 4)$ variable.

In [None]:
# ENTER YOUR CODE IN THIS CELL



### Problem 6 --- Combined plots

In the next code cell, produce [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-6.png?raw=true) side-by-side plot of the PDF and CDF of a $\Gamma(1,4)$ variable. Here are some tips/directions:

* Set `figsize` to `(8, 4)` in your call to `plt.subplots`.

* Use `plt.suptitle()` to place the title on the entire plot.

* Use the `set_title` method on the axes objects to set the titles on the individual plots. (For example, call `axes[0].set_title()` for the left-hand plot.)

* Notice the labels on the $x$- and $y$-axes, and that the ranges of $x$-values are _different_ in the two plots.

In [None]:
# ENTER YOUR CODE IN THIS CELL




## Expectations, variances and quantiles

### Description

In addition to plotting functionality, the `scipy.stats` module also provides functions and methods to compute various numerical quantities associated with random variables.

For example, suppose we wanted to compute $E(X)$ for $X\sim \mathcal{B}in(12, 0.75)$. Instead of instantiating the random variable `X` with these fixed parameters (this is called _freezing_ the distribution in SciPy), we can directly call `mean` and pass in the parameters:

In [None]:
ss.binom.mean(n=12, p=0.75)

However, if you're going to use the same random variable multiple times, it might be best to instantiate it directly and freeze it. For example, suppose that we wanted to compute both the variance and standard deviation of a variable $X\sim \mathcal{HG}eo(25, 10, 20)$. Here they are:

In [None]:
X = ss.hypergeom(M=25, n=10, N=15)
print(f'The variance is {X.var()}')
print(f'The standard deviation is {X.std()}')


Of course, the standard deviation can always be obtained from the variance by taking the square root:

In [None]:
np.sqrt(X.var())

We can also compute quantiles in SciPy through the `ppf` method. (SciPy calls the inverse CDF the "percent point function.") For example, suppose we wanted the $0.95$-quantile of a variable $X\sim \mathcal{N}(1,2^2)$. Then we would call:

In [None]:
ss.norm.ppf(0.95, loc=1, scale=2)

What about the median of a standard normal variable? (We already know the answer, of course.)

In [None]:
ss.norm.ppf(0.5) # default values loc=0 and scale=1

### Problem 7 --- Statistics of gamma variables

In one line of code, define a random variable $X \sim \Gamma(1,4)$ as `X`:

In [None]:
# ENTER YOUR CODE IN THIS CELL



Now compute its expected value:

In [None]:
# ENTER YOUR CODE IN THIS CELL



Compute its variance:

In [None]:
# ENTER YOUR CODE IN THIS CELL



Compute its standard deviation:

In [None]:
# ENTER YOUR CODE IN THIS CELL



By calling the `ppf` method directly on `X`, compute the $0.25$-quantile of $X$:

In [None]:
# ENTER YOUR CODE IN THIS CELL



### Problem 8 --- Plotting quantiles

The goal in this problem is to produce [this](https://github.com/jmyers7/stats-book-materials/blob/0ef3ab6affa90279aa4fd38b5edad78a2b7e0dcc/img/plot-2-7.png?raw=true) plot displaying the $0.75$-quantile of a random variable $X \sim \mathcal{N}(1, 2^2)$. You will not be working from scratch, however.

Instead, I will give you a template: In the following code cell, replace all the placeholder `None` values with appropriate code to produce the desired plot.

In [None]:
# ENTER YOUR CODE IN THIS CELL

# Define the random variable and the appropriate range of x-values.
X = None
x_vals = None

# Compute the 0.75-quantile and save it as the variable q.
q = None
x_fill_vals = np.linspace(-5, q)

plt.plot(x_vals, X.pdf(x_vals))
plt.axvline(x=q, color='r', linestyle='--')
plt.fill_between(x_fill_vals, X.pdf(x_fill_vals), alpha=0.3)

# Change the labels on the x- and y-axes, and also add the title.
plt.None
plt.None
plt.None

plt.tight_layout()

## Submission instructions

1. At the top of the window, click on "Runtime > Restart and run all" (assuming you're using Google Colab). It will ask whether you are _sure_ that you want to restart---click _yes_!

2. Now scroll through your notebook and check to make sure that all your answers have regenerated properly.

    * If something is amiss, correct the errors and then re-do step 1.

    * If everything looks OK, then do **not** change any of your code and proceed to step 3.

3. Click on "File > Save" to save a copy of "assignment-x.ipynb" somewhere on your local disk.

4. After saving your solutions in step 3, change the file name of "assignment-x.ipynb" by prepending your last name in all lowercase in this format: "last-name-assignment-x.ipynb". For example, I would rename my file as "myers-assignment-x.ipynb". If your last name needs a space, use a hyphen "-".

5. Upload the file "last-name-assignment-x.ipynb" at <a href="submission">this</a> link **by 31 December, 1999 at 11:59pm**.

