# Discrete Probability

**Topics:** Focusing on only one variable, 
* Random variables and distributions
  * Definitions and notation: sample space, event, axioms of probability (probability is non-negative, and probability of sample space is 1).
* Familiarity with common distributions and their properties – which is useful for what? (Categorical, Uniform, Gaussian, Bernoulli, Beta, Dirichlet)
* Coding: jax and matplotlib 
  * Sampling from numpyro distributions
  * Visualizing them

**Outline (Part 1): Discrete Probability**
1. Introduce 3 tasks from mental health data
2. Motivation: Why use probability?
3. Introduce terminology and notation in the context of the task: random variable, outcome, sample space (all possible outcomes), event (set of at least one outcome), probability of event (PMF)
4. Introduce properties of discrete probability: p(sample space) = 1, 0 <= p(event) <= 1.
5. Introduce i.i.d sampling (to simulate many independent observations of the same phenomenon)
6. Re-introduce task in terms of notation
7. In-class exercise 1: students look up common distributions on Wikipedia (listed above), and see if they can match them to one of the tasks introduced earlier. Instruct students to first match distributions to tasks based on obvious properties (support of distribution), and only then based on shape (asking them to relate shape to the problem).
8. In-class exercise 2: students use Jax/NumPyro to plot samples from the distributions and compare that against the empirical distributions from the data. Iteratively, they search for parameters that make the two distributions similar.
9. Point out to students that for discrete distributions, probabilities can be estimated via counting (how many / total).

## Motivation

**Context:** As your first assignment at IHH's ML team, you've been tasked with better understanding their Emergency Room (ER). Since you're new, you'd first like to better understand how the ER works at a high level. Specifically, you'd like to answer the following questions:

* Q1: How many beings come to ER every day?
* Q2: Overall, what conditions do the beings come to the ER for? (e.g. inflamed antenna, fever, etc.)
* Q3: How many being remain hospitalized over night?

**Challenge:** The answer to both of these questions cannot be given by a single number (e.g. the number of beings that come to the ER changes from day to day). So how can we give a deterministic answer to a question whose response is inherently variable or *stochastic*? Answer: *probability distributions*. Probability distributions are the basic building block that we will use to build complex ML systems. 

**Outline:** 
1. Introduce and practice the concepts, terminology, and notation behind discrete probability distributions (continuous distributions will be covered next).
2. Answer the above questions using this new toolset.
3. Start to gain familiarity with two important Python libraries we will use throughout the semester: `Jax` and `NumPyro`.

## Terminology and Notation for Discrete Probability

As in the spirit of all Computer Science classes, if we want the help of a computer to solve a problem, we need a *language* to precisely specify what we want it to do. Today, we will introduce the language---terminology and notation---from statistics, and we will then translate it into code that a computer can run.

The terminology that we introduce here is slightly different from ones that you may have seen in a statistics class. The reason for this is that we're honing in on the minimal subset of terminology we need to describe a probabilistic ML model.

**Random Variable (RV):** A variable whose possible values are outcomes of a random phenomenon.
> Example: Let $N$ be an RV describing the number of beings that come into the ER on a given day.

**Sample Space or Support:** The set of all possible values that an RV can take on. For discrete probability, this set must be countable (though this is not important for now).
> Example: The sample space for $N$ is the set $S = [0, \infty)$, since we can have any number from $0$ to $\infty$ (theoretically speaking) of beings come to the ER. 

**Probability Mass Function (PMF):** A function mapping the outcome of an RV to the probability with which it occurs. We can write the PMF as a mapping from the sample space to a number on the unit interval: $p: S \rightarrow [0, 1]$.
> Example: Let $p_N(\cdot)$ denote the PMF of $N$, where the dot represents an argument we have not specified. We denote the probability that $N$ takes on a specific value $n$ as follows: $p_N(n)$. If we were told that $p_N(5) = 0.1$, this means that the probability that exactly 5 beings came to the ER is 0.1 (or 10%).

PMFs has one notable property: the probability of all outcomes in the sample space must sum to 1.
> Example: Continuing with the above example, we have that $\sum\limits_{n \in [0, \infty)} p_N(n) = 1$.

**Independent, Identically Distributed (i.i.d):** We say a variable is i.i.d if different observations of the same phenomenon are independent (i.e. that do not affect one another) and if they follow the same PMF. For example, when we flip a coin, previous flips do not affect future flips (i.e. if the coin landed heads, it does not affect its probability of landing heads next), and every time we flip the coin, the probability of it landing heads is the same (so it's identically distributed).
> Example: We the notation $N \sim p_N(\cdot)$ signifies that, the number of beings coming to the ER is distributed according to distribution $p_N(\cdot)$, and that $N$ is sampled i.i.d.

**Summary of Notation:** 
* Let $R$ denote an RV.
* We denote the PMF of $R$ using $p_R(\cdot)$. Note that $p_R(\cdot)$ is a function that maps possible values $R$ can take on to probabilities (between 0 and 1).
* We call $p_R(r)$ the evaluation of the function at $r$: i.e. what's the probability that $R$ equals the specific value $r$?
* We write $R \sim p_R(\cdot)$ to denote that $R$ is sampled i.i.d from $p_R(\cdot)$.

## Matching Distribution to Scenario

**Exercise 1:** Browse the Wikipedia pages for the following distirbutions, and determine which ones fits each of Q1-Q3.
* [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution)
* [Categorical](https://en.wikipedia.org/wiki/Categorical_distribution)
* [Binomial](https://en.wikipedia.org/wiki/Binomial_distribution)
* [Geometric](https://en.wikipedia.org/wiki/Geometric_distribution)
* [Poisson](https://en.wikipedia.org/wiki/Poisson_distribution)

Hint: On each Wikipedia page, there's a panel on the right side that summarizes the properties of the distribution (including its support), and provides example plots. 

**Exercise 2:** Once you've decided which distribution best fits each of the questions above, let's see if we can match them to data from IHH. To do this, you will need two things: 

* A function to visualize the data
* A function to visualize the PMF of the distribution. 

When you have both, you can plot the distribution of the observed data against the PMF of the distribution to see how well they align. Note that each distribution has *parameters* that control its shape -- you will have to play with these to find settings that best match.

In [1]:
import jax
import numpyro
import numpyro.distributions as D

## Getting Comfortable with Notation