## ACM SIGCHI Summer School on Computational Interaction  
### Inference, optimization and modeling for the engineering of interactive systems  
#### 27th August - 1st September 2018  
#### University of Cambridge, UK  


# Introduction to probabilistic inference

In [None]:
# standard imports
import numpy as np
import matplotlib.pyplot as plt
import sys, os, time
import pandas as pd
%matplotlib inline
import matplotlib as mpl
plt.rc('figure', figsize=(8.0, 4.0), dpi=140)
import scipy.stats


$$\newcommand{\vec}[1]{{\bf #1}} 
\newcommand{\real}{\mathbb{R}}
\newcommand{\expect}[1]{\mathbb{E}[#1]}
\DeclareMathOperator*{\argmin}{arg\,min}
\vec{x}
\real
$$


## Topic purpose
This section will cover probabilistic **inference**. Rather than learning a single set of parameters by optimisation, we can infer probability distributions over possible models that might be compatible with our data.  


Concretely, we'll use Monte Carlo sampling to make it simple and easy (if not very efficient) to work with probabilistic models. We will use these approaches to model **typing behaviour** at the keystroke level, and both make predictions given some data ("how likely is it that this sequence was typed by user X?") and quantify how much confidence we have in those models.



## Outline for section A: Inferring typing behaviour



### Why is this relevant for computational HCI?
We will build **statistical models** of user behavior, and estimate parameters of that model from quantitative observations of data. 

This is **robust** (it appropriately represents uncertainty) and **generative** (it can simulate behaviour compatible with observations).  

# Background
<a id="bayesian"> </a>
# Random variables and distributions
A **random variable** is a variable that can take on different values, but we do not know what value it has; i.e. one that is "unassigned". However, we have some knowledge which captures the possible states the variable could take on, and their corresponding probabilities. Probability theory allows us to manipulate random variables without having to assign them a specific value.

A random variable is written with a capital letter, like $X$.

A random variable might represent the outcome of dice throw (discrete); whether or not it is raining outside (discrete: binary); the height of person we haven't met yet (continuous); the position of a user's hand (continuous, multivariate);. 

## Distributions
A **probability distribution** defines how likely different states of a random variable are. 

We can see $X$ as the the *experiment* and $x$ as the *outcome*, with a function mapping every possible outcome to a probability. We write $P(x)$ to mean the probability of $P(X=x)$ (note the case!).

$$P(X=x),\  \text{the probability of random variable X taking on value x}\\
P(X),\  \text{shorthand for probability of X=x }\\
P(x),\  \text{shorthand for probability of specific value X=x }\\
$$
We can see an outcome as a random variable taking on a specific value i.e. $P(X=x)$. Note that we use $P(A)$ to mean the probability of **event** $A$, not the random variable $A$.

### Discrete and continuous
Random variables can be continuous (e.g. the height of a person) or discrete (the value showing on the face of a dice). 

* **Discrete variables** The distribution of a discrete variable is described with a **probability mass function** (PMF) which gives each outcome a specific value; imagine a dictionary mapping outcomes to probabilities. The PMF is usually written $f_X(x)$, where $P(X=x) = f_X(x)$.

* **Continuous variables** A continuous variable has a **probability density function** (PDF) which specifies the spread of the probability as a *continuous function* $f_X(x)$. It is **not** the case that $P(X=x) = f_X(x)$ for PDFs.

##### Integration to unity
A probability mass function or probability density function *must* sum/integrate to exactly 1, as the random variable under consideration must take on *some* value. Every repetition of an experiment has exactly one outcome.

$$\sum_i f_X(x_i) = 1\quad \text{for PMFs of discrete RVs}$$
$$\int_x f_X(x)\ dx = 1\quad \text{for PDFs of continuous RVs}$$
---

## PMF example: sum of dice rolls

In [None]:
# the PMF of the sum of two dice rolls
def two_dice():
    # form the sum of the cross product of these possibilities
    roll_two = [i+j for i in range(1,7) for j in range(1,7)]
    # now plot the histogram
    pmf, edges, patches = plt.hist(roll_two, normed=True, bins=range(1,14))
    print("Sum of PMF %.2f" % np.sum(pmf)) # sum of probability should be *exactly* 1.0
    plt.title("PMF of sum of 2d6 dice")
    plt.xlabel("Sum of rolls x")
    plt.ylabel("P(x)")

In [None]:
two_dice()

## Samples and sampling
**Samples** are observed outcomes of an experiment; we will use the term **observations** synonymously. We can **sample** from a distribution; this means simulating outcomes according to the probability distribution of those variables.


For example, we can sample from the sum of dice PMF by rolling two dice and summing the result. This is a sample or a draw from this distribution.


For discrete random variables, this is easy: we simply produce samples by drawing each outcome according to its probability. For continuous variables, we need to use specific algorithms to draw samples according to a distribution.


In [None]:
# the PMF of the sum of two dice rolls
def sample_two_dice():
    roll_two = [i+j for i in range(1,7) for j in range(1,7)]    
    pmf = np.histogram(roll_two, normed=True, bins=range(1,14))[0]
    cmf  = np.cumsum(pmf) # cumulative sum of the amount of probability in each bin        
    uniform_samples = np.random.uniform(0, 1, 200)
    discrete_samples = np.digitize(uniform_samples, cmf) + 1 # compensate for bin starting on 1, not 0
    
    plt.hist(roll_two, bins=range(1,14), facecolor='C0', normed=True, alpha=0.2, label="Sampled histogram")
    plt.hist(discrete_samples, bins=range(1,14), facecolor='none', edgecolor='C1', linewidth=2, normed=True, label="True PMF")
    plt.legend()
sample_two_dice()    

#### Probability distribution functions (for continuous random variables)
The PDF $f_X(x)$ of a random variable $X$ maps a value $x$ (which might be a real number, or a vector, or any other continuous value) to a single number, the density at the point. It is a function $\real^N \rightarrow \real^+$, where $\real^+$ is the positive real numbers.

* While a PMF can have outcomes with a probability of at most 1, it is *not* the case that the maximum value of a PDF is $f_X(x) \leq 1$ -- *just that the integral of the PDF be 1.*

The value of the PDF at any point is **not** a probability, because the probability of a continuous random variable taking on any specific number must be zero. 

Instead, we can say that the probability of a continuous random variable $X$ lying in a range $[a,b]$ is:
$$\begin{equation} P(X \in [a,b]) = (a < X < b)  = \int_a^b f_X(x) \end{equation}$$

## PDF example: the normal disribution
The most ubiquitous of all continuous PDFs is the **normal** or **Gaussian** distribution. It assigns probabilities to real values $x \in {\mathbb{R}}$ (in other words, a sample space consisting of all of the real numbers). It has a density given by the PDF:

$$f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\, e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$$


We use a shorthand notation to refer to the distribution of continuous random variables:
$$\begin{equation}X \sim \mathcal{N}(\mu, \sigma^2)\end{equation},$$ 
which is read as 
>"Random variable X is distributed as [N]ormal with mean $\mu$ and variance $\sigma^2$"

### Location and scale
The normal distribution places most density close to its center $\mu$ (the "mean"), with a spread defined by $\sigma^2$ (the "variance"). This can be though of the **location** and **scale** of the density function. Most standard continuous random variable PDFs have a location (where density is concentrated) and scale (how spread out the density is).

In [None]:
import scipy.stats as stats
# Plot the PDF of the normal distribution
def plot_normal():
    # plot the normal (Gaussian distibution) along with a set of points drawn from that distribution
    x = np.linspace(-4,4,100)
    y = stats.norm.pdf(x,0,1) # mean 0, std. dev. 1
    plt.plot(x,y, label="PDF")
    plt.axhline(0, color='k', linewidth=0.2) # axis line
 
    # mark the mean
    plt.text(0, 0.51, '$\mu$')
    plt.axvline(0, color='r')
    # highlight one std. dev. to the right
    plt.axvspan(0,1, facecolor='b', alpha=0.1, label="1 std. dev.")
    plt.text(1.2, 0.3, '$\sigma$')
    # take 1000 random samples and scatter plot them
    samples = stats.norm.rvs(0,1,1000)
    plt.scatter(samples, np.full(samples.shape, .2), s=448, c='b', alpha=0.1, marker='|', label="Samples")
    plt.xlabel("$x$")
    plt.ylabel("$P(x)$")
    plt.legend()

In [None]:
plot_normal()

## Joint, conditional, marginal

The **joint probability** of two random variables is written $$P(X,Y)$$ and gives the probability that $X$ and $Y$ take the specific values *simultaneously* (i.e. $P(X=x) \land P(Y=y)$). 


The **marginal probability** is the derivation of $P(X)$ from $P(X,Y)$ by integrating (summing) over all the possible outcomes of $Y$:
$$P(X) = \int_y P(X,Y=y) dy\  \text{for a PDF.}$$
$$P(X) = \sum_y P(X,Y=y)\  \text{for a PMF.}$$


**Marginalisation** just means integration over one or more variables from a joint distribution: it *removes* those variables from the distribution.

Two random variables are **independent** if the they do not have any dependence on each other. If this is the case then the joint distribution is just the product of the individual distributions:
$P(X,Y) = P(X)P(Y).$ This is not true in the general case where the variables have dependence.

The **conditional probability** of $X$ *given* $Y$ is written as $$P(X|Y)$$ and can be computed as $$\begin{equation} P(X|Y) = \frac{P(X,Y)}{P(Y)}. \end{equation}$$ This tells us how likely $X$ is to occur *if we already know*  (or fix) the value of $Y$.


In [None]:
def joint_marginal(cov):
    # create an independent 2D normal distribution
    x,y = np.meshgrid(np.linspace(-3,3,50), np.linspace(-3,3,50))
    pos = np.empty(x.shape + (2,))
    pos[:,:,0] = x
    pos[:,:,1] = y
    joint_pdf = scipy.stats.multivariate_normal.pdf(pos, [0,0], cov)
    fig = plt.figure()
    # plot the joint
    ax = fig.add_subplot(2,2,1)
    ax.axis('equal')
    plt.title("Joint p(x,y)")
    ax.pcolor(x,y,joint_pdf, cmap='viridis')
    # plot the marginals
    ax = fig.add_subplot(2,2,3)
    ax.axis('equal')
    plt.title("Marginal $P(x) = \int\  P(x,y) dy$")
    ax.plot(x[0,:], np.sum(joint_pdf, axis=0))
    ax = fig.add_subplot(2,2,2)
    ax.axis('equal')
    plt.title("Marginal $P(y) = \int\  P(x,y) dx$")
    ax.plot(np.sum(joint_pdf, axis=1), x[0,:])
    # plot p(x|y)
    ax = fig.add_subplot(2,2,4)
    ax.axis('equal')
    plt.title("Conditional $P(x|y) = \\frac{P(x,y)}{P(y)}$")
    marginal = np.tile(np.sum(joint_pdf, axis=0), (joint_pdf.shape[0],1))
    ax.pcolor(x,y,joint_pdf/marginal, cmap='viridis')
joint_marginal([[1,0],[0.5,1]])

## Probability theory and Bayesian inference

#### Probability as a calculus of belief
*Bayesians* treat probability as a **calculus of belief**; in this model of thought, probabilities are measures of degrees of belief. $P(A)=0$ means a belief that $A$ cannot be true and $P(A)=1$ is a belief that $A$ is absolutely certain.


#### Probability as the optimal way of representing uncertainty
Other representations of uncertainty are strictly inferior to probabilistic methods *in the sense that* a person, agent, computer placing "bets" on future events using probabilistic models has the best possible return out of all decision systems when there is uncertainty. 

*Bayesians* allow for belief in states to be combined and manipulated via the rules of probability. The key process in Bayesian logic is *updating of beliefs*. Given some *prior* belief (it's Glasgow, it's not likely to be sunny) and some new evidence (there seems to be a bright reflection inside) we can update our belief to calculate the *posterior* -- our new probability that it is sunny outside. Bayesian inference requires that we accept priors over events, i.e. that we must explicitly quantify our assumptions with probability distributions. 

#### Prior, likelihood, posterior, evidence

We often want to know the probability of a some outcome $A$ given some other outcome $B$; that is $P(A|B)$. But we are often in the situation that we can only compute $P(B|A)$. 

In general $P(A|B) \neq P(B|A);$ and the two expressions can be completely different. 

Typically, this type of problem occurs where we:
* want to know the probability of some event given some *evidence* 
* but we only know the probability of the evidence given the event 

**Bayes' rule** gives a consistent way to invert the probability distribution:
$$ \begin{equation} P(A|B) = \frac{P(B|A) P(A)}{P(B)} \end{equation}$$

This follows directly from the axioms of probability. Bayes' Rule is a very important rule, and has some surprising results.

* $P(A|B)$ is called the **posterior** -- what we want to know, or will know after the computation
* $P(B|A)$ is called the **likelihood** -- how likely the event $A$ is to produce the evidence we see
* $P(A)$ is the **prior**  -- how likely the event $A$ is regardless of evidence
* $P(B)$ is the **evidence** -- how likely the evidence $B$ is regardless of the event.

Bayes' rule gives a consistent rule to take some prior belief and combine it with observed data to estimate a new distribution which combines them.

We often phrase this as some **hypothesis** $H$ we want to know, given some **data** $D$ we observe, and we write Bayes' Rule as:
$$ \begin{equation}P(H|D) = \frac{P(D|H) P(H)}{P(D)} \end{equation}$$

(the probability of the hypothesis given the data) is equal to (the probability of the data given the hypothesis) times (the probability of the hypothesis) divided by (the probability of the data).

In other words, if we want to work out how likely a hypothesis is to be true given observations, but we only know how likely we are to have seen those observations if that hypothesis *was* true, we can use Bayes' rule to solve the problem.

## Bayes' rule for combining evidence
Bayes' rule is the correct way to combine prior belief and observation to update beliefs. This can be used to "learn", where "learning" means updating a probability distribution based on observations. 

It has enormous applications anywhere uncertain information must be fused together, whether from multiple sources (e.g. sensor fusion) or over time (e.g. probabilistic filtering). 

In [None]:
import time
import scipy.stats

def prior_posterior(prior_mean=0, prior_std=1, sonar_std=1, n=10, anim=False):
    mean = prior_mean
    std = prior_std
    var = std*std
    prior = scipy.stats.norm(mean,std)
    evidence = scipy.stats.norm(1, 0.25)
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    
    xs = np.linspace(-5,5,200)
    ax.fill_between(xs, prior.pdf(xs), label="Prior belief", alpha=0.3)
    ax.fill_between(xs, evidence.pdf(xs), label="True generating PDF", alpha=0.3)
    
    sample_var = sonar_std**2 # the *expected* variance of our observations
    # note that changing this allows us to continously adjust our belief
    # in our observations 
    ax.plot([0,0],[0,-0.1], 'c', alpha=0.7, label="Evidence")
    ax.plot([0,0],[0,-0.1], 'k:', alpha=0.7, label="Posterior belief")
    ax.set_title("Recursive Bayesian estimation")
    
    ax.set_xlabel("x")
    ax.set_ylabel("PDF $f_X(x)$")
    ax.axvline(1.0, label='True')
    ax.legend()
    for i in range(n):
        
        sample = evidence.rvs()
        # single step update for a normal distribution    
        mean = (var * sample + sample_var * mean) / (sample_var + var)
        var = (var*sample_var) / (sample_var+var)     
        
        sample_pdf = scipy.stats.norm(sample, sonar_std).pdf
        
        # plot the sample and the resulting pdf
        ax.plot([sample,sample],[0,-0.5], 'c', alpha=0.7)
        if anim:
            ax.plot(xs,-sample_pdf(xs), 'c', alpha=0.25)
        ax.plot(xs, scipy.stats.norm(mean,np.sqrt(var)).pdf(xs), 'k:', alpha=0.25)
        if anim:            
            time.sleep(1.0)
            fig.canvas.draw()
        
        
    ax.fill_between(xs, scipy.stats.norm(mean,np.sqrt(var)).pdf(xs), color='g', label="Final posterior", alpha=0.2)
    ax.legend()
    

In [None]:
   
prior_posterior(0,0.75)

In [None]:
   
prior_posterior(0,3)

In [None]:
   
prior_posterior(-3,0.5)

In [None]:
   
prior_posterior(-3,0.5, n=100)

## Side note: Log probabilities

The probability of multiple **independent** random variables taking on a set of values can be computed from the product:
$$P(x,y,z) = P(x)P(y)P(z)$$
and in general
$$P(x_1, \dots, x_n) = \prod_{i=1}^{n} x_i$$

We often have to have to compute such products, but to multiply lots of values $<1$ leads to numerical issues. Instead, we often prefer to manipulate *log probabilities*, which can be summed instead of multiplied:
$$\log P(x_1, \dots, x_n) = \sum_{i=1}^{n} \log P(x_i)$$

This is simply a numerical convenience. The **log-likelihood** is just $\log P(B|A)$, and is often more convenient to work with than the raw likelihood. This avoids numerical underflow.