# ECE 493 - Probabilistic Reasoning and Decision Making

## Review of Probability Theory <span name="S01"></span>

- *See [Stanford CS228 - Probability Review](https://ermongroup.github.io/cs228-notes/preliminaries/probabilityreview/).*

## Representation <span name="S02"></span>

- **Problem**: *How do we express a probability distribution $p(x_{1}, x_{2}, ..., x_{n})$ that models some real-world phenomenon?*
    - *Naive Complexity*: $O(d^{n})$
- **Solution**: *Representation with Probabilistic Graphical Models + Verifying Independence Assumptions*

### Bayesian Networks (Directed Probabilistic Graphical Model) <span name="S02-1"></span>

#### Definition - What is a Bayesian network?

- A **Bayesian network** is a directed graph $G$ with the following:
    - *Nodes*: A random variable $x_{i}$.
    - *Edges*: A conditional probability distribution (CPD) $p(x_{i} \mid x_{A_{i}})$ per node, specifying the probability of $x_i$ conditioned on its parent's values.

#### Representation - How does a Bayesian network express a probability distribution?

1. Let $p$ be a probability distribution.
2. A naive representation of $p$ can be derived using the chain rule:
$$p(x_{1}, x_{2}, ..., x_{n}) = p(x_{1}) p(x_{2} \mid x_{1}) \cdots p(x_{n} \mid x_{n - 1}, ..., x_{2}, x_{1})$$
3. A Bayesian network representation of $p$ compacts the naive representation by having each factor in the right hand side depend only on a small number of **ancestor variables** $x_{A_{i}}$:
$$p(x_{i} \mid x_{i - 1}, ..., x_{2}, x_{1}) = p(x_{i} \mid x_{A_{i}})$$
    - e.g., Approximate $p(x_{5} \mid x_{4}, x_{3}, x_{2}, x_{1})$ with $p(x_{5} \mid x_{A_{5}})$ where $x_{A_{5}} = \{x_{4}, x_{3}\}$.

#### Space Complexity - How compact is a Bayesian network?

- Consider each of the factors $p(x_{i} \mid x_{A_{i}})$ as a **probability table**:
    - *Rows*: Values of $x_{i}$
    - *Columns*: Values of $x_{A_{i}}$
    - *Cells*: Values of $p(x_{i} \mid x_{A_{i}})$
- If each discrete random variable takes $d$ possible values and has at most $k$ ancestors, then each probability table has at most $O(d^{k + 1})$ entries.
- **Naive Representation Space Complexity**: $O(d^n)$
- **Bayesian Networks Representation Space Complexity**: $O(n \cdot d^{k + 1})$
$$\approx \text{Bayesian Networks Representation} \le \text{Naive Representation}$$

#### Independence Assumptions - Why are the independence assumptions of a Bayesian network important to identify?

- A Bayesian network expresses a probability distribution $p$ via products of smaller, local conditional probability distributions (one for each variable).
- These smaller, local conditional probability distributions introduces assumptions into the model of $p$ that certain variables are independent.
- **Important Note**: Which independence assumptions are we exactly making by using a Bayesian network?
    - *Correctness: Are these independence assumptions correct?*
    - *Efficiency: Do these independence assumptions efficiently compact the representation?*

#### $3$-Variable Independencies in Directed Graphs - How do you identify independent variables in a $3$-variable Bayesian network?

- Let $x \perp y$ indicate that variables $x$ and $y$ are independent.
- Let $G$ be a Bayesian network with three nodes: $A$, $B$, and $C$.

##### Common Parent

- If $G$ is of the form $A \leftarrow B \rightarrow C$,
    - If $B$ is observed, then $A \perp C \mid B$
    - If $B$ is unobserved, then $A \not\perp C$
- **Intuition**: $B$ contains all the information that determines the outcomes of $A$ and $C$; once it is observed, there is nothing else that affects $A$'s and $C$s' outcomes.

##### Cascade

- If $G$ equals $A \rightarrow B \rightarrow C$,
    - If $B$ is observed, then $A \perp C \mid B$
    - If $B$ is unobserved, then $A \not\perp C$
- **Intuition**: $B$ contains all the information that determines the outcomes of $C$; once it is observed, there is nothing else that affects $C$'s outcomes.

##### V-Structure

- If $G$ is $A \rightarrow C \leftarrow B$, then knowing $C$ couples $A$ and $B$.
    - If $C$ is unobserved, then $A \perp B$
    - If $C$ is observed, then $A \not\perp B \mid C$

#### $n$-Variable Independencies in Directed Graphs - How do you identify independent variables in a $n$-variable Bayesian network?

- Let $I(p)$ be the set of all independencies that hold for a probability distribution $p$.
- Let $I(G) = \{(X \perp Y \mid Z) : X, Y \text{ are } d\text{-sep given } Z\}$ be a set of variables that are $d$-separated in $G$.
- If the probability distribution $p$ factorizes over $G$, then $I(G) \subseteq I(p)$ and $G$ is an $I$-map (**independence map**) for $p$.
- **Important Note 1**: Thus, variables that are $d$-separated in $G$ are independent in $p$.
- **Important Note 2**: However, a probability distribution $q$ can factorize over $G$, yet have independencies that are not captured in $G$.
- **Important Caveat**: A Bayesian network cannot perfectly represent all probability distributions.

##### $d$-separation (a.k.a. Directed Separation)

- $Q$ and $W$ are **$d$-separated** when variables $O$ are observed if they are **NOT CONNECTED** by an active path.

##### Active Path

- An undirected path in the Bayesian Network structure $G$ is called **active** given observed variables $O$ if for **EVERY CONSECUTIVE TRIPLE** of variables $X$, $Y$, $Z$ on the path, one of the following holds:
    - **Evidential Trail**: $X \leftarrow Y \leftarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Causal Trail**: $X \rightarrow Y \rightarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Common Cause**: $X \leftarrow Y \rightarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Common Effect**: $X \rightarrow Y \leftarrow Z$, and $Y$ or any of its descendants are observed

#### Equivalence - When are two Bayesian networks $I$-equivalent?

- $G_1$ and $G_2$ are **$I$-equivalent**...
    - If they encode the same dependencies: $I(G_1) = I(G_2)$.
    - If they have the same skeleton and the same v-structures.
    - If the $d$-separation between variables is the same.
    
##### Skeleton

![Skeleton](images/BN_1.png)

- A **skeleton** is an undirected graph obtained by dropping the directionality of the arrows.
    - (a) is Cascade
    - (b) is Cascade
    - (c) is Common Parent
    - (d) is V-Structure
    - (a), (b), (c), and (d) have the same skeleton.

#### Example Problem 1 - $d$-separation

![Problem 1 - $d$-separation](images/BN_P1.png)

##### Question

- Are $X_{1}$ and $X_{6}$ $d$-separated given $\{X_{2}, X_{3}\}$?

##### Solution

1. **Path**: $X_{1} \rightarrow X_{2} \rightarrow X_{6}$
    1. *Consecutive Triple*: $X_{1} \rightarrow X_{2} \rightarrow X_{6}$
        - Although $X_{2}$ is observed, the *common effect* does not hold.
    2. As not all the consecutive triples hold, this path is not *active*.
2. **Path**: $X_{1} \rightarrow X_{3} \rightarrow X_{5} \rightarrow X_{6}$
    1. *Consecutive Triple*: $X_{1} \rightarrow X_{3} \rightarrow X_{5}$
        - Although $X_{3}$ is observed, the *common effect* does not hold.
    2. *Consecutive Triple*: $X_{3} \rightarrow X_{5} \rightarrow X_{6}$
        - As $X_{5}$ is unobserved, the *causal trail* does hold.
    3. As not all the consecutive triples hold, this path is not *active*.
3. As there are no active paths between $X_{1}$ and $X_{6}$, they are $d$-separated given $\{X_{2}, X_{3}\}$.

#### Example Problem 2 - $d$-separation

![Problem 2 - $d$-separation](images/BN_P2.png)

##### Question

- Are $X_{2}$ and $X_{3}$ $d$-separated given $\{X_{1}, X_{6}\}$?

##### Solution

1. **Path**: $X_{2} \leftarrow X_{1} \rightarrow X_{3}$
    1. *Consecutive Triple*: $X_{2} \leftarrow X_{1} \rightarrow X_{3}$
        - Although $X_{1}$ is observed, the *common effect* does not hold.
    2. As not all the consecutive triples hold, this path is not *active*.
2. **Path**: $X_{2} \rightarrow X_{6} \leftarrow X_{5} \leftarrow X_{3}$
    1. *Consecutive Triple*: $X_{2} \rightarrow X_{6} \leftarrow X_{5}$
        - As $X_{6}$ is observed, the *common effect* does hold.
    2. *Consecutive Triple*: $X_{6} \leftarrow X_{5} \leftarrow X_{3}$
        - As $X_{5}$ is unobserved. the *causal trail* does hold.
    3. As all the consecutive triples hold, this path is *active*.
3. As there exists an active path between $X_{2}$ and $X_{3}$, they are not $d$-separated given $\{X_{1}, X_{6}\}$.

### Markov Random Fields (Undirected Probabilistic Graphical Model) <span name="S02-2"></span>

#### Definition - What is a Markov random field?

- A **Markov random field** is an undirected graph $G$ with the following:
    - *Nodes*: A random variable $x_{i}$.
    - *Fully Connected Subgraphs*: An optional factor $\phi_{c}(x_{c})$ per clique, specifying the level of coupling (**potentials**) between all the dependent variables within the clique.
- **Important Note**:
>...SPECIFYING THE LEVEL OF COUPLING BETWEEN ALL THE DEPENDENT VARIABLES WITHIN THE CLIQUE...

#### Representation - How does a Markov random field express a probability distribution?

1. Let $p$ be a probability distribution.
2. A Markov random field representation of $p$ is the following:
$$p(x_{1}, x_{2}, ..., x_{n}) = \frac{1}{Z} \prod_{c \in C} \phi_{c}(x_{c})$$
    - Where $C$ is the set of cliques of $G$.
    - Where $\phi_{c}$ is a **factor** (nonnegative function) over the variables in a clique.
    - Where $Z$ is a **normalizing constant** that ensures that $p$ sums to one.
$$Z = \sum_{x_{1}, x_{2}, ..., x_{n}} \prod_{c \in C} \phi_{c}(x_{c})$$

#### Space Complexity - How compact is a Markov random field?

##### Factor Product

- Let $A$, $B$, and $C$ be three disjoint sets of variables.
- Let $\phi_{1}(A, B)$ and $\phi_{2}(B, C)$ be two factors.
- Let $\phi_{3}(A, B, C)$ be the **factor product**.
$$\phi_{3}(A, B, C) = \phi_{1}(A, B) \cdot \phi_{2}(B, C)$$
    - Where the two factors are multiplied for common values of $B$.

##### Binary Factor Tables

- Each of the optional factors $\phi_{c}(x_{c})$ can be expressed as a product of **binary factor tables** $\phi(X, Y)$:
    - *Rows*: Values of $X$
    - *Columns*: Values of $Y$
    - *Cells*: Values of $\phi(X, Y)$
- If each variable takes $d$ values, each binary factor table has at most $O(d^{2})$ entries.
- **Markov Random Fields Representation Space Complexity**: $O(E \cdot d^{2})$
    - Where $E$ is the number of edges in a Markov random field.
$$\approx \text{Markov Random Field Representation} \le \text{Naive Representation}$$

#### Markov Random Fields vs. Bayesian Networks - What are the advantages and disadvantages of Markov random fields?

##### Advantages

- *Applicable for Variable Dependencies Without Natural Directionality*
- *Succinctly Express Dependencies Not Easily Expressible in Bayesian Networks*

##### Disadvantages

- *Cannot Express Dependencies Easily Expressible in Bayesian Networks*
    - *e.g., V-Structures*
- *Computing Normalization Constant $Z$ Is NP-Hard*
- *Generally Require Approximation Techniques*
- *Difficult to Interpret*
- *Easier to Construct Bayesian Networks*

#### Moralization - What is moralization?

![Moralization](images/MRF_1.png)

- Bayesian networks are a special case of Markov random fields with factors corresponding to conditional probability distributions and a normalizing constant of one.
- **Moralization**: Bayesian Network $\to$ Markov Random Field
    1. Add side edges to all parents of a given node.
    2. Remove the directionality of all the edges.

#### $n$-Variable Independencies in Undirected Graphs - How do you identify independent variables in a $n$-variable Markov random field?

1. If variables $X$ and $Y$ are connected by a path of unobserved variables, then $X$ and $Y$ are dependent.
2. If variable $X$'s neighbors are all observed, then $X$ is independent of all the other variables.
3. If a set of observed variables forms a cut-set between two halves of the graph, then variables in one half are independent from ones in the other.

##### Cut-Set Variable Independencies

![Cut-Set Variable Independencies](images/MRF_2.png)

##### Markov Blanket

- The **Markov blanket** $U$ of a variable $X$ is the minimal set of nodes such that $X$ is independent from the rest of the graph if $U$ is observed.
$$X \perp (\mathcal{X} - \{X\} - U) \mid U$$
- In an undirected graph, the Markov blanket is a node's neighborhood.

#### Conditional Random Fields - What are conditional random fields?

##### Definition

- A **conditional random field** is a Markov random field over variables $\mathcal{X} \cup \mathcal{Y}$ which specifies a conditional distribution:
$$
\begin{align}
P(y \mid x) &= \frac{1}{Z(x)} \prod_{c \in C} \phi_{c}(x_{c}, y_{c}) \\
Z(x) &= \sum_{y \in \mathcal{Y}} \prod_{c \in C} \phi_{c}(x_{c}, y_{c})
\end{align}
$$
    - Where $x \in \mathcal{X}$ and $y \in \mathcal{Y}$ are **VECTOR-VALUED** variables.
    - Where $Z(x)$ is the partition function.
- **Important Note 1**: A conditional random field results in an instantiation of a new Markov random field for each input $x$.
- **Important Note 2**: A conditional random field is useful for structured prediction in which the output labels are predicted considering the neighboring input samples.
    - *See [Stanford CS228 - Markov Random Fields: Conditional Random Fields (OCR Example)](https://ermongroup.github.io/cs228-notes/representation/undirected/#conditional-random-fields).*

##### Features

- Assume the factors $\phi_{c}(x_{c}, y_{c})$ are of the following form:
$$\phi_{c}(x_{c}, y_{c}) = \exp(w_{c}^{T} f_{c}(x_{c}, y_{c}))$$
    - Where $f_{c}(x_{c}, y_{c})$ can be an arbitrary set of features describing the compatibility between $x_{c}$ and $y_{c}$.
    - Where $w_{c}^{T}$ is the transposed weight matrix.
- Accordingly, $f_{c}(x_{c}, y_{c})$ allows arbitrarily complex features.
    - e.g., $f(x, y_{i})$ are features that depend on the entirety of input samples $x$.
    - e.g., $f(y_{i}, y_{i + 1})$ are features that depend on successive pairs of output labels $y$.

#### Conditonal Random Fields vs. Markov Random Fields - Why is a conditional random field a special case of Markov random fields?

- If we were to model $p(x, y)$ using a Markov random field, then we need to fit two probability distributions to the data: $p(y \mid x)$ and $p(x)$.
    - *Remember Baye's Rule*: $p(x, y) = p(y \mid x) \cdot p(x)$
- However, if all we are interested in is predicting $y$ given $x$, then modeling $p(x)$ is expensive and unnecessary.
$$\text{Prediction} \implies \text{CRF} > \text{MRF}$$

#### Factor Graphs - What is a factor graph? Why does a factor graph exist?

![Factor Graph](images/MRF_3.png)

- A **factor graph** is a bipartite graph where one group is the variables in the distribution being modeled, and the other group is the factors defined on these variables.
    - *Edges Between Factors and Variables*
- **Side Note**: A **bipartite graph** is a graph whose vertices are divided into two disjoint and independent sets.
    - *Set 1: Variables*
    - *Set 2: Factors*
- **Important Note**: Use a factor graph to identify what variables a factor depends on when computing probability distributions.

## Inference <span name="S03"></span>

- **Problem**: *Given a probabilistic model, how do we obtain answers to relevant questions about the world?*
    - **Marginal Inference**: *What is the probability of a given variable in our model after we sum everything else out?*
$$p(y = 1) = \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} p(y = 1, x_{1}, x_{2}, ..., x_{n})$$
        - e.g., What is the overall probability that an email is spam?
        - *Perspective*: We desire to infer the general probability of some real-world phenomenon being observed.
            - i.e., You care more about spam as a whole than specific instances of spam.
    - **Maximum A Posteriori**: *What is the most likely assignment of variables?*
$$\max_{x_{1}, ..., x_{n}} p(y = 1, x_{1}, x_{2}, ..., x_{n})$$
        - e.g., What is the set of words such that an email has the maximum probability of being spam?
        - *Perspective*: We desire to infer the set of conditions that maximizes the probability of some real-world phenomenon being observed.
            - i.e., You care more about identifying indicators of spam than detecting spam.
    - *Naive Complexity*: NP-Hard **(DIFFICULT PROBLEM)**
- **Solution**: *Exact Inference Algorithms & Approximate Inference Algorithms*

### Vairable Elimination (Exact Inference Algorithm) <span name="S03-1"></span>

#### Motivation - Why does the variable elimination algorithm exist?

- Let $x_{i}$ be a discrete random variable that takes $k$ possible values.
- **Problem**: *Marginal Inference*
$$p(y = 1) = \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} p(y = 1, x_{1}, x_{2}, ..., x_{n})$$
- **Naive Solution's Time Complexity** (*Exponential*): $O(k^{n})$
    - *See [Rule of Product in Combinatorics](https://en.wikipedia.org/wiki/Rule_of_product)*
- **Variable Elimination Solution's Time Complexity** (*Non-Exponential*): $O(n \cdot k^{M + 1})$
    - *See Below*.
$$\therefore \text{Variable Elimination Solution} \ll \text{Naive Solution}$$

#### Factors - How should a probabilistic graphical model express a probability distribution?

- **Assumption**: *Probabilistic Graphical Models = Product of Factors*
$$p(x_{1}, ..., x_{n}) = \prod_{c \in C} \phi_{c}(x_{c})$$
- **Representation**: A factor can be represented as a *multi-dimensional table* with a cell for each assignment of $x_{c}$.
- *Bayesian Networks*: $\phi$ is Conditional Probability Distribution
- *Markov Random Fields*: $\phi$ is Potentials

#### Factor Product - What is the product operation?

![Example of Factor Product](images/VE_1.png)

- Let $A$, $B$, and $C$ be three disjoint sets of variables.
- Let $\phi_{1}(A, B)$ and $\phi_{2}(B, C)$ be two factors.
- Let $\phi_{3}(A, B, C)$ be the **factor product**.
$$\phi_{3}(A, B, C) = \phi_{1}(A, B) \cdot \phi_{2}(B, C)$$
    - Where the two factors are multiplied for common values of $B$.

#### Factor Marginalization - What is the marginalization operation?

![Example of Factor Marginalization](images/VE_2.png)

- Let $A$, and $B$ be two disjoint sets of variables.
- Let $\phi(A, B)$ be a factor.
- Let $\tau(A)$ be the **factor marginalization** of $B$ in $\phi$.
$$\tau(A) = \sum_{B} \phi(A, B)$$
- **Important Note**: $\tau$ does not need necessarily correspond to a probability distribution.

#### Ordering - What is an ordering?

- An **ordering** $O$ is the sequence of variables by which they will be eliminated.
- Although any ordering can be used, different orderings may dramatically alter the running time of the variable elimination algorithm.
- **Important Note**: *Finding Best Ordering = NP-Hard*

#### Algorithm - How does the variable elimination algorithm work?

- For each variable $X_{i}$ (ordered according to $O$),
    1. Multiply all factors $\Phi_{i}$ containing $X_{i}$.
    2. Marginalize out $X_{i}$ to obtain a new factor $\tau$.
    3. Replace the factors $\Phi_{i}$ with $\tau$.

#### Time Complexity - What is the time complexity of variable elimination?

- **Time Complexity**: $O(n \cdot k^{M + 1})$
    - Where $n$ is the number of variables.
    - Where $M$ is the maximum number of dimensions of any factor $\tau$ formed during the elimination process.

#### Ordering Heuristics - How should you choose an ordering for variable elimination?

- **Minimum Neighbors**: Choose a variable with the fewest dependent variables.
- **Minimum Weight**: Choose variables to minimize the product of the cardinalities of its dependent variables.
- **Minimum Fill**: Choose vertices to minimize the size of the factor that will be added to the graph.

#### Evidence - How do you perform marginal inference given some evidence using variable elimination?

- Given a probability distribution $P(X, Y, E)$ with unobserved variables $X$, query variables $Y$, and observed evidence variables $E$, $P(Y \mid E = e)$ can be calculated using variable elimination.
$$P(Y \mid E = e) = \frac{P(Y, E = e)}{P(E = e)}$$

##### Variable Elimination with Evidence

1. Set every factor $\phi(X', Y', E')$ with values specified by $E = e$.
2. Compute $P(Y, E = e)$ by performing variable elimination over $X$.
3. Compute $P(E = e)$ by performing variable elimination over $Y$.

#### Example Problem 1 - Variable Elimination

![Problem 1 - Variable Elimination](images/VE_P1.png)

- A Bayesian network that models a student's grade on an exam:
    - $g$ is a ternary variable of the student's grade.
    - $d$ is a binary variable of the exam's difficulty.
    - $i$ is a binary variable of the student's intelligence.
    - $l$ is a binary variable of the quality of a reference letter from the professor who taught the course.
    - $s$ is a binary variable of the student's SAT score.
$$p(l, g, i, d, s) = p(l \mid g) \cdot p(s \mid i) \cdot p(i) \cdot p(g \mid i, d) \cdot p(d)$$

##### Question (Marginal Inference)

- What is the probability distribution of the quality of a reference letter from the professor who taught the course?
$$p(l) = \sum_{g} \sum_{i} \sum_{d} \sum_{s} p(l, g, i, d, s)$$

##### Solution (Variable Elimination)

1. Order the variables according to the topological sort of the Bayesian network.
$$d, i, s, g$$
2. Eliminate $d$ with a new factor $\tau_{1}$:
$$
\begin{align}
\tau_{1}(g, i) &= \sum_{d} p(g \mid i, d) \cdot p(d) \\
p(l, g, i, s) &= p(l \mid g) \cdot p(s \mid i) \cdot p(i) \cdot \tau_{1}(g, i)
\end{align}
$$
3. Eliminate $i$ with a new factor $\tau_{2}$:
$$
\begin{align}
\tau_{2}(g, s) &= \sum_{i} p(s \mid i) \cdot p(i) \cdot \tau_{1}(g, i) \\
p(l, g, s) &= p(l \mid g) \cdot \tau_{2}(g, s)
\end{align}
$$
4. Eliminate $s$ with a new factor $\tau_{3}$:
$$
\begin{align}
\tau_{3}(g) &= \sum_{s} \tau_{2}(g, s) \\
p(l, g) &= p(l \mid g) \cdot \tau_{3}(g)
\end{align}
$$
5. Eliminate $g$ with a new factor $\tau_{4}$:
$$
\begin{align}
\tau_{4}(l) &= \sum_{g} p(l \mid g) \cdot \tau_{3}(g) \\
p(l) &= \tau_{4}(l)
\end{align}
$$
6. Expanding $\tau_{i}$:
$$p(l) = \sum_{g} p(l \mid g) \cdot \sum_{s} \sum_{i} p(s \mid i) \cdot p(i) \cdot \sum_{d} p(g \mid i, d) \cdot p(d)$$

##### Time Complexity

- **Naive Solution**: $O(k^{4})$
- **Variable Elimination Solution**: $O(4 \cdot k^{3})$
    - Step 2. takes $O(k^{3})$ steps as the factor product $p(g \mid i, d) \cdot p(d)$ has a $3$-dimensional table representation, and the factor marginalization of $d$ can execute concurrently with the factor product.
    - Step 3. takes $O(k^{3})$ steps as the factor product $p(s \mid i) \cdot p(i) \cdot \tau_{1}(g, i)$ has a $3$-dimensional table representation, and the factor marginalization of $i$ can execute concurrently with the factor product.
    - Step 4. takes $O(k)$ steps for the factor marginalization of $s$.
    - Step 5. takes $O(k^{2})$ steps as the factor product $p(l \mid g) \cdot \tau_{3}(g)$ has a $2$-dimensional table representation, and the factor marginalization of $g$ can execute concurrently with the factor product.
    - As $O(k^{3})$ is the largest step, with $4$ steps, the time complexity is at most $O(4 \cdot k^{3})$.
    - **Thus, with $n = 4$ and $M = 2$, the time complexity is at most $O(n \cdot k^{M + 1}) = O(4 \cdot k^{3})$.**

### MAP Inference <span name="S03-2"></span>

#### Overview - What is MAP inference?

- *See [Inference](#S03).*
- Given a probabilistic graphical model $p(x_{1}, ..., x_{n}) = \prod_{c \in C} \phi_{c}(x_{c})$, MAP inference corresponds to the following optimization problem:
$$\max_{x} \log p(x) = \max_{x} \sum_{c \in C} \theta_{c}(x_{c}) - \log Z$$
    - Where $\theta_{c}(x_{c}) = \log \phi_{c}(x_{c})$.
    
##### Derivation - Why is the MAP inference optmization problem expressed the way it is?


1. All probabilistic graphical models (as BNs and CRFs are special cases of MRFs) have the following representation:
$$p(x) = \frac{1}{Z} \prod_{c \in C} \phi_{c}(x_{c})$$
    - Where $Z$ is NP-Hard.
2. MAP inference desires to infer the set of conditions that maximizes the probability of some real-world phenomenon being observed.
$$\max_{x} p(x) = \max_{x} \frac{1}{Z} \prod_{c \in C} \phi_{c}(x_{c})$$
3. As $Z$ is expensive to calculate, maximize $\log p(x)$ instead of $p(x)$.
$$\max_{x} \log p(x) = \max_{x} \log \left[ \frac{1}{Z} \prod_{c \in C} \phi_{c}(x_{c}) \right]$$
4. Simplify using logarithmic identities.
    - $\log(x \times y) = \log(x) + \log(y)$
    - $\log(x \div y) = \log(x) - \log(y)$
$$\max_{x} \log p(x) = \max_{x} \left[ \sum_{c} \log \phi_{c}(x_{c}) - \log Z \right]$$
5. Simplify using maximum identities.
    - $\max_{x}(x \pm 1) = \max_{x}(x) \pm 1$
$$\max_{x} \log p(x) = \max_{x} \sum_{c} \log \phi_{c}(x_{c}) - \log Z$$
6. Let $\theta_{c}(x_{c}) = \log \phi_{c}(x_{c})$.
$$\max_{x} \log p(x) = \max_{x} \sum_{c \in C} \theta_{c}(x_{c}) - \log Z$$


- As $\log Z$ is outside the scope of the maximization, if you desire to infer the set of conditions that maximizes the probability of some real-world phenomenon being observed, then solve the following optimization problem:
$$\arg \max_{x} \log p(x) = \arg \max_{x} \sum_{c \in C} \theta_{c}(x_{c})$$
- **Important Note 1**: Without $Z$, this optimization problem suggests that MAP inference is computationally cheaper than marginal inference questions.
- **Important Note 2**: As maximization and summation both distribute over products, techniques used to solve marginal inference problems can be used to solve MAP inference problems.

#### Graph Cuts - How can MAP inference problems be solved using graph cuts?

- A **graph cut** of an undirected graph $G = (V, E)$ is a partition of $V$ into two disjoint sets $V_{s}$ and $V_{t}$.
- The **min-cut** problem is to find the partition $V_{s}, V_{t}$ that minimize the cost of the graph cut.
    - The cost of a graph cut is the sum of the nonnegative costs of the edges that cross between the two partitions:
$$\text{cost}(V_{s}, V_{t}) = \sum_{v_{1} \in V_{s}, v_{2} \in V_{t}} \text{cost}(v_{1}, v_{2})$$
    - **Time Complexity 1**: $O(\lvert E \rvert \lvert V \rvert \log \lvert V \rvert)$
    - **Time Complexity 2**: $O({\lvert V \rvert}^{3})$
- A MAP inference problem can be reduced into the min-cut problem in certain restricted cases of MRFs with binary variables.

#### Linear Programming - How can MAP inference problems be solved using linear programming?

- An approximate approach to computing the MAP values is to use Integer Linear Programming by introducing:
    - An indicator variable per variable in the PGM.
    - An indicator variable per edge/clique in the PGM.
    - Constraints on consistent values in cliques.

#### Local Search - How can MAP inference problems be solved using local search?

- A heuristic solution that starts with an arbitrary assignment and performs modifications on the joint assignment that locally increase the probability.

#### Branch and Bound - How can MAP inference problems be solved using branch and bound?

- An exhaustive solution that searches over the space of assignments while pruning branches that can be provably shown not to contain a MAP assignment.

#### Simulated Annealing - How can MAP inference problems be solved using simulated annealing?

- A sampling solution that expresses a probability distribution with the following:
$$p_{t}(x) \propto \exp\left( \frac{1}{t} \sum_{c \in C} \theta_{c}(x_{c}) \right)$$
    - Where $t$ is temperature.
        - $t \to \infty$ $\implies$ $p_{t}$ approaches a continuous uniform distribution.
        - $t \to 0$ $\implies$ $p_{t}$ approaches a continuous exponential distribution with a significant peak of $\arg \max_{x} \sum_{c \in C} \theta_{c}(x_{c})$.
- As the peak is a MAP assignment, a sampling algorithm starting with a high temperature which gradually decreases can eventually find the peak, given a sufficiently slow cooling rate.

### Sampling-Based Inference <span name="S03-3"></span>

#### Motivation - Why does sampling-based inference algorithms exist?

- **Exact Inference Algorithms**: Slow/NP-Hard
- **Approximate Inference Algorithms**: Marginal Inference, MAP Inference, Expectations

##### Expectations $\mathbb{E}[f(X)]$ - Why do we want to estimate expectations of random variables?

- Abstractly, approximate inference algorithms want to estimate the probability of some real-world phenomenon.
- Mathematically, estimating a probability $p(x)$ is a **SPECIALIZATION** of estimating an expectation $\mathbb{E}_{x \sim p}[f(x)] = \sum_{x} f(x)p(x)$
- If $f(x) = \mathbb{I}_{\lvert x \rvert}$, where $\mathbb{I}_{\lvert x \rvert}$ is an indicator function for event $x$,
$$\mathbb{E}_{x \sim p}[\mathbb{I}_{\lvert x \rvert}] = p(x)$$

#### Multinomial Sampling - How do you sample a discrete CPD?


1. Let $p$ be a multinomial probability distribution with event values $\{x^{1}, ..., x^{k}\}$ and event probabilities $\{\theta_{1}, ..., \theta_{k}\}$.
2. Generate a sample $s$ uniformly from the interval $[0, 1]$.
3. Partition the interval into $k$ subintervals:
$$[0, \theta_{1}), [\theta_{1}, \theta_{1} + \theta_{2}), ..., \left[ \sum_{j = 1}^{i - 1} \theta_{j}, \sum_{j = 1}^{i} \theta_{j} \right)$$
4. If $s$ is in the $i$th interval, then the sampled value is $x^{i}$.


- **Time Complexity**: $O(\log k)$ - *Using Binary Search*
- *Remember Baye's Rule*: $p(y \mid x) = \frac{p(x, y)}{p(x)}$
    - $p(x, y)$ is a multinomial probability distribution.

#### Forward Sampling - How do you sample a discrete Bayesian network?


1. Let $G$ be a Bayesian network representing a probability distribution $p(x_{1}, ..., x_{n})$.
2. Sample the variables in a topological order.
3. Sample the successor variables by conditioning these node's CPDs to the values sampled by their ancestors.
4. Repeat until all $n$ variables have been sampled.


- **Time Complexity**: $O(n)$

#### Monte Carlo Integration/Estimation - How do you take a large number of samples to estimate expectations?

- *Monte Carlo $\approx$ Large Number of Samples*
$$\mathbb{E}_{x \sim p}[f(x)] \approx I_{T} = \frac{1}{T} \sum_{t = 1}^{T} f(x^{t})$$
    - Where $x^{1}, ..., x^{T}$ are [i.i.d.](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) samples drawn according to $p$.
$$
\begin{align}
\mathbb{E}_{x^{1}, ..., x^{T} \sim^{\text{i.i.d.}} p}[I_{T}] &=  \mathbb{E}_{x \sim p}[f(x)] \\
\text{Var}_{x^{1}, ..., x^{T} \sim^{\text{i.i.d.}} p}[I_{T}] &= \frac{1}{T} \text{Var}_{x \sim p}[f(x)]
\end{align}
$$
    - Where the Monte Carlo estimate $I_T$
    
##### Implications - What is important about Monte Carlo estimations?

1. $I_{T}$ is an unbiased estimator for $\mathbb{E}_{x \sim p}[f(x)]$.
2. Referencing the [Weak Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), if $T \to \infty$, then $I_{T} \to \mathbb{E}_{x \sim p}[f(x)]$.

#### Rejection Sampling - How does rejection sampling work?

- Compute a target probability distribution $p(x)$ by sampling a proposal probability distribution $q(x)$, rejecting samples inconsistent with $p(x)$, and applying the Monte Carlo estimation.
    - **Examples**: *See [Rejection Sampling](https://ermongroup.github.io/cs228-notes/inference/sampling/)*
    - **Disadvantage**: *Ignores Many Samples*

#### Importance Sampling - How does importance sampling work?

- Compute a target probability distribution $p(x)$ by sampling a proposal probability distribution $q(x)$, reweighing samples with $w(x) = \frac{p(x)}{q(x)}$, and applying the Monte Carlo estimation.
$$
\begin{align}
\mathbb{E}_{x \sim p}[f(x)] &= \sum_{x} f(x)p(x) \\
&= \sum_{x} f(x)\frac{p(x)}{q(x)}q(x) \\
&= \mathbb{E}_{x \sim q}[f(x)w(x)] \\
&\approx \frac{1}{T} \sum_{t = 1}^{T} f(x^{t})w(x^{t})
\end{align}
$$
    - **Examples**: *See [Importance Sampling](https://ermongroup.github.io/cs228-notes/inference/sampling/)*
    - **Advantage**: *Uses All Samples*

#### Normalized Importance Sampling - How does normalized importance sampling work?

1. Let $p(x)$ be unknown.
2. Let $\tilde{p}(x) = Z \cdot p(x)$ be known.
3. The weight $w(x) = \frac{\tilde{p}(x)}{q(x)}$ is invalid for unnormalized importance sampling.
4. The **normalizing constant** of the distribution $\tilde{p}(x)$ is the following:
$$\mathbb{E}_{x \sim q}[w(x)] = \sum_{x} q(x)\frac{\tilde{p}(x)}{q(x)} = \sum_{x} \tilde{p}(x) = Z$$
5. The **normalized importance sampling estimator** is the following:
$$
\begin{align}
\mathbb{E}_{x \sim p}[f(x)] &= \sum_{x} f(x)p(x) \\
&= \sum_{x} f(x)\frac{p(x)}{q(x)}q(x) \\
&= \frac{1}{Z} \sum_{x} f(x)\frac{\tilde{p}(x)}{q(x)}q(x) \\
&= \frac{1}{Z} \mathbb{E}_{x \sim q}[f(x)w(x)] \\
&= \frac{\mathbb{E}_{x \sim q}[f(x)w(x)]}{\mathbb{E}_{x \sim q}[w(x)]}
\end{align}
$$

#### Markov Chain - What is a Markov chain?

- **Markov Chain**: A sequence of random variables $S_{0}, S_{1}, S_{2}, ...$ with each random variable $S_{i} \in \{1, 2, ..., d\}$ taking one of $d$ possible values.
    - *Initial State*: $P(S_{0})$
    - *Subsequent States*: $P(S_{i} \mid S_{i - 1})$
- **Markov Assumption**: $S_{i}$ cannot depend directly on $S_{j}$ where $j < i - 1$.

#### Stationary Distribution - Why is it important for a stationary distribution to exist?

- Let $T_{ij} = P(S_{\text{new}} = i \mid S_{\text{prev}} = j)$ be a $d \times d$ transition probability matrix.
- If the initial state $S_{0}$ is drawn from a vector probabilities $p_{0}$, the probability $p_{t}$ of ending in **EACH STATE** after $t$ steps is the following:
$$p_{t} = T^{t} p_{0}$$
    - Where $T^{t}$ is matrix exponentiation.
- **Stationary Distribution**: If it exists, the limit $\pi = \lim_{t \to \infty} p_{t}$.
- **Important Note 1**: A Markov chain whose states are joint assignments to the variables in a probabilistic graphical model $p$ has a stationary distribution equal to $p$.

##### Existence of Stationary Distribution

- **Irreducibility**: It is possible to get from any state $x$ to any other state $x'$ with probability $>0$ in a finite number of steps.
- **Aperiodicity**: It is possible to return to any state at any time, i.e. there exists an $n$ such that for all $i$ and all $n' \ge n$, $P(s_{n'} = i \mid s_{0} = i) > 0$.
- **Important Note 2**: An irreducible and aperiodic finite-state Markov chain has a stationary distribution.

#### Markov Chain Monte Carlo - How do you sample from a MCMC?

1. Let $T$ be a **transition operator** specifying a Markov chain whose stationary distribution is $p$.
2. Le $x_{0}$ be an initial assignment to the variables of $p$.
3. Run the Markov chain from $x_{0}$ for $B$ *burn-in* steps.
    - If $B$ is sufficiently large, $\pi \to p$.
4. Run the Markov chain for $N$ *sampling* steps and collect all the states that it visits.
    - The collection of states form samples from $p$.

##### Applications of Markov Chain Monte Carlo

1. Use samples for Monte Carlo integration to estimate expectations.
2. Use samples to perform marginal inference.
3. Use the sample with the highest probability to perform MAP inference.

#### Gibbs Sampling - How do you construct a MCMC?


1. Let $x_{1}, ..., x_{n}$ be an ordered set of variables.
2. Let $x^{0} = (x_{1}^{0}, ..., x_{n}^{0})$ be a starting configuration.
3. Repeat until convergence for $t = 1, 2, ...$,
    1. Set $x \gets x^{t - 1}$.
    2. For each variable $x_{i}$,
        1. Sample $x_{i}' \sim p(x_{i} \mid x_{-i})$.
            - Where $x_{-i}$ is all variables in $x$ except $x_{i}$
        2. Update $x \gets (x_{1}, ..., x_{i}', ..., x_{n})$.
    3. Set $x^{t} \gets x$
    

- **Important Note 1**: When $x_{i}$ is updated, its new value is immediately used for sampling other variables $x_{j}$.
- **Important Note 2**: Every iteration of $x^{t}$ is a new sample from $p$.

## Learning <span name="S04"></span>

- **Problem**: *Given a dataset $D$ of $m$ i.i.d. samples from some underlying distribution $p^{\ast}$, how do you fit the best model, given a family of models $M$, to make useful predictions?*
    - **Parameter Learning**: *Where the graph structure is known, and we want to estimate the factors.*
    - **Structure Learning**: *Where we want to estimate the graph, i,e. determine from data how the variables depend on each other.*
- **Solution**: *Best Approximation of $p^{\ast}$*
    - **Density Estimation**: *We are interested in the full distribution.*
    - **Specific Prediction Tasks**: *We are using the distribution to make a prediction.*
        - e.g. Is this email spam or not?
    - **Structure or Knowledge Discovery**: *We are interested in the model itself.*
        - e.g. How do some genes interact with each other?

### Maximum Likelihood Estimation <span name="S04-01"></span>

#### Motivation - Why does maximum likelihood estimation exist?

- **Goal**: How do we approximate $p$ as close as possible to $p^{\ast}$?
- **Approach**: When the KL divergence between $p$ and $p^{\ast}$ is minimal, $p$ is as close as possible to $p^{\ast}$.


##### KL Divergence - What is KL divergence?

- **KL Divergence**: How different is one probability distribution from another probability distribution?
$$KL(p^{\ast} \parallel p) = \sum_{x} p^{\ast}(x) \log \frac{p^{\ast}(x)}{p(x)} = -H(p^{\ast}) - \mathbb{E}_{x \sim p^{\ast}}[\log p(x)]$$

##### Minimal KL Divergence - When is KL divergence minimal?

- **General Idea**: *Minimizing KL Divergence* $\Longleftrightarrow$ *Maximizing Likelihood*
$$\min KL(p^{\ast} \parallel p) \Longleftrightarrow \max \mathbb{E}_{x \sim p^{\ast}}[\log p(x)]$$
- Because $p^{\ast}$ is unknown, approximate the log-likelihood with the emperical log-likelihood using a Monte-Carlo estimate.
$$\mathbb{E}_{x \sim p^{\ast}}[\log p(x)] \approx \frac{1}{\lvert D \rvert} \sum_{x \in D} \log p(x)$$

##### Maximum Likelihood Learning - How do you fit the best model using maximum likelihood learning?

- Given a family of models $M$, to fit the best model $p$, compute the following.
$$\max_{p \in M} \mathbb{E}_{x \sim p^{\ast}}[\log p(x)] \approx \max_{p \in M} \frac{1}{\lvert D \rvert} \sum_{x \in D} \log p(x)$$

#### Definition - What is maximum likelihood estimation?

- **Maximum Likelihood Estimation**: Given a data set $D$, choose parameters $\hat{\theta}$ that satisfy the following.
$$\max_{\theta in \Theta} L(\theta, D)$$
    - i.e., Maximize the parameters $\theta$ to best fit the data set $D$.

#### Loss Function - What is a loss function?

- **Loss Function** ($L(x, p)$): A measure of the loss that a model distribution $p$ makes on a particular instance $x$.
    - e.g., *MLE Loss Function*: $L(x, p) = -\log p(x)$
- **Important Note**: Assuming instances are sampled from some distribution $p^{\ast}$, to fit the best model, **MINIMIZE** the expected loss.
$$\mathbb{E}_{x \sim p^{\ast}}[L(x, p)] \approx \frac{1}{\lvert D \rvert} \sum_{x \in D} L(x, p)$$

#### Likelihood Function - What is a likelihood function?

- **Likelihood Function** ($L(\theta, D)$): The probability of observing the i.i.d. samples $D$ for all permissible values of the parameters $\theta$.

##### Example - Likelihood Function

1. Let $p(x)$ be a probability distribution where $x \in \{h, t\}$ such that $p(x = h) = \theta$ and $p(x = t) = 1 - \theta$.
2. Let $D = \{h, h, t, h, t\}$ be observed i.i.d. samples.
4. Accordingly, $p(x)$ models the outcome of a biased coin where parameter $\theta$ represents the probability of flipping heads and $1 - \theta$ represents the probability of flipping tails.
3. Express the likelihood function as the following.
$$L(\theta, D) = \theta \cdot \theta \cdot (1 - \theta) \cdot \theta \cdot (1 - \theta) = \theta^{3} \cdot (1 - \theta)^{2}$$

#### Maximum Likelihood Learning - How does maximum likelihood learning estimate the CPDs in Bayesian networks?


1. Let $p(x) = \prod_{i = 1}^{n} \theta_{x_{i} \mid x_{pa(i)}}$ be a Bayesian network.
    - Where $\theta_{x_{i} \mid x_{pa(i)}}$ are parameters (CPDs) with **UNKNOWN VALUES**.
2. Let $D = \{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$ be i.i.d. samples.
3. Let $L(\theta, D) = \prod_{i = 1}^{n} \prod_{j = 1}^{m} \theta_{x_{i}^{j} \mid x_{pa(i)}^{j}}$ be the likelihood function.
4. Log and collect like terms of the likelihood function.
$$\log L(\theta, D) = \sum_{i = 1}^{n} \sum_{x_{pa(i)}} \sum_{x_{i}} \#(x_{i}, x_{pa(i)}) \cdot \log \theta_{x_{i} \mid x_{pa(i)}}$$
5. Maximize the (log) likelihood function by decomposing it into separate maximizations for the local conditional distributions.


- **Important Note**: The maximum-likelihood estimates of the parameters (CPDs) have closed-form solutions.
$$\theta_{x_{i} \mid x_{pa(i)}}^{\ast} = \frac{\#(x_{i}, x_{pa(i)})}{\#(x_{pa(i)})}$$

### Bayesian Learning <span name="S04-02"></span>

#### Motivation - What are some problems with maximum likelihood estimation?

- A maximum likelihood estimate does not change as more data is observed because it assumes that the only source of uncertainty is explained by the parameters that are being fitted.
- **Problem 1**: *Cannot Improve Confidence*
- **Problem 2**: *Cannot Incorporate Prior Knowledge*

#### Definitions - What are a *prior* and a *posterior*?

- **Bayesian Learning**: Explicitly model uncertainty over both variables $X$ and parameters $\theta$ by letting parameters be random variables.
- A **prior** is the earlier probability distribution of parameter $\theta$ **BEFORE** observing data $D$.
- A **posterior** is the later probability distribution of parameter $\theta$ **AFTER** observing data $D$.
$$p(\theta \mid D) = \frac{p(D \mid \theta) p(\theta)}{p(D)} 
\propto p(D \mid \theta) p(\theta)$$
$$posterior \propto likelihood \times prior$$
- **Important Note 1**: Bayes'rule allows prior knowledge to be incorporated into a model's parameters.
- **Important Note 2**: Using Bayes' rule, the numerator is easy to calculate, but the denominator is difficult to calculate.

#### Conjugate Priors - What is a *conjugate prior*?

- A parametric family $\phi$ is **conjugate** for the likelihood $P(D \mid \theta)$ if:
$$P(\theta) \in \phi \implies P(\theta \mid D) \in \phi$$
- **Important Note**: If the normalizing constant of $\phi$ is known, then the denominator in Bayes' rule is easy to calculate.

#### Beta Distribution - What is the Beta distribution?

![Examples of Beta Distribution](images/BL_1.png)

- A **Beta distribution** is parameterized by two hyperparameters $\alpha \in \mathbb{R}$, and $\beta \in \mathbb{R}$ with the following continuous probability distribution.
$$\theta \sim \text{Beta}(\alpha, \beta) \implies p(\theta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}$$
    - Where the constant $\alpha$ intuitively corresponds to the number of **SUCCESSES** before observing new data.
    - Where the constant $\beta$ intuitively corresponds to the number of **FAILURES** before observing new data.
    - Where the constant $B(\alpha, \beta)$ is a normalizing constant defined by the following.
$$B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}$$
    - Where the Gamma function $\Gamma(x)$ is the continuous generalization of the factorial function defined by the following.
$$\Gamma(x) = \int_{0}^{\infty} t^{x - 1} e^{-t} dt$$
- **Mean**: $$\text{mean}[X] = \frac{\alpha}{\alpha + \beta}$$
- **Variance**: $$\text{var}[X] = \frac{\alpha \beta}{(\alpha + \beta)^{2} (\alpha + \beta + 1)}$$

#### Conjugate Priors and Beta Distribution - How do you calculate a posterior with data observed from a binary process?

- The beta distribution is the conjugate prior for the following probability distributions:
    - **Bernoulli**: A discrete Bernoulli random variable, $X$, is the outcome from a single experiment from which this outcome is classified as either a success, $X = 1$ with probability $p$, or a failure, $X = 0$ with probability $1 - p$.
    - **Binomial**: A discrete binomial random variable, $X$, is the number of successful outcomes from a sequence of $n$ independent experiments in which each experiment has an outcome classified as either a success with probability $p$ or a failure with probability $1 - p$
    - **Geometric**: A discrete geometric random variable, $X$, is the number of Bernoulli trials with probability $p$ needed to get one success.
    - **Negative Binomial**: A discrete negative binomial random variable, $X$, is the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures.
- **Important Note**: To best fit a binary model with a probability distribution $p$, the beta distribution can be used by the following.
    1. Assign $\text{Beta}(\alpha, \beta)$ as a prior to $p$.
    2. Observe data generated by a binary process.
        - If $X \sim \text{Bernoulli}(\theta)$, then the posterior is $\text{Beta}(\alpha + 1, \beta)$ or $\text{Beta}(\alpha, \beta + 1)$.
        - If $X \sim \text{Binomial}(N, \theta)$, then the posterior is $\text{Beta}(\alpha + X, \beta + N - X)$.
        - If $X \sim \text{Geometric}(N, \theta)$, then the posterior is $\text{Beta}(\alpha + X, \beta + 1)$.
        - If $X \sim \text{Negative-Binomial}(R, \theta)$, then the posterior is $\text{Beta}(\alpha + X, \beta + R)$.