# ECE 493 - Probabilistic Reasoning and Decision Making

## Review of Probability Theory

- *See [Stanford CS228 - Probability Review](https://ermongroup.github.io/cs228-notes/preliminaries/probabilityreview/).*

## Representation

- **Problem**: *How do we express a probability distribution $p(x_{1}, x_{2}, ..., x_{n})$ that models some real-world phenomenon?*
    - *Naive Complexity*: $O(d^{n})$
- **Solution**: *Representation with Probabilistic Graphical Models + Verifying Independence Assumptions*

### Bayesian Networks (Directed Probabilistic Graphical Model)

#### Definition - What is a Bayesian network?

- A **Bayesian network** is a directed graph $G$ with the following:
    - *Nodes*: A random variable $x_{i}$.
    - *Edges*: A conditional probability distribution (CPD) $p(x_{i} \mid x_{A_{i}})$ per node, specifying the probability of $x_i$ conditioned on its parent's values.

#### Representation - How does a Bayesian network express a probability distribution?

1. Let $p$ be a probability distribution.
2. A naive representation of $p$ can be derived using the chain rule:
$$p(x_{1}, x_{2}, ..., x_{n}) = p(x_{1}) p(x_{2} \mid x_{1}) \cdots p(x_{n} \mid x_{n - 1}, ..., x_{2}, x_{1})$$
3. A Bayesian network representation of $p$ compacts the naive representation by having each factor in the right hand side depend only on a small number of **ancestor variables** $x_{A_{i}}$:
$$p(x_{i} \mid x_{i - 1}, ..., x_{2}, x_{1}) = p(x_{i} \mid x_{A_{i}})$$
    - e.g., Approximate $p(x_{5} \mid x_{4}, x_{3}, x_{2}, x_{1})$ with $p(x_{5} \mid x_{A_{5}})$ where $x_{A_{5}} = \{x_{4}, x_{3}\}$.

#### Space Complexity - How compact is a Bayesian network?

- Consider each of the factors $p(x_{i} \mid x_{A_{i}})$ as a **probability table**:
    - *Rows*: Values of $x_{i}$
    - *Columns*: Values of $x_{A_{i}}$
    - *Cells*: Values of $p(x_{i} \mid x_{A_{i}})$
- If each variable takes $d$ values and has at most $k$ ancestors, then each probability table has at most $O(d^{k + 1})$ entries.
- **Naive Representation Space Complexity**: $O(d^n)$
- **Bayesian Networks Representation Space Complexity**: $O(n \cdot d^{k + 1})$
$$\approx \text{Bayesian Networks Representation} \le \text{Naive Representation}$$

#### Independence Assumptions - Why are the independence assumptions of a Bayesian network important to identify?

- A Bayesian network expresses a probability distribution $p$ via products of smaller, local conditional probability distributions (one for each variable).
- These smaller, local conditional probability distributions introduces assumptions into the model of $p$ that certain variables are independent.
- **Important Note**: Which independence assumptions are we exactly making by using a Bayesian network?
    - *Correctness: Are these independence assumptions correct?*
    - *Efficiency: Do these independence assumptions efficiently compact the representation?*

#### $3$-Variable Independencies in Directed Graphs - How do you identify independent variables in a $3$-variable Bayesian network?

- Let $x \perp y$ indicate that variables $x$ and $y$ are independent.
- Let $G$ be a Bayesian network with three nodes: $A$, $B$, and $C$.

##### Common Parent

- If $G$ is of the form $A \leftarrow B \rightarrow C$,
    - If $B$ is observed, then $A \perp C \mid B$
    - If $B$ is unobserved, then $A \not\perp C$
- **Intuition**: $B$ contains all the information that determines the outcomes of $A$ and $C$; once it is observed, there is nothing else that affects $A$'s and $C$s' outcomes.

##### Cascade

- If $G$ equals $A \rightarrow B \rightarrow C$,
    - If $B$ is observed, then $A \perp C \mid B$
    - If $B$ is unobserved, then $A \not\perp C$
- **Intuition**: $B$ contains all the information that determines the outcomes of $C$; once it is observed, there is nothing else that affects $C$'s outcomes.

##### V-Structure

- If $G$ is $A \rightarrow C \leftarrow B$, then knowing $C$ couples $A$ and $B$.
    - If $C$ is unobserved, then $A \perp B$
    - If $C$ is observed, then $A \not\perp B \mid C$

#### $n$-Variable Independencies in Directed Graphs - How do you identify independent variables in a $n$-variable Bayesian network?

- Let $I(p)$ be the set of all independencies that hold for a probability distribution $p$.
- Let $I(G) = \{(X \perp Y \mid Z) : X, Y \text{ are } d\text{-sep given } Z\}$ be a set of variables that are $d$-separated in $G$.
- If the probability distribution $p$ factorizes over $G$, then $I(G) \subseteq I(p)$ and $G$ is an $I$-map (**independence map**) for $p$.
- **Important Note 1**: Thus, variables that are $d$-separated in $G$ are independent in $p$.
- **Important Note 2**: However, a probability distribution $q$ can factorize over $G$, yet have independencies that are not captured in $G$.
- **Important Caveat**: A Bayesian network cannot perfectly represent all probability distributions.

##### $d$-separation (a.k.a. Directed Separation)

- $Q$ and $W$ are **$d$-separated** when variables $O$ are observed if they are **NOT CONNECTED** by an active path.

##### Active Path

- An undirected path in the Bayesian Network structure $G$ is called **active** given observed variables $O$ if for **EVERY CONSECUTIVE TRIPLE** of variables $X$, $Y$, $Z$ on the path, one of the following holds:
    - **Evidential Trail**: $X \leftarrow Y \leftarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Causal Trail**: $X \rightarrow Y \rightarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Common Cause**: $X \leftarrow Y \rightarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Common Effect**: $X \rightarrow Y \leftarrow Z$, and $Y$ or any of its descendants are observed

#### Equivalence - When are two Bayesian networks $I$-equivalent?

- $G_1$ and $G_2$ are **$I$-equivalent**...
    - If they encode the same dependencies: $I(G_1) = I(G_2)$.
    - If they have the same skeleton and the same v-structures.
    - If the $d$-separation between variables is the same.
    
##### Skeleton

![Skeleton](images/BN_1.png)

- A **skeleton** is an undirected graph obtained by dropping the directionality of the arrows.
    - (a) is Cascade
    - (b) is Cascade
    - (c) is Common Parent
    - (d) is V-Structure
    - (a), (b), (c), and (d) have the same skeleton.

#### Example Problem 1 - $d$-separation

![Problem 1](images/BN_P1.png)

##### Question

- Are $X_{1}$ and $X_{6}$ $d$-separated given $\{X_{2}, X_{3}\}$?

##### Solution

1. **Path**: $X_{1} \rightarrow X_{2} \rightarrow X_{6}$
    1. *Consecutive Triple*: $X_{1} \rightarrow X_{2} \rightarrow X_{6}$
        - Although $X_{2}$ is observed, the *common effect* does not hold.
    2. As not all the consecutive triples hold, this path is not *active*.
2. **Path**: $X_{1} \rightarrow X_{3} \rightarrow X_{5} \rightarrow X_{6}$
    1. *Consecutive Triple*: $X_{1} \rightarrow X_{3} \rightarrow X_{5}$
        - Although $X_{3}$ is observed, the *common effect* does not hold.
    2. *Consecutive Triple*: $X_{3} \rightarrow X_{5} \rightarrow X_{6}$
        - As $X_{5}$ is unobserved, the *causal trail* does hold.
    3. As not all the consecutive triples hold, this path is not *active*.
3. As there are no active paths between $X_{1}$ and $X_{6}$, they are $d$-separated given $\{X_{2}, X_{3}\}$.

#### Example Problem 2 - $d$-separation

![Problem 2](images/BN_P2.png)

##### Question

- Are $X_{2}$ and $X_{3}$ $d$-separated given $\{X_{1}, X_{6}\}$?

##### Solution

1. **Path**: $X_{2} \leftarrow X_{1} \rightarrow X_{3}$
    1. *Consecutive Triple*: $X_{2} \leftarrow X_{1} \rightarrow X_{3}$
        - Although $X_{1}$ is observed, the *common effect* does not hold.
    2. As not all the consecutive triples hold, this path is not *active*.
2. **Path**: $X_{2} \rightarrow X_{6} \leftarrow X_{5} \leftarrow X_{3}$
    1. *Consecutive Triple*: $X_{2} \rightarrow X_{6} \leftarrow X_{5}$
        - As $X_{6}$ is observed, the *common effect* does hold.
    2. *Consecutive Triple*: $X_{6} \leftarrow X_{5} \leftarrow X_{3}$
        - As $X_{5}$ is unobserved. the *causal trail* does hold.
    3. As all the consecutive triples hold, this path is *active*.
3. As there exists an active path between $X_{2}$ and $X_{3}$, they are not $d$-separated given $\{X_{1}, X_{6}\}$.

### Markov Random Fields (Undirected Probabilistic Graphical Model)

#### Definition - What is a Markov random field?

- A **Markov random field** is an undirected graph $G$ with the following:
    - *Nodes*: A random variable $x_{i}$.
    - *Fully Connected Subgraphs*: An optional factor $\phi_{c}(x_{c})$ per clique, specifying the level of coupling (**potentials**) between all the dependent variables within the clique.
- **Important Note**:
>...SPECIFYING THE LEVEL OF COUPLING BETWEEN ALL THE DEPENDENT VARIABLES WITHIN THE CLIQUE...

#### Representation - How does a Markov random field express a probability distribution?

1. Let $p$ be a probability distribution.
2. A Markov random field representation of $p$ is the following:
$$p(x_{1}, x_{2}, ..., x_{n}) = \frac{1}{Z} \prod_{c \in C} \phi_{c}(x_{c})$$
    - Where $C$ is the set of cliques of $G$.
    - Where $\phi_{c}$ is a **factor** (nonnegative function) over the variables in a clique.
    - Where $Z$ is a **normalizing constant** that ensures that $p$ sums to one.
$$Z = \sum_{x_{1}, x_{2}, ..., x_{n}} \prod_{c \in C} \phi_{c}(x_{c})$$

#### Space Complexity - How compact is a Markov random field?

##### Factor Product

- Let $A$, $B$, and $C$ be three disjoint sets of variables.
- Let $\phi_{1}(A, B)$ and $\phi_{2}(B, C)$ be two factors.
- Let $\psi(A, B, C)$ be the factor product $\phi_{1} \times \phi_{2}$.
$$\psi(A, B, C) = \phi_{1}(A, B) \cdot \phi_{2}(B, C)$$
    - Where the two factors are multiplied for common values of $B$.

##### Binary Factor Tables

- Each of the optional factors $\phi_{c}(x_{c})$ can be expressed as a product of **binary factor tables** $\phi(X, Y)$:
    - *Rows*: Values of $X$
    - *Columns*: Values of $Y$
    - *Cells*: Values of $\phi(X, Y)$
- If each variable takes $d$ values, each binary factor table has at most $O(d^{2})$ entries.
- **Markov Random Fields Representation Space Complexity**: $O(E \cdot d^{2})$
    - Where $E$ is the number of edges in a Markov random field.
$$\approx \text{Markov Random Field Representation} \le \text{Naive Representation}$$

#### Markov Random Fields vs. Bayesian Networks - What are the advantages and disadvantages of Markov random fields?

##### Advantages

- *Applicable for Variable Dependencies Without Natural Directionality*
- *Succinctly Express Dependencies Not Easily Expressible in Bayesian Networks*

##### Disadvantages

- *Cannot Express Dependencies Easily Expressible in Bayesian Networks*
    - *e.g., V-Structures*
- *Computing Normalization Constant $Z$ Is NP-Hard*
- *Generally Require Approximation Techniques*
- *Difficult to Interpret*
- *Easier to Construct Bayesian Networks*

#### Moralization - What is moralization?

![Moralization](images/MRF_1.png)

- Bayesian networks are a special case of Markov random fields with factors corresponding to conditional probability distributions and a normalizing constant of one.
- **Moralization**: Bayesian Network $\to$ Markov Random Field
    1. Add side edges to all parents of a given node.
    2. Remove the directionality of all the edges.

#### $n$-Variable Independencies in Undirected Graphs - How do you identify independent variables in a $n$-variable Markov random field?

1. If variables $X$ and $Y$ are connected by a path of unobserved variables, then $X$ and $Y$ are dependent.
2. If variable $X$'s neighbors are all observed, then $X$ is independent of all the other variables.
3. If a set of observed variables forms a cut-set between two halves of the graph, then variables in one half are independent from ones in the other.

##### Cut-Set Variable Independencies

![Cut-Set Variable Independencies](images/MRF_2.png)

##### Markov Blanket

- The **Markov blanket** $U$ of a variable $X$ is the minimal set of nodes such that $X$ is independent from the rest of the graph if $U$ is observed.
$$X \perp (\mathcal{X} - \{X\} - U) \mid U$$
- In an undirected graph, the Markov blanket is a node's neighborhood.

#### Conditional Random Fields - What are conditional random fields?

##### Definition

- A **conditional random field** is a Markov random field over variables $\mathcal{X} \cup \mathcal{Y}$ which specifies a conditional distribution:
$$
\begin{align}
P(y \mid x) &= \frac{1}{Z(x)} \prod_{c \in C} \phi_{c}(x_{c}, y_{c}) \\
Z(x) &= \sum_{y \in \mathcal{Y}} \prod_{c \in C} \phi_{c}(x_{c}, y_{c})
\end{align}
$$
    - Where $x \in \mathcal{X}$ and $y \in \mathcal{Y}$ are **VECTOR-VALUED** variables.
    - Where $Z(x)$ is the partition function.
- **Important Note 1**: A conditional random field results in an instantiation of a new Markov random field for each input $x$.
- **Important Note 2**: A conditional random field is useful for structured prediction in which the output labels are predicted considering the neighboring input samples.
    - *See [Stanford CS228 - Markov Random Fields: Conditional Random Fields (OCR Example)](https://ermongroup.github.io/cs228-notes/representation/undirected/#conditional-random-fields).*

##### Features

- Assume the factors $\phi_{c}(x_{c}, y_{c})$ are of the following form:
$$\phi_{c}(x_{c}, y_{c}) = \exp(w_{c}^{T} f_{c}(x_{c}, y_{c}))$$
    - Where $f_{c}(x_{c}, y_{c})$ can be an arbitrary set of features describing the compatibility between $x_{c}$ and $y_{c}$.
    - Where $w_{c}^{T}$ is the transposed weight matrix.
- Accordingly, $f_{c}(x_{c}, y_{c})$ allows arbitrarily complex features.
    - e.g., $f(x, y_{i})$ are features that depend on the entirety of input samples $x$.
    - e.g., $f(y_{i}, y_{i + 1})$ are features that depend on successive pairs of output labels $y$.

#### Conditonal Random Fields vs. Markov Random Fields - Why is a conditional random field a special case of Markov random fields?

- If we were to model $p(x, y)$ using a Markov random field, then we need to fit two probability distributions to the data: $p(y \mid x)$ and $p(x)$.
    - *Remember Baye's Rule*: $p(x, y) = p(y \mid x) \cdot p(x)$
- However, if all we are interested in is predicting $y$ given $x$, then modeling $p(x)$ is expensive and unnecessary.
$$\text{Prediction} \implies \text{CRF} > \text{MRF}$$

#### Factor Graphs - What is a factor graph? Why does a factor graph exist?

![Factor Graph](images/MRF_3.png)

- A **factor graph** is a bipartite graph where one group is the variables in the distribution being modeled, and the other group is the factors defined on these variables.
    - *Edges Between Factors and Variables*
- **Side Note**: A **bipartite graph** is a graph whose vertices are divided into two disjoint and independent sets.
    - *Set 1: Variables*
    - *Set 2: Factors*
- **Important Note**: Use a factor graph to identify what variables a factor depends on when computing probability distributions.

## Inference

- **Problem**: *Given a probabilistic model, how do we obtain answers to relevant questions about the world?*
    - **Marginal Inference**: *What is the probability of a given variable in our model after we sum everything else out?*
    - **Maximum A Posteriori**: *What is the most likely assignment of variables?*
    - *Naive Complexity*: NP-Hard
- **Solution**: TODO.