# ECE 493 - Probabilistic Reasoning and Decision Making

## Preliminaries - Review of Probability Theory

- *See [Stanford CS228 - Probability Review](https://ermongroup.github.io/cs228-notes/preliminaries/probabilityreview/).*

## Representation

- **Problem**: *How do we express a probability distribution $p(x_{1}, x_{2}, ..., x_{n})$ that models some real-world phenomenon?*
    - *Naive Complexity*: $O(d^{n})$
- **Solution**: *Representation with Probabilistic Graphical Models + Verifying Independence Assumptions*

### Bayesian Networks (Directed Probabilistic Graphical Model)

#### Definition - What is a Bayesian network?

- A **Bayesian network** is a directed graph $G = (V, E)$ with the following:
    - *Nodes*: A random variable $x_{i}$ for each node $i \in V$.
    - *Edges*: A conditional probability distribution (CPD) $p(x_{i} \mid x_{A_{i}})$ per node, specifying the probability of $x_i$ conditioned on its parent's values.

#### Representation - How does a Bayesian network express a probability distribution?

1. Let $p$ be a probability distribution.
2. A naive representation of $p$ can be derived using the chain rule:
$$p(x_{1}, x_{2}, ..., x_{n}) = p(x_{1}) p(x_{2} \mid x_{1}) \cdots p(x_{n} \mid x_{n - 1}, ..., x_{2}, x_{1})$$
3. A Bayesian network representation of $p$ compacts the naive representation by having each factor in the right hand side depend only on a small number of **ancestor variables** $x_{A_{i}}$:
$$p(x_{i} \mid x_{i - 1}, ..., x_{2}, x_{1}) = p(x_{i} \mid x_{A_{i}})$$
    - e.g., Approximate $p(x_{5} \mid x_{4}, x_{3}, x_{2}, x_{1})$ with $p(x_{5} \mid x_{A_{5}})$ where $x_{A_{5}} = \{x_{4}, x_{3}\}$.

#### Space Complexity - How compact is a Bayesian network representation?

- Consider each of the factors $p(x_{i} \mid x_{A_{i}})$ as a **probability table**:
    - *Rows*: Values of $x_{i}$
    - *Columns*: Values of $x_{A_{i}}$
    - *Cells*: Values of $p(x_{i} \mid x_{A_{i}})$
- If each variable takes $d$ values and has at most $k$ ancestors, then each probability table has at most $O(d^{k + 1})$ entries.
- **Naive Representation Space Complexity**: $O(d^n)$
- **Bayesian Networks Representation Space Complexity**: $O(nd^{k + 1})$
$$\therefore \text{Bayesian Networks Representation} \le \text{Naive Representation}$$

#### Independence Assumptions - Why should we care about the independence assumptions introduced by a Bayesian network?

- A Bayesian network expresses a probability distribution $p$ via products of smaller, local conditional probability distributions (one for each variable).
- A Bayesian network introduces assumptions into the model of $p$ that certain variables are independent.
- **Important Note**: Which independence assumptions are we exactly making by using a model Bayesian network with a given structure described by $G$?
    - *Important for Correctness: Are these independence assumptions valid?*
    - *Important for Efficiency: Are there additional independence assumptions to compact the representation?*

#### $3$-Variable Independencies in Directed Graphs - How do you identify the pairs of independent variables in a $3$-variable Bayesian network?

- Let $x \perp y$ indicate that variables $x$ and $y$ are independent.
- Let $I(p)$ be the set of all independencies that hold for a joint probability distribution $p$.
- Let $G$ be a Bayesian network with three nodes: $A$, $B$, and $C$.

##### Common Parent

- If $G$ is of the form $A \leftarrow B \rightarrow C$,
    - If $B$ is observed, then $A \perp C \mid B$
    - If $B$ is unobserved, then $A \not\perp C$
- **Intuition**: $B$ contains all the information that determines the outcomes of $A$ and $C$; once it is observed, there is nothing else that affects $A$'s and $C$s' outcomes.

##### Cascade

- If $G$ equals $A \rightarrow B \rightarrow C$,
    - If $B$ is observed, then $A \perp C \mid B$
    - If $B$ is unobserved, then $A \not\perp C$
- **Intuition**: $B$ contains all the information that determines the outcomes of $C$; once it is observed, there is nothing else that affects $C$'s outcomes.

##### V-Structure

- If $G$ is $A \rightarrow C \leftarrow B$, then knowing $C$ couples $A$ and $B$.
    - If $C$ is unobserved, then $A \perp B$
    - If $C$ is observed, then $A \not\perp B \mid C$

#### $n$-Variable Independencies in Directed Graphs - How do you identify the pairs of independent variables in a $n$-variable Bayesian network?

##### $d$-separation (a.k.a. Independence Separation)

- $Q$ and $W$ are **$d$-separated** when variables $O$ are observed if they are not connected by an active path.

##### Active Path (a.k.a. Dependent Path)

- An undirected path in the Bayesian Network structure $G$ is called **active** given observed variables $O$ if for every consecutive triple of variables $X$, $Y$, $Z$ on the path, one of the following holds:
    - **Evidential Trail**: $X \leftarrow Y \leftarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Causal Trail**: $X \rightarrow Y \rightarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Common Cause**: $X \leftarrow Y \rightarrow Z$, and $Y$ is unobserved $Y \not\in O$
    - **Common Effect**: $X \rightarrow Y \leftarrow Z$, and $Y$ or any of its descendants are observed
    
##### Independence Maps

- Let $I(G) = \{(X \perp Y \mid Z) : X, Y \text{ are } d\text{-sep given } Z\}$ be a set of variables that are $d$-separated in $G$.
- If a probability distribution $p$ factorizes over $G$, then $I(G) \subseteq I(p)$.
    - $G$ is an $I$-map (**independence map**) for $p$.
- However, a probability distribution $q$ can factorize over $G$, yet have independencies that are not captured in $G$.
- **Important Note**: A Bayesian network cannot perfectly represent all probability distributions.

#### Equivalence - How can Bayesian networks be equivalent?

- $G_1$ and $G_2$ are **$I$-equivalent**...
    - If they encode the same dependencies: $I(G_1) = I(G_2)$.
    - If they have the same skeleton and the same v-structures.
        - A **skeleton** is an undirected graph obtained by dropping the directionality of the arrows.
    - If the $d$-separation between variables is the same.

#### Example Problem 1

![Problem 1](images/BN_1.png)

##### Question

- Are $X_{1}$ and $X_{6}$ $d$-separated given $\{X_{2}, X_{3}\}$?

##### Solution

1. **Path**: $X_{1} \rightarrow X_{2} \rightarrow X_{6}$
    1. *Consecutive Triple*: $X_{1} \rightarrow X_{2} \rightarrow X_{6}$
        - Although $X_{2}$ is observed, the *common effect* does not hold.
    2. As not all the consecutive triples hold, this path is not *active*.
2. **Path**: $X_{1} \rightarrow X_{3} \rightarrow X_{5} \rightarrow X_{6}$
    1. *Consecutive Triple*: $X_{1} \rightarrow X_{3} \rightarrow X_{5}$
        - Although $X_{3}$ is observed, the *common effect* does not hold.
    2. *Consecutive Triple*: $X_{3} \rightarrow X_{5} \rightarrow X_{6}$
        - As $X_{5}$ is unobserved, the *causal trail* does hold.
    3. As not all the consecutive triples hold, this path is not *active*.
3. As there are no active paths between $X_{1}$ and $X_{6}$, they are $d$-separated given $\{X_{2}, X_{3}\}$.

#### Example Problem 2

![Problem 2](images/BN_2.png)

##### Question

- Are $X_{2}$ and $X_{3}$ $d$-separated given $\{X_{1}, X_{6}\}$?

##### Solution

1. **Path**: $X_{2} \leftarrow X_{1} \rightarrow X_{3}$
    1. *Consecutive Triple*: $X_{2} \leftarrow X_{1} \rightarrow X_{3}$
        - Although $X_{1}$ is observed, the *common effect* does not hold.
    2. As not all the consecutive triples hold, this path is not *active*.
2. **Path**: $X_{2} \rightarrow X_{6} \leftarrow X_{5} \leftarrow X_{3}$
    1. *Consecutive Triple*: $X_{2} \rightarrow X_{6} \leftarrow X_{5}$
        - As $X_{6}$ is observed, the *common effect* does hold.
    2. *Consecutive Triple*: $X_{6} \leftarrow X_{5} \leftarrow X_{3}$
        - As $X_{5}$ is unobserved. the *causal trail* does hold.
    3. As all the consecutive triples hold, this path is *active*.
3. As there exists an active path between $X_{2}$ and $X_{3}$, they are not $d$-separated given $\{X_{1}, X_{6}\}$.

### Markov Random Fields (Undirected Probabilistic Graphical Model)

TODO

## Inference

- **Problem**: *Given a probabilistic model, how do we obtain answers to relevant questions about the world?*
    - **Marginal Inference**: *What is the probability of a given variable in our model after we sum everything else out?*
    - **Maximum A Posteriori**: *What is the most likely assignment of variables?*
    - *Naive Complexity*: NP-Hard
- **Solution**: TODO.