# Graphical Models Part 2: Markov Random Fields

This document is about Markov Random Fields (but not its learning or inference algorithms).



## Interpretation of potential functions
We know that a distribution based on a undirected graph is defined as the product of potential functions defined on the (maximal) cliques of the graph. There's no conditional probability or marginal probability interpretation of these potential functions, as discussed in pp. 28 and pp. 31 of Jordan's book (Chapter 2).

This is also discussed in 10-708 (slides for lecture 3, pp. 19)
![image](./interpretation_potential_functions_10708.png)

## Three kinds of conditional independence statements from MRF

For BN, we have **local independencies**, as well as the whole set defined by d-seperation. For MRF, three sets of independencies are there.

### Global independencies

This corresponds to the set defined by d-sep for BN.

Koller book (pp. 115):
![image](./indep_global_Koller.png)

10708 slides (pp. 14, lecture 3): 
![image](./indep_global_10708.png)

### Local independencies

Koller book (pp. 118):
![image](./indep_local_Koller.png)

### Pairwise independencies

Koller book (pp. 118):
![image](./indep_pairwise_Koller.png)

### Relationship of the three

Clearly, each set of independencies is (literally) weaker than the previous set. But they are the same for positive distributions.

Koller book (pp. 119):
![image](./indep_equi_Koller.png)

10708 slides (pp. 28, lecture 3): 
![image](./indep_equi_10708.png)

## Equivalence theorem

This establishes the equivalence between using graph structure and the set of conditional independencies. We only discuss **positive** distributions.

More specifically, the set of **positive** distributions specified by 

1. factorization (P factorizes according to G)
2. set of conditional independence statements (G is an I-map of P, or P satisfies conditional independence statements from G)

are the same.

Koller book (pp. 115):
![image](./soundness_Koller.png)
This is the soundness part in the next section.

Koller book (pp. 116):
![image](./HC_Koller.png)
This direction (from independencies to factorization) is also called Hammersley-Clifford theorem. (Well I think there are many versions of this theorem, and I just pick this one based on Koller's book.)

There's an example about why the previous statement only works for positive distributions (I don't check its correctness).

![image](./HC_why_positive_Koller.png)


## Soundness and completeness

This is about whether the set of conditional independence statements inferred from a MRF graph using sep is the (maximum) set of conditional independence statements applicable to every **positive** P factorized according to the given MRF graph.

Soundness: 

Koller book (pp. 115):
![image](./soundness_Koller.png)

Completeness:
![image](./completeness_Koller.png)

Stronger version of completeness: 
![image](./completeness_Koller_2.png)

Soundness + Completeness (10708 slides for lecture 3, pp. 25)
![image](./soundness_completeness_10708.png)

## Alternative, compact representations of MRF

### Factor graph

This representation of MRF makes explicit the factors (clique potentials) involved in the network, and can reveal some fine-grained strucutres, although not in terms of additional conditional indepdendence statements. Basically, same MRF can map to different factor graphs. Factor graph can be helpful in inference (Junction Tree algorithm).

One example in Koller's book (pp. 123):
![image](./factor_graph_Koller.png)

FG is also discussed in pp. 36 of Jordan's book (chapter 2). Jordan mentioned that factorization is a richer concept than conditional independence, since different FG can map to the same MRF.


### Log-linear models

In specific application contexts, we may like to represent MRF factorization using summation.

Koller's book (pp. 124-125):
![image](./log_linear_Koller_1.png)
![image](./log_linear_Koller_2.png)

I think in many cases, people just assume features are known and only learn $w_i$.

In theory log linear model can represent all positive distributions, as long as you use correct features.

## Other stuffs

Koller and 10708 also discuss minimal I-map, etc.  But they are mainly about how to learn a MRF from a set of conditional independencies in the data. Currently I'm not interested in such topics, so I just skip them.