In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import sys
sys.path.append('../../pyutils')
import metrics
import utils

# Introduction

In an undirected graph model, each vertex represent a random variable, and the absence of edge between two vertices means that the 2 randoms variables are conditionaly independent, given the other. The graph gives a visual way of understanding the joint distribution of the entire set of random variables.  
This graph is also called a Markov Random Field.  
Sparge graphs have a small number of edges, and are convenient for interpretation.  

Each edge is parametrized by it's value (or potiential) that encode the strength of the conditional dependence between the random variables.  
Challenges of grahical models are:
- model selection (graph structure)
- estimation of the edges parameters from data (learning)
- compute marginal random variables propabilities and expectations. (inference)

# Markov Graphs and Their Properties

Let a graph $\mathcal{G}$ a pair $(V,E)$ with $V$ a set of vertices and $E$ a set of edges.  
Two vertices $X$ and $Y$ are adjacent is there is an edge between them: $X \sim Y$.  
A path $X_1,\text{...},X_n$ is a set of joined vertices; $X_{i-1} \sim X_i$ for $i=2,\text{...},n$.  
A complete graph is a graph with every pair of vertices joined by an edge.  
A subgraph $U \in V$ is a subject of vertices with their edges.  

In a markov graph, the absence of an edge implies that the random variables are conditionally independant given the other variables:
$$\text{No edge joining $X$ and $Y$} \iff X \perp Y | \text{rest}$$

$A,B,C$ subgraphs. $C$ separate $A$ and $B$ if every path between $A$ and $B$ intersects $C$.
$$\text{$C$ separates $A$ and $B$} \implies A \perp B | C$$

A clique is a complete subgraph. A clique is said maximal is no other vertices can be added to it and still yield a clique.  

A probability density function $f$ over $\mathcal{G}$ can be represented as:
$$f(x) = \frac{1}{Z} \sum_{C \in \mathcal{C}} \psi_C(x_C)$$
with $\mathcal{C}$ set of maximal cliques, and $\psi_C$ clique potentials. These are not really density functions, but capture dependence in $X_C$ by scoring certains $x_C$ higher than others.  
$Z$ is the normilazing constant, also called the partition function:
$$Z = \sum_{x \in \mathcal{X}} \prod_{C \in \mathcal{C}} \Psi_C (x_C)$$

This chapter focus on pairwise Markov Graph. There is a potential function for each edge, and at-most second-order interactions are represented. Thewyneed fewer parameters, and are easier to work with.

# Undirected Graphical Models for Continuous Variables

We assume the observations have a multivariate Gaussian distribution with mean $\mu$ and covariance $\Sigma$. Since Gaussian distribution represents at most second-order relationships, it encodes a pairwise Markov Graph.  

Th Gaussian distribution has the property that all conditionals distributions are also Gaussian.  
Let $\Theta = \Sigma^{-1}$. If $\Theta_{ij}=0$, then variables $i$ and $j$ are conditionally independant given the other variables.

## Estimation of the Parameters when the Graph Structure is Known

Given some observations of $X$, let estimate the parameters of the joint distribution ($\mu$ and $\Sigma$). We suppose that the graph is complete.  
Let's define the empirical covariance matrix $S$:
$$S = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x}) (x_i - \bar{x})^T$$
with $\bar{x}$ the sample mean vector.  

The log-likelihood of the data can be written as:
$$l(\Theta) = \log \det \theta - \text{trace}(S\theta)$$
The maximum likelihood estimate of $\Sigma$ is $S$.  

Now is some edges are missing, we are trying to maximize $l(\Theta)$ under the constraint that somes entries of $\Theta$ are 0.  
We add Lagrange constants for all missing edges:
$$l_C(\Theta) = \log \det \theta - \text{trace}(S\theta) - \sum_{(j,k) \notin E} \gamma_{jk}\theta_{jk}$$
This can be maximized with the followigng equation:
$$\Theta^{-1} - S - \Gamma = 0$$

with $\Gamma$ matrix of Lagrange parameters.  

We can patition the matrices into 2 parts: part 1 the first $p-1$ rows and columns, and part 2 the $pth$ row and column.  
The equation can be rewrtieen as:
$$W_{11}\beta - s_{12} - \gamma_{12} = 0$$
We can remove all non-zeros elements from $\gamma_12$, corresponding to edges constrained to be $0$, because they carry no information. We can also reduce the same way $\beta$ and $W_11$, giving us the new equation:
$$W^*_{11}\beta^* - s^*_{12} = 0$$
with solution:
$$\hat{\beta}^* = W_{11}^{*-1} s^*_{12}$$
the solution is padded with zeros to give $\hat{\beta}$

Algorithm $17.1$ page 634

In [53]:
def part_mats(W, S, j):
    W_11 = np.delete(W, j, axis=0)
    W_11 = np.delete(W_11, j, axis=1)
    s_12 = np.delete(S, j, axis=0)[:,j]
    return W_11, s_12

def regroup_mat(W, w_12, j):
    w_12 = np.insert(w_12, j, W[j,j])
    W[:,j] = w_12
    return W
    

def edit_mats(G, j, W_11, s_12):
    N = len(G)
    suppr = [i for i in range(N) if G[i,j] == 0]
    suppr = [x if x < j else x - 1 for x in suppr]
    s_12 = np.delete(s_12, suppr)
    W_11 = np.delete(W_11, suppr, axis=0)
    W_11 = np.delete(W_11, suppr, axis=1)
    return W_11, s_12
    
def extend_beta(G, j, betar):
    N = len(G)
    suppr = [i for i in range(N) if G[i,j] == 0]
    suppr = [x if x < j else x - 1 for x in suppr]
    beta = np.insert(betar, suppr, 0)
    return beta
        
def estim_gauss(S, G, max_iters = 100, tol=1e-16):
    
    N = len(S)
    W = S.copy()

    for it in range(max_iters):
        
        W_old = W.copy() 
        
        for j in range(N):
            W_11, s_12 = part_mats(W, S, j)    
            #print(W_11)
            #print(s_12)

            W_11r, s_12r = edit_mats(G, j, W_11, s_12)

            #print(W_11r)
            #print(s_12r)


            betar = np.linalg.inv(W_11r) @ s_12r
            beta = extend_beta(G, j, betar)
            #print(betar)
            #print(beta)

            w_12 = W_11 @ beta
            #print(w_12)

            W = regroup_mat(W, w_12, j)
            #W[:N-1, -1] = w_12

            #print(W)
    
        if np.linalg.norm(W - W_old) < tol:
            break
    
    print('Iterations:', it)
    return W

S = np.array([
    [10, 1, 5, 4],
    [1., 10, 2, 6],
    [5, 2, 10, 3],
    [4, 6, 3, 10]
])

G = np.array([
    [1, 1, 0, 1],
    [1, 1, 1, 0],
    [0, 1, 1, 1],
    [1, 0, 1, 1]
])

W = estim_gauss(S, G)
print(np.around(W,2))
print(np.around(np.linalg.inv(W),2))

Iterations: 10
[[10.    1.    1.31  4.  ]
 [ 1.   10.    2.    0.87]
 [ 1.31  2.   10.    3.  ]
 [ 4.    0.87  3.   10.  ]]
[[ 0.12 -0.01 -0.   -0.05]
 [-0.01  0.1  -0.02 -0.  ]
 [ 0.   -0.02  0.11 -0.03]
 [-0.05  0.   -0.03  0.13]]


## Estimation of the Graph Structure

Sparse inverse covariance estimation with the graphical lasso - Friedman, J., Hastie, T. and Tibshirani, R. (2008) -[PDF](file:///home/aiw/docs/eosl-refs.pdf)

We can use the lasso regularization to estimate $\Sigma$ in a way that tries to insert zeroes in $\Theta$, giving us the graph strcture.  
Let's maximizing the penalized log-likelihood:
$$\log \det \Theta - \text{trace}(S\Theta) - \lambda ||\Theta||_1$$

The gradient equation is:
$$\Theta^{-1} - S - \lambda \text{sign}(\Theta) = 0$$

Similary to the algorithm above, we reach the equation:
$$W_{11}\beta - s_{12} + \lambda \text{sign}(\beta) = 0$$

This problem is similar to linear regression with lasso, and can be solved using the pathwise coordinate descent method.  
Let $V = W_{11}$, the update has the form:
$$\hat{\beta}_j \leftarrow \frac{1}{V_jj} S(s_{12j} - \sum_{k \neq j} V_{kj} \hat{\beta}_k, \lambda)$$
with $S$ the soft-threshold operator:
$$S(x,t) = \text{sign}(x)(|x| - t)_+$$

Algorithm $17.2$ page $636$

## Undirected Graphical Models for Discrete Variables

Pairwise Markov Networks with binary variables are veary common. They are called Ising models, or Botlzmann machines.  
Each vertices are referred to as nodes or units. The values at each node can either be obversed (visible) or unobserved (hidden).  

We consider first the case where all $p$ nodes are visible with edge pairs $(j,k) \in E$. Their joint probability is given by:
$$p(X,\Theta) = \exp \left[ \sum_{(j,k) \in E} \theta_{jk} X_j X_k - \Phi(\Theta) \right]$$

with $\Phi(\Theta)$ the log of the partition function:
$$\Phi(\Theta) = \log \sum_{x \in \mathcal{X}} \left[ \exp \sum_{(j,k) \in E} \theta_{jk} x_j x_k \right]$$

The model requires a constant node $X_0=1$.  
The Ising model implies a logistic form for each node conditional on the others:
$$P(X_j=1| X_{-j} = x_{-j}) = \left( 1 + \exp ( -\theta_{j0} - \sum_{(j,k) \in E} \theta_{jk} x_k) \right) ^{-1}$$

## Estimations of the Parameters when the Graph Structure is known

Given N observations $x_i$, we can estimate the parameters by maximizing the log-likelihood:
$$l(\Theta) = \sum_{i=1}^N \log P_\Theta(X = x_i)$$

$$
\begin{equation}
\begin{split}
l(\Theta) & = \sum_{i=1}^N \log P_\Theta(X = x_i) \\
& = \sum_{i=1}^N \left( \sum_{(j,k) \in E} \theta_{jk} x_{ij} x_{ik} - \Phi(\Theta) \right)
\end{split}
\end{equation}
$$

Setting the gradient to $0$ gives:
$$\hat{E}(X_j,X_k) - E_\Theta(X_j,X_k) = 0$$
with $\hat{E}(X_j,X_k)$ the expecation taken with respect to the empirical distribution of the data:
$$\hat{E}(X_j, X_k) = \frac{1}{N} \sum_{i=1}^N x_{ij}x_{ik}$$

We can find the maximum likelihood estimates using gradient search or Netwon methods, but computing $E_\Theta(X_j,X_k)$ involves the enumeration of $p(X,\Theta)$ over the $|\mathcal{X}|=2^p$ values of $X$, and it not feasible for large $p$ (over 30).

When $p$ is large, the gradient is approximated using other methods, like mean field approximation or Gibbs sampling

## Hidden Nodes

Let's suppose we have a subset of visible variables $X_V$, and the remaining are the hidden $X_H$. The log-likelihood become:

$$
\begin{equation}
\begin{split}
l(\Theta) & = \sum_{i=1}^N \log P_\Theta(X_V = x_{iV}) \\
& = \sum_{i=1}^N \left( \log \sum_{x_h \in \mathcal{X}_H} \exp \sum_{(j,k) \in E} \theta_{jk} x_{ij} x_{ik} - \Phi(\Theta) \right)
\end{split}
\end{equation}
$$

The gradient becames:
$$\frac{d l(\Theta)}{d\theta_{jk}} = \hat{E}_V E_\Theta(X_k,X_K|X_V) - E\Theta(X_j,X_k)$$

It can be computed using Gibbs sampling, but the method can be very slow, even for moderate-sized models. We can add more restrictions to make those computations manageable

## Estimation of the Graph Structure

As for continuous random variables, we can use lasso to remove edges.  

One solution is to use a penalized log-likelihood, but the gradient computation for dense graphs is not manageable.  

Another solution fit an $L_1$ penalized logistic regression model to each node as a functon of the other nodes, and symmetrize the edge parameter estimates.

## Restricted Boltzmann Machines

An RBM consts of one layer of visible units, and one layer of hidden units , with no connections within a layer.  
The restricted form simplifies the Gibbs Sampling to compute the gradient of the log-likelihhod.  
Using the contrastive-divergence algorithm, they can be trained rapidly.  

RBM can learn to extract interesting features from data.  
We can stack several RBM togethers and train the whole joint density model.