In [None]:
import pandas as pd
import numpy as np

# Graphical lasso

The graphical lasso is a method proposed by Friedman et al. in 2007 to estimate a sparse graph through a sparse penalty.


This models assumes that the covariates we are analyzing have a multivariate Gaussian distribution with mean $\mu$ and covariance $\Sigma$.

Moreover it is known that if the $ij$-th components of the inverse of the covariance matrix $\Sigma^{-1} = \Theta$ is zero, than the two variables $i$ and $j$ are conditionally independent given the others variable.

Some papers proposed different methods to reach an approximate solution of the problem, typically they are based on the maximization of a likelihood, derived from the distribution, given as 

$$ \text{log det}\Theta - \text{tr}(S\Theta) $$

where $\Theta$ is the inverse of the covariance matrix and its the unknown graph we want to estimate, and $S$ is the empirical covariance of our data. 
If we have a matrix $X \in \mathcal{R}^{n \times d}$ than $S=\frac{1}{n}X^TX \in \mathcal{R}^{d \times d}$

Since the $\Theta$ is supposed to be sparse the final functional imposes also a sparse penalty on it.

$$ \hat{\Theta} = \underset{\Theta}{\text{argmin}}\left(\text{tr}(S\Theta) - \text{log det}(\Theta) + \lambda\sum_{j\neq k}|\Theta_{jk}|\right)$$

## In this lab you are going infer a sparse network in two flavors:
    
    -Supervised
    -Unsupervised
    
**More specifically, you will be given n observations, drawn from a fully specified multivariate Gaussian distribution, whose precision matrix is known. You will infer a precision matrix by maximizing a score in a cross-validation scheme (supervised) and then you will assume you do not know the underlying distribution (*i.e.* the precision matrix) and will try to infer a precision matrix in an unsupervised manner.**

Define the distribution, the number of samples, variables

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(20)

In [None]:
X = np.load('data.npy')

**The precision matrix is the following `precision`**

In [None]:
precision = np.load('precision.npy')

plt.imshow(precision, cmap = 'viridis')
plt.colorbar()

**You are going to use the sklearn [GraphLasso](https://scikit-learn.org/stable/modules/generated/sklearn.covariance.GraphicalLasso.html)**

Define a plausible list of parameters for the model

In [None]:
##Code here



For each hyper-parameter in the list `alphas` fit a GraphicaLasso model to your data and choose the best one according to score of your choice (**Hint: remember that inferring the right edges is equivalent to inferring the right class in a binary classificaion problem**). For stability analysis, you could also try the same setting for different splits.

In [None]:
from sklearn.covariance import GraphicalLasso

Define a function able to recover the corresponding adjacency matrix from an arbitrary square matrix

Compare the aground-truth adjacency matrix with the inferred one usign the **Hamming distance**

### Unsupervised learning of the precision matrix

Assume that you do not know the precision matrix of the underlying data distribution. You need to perform inference of the precision matrix only using your observations. Typically, in this setting, Probabilistic model selection (or “information criteria”) provides an analytical technique for scoring and choosing among candidate models.

You are going to use the **`Bayesian Information Criterion (BIC)`**, appropriate for models fit under the maximum likelihood estimation framework.

It is defined as:

$$BIC = -2LL + \log(N)k$$

where LL is the log-likelihood of the model, N is the number of examples in the training dataset, and k is the number of parameters in the model.

The score as defined above is minimized, e.g. the model with the lowest BIC is selected.

**Define a function for computing the BIC specific for the Graphical Lasso likelihood:**

$$ \text{log det}\Theta - \text{tr}(S\Theta) $$

**Define a splitting scheme in order to obtain for each split the BIC and for each hyper-parameter an average BIC over the splits. Then plot the average BIC against the parameters.**

**After selecting the parameter which minimizes the BIC, compares the inferred network with the ground truth in terms of Hamming distance**