# DBN Structure Inference

The idea is to infer a posterior for the *structure* of a Dynamic Bayesian Network (DBN), given some data.

We formulate this task with the following model:

$$ P(G | X) \propto P(X | G) \cdot P(G) $$

* $P(G)$ is a prior distribution over DBN structures. We'll assume it has the form
$$P(G) \propto \exp \left( -\lambda |G \setminus G^\prime| \right)$$
where $|G \setminus G^\prime|$ denotes the number of edges in the graph, which are not present in some reference graph $G^\prime$.
* $P(X | G)$ is the marginal likelihood of the DBN structure. That is, it's the likelihood of the DBN after the network parameters have been integrated out -- it scores network *structure*. 
* If we assume some reasonable priors for network parameters, $P(X|G)$ can be obtained in closed form. In this work, we'll use the following marginal likelihood:
    
    $$P(X | G) \propto \prod_{i=1}^p (1 + n)^{-(2^{|\pi(i)|} - 1)/2} \left( X_i^{+ T} X_i^+ - \frac{n}{n+1} X_i^{+ T} B_i (B_i^T B_i)^{-1} B_i^T X_i^+ \right)^{-\frac{n}{2}}$$ 
    where $X$ and $B$ are matrices obtained from data. This marginal likelihood results from an empirical prior over the regression coefficients, and an improper $\propto 1/\sigma^2$ prior for the regression "noise" variables.

In [4]:
include("DiGraph.jl")
using Gen
using PyPlot
using .DiGraphs
using LinearAlgebra
using CSV
using DataFrames

┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1242


## Getting some data

For now, we'll work with some data used by Hill et al. in their 2007 paper, _Bayesian Inference of Signaling Network Topology in a Cancer Cell Line_.

It gives the differential phosphorylation levels of 20 proteins, in a cancer cell line perturbed by EGF. This is a well-studied signaling pathway; the goal is to produce a graph describing the dependencies between proteins in this pathway. 

In [8]:
protein_names = CSV.read("data/protein_names.csv");
reference_adjacency = CSV.read("data/prior_graph.csv");
timesteps = CSV.read("data/time.csv")
timeseries_data = CSV.read("data/mukherjee_data.csv")

Unnamed: 0_level_0,Column1,AKT.pS473,AKT.pT308,AMPK.PT172,cJUN.pS73,EGFR.PY1173
Unnamed: 0_level_1,String,Float64,Float64,Float64,Float64,Float64
1,MDA-MB-468 0ng EGF (5 min) (1),0.530401,0.112262,0.0073852,0.0372281,0.0010107
2,MDA-MB-468 0ng EGF (5 min) (2),0.508459,0.103413,0.007484,0.0376396,0.0018878
3,MDA-MB-468 0ng EGF (5 min) (3),0.567345,0.119212,0.0071113,0.0369495,0.0029138
4,MDA-MB-468 5ng EGF (5 min) (1),0.968135,0.262998,0.0052351,0.0384648,0.0088047
5,MDA-MB-468 5ng EGF (5 min) (2),1.24135,0.311034,0.0067092,0.0334464,0.0086412
6,MDA-MB-468 5ng EGF (5 min) (3),1.13693,0.289114,0.007687,0.0330702,0.0072702
7,MDA-MB-468 10ng EGF (5 min) (1),1.29989,0.336902,0.01052,0.0351666,0.0173106
8,MDA-MB-468 10ng EGF (5 min) (2),1.2357,0.291254,0.0063072,0.0366709,0.0185678
9,MDA-MB-468 10ng EGF (5 min) (3),1.48894,0.348485,0.0085701,0.0374861,0.0236866
10,MDA-MB-468 20ng EGF (5 min) (1),1.36396,0.351149,0.0099795,0.0356612,0.0622616


## Building the model

Implement the graph prior distribution:

$$P(G) \propto \exp \left( -\lambda |G \setminus G^\prime| \right)$$

In [3]:
struct GraphPrior <: Gen.Distribution{DiGraph} end
const graphprior = GraphPrior()

Gen.random(gp::GraphPrior, lambda::Float64, reference_graph::DiGraph) = reference_graph

function graph_edge_diff(g::DiGraph, g_ref::DiGraph)
    e1 = Set([g.edges[i,:] for i=1:size(g.edges)[1]])
    e_ref = Set([g_ref.edges[i,:] for i=1:size(g_ref.edges)[1]])
    return length(setdiff(e1, e_ref))
end
    
Gen.logpdf(gp::GraphPrior, graph::DiGraph, lambda::Float64, reference_graph::DiGraph) = -lambda * graph_edge_diff(graph, reference_graph)

graphprior(lambda::Float64, reference_graph::DiGraph) = Gen.random(graphprior, lambda, reference_graph);

Implement a DBN:

In [1]:
@gen function dbn_node(x_prev::Vec{Float64}, parents::Vec{Int}, weights::Vec{Float}, noise::Float64)
    x_new = @trace(Gen.normal(dot()), :x)
    return x_new
end

layer_nodes = Gen.Map(dbn_node)

@gen function dbn_layer(timestep::Int, x_prev::Vec{Float64}, parents::Vec{Vec{Int}}, weights::Vec{Vec{Float64}})
    x_prev_repeat = fill(x_prev, length(parents))
    x_new = @trace(layer_nodes(x_prev_repeat, parents, weights), :variables)
    return x_new
end
    
dbn = Gen.Unfold(dbn_layer)

independent_series = Gen.Map(unfolded_layers)

@gen function dbn(V::Vec{Int}, parents::Vec{Vec{Int}}, T::Int)
    
    x = @trace(unfolded_layers(), :timeseries)
    return x
end

LoadError: UndefVarError: @gen not defined

Implement our overall model:

In [None]:
function get_parent_vecs(G)
    return [sort(in_neighbors(G, v)) for v in sort(G.vertices)]
end
    
@gen function coeff_prior()

@gen function data_generator(reference_graph::DiGraph{Int}, Tvec::Vec{Int})
    
    lambda = @trace(Gen.gamma(1,1), :lambda)
    
    G = @trace(GraphPrior(lambda, reference_graph), :G)
    V = sort(G.vertices)
    parents = get_parent_vecs(G)
    
    x_init = @trace(Gen.mvnormal(), :x_init)
    regression_coeffs = @trace(, :beta)
    regression_noise = @trace(, :noise)
    
    x = @trace(dbn(V, parents, ))

I see a big fork in the path -- a choice between naivety and cleverness:
* implement a full Bayesian model which doesn't marginalize the network parameters.
This may be simpler to implement, though it's certain to give inferior performance.
* figure out how to implement the marginal likelihood. This will be trickier, but will probably make the difference between the method's practicality and impracticality. I envision the following:
    - have a cache (glorified dictionary: (variable, parents) => float?) which stores factors of the marginal likelihood as they're computed. There may even be bells and whistles, e.g., we could set a memory limit in order to prevent the cache from ballooning.
    - this cache will be used by a "marginal_likelihood" Distribution object; its `random` function will return some reasonable value; its `log_pdf` function will return the (log) marginal likelihood.